Cloud AI’s Hidden Price Tag: Why On-Prem May Be the Smarter Long-Term Bet Part 1 of 2
- Tax the Robots
- Sep 28
- 5 min read
So... Cloud AI pricing looks simple (“£X per million tokens”), but the real bill includes inference, embeddings, vector DB reads/writes/storage, safety filters, observability, egress, logging, KMS/Key Vault calls, caching quirks, retries, and agentic orchestration—costs that scale with usage and user behaviour. For sustained, predictable workloads or strict data-residency/compliance needs, on-prem or hybrid AI increasingly wins on total cost of ownership (TCO), control, and latency—especially as modern inference GPUs (e.g., NVIDIA L40S/Blackwell-class) drive down cost-per-token on owned infrastructure. Cloud still shines for bursty R&D and rapid multi-model prototyping. The sweet spot for many is a tiered approach: prototype in cloud, scale predictable inference on-prem, with rigorous cost governance everywhere.
The promise (and trap) of simple token pricing
Most cloud AI providers headline “per-million-token” rates. That’s useful for back-of-the-envelope budgeting, but real spend quickly diverges:
Model calls (prompt + output): charged per token; pricing varies by model and vendor. OpenAI’s public pricing pages show per-million token rates for text and embeddings, plus features like prompt caching discounts. OpenAI PlatformOpenAI+1
Safety/guardrails: Many platforms encourage or require content filters; those can bill per text record or image. Microsoft Azure+1Google Cloud
Observability/tracing/evals: Tools such as LangSmith are invaluable but add per-trace or retention costs. LangChainLangSmith+1
RAG plumbing (embeddings + vector DB): Embedding tokens, vector storage, reads/writes, and (often) rerankers add up. OpenAI PlatformPineconeweaviate.io
Data movement: Egress fees for pulling data or serving results outside the cloud region. RedditNextcloudBasecamp
Security primitives: KMS/Key Vault operations are metered and can surprise at scale. Amazon Web Services, Inc.Microsoft Azure
Platform logging/metrics: CloudWatch/Azure Monitor/Cloud Logging ingestion and retention are billable. Amazon Web Services, Inc.Microsoft AzureGoogle Cloud
Each component may be “pennies,” but compound across millions of users, retries, multistep agents, and background jobs—your monthly spend can drift far beyond the headline token price.
Where the hidden costs hide
1) Prompt & output tokens aren’t the whole story
Embeddings for RAG: Every document chunk you index and every query you run incurs embedding tokens. OpenAI’s current public embedding prices are low per-million tokens, but at scale, millions of chunks add up. OpenAI Platform
Caching is not a silver bullet: Prompt caching can reduce input token cost (e.g., 50% for OpenAI cached inputs; Anthropic offers steeper discounts with different rules), but you must architect for cacheable prefixes and hit-rates. Overhead for cache writes and expiry windows can blunt savings if your prompts are highly dynamic. OpenAISergii GrytsaienkoBind AI IDE
2) Safety & compliance layers quietly meter usage
Safety filters: Azure AI Content Safety bills per text record or image screened; Google’s Gemini/Vertex AI includes configurable filters that can affect throughput/latency and operational complexity. If you add red-team checks or jailbreak detection (e.g., prompt-guard models), cost rises again. Microsoft Azure+1Google CloudGoogle Cloud
Key management: AWS KMS and Azure Key Vault charge per operation; at high request rates these fees and quotas become material, particularly if you envelope-encrypt payloads or rotate secrets aggressively. Amazon Web Services, Inc.Microsoft Azure
3) Observability, tracing, and evals
4) RAG vector databases: reads, writes, storage, reranking
Pinecone (serverless): per-GB storage and per-million read/write units; storage (~$0.33/GB/mo) and read/write charges stack with your traffic. Weaviate’s serverless lists per-million vector dimensions stored/month plus plan fees. Pineconeweaviate.io
Operational nuance: Pricing is sensitive to vector dimensionality, number of objects, query patterns, and SLA tier; benchmarks or calculators help, but real bills depend on usage shape. Stack OverflowAIMultiple
5) Data egress and logging
Network egress from major clouds is billed per-GB; if your app serves large outputs, images, or streams results cross-region, egress fees can rival compute costs. Logging ingestion and retention (CloudWatch/Azure Monitor/Cloud Logging) can also be non-trivial for verbose AI systems. RedditNextcloudBasecampAmazon Web Services, Inc.Microsoft AzureGoogle Cloud
6) Rate limits and throttling side-effects
KMS/Key Vault quotas and per-request costs can trigger architectural workarounds (local envelope caches, batching), which add complexity—and sometimes new costs (extra storage, retries). AWS Documentation+1

The inference reality: your biggest long-term cost centre
Across the industry, inference (the act of serving model outputs) is already eclipsing training for many real-world deployments. That matters because inference cost scales with every user interaction—forever.
NVIDIA emphasises the need to balance latency and throughput to keep inference cost in check; vendor materials and annual reports highlight orders-of-magnitude efficiency gains with new platforms (e.g., Blackwell claiming large inference cost reductions). Gains flow disproportionately to those who control their stack. NVIDIA DeveloperQ4 Assets
When usage is steady and high, owning the inference plane (on-prem or colocation) can flip the cost curve: you amortise capex over years, squeeze GPUs at high utilisation, and avoid per-token markups and egress.
When cloud AI pricing makes sense
Cloud remains phenomenal for:
Exploration & prototyping: spin up models in minutes, compare vendors, A/B prompts, and build POCs without capex.
Spiky, unpredictable workloads: elasticity and managed SLAs mitigate idle hardware risk.
Niche model access: specialty models or APIs you won’t host yourself.
But for sustained production inference at scale, paying retail per token forever can be more expensive than owning well-utilised inference hardware, especially as new GPUs improve perf/W and perf/£. NVIDIA Developer
The on-prem case: cost, control, and compliance
1) Cost per token you actually own
Modern inference-optimised GPUs (e.g., NVIDIA L40S class today; Blackwell-class incoming) deliver far better tokens-per-watt and tokens-per-£ than older cards. With proper batching and serving stacks, you can drive down cost per million tokens below cloud retail, particularly for high-volume, steady workloads. (Independent and vendor studies consistently show strong inference scaling on L40S-class servers.) Fujitsu
2) Data gravity, residency, and governance
Keeping prompts, context, and outputs inside your perimeter reduces egress, simplifies data-residency controls, and can shorten audit cycles. You also have direct control over logs, retention, and redaction policies—versus paying for external logging volumes. (Cloud logging is excellent, just metered.) Amazon Web Services, Inc.Microsoft AzureGoogle Cloud
3) Predictable spend & utilisation discipline
Owning the kit nudges teams to engineer for utilisation (batching, quantisation, prompt hygiene) because the savings accrue to you—not your provider. Over 12–36 months, this discipline compounds.
4) Real-world “cloud exit” datapoints
While not AI-specific, companies like 37signals and Dropbox publicly reported substantial savings from moving key workloads off hyperscalers; the underlying lesson generalises: predictable, heavy workloads often cost less on owned infra. Microsoft LearnOpenAI Community
A pragmatic buyer’s guide: model the whole pipeline
To decide cloud vs on-prem (or where to split), model costs across your full LLM pipeline:
Model calls
Input/output tokens by use case. Include retries, function/tool calls, chain steps, and agents’ background actions. Apply provider pricing and any caching discounts. OpenAI PlatformOpenAI+1
RAG
Embedding tokens (ingest + query), vector DB storage/reads/writes, reranking calls. OpenAI PlatformPinecone
Safety/guardrails
Per-request moderation scans (text/image) and prompt-guard models. Microsoft AzureGoogle Cloud
Observability & evals
Security services
KMS/Key Vault operations per request, secret rotations, HSM pools if used. Amazon Web Services, Inc.Microsoft Azure
Networking & logging
Egress by endpoint and region; log ingestion/retention by environment. RedditAmazon Web Services, Inc.
On-prem alternative
Capex (servers, GPUs, storage, networking), energy, cooling, space/colo, support, and staff; amortise over 3–5 years. Use vendor performance data to estimate tokens/sec and cost per million tokens at target utilisation. NVIDIA Developer



