Loading...


Updated 23 Jun 2026 • 6 mins read

LLM input prices span about 140x, from $0.035 per million tokens (Amazon Nova Micro) to $5 (GPT-5.5, Claude Opus 4.8). We compare cost per token across OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral, xAI, Amazon, and Cohere, then explain why cost per answer, not per token, decides your bill.
Choosing a large language model in 2026 starts with a deceptively simple question: what does it cost? The answer is anything but simple. Input prices across the major providers span roughly 140x, from a few cents per million tokens at the value end to five dollars at the frontier, and output can cost anywhere from two to ten times more than input. On top of that, every provider prices caching, context tiers, and reasoning tokens differently, so two models with similar headline rates can produce very different bills.
This guide gives you the full per-token picture across OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral, xAI, Amazon, and Cohere in one master table, then goes a step further. Because the number that actually lands on your invoice is not price per token, it is cost per answer: total spend divided by the answers your application delivers, after retries, tool calls, context bloat, and reasoning overhead. We cover both, so you can compare per token to choose a model and budget per answer to control the bill.
The table below is the single source of truth for this guide. It compares list price per million tokens for input, output, and cached input across every major model family, with context window. Figures are on-demand, standard-tier rates. This is written for engineering and platform leaders sizing model spend and for FinOps managers approving it, so it leads with the numbers you budget against.
| Provider | Model | Input $/1M | Output $/1M | Cached $/1M | Context |
|---|---|---|---|---|---|
| OpenAI | GPT-5.5 (flagship) | 5.00 | 30.00 | 0.50 | ~1.05M |
| OpenAI | GPT-5 | 1.25 | 10.00 | 0.125 | 400K |
| OpenAI | GPT-5 mini | 0.25 | 2.00 | 0.025 | 400K |
| OpenAI | o3 (reasoning) | 2.00 | 8.00 | 0.50 | 200K |
| OpenAI | o4-mini | 1.10 | 4.40 | 0.275 | 200K |
| OpenAI | GPT-4o | 2.50 | 10.00 | 1.25 | 128K |
| OpenAI | GPT-4o mini | 0.15 | 0.60 | 0.075 | 128K |
| Anthropic | Claude Opus 4.8 | 5.00 | 25.00 | 0.50 | 1M |
| Anthropic | Claude Sonnet 4.6 | 3.00 | 15.00 | 0.30 | 1M |
| Anthropic | Claude Haiku 4.5 | 1.00 | 5.00 | 0.10 | 200K |
| Gemini 2.5 Pro (≤200K) | 1.25 | 10.00 | 0.125 | 1M | |
| Gemini 2.5 Pro (>200K) | 2.50 | 15.00 | 0.25 | 1M | |
| Gemini 2.5 Flash | 0.30 | 2.50 | 0.03 | 1M | |
| Gemini 2.5 Flash-Lite | 0.10 | 0.40 | 0.01 | 1M | |
| Gemini 3.1 Pro (≤200K) | 2.00 | 12.00 | 0.20 | 1M | |
| Gemini 3.5 Flash | 1.50 | 9.00 | 0.15 | 1M | |
| DeepSeek | deepseek-chat (V4 Flash) | 0.14 | 0.28 | 0.0028 | 1M |
| DeepSeek | deepseek-chat (legacy V3) | 0.27 | 1.10 | 0.07 | 64K |
| DeepSeek | deepseek-reasoner (legacy R1) | 0.55 | 2.19 | 0.14 | 64K |
| Meta / Bedrock | Llama 4 Maverick | 0.24 | 0.97 | n/p | 1M |
| Meta / Bedrock | Llama 4 Scout | 0.17 | 0.66 | n/p | 3.5M |
| Meta / Bedrock | Llama 3.3 70B | 0.72 | 0.72 | n/p | 128K |
| Mistral | Mistral Large 3 | 0.50 | 1.50 | n/p | 256K |
| Mistral | Mistral Medium 3.5 | 1.50 | 7.50 | n/p | 128K |
| Mistral | Mistral Small 4 | 0.10 | 0.30 | n/p | n/p |
| xAI | Grok 4.3 | 1.25 | 2.50 | 0.20 | 1M |
| Amazon / Bedrock | Nova Pro | 0.80 | 3.20 | ~0.20 (est) | 300K |
| Amazon / Bedrock | Nova Lite | 0.06 | 0.24 | ~0.015 (est) | 300K |
| Amazon / Bedrock | Nova Micro | 0.035 | 0.14 | 0.00875 | 128K |
| Cohere | Command A | 2.50 | 10.00 | n/p | 256K |
| Cohere | Command R | 0.50 | 1.50 | n/p | 128K |
All prices in USD per 1M tokens, on-demand standard tier. "n/p" means not published. Official sources by provider: OpenAI, Anthropic, Google Gemini, DeepSeek, AWS Bedrock (Llama, Nova), Mistral, xAI Grok, Cohere.
On raw input cost per token, Amazon Nova Micro is the cheapest at $0.035/M input and $0.14/M output, followed by Nova Lite ($0.06/$0.24) and Gemini 2.5 Flash-Lite ($0.10/$0.40). Among frontier-class models that handle general production work, DeepSeek V4 Flash is the price leader at $0.14/$0.28, undercutting GPT-5.5 and Claude Opus 4.8 by roughly 15x to 35x on input.
But "cheapest API" is the wrong question for a budget owner. The cheapest list price frequently produces a higher bill, because small value-tier models often need more retries, longer outputs, or human correction to reach the same quality on a hard task. The cheapest sticker rate and the cheapest delivered answer are different numbers, and only one of them appears on your invoice.
Definition. Cost per token is the price a provider charges per token of input or output, quoted in USD per million tokens. It is the list-price unit of LLM billing, not the unit that determines a real invoice.
Providers meter input and output tokens separately and bill in USD per million. A token is a chunk of text - roughly 0.75 of an English word, so about 1,500 words is near 2,000 tokens. Your bill for any model is a simple sum, computed per call and aggregated across all traffic:
LLM bill = (input tokens x input rate) + (output tokens x output rate), with cached input billed at the lower cache-read rate where the provider supports it.
The trap is modeling cost as "tokens times one rate." In practice your effective input rate is a blend of the standard and cached rates, weighted by your cache-hit ratio, which you control through prompt design. A million tokens is an abstraction reached through volume, not through a single call; a typical request is a few thousand tokens, and you arrive at a million by serving many of them.
Definition. A token is the atomic unit an LLM reads and writes - a short piece of text, on average about 0.75 of an English word. Both your prompt and the model's response are counted in tokens, and both are billed.
Output is the more expensive operation, so providers price it 2x to 10x above input. The reason is mechanical. Input tokens are processed in a single parallel prefill pass, where the model reads the whole prompt at once. Output tokens are produced one at a time in a sequential decode loop, each new token conditioned on every prior one - a compute-bound process that does not parallelize the same way.
The ratio shows up consistently in the table. GPT-5.5 charges $5 input and $30 output, a 6x ratio. Claude Opus 4.8 is $5/$25, a 5x ratio. Gemini 2.5 Flash is $0.30/$2.50, an 8x ratio. The practical implication for a budget: capping output length is a direct, high-leverage cost control, because every output token is billed in the most expensive class. Verbose, unconstrained responses are the quiet driver of an inflated bill.
Caching is the dominant cost lever in 2026. When a provider recognizes that the leading portion of your prompt - the prefix - matches context it processed recently, it serves those tokens from cache at a steep discount instead of reprocessing them. Most providers discount cache reads by about 90% versus standard input; DeepSeek V4 reaches roughly 98%, while GPT-4o is an outlier at only 50% (Opslyft pricing observation, 2026).
Definition. Cached input is prompt text the provider has seen recently and can reuse from its context cache, billed at a discounted cache-read rate instead of the full input rate. Stable, repeated prompt prefixes maximize cache hits.
The design discipline is simple and high-ROI. Put everything stable and repeated - system prompt, tool definitions, few-shot examples, a fixed document - at the front of the prompt, byte-identical across calls, so it lands on the cache rate. Put the variable part, the user's actual question, at the end. On a chatbot with a large fixed system prompt, ordering prompts this way can cut input cost by more than half without changing the model or the answer.
List price per token is not what you pay. Cost per answer is. This is the single most important idea in LLM cost management, and it is invisible on every rate card. Two models with identical per-token prices can produce wildly different bills, because the number of tokens consumed to deliver one finished answer varies by an order of magnitude depending on how the application is built.
Definition. Cost per answer is total LLM infrastructure spend divided by the number of answers your application actually delivers to users. It is the true unit cost of an AI feature, and it is what a budget should track.
We frame this as the Cost-Per-Answer formula: total infrastructure spend divided by answers delivered. Four hidden multipliers move that denominator and inflate the numerator, and none of them appear in a price-per-token comparison:
The consequence is that token consumption is non-linear with user activity. A single query routed through retrieval, answered by a reasoning model, with a few tool calls and a validation pass, can consume one to two orders of magnitude more tokens than a direct prompt. The forecast models the visible activity; the bill reflects the invisible fan-out. This is why a per-token comparison is necessary but never sufficient.
This is the gap Opslyft closes.A low per-token rate does not protect you from a five-figure surprise when a prompt change quietly triples context or an agent loop runs away. Opslyft surfaces LLM token cost where engineers work, allocates it across teams, features, and customers without requiring perfect tags, and ties it back to answers delivered so you optimize cost per answer, not price per token. See how Opslyft handles AI cost →
There is no single winner, because cost-to-performance depends on the task and the quality bar. The honest framework is to match the tier to the work:
The right comparison is cost per successfully completed task at your quality bar - a measurement you run on your own traffic, not a number you read off a rate card. The disciplined pattern is to default to a cheap model and escalate to a flagship only when a classifier or confidence score says the task needs it. That single routing decision is typically worth more than any negotiated discount.
Concrete math makes the rate card legible. Take a retrieval-augmented (RAG) support assistant with this per-request profile: a 1,200-token system prompt (stable, cacheable), 1,800 tokens of retrieved context and history, and a 200-token user message - 3,200 input tokens total - returning a 400-token answer. Volume is 500,000 requests per month. We assume the 1,200-token system prompt always hits cache (a 37.5% cache-hit ratio on input) and the rest is uncached.
Per month that is 600M cached input tokens, 1,000M uncached input tokens, and 200M output tokens. Applying each model's rates:
| Model | Cached input (600M) | Uncached input (1,000M) | Output (200M) | Monthly total |
|---|---|---|---|---|
| DeepSeek V4 Flash ($0.14/$0.28, cache $0.0028) | $1.68 | $140 | $56 | ~$198 |
| Claude Sonnet 4.6 ($3/$15, cache $0.30) | $180 | $3,000 | $3,000 | ~$6,180 |
| GPT-5.5 ($5/$30, cache $0.50) | $300 | $5,000 | $6,000 | ~$11,300 |
Same workload, same token profile, three published rate cards: the bill ranges from about $198 to $11,300 per month, a 57x spread driven entirely by model choice. The lesson is not "always pick the cheapest." It is that the model decision dwarfs almost every other cost lever, so it deserves to be made deliberately, per task, and re-measured against cost per correct answer rather than assumed. If GPT-5.5 resolves a class of tickets in one pass that DeepSeek needs two attempts and an escalation to handle, the gap narrows - which is exactly why you measure on your own traffic.
Most cost surprises come from never doing the arithmetic before launch. Here is the method, worked end to end, so you can size any workload in advance:
For a deeper treatment of how token consumption maps to value and budget, see the token economics pillar, and for the broader playbook see AI cost optimization. The same forecasting discipline that governs cloud spend - estimate, attribute, govern - applies directly here; teams already doing it for infrastructure (for example via our OCI cost optimization guide) extend it to LLM tokens with the same muscles.
Three observable shifts should shape any 2026 model budget:
A related caution from the value tier: prices reprice upward across generations. Gemini 3.5 Flash ($1.50/$9) costs about 5x the input of Gemini 2.5 Flash ($0.30/$2.50), so upgrading to the newest "fast" model is not automatically a cost saving. Always re-check the rate when you migrate model versions.
The 2026 LLM market gives buyers a real spread to work with - roughly 140x between the cheapest and most expensive input rates, a frontier-class price leader in DeepSeek V4 Flash, and a ~90% cache discount that most teams under-exploit. Use the master table to shortlist by tier and price. But stop the comparison there and you will mis-budget, because the rate card omits the four multipliers - retries, tool calls, context bloat, and reasoning tokens - that decide the real number. Compare per token to choose a model; budget per answer to control the bill; and attribute every token to the value it produces, because that is the only view that survives contact with production traffic.
On raw input cost, Amazon Nova Micro is cheapest at $0.035/M input and $0.14/M output, followed by Nova Lite and Gemini 2.5 Flash-Lite. Among frontier-class models, DeepSeek V4 Flash leads at $0.14/$0.28. But the cheapest sticker rate is not always the cheapest delivered answer.
Providers meter input and output tokens separately and bill per million. Your bill is (input tokens x input rate) + (output tokens x output rate), with cached input billed at a lower cache-read rate where supported. Your effective input rate is a blend of standard and cached rates weighted by your cache-hit ratio.
Output is priced 2x to 10x above input because it is generated sequentially, one token at a time in a compute-bound decode loop, while input is processed in a single parallel prefill pass. Capping output length is therefore a high-leverage cost control.
Most providers discount cache reads by about 90% versus standard input, DeepSeek V4 by roughly 98%, with GPT-4o an outlier at 50%. Putting stable, byte-identical content at the front of the prompt maximizes cache hits and can cut input cost by more than half.