Updated 23 Jun 2026 • 6 mins read

LLM Pricing Comparison: Cost Per Token Across Every Major Model (2026)

Ai Cost Optimization

Khushi Dubey
Author

Table of Content

LLM input prices span about 140x, from $0.035 per million tokens (Amazon Nova Micro) to $5 (GPT-5.5, Claude Opus 4.8). We compare cost per token across OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral, xAI, Amazon, and Cohere, then explain why cost per answer, not per token, decides your bill.

Introduction

Choosing a large language model in 2026 starts with a deceptively simple question: what does it cost? The answer is anything but simple. Input prices across the major providers span roughly 140x, from a few cents per million tokens at the value end to five dollars at the frontier, and output can cost anywhere from two to ten times more than input. On top of that, every provider prices caching, context tiers, and reasoning tokens differently, so two models with similar headline rates can produce very different bills.

This guide gives you the full per-token picture across OpenAI, Anthropic, Google, DeepSeek, Meta, Mistral, xAI, Amazon, and Cohere in one master table, then goes a step further. Because the number that actually lands on your invoice is not price per token, it is cost per answer: total spend divided by the answers your application delivers, after retries, tool calls, context bloat, and reasoning overhead. We cover both, so you can compare per token to choose a model and budget per answer to control the bill.

The Master LLM Pricing Comparison Table (2026)

The table below is the single source of truth for this guide. It compares list price per million tokens for input, output, and cached input across every major model family, with context window. Figures are on-demand, standard-tier rates. This is written for engineering and platform leaders sizing model spend and for FinOps managers approving it, so it leads with the numbers you budget against.

Provider	Model	Input $/1M	Output $/1M	Cached $/1M	Context
OpenAI	GPT-5.5 (flagship)	5.00	30.00	0.50	~1.05M
OpenAI	GPT-5	1.25	10.00	0.125	400K
OpenAI	GPT-5 mini	0.25	2.00	0.025	400K
OpenAI	o3 (reasoning)	2.00	8.00	0.50	200K
OpenAI	o4-mini	1.10	4.40	0.275	200K
OpenAI	GPT-4o	2.50	10.00	1.25	128K
OpenAI	GPT-4o mini	0.15	0.60	0.075	128K
Anthropic	Claude Opus 4.8	5.00	25.00	0.50	1M
Anthropic	Claude Sonnet 4.6	3.00	15.00	0.30	1M
Anthropic	Claude Haiku 4.5	1.00	5.00	0.10	200K
Google	Gemini 2.5 Pro (≤200K)	1.25	10.00	0.125	1M
Google	Gemini 2.5 Pro (>200K)	2.50	15.00	0.25	1M
Google	Gemini 2.5 Flash	0.30	2.50	0.03	1M
Google	Gemini 2.5 Flash-Lite	0.10	0.40	0.01	1M
Google	Gemini 3.1 Pro (≤200K)	2.00	12.00	0.20	1M
Google	Gemini 3.5 Flash	1.50	9.00	0.15	1M
DeepSeek	deepseek-chat (V4 Flash)	0.14	0.28	0.0028	1M
DeepSeek	deepseek-chat (legacy V3)	0.27	1.10	0.07	64K
DeepSeek	deepseek-reasoner (legacy R1)	0.55	2.19	0.14	64K
Meta / Bedrock	Llama 4 Maverick	0.24	0.97	n/p	1M
Meta / Bedrock	Llama 4 Scout	0.17	0.66	n/p	3.5M
Meta / Bedrock	Llama 3.3 70B	0.72	0.72	n/p	128K
Mistral	Mistral Large 3	0.50	1.50	n/p	256K
Mistral	Mistral Medium 3.5	1.50	7.50	n/p	128K
Mistral	Mistral Small 4	0.10	0.30	n/p	n/p
xAI	Grok 4.3	1.25	2.50	0.20	1M
Amazon / Bedrock	Nova Pro	0.80	3.20	~0.20 (est)	300K
Amazon / Bedrock	Nova Lite	0.06	0.24	~0.015 (est)	300K
Amazon / Bedrock	Nova Micro	0.035	0.14	0.00875	128K
Cohere	Command A	2.50	10.00	n/p	256K
Cohere	Command R	0.50	1.50	n/p	128K

All prices in USD per 1M tokens, on-demand standard tier. "n/p" means not published. Official sources by provider: OpenAI, Anthropic, Google Gemini, DeepSeek, AWS Bedrock (Llama, Nova), Mistral, xAI Grok, Cohere.

Accuracy footnotes - do not treat flagged rows as hard fact

DeepSeek dual pricing. There are two conflicting official pages. The deepseek-chat and deepseek-reasoner aliases now route to V4 Flash (0.14/0.28, cache 0.0028, 1M context). The standalone legacy V3 and R1 entries (64K context) are deprecating on 2026-07-24. Both are listed; confirm which your account hits.
Llama 4 on Bedrock (medium confidence). The AWS pricing page is JavaScript-rendered; per-token figures are corroborated via aggregators consistent with Together AI hosted rates, and context windows are confirmed by the AWS launch blog. Treat the rates as indicative, not final.
Nova cached input (estimated). Nova Lite and Nova Pro cached-input figures are computed estimates at roughly 25% of input; only Nova Micro's cache-read rate is published. Base input/output rates are high confidence.
Cohere Command A (aggregator-sourced). The 2.50/10.00 rate is corroborated via aggregators (OpenRouter, AI Pricing Guru), not on Cohere's own published table, where the flagship is priced behind sales. Verify with Cohere directly before budgeting.
Gemini 3.x (preview). Gemini 3.1 Pro and 3.5 Flash are Preview pricing; GA pricing on non-global endpoints is effective 2026-07-01 and may reprice. Gemini Pro keeps a two-tier rate (above/below a 200K prompt), and cache storage is billed separately.

What is the cheapest LLM API in 2026?

On raw input cost per token, Amazon Nova Micro is the cheapest at $0.035/M input and $0.14/M output, followed by Nova Lite ($0.06/$0.24) and Gemini 2.5 Flash-Lite ($0.10/$0.40). Among frontier-class models that handle general production work, DeepSeek V4 Flash is the price leader at $0.14/$0.28, undercutting GPT-5.5 and Claude Opus 4.8 by roughly 15x to 35x on input.

But "cheapest API" is the wrong question for a budget owner. The cheapest list price frequently produces a higher bill, because small value-tier models often need more retries, longer outputs, or human correction to reach the same quality on a hard task. The cheapest sticker rate and the cheapest delivered answer are different numbers, and only one of them appears on your invoice.

Definition. Cost per token is the price a provider charges per token of input or output, quoted in USD per million tokens. It is the list-price unit of LLM billing, not the unit that determines a real invoice.

How is LLM pricing calculated per token?

Providers meter input and output tokens separately and bill in USD per million. A token is a chunk of text - roughly 0.75 of an English word, so about 1,500 words is near 2,000 tokens. Your bill for any model is a simple sum, computed per call and aggregated across all traffic:

LLM bill = (input tokens x input rate) + (output tokens x output rate), with cached input billed at the lower cache-read rate where the provider supports it.

The trap is modeling cost as "tokens times one rate." In practice your effective input rate is a blend of the standard and cached rates, weighted by your cache-hit ratio, which you control through prompt design. A million tokens is an abstraction reached through volume, not through a single call; a typical request is a few thousand tokens, and you arrive at a million by serving many of them.

Definition. A token is the atomic unit an LLM reads and writes - a short piece of text, on average about 0.75 of an English word. Both your prompt and the model's response are counted in tokens, and both are billed.

Why is output priced higher than input?

Output is the more expensive operation, so providers price it 2x to 10x above input. The reason is mechanical. Input tokens are processed in a single parallel prefill pass, where the model reads the whole prompt at once. Output tokens are produced one at a time in a sequential decode loop, each new token conditioned on every prior one - a compute-bound process that does not parallelize the same way.

The ratio shows up consistently in the table. GPT-5.5 charges $5 input and $30 output, a 6x ratio. Claude Opus 4.8 is $5/$25, a 5x ratio. Gemini 2.5 Flash is $0.30/$2.50, an 8x ratio. The practical implication for a budget: capping output length is a direct, high-leverage cost control, because every output token is billed in the most expensive class. Verbose, unconstrained responses are the quiet driver of an inflated bill.

How much can prompt caching cut LLM costs?

Caching is the dominant cost lever in 2026. When a provider recognizes that the leading portion of your prompt - the prefix - matches context it processed recently, it serves those tokens from cache at a steep discount instead of reprocessing them. Most providers discount cache reads by about 90% versus standard input; DeepSeek V4 reaches roughly 98%, while GPT-4o is an outlier at only 50% (Opslyft pricing observation, 2026).

Definition. Cached input is prompt text the provider has seen recently and can reuse from its context cache, billed at a discounted cache-read rate instead of the full input rate. Stable, repeated prompt prefixes maximize cache hits.

The design discipline is simple and high-ROI. Put everything stable and repeated - system prompt, tool definitions, few-shot examples, a fixed document - at the front of the prompt, byte-identical across calls, so it lands on the cache rate. Put the variable part, the user's actual question, at the end. On a chatbot with a large fixed system prompt, ordering prompts this way can cut input cost by more than half without changing the model or the answer.

Cost per token vs cost per answer: what actually matters?

List price per token is not what you pay. Cost per answer is. This is the single most important idea in LLM cost management, and it is invisible on every rate card. Two models with identical per-token prices can produce wildly different bills, because the number of tokens consumed to deliver one finished answer varies by an order of magnitude depending on how the application is built.

Definition. Cost per answer is total LLM infrastructure spend divided by the number of answers your application actually delivers to users. It is the true unit cost of an AI feature, and it is what a budget should track.

We frame this as the Cost-Per-Answer formula: total infrastructure spend divided by answers delivered. Four hidden multipliers move that denominator and inflate the numerator, and none of them appear in a price-per-token comparison:

Retries and validation. Failed calls, guardrail passes, and JSON-repair loops all burn tokens for zero net answers. A cheaper model that retries three times can cost more per answer than a flagship that succeeds once.
Tool calls and agent loops. In multi-step agents, the model is invoked many times per user-facing answer. The intermediate reasoning and tool-routing chatter can exceed the cost of the visible work.
Context bloat. System prompts and conversation history are re-sent and re-billed on every turn. In a long session, the context can dwarf the user's actual question.
Reasoning tokens. Reasoning models (o3, V4 Pro, and similar) generate large volumes of internal tokens to reach a short answer, and you pay for all of them.

The consequence is that token consumption is non-linear with user activity. A single query routed through retrieval, answered by a reasoning model, with a few tool calls and a validation pass, can consume one to two orders of magnitude more tokens than a direct prompt. The forecast models the visible activity; the bill reflects the invisible fan-out. This is why a per-token comparison is necessary but never sufficient.

This is the gap Opslyft closes.A low per-token rate does not protect you from a five-figure surprise when a prompt change quietly triples context or an agent loop runs away. Opslyft surfaces LLM token cost where engineers work, allocates it across teams, features, and customers without requiring perfect tags, and ties it back to answers delivered so you optimize cost per answer, not price per token. See how Opslyft handles AI cost →

Which LLM gives the best cost-to-performance?

There is no single winner, because cost-to-performance depends on the task and the quality bar. The honest framework is to match the tier to the work:

High-volume, low-difficulty work (classification, extraction, routing, summarization): value-tier models win decisively on cost per answer. Nova Lite ($0.06/$0.24), Gemini 2.5 Flash-Lite ($0.10/$0.40), and DeepSeek V4 Flash ($0.14/$0.28) deliver near-flagship quality on simple tasks at a fraction of the price.
Hard reasoning where a wrong answer is costly (complex code, multi-constraint planning, analysis): a flagship that solves in one attempt can beat a cheap model that needs three retries plus human correction. Here Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro earn their premium when measured on cost per correct answer.
The middle (general assistants, RAG chat): mid-tier models such as Claude Sonnet 4.6, GPT-5, and Gemini 2.5 Pro balance quality and price, and are the common default for production assistants.

The right comparison is cost per successfully completed task at your quality bar - a measurement you run on your own traffic, not a number you read off a rate card. The disciplined pattern is to default to a cheap model and escalate to a flagship only when a classifier or confidence score says the task needs it. That single routing decision is typically worth more than any negotiated discount.

A worked RAG cost example across three models

Concrete math makes the rate card legible. Take a retrieval-augmented (RAG) support assistant with this per-request profile: a 1,200-token system prompt (stable, cacheable), 1,800 tokens of retrieved context and history, and a 200-token user message - 3,200 input tokens total - returning a 400-token answer. Volume is 500,000 requests per month. We assume the 1,200-token system prompt always hits cache (a 37.5% cache-hit ratio on input) and the rest is uncached.

Per month that is 600M cached input tokens, 1,000M uncached input tokens, and 200M output tokens. Applying each model's rates:

Model	Cached input (600M)	Uncached input (1,000M)	Output (200M)	Monthly total
DeepSeek V4 Flash ($0.14/$0.28, cache $0.0028)	$1.68	$140	$56	~$198
Claude Sonnet 4.6 ($3/$15, cache $0.30)	$180	$3,000	$3,000	~$6,180
GPT-5.5 ($5/$30, cache $0.50)	$300	$5,000	$6,000	~$11,300

Same workload, same token profile, three published rate cards: the bill ranges from about $198 to $11,300 per month, a 57x spread driven entirely by model choice. The lesson is not "always pick the cheapest." It is that the model decision dwarfs almost every other cost lever, so it deserves to be made deliberately, per task, and re-measured against cost per correct answer rather than assumed. If GPT-5.5 resolves a class of tickets in one pass that DeepSeek needs two attempts and an escalation to handle, the gap narrows - which is exactly why you measure on your own traffic.

How do you forecast LLM costs before you ship?

Most cost surprises come from never doing the arithmetic before launch. Here is the method, worked end to end, so you can size any workload in advance:

Estimate tokens per request. Sum the system prompt, retrieved context, conversation history, user message (input), and the expected answer length (output).
Split input into cached and uncached. Identify the stable, byte-identical prefix that will hit cache versus the variable remainder. This gives your cache-hit ratio.
Multiply by monthly request volume. Produce three totals: cached input, uncached input, and output tokens per month.
Apply each candidate model's rates from the table above, summing the three streams.
Add the hidden multipliers. Inflate for expected retries, tool calls, and agent loops - the cost-per-answer factors that a naive per-token forecast omits.
Re-run against real logs after a week of traffic. The gap between forecast and actuals is your optimization backlog.

For a deeper treatment of how token consumption maps to value and budget, see the token economics pillar, and for the broader playbook see AI cost optimization. The same forecasting discipline that governs cloud spend - estimate, attribute, govern - applies directly here; teams already doing it for infrastructure (for example via our OCI cost optimization guide) extend it to LLM tokens with the same muscles.

2026 LLM pricing trends to budget around

Three observable shifts should shape any 2026 model budget:

Flagship prices diverged. OpenAI raised its top tier from GPT-5 ($1.25/$10) to GPT-5.5 ($5/$30), while Anthropic cut Opus to roughly a third of its earlier-generation price (Opslyft pricing observation, 2026). You can no longer assume "newer flagship equals higher price" across vendors; the direction now depends on the provider's market strategy.
Cached input became the standard discount lever. A ~90% cache discount is now common, with DeepSeek V4 at ~98% (Opslyft pricing observation, 2026). Any 2026 cost model that ignores cache economics overstates spend by a wide margin.
Context windows standardized near 1M tokens for frontier models, pushing 128K into the older-model band (Opslyft pricing observation, 2026). Larger windows enable more context per call - and, if used carelessly, more billed input per call, which loops back to the cost-per-answer point.

A related caution from the value tier: prices reprice upward across generations. Gemini 3.5 Flash ($1.50/$9) costs about 5x the input of Gemini 2.5 Flash ($0.30/$2.50), so upgrading to the newest "fast" model is not automatically a cost saving. Always re-check the rate when you migrate model versions.

Conclusion

The 2026 LLM market gives buyers a real spread to work with - roughly 140x between the cheapest and most expensive input rates, a frontier-class price leader in DeepSeek V4 Flash, and a ~90% cache discount that most teams under-exploit. Use the master table to shortlist by tier and price. But stop the comparison there and you will mis-budget, because the rate card omits the four multipliers - retries, tool calls, context bloat, and reasoning tokens - that decide the real number. Compare per token to choose a model; budget per answer to control the bill; and attribute every token to the value it produces, because that is the only view that survives contact with production traffic.

FAQs

What is the cheapest LLM API in 2026?

On raw input cost, Amazon Nova Micro is cheapest at $0.035/M input and $0.14/M output, followed by Nova Lite and Gemini 2.5 Flash-Lite. Among frontier-class models, DeepSeek V4 Flash leads at $0.14/$0.28. But the cheapest sticker rate is not always the cheapest delivered answer.

How is LLM pricing calculated per token?

Providers meter input and output tokens separately and bill per million. Your bill is (input tokens x input rate) + (output tokens x output rate), with cached input billed at a lower cache-read rate where supported. Your effective input rate is a blend of standard and cached rates weighted by your cache-hit ratio.

Why is output priced higher than input on LLM APIs?

Output is priced 2x to 10x above input because it is generated sequentially, one token at a time in a compute-bound decode loop, while input is processed in a single parallel prefill pass. Capping output length is therefore a high-leverage cost control.

How much can prompt caching cut LLM costs?

Most providers discount cache reads by about 90% versus standard input, DeepSeek V4 by roughly 98%, with GPT-4o an outlier at 50%. Putting stable, byte-identical content at the front of the prompt maximizes cache hits and can cut input cost by more than half.

Related Blogs

Claude Pricing 2026: Complete Guide to Plans, API Costs & Real Token Rates

Google Gemini API Pricing 2026: Every Model and Cost Explained

DeepSeek API Pricing 2026: Models, Token Costs, and How to Optimize

Cloud waste? Bench it. Opslyft puts the right players on the field.

Updated 23 Jun 2026 • 6 mins read

LLM Pricing Comparison: Cost Per Token Across Every Major Model (2026)

Ai Cost Optimization

Khushi Dubey
Author

Table of Content

Introduction

The Master LLM Pricing Comparison Table (2026)

Provider	Model	Input $/1M	Output $/1M	Cached $/1M	Context
OpenAI	GPT-5.5 (flagship)	5.00	30.00	0.50	~1.05M
OpenAI	GPT-5	1.25	10.00	0.125	400K
OpenAI	GPT-5 mini	0.25	2.00	0.025	400K
OpenAI	o3 (reasoning)	2.00	8.00	0.50	200K
OpenAI	o4-mini	1.10	4.40	0.275	200K
OpenAI	GPT-4o	2.50	10.00	1.25	128K
OpenAI	GPT-4o mini	0.15	0.60	0.075	128K
Anthropic	Claude Opus 4.8	5.00	25.00	0.50	1M
Anthropic	Claude Sonnet 4.6	3.00	15.00	0.30	1M
Anthropic	Claude Haiku 4.5	1.00	5.00	0.10	200K
Google	Gemini 2.5 Pro (≤200K)	1.25	10.00	0.125	1M
Google	Gemini 2.5 Pro (>200K)	2.50	15.00	0.25	1M
Google	Gemini 2.5 Flash	0.30	2.50	0.03	1M
Google	Gemini 2.5 Flash-Lite	0.10	0.40	0.01	1M
Google	Gemini 3.1 Pro (≤200K)	2.00	12.00	0.20	1M
Google	Gemini 3.5 Flash	1.50	9.00	0.15	1M
DeepSeek	deepseek-chat (V4 Flash)	0.14	0.28	0.0028	1M
DeepSeek	deepseek-chat (legacy V3)	0.27	1.10	0.07	64K
DeepSeek	deepseek-reasoner (legacy R1)	0.55	2.19	0.14	64K
Meta / Bedrock	Llama 4 Maverick	0.24	0.97	n/p	1M
Meta / Bedrock	Llama 4 Scout	0.17	0.66	n/p	3.5M
Meta / Bedrock	Llama 3.3 70B	0.72	0.72	n/p	128K
Mistral	Mistral Large 3	0.50	1.50	n/p	256K
Mistral	Mistral Medium 3.5	1.50	7.50	n/p	128K
Mistral	Mistral Small 4	0.10	0.30	n/p	n/p
xAI	Grok 4.3	1.25	2.50	0.20	1M
Amazon / Bedrock	Nova Pro	0.80	3.20	~0.20 (est)	300K
Amazon / Bedrock	Nova Lite	0.06	0.24	~0.015 (est)	300K
Amazon / Bedrock	Nova Micro	0.035	0.14	0.00875	128K
Cohere	Command A	2.50	10.00	n/p	256K
Cohere	Command R	0.50	1.50	n/p	128K

Accuracy footnotes - do not treat flagged rows as hard fact

DeepSeek dual pricing. There are two conflicting official pages. The deepseek-chat and deepseek-reasoner aliases now route to V4 Flash (0.14/0.28, cache 0.0028, 1M context). The standalone legacy V3 and R1 entries (64K context) are deprecating on 2026-07-24. Both are listed; confirm which your account hits.
Llama 4 on Bedrock (medium confidence). The AWS pricing page is JavaScript-rendered; per-token figures are corroborated via aggregators consistent with Together AI hosted rates, and context windows are confirmed by the AWS launch blog. Treat the rates as indicative, not final.
Nova cached input (estimated). Nova Lite and Nova Pro cached-input figures are computed estimates at roughly 25% of input; only Nova Micro's cache-read rate is published. Base input/output rates are high confidence.
Cohere Command A (aggregator-sourced). The 2.50/10.00 rate is corroborated via aggregators (OpenRouter, AI Pricing Guru), not on Cohere's own published table, where the flagship is priced behind sales. Verify with Cohere directly before budgeting.
Gemini 3.x (preview). Gemini 3.1 Pro and 3.5 Flash are Preview pricing; GA pricing on non-global endpoints is effective 2026-07-01 and may reprice. Gemini Pro keeps a two-tier rate (above/below a 200K prompt), and cache storage is billed separately.

What is the cheapest LLM API in 2026?

How is LLM pricing calculated per token?

LLM bill = (input tokens x input rate) + (output tokens x output rate), with cached input billed at the lower cache-read rate where the provider supports it.

Why is output priced higher than input?

How much can prompt caching cut LLM costs?

Cost per token vs cost per answer: what actually matters?

Retries and validation. Failed calls, guardrail passes, and JSON-repair loops all burn tokens for zero net answers. A cheaper model that retries three times can cost more per answer than a flagship that succeeds once.
Tool calls and agent loops. In multi-step agents, the model is invoked many times per user-facing answer. The intermediate reasoning and tool-routing chatter can exceed the cost of the visible work.
Context bloat. System prompts and conversation history are re-sent and re-billed on every turn. In a long session, the context can dwarf the user's actual question.
Reasoning tokens. Reasoning models (o3, V4 Pro, and similar) generate large volumes of internal tokens to reach a short answer, and you pay for all of them.

Which LLM gives the best cost-to-performance?

There is no single winner, because cost-to-performance depends on the task and the quality bar. The honest framework is to match the tier to the work:

High-volume, low-difficulty work (classification, extraction, routing, summarization): value-tier models win decisively on cost per answer. Nova Lite ($0.06/$0.24), Gemini 2.5 Flash-Lite ($0.10/$0.40), and DeepSeek V4 Flash ($0.14/$0.28) deliver near-flagship quality on simple tasks at a fraction of the price.
Hard reasoning where a wrong answer is costly (complex code, multi-constraint planning, analysis): a flagship that solves in one attempt can beat a cheap model that needs three retries plus human correction. Here Claude Opus 4.8, GPT-5.5, and Gemini 3.1 Pro earn their premium when measured on cost per correct answer.
The middle (general assistants, RAG chat): mid-tier models such as Claude Sonnet 4.6, GPT-5, and Gemini 2.5 Pro balance quality and price, and are the common default for production assistants.

A worked RAG cost example across three models

Per month that is 600M cached input tokens, 1,000M uncached input tokens, and 200M output tokens. Applying each model's rates:

Model	Cached input (600M)	Uncached input (1,000M)	Output (200M)	Monthly total
DeepSeek V4 Flash ($0.14/$0.28, cache $0.0028)	$1.68	$140	$56	~$198
Claude Sonnet 4.6 ($3/$15, cache $0.30)	$180	$3,000	$3,000	~$6,180
GPT-5.5 ($5/$30, cache $0.50)	$300	$5,000	$6,000	~$11,300

How do you forecast LLM costs before you ship?

Most cost surprises come from never doing the arithmetic before launch. Here is the method, worked end to end, so you can size any workload in advance:

Estimate tokens per request. Sum the system prompt, retrieved context, conversation history, user message (input), and the expected answer length (output).
Split input into cached and uncached. Identify the stable, byte-identical prefix that will hit cache versus the variable remainder. This gives your cache-hit ratio.
Multiply by monthly request volume. Produce three totals: cached input, uncached input, and output tokens per month.
Apply each candidate model's rates from the table above, summing the three streams.
Add the hidden multipliers. Inflate for expected retries, tool calls, and agent loops - the cost-per-answer factors that a naive per-token forecast omits.
Re-run against real logs after a week of traffic. The gap between forecast and actuals is your optimization backlog.

2026 LLM pricing trends to budget around

Three observable shifts should shape any 2026 model budget:

Flagship prices diverged. OpenAI raised its top tier from GPT-5 ($1.25/$10) to GPT-5.5 ($5/$30), while Anthropic cut Opus to roughly a third of its earlier-generation price (Opslyft pricing observation, 2026). You can no longer assume "newer flagship equals higher price" across vendors; the direction now depends on the provider's market strategy.
Cached input became the standard discount lever. A ~90% cache discount is now common, with DeepSeek V4 at ~98% (Opslyft pricing observation, 2026). Any 2026 cost model that ignores cache economics overstates spend by a wide margin.
Context windows standardized near 1M tokens for frontier models, pushing 128K into the older-model band (Opslyft pricing observation, 2026). Larger windows enable more context per call - and, if used carelessly, more billed input per call, which loops back to the cost-per-answer point.

Conclusion

FAQs

What is the cheapest LLM API in 2026?

How is LLM pricing calculated per token?

Why is output priced higher than input on LLM APIs?

How much can prompt caching cut LLM costs?