“What’s the cheapest LLM API?” is the wrong question. There is no single answer, because input and output tokens are priced separately — usually 4 to 6 times apart — and your workload decides which one dominates the bill. The same provider that’s cheapest for classification can be 10x more expensive for reasoning. Below: the verified cheapest model for each of seven common workloads, with the math that shows why the winner changes.
The cheapest model is a function of your workload, not a leaderboard
Every “cheapest LLM” ranking you’ve seen sorts by a single number — usually input price — and calls it a day. That’s how you end up paying 10x too much.
Here’s the mechanism. A high-volume classification job sends 500 tokens of input and emits a 20-token label. Input price is ~96% of the cost; output price is a rounding error. A reasoning agent does the opposite: a short prompt, then thousands of tokens of chain-of-thought output. Now output price is the whole game. A model that’s cheapest for the first job can be the most expensive for the second.
Four levers move the real bill, in roughly this order of impact:
- Input/output ratio — which token type dominates your spend.
- Cache reuse — whether your prompts share a long, stable prefix (system prompt, codebase, document set). A 90-98% cache discount is the single biggest lever for agents.
- Context size — a few models surcharge requests above 200K tokens. Crossing that line can double the bill on the same call.
- Batch tolerance — if you can wait up to 24 hours, the Batch API is a flat 50% off almost everywhere.
So we’ll go use case by use case. Every price below was verified against the vendor’s own pricing page on 2026-06-02, and the tables pull live from our model pages so they don’t drift from this prose.
Use case 1: High-volume classification and extraction
You’re tagging support tickets, scoring sentiment, extracting fields from invoices, or routing emails. The pattern: a lot of input, a tiny structured output (a label, a JSON stub, a number). Input price is the only number that matters here.
| Model | Input /M | Output /M | Context |
|---|---|---|---|
| Gemini 2.5 Flash-Lite Google | $0.10 | $0.40 | 1M |
| DeepSeek V4 Flash DeepSeek | $0.14 | $0.28 | 1M |
| GPT-5.4 mini OpenAI | $0.75 | $4.50 | 400K |
| Claude Haiku 4.5 Anthropic | $1 | $5 | 200K |
Run the math on 1 million items at 500 input + 20 output tokens each (500M input tokens, 20M output tokens total):
| Model | Input cost | Output cost | Total |
|---|---|---|---|
| Gemini 2.5 Flash-Lite | $50.00 | $8.00 | $58.00 |
| DeepSeek V4-Flash | $70.00 | $5.60 | $75.60 |
| GPT-5.4-mini | $375.00 | $90.00 | $465.00 |
| Claude Haiku 4.5 | $500.00 | $100.00 | $600.00 |
Same job. Flash-Lite costs roughly 10x less than Haiku 4.5 — and the gap is almost entirely on the input side. Winner: Gemini 2.5 Flash-Lite at $0.10/M input, with DeepSeek V4-Flash a close second that pulls ahead the moment caching enters the picture (use case 3).
Use case 2: Chatbots and conversational assistants
A customer-support bot or a personal assistant sends a moderate prompt and emits a few hundred tokens of conversational reply. The input/output ratio is closer to balanced, so output price starts to matter — and you can’t go as cheap as classification, because quality and latency are now user-facing.
| Model | Input /M | Output /M | Cached input /M |
|---|---|---|---|
| Gemini 2.5 Flash Google | $0.30 | $2.50 | $0.03 |
| DeepSeek V4 Flash DeepSeek | $0.14 | $0.28 | $0.0028 |
| Claude Haiku 4.5 Anthropic | $1 | $5 | $0.10 |
| MiniMax M2.7 MiniMax | $0.30 | $1.20 | $0.06 |
For a turn of ~1,500 input + 400 output tokens, Gemini 2.5 Flash ($0.30/M in, $2.50/M out) costs about $0.0015/turn — roughly 3,500 reply turns per $5. DeepSeek V4-Flash is cheaper still on raw tokens, and Claude Haiku 4.5 is the pick if you’re already on the Anthropic stack and want its instruction-following at the low end. Winner: it’s a genuine tie between Gemini 2.5 Flash and DeepSeek V4-Flash — choose on latency and quality in your own eval, not on the price sheet, because at this volume the difference is cents.
Use case 3: Coding agents (where caching is the whole game)
A coding agent re-sends the same large context every turn — system prompt, tool definitions, file tree, the open files. That prefix is stable across the session, which makes it perfectly cacheable. The model with the deepest cache discount wins, and it isn’t close.
| Model | Input /M | Cached input /M | Cache discount | Output /M |
|---|---|---|---|---|
| DeepSeek V4 Flash DeepSeek | $0.14 | $0.0028 | -98% | $0.28 |
| Claude Haiku 4.5 Anthropic | $1 | $0.10 | -90% | $5 |
| Claude Sonnet 4.6 Anthropic | $3 | $0.30 | -90% | $15 |
DeepSeek V4-Flash’s cache hit is $0.0028/M — 98% below its $0.14/M input price. Take a realistic agent turn of 30K input (mostly cached) + 2K output, at the 82% cache-hit rate we model in the cornerstone post on prompt-caching math:
- DeepSeek V4-Flash: blended input ~$0.0275/M → ~$0.0008/turn input + ~$0.0006/turn output = ~$0.0014/turn
- Claude Sonnet 4.6: blended input ~$0.786/M → ~$0.024/turn input + $0.030/turn output = ~$0.054/turn
That’s nearly 40x cheaper per turn. But here’s the honest part the price sheet won’t tell you:
If you’re optimizing a Claude Code workflow specifically, output reduction stacks on top of caching — see the breakdown of caveman mode’s real savings. Caching attacks the input rewrite; output trimming attacks the other half of the bill.
Use case 4: Long-context RAG and document Q&A
You stuff retrieved documents — or an entire codebase, contract, or transcript — into a single call. The input is enormous and the output is small. You want the cheapest input price that comes with a large enough context window, and you want to watch for context-tier surcharges.
| Model | Input /M | Cached input /M | Context |
|---|---|---|---|
| Gemini 2.5 Flash-Lite Google | $0.10 | $0.01 | 1M |
| DeepSeek V4 Flash DeepSeek | $0.14 | $0.0028 | 1M |
| Qwen 3.5 Flash Alibaba (Qwen) | $0.10 | n/a | 1M |
| Gemini 2.5 Pro Google | $1.25 | $0.13 | 2M |
For a 200K-token RAG call with a 1K-token answer, Gemini 2.5 Flash-Lite costs about $0.02/call ($0.10/M input × 200K = $0.020). DeepSeek V4-Flash is ~$0.028, and its $0.0028/M cache hit makes a re-queried document set nearly free. Qwen3.5-Flash matches Flash-Lite at $0.10/M with a 1M window. Winner: Gemini 2.5 Flash-Lite, unless you re-query the same corpus repeatedly — then DeepSeek’s cache wins outright.
Use case 5: Bulk and overnight batch jobs
Embeddings backfills, dataset labeling, document summarization at scale, eval runs — anything that doesn’t need a synchronous answer. If you can wait, this is the cheapest tier on the entire board.
The decision rule is simple: if a job has no human waiting on it, it should almost never run on the synchronous tier. Batching a million-document summarization job instead of streaming it through the live API halves the bill for zero engineering cost beyond using a different endpoint. Winner: Gemini 2.5 Flash-Lite on Batch for general work; DeepSeek V4 off-peak if you’re already on DeepSeek and your jobs cluster in the UTC evening.
Use case 6: Reasoning, planning, and agentic loops
Now output dominates. A reasoning model emits long chains of thought, and that output is billed at the full rate. The cheap-input models from use case 1 are no longer the answer — you need a model that’s actually good at multi-step reasoning, as cheaply as that’s available.
| Model | Input /M | Output /M | Cached input /M |
|---|---|---|---|
| DeepSeek V4 Pro DeepSeek | $0.43 | $0.87 | $0.0036 |
| Gemini 2.5 Pro Google | $1.25 | $10 | $0.13 |
| GPT-5.5 OpenAI | $5 | $30 | $0.50 |
DeepSeek V4-Pro sits at $0.435/M input, $0.87/M output (cache hits ~$0.0036/M) — this is its standing rate after DeepSeek made its end-of-May price cut permanent. Compare the output side, which is what you’ll actually pay for on reasoning: V4-Pro’s $0.87/M is 34x below GPT-5.5’s $30/M and ~11x below Gemini 2.5 Pro’s $10/M.
The tradeoff is the same one as the coding case: V4-Pro is the cheapest capable reasoner, not the strongest. For most agentic planning, structured reasoning, and tool-use loops it’s more than enough and the savings are enormous. Reserve GPT-5.5 or Gemini 2.5 Pro for the tasks where you’ve measured that the frontier actually wins. Winner: DeepSeek V4-Pro on cost-per-capability; frontier tiers only where an eval justifies the 10x+ premium.
Use case 7: Free tier — for prototypes, not production
Several models are genuinely $0/M, not trial-limited:
| Model | Input /M | Output /M | Context |
|---|---|---|---|
| GLM-4.7 Flash Zhipu (Z.ai / GLM) | $0 | $0 | 200K |
| Hunyuan Lite Tencent (Hunyuan) | $0 | $0 | documented elsewhere |
| Leanstral Mistral | $0 | $0 | 128K |
GLM-4.7-Flash is free on Z.ai with a 203K context window and no credit card required. The catch is in the limits, not the price: the free tier is capped at 1,000 requests/day with a concurrency of 1. That’s a fine budget for a prototype, a side project, or a low-traffic internal tool — and useless the moment you need throughput.
The cheapest number on the page is usually a trap
The lowest price in any comparison is almost always a model you can’t actually use for the job in front of you. The $0/M tiers are rate-capped and CN-hosted. The cheapest-input model loses to a pricier one the moment output or reasoning dominates. The headline-cheap reasoner trails the frontier on the 5% of tasks that are genuinely hard.
“Cheapest” only means something once you’ve pinned down the workload. So the order of operations is:
- Profile your token mix. Pull a week of real usage and compute your input/output ratio. This alone tells you whether to optimize for input price (classification, RAG) or output price (chat, reasoning).
- Check cache reuse. If prompts share a long stable prefix, the cache discount dwarfs every other lever — and DeepSeek’s 98% hit discount likely wins. If they don’t, ignore the cached column entirely.
- Mind the context tier. If you run near 200K tokens, confirm whether your model surcharges above it before you scale.
- Default latency-tolerant jobs to batch. A flat 50% for zero code change is the easiest money you’ll save.
- Confirm access and compliance before price. A model your legal team won’t approve, or whose rate limits throttle your traffic, has an effective price of infinity.
Get those five right and you’re usually within a few percent of the true floor — without overpaying 10x by sorting a leaderboard on a single column.
All prices in this post were verified on 2026-06-02 against each vendor’s canonical pricing page — DeepSeek, Google AI, OpenAI, Anthropic, and Z.ai — and the tables above pull live from our model pages, so they update when we re-verify. See our methodology for the verification cadence and how we handle vendor price changes.
Frequently asked questions
What is the cheapest LLM API overall in 2026?
Ignoring access friction, the cheapest priced models are the $0/M free tiers: GLM-4.7-Flash and Hunyuan Lite. Among models a Western team can comfortably put in production, the floor is Gemini 2.5 Flash-Lite at $0.10/M input / $0.40/M output, with DeepSeek V4-Flash ($0.14/$0.28, $0.0028/M cache hit) winning anything cache-heavy. But "cheapest overall" is the wrong question — the cheapest model for your workload depends on its input/output ratio. See the use-case breakdown above.
Why does the cheapest model change by use case?
Because input and output tokens are priced separately, usually 4-6x apart. A classification job sends a lot of input and emits a one-word label, so input price dominates and a cheap-input model wins. A reasoning job emits long chains of output, so output price dominates and the ranking flips. Caching, batch discounts, and context-tier surcharges shift it further. A model that's cheapest for one shape can be 10x more expensive for another.
Is DeepSeek really that much cheaper than Claude or GPT?
On price, yes — DeepSeek V4-Flash input is roughly 1/20th of GPT-5.5's and its cache hit is 98% below its own input price. On capability, no: it trails the frontier tiers on hard reasoning and agentic benchmarks. The honest framing is that DeepSeek is the cheapest capable-enough model for high-volume and cache-heavy work, not a drop-in replacement for a frontier model on every task. Match the tier to the task.
Do the free LLM APIs have a catch?
Two. First, rate limits: GLM-4.7-Flash's free tier on Z.ai is capped at 1,000 requests/day with a concurrency of 1 — fine for a prototype, useless at production scale. Second, data residency: the genuinely-free models are CN-hosted, which is a compliance non-starter for many teams routing customer data. Free on the price sheet is not free of constraints.
How much does the Batch API actually save?
A flat 50% on most providers (Google, OpenAI, Anthropic) in exchange for asynchronous, up-to-24-hour turnaround. That makes batch the cheapest accessible tier on the board: Gemini 2.5 Flash-Lite falls to $0.05/M input / $0.20/M output. DeepSeek stacks a separate lever — a 50%-off off-peak window from 16:30 to 00:30 UTC that applies automatically on the direct API. If your job tolerates latency, batch first.
Nothing yet. Mention this post on any platform — Mastodon, Bluesky, LinkedIn, a blog — and the citation surfaces here.