Cheapest LLM API by use case (2026 guide)

TL;DR

There is no single cheapest LLM API. The cheapest model is whichever one fits your workload's input/output ratio, cache reuse, context size, and batch tolerance — these swing the bill 10x for the same task.
For high-volume classification (tiny output), Gemini 2.5 Flash-Lite at $0.10/M input wins. Processing 1M tickets costs ~$58 on Flash-Lite vs ~$600 on Claude Haiku 4.5 — same job, 10x the spend.
For coding agents, DeepSeek V4-Flash's $0.0028/M cache hit (98% below its own input price) makes a cached turn cost ~$0.0014 vs ~$0.054 on Sonnet 4.6 — nearly 40x cheaper, at a real capability cost.
The absolute cheapest accessible per-token tier is batch: Gemini 2.5 Flash-Lite drops to $0.05/$0.20 on the Batch API, and DeepSeek V4 is 50% off in its 16:30-00:30 UTC off-peak window.
The $0/M models (GLM-4.7-Flash, Hunyuan Lite) are real but rate-capped and CN-hosted — fine for prototypes, not for production data.

“What’s the cheapest LLM API?” is the wrong question. There is no single answer, because input and output tokens are priced separately — usually 4 to 6 times apart — and your workload decides which one dominates the bill. The same provider that’s cheapest for classification can be 10x more expensive for reasoning. Below: the verified cheapest model for each of seven common workloads, with the math that shows why the winner changes.

The cheapest model is a function of your workload, not a leaderboard

Every “cheapest LLM” ranking you’ve seen sorts by a single number — usually input price — and calls it a day. That’s how you end up paying 10x too much.

Here’s the mechanism. A high-volume classification job sends 500 tokens of input and emits a 20-token label. Input price is ~96% of the cost; output price is a rounding error. A reasoning agent does the opposite: a short prompt, then thousands of tokens of chain-of-thought output. Now output price is the whole game. A model that’s cheapest for the first job can be the most expensive for the second.

Four levers move the real bill, in roughly this order of impact:

Input/output ratio — which token type dominates your spend.
Cache reuse — whether your prompts share a long, stable prefix (system prompt, codebase, document set). A 90-98% cache discount is the single biggest lever for agents.
Context size — a few models surcharge requests above 200K tokens. Crossing that line can double the bill on the same call.
Batch tolerance — if you can wait up to 24 hours, the Batch API is a flat 50% off almost everywhere.

So we’ll go use case by use case. Every price below was verified against the vendor’s own pricing page on 2026-06-02, and the tables pull live from our model pages so they don’t drift from this prose.

Use case 1: High-volume classification and extraction

You’re tagging support tickets, scoring sentiment, extracting fields from invoices, or routing emails. The pattern: a lot of input, a tiny structured output (a label, a JSON stub, a number). Input price is the only number that matters here.

Model	Input /M	Output /M	Context
Gemini 2.5 Flash-Lite Google	$0.10	$0.40	1M
DeepSeek V4 Flash DeepSeek	$0.14	$0.28	1M
GPT-5.4 mini OpenAI	$0.75	$4.50	400K
Claude Haiku 4.5 Anthropic	$1	$5	200K

Cheap models for input-dominant work. Output price is almost irrelevant at this shape.

Run the math on 1 million items at 500 input + 20 output tokens each (500M input tokens, 20M output tokens total):

Model	Input cost	Output cost	Total
Gemini 2.5 Flash-Lite	$50.00	$8.00	$58.00
DeepSeek V4-Flash	$70.00	$5.60	$75.60
GPT-5.4-mini	$375.00	$90.00	$465.00
Claude Haiku 4.5	$500.00	$100.00	$600.00

Same job. Flash-Lite costs roughly 10x less than Haiku 4.5 — and the gap is almost entirely on the input side. Winner: Gemini 2.5 Flash-Lite at $0.10/M input, with DeepSeek V4-Flash a close second that pulls ahead the moment caching enters the picture (use case 3).

Use case 2: Chatbots and conversational assistants

A customer-support bot or a personal assistant sends a moderate prompt and emits a few hundred tokens of conversational reply. The input/output ratio is closer to balanced, so output price starts to matter — and you can’t go as cheap as classification, because quality and latency are now user-facing.

Model	Input /M	Output /M	Cached input /M
Gemini 2.5 Flash Google	$0.30	$2.50	$0.03
DeepSeek V4 Flash DeepSeek	$0.14	$0.28	$0.0028
Claude Haiku 4.5 Anthropic	$1	$5	$0.10
MiniMax M2.7 MiniMax	$0.30	$1.20	$0.06

Balanced conversational tiers. Output price is now ~5x input, so it moves the bill.

For a turn of ~1,500 input + 400 output tokens, Gemini 2.5 Flash ($0.30/M in, $2.50/M out) costs about $0.0015/turn — roughly 3,500 reply turns per $5. DeepSeek V4-Flash is cheaper still on raw tokens, and Claude Haiku 4.5 is the pick if you’re already on the Anthropic stack and want its instruction-following at the low end. Winner: it’s a genuine tie between Gemini 2.5 Flash and DeepSeek V4-Flash — choose on latency and quality in your own eval, not on the price sheet, because at this volume the difference is cents.

Use case 3: Coding agents (where caching is the whole game)

A coding agent re-sends the same large context every turn — system prompt, tool definitions, file tree, the open files. That prefix is stable across the session, which makes it perfectly cacheable. The model with the deepest cache discount wins, and it isn’t close.

Model	Input /M	Cached input /M	Cache discount	Output /M
DeepSeek V4 Flash DeepSeek	$0.14	$0.0028	-98%	$0.28
Claude Haiku 4.5 Anthropic	$1	$0.10	-90%	$5
Claude Sonnet 4.6 Anthropic	$3	$0.30	-90%	$15

Cache discount is the decisive number for agents. DeepSeek V4-Flash cuts 98% off its own input price on a hit.

DeepSeek V4-Flash’s cache hit is $0.0028/M — 98% below its $0.14/M input price. Take a realistic agent turn of 30K input (mostly cached) + 2K output, at the 82% cache-hit rate we model in the cornerstone post on prompt-caching math:

DeepSeek V4-Flash: blended input ~$0.0275/M → ~$0.0008/turn input + ~$0.0006/turn output = ~$0.0014/turn
Claude Sonnet 4.6: blended input ~$0.786/M → ~$0.024/turn input + $0.030/turn output = ~$0.054/turn

That’s nearly 40x cheaper per turn. But here’s the honest part the price sheet won’t tell you:

If you’re optimizing a Claude Code workflow specifically, output reduction stacks on top of caching — see the breakdown of caveman mode’s real savings. Caching attacks the input rewrite; output trimming attacks the other half of the bill.

Use case 4: Long-context RAG and document Q&A

You stuff retrieved documents — or an entire codebase, contract, or transcript — into a single call. The input is enormous and the output is small. You want the cheapest input price that comes with a large enough context window, and you want to watch for context-tier surcharges.

Model	Input /M	Cached input /M	Context
Gemini 2.5 Flash-Lite Google	$0.10	$0.01	1M
DeepSeek V4 Flash DeepSeek	$0.14	$0.0028	1M
Qwen 3.5 Flash Alibaba (Qwen)	$0.10	n/a	1M
Gemini 2.5 Pro Google	$1.25	$0.13	2M

Cheap input at 1M+ context. Note: Gemini 2.5 Pro reprices above 200K tokens (see warning).

For a 200K-token RAG call with a 1K-token answer, Gemini 2.5 Flash-Lite costs about $0.02/call ($0.10/M input × 200K = $0.020). DeepSeek V4-Flash is ~$0.028, and its $0.0028/M cache hit makes a re-queried document set nearly free. Qwen3.5-Flash matches Flash-Lite at $0.10/M with a 1M window. Winner: Gemini 2.5 Flash-Lite, unless you re-query the same corpus repeatedly — then DeepSeek’s cache wins outright.

Use case 5: Bulk and overnight batch jobs

Embeddings backfills, dataset labeling, document summarization at scale, eval runs — anything that doesn’t need a synchronous answer. If you can wait, this is the cheapest tier on the entire board.

The decision rule is simple: if a job has no human waiting on it, it should almost never run on the synchronous tier. Batching a million-document summarization job instead of streaming it through the live API halves the bill for zero engineering cost beyond using a different endpoint. Winner: Gemini 2.5 Flash-Lite on Batch for general work; DeepSeek V4 off-peak if you’re already on DeepSeek and your jobs cluster in the UTC evening.

Use case 6: Reasoning, planning, and agentic loops

Now output dominates. A reasoning model emits long chains of thought, and that output is billed at the full rate. The cheap-input models from use case 1 are no longer the answer — you need a model that’s actually good at multi-step reasoning, as cheaply as that’s available.

Model	Input /M	Output /M	Cached input /M
DeepSeek V4 Pro DeepSeek	$0.43	$0.87	$0.0036
Gemini 2.5 Pro Google	$1.25	$10	$0.13
GPT-5.5 OpenAI	$5	$30	$0.50

Reasoning-capable tiers. DeepSeek V4-Pro is the budget floor; the frontier costs 10x+ on output.

DeepSeek V4-Pro sits at $0.435/M input, $0.87/M output (cache hits ~$0.0036/M) — this is its standing rate after DeepSeek made its end-of-May price cut permanent. Compare the output side, which is what you’ll actually pay for on reasoning: V4-Pro’s $0.87/M is 34x below GPT-5.5’s $30/M and ~11x below Gemini 2.5 Pro’s $10/M.

The tradeoff is the same one as the coding case: V4-Pro is the cheapest capable reasoner, not the strongest. For most agentic planning, structured reasoning, and tool-use loops it’s more than enough and the savings are enormous. Reserve GPT-5.5 or Gemini 2.5 Pro for the tasks where you’ve measured that the frontier actually wins. Winner: DeepSeek V4-Pro on cost-per-capability; frontier tiers only where an eval justifies the 10x+ premium.

Use case 7: Free tier — for prototypes, not production

Several models are genuinely $0/M, not trial-limited:

Model	Input /M	Output /M	Context
GLM-4.7 Flash Zhipu (Z.ai / GLM)	$0	$0	200K
Hunyuan Lite Tencent (Hunyuan)	$0	$0	documented elsewhere
Leanstral Mistral	$0	$0	128K

Genuinely free models. Real $0/M — but read the constraints before you build on them.

GLM-4.7-Flash is free on Z.ai with a 203K context window and no credit card required. The catch is in the limits, not the price: the free tier is capped at 1,000 requests/day with a concurrency of 1. That’s a fine budget for a prototype, a side project, or a low-traffic internal tool — and useless the moment you need throughput.

The cheapest number on the page is usually a trap

The lowest price in any comparison is almost always a model you can’t actually use for the job in front of you. The $0/M tiers are rate-capped and CN-hosted. The cheapest-input model loses to a pricier one the moment output or reasoning dominates. The headline-cheap reasoner trails the frontier on the 5% of tasks that are genuinely hard.

“Cheapest” only means something once you’ve pinned down the workload. So the order of operations is:

Profile your token mix. Pull a week of real usage and compute your input/output ratio. This alone tells you whether to optimize for input price (classification, RAG) or output price (chat, reasoning).
Check cache reuse. If prompts share a long stable prefix, the cache discount dwarfs every other lever — and DeepSeek’s 98% hit discount likely wins. If they don’t, ignore the cached column entirely.
Mind the context tier. If you run near 200K tokens, confirm whether your model surcharges above it before you scale.
Default latency-tolerant jobs to batch. A flat 50% for zero code change is the easiest money you’ll save.
Confirm access and compliance before price. A model your legal team won’t approve, or whose rate limits throttle your traffic, has an effective price of infinity.

Get those five right and you’re usually within a few percent of the true floor — without overpaying 10x by sorting a leaderboard on a single column.

All prices in this post were verified on 2026-06-02 against each vendor’s canonical pricing page — DeepSeek, Google AI, OpenAI, Anthropic, and Z.ai — and the tables above pull live from our model pages, so they update when we re-verify. See our methodology for the verification cadence and how we handle vendor price changes.

Frequently asked questions

What is the cheapest LLM API overall in 2026?

Ignoring access friction, the cheapest priced models are the $0/M free tiers: GLM-4.7-Flash and Hunyuan Lite. Among models a Western team can comfortably put in production, the floor is Gemini 2.5 Flash-Lite at $0.10/M input / $0.40/M output, with DeepSeek V4-Flash ($0.14/$0.28, $0.0028/M cache hit) winning anything cache-heavy. But "cheapest overall" is the wrong question — the cheapest model for your workload depends on its input/output ratio. See the use-case breakdown above.

Why does the cheapest model change by use case?

Because input and output tokens are priced separately, usually 4-6x apart. A classification job sends a lot of input and emits a one-word label, so input price dominates and a cheap-input model wins. A reasoning job emits long chains of output, so output price dominates and the ranking flips. Caching, batch discounts, and context-tier surcharges shift it further. A model that's cheapest for one shape can be 10x more expensive for another.

Is DeepSeek really that much cheaper than Claude or GPT?

On price, yes — DeepSeek V4-Flash input is roughly 1/20th of GPT-5.5's and its cache hit is 98% below its own input price. On capability, no: it trails the frontier tiers on hard reasoning and agentic benchmarks. The honest framing is that DeepSeek is the cheapest capable-enough model for high-volume and cache-heavy work, not a drop-in replacement for a frontier model on every task. Match the tier to the task.

Do the free LLM APIs have a catch?

Two. First, rate limits: GLM-4.7-Flash's free tier on Z.ai is capped at 1,000 requests/day with a concurrency of 1 — fine for a prototype, useless at production scale. Second, data residency: the genuinely-free models are CN-hosted, which is a compliance non-starter for many teams routing customer data. Free on the price sheet is not free of constraints.

How much does the Batch API actually save?

A flat 50% on most providers (Google, OpenAI, Anthropic) in exchange for asynchronous, up-to-24-hour turnaround. That makes batch the cheapest accessible tier on the board: Gemini 2.5 Flash-Lite falls to $0.05/M input / $0.20/M output. DeepSeek stacks a separate lever — a 50%-off off-peak window from 16:30 to 00:30 UTC that applies automatically on the direct API. If your job tolerates latency, batch first.

About the author

Yaroslav Vikhariev

Founder

Founder, AI//COST.

LinkedIn Author profile