Prompt caching and LLM cost: the math that matters

Q: Does Anthropic charge a fee to write a cache?

Yes. Writing a 5-minute ephemeral cache costs 1.25x the base input price; writing a 1-hour ephemeral cache costs 2x. Reads (and the cache refresh on a hit within TTL) cost 10% of base input. The write fee is the breakeven question -- if a cached prefix is reused at least once within its TTL window, the math favors caching.

TL;DR

Anthropic charges 1.25x base input to WRITE a 5-min cache, 2x to write a 1-hour cache, and 0.1x (10%) to READ. Verified 2026-06-01 on Anthropic's prompt caching docs (linked below).
OpenAI charges 0.1x (10%) for cached input, automatic above a 1,024-token prefix. No write fee. Verified 2026-06-01 on developers.openai.com.
Google Gemini 3.1 Pro charges $0.20/M for cached input (10% of $2/M) PLUS $4.50/M/hour storage. Storage flips the math if your cache sits idle.
DeepSeek V4-Flash cache hit is $0.0028/M -- 2% of the cache-miss price. The smallest discount-after-discount in the market.
Caching favors workloads with >=1,024 token stable prefixes, repeated within ~5 minutes. Single-turn classification, dynamic system prompts, and code-gen agents with long outputs are the exceptions where caching saves little or nothing.

LLM pricing pages quote two numbers: an input rate and an output rate per million tokens. That math has been wrong for two years. The third number — the cached input rate — decides which model is cheapest for any workload that re-sends a system prompt, tool definitions, or accumulated conversation history. On a typical agent loop with a 5K-token stable prefix re-used 10 times, it cuts per-call input cost on the cached portion by ~90% across Anthropic, OpenAI, and Google, and roughly halves total request cost (see the worked example below). Here is what every vendor actually charges today, when caching helps, and when it does nothing.

What prompt caching actually is (in 80 words)

A prompt cache stores the model’s internal representation of a token prefix and bills subsequent calls that send the same prefix at a discounted rate. The discount applies only to the cached prefix — not the new portion of the prompt, not the output. Caches are scoped per workspace, expire on a TTL (5 minutes default, up to 1 hour or 24 hours depending on vendor), and bust on any change to the prefix tokens. Caching is automatic on OpenAI, opt-in on Anthropic via cache_control, and configurable on Google.

The math: standard rate vs cached rate

Every major frontier model now offers a cache-read rate around 10% of the standard input rate. The pattern is the same across vendors — with one big exception (DeepSeek), and one quiet trap (Google’s storage fee).

Model	Input /M	Cached input /M	Cache discount	Output /M
Claude Opus 4.8 Anthropic	$5	$0.50	-90%	$25
Claude Sonnet 4.6 Anthropic	$3	$0.30	-90%	$15
GPT-5.5 OpenAI	$5	$0.50	-90%	$30
Gemini 3.1 Pro Preview Google	$2	$0.20	-90%	$12
DeepSeek V4 Flash DeepSeek	$0.14	$0.0028	-98%	$0.28

Standard input vs cached input rates across current frontier tiers. Numbers verified against each vendor's canonical pricing page on 2026-06-01.

Read across the row: Anthropic, OpenAI, and Google all charge a hair under or exactly 10% of input for cache reads. DeepSeek charges roughly 2% — the cache-hit rate on V4-Flash is $0.0028/M against a cache-miss rate of $0.14/M.

Three things this table does not show, which matter more than the discount percentages:

Cache write fees. Anthropic charges 1.25x base input to write a 5-minute cache and 2x base input to write a 1-hour cache. OpenAI and DeepSeek do not charge for the write. Google charges no write fee on implicit caching but bills $4.50/M/hour storage on explicit caching for Gemini 3.1 Pro Preview.
Minimum prefix length. Anthropic requires 4,096 tokens to cache on Opus 4.5/4.6/4.7 AND on Haiku 4.5 (per Anthropic’s prompt caching docs); 1,024 tokens on Opus 4.8, Sonnet 4.6/4.5, and Opus 4.1. OpenAI requires 1,024 uniformly. Google requires 4,096 on Gemini 3.1 Pro Preview / 2.5 Pro, 1,024 on Flash tiers.
TTL. Anthropic: 5 min default, 1 hour optional. OpenAI: “5 to 10 minutes of inactivity, up to a maximum of one hour” with up to 24 hours on extended retention for select models. Google explicit: configurable (you pay storage). DeepSeek: unstated on the public docs.

Sources: Anthropic prompt caching docs, OpenAI prompt caching guide, Google Gemini caching docs, DeepSeek pricing.

When caching saves money (and when it doesn’t)

The caching savings curve has two parameters that matter: how often you re-send a stable prefix, and how big that prefix is relative to the new tokens per call. A long stable prefix re-sent inside the TTL collapses input cost. A short volatile prefix saves nothing.

Caching saves the most when:

The cached prefix is >=4,096 tokens on Opus tiers, >=1,024 tokens elsewhere.
The prefix is stable across calls within the TTL window — system prompt, tool definitions, retrieved-document context.
The call rate is higher than once per 5 minutes so the cache stays warm without paying for the 1-hour extension.
You can structure the prompt so the stable bytes come FIRST (the cache key is prefix-based; any drift busts the cache from that point on).

Caching does little or nothing when:

The system prompt changes every call (per-user injection, timestamps, randomized examples).
The workload is single-turn classification with short inputs that never repeat.
The dollars sit in output tokens (long-form generation, code emission) where the input share is small.
The conversation is strictly user-only with no system or tool overhead.

Worked example: agent loop with 15-minute conversation

Take a document-processing agent that runs 10 tool calls over 15 minutes against a single user request. System prompt + tool definitions = 5,000 tokens (well above both Opus 4.8’s 1,024-token minimum and Opus 4.7’s 4,096-token minimum). Per-call user delta + tool result = 800 tokens. Per-call output = 400 tokens.

Without caching on Claude Opus 4.8:

Input per call: 5,800 tokens at $5/M = $0.029
Output per call: 400 tokens at $25/M = $0.010
Per call: $0.039 — 10 calls = $0.39 per agent run

With Anthropic’s 5-minute cache (write once, hit 9 times):

Call 1: write 5,000-token cache at 1.25x ($6.25/M) = $0.0313 + 800 fresh input at $5/M = $0.004 + $0.010 output = $0.045
Calls 2-10 (within 5-min TTL, refreshed by each hit): 5,000-token cache READ at $0.5/M = $0.0025 + 800 fresh input at $5/M = $0.004 + $0.010 output = $0.0165
Total: $0.045 + 9 x $0.0165 = $0.045 + $0.1485 = $0.194 per agent run

Caching cuts the cost in half. The cache write fee on call 1 is recovered on call 2; every call after that bleeds the discount. (Opus 4.7 produces identical numbers at the same $5/$25/$6.25/$0.50 rates.)

Now redo the same math on GPT-5.5. The arithmetic is simpler because OpenAI does not bill a cache write:

Call 1: 5,800 input at $5/M = $0.029 + $0.012 output (400 tokens at $30/M) = $0.041
Calls 2-10: 5,000 cached input at $0.5/M = $0.0025 + 800 fresh input at $5/M = $0.004 + $0.012 output = $0.0185
Total: $0.041 + 9 x $0.0185 = $0.208 per agent run

GPT-5.5 lands at $0.208 vs Opus 4.8’s $0.194 — Opus 4.8 wins this scenario by ~7%, primarily because Anthropic’s lower output rate ($25/M vs $30/M) compounds across all 10 calls.

The 5-min vs 1-hour cache tradeoff (Anthropic)

Anthropic’s two TTL tiers force one decision: is your re-use rate higher than once every 5 minutes? If yes, the default 5-minute cache costs 1.25x input to write — cheaper. If you need to keep the cache warm across a longer pause (a paused agent waiting on user input, an inactive analyst between queries), the 1-hour extension costs 2x input to write.

The breakeven math: a 1-hour write costs 2x - 1.25x = 0.75x extra input price. A 5-minute cache write avoided by extending is worth 1.25x input. So if you would otherwise write the cache twice or more inside an hour, the 1-hour option pays for itself.

This means: dispatched workloads (cron-driven agents, batch processors) generally want 1-hour. Interactive workloads with conversation flows under 5 minutes want default 5-minute. Mixed workloads should benchmark both, not assume.

How OpenAI’s auto-caching differs

OpenAI does not give you a cache_control knob. You get caching automatically on every prefix of 1,024+ tokens, and you can tell from the response payload whether your call was a hit or a miss (the cached_tokens field reports the count). The trade-offs:

No write fee. You never pay extra to populate a cache.
No explicit control. You cannot pin a cache, change its TTL, or guarantee it stays warm. OpenAI documents “5 to 10 minutes of inactivity, up to a maximum of one hour” — which means you cannot rely on cache hits if traffic is bursty.
Extended retention is available on GPT-5.5 (incl. 5.5-pro), GPT-5.4 family, GPT-5.2, GPT-5.1 variants, GPT-5 (incl. 5-codex), and GPT-4.1, holding cache up to 24 hours per OpenAI’s caching guide. Same pricing applies.
Prefix-only. Any drift from the cached prefix invalidates the cache from that token onward, same as Anthropic.

For most production loops the OpenAI behavior is fine: write your prompt to maximize the stable prefix, run frequently enough to keep the cache warm, and you will get the 10% discount on most calls without managing TTLs.

Things that break caching (and how to spot them)

Six failure modes account for most “I enabled caching but my bill didn’t change” complaints. All are visible in the response payload if you look.

Prefix not long enough. Check the response: Anthropic’s cache_read_input_tokens or OpenAI’s cached_tokens will be 0. The fix is to lengthen the stable prefix (add few-shot examples, hoist tool definitions to the top) so it clears the threshold.
System prompt has a per-call variable. A timestamp, a session ID, a user name injected at position 50 in the prefix busts the cache after token 50 — meaning everything after position 50 pays full input. Move variables to the user portion of the prompt; keep the system block byte-stable.
Tool definition list changes per call. If your agent loads tools dynamically and the order or shape varies, every call writes a fresh cache. Pin the tool list at construction time.
Workspace isolation. Anthropic switched to per-workspace cache isolation on 2026-02-05 for Claude API, Platform on AWS, and Microsoft Foundry beta. Calls from different workspace IDs do not share a cache, even with identical prefixes.
You hit TTL. If your agent waits longer than the TTL between calls, the cache evicts and the next call pays the write fee again. Either fall back to a longer TTL or keep an idle ping running.
Provider experimental flags. OpenAI flex / batch / priority processing options have different cache behaviors. Verify cache hits on the same processing tier you bill on.

For DeepSeek V4-Pro specifically, the promotional pricing was 75% off through 2026-05-31 15:59 UTC. Per DeepSeek’s pricing page, the post-promo rate is permanently adjusted to 1/4 of the original list price — the discounted rate becomes the new standard, not a temporary offer. The cache-hit rate stays at ~$0.003625/M and cache-miss at ~$0.435/M after 2026-06-01. If you priced your workload during the promo, your budget math does NOT need to be re-run upward.

Google’s storage fee changes the implicit-vs-explicit choice

Google offers two caching modes. Implicit caching is automatic on Gemini 2.5 and newer, has no storage fee, and provides “no cost saving guarantee” per the docs — Google may or may not cache. Explicit caching lets you create a named cache and bills you $4.50/M tokens stored per hour on Gemini 3.1 Pro Preview (listed as Gemini 3 Pro Preview in Google’s caching docs at time of writing; the storage rate varies per model — $1/M/hour on most others).

The storage fee math matters. A 100K-token cached prefix held idle for one hour costs $0.45 in storage. If your workload re-uses it once and discards it, the read discount (saving 90% on 100K tokens at $2/M = $0.18) is dwarfed by the storage fee — you net a loss. If you re-use it twenty times the discount wins. The breakeven is around 3 re-uses per hour for Gemini 3.1 Pro Preview.

For burst workloads, implicit caching is the safer default — you take whatever discount Google decides to give you, with no storage exposure. For predictable agent loops with known re-use counts, explicit caching beats implicit if your math says so.

Practical guidance

If you are reading this because you are about to ship a model behind an agent, here is the order of operations that has worked across the analyses I have run on this site:

Measure your actual prefix size. Log token counts on real production prompts (or a representative sample) for one full TTL window. If your stable prefix is under the minimum threshold for your candidate model, caching cannot help and the input rate is what you will pay — pick the model with the lowest list input rate, not the model with the deepest cache discount.
Measure your actual re-use rate. A workload with one call per 30 minutes pays the write fee on every call on Anthropic — net negative vs no caching. OpenAI is friendlier here (no write fee) but the cache may evict before your next call. Match the TTL to your re-use rate.
Price both with and without caching. A spreadsheet with (input_tokens * input_rate) + (cached_tokens * cache_rate * cache_hit_rate) + (cached_tokens * input_rate * (1 - cache_hit_rate)) + (output_tokens * output_rate) for each candidate model tells you the real comparison. List rates lie about agent cost.
Verify each price against the vendor’s own page. Cache rates change quarterly across all four major vendors. We refresh our pricing data daily against canonical pricing pages — see the methodology page for how. Quoting a number from a 6-month-old blog post (including this one, eventually) is the fastest way to ship a budget miss.
Re-architect if caching is breaking. Lifting timestamps out of the system prompt, freezing tool definitions, and pinning the prefix order are the three biggest unlocks. None of them require a model change.

Caching is not magic. It is a 10% discount with strings attached, and the strings change per vendor. Read the contract, log your own tokens, and the math falls out.

Frequently asked questions

What is the actual discount for cached input across major providers?

All four major providers price cache READS at roughly 10% of standard input: Anthropic explicitly at 0.1x base input across Claude Opus 4.8 / 4.7 and Sonnet 4.6 / 4.5 per Anthropic's caching docs; OpenAI at the same 10% on GPT-5.5, GPT-5.4, and GPT-5.4-mini per OpenAI's caching guide; Google at 10% for Gemini 3.1 Pro Preview (with an additional storage fee) per Google's caching docs. DeepSeek V4-Flash is the outlier at roughly 2% of cache-miss input -- a much larger discount.

Does Anthropic charge a fee to write a cache?

Yes. Writing a 5-minute ephemeral cache costs 1.25x the base input price; writing a 1-hour ephemeral cache costs 2x. Reads (and the cache refresh on a hit within TTL) cost 10% of base input. The write fee is the breakeven question -- if a cached prefix is reused at least once within its TTL window, the math favors caching.

Does OpenAI charge a write fee?

No. OpenAI's caching is automatic on prefixes of 1,024 tokens or longer, with no per-write surcharge. The first call pays standard input; subsequent calls within the cache window (typically 5-10 minutes of inactivity, up to 1 hour) pay the discounted cache-read rate.

What is the minimum cacheable prefix length?

Anthropic: 4,096 tokens on Opus 4.7 / 4.6 / 4.5 and Haiku 4.5; 1,024 tokens on Opus 4.8, Sonnet 4.6 / 4.5, and Opus 4.1 per Anthropic's caching docs. OpenAI: 1,024 tokens uniformly on GPT-4o and newer. Google: 1,024 tokens on Gemini 2.5 Flash and Gemini 3.5 Flash; 4,096 tokens on Gemini 2.5 Pro and Gemini 3.1 Pro Preview (listed as Gemini 3 Pro Preview in the caching docs at time of writing). Prefixes shorter than the threshold are processed without caching (no error, no discount).

When does prompt caching NOT save money?

Three workload types: (1) single-turn classification where the prefix is short and the call never repeats, (2) agents with dynamic system prompts that change per call (cache key is busted), and (3) output-heavy generation where the dollars sit in output tokens, not input. Code generation that emits 2K+ output tokens per call often costs less on GPT-5.4 ($15/M output) than on Opus 4.8 ($25/M output) regardless of cache.

Does Gemini's storage fee change the math?

Yes, materially. Gemini 3.1 Pro Preview's explicit cache charges $4.50 per 1M cached tokens per hour stored. A 100K-token cached prefix held idle for one hour costs $0.45 in storage alone. If you re-use the cache once and discard it, this offsets the read discount; if you reuse it dozens of times the discount still wins. Implicit caching on Gemini 2.5+ has no storage fee but no guaranteed discount either -- Google may or may not cache. See Google's caching docs for the implicit vs explicit distinction.

About the author

Yaroslav Vikhariev

Founder

Founder, AI//COST.

LinkedIn Author profile