Hidden LLM API costs nobody quotes (2026)

Q: What is the single most overlooked LLM API cost?

The tokenizer. Every comparison ranks models by price per token, but the same text turns into a different number of tokens on each model. Anthropic states that Opus 4.7 and later 'may use up to 35% more tokens for the same fixed text' because of a new tokenizer. A model that looks 10% cheaper per token can be more expensive per request. Always estimate cost on token counts from the actual model's tokenizer, not on the sticker price alone.

Q: Does prompt caching always save money?

No. A cache write costs more than standard input — 1.25x base for a 5-minute cache and 2x for a 1-hour cache on Anthropic — and a cache read costs 0.1x. So caching only pays off once a cached prefix is reused: after one read for the 5-minute tier, two reads for the 1-hour tier. If your prefix changes every call, or the cache expires before you reuse it, caching costs you more. Google adds a separate per-hour storage fee on top. We walk the full math in the cornerstone caching post.

TL;DR

The headline '$X/M input, $Y/M output' is a sticker, not an invoice. Ten line items move your real bill, and none of them show up on comparison sites. Each one below is quoted from the vendor's own docs.
The tokenizer tax: Anthropic states Opus 4.7+ 'may use up to 35% more tokens for the same fixed text.' The per-token price is honest; the token count is 35% higher — so the effective price is too.
Caching can lose you money: a 5-min cache write costs 1.25x base input and a 1-hour write costs 2x (Anthropic). It only pays off after 1-2 reads. Google adds a separate storage fee of $1.00/M tokens/hour just to keep the cache warm.
You pay for tokens you never see: OpenAI o-series reasoning tokens are 'not visible via the API' but 'billed as output tokens.' One image is 2,000-8,000 tokens; a 10-minute Gemini video is ~158,000.
Surcharges stack: Anthropic fast mode is 2-6x base, US-only routing adds 1.1x, server-side web search is $10/1,000 calls on top of tokens, and every tool call injects a 290-804 token system prompt you didn't write.

Every LLM comparison ranks models by two numbers: dollars per million input tokens, dollars per million output tokens. That’s the sticker. Your actual invoice has line items the sticker never mentions — a cache write that costs more than the input it replaces, a tokenizer that quietly inflates every request, reasoning tokens you’re billed for but never see. None of them are secret; most are in the vendor’s own docs. They’re just not on the comparison sites. Here are ten, each quoted from the source.

A note before the list

These aren’t scams. Anthropic, to its credit, documents nearly all of its surcharges on one pricing page. Google’s caching storage fee is in the caching docs. OpenAI’s reasoning-token billing is in the reasoning guide. The problem isn’t disclosure — it’s that the ecosystem quotes models by two numbers, so anyone comparing on the sticker is comparing the wrong thing. Every figure below links to the canonical source. Where the wording matters, it’s quoted verbatim.

1. The tokenizer tax

The price per token is honest. The number of tokens is the part nobody checks. The same sentence becomes a different token count on each model — and on the same family across versions. Anthropic states it plainly:

“Opus 4.7 and later use a new tokenizer compared to previous models… This new tokenizer may use up to 35% more tokens for the same fixed text.”

— Anthropic pricing docs

Read that again with your invoice in mind. Opus 4.8 and Opus 4.7 carry the same $5/M input sticker as Opus 4.5 — but if the newer tokenizer turns your prompt into up to 35% more tokens, your effective input price is up to 35% higher for identical text. A model that looks 10% cheaper per token can be more expensive per request.

2. Cache writes cost more than the input they replace

Prompt caching is sold as a discount. The cache read is — but the cache write is a premium. Anthropic’s multipliers:

Cache operation	Multiplier vs base input
5-minute cache write	1.25x base input price
1-hour cache write	2x base input price
Cache read (hit)	0.1x base input price

Anthropic’s own note on when it pays off:

“A cache hit costs 10% of the standard input price, which means caching pays off after just one cache read for the 5-minute duration (1.25x write), or after two cache reads for the 1-hour duration (2x write).”

— Anthropic pricing docs

So caching is only a win once a cached prefix is reused enough. If your system prompt changes every call, or the 5-minute cache expires before the next request, you paid 1.25-2x base input for nothing. We walk the full break-even math in the cornerstone post on caching — the short version is that caching is a bet that you’ll reuse the prefix, and the house edge is the write premium.

3. Cache rent: Google charges you to store it

There’s a second caching cost that almost nobody models. On Google’s explicit context caching, the read discount is only half the bill — the other half is a per-hour storage fee for keeping the cache alive. From the pricing page, for both Gemini 2.5 Pro and 2.5 Flash:

“$1.00 / 1,000,000 tokens per hour (storage price)”

— Google Gemini pricing

Do the math on a Gemini 2.5 Pro cache holding a 200K-token document for an 8-hour workday: 0.2M tokens × $1.00/M/hr × 8 hr = $1.60 in storage alone, before you read it a single time. Hold a full 1M-token context cache for a day and that’s about $24 in pure storage. Cache something large, forget to expire it, and you’re paying rent on tokens you may never re-query — and check your specific model’s line, because some newer and larger Pro tiers carry a higher storage rate. The read discount is real; so is the meter running while the cache sits idle.

4. Crossing the context line reprices the whole request

Some models charge a higher rate above a context threshold — and it applies to the entire request, not just the overflow. Gemini 2.5 Pro is $1.25/M input and $10/M output at or below 200K tokens, and $2.50/M / $15/M above it (Google pricing). A 200K-token call runs about $0.26; push it to 250K and it doesn’t cost 25% more — it costs about $0.64, roughly 2.5x, because the whole prompt reprices at the higher tier.

Not every vendor does this, and the contrast is worth noting. Anthropic explicitly does not surcharge long context on its current models:

“Claude Opus 4.8, Opus 4.7, Opus 4.6, and Sonnet 4.6 include the full 1M token context window at standard pricing. (A 900k-token request is billed at the same per-token rate as a 9k-token request.)”

— Anthropic pricing docs

We flagged the Gemini line in the cheapest-by-use-case roundup too, because it’s the single easiest way to accidentally double a long-context bill. If you run near 200K on any model, confirm its tier policy before you scale.

5. Reasoning tokens you pay for but never see

Reasoning models generate a chain of thought before they answer. You don’t see it — but you’re billed for it. OpenAI’s docs are explicit:

“While reasoning tokens are not visible via the API, they still occupy space in the model’s context window and are billed as output tokens.”

— OpenAI reasoning guide

Output tokens are the expensive side of the bill — typically 4-8x input. A response that shows 200 visible tokens may have burned several thousand invisible reasoning tokens first, every one billed at the output rate. OpenAI recommends reserving at least 25,000 tokens for reasoning plus output when you start, which tells you how large the invisible part can get. If you’re costing a GPT-5.5-class reasoning workload on visible output alone, your estimate is low — sometimes by an order of magnitude.

6. One image is hundreds to thousands of tokens

“Vision input” sounds like a flat add-on. It isn’t — images are converted to tokens, and a single photo can dwarf your text prompt. Anthropic’s formula:

An image uses approximately (width px × height px) / 750 tokens.

— Anthropic vision docs

That makes a 1500×1000 image ~2,000 tokens and a 3000×2000 image ~8,000 tokens. On Opus 4.8 input at $5/M, that 8,000-token image is $0.04 — and ten of them in one prompt is 80,000 tokens, $0.40, before you’ve written a word of text.

OpenAI bills images differently but no more cheaply: high-detail images are scaled and split into 512px tiles at 170 tokens per tile plus an 85-token base, and image input is metered at $8/M tokens (OpenAI vision docs). Either way: a multi-image prompt is a large-prompt bill wearing a small-prompt disguise.

7. Video and audio explode the token count

If images are surprising, time-based media is a different order of magnitude. Google converts media to tokens at fixed rates:

Media	Tokens	A 10-minute clip
Video	263 tokens / second	~157,800 tokens
Audio	32 tokens / second	~19,200 tokens

Source: Google Gemini token docs. A ten-minute video is ~158,000 input tokens — most of a 200K context window, consumed by one clip. A one-hour video is ~947,000 tokens, brushing the 1M ceiling and costing real money on input alone. Audio input also runs several times the text rate per token on most Gemini tiers. If your product ingests media, model the bill on seconds of footage, not “number of files.”

8. Fast mode: paying 2-6x for the same model

Need lower latency? Some providers sell a faster path to the same model at a steep premium. Anthropic’s fast mode:

Model	Standard (in / out)	Fast mode (in / out)	Premium
Opus 4.8	$5 / $25	$10 / $50	2x
Opus 4.6 / 4.7	$5 / $25	$30 / $150	6x

Source: Anthropic pricing docs. Fast mode can be the right call when latency is directly tied to revenue — but at 6x on Opus 4.7, it’s a line item that should be a deliberate decision, not a default someone flipped on. And it stacks: caching multipliers and the data-residency surcharge apply on top of fast-mode pricing.

9. The geography surcharge

If compliance forces you to pin inference to a region, you pay for it. On Anthropic’s first-party API, requesting US-only routing adds a flat multiplier across every token category:

“Specifying US-only inference through the inference_geo parameter incurs a 1.1x multiplier on all token pricing categories, including input tokens, output tokens, cache writes, and cache reads.”

— Anthropic pricing docs

On the partner clouds it’s similar: Bedrock and Vertex regional and multi-region endpoints carry a 10% premium over global endpoints for current models. A 10% tax on the entire bill is the kind of thing a data-residency requirement quietly adds — and it’s invisible until you compare a regulated deployment’s invoice against the sticker.

10. Tool use bills twice

Give a model tools and you pay on two axes the sticker never mentions. First, server-side tools carry their own usage charges on top of tokens — on Anthropic, web search is $10 per 1,000 searches and code execution is $0.05 per hour per container beyond a free monthly allowance (Anthropic pricing docs). Second, simply enabling tools injects a system prompt you didn’t write and are billed for every call:

Model	Tool-use system prompt overhead
Opus 4.8	290-410 tokens
Opus 4.7	675-804 tokens
Sonnet 4.6 / Haiku 4.5	~496-589 tokens

Source: Anthropic pricing docs (range depends on tool_choice). It’s small per call — but an agent that makes thousands of tool-augmented calls a day is paying for a system prompt it never authored, every single turn, plus per-search and per-container-hour fees on the work the tools actually do.

How to not get surprised

None of this means any vendor is cheating you. It means the two-number sticker is the start of a cost estimate, not the end. A checklist that closes the gap between your estimate and your invoice:

Count tokens with the real tokenizer, not a rule of thumb — and re-count when you switch model families. The tokenizer tax (§1) can move the bill 20-35%.
Model cache reuse before enabling caching. If you won’t clear the write premium (§2) with enough reads, don’t cache. And set a TTL you’ll actually use, so you don’t pay storage rent (§3).
Know your context tier. If you run near 200K tokens, confirm whether the model reprices the whole request above it (§4).
Budget reasoning tokens as output. For o-series and extended-thinking models, the invisible chain of thought (§5) is the bill.
Meter media by the second and the pixel, not by the file (§6, §7). One video can fill a context window.
Treat fast mode and US-only routing as deliberate line items (§8, §9), not defaults.
Account for tool overhead — per-call system prompts plus per-use server-tool fees (§10).

Do that and your forecast will match your invoice. Skip it and you’ll learn these line items the expensive way — on the bill. For how we verify every price on this site against the vendor’s own page, see our methodology.

All figures in this post were verified on 2026-06-03 against the vendors’ own documentation: Anthropic pricing and vision docs, OpenAI reasoning and vision guides, and Google Gemini pricing, caching, and token docs. Where a number drives a decision, we link the source so you can check it yourself — see our methodology for the verification cadence.

Frequently asked questions

What is the single most overlooked LLM API cost?

The tokenizer. Every comparison ranks models by price per token, but the same text turns into a different number of tokens on each model. Anthropic states that Opus 4.7 and later 'may use up to 35% more tokens for the same fixed text' because of a new tokenizer. A model that looks 10% cheaper per token can be more expensive per request. Always estimate cost on token counts from the actual model's tokenizer, not on the sticker price alone.

Does prompt caching always save money?

No. A cache write costs more than standard input — 1.25x base for a 5-minute cache and 2x for a 1-hour cache on Anthropic — and a cache read costs 0.1x. So caching only pays off once a cached prefix is reused: after one read for the 5-minute tier, two reads for the 1-hour tier. If your prefix changes every call, or the cache expires before you reuse it, caching costs you more. Google adds a separate per-hour storage fee on top. We walk the full math in the cornerstone caching post.

Why is my reasoning-model bill higher than the tokens I can see?

Because reasoning models bill for hidden thinking tokens. OpenAI's docs state that for o-series models, reasoning tokens 'are not visible via the API' but 'are billed as output tokens.' A response that shows 200 visible output tokens may have generated thousands of billed-but-invisible reasoning tokens first. Budget for it: OpenAI recommends reserving at least 25,000 tokens for reasoning plus output when you start.

How many tokens does an image or a video cost?

Far more than you'd guess. On Anthropic, an image is about (width x height) / 750 tokens — a 3000x2000 photo is ~8,000 tokens. On OpenAI, a high-detail image is billed per 512px tile and image input runs $8/M tokens. On Google, video is 263 tokens/second and audio is 32 tokens/second — a 10-minute video is ~158,000 input tokens. 'One image' or 'one clip' is never one unit on your bill.

Are these hidden costs a scam?

No — most are documented, just not on the comparison sites that rank models by two numbers. Anthropic, to its credit, publishes nearly all of these surcharges on one pricing page. The problem is that the ecosystem quotes only input and output price, so a real cost comparison requires reading past the sticker. That's the point of this post: surface the line items so your estimate matches your invoice.

About the author

Yaroslav Vikhariev

Founder

Founder, AI//COST.

LinkedIn Author profile