Claude API Prompt Caching: Complete Guide to Cutting Token Costs by 90% (2026)
[01]The Problem Prompt Caching Solves
If your Claude-powered app sends the same large context with every request — a 20-page system prompt, a tool catalog, a long document the user is asking questions about — you are paying full price for the same tokens hundreds or thousands of times. Claude API prompt caching lets you tell Anthropic "this prefix won't change for the next few minutes; charge me cheaply when I send it again."
The savings are real. Cache reads price at 10% of base input cost. A 30,000-token system prompt that costs $0.09 per request on Sonnet 4.6 ($3/M input) costs $0.009 with a cache hit — a 90% reduction on the prefix portion. For high-traffic endpoints this turns into thousands of dollars a month.
This guide is the practical companion to our Top 10 MCP servers article, but at the API level rather than the Claude Code level. If you're building on the SDK, prompt caching is the optimization that pays for itself in the first day.
[02]How Prompt Caching Works
The mental model: the API has an internal cache keyed by the exact prefix of your request. You mark cut points in your messages with cache_control, and Anthropic stores everything up to that point. The next request that has the same prefix gets a cache hit and pays the discounted read price.
Request 1 (cache write — full base price + 25% surcharge):
[system: 30K tokens] [user: 100 tokens] ← cache_control here
↓
[Anthropic stores prefix]
Request 2 within TTL (cache hit — 10% read price):
[system: 30K tokens] [user: 200 tokens] ← same prefix, new tail
↓
[served from cache]
Two important rules:
- The prefix must match exactly. One character of difference and you miss the cache. Build your prefixes deterministically.
- The cut point must be at a block boundary. You can't cache the middle of a message.
cache_controlgoes on the last block you want included in the cached prefix.
[03]The Two Cache Tiers — 5 Minutes and 1 Hour
Anthropic offers two TTL options, with different write surcharges:
| Tier | TTL | Write Surcharge | Read Discount | Best For |
|---|---|---|---|---|
| Default (5 min) | 5 minutes | +25% on write | −90% on read | Active conversations, multi-turn agents |
| Extended (1 hour) | 1 hour | +100% on write | −90% on read | Long-lived sessions, batch workloads with bursty traffic |
The 1-hour tier costs twice as much to write but reads at the same discount. The math: 1-hour caching pays off when you expect at least 4 cache reads (because you've paid 2× the write cost vs default tier — you need 4 reads at −90% to break even on the extra 1× write surcharge).
// Default 5-minute cache
{ "type": "text", "text": "...", "cache_control": { "type": "ephemeral" } }
// Extended 1-hour cache
{ "type": "text", "text": "...", "cache_control": { "type": "ephemeral", "ttl": "1h" } }
Default rule of thumb: start with 5-minute cache. Switch to 1-hour only when you have telemetry showing your cache hit rate would benefit (next-section).
[04]Pricing Math With Real Numbers
Concrete example — Sonnet 4.6 prices (May 2026):
| Operation | Price per 1M tokens |
|---|---|
| Input (uncached) | $3.00 |
| Cache write (5min) | $3.75 (+25%) |
| Cache write (1h) | $6.00 (+100%) |
| Cache read | $0.30 (−90%) |
| Output | $15.00 |
Scenario: 30K-token system prompt, 1000 requests over 5 minutes
| Strategy | System prompt cost | Total saved |
|---|---|---|
| No caching | 1000 × 30K × $3/M = $90.00 | — |
| 5-min caching | 1× $3.75/M × 30K + 999× $0.30/M × 30K = $0.11 + $9.00 = $9.11 | $80.89 (90%) |
Scenario: same 30K prompt, 5000 requests over 1 hour (heavy multi-turn agent)
| Strategy | System prompt cost | Total saved |
|---|---|---|
| No caching | 5000 × 30K × $3/M = $450.00 | — |
| 5-min caching (12 writes) | 12× $3.75/M × 30K + 4988× $0.30/M × 30K = $1.35 + $44.89 = $46.24 | $403.76 (90%) |
| 1h caching (1 write) | 1× $6/M × 30K + 4999× $0.30/M × 30K = $0.18 + $44.99 = $45.17 | $404.83 (90%) |
Both tiers save roughly the same in steady-state — the 1h tier wins by $1 in this scenario because it avoids re-writes. The real difference shows up in burst-then-quiet patterns where 5-min caching expires between bursts.
[05]The Anthropic SDK Pattern
TypeScript
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
const SYSTEM_PROMPT = "..."; // 30K tokens of stable instructions
async function ask(question: string) {
return client.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: SYSTEM_PROMPT,
cache_control: { type: "ephemeral" }, // ← cut here
},
],
messages: [{ role: "user", content: question }],
});
}
Python
import anthropic
client = anthropic.Anthropic()
SYSTEM_PROMPT = "..." # 30K tokens
def ask(question: str):
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": question}],
)
Multiple cut points (cache layers)
You can mark up to 4 cache breakpoints per request. Useful when you have a stable global prefix + per-tenant context + per-conversation context:
system: [
{ type: "text", text: GLOBAL_INSTRUCTIONS, cache_control: { type: "ephemeral" } },
{ type: "text", text: TENANT_DOCS, cache_control: { type: "ephemeral" } },
],
messages: [
{ role: "user", content: [
{ type: "text", text: CONVERSATION_HISTORY, cache_control: { type: "ephemeral" } },
{ type: "text", text: latest_question }
]}
]
The cache will hit at whatever the longest matching prefix is. New tenant traffic still benefits from the global cache; new conversations still benefit from the tenant cache.
[06]What to Cache (and What Not to)
The bar is simple: cache anything large that doesn't change between requests within the TTL. Anthropic enforces a minimum cacheable size (1024 tokens for Sonnet/Opus, 2048 for Haiku) — smaller blocks won't earn the surcharge back.
Strong candidates
- System prompts over 1K tokens — almost always worth caching
- Tool definitions — schemas don't change between calls in a session
- RAG context when the same chunks are retrieved repeatedly within a session
- Long documents that the user is asking multiple questions about ("read this 50-page contract, then ask")
- Few-shot examples — same examples used many times
Don't cache
- Anything below the minimum size
- Anything that changes per request (timestamps, request IDs, recent dynamic context)
- Output blocks (only input is cacheable; this isn't even an option in the API)
- Conversation tails that are still growing — the unstable part
The hidden trap: dynamic data inside system prompts
If your system prompt includes Current time: {{new Date(){'}}'}, you'll never get a cache hit because the prefix changes every request. Common gotchas: timestamps, user IDs, request UUIDs, A/B test variants. Move dynamic content out of the cached prefix and into the message tail.
[07]Measuring Cache Hit Rate and Common Mistakes
Every API response includes usage with cache stats. Log these to know if your caching is actually working:
response.usage = {
input_tokens: 50, // tokens NOT served from cache
cache_creation_input_tokens: 0, // tokens written to cache this request
cache_read_input_tokens: 30000, // tokens read from cache
output_tokens: 200,
}
Healthy steady-state cache: cache_read_input_tokens should dominate. cache_creation_input_tokens should be near zero except on the first request of each TTL window.
Five Common Mistakes
- Caching below the minimum size. You pay the write surcharge but no read discount applies. Always check your block size against the model's minimum (1024 for Sonnet, 2048 for Haiku).
- Dynamic data in cached blocks. Timestamps, UUIDs, user IDs in the cached prefix mean every request is a cache miss + a cache write. Telemetry will show this immediately —
cache_creationwon't drop. - Wrong cut point. Putting
cache_controltoo early misses caching the bulk of your stable context. Put it on the last block you want included. - Using 1h tier for low-traffic endpoints. The 100% write surcharge eats your savings unless you'll get many reads. Stay on default (5min) until telemetry justifies switching.
- Not warming the cache before bursts. If you know a flood of requests is coming (e.g. scheduled batch job), do one warm-up request a few seconds before. The 25% write surcharge is paid by the warm-up, not by your real traffic.
[08]Frequently Asked Questions
Does prompt caching work with streaming responses?
Yes. The cache hit/miss is determined at request submission, independent of streaming. Both streaming and non-streaming benefit equally.
Can I cache messages from previous turns in a conversation?
Yes — that's one of the most valuable patterns. Mark cache_control on the last message of the conversation history. As the conversation grows, the cached prefix grows with it, and only the new turn pays full price.
Does prompt caching work for tool use?
Yes. Tool definitions are part of the cacheable prefix. If you have 20 tools with detailed descriptions, caching them saves substantially.
What happens when the cache expires mid-request?
The next request after expiration is a cache miss + new cache write. The expiration is silent — your code doesn't get notified. Monitor cache_creation_input_tokens to detect expiration patterns.
Is caching available on all Claude models?
Caching is available on Sonnet 4.x, Opus 4.x, and Haiku 4.x. Older models (Claude 3.x) had limited caching; they're being deprecated. Always check the official model availability page for the current matrix.
Can I share a cache across different API keys or projects?
No. Caches are scoped to your organization and the exact request prefix. Different API keys within the same org may share a cache if their prefixes match exactly, but cross-org sharing doesn't happen.
How does caching interact with the Anthropic Workbench / Console?
The same caching rules apply — your Workbench requests can hit caches if the prefix matches. Most teams don't see Workbench cache hits because Workbench traffic is too low-volume for prefixes to repeat within TTL.
Related: Top 10 MCP servers for Claude Code · .claude/ folder complete guide · Claude Code Hooks practical guide