How to cut Claude API costs by up to 95 percent with three features most developers skip
Prompt caching, batch processing, and model routing can slash Claude API bills by 50 to 95 percent when combined correctly. Anthropic buries the most powerful savings behind prefix-match rules and feature flags most developers never configure.
What you will learn
- How prompt caching cuts 90% of your input token bill with a single field added to your API request
- Why batch processing gives you a flat 50% discount and how to stack it with caching for 95% savings
- The 70/20/10 model routing pattern that routes cheap tasks to cheap models automatically
- Which monitoring tools catch runaway costs before they hit your invoice
Most developers using the Claude API pay standard rates on every single request. They send the same system prompt on every call without caching it. They process overnight jobs at real-time prices. They run simple classification tasks through Opus when Haiku would do the same work for a fraction of the cost. It is genuinely painful to watch, especially when the fix for each of these takes less than an hour to implement.
Three features - prompt caching, batch processing, and model routing - can stack to cut your total API bill by 95%. That isn’t a typo. Anthropic documented these savings themselves. But the configuration details are scattered across multiple documentation pages, and most developers never put all three together. In building Tallyfy, I’ve had conversations with teams spending 10x what they need to on API costs simply because nobody configured these features.
If you are on a Claude subscription plan instead (Pro, Max, Teams), the cost levers are completely different - context management and session hygiene matter more than token pricing. That post covers the subscription-specific techniques.
Prompt caching cuts 90 percent of your input bill
This is the single most powerful cost reduction available on the Claude API, and most developers either skip it or configure it wrong. When consulting with companies about their API integration costs, caching is almost never set up on day one - and the first month’s bill is always the wake-up call.
Prompt caching works by marking a section of your prompt to be cached on Anthropic’s servers. The first request writes the cache at 1.25x the normal input token price. Every subsequent request that starts with the same prefix reads from cache at 0.1x - a 90% discount on input tokens. There is also a 1-hour cache option where writes cost 2x but reads stay at 0.1x, useful for workloads with irregular traffic patterns.
The critical detail most people miss: it is a prefix match. Anthropic caches everything from the beginning of your prompt up to the cache breakpoint. If anything changes in that prefix - a single character, a reordered tool definition, a timestamp in your system prompt - the entire cache invalidates and you pay the full write cost again.
Cache killers that will quietly eat your budget:
- Timestamps or dates injected into system prompts (the current date changes daily - that breaks your cache)
- Reordering tool definitions between requests (tool order matters for prefix matching)
- Switching models mid-conversation (different models, different caches)
- Dynamic content placed before static content in the prompt (put variable data at the end, not the beginning)
One genuinely useful side effect: cached tokens don’t count against your Input Tokens Per Minute rate limits. So caching isn’t just cheaper, it also lets you make more requests per minute. That combination - 90% cheaper AND higher throughput - makes caching sort of a no-brainer for any application with repeated context. If you have been hitting rate limits on a high-volume application, enabling caching might solve two problems at once.
Funnily enough, many production applications already have perfect caching opportunities they are ignoring. A chatbot with a 2,000-word system prompt sends those same 2,000 words on every single user message. Without caching, that’s roughly 2,600 tokens of input cost per request. With caching, it drops to 260 tokens after the first call. Multiply by thousands of daily conversations and the savings are massive.
The implementation itself is straightforward. Add cache_control: {"type": "ephemeral"} to the last content block you want cached in your API request. Anthropic handles the rest. The 5-minute TTL means the cache stays warm during active conversations and expires naturally during quiet periods. For chatbot workloads where users send messages every few seconds, you basically never pay full input price after the first message.
Batch processing is free money for patient workloads
The Claude Batch API gives you a flat 50% discount on all tokens - input and output - for workloads that can wait up to 24 hours. Most batches finish in under an hour. Half price for waiting 60 minutes. That is about as close to free money as API pricing gets.
Best use cases:
- Overnight content generation pipelines
- Bulk document analysis and extraction
- Model evaluation and benchmarking runs
- Data enrichment and classification at scale
- Batch email or notification personalization
Not suitable for:
- Real-time chat interfaces
- Interactive coding assistants
- Anything where a user is actively waiting
The real power comes from stacking. Send a batch request with prompt caching enabled and you get both discounts. Cache reads at 0.1x base price, then the batch discount cuts that in half again. The math works out to roughly 95% total savings on input tokens for cached batch requests. Anthropic confirmed this stacking behavior in their token-saving updates.
Spot on for any workload that isn’t latency-sensitive. If your pipeline currently processes documents synchronously during business hours, switching to a nightly batch job with caching enabled is probably the single biggest cost reduction you can make this week.
Model routing saves more than model selection
Picking the right model is table stakes. Routing tasks to the right model automatically is where the real savings live.
The 70/20/10 pattern: route 70% of your workload to Haiku, 20% to Sonnet, 10% to Opus. Haiku handles classification, extraction, simple formatting, and high-volume repetitive tasks at roughly a fifth of the cost per token compared to Sonnet. Sonnet covers most coding tasks, general-purpose writing, and API building. Opus is reserved for complex multi-step reasoning, architecture decisions, and tasks that genuinely need deep thinking.
According to Anthropic’s pricing, Opus 4.6 uses 65-76% fewer tokens than previous versions for the same tasks. So if you migrated from an older Opus model, your per-request costs dropped significantly even at the same price-per-token tier. The effort parameter on Opus lets you explicitly control how many tokens the model uses - set it to low for straightforward tasks and high only when you actually need deep reasoning.
Token-efficient tool calling reduces output tokens by up to 70% for tool-heavy workflows. If your application makes frequent function calls, enabling this feature on Sonnet 4.6+ can cut your output costs dramatically without changing your code.
For applications already in production, implementing a router doesn’t require rebuilding your stack. A simple classifier (which itself can run on Haiku) examines each incoming request and routes it to the cheapest model that can handle it. The classifier adds minimal overhead - maybe 50-100 tokens per request - but saves thousands by preventing expensive models from handling trivial work.
The pattern works especially well for developer-facing tools where request complexity varies wildly. A developer asking “rename this variable” doesn’t need the same model as one asking “redesign this authentication flow to support SAML SSO.” Route the first to Haiku, the second to Opus, and everything in between to Sonnet. The savings compound fast across hundreds of daily requests.
The monitoring tools you probably skip
Costs spiral when nobody watches them. Anthropic provides monitoring tools that most developers set up too late - usually after an unexpected bill.
The Token Counting API lets you count tokens before sending a request. Free to use, rate-limited but adequate for pre-flight checks. If a request would exceed your budget threshold, you can abort it before incurring any cost. Especially useful for user-submitted prompts where input length is unpredictable.
Code execution through MCP is a janky name for a brilliant idea. Instead of having Claude return massive code outputs in its response (which you pay output tokens for), MCP runs the code in a sandbox and only returns the result. Anthropic’s own testing showed this reduces token usage from 150,000 tokens to 2,000 - a 98.7% reduction for output-heavy operations. If your application generates or executes code, this is worth investigating. The token savings alone justify the integration effort.
The Compaction API automatically summarizes long conversations, reducing the context size that gets sent with each subsequent message. Same principle as /compact in Claude Code but available programmatically for API applications. Essential for chatbots or agents that maintain long-running conversations where context would otherwise grow unbounded.
Set up alerting on your Usage and Cost API endpoint. Even basic threshold alerts - “notify me when daily spend exceeds X” - catch runaway costs from bugs, prompt injection attacks, or unexpected traffic spikes before they become expensive problems. The teams that skip monitoring are always the ones who discover a runaway loop on Monday morning after it burned through a weekend’s worth of budget. I’ve seen it happen, and it isn’t a pleasant conversation.
Stack these for maximum savings
The savings compound multiplicatively, not additively. Here is how they stack in a real-world scenario.
Start with your system prompt and tool definitions. Cache them. That cuts your per-request input cost to 10% of standard pricing. Route incoming requests through a classifier: simple tasks go to Haiku (cheapest), moderate tasks to Sonnet, and only genuinely hard problems reach Opus. For any non-urgent workload - and this applies broadly to batch jobs, evaluations, and overnight processing - run them through the Batch API for an additional 50% discount on top of everything else.
The compounding math: caching saves 90% on cached input tokens. Routing to Haiku saves another 60-80% compared to Opus pricing. Batching saves 50% on top. Combined, your effective cost per token can drop to roughly 5% of standard Opus pricing for cached, batched, Haiku-routed requests.
Extended thinking tokens deserve a mention. They are charged as output tokens, which makes them expensive on Opus. The effort parameter controls this directly. Default to medium effort and only escalate to high when the task genuinely requires deep reasoning. Most classification, summarization, and extraction tasks work fine on low effort. Setting effort to low on a batch of 10,000 classification requests versus leaving it on high can mean the difference between a manageable bill and a painful one.
The pattern that works: start with caching (highest impact, lowest effort to implement), add model routing (moderate effort, high ongoing savings), then migrate eligible workloads to batch processing (requires workflow changes but delivers the biggest multiplier). Each layer builds on the last. The teams I have talked to who implemented all three typically see 70-95% total cost reduction compared to their starting point.
One mistake I keep seeing: teams optimize one layer and declare victory. They enable caching but keep running everything through Opus. Or they route to Haiku but process everything synchronously at peak-hour rates. The teams that actually hit 90%+ savings are the ones who methodically stack all three layers. It takes maybe a week of engineering work. The payback period is usually measured in days.
If you are on a subscription plan rather than using the API directly, the cost levers work differently. Context management, session hygiene, and model switching are what matter there - covered in detail here.
About the Author
Amit Kothari is an experienced consultant, advisor, coach, and educator specializing in AI and operations for executives and their companies. With 25+ years of experience and as the founder of Tallyfy (raised $3.6m), he helps mid-size companies identify, plan, and implement practical AI solutions that actually work. Originally British and now based in St. Louis, MO, Amit combines deep technical expertise with real-world business understanding.
Disclaimer: The content in this article represents personal opinions based on extensive research and practical experience. While every effort has been made to ensure accuracy through data analysis and source verification, this should not be considered professional advice. Always consult with qualified professionals for decisions specific to your situation.