The short version
A big context window looks like free storage. It behaves like a running meter. Every token you load gets re-sent, and re-billed, on every turn that follows, and the cache that is meant to soften that has a five-minute fuse.
- The context window is re-sent in full on every turn, so a file loaded once is paid for many times
- Prompt caching cuts that cost, but the cache expires after five minutes by default
- A cache read costs a tenth of a fresh send; a cache write costs a quarter more, so a miss is a steep swing
- Model accuracy fades as the window fills, so a fuller window is not a smarter one
There is a comfortable belief about large context windows that goes like this: the window is big, so I can load whatever I want, and Claude will sort it out. A million tokens of room. Pour the whole codebase in.
The window is real and the room is real. What the belief misses is that a context window is not storage. It is not a drawer you fill once. It is closer to a meter that runs. Everything you put in the window is sent to the model again on every single turn for the rest of the session, and paid for again each time. A file you loaded on turn two is not a turn-two cost. It is a cost on turn three, turn four, and every turn after.
So the real question about a large context window is not “will it fit.” Almost anything fits. The question is “what does carrying it cost,” and the answer has three parts that compound: the re-billing, a cache that expires faster than people expect, and a quieter tax where the model gets less reliable as the window fills. Put together, brute-force loading is usually the expensive choice dressed up as the easy one.
What a full window costs
Start with the mechanism, because it is the part people skip. Claude Code’s context window holds the entire conversation: every message, every file Claude has read, every command output. And the model does not get a diff each turn. It gets the whole window, resent.
That is the cost engine. A large file you load early does not cost you once. It joins the window and is re-sent on every following turn until the session ends or the context is compacted. Load a handful of big files at the start of a long session and you have not made a handful of purchases. You have signed up for a subscription that bills every turn, and you signed it without a price shown. This is why a session that never felt expensive can be: nothing in it was a dramatic move, but a dozen heavy reads, each re-sent across thirty turns, is a large number arrived at quietly. The first cost of a big window is not the loading. It is the keeping. I made the same point from the budgeting side in how to budget tokens in Claude Code; here the point is sharper, because a large window is precisely the thing that makes the re-billing large.
The five-minute cache cliff
The obvious objection is prompt caching, and it is a fair one. Caching exists to stop you re-paying full price for the same tokens turn after turn. It works. It also has an edge that almost nobody plans for.
The verified numbers are worth holding precisely. A cache read costs 0.1 times the price of a normal input token. A cache write costs 1.25 times. So reading cached context is a tenth of the price of sending it fresh, a real saving, and writing it to the cache costs a quarter more than a fresh send. The trap is in the lifetime. The official prompt caching documentation is exact: “By default, the cache has a 5-minute lifetime.”
Five minutes. Step away to read a Slack message, think through a hard problem, take a call, and the cache behind your big context window quietly expires. The next request cannot read at 0.1 times. It has to write again, at 1.25 times. That is the cache cliff: the same chunk of context swings from a tenth of the price to a quarter above it, a more than twelvefold jump, the instant you cross five idle minutes. A 1M-token window that you imagined as cheap-to-carry because it is mostly cache reads is only cheap while you keep the cache warm. Work in long, gappy stretches and you are not getting the cached price. You are paying the write price, again and again, for a window you were told would be efficient.
Accuracy fades as it fills
The cost so far has been measured in tokens. There is a second cost measured in quality, and it is the one that should worry you more, because you cannot see it on a bill.
A fuller context window is not a smarter one. Anthropic’s own best-practices guide states it without hedging: performance degrades as the context fills, and “when the context window is getting full, Claude may start forgetting earlier instructions or making more mistakes.” Read that as the real penalty of brute-force loading. When you pour the whole codebase into the window, you are not handing the model more to reason with. Past a point you are handing it more to lose track of. The instruction you gave at the start and the file that actually mattered are now competing for attention with hundreds of files the task never needed. The model does not fail loudly. It just gets a little less precise, a little more forgetful, and you pay for the bigger window and get a worse answer inside it. That is the trap at its purest: you spent more to make the result worse.
When the window earns its cost
None of this means the large window is a mistake. It is a tool, and there are tasks it is right for. Knowing the cost is what lets you spend it well, so consider the other side.
A large context window earns its cost when the task needs breadth that cannot be chunked: tracing one behavior across many files at once, reviewing a wide change where the interactions are the point, holding a long document whole because a summary would drop the detail that matters. In those cases the alternative, repeated narrow loads, would cost more in tokens and more in your time, and the degradation is a price worth paying for work that cannot be done any other way. The test is simple to state. Does this task need everything loaded at the same time, or does it just feel easier to load everything than to choose? The first is a good reason. The second is the habit this post is arguing against. If you are trying to set that discipline across a team rather than relying on each person to judge it, Blue Sheen runs engagements like this.
Load like it costs something
The fix is not a smaller context window. It is treating the one you have as something with a price, because it has one.
Load narrow. Ask for the specific file and the specific function, not the directory, because every token you pull in joins the re-billing. Work in focused stretches rather than long gappy ones, so the cache stays warm and you keep the tenth-price read instead of the write. Compact or clear the context between unrelated tasks, before the old material has cost you twenty turns of re-billing and started to crowd the model’s attention; the discipline of keeping a plan and a clean context pays here too. And lean on caching deliberately, which has a craft of its own worth learning properly in LLM caching strategies and the broader question of what actually saves Claude costs.
The large context window is one of the best things about modern Claude, and it is most useful to the people who respect what it costs. The trap was never the window. It was the belief that filling it is free. It is not free. It is re-billed every turn, it is cheap only while the cache is warm, and it makes the model less sharp as it fills. Load it the way you would pack a bag you have to carry all day. Only what the day needs, and nothing you will not use.



