Context Window is the maximum amount of text an AI model can consider in a single request, measured in tokens, including both the input you send and the output it generates.
The basic idea
A context window is the budget an AI model has for what it can “see” in a single request. It is measured in tokens, which are chunks of text roughly three-quarters the size of an English word. Everything the model processes in that request shares the same budget: your instructions, the conversation history, any documents, tool definitions, tool results, and the model’s own reply.
Once the budget is used up, nothing else fits. Think of it like RAM. The model is a CPU, the context window is RAM, and your prompt is the program being loaded. Too much input and the machine cannot cope.
Why context windows exist at all
Attention mechanisms in transformer models scale poorly with input length. Doubling the context roughly quadruples the computation for the attention step. Hardware and algorithmic advances have pushed the limit up dramatically over the past few years, but infinite context is still not free. Every extra token costs compute, latency, and money.
So model providers set a cap. In 2026 the ceiling is around one million tokens for the largest frontier models, with most production use sitting between 128k and 200k. Cheap and fast models tend to have smaller windows.
What fills up a context window
Most people underestimate how quickly context gets used.
- System prompts can be 1k to 10k tokens by themselves
- Tool definitions for an agent with many MCP servers can run into tens of thousands
- Prior turns in a long conversation accumulate fast
- Retrieved documents in a RAG setup are often the biggest single item
- The response itself has to fit inside the same window
A long coding session in an agent can easily push through hundreds of thousands of tokens before doing anything unusual. This is why compaction and compression are becoming standard features in AI tools.
Big context versus retrieval
There are two strategies for dealing with data that is larger than your context budget.
Stuff it all in. If your model has a 1M token window and your document fits, you can just include the whole thing. This works for book-length single documents and is the simplest approach.
Retrieve what is relevant. For data that is much bigger than any window (entire codebases, personal bookmark archives, company knowledge bases), you use RAG. Only the most relevant chunks for the current query go into the window.
Retrieval usually wins for large or changing data. It is cheaper per call, faster, and keeps the model focused on what actually matters. Research consistently shows that answer quality drops when irrelevant content is packed into the window, even when everything technically fits.
Context windows and MCP tools
The Model Context Protocol is designed with context windows in mind. Instead of preloading all possible data, MCP servers expose tools that the AI can call on demand. The tool runs, returns only the relevant results, and those get added to the context.
ContextBolt is a good example. Your saved articles might total tens of megabytes of text. That would never fit in any context window, and sending it all would drown the model in noise. Instead, the MCP server exposes a search tool. The AI calls it with a specific query, gets back the top matching saves, and uses just those. The context stays efficient. The data stays outside the window until it is needed.
This is the pattern most agent systems converge on by 2026: tight context, powerful tools, retrieval for scale.