Glossary

What is Context Window?

Concept By David Hamilton
Definition

Context Window is the maximum amount of text an AI model can consider in a single request, measured in tokens, including both the input you send and the output it generates.

The basic idea

A context window is the budget an AI model has for what it can “see” in a single request. It is measured in tokens, which are chunks of text roughly three-quarters the size of an English word. Everything the model processes in that request shares the same budget: your instructions, the conversation history, any documents, tool definitions, tool results, and the model’s own reply.

Once the budget is used up, nothing else fits. Think of it like RAM. The model is a CPU, the context window is RAM, and your prompt is the program being loaded. Too much input and the machine cannot cope.

Why context windows exist at all

Attention mechanisms in transformer models scale poorly with input length. Doubling the context roughly quadruples the computation for the attention step. Hardware and algorithmic advances have pushed the limit up dramatically over the past few years, but infinite context is still not free. Every extra token costs compute, latency, and money.

So model providers set a cap. In 2026 the ceiling is around one million tokens for the largest frontier models, with most production use sitting between 128k and 200k. Cheap and fast models tend to have smaller windows.

What fills up a context window

Most people underestimate how quickly context gets used.

A long coding session in an agent can easily push through hundreds of thousands of tokens before doing anything unusual. This is why compaction and compression are becoming standard features in AI tools.

Big context versus retrieval

There are two strategies for dealing with data that is larger than your context budget.

Stuff it all in. If your model has a 1M token window and your document fits, you can just include the whole thing. This works for book-length single documents and is the simplest approach.

Retrieve what is relevant. For data that is much bigger than any window (entire codebases, personal bookmark archives, company knowledge bases), you use RAG. Only the most relevant chunks for the current query go into the window.

Retrieval usually wins for large or changing data. It is cheaper per call, faster, and keeps the model focused on what actually matters. Research consistently shows that answer quality drops when irrelevant content is packed into the window, even when everything technically fits.

Context windows and MCP tools

The Model Context Protocol is designed with context windows in mind. Instead of preloading all possible data, MCP servers expose tools that the AI can call on demand. The tool runs, returns only the relevant results, and those get added to the context.

ContextBolt is a good example. Your saved articles might total tens of megabytes of text. That would never fit in any context window, and sending it all would drown the model in noise. Instead, the MCP server exposes a search tool. The AI calls it with a specific query, gets back the top matching saves, and uses just those. The context stays efficient. The data stays outside the window until it is needed.

This is the pattern most agent systems converge on by 2026: tight context, powerful tools, retrieval for scale.

Related terms

Frequently asked questions

How big is a modern context window? +
As of 2026, frontier models range from 128,000 tokens (GPT-4-class) to 1,000,000 tokens (Claude Opus 4.7 in 1M mode, Gemini 2.5 Pro). One million tokens is roughly 750,000 words, or around 3,000 pages of text. Smaller open-source models often sit at 32,000 to 128,000.
What counts against the context window? +
Everything the model sees. The system prompt, your messages, previous assistant replies, tool definitions, tool output, and retrieved documents all count. The response the model generates also counts. If you have a 200k window and a 180k input, you only have 20k left for the answer.
What happens when the context window fills up? +
Different clients handle this differently. Some error out. Some truncate the oldest messages. Some compress or summarise earlier content. Claude Code, for example, automatically compacts old messages as the window fills up, so long sessions keep working.
Why do bigger context windows not solve everything? +
Two reasons. First, cost and latency scale with input size, so stuffing a million tokens into every call gets expensive and slow. Second, models still have attention issues over very long inputs, so relevance tends to degrade at the edges. Retrieval (RAG) usually beats brute-force stuffing.
How does this relate to bookmarks and MCP? +
You cannot fit your entire bookmark history in a context window, and even if you could, it would be wasteful. Instead, an MCP server like ContextBolt retrieves only the most relevant saves for a given query and feeds just those into the context. The window stays focused on what matters.