Retrieval-Augmented Generation (RAG) is a pattern where an AI model fetches relevant documents at query time and uses them as context, instead of relying only on what it learned during training.
What RAG actually is
Retrieval-Augmented Generation is a pattern for giving AI models information they did not learn during training. The model stays the same. What changes is that before answering, the system retrieves relevant content from some data source and stuffs it into the prompt as context.
The flow is simple. You ask a question. A retrieval layer finds the most relevant documents or snippets. Those get added to the prompt. The model generates an answer grounded in what was retrieved. Without RAG, you either have to fine-tune the model on your data or accept that it only knows what was in its training set.
Why RAG exists
Large language models have two well-known limits. They have a knowledge cutoff, so they do not know anything after training. And they cannot access your private data, because your data was not in their training set.
You can solve both by fine-tuning, but fine-tuning is slow, expensive, and has to be redone every time the data changes. For most use cases, that is the wrong tool. RAG solves the same problem by keeping the model untouched and feeding it fresh context on demand. Cheaper. Faster. Easier to keep current.
How RAG works under the hood
A typical RAG pipeline has three stages.
Ingestion. Your documents get split into chunks, and each chunk gets converted into a vector embedding. Embeddings are numerical representations of meaning. Similar content ends up with similar vectors. The chunks and their embeddings get stored in a vector database.
Retrieval. When a query comes in, it gets embedded the same way. The system finds the chunks whose embeddings are closest to the query embedding. These are your most semantically relevant results, not just keyword matches.
Generation. The retrieved chunks get inserted into the prompt alongside the user’s question. The model generates an answer grounded in that context.
Good RAG systems add more: query rewriting, reranking, hybrid search that combines keyword and semantic, and filters for metadata like source or date. The model is usually the easy part. Retrieval is where the work lives.
Personal RAG and MCP
RAG used to mean building your own pipeline: vector database, embedding model, retrieval logic, prompt construction. That is a lot of infrastructure for a single user.
The Model Context Protocol changes this. An MCP server can expose a retrieval tool to any compatible AI client. ContextBolt is a good example. The extension captures your bookmarks, processes them into searchable chunks, and exposes a search tool via MCP. When you ask Claude something like “what did I save about React state management”, Claude calls the tool, gets back the most relevant saves, and answers with that as context.
You get personal RAG without building anything. The AI client handles the prompt, the MCP server handles the retrieval, and your saves stay in your system.
When to use RAG
RAG makes sense when:
- The data changes often and fine-tuning would be too slow
- The data is private and cannot be used for training
- You need to cite sources in the answer
- The data volume is too large to fit in a context window
It is less useful when you need the model to learn a new skill or style, rather than access new facts. For that, fine-tuning is still the right tool.
For most people saving content they want to find later, RAG is the pattern that makes AI actually useful. Your browsing context becomes a queryable knowledge base, and the AI acts as the interface.