Retrieval-Augmented Generation (RAG) Explained & Implemented | Codersarts
- 4 hours ago
- 7 min read
Retrieval-Augmented Generation (RAG): The Paper That Grounded AI in Real Knowledge
Published by Codersarts · AI Research Paper Series | https://labs.codersarts.com/
The Paper at a Glance
Title | Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks |
Authors | Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, Yih, Rocktäschel, Riedel, Kiela |
Institution | Facebook AI Research (FAIR) |
Published | 2020 |
arXiv | |
Citations | 10,000+ |

What This Paper Introduced
Language models have a fundamental problem: their knowledge is frozen at training time.
Ask a pre-trained LLM about something that happened after its training cutoff, or about a proprietary document inside your company, and it will either hallucinate an answer or admit it doesn't know. Both outcomes are unacceptable in production systems.
The RAG paper introduced a clean architectural solution: give the language model access to an external knowledge source at inference time.
Instead of relying solely on what the model memorized during pre-training, RAG retrieves relevant documents from an external store and conditions the generation on that retrieved context. The model's output is no longer just a function of its weights — it is a function of its weights plus the most relevant knowledge available right now.
This was a conceptual shift as much as a technical one. It changed how the industry thought about LLMs in production: not as static knowledge bases, but as reasoning engines that can be connected to live, updatable information.
Every enterprise AI system, chatbot with a knowledge base, and document Q&A tool you see today is built on this idea.
Need a RAG pipeline built for your use case? Codersarts designs and implements end-to-end RAG systems for production. → Get Implementation Help
The Core Architecture
RAG combines two components that are trained end-to-end together:
User Query
↓
┌─────────────────────────────┐
│ RETRIEVER │
│ Dense Passage Retrieval │
│ (DPR — bi-encoder) │
│ Query Encoder → q vector │
│ Doc Encoder → doc vectors │
│ MIPS: top-k docs │
└────────────┬────────────────┘
│ top-k documents
↓
┌─────────────────────────────┐
│ GENERATOR │
│ BART seq2seq LLM │
│ Input: query + doc context │
│ Output: answer │
└─────────────────────────────┘
↓
Generated Answer
The paper introduced two variants:
RAG-Sequence — retrieves k documents once per query, generates the full answer from each doc independently, then marginalizes over all k outputs to produce a final answer.
RAG-Token — retrieves documents at each generation step, so different tokens in the output can attend to different retrieved documents. More flexible, more compute-intensive.
Component 1: The Retriever (DPR)
The retriever is a Dense Passage Retriever (DPR) — a bi-encoder architecture with two BERT-based encoders:
p_η(z|x) — encodes the query x into a dense vector
A pre-built document index — each document z is pre-encoded into a dense vector
At inference time, retrieval is performed via Maximum Inner Product Search (MIPS) — finding the top-k documents whose embeddings have the highest dot product with the query embedding.
# Conceptual retrieval
query_vector = query_encoder(query) # shape: (d,)
doc_scores = doc_index @ query_vector # shape: (N,)
top_k_docs = doc_index[argsort(doc_scores)[-k:]]
The key difference from sparse retrieval (BM25, TF-IDF) is that dense retrieval captures semantic similarity, not just keyword overlap. "Heart attack" will retrieve documents about "myocardial infarction" — keyword search won't.
Component 2: The Generator (BART)
The generator is BART — a seq2seq Transformer pre-trained with denoising objectives.
For each retrieved document z_i, the generator receives a concatenation of the query and the document:
Input: [query] [SEP] [retrieved document z_i]
Output: [answer tokens]
For RAG-Sequence, the final output probability is:
p(y|x) = Σ_z p_η(z|x) × p_θ(y|x, z)
The model marginalizes over the k retrieved documents, weighting each generated answer by how relevant the retrieval was.
End-to-End Training
Crucially, both the retriever and generator are trained jointly — the retriever learns to fetch documents that make the generator's job easier. The document index itself is kept frozen during training (updating it on every step would be prohibitively expensive), but the query encoder is updated via backpropagation through the generator's loss.
How Modern RAG Works in Practice
The paper's original architecture used BART as the generator. In 2024–2025, production RAG systems have evolved considerably, but the core pattern is identical:
┌────────────────────────────────────────────────┐
│ RAG PIPELINE │
├────────────────────────────────────────────────┤
│ │
│ 1. INDEXING (offline) │
│ Documents → Chunks → Embeddings → Vector DB│
│ │
│ 2. RETRIEVAL (online, per query) │
│ Query → Embedding → ANN Search → Top-k docs│
│ │
│ 3. AUGMENTATION │
│ Prompt = System + Retrieved Docs + Query │
│ │
│ 4. GENERATION │
│ LLM(Prompt) → Grounded Answer │
│ │
└────────────────────────────────────────────────┘
Modern stack:
Component | Common Choices |
Embedding model | text-embedding-3-small, bge-large, e5-mistral |
Vector database | Pinecone, Weaviate, Qdrant, pgvector, Chroma |
Retrieval strategy | Dense, sparse (BM25), or hybrid |
Generator | GPT-4o, Claude 3.5, LLaMA 3, Mistral |
Orchestration | LangChain, LlamaIndex, Haystack |
Implementation: Production RAG Pipeline in Python
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader
# ── Step 1: Load and chunk documents ──────────────────────────
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64, # overlap preserves context across chunks
separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)
# ── Step 2: Embed and index ────────────────────────────────────
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# ── Step 3: Build retriever ────────────────────────────────────
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximal Marginal Relevance — reduces redundancy
search_kwargs={
"k": 5, # retrieve top 5 chunks
"fetch_k": 20 # candidate pool for MMR
}
)
# ── Step 4: Build RAG chain ────────────────────────────────────
llm = ChatOpenAI(model="gpt-4o", temperature=0)
rag_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # concat all docs into context
retriever=retriever,
return_source_documents=True
)
# ── Step 5: Query ──────────────────────────────────────────────
result = rag_chain.invoke({"query": "What are the key findings?"})
print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
print(f" - {doc.metadata['source']} (page {doc.metadata.get('page', '?')})")
Advanced: Hybrid Retrieval (Dense + Sparse)
Pure dense retrieval misses exact keyword matches. Pure sparse retrieval misses semantic similarity. Hybrid combines both:
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever
# Dense retriever
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Hybrid: 60% dense, 40% sparse
hybrid_retriever = EnsembleRetriever(
retrievers=[dense_retriever, bm25_retriever],
weights=[0.6, 0.4]
)
Advanced: Reranking
After retrieval, a cross-encoder reranker scores each candidate more accurately than the bi-encoder used for retrieval:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, docs, top_n=3):
pairs = [(query, doc.page_content) for doc in docs]
scores = reranker.predict(pairs)
ranked = sorted(zip(scores, docs), reverse=True)
return [doc for _, doc in ranked[:top_n]]
# Retrieve more, rerank to fewer
candidates = hybrid_retriever.get_relevant_documents(query)
reranked_docs = rerank(query, candidates, top_n=3)
Original Paper Results
Evaluated on open-domain QA benchmarks against previous state-of-the-art:
Model | NaturalQ (EM) | TriviaQA (EM) | WebQ (EM) |
Closed-book GPT-2 | 4.1 | 29.9 | 3.3 |
REALM (retrieval-augmented) | 40.4 | — | 40.7 |
T5 (closed-book, 11B) | 36.6 | 60.5 | 44.7 |
RAG (Lewis et al.) | 44.5 | 68.0 | 45.5 |
RAG outperformed much larger closed-book models — including the 11B T5 — by grounding generation in retrieved evidence rather than memorized parameters.
Why It Still Matters in 2025
RAG is not just a research concept — it is the dominant architecture for production AI systems:
Enterprise chatbots — answer questions grounded in internal documentation
Legal and compliance tools — cite specific clauses from contracts and regulations
Medical assistants — retrieve from clinical guidelines and research literature
Customer support — ground responses in product manuals and support history
Code assistants — retrieve from private codebases and internal APIs
Financial analysis — ground responses in earnings reports and filings
The fundamental problem RAG solves — LLMs don't know your data — remains unsolved by fine-tuning alone. Fine-tuning bakes knowledge into weights; RAG keeps knowledge external, updatable, and citeable.
Common RAG Pitfalls
1. Chunks that are too large or too small Large chunks dilute relevance — the retrieved context contains too much noise. Small chunks lose context — a sentence without its surrounding paragraph is often meaningless. Start with 512 tokens, 64 token overlap. Adjust based on your document type.
2. No overlap between chunks A sentence split across two chunks will never be retrieved coherently. Always set chunk_overlap > 0.
3. Using cosine similarity when MMR is better Basic cosine similarity returns the 5 most similar chunks — which are often near-duplicates. Maximal Marginal Relevance (MMR) balances relevance with diversity. Use MMR in production.
4. Ignoring metadata filtering If your corpus spans multiple domains, date ranges, or access levels, filter by metadata before semantic search — not after. Pre-filtering dramatically improves precision.
retriever = vectorstore.as_retriever(
search_kwargs={
"filter": {"department": "engineering", "year": 2024},
"k": 5
}
)
5. No reranking step Bi-encoders are fast but imprecise. The top-5 by cosine similarity is rarely the best-5 for answering the specific query. A lightweight cross-encoder reranker (adds ~50ms) meaningfully improves answer quality.
6. Prompting the LLM without clear grounding instructions
# Weak — model may still hallucinate
prompt = f"Answer this: {query}\nContext: {context}"
# Strong — explicitly instructs grounded generation
prompt = f"""Answer the question using ONLY the context provided below.
If the answer is not in the context, say "I don't have enough information."
Context:
{context}
Question: {query}
Answer:"""
RAG vs Fine-Tuning: When to Use Which
RAG | Fine-Tuning | |
Knowledge updates | Easy — update the index | Hard — retrain required |
Factual grounding | Strong — cites sources | Weak — knowledge in weights |
Domain style/tone | Weak | Strong |
Private data | Ideal | Possible but risky |
Hallucination risk | Lower | Higher |
Cost to implement | Moderate | High |
Best for | Dynamic knowledge, Q&A, search | Style, format, domain behavior |
In practice, the best production systems combine both: a fine-tuned model for tone and format, with a RAG layer for factual grounding.
How to Go Deeper
Read next from this series:
Attention Is All You Need → — the Transformer that powers the generator in every RAG system
BERT → — the bi-encoder architecture used in DPR retrieval
Chain-of-Thought → — reasoning techniques that combine powerfully with RAG
Recommended resources:
BEIR Benchmark — standard benchmark for evaluating retrieval systems
LlamaIndex Docs — production RAG patterns and advanced techniques
RAGAs — framework for evaluating RAG pipeline quality
Need a RAG Pipeline Built for Your Use Case?
RAG is one of those architectures that looks simple in a tutorial and reveals serious complexity in production — chunking strategy, embedding model choice, retrieval quality, reranking, prompt design, evaluation, and latency all interact.
At Codersarts, we help engineers, researchers, and founders:
✅ Design and implement RAG pipelines tailored to your document types and use case
✅ Evaluate and improve retrieval quality on your specific corpus
✅ Build hybrid retrieval systems combining dense and sparse search
✅ Add reranking, metadata filtering, and query rewriting layers
✅ Reproduce the original RAG paper results on standard benchmarks
✅ Consult on RAG architecture decisions — vector DB selection, chunking strategy, embedding model choice
This post is Part 6 of the Codersarts AI Research Paper Series. Next: Chain-of-Thought Prompting →



Comments