top of page

Retrieval-Augmented Generation (RAG) Explained & Implemented | Codersarts

  • 4 hours ago
  • 7 min read

Retrieval-Augmented Generation (RAG): The Paper That Grounded AI in Real Knowledge

Published by Codersarts · AI Research Paper Series | https://labs.codersarts.com/



The Paper at a Glance



Title

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Authors

Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, Yih, Rocktäschel, Riedel, Kiela

Institution

Facebook AI Research (FAIR)

Published

2020

arXiv

Citations

10,000+


Retrieval-Augmented Generation (RAG) Explained & Implemented | Codersarts

What This Paper Introduced

Language models have a fundamental problem: their knowledge is frozen at training time.


Ask a pre-trained LLM about something that happened after its training cutoff, or about a proprietary document inside your company, and it will either hallucinate an answer or admit it doesn't know. Both outcomes are unacceptable in production systems.


The RAG paper introduced a clean architectural solution: give the language model access to an external knowledge source at inference time.


Instead of relying solely on what the model memorized during pre-training, RAG retrieves relevant documents from an external store and conditions the generation on that retrieved context. The model's output is no longer just a function of its weights — it is a function of its weights plus the most relevant knowledge available right now.


This was a conceptual shift as much as a technical one. It changed how the industry thought about LLMs in production: not as static knowledge bases, but as reasoning engines that can be connected to live, updatable information.


Every enterprise AI system, chatbot with a knowledge base, and document Q&A tool you see today is built on this idea.


Need a RAG pipeline built for your use case? Codersarts designs and implements end-to-end RAG systems for production. → Get Implementation Help


The Core Architecture

RAG combines two components that are trained end-to-end together:



User Query
    ↓
┌─────────────────────────────┐
│        RETRIEVER            │
│  Dense Passage Retrieval    │
│  (DPR — bi-encoder)         │
│  Query Encoder → q vector   │
│  Doc Encoder → doc vectors  │
│  MIPS: top-k docs           │
└────────────┬────────────────┘
             │ top-k documents
             ↓
┌─────────────────────────────┐
│        GENERATOR            │
│  BART seq2seq LLM           │
│  Input: query + doc context │
│  Output: answer             │
└─────────────────────────────┘
    ↓
Generated Answer



The paper introduced two variants:


RAG-Sequence — retrieves k documents once per query, generates the full answer from each doc independently, then marginalizes over all k outputs to produce a final answer.


RAG-Token — retrieves documents at each generation step, so different tokens in the output can attend to different retrieved documents. More flexible, more compute-intensive.



Component 1: The Retriever (DPR)

The retriever is a Dense Passage Retriever (DPR) — a bi-encoder architecture with two BERT-based encoders:


  • p_η(z|x) — encodes the query x into a dense vector

  • A pre-built document index — each document z is pre-encoded into a dense vector


At inference time, retrieval is performed via Maximum Inner Product Search (MIPS) — finding the top-k documents whose embeddings have the highest dot product with the query embedding.



# Conceptual retrieval
query_vector = query_encoder(query)           # shape: (d,)
doc_scores = doc_index @ query_vector         # shape: (N,)
top_k_docs = doc_index[argsort(doc_scores)[-k:]]

The key difference from sparse retrieval (BM25, TF-IDF) is that dense retrieval captures semantic similarity, not just keyword overlap. "Heart attack" will retrieve documents about "myocardial infarction" — keyword search won't.



Component 2: The Generator (BART)

The generator is BART — a seq2seq Transformer pre-trained with denoising objectives.

For each retrieved document z_i, the generator receives a concatenation of the query and the document:


Input:  [query] [SEP] [retrieved document z_i]
Output: [answer tokens]

For RAG-Sequence, the final output probability is:


p(y|x) = Σ_z  p_η(z|x) × p_θ(y|x, z)

The model marginalizes over the k retrieved documents, weighting each generated answer by how relevant the retrieval was.


End-to-End Training

Crucially, both the retriever and generator are trained jointly — the retriever learns to fetch documents that make the generator's job easier. The document index itself is kept frozen during training (updating it on every step would be prohibitively expensive), but the query encoder is updated via backpropagation through the generator's loss.




How Modern RAG Works in Practice

The paper's original architecture used BART as the generator. In 2024–2025, production RAG systems have evolved considerably, but the core pattern is identical:



┌────────────────────────────────────────────────┐
│                  RAG PIPELINE                  │
├────────────────────────────────────────────────┤
│                                                │
│  1. INDEXING (offline)                         │
│     Documents → Chunks → Embeddings → Vector DB│
│                                                │
│  2. RETRIEVAL (online, per query)              │
│     Query → Embedding → ANN Search → Top-k docs│
│                                                │
│  3. AUGMENTATION                               │
│     Prompt = System + Retrieved Docs + Query   │
│                                                │
│  4. GENERATION                                 │
│     LLM(Prompt) → Grounded Answer              │
│                                                │
└────────────────────────────────────────────────┘


Modern stack:

Component

Common Choices

Embedding model

text-embedding-3-small, bge-large, e5-mistral

Vector database

Pinecone, Weaviate, Qdrant, pgvector, Chroma

Retrieval strategy

Dense, sparse (BM25), or hybrid

Generator

GPT-4o, Claude 3.5, LLaMA 3, Mistral

Orchestration

LangChain, LlamaIndex, Haystack



Implementation: Production RAG Pipeline in Python



from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.document_loaders import PyPDFLoader

# ── Step 1: Load and chunk documents ──────────────────────────
loader = PyPDFLoader("your_document.pdf")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,          # overlap preserves context across chunks
    separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)

# ── Step 2: Embed and index ────────────────────────────────────
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# ── Step 3: Build retriever ────────────────────────────────────
retriever = vectorstore.as_retriever(
    search_type="mmr",         # Maximal Marginal Relevance — reduces redundancy
    search_kwargs={
        "k": 5,                # retrieve top 5 chunks
        "fetch_k": 20          # candidate pool for MMR
    }
)

# ── Step 4: Build RAG chain ────────────────────────────────────
llm = ChatOpenAI(model="gpt-4o", temperature=0)

rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",        # concat all docs into context
    retriever=retriever,
    return_source_documents=True
)

# ── Step 5: Query ──────────────────────────────────────────────
result = rag_chain.invoke({"query": "What are the key findings?"})

print(result["result"])
print("\nSources:")
for doc in result["source_documents"]:
    print(f"  - {doc.metadata['source']} (page {doc.metadata.get('page', '?')})")




Advanced: Hybrid Retrieval (Dense + Sparse)

Pure dense retrieval misses exact keyword matches. Pure sparse retrieval misses semantic similarity. Hybrid combines both:




from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

# Dense retriever
dense_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Sparse retriever (BM25)
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Hybrid: 60% dense, 40% sparse
hybrid_retriever = EnsembleRetriever(
    retrievers=[dense_retriever, bm25_retriever],
    weights=[0.6, 0.4]
)


Advanced: Reranking

After retrieval, a cross-encoder reranker scores each candidate more accurately than the bi-encoder used for retrieval:


from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, docs, top_n=3):
    pairs = [(query, doc.page_content) for doc in docs]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, docs), reverse=True)
    return [doc for _, doc in ranked[:top_n]]

# Retrieve more, rerank to fewer
candidates = hybrid_retriever.get_relevant_documents(query)
reranked_docs = rerank(query, candidates, top_n=3)


Original Paper Results

Evaluated on open-domain QA benchmarks against previous state-of-the-art:

Model

NaturalQ (EM)

TriviaQA (EM)

WebQ (EM)

Closed-book GPT-2

4.1

29.9

3.3

REALM (retrieval-augmented)

40.4

40.7

T5 (closed-book, 11B)

36.6

60.5

44.7

RAG (Lewis et al.)

44.5

68.0

45.5


RAG outperformed much larger closed-book models — including the 11B T5 — by grounding generation in retrieved evidence rather than memorized parameters.




Why It Still Matters in 2025

RAG is not just a research concept — it is the dominant architecture for production AI systems:

  • Enterprise chatbots — answer questions grounded in internal documentation

  • Legal and compliance tools — cite specific clauses from contracts and regulations

  • Medical assistants — retrieve from clinical guidelines and research literature

  • Customer support — ground responses in product manuals and support history

  • Code assistants — retrieve from private codebases and internal APIs

  • Financial analysis — ground responses in earnings reports and filings


The fundamental problem RAG solves — LLMs don't know your data — remains unsolved by fine-tuning alone. Fine-tuning bakes knowledge into weights; RAG keeps knowledge external, updatable, and citeable.




Common RAG Pitfalls

1. Chunks that are too large or too small Large chunks dilute relevance — the retrieved context contains too much noise. Small chunks lose context — a sentence without its surrounding paragraph is often meaningless. Start with 512 tokens, 64 token overlap. Adjust based on your document type.


2. No overlap between chunks A sentence split across two chunks will never be retrieved coherently. Always set chunk_overlap > 0.


3. Using cosine similarity when MMR is better Basic cosine similarity returns the 5 most similar chunks — which are often near-duplicates. Maximal Marginal Relevance (MMR) balances relevance with diversity. Use MMR in production.


4. Ignoring metadata filtering If your corpus spans multiple domains, date ranges, or access levels, filter by metadata before semantic search — not after. Pre-filtering dramatically improves precision.




retriever = vectorstore.as_retriever(
    search_kwargs={
        "filter": {"department": "engineering", "year": 2024},
        "k": 5
    }
)

5. No reranking step Bi-encoders are fast but imprecise. The top-5 by cosine similarity is rarely the best-5 for answering the specific query. A lightweight cross-encoder reranker (adds ~50ms) meaningfully improves answer quality.


6. Prompting the LLM without clear grounding instructions



# Weak — model may still hallucinate
prompt = f"Answer this: {query}\nContext: {context}"

# Strong — explicitly instructs grounded generation
prompt = f"""Answer the question using ONLY the context provided below.
If the answer is not in the context, say "I don't have enough information."

Context:
{context}

Question: {query}
Answer:"""



RAG vs Fine-Tuning: When to Use Which


RAG

Fine-Tuning

Knowledge updates

Easy — update the index

Hard — retrain required

Factual grounding

Strong — cites sources

Weak — knowledge in weights

Domain style/tone

Weak

Strong

Private data

Ideal

Possible but risky

Hallucination risk

Lower

Higher

Cost to implement

Moderate

High

Best for

Dynamic knowledge, Q&A, search

Style, format, domain behavior


In practice, the best production systems combine both: a fine-tuned model for tone and format, with a RAG layer for factual grounding.




How to Go Deeper

Read next from this series:

  • Attention Is All You Need → — the Transformer that powers the generator in every RAG system

  • BERT → — the bi-encoder architecture used in DPR retrieval

  • Chain-of-Thought → — reasoning techniques that combine powerfully with RAG


Recommended resources:

  • BEIR Benchmark — standard benchmark for evaluating retrieval systems

  • LlamaIndex Docs — production RAG patterns and advanced techniques

  • RAGAs — framework for evaluating RAG pipeline quality




Need a RAG Pipeline Built for Your Use Case?


RAG is one of those architectures that looks simple in a tutorial and reveals serious complexity in production — chunking strategy, embedding model choice, retrieval quality, reranking, prompt design, evaluation, and latency all interact.


At Codersarts, we help engineers, researchers, and founders:

  • ✅ Design and implement RAG pipelines tailored to your document types and use case

  • ✅ Evaluate and improve retrieval quality on your specific corpus

  • ✅ Build hybrid retrieval systems combining dense and sparse search

  • ✅ Add reranking, metadata filtering, and query rewriting layers

  • ✅ Reproduce the original RAG paper results on standard benchmarks

  • ✅ Consult on RAG architecture decisions — vector DB selection, chunking strategy, embedding model choice




This post is Part 6 of the Codersarts AI Research Paper Series. Next: Chain-of-Thought Prompting →

Comments


bottom of page