Build Your First RAG System: A Python Walkthrough
- 6 days ago
- 7 min read
Many tutorials explain the high-level concepts of Retrieval-Augmented Generation (RAG) using architecture diagrams and analogies like “open-book exams.”But the fastest way to truly understand RAG is to build one yourself.

In this guide, you’ll create a fully functional local RAG pipeline in Python that can:
Read custom documents
Convert them into embeddings
Store them in a vector database
Retrieve relevant context
Generate grounded answers using an LLM
By the end, you’ll have a complete command-line RAG application running locally on your machine.
What We’re Building
Our RAG pipeline will follow this workflow:
flowchart TD
Prepare[1. Prepare Workspace & knowledge.txt] --> Load[2. Read text & split into chunks]
Load --> Embed[3. Store chunks in ChromaDB]
Embed --> Search[4. Search database for question context]
Search --> Generate[5. Send context + question to LLM API]
Generate --> Output[6. Print final answer]
Step 1 - Setting Up Your Workspace
First, create a clean Python project environment.
Create the Project Folder
mkdir first-rag-system
cd first-rag-system
Create a Virtual Environment
python -m venv venv
Activate the Environment
Windows
venv\Scripts\activate
macOS/Linux
source venv/bin/activate
Install Required Libraries
We’ll use two core libraries:
Library | Purpose |
chromadb | Local vector database |
openai | Embeddings + LLM generation |
Install them with:
pip install chromadb openai
Step 2 - Preparing the Knowledge Base
Every RAG system needs a source of truth.
Create a file named:
knowledge.txt
Paste the following fictional company policy document into it:
Remote Work and Reimbursement Policy 2026
1. Equipment Reimbursement:
Employees are eligible for a one-time workspace setup reimbursement of up to $500. This includes chairs, desks, monitors, and keyboards. All requests must be submitted through the ExpensePortal within 30 days of purchase. Receipts are mandatory.
2. Core Collaboration Hours:
To ensure smooth collaboration across time zones, all employees must be online and active during our core hours of 10:00 AM to 3:00 PM Eastern Standard Time (EST).
3. Internet Subsidy:
OmniCorp provides a monthly home internet stipend of $50. This is processed automatically in the second paycheck of every month. No expense reports are required for this stipend.
4. Travel Policy:
If an employee is required to travel to a regional office, travel expenses (flights, hotels, and meals up to $75/day) are fully covered. All bookings must be made through the OmniTravel portal at least 14 days in advance.
Why RAG Matters
A normal LLM has no idea what the policies are.
If you ask:
“What is OmniCorp’s monthly internet subsidy?”
A generic model will either:
hallucinate,
guess,
or say it doesn’t know.
Our RAG pipeline fixes this by retrieving the relevant company policy first and injecting it into the prompt.
Step 3 - Writing the RAG Pipeline
Create a new file:
rag_app.py
We’ll build the pipeline one step at a time.
1. Imports & Client Initialization
Start by importing the required libraries and initializing the clients.
import os
import chromadb
from openai import OpenAI
# Initialize the OpenAI client
openai_client = OpenAI()
# Initialize ChromaDB
chroma_client = chromadb.Client()
2. Loading & Chunking Documents
Before storing data in a vector database, we must split the document into smaller chunks.
Why Chunking Matters
Large documents:
exceed context windows,
introduce noise,
reduce retrieval precision.
Chunking isolates smaller semantic units.
Adding Overlap
Overlapping chunks preserve context between boundaries.
Without overlap:
Chunk 1 → “Employees receive reimbursement...”
Chunk 2 → “...within 30 days of purchase.”
Important meaning gets split apart.
Overlap fixes this.
Chunking Function
def load_and_chunk_document(filepath, chunk_size_words=80, overlap_words=20):
"""Reads a text file and splits it into overlapping chunks."""
if not os.path.exists(filepath):
raise FileNotFoundError(f"Could not find the file: {filepath}")
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk_words = words[i : i + chunk_size_words]
chunks.append(" ".join(chunk_words))
i += (chunk_size_words - overlap_words)
return chunks
# Run chunking
document_chunks = load_and_chunk_document("knowledge.txt")
print(f"Split document into {len(document_chunks)} chunks.")
Step 4 - Creating the Vector Store
Now we convert chunks into embeddings and store them in ChromaDB.
What Are Embeddings?
Embeddings are numerical vector representations of text.
Semantically similar sentences produce vectors that are mathematically close together.
Example:
Text | Semantic Meaning |
“internet reimbursement” | similar |
“monthly WiFi allowance” | similar |
This enables semantic search.
Creating the Collection
collection = chroma_client.create_collection(
name="omnicorp_policies"
)
Generating & Storing Embeddings
for idx, chunk in enumerate(document_chunks):
# Convert chunk into vector embedding
response = openai_client.embeddings.create(
input=[chunk],
model="text-embedding-3-small"
)
vector = response.data[0].embedding
# Store inside ChromaDB
collection.add(
ids=[f"chunk_{idx}"],
embeddings=[vector],
documents=[chunk]
)
print(f"Stored {len(document_chunks)} chunks.")
Step 5 - Semantic Retrieval
Now we build the retrieval step.
When the user asks a question:
Convert the question into an embedding
Search the vector database
Retrieve the most relevant chunks
Retrieval Function
def retrieve_relevant_context(query, limit=2):
# Embed the query
response = openai_client.embeddings.create(
input=[query],
model="text-embedding-3-small"
)
query_vector = response.data[0].embedding
# Search vector DB
results = collection.query(
query_embeddings=[query_vector],
n_results=limit
)
return results['documents'][0]
Testing Retrieval
question = "How much can I get reimbursed for my home workspace setup?"
context_chunks = retrieve_relevant_context(question)
print("\n--- RETRIEVED CONTEXT ---")
for idx, chunk in enumerate(context_chunks):
print(f"Match {idx+1}: {chunk}\n")
Step 6 - Generating the Final Answer
Now we combine:
retrieved context,
user question,
and the LLM.
This is the “Generation” part of RAG.
RAG Generation Function
def generate_rag_answer(query):
# Retrieve relevant chunks
context_chunks = retrieve_relevant_context(query, limit=2)
# Merge chunks into one context block
context_text = "\n---\n".join(context_chunks)
# Build grounded prompt
prompt = f"""
You are a helpful company assistant.
Use ONLY the provided context below to answer the user's question.
If you do not know the answer based on the context, say:
"I cannot find that information in the company guidelines."
Context:
{context_text}
User Question:
{query}
"""
# Generate response
chat_completion = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are a policy assistant."
},
{
"role": "user",
"content": prompt
}
],
temperature=0.0
)
return chat_completion.choices[0].message.content
Why Temperature = 0?
temperature=0.0
Lower temperature:
reduces randomness,
improves factual consistency,
minimizes hallucinations.
Perfect for enterprise RAG systems.
Step 7 - Final RAG Implementation
Here’s the full end-to-end implementation.
import os
import chromadb
from openai import OpenAI
# Initialize Clients
openai_client = OpenAI()
chroma_client = chromadb.Client()
# Chunking Function
def load_and_chunk_document(filepath, chunk_size_words=80, overlap_words=20):
if not os.path.exists(filepath):
raise FileNotFoundError(f"Could not find the file: {filepath}")
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
words = text.split()
chunks = []
i = 0
while i < len(words):
chunk_words = words[i : i + chunk_size_words]
chunks.append(" ".join(chunk_words))
i += (chunk_size_words - overlap_words)
return chunks
# Ingestion
print("Starting ingestion...")
document_chunks = load_and_chunk_document("knowledge.txt")
collection = chroma_client.create_collection(
name="omnicorp_policies"
)
for idx, chunk in enumerate(document_chunks):
response = openai_client.embeddings.create(
input=[chunk],
model="text-embedding-3-small"
)
vector = response.data[0].embedding
collection.add(
ids=[f"chunk_{idx}"],
embeddings=[vector],
documents=[chunk]
)
print("Ingestion complete.\n")
# Retrieval Function
def retrieve_relevant_context(query, limit=2):
response = openai_client.embeddings.create(
input=[query],
model="text-embedding-3-small"
)
query_vector = response.data[0].embedding
results = collection.query(
query_embeddings=[query_vector],
n_results=limit
)
return results['documents'][0]
# Generation Function
def generate_rag_answer(query):
context_chunks = retrieve_relevant_context(query)
context_text = "\n---\n".join(context_chunks)
prompt = f"""
You are a helpful company assistant.
Use ONLY the provided context below to answer the user's question.
If you do not know the answer based on the context, say:
"I cannot find that information in the company guidelines."
Context:
{context_text}
User Question:
{query}
"""
chat_completion = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are a policy assistant."
},
{
"role": "user",
"content": prompt
}
],
temperature=0.0
)
return chat_completion.choices[0].message.content
# Run Queries
if __name__ == "__main__":
q1 = "What is OmniCorp's monthly internet subsidy?"
print(f"Question: {q1}")
print(f"Answer: {generate_rag_answer(q1)}\n")
q2 = "What are the rules for travel hotel bookings?"
print(f"Question: {q2}")
print(f"Answer: {generate_rag_answer(q2)}\n")
q3 = "What is our company's policy on parental leave?"
print(f"Question: {q3}")
print(f"Answer: {generate_rag_answer(q3)}\n")
Step 8 - Running the Application
Before running the script, export your OpenAI API key.
Windows PowerShell
$env:OPENAI_API_KEY="your-api-key"
macOS/Linux
export OPENAI_API_KEY="your-api-key"
Run the Script
python rag_app.py
Expected Output
Starting ingestion...
Ingestion complete.
Question: What is OmniCorp's monthly internet subsidy?
Answer: OmniCorp provides a monthly home internet stipend of $50.
Question: What are the rules for travel hotel bookings?
Answer: All travel bookings must be made through the OmniTravel portal at least 14 days in advance.
Question: What is our company's policy on parental leave?
Answer: I cannot find that information in the company guidelines.
Why This Is Powerful
Notice something important:
For unknown questions, the model safely says:
“I cannot find that information…”
Instead of hallucinating.
That’s one of the biggest advantages of RAG systems in production AI applications.
Next Steps & Improvements
Now that your basic RAG pipeline works, here are some powerful upgrades you can build next.
1. Add PDF Support
Instead of plain text files, parse PDFs directly.
Example libraries:
pypdf
pdfplumber
PyMuPDF
2. Persist the Vector Database
Right now the DB resets every run.
Switch to:
chromadb.PersistentClient(path="./db")
This saves embeddings permanently to disk.
3. Build an Interactive Chat CLI
Replace static queries with a loop:
while True:
user_input = input("Ask a question: ")
Now your RAG system becomes a live chatbot.
4. Add Metadata Filtering
Store:
departments,
timestamps,
document types,
authors.
Then retrieve selectively.
5. Upgrade Retrieval
Explore:
hybrid search,
rerankers,
contextual compression,
parent-child retrieval,
multi-query retrieval.
These dramatically improve retrieval quality in enterprise systems.
Final Thoughts
You’ve now built a complete Retrieval-Augmented Generation pipeline from scratch.
More importantly, you’ve understood the core mechanics behind:
embeddings,
chunking,
semantic retrieval,
vector databases,
grounded prompting.
This foundation is what powers many modern AI products today.
From here, you can evolve this into:
enterprise document assistants,
internal copilots,
AI customer support systems,
research assistants,
knowledge management tools,
or full-scale production RAG architectures.
Explore More AI Engineering Insights from Codersarts
If you liked this blog and you’re interested in building or reading about modern AI systems, production-ready LLM pipelines, and real-world RAG applications, check out some of our other blogs from Codersarts:
How to Build an AI Blog Post Writer with Next.js, FastAPI, LangChain, OpenAI, and Pinecone
How to Deploy vLLM in Production: OpenAI-Compatible APIs, Tensor Parallelism, and Docker on 2 GPUs
Natural Language to SQL with LangChain: Building Intelligent Analytics Platforms
20 Powerful AI Reporting and Analytics Solutions Enterprises Are Building in 2026
Ready to Build Smarter RAG Systems?
At Codersarts, we help developers, startups, and enterprises design production-ready AI systems powered by modern retrieval architectures, LLM pipelines, and scalable RAG workflows.
Whether you're building:
enterprise knowledge assistants,
AI search systems,
document intelligence platforms,
agentic workflows,
or domain-specific copilots,
Our team can help you engineer reliable, retrieval-aware AI systems that go beyond basic chatbot demos.
From:
chunking strategy optimization,
vector database design,
and retrieval evaluation,
to:
end-to-end RAG deployment,
multimodal AI pipelines,
and custom LLM integration,
we work on practical AI systems built for real-world scale.
Explore more AI engineering insights and projects at: https://www.codersarts.com or connect with the Codersarts team to build your next AI solution.




Comments