top of page

Build Your First RAG System: A Python Walkthrough

  • 6 days ago
  • 7 min read
Many tutorials explain the high-level concepts of Retrieval-Augmented Generation (RAG) using architecture diagrams and analogies like “open-book exams.”But the fastest way to truly understand RAG is to build one yourself.


In this guide, you’ll create a fully functional local RAG pipeline in Python that can:

  • Read custom documents

  • Convert them into embeddings

  • Store them in a vector database

  • Retrieve relevant context

  • Generate grounded answers using an LLM


By the end, you’ll have a complete command-line RAG application running locally on your machine.



What We’re Building


Our RAG pipeline will follow this workflow:

flowchart TD
    Prepare[1. Prepare Workspace & knowledge.txt] --> Load[2. Read text & split into chunks]
    Load --> Embed[3. Store chunks in ChromaDB]
    Embed --> Search[4. Search database for question context]
    Search --> Generate[5. Send context + question to LLM API]
    Generate --> Output[6. Print final answer]

Step 1 - Setting Up Your Workspace


First, create a clean Python project environment.


Create the Project Folder

mkdir first-rag-system
cd first-rag-system

Create a Virtual Environment

python -m venv venv

Activate the Environment

Windows

venv\Scripts\activate

macOS/Linux

source venv/bin/activate

Install Required Libraries

We’ll use two core libraries:

Library

Purpose

chromadb

Local vector database

openai

Embeddings + LLM generation

Install them with:

pip install chromadb openai

Step 2 - Preparing the Knowledge Base


Every RAG system needs a source of truth.

Create a file named:

knowledge.txt

Paste the following fictional company policy document into it:

Remote Work and Reimbursement Policy 2026

1. Equipment Reimbursement:
Employees are eligible for a one-time workspace setup reimbursement of up to $500. This includes chairs, desks, monitors, and keyboards. All requests must be submitted through the ExpensePortal within 30 days of purchase. Receipts are mandatory.

2. Core Collaboration Hours:
To ensure smooth collaboration across time zones, all employees must be online and active during our core hours of 10:00 AM to 3:00 PM Eastern Standard Time (EST).

3. Internet Subsidy:
OmniCorp provides a monthly home internet stipend of $50. This is processed automatically in the second paycheck of every month. No expense reports are required for this stipend.

4. Travel Policy:
If an employee is required to travel to a regional office, travel expenses (flights, hotels, and meals up to $75/day) are fully covered. All bookings must be made through the OmniTravel portal at least 14 days in advance.

Why RAG Matters


A normal LLM has no idea what the policies are.

If you ask:

“What is OmniCorp’s monthly internet subsidy?”

A generic model will either:

  • hallucinate,

  • guess,

  • or say it doesn’t know.


Our RAG pipeline fixes this by retrieving the relevant company policy first and injecting it into the prompt.



Step 3 - Writing the RAG Pipeline


Create a new file:

rag_app.py

We’ll build the pipeline one step at a time.


1. Imports & Client Initialization


Start by importing the required libraries and initializing the clients.

import os
import chromadb
from openai import OpenAI

# Initialize the OpenAI client
openai_client = OpenAI()

# Initialize ChromaDB
chroma_client = chromadb.Client()

2. Loading & Chunking Documents


Before storing data in a vector database, we must split the document into smaller chunks.


Why Chunking Matters

Large documents:

  • exceed context windows,

  • introduce noise,

  • reduce retrieval precision.

Chunking isolates smaller semantic units.


Adding Overlap

Overlapping chunks preserve context between boundaries.

Without overlap:

Chunk 1 → “Employees receive reimbursement...”
Chunk 2 → “...within 30 days of purchase.”

Important meaning gets split apart.

Overlap fixes this.


Chunking Function

def load_and_chunk_document(filepath, chunk_size_words=80, overlap_words=20):
    """Reads a text file and splits it into overlapping chunks."""
    
    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Could not find the file: {filepath}")

    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()

    words = text.split()
    chunks = []

    i = 0
    while i < len(words):
        chunk_words = words[i : i + chunk_size_words]
        chunks.append(" ".join(chunk_words))

        i += (chunk_size_words - overlap_words)

    return chunks

# Run chunking
document_chunks = load_and_chunk_document("knowledge.txt")

print(f"Split document into {len(document_chunks)} chunks.")

Step 4 - Creating the Vector Store


Now we convert chunks into embeddings and store them in ChromaDB.


What Are Embeddings?


Embeddings are numerical vector representations of text.

Semantically similar sentences produce vectors that are mathematically close together.


Example:

Text

Semantic Meaning

“internet reimbursement”

similar

“monthly WiFi allowance”

similar

This enables semantic search.


Creating the Collection

collection = chroma_client.create_collection(
    name="omnicorp_policies"
)

Generating & Storing Embeddings

for idx, chunk in enumerate(document_chunks):

    # Convert chunk into vector embedding
    response = openai_client.embeddings.create(
        input=[chunk],
        model="text-embedding-3-small"
    )

    vector = response.data[0].embedding

    # Store inside ChromaDB
    collection.add(
        ids=[f"chunk_{idx}"],
        embeddings=[vector],
        documents=[chunk]
    )

print(f"Stored {len(document_chunks)} chunks.")

Step 5 - Semantic Retrieval


Now we build the retrieval step.

When the user asks a question:

  1. Convert the question into an embedding

  2. Search the vector database

  3. Retrieve the most relevant chunks


Retrieval Function

def retrieve_relevant_context(query, limit=2):

    # Embed the query
    response = openai_client.embeddings.create(
        input=[query],
        model="text-embedding-3-small"
    )

    query_vector = response.data[0].embedding

    # Search vector DB
    results = collection.query(
        query_embeddings=[query_vector],
        n_results=limit
    )

    return results['documents'][0]

Testing Retrieval

question = "How much can I get reimbursed for my home workspace setup?"

context_chunks = retrieve_relevant_context(question)

print("\n--- RETRIEVED CONTEXT ---")

for idx, chunk in enumerate(context_chunks):
    print(f"Match {idx+1}: {chunk}\n")

Step 6 - Generating the Final Answer


Now we combine:

  • retrieved context,

  • user question,

  • and the LLM.

This is the “Generation” part of RAG.


RAG Generation Function

def generate_rag_answer(query):

    # Retrieve relevant chunks
    context_chunks = retrieve_relevant_context(query, limit=2)

    # Merge chunks into one context block
    context_text = "\n---\n".join(context_chunks)

    # Build grounded prompt
    prompt = f"""
You are a helpful company assistant.

Use ONLY the provided context below to answer the user's question.

If you do not know the answer based on the context, say:
"I cannot find that information in the company guidelines."

Context:
{context_text}

User Question:
{query}
"""

    # Generate response
    chat_completion = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a policy assistant."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=0.0
    )

    return chat_completion.choices[0].message.content

Why Temperature = 0?

temperature=0.0

Lower temperature:

  • reduces randomness,

  • improves factual consistency,

  • minimizes hallucinations.

Perfect for enterprise RAG systems.


Step 7 - Final RAG Implementation


Here’s the full end-to-end implementation.

import os
import chromadb
from openai import OpenAI

# Initialize Clients
openai_client = OpenAI()
chroma_client = chromadb.Client()

# Chunking Function
def load_and_chunk_document(filepath, chunk_size_words=80, overlap_words=20):

    if not os.path.exists(filepath):
        raise FileNotFoundError(f"Could not find the file: {filepath}")

    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()

    words = text.split()
    chunks = []

    i = 0

    while i < len(words):
        chunk_words = words[i : i + chunk_size_words]
        chunks.append(" ".join(chunk_words))

        i += (chunk_size_words - overlap_words)

    return chunks

# Ingestion
print("Starting ingestion...")

document_chunks = load_and_chunk_document("knowledge.txt")

collection = chroma_client.create_collection(
    name="omnicorp_policies"
)

for idx, chunk in enumerate(document_chunks):

    response = openai_client.embeddings.create(
        input=[chunk],
        model="text-embedding-3-small"
    )

    vector = response.data[0].embedding

    collection.add(
        ids=[f"chunk_{idx}"],
        embeddings=[vector],
        documents=[chunk]
    )

print("Ingestion complete.\n")

# Retrieval Function
def retrieve_relevant_context(query, limit=2):

    response = openai_client.embeddings.create(
        input=[query],
        model="text-embedding-3-small"
    )

    query_vector = response.data[0].embedding

    results = collection.query(
        query_embeddings=[query_vector],
        n_results=limit
    )

    return results['documents'][0]

# Generation Function
def generate_rag_answer(query):

    context_chunks = retrieve_relevant_context(query)

    context_text = "\n---\n".join(context_chunks)

    prompt = f"""
You are a helpful company assistant.

Use ONLY the provided context below to answer the user's question.

If you do not know the answer based on the context, say:
"I cannot find that information in the company guidelines."

Context:
{context_text}

User Question:
{query}
"""

    chat_completion = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": "You are a policy assistant."
            },
            {
                "role": "user",
                "content": prompt
            }
        ],
        temperature=0.0
    )

    return chat_completion.choices[0].message.content

# Run Queries
if __name__ == "__main__":

    q1 = "What is OmniCorp's monthly internet subsidy?"

    print(f"Question: {q1}")
    print(f"Answer: {generate_rag_answer(q1)}\n")

    q2 = "What are the rules for travel hotel bookings?"

    print(f"Question: {q2}")
    print(f"Answer: {generate_rag_answer(q2)}\n")

    q3 = "What is our company's policy on parental leave?"

    print(f"Question: {q3}")
    print(f"Answer: {generate_rag_answer(q3)}\n")

Step 8 - Running the Application


Before running the script, export your OpenAI API key.


Windows PowerShell

$env:OPENAI_API_KEY="your-api-key"

macOS/Linux

export OPENAI_API_KEY="your-api-key"

Run the Script

python rag_app.py

Expected Output

Starting ingestion...
Ingestion complete.

Question: What is OmniCorp's monthly internet subsidy?
Answer: OmniCorp provides a monthly home internet stipend of $50.

Question: What are the rules for travel hotel bookings?
Answer: All travel bookings must be made through the OmniTravel portal at least 14 days in advance.

Question: What is our company's policy on parental leave?
Answer: I cannot find that information in the company guidelines.

Why This Is Powerful


Notice something important:

For unknown questions, the model safely says:

“I cannot find that information…”

Instead of hallucinating.

That’s one of the biggest advantages of RAG systems in production AI applications.


Next Steps & Improvements


Now that your basic RAG pipeline works, here are some powerful upgrades you can build next.


1. Add PDF Support

Instead of plain text files, parse PDFs directly.

Example libraries:

  • pypdf

  • pdfplumber

  • PyMuPDF


2. Persist the Vector Database

Right now the DB resets every run.

Switch to:

chromadb.PersistentClient(path="./db")

This saves embeddings permanently to disk.


3. Build an Interactive Chat CLI

Replace static queries with a loop:

while True:
    user_input = input("Ask a question: ")

Now your RAG system becomes a live chatbot.


4. Add Metadata Filtering

Store:

  • departments,

  • timestamps,

  • document types,

  • authors.

Then retrieve selectively.


5. Upgrade Retrieval

Explore:

  • hybrid search,

  • rerankers,

  • contextual compression,

  • parent-child retrieval,

  • multi-query retrieval.

These dramatically improve retrieval quality in enterprise systems.


Final Thoughts

You’ve now built a complete Retrieval-Augmented Generation pipeline from scratch.


More importantly, you’ve understood the core mechanics behind:

  • embeddings,

  • chunking,

  • semantic retrieval,

  • vector databases,

  • grounded prompting.


This foundation is what powers many modern AI products today.

From here, you can evolve this into:

  • enterprise document assistants,

  • internal copilots,

  • AI customer support systems,

  • research assistants,

  • knowledge management tools,

  • or full-scale production RAG architectures.



Explore More AI Engineering Insights from Codersarts


If you liked this blog and you’re interested in building or reading about modern AI systems, production-ready LLM pipelines, and real-world RAG applications, check out some of our other blogs from Codersarts:




Ready to Build Smarter RAG Systems?

At Codersarts, we help developers, startups, and enterprises design production-ready AI systems powered by modern retrieval architectures, LLM pipelines, and scalable RAG workflows.


Whether you're building:

  • enterprise knowledge assistants,

  • AI search systems,

  • document intelligence platforms,

  • agentic workflows,

  • or domain-specific copilots,


Our team can help you engineer reliable, retrieval-aware AI systems that go beyond basic chatbot demos.


From:

  • chunking strategy optimization,

  • vector database design,

  • and retrieval evaluation,

to:

  • end-to-end RAG deployment,

  • multimodal AI pipelines,

  • and custom LLM integration,


we work on practical AI systems built for real-world scale.


Explore more AI engineering insights and projects at: https://www.codersarts.com or connect with the Codersarts team to build your next AI solution.



Comments


bottom of page