Building a RAG Knowledge Base Pipeline

Mar 24
5 min read

Course: RAG from Scratch

Level: Beginner to Medium

Type: Individual

Duration: 5 to 7 days

Objective

This assignment tests your ability to build the foundational stages of a RAG pipeline: loading documents, extracting clean text, attaching metadata, enriching documents with LLM-generated keywords, and splitting them into retrievable chunks. By completing this assignment, you will have built a reusable knowledge base preparation pipeline that you can apply to any document collection.

Tasks

Task 1: Document Loading and Text Extraction (10 marks)

Select a collection of 10 to 15 documents on a topic of your choice. Your collection must include at least two different file formats (for example, PDF and plain text or Markdown).
Write an extract_text_from_file() function that reads each file and returns the raw text content.
Print the file name, format, and character count for each document to confirm successful loading.

Task 2: Text Cleaning and Normalization (10 marks)

Implement a clean_text() function that removes extra whitespace, normalises line breaks, and strips boilerplate content (for example, repeated headers or footers).
Apply the function to all loaded documents and print the before and after character counts for at least two examples.
Explain in a comment or markdown cell why each cleaning step improves retrieval quality.

Task 3: Document Metadata Extraction (15 marks)

For each document, build a metadata dictionary containing at least six fields: a unique document ID (content-based hash), filename, source directory, file type, word count, and character count.
Write a generate_document_id() function that creates a consistent, reproducible ID for each document.
Display the metadata for three documents in a formatted table.

Task 4: LLM-Enriched Keywords (20 marks)

Write a generate_keywords() function that calls the OpenAI API and returns 3 to 5 lowercase topic keywords for each document.
The function should use gpt-4o-mini-2024-07-18 and return a Python list. Handle API errors gracefully.
Apply the function to at least five documents and display the document name alongside its generated keywords.
Add the keywords list to each document's metadata dictionary.

Task 5: Unified Document Loading Pipeline (20 marks)

Write a load_document() function that combines text extraction, cleaning, metadata generation, and keyword generation into a single call that returns a complete document dictionary.
Write a load_knowledge_base() function that applies load_document() to every file in a directory and returns a list of document dictionaries.
Save the complete knowledge base to a JSON file and verify you can reload it correctly.
Print a summary showing total documents loaded, total characters, and total words.

Task 6: Chunking Implementation (15 marks)

Implement a sentence-aware chunking function that splits each document into chunks of approximately 200 to 500 characters, respecting sentence boundaries.
Add an overlap parameter (default: 50 characters) so that consecutive chunks share some context.
Each chunk dictionary must include: chunk_id, parent_doc_id, chunk_index, text, character_count, and all parent metadata fields.
Apply your chunking function to the full knowledge base and print the total number of chunks produced.

Task 7: Analysis and Reflection (10 marks)

Choose one document from your collection. Generate embeddings for both the raw extracted text and the cleaned version of the same text using text-embedding-3-small.
Compute cosine similarity between a relevant query and each version. Show the similarity scores and discuss whether cleaning made a measurable difference.
Write a short paragraph (150 to 250 words) reflecting on the most important design decisions in your pipeline and what you would change for a production system.

Evaluation Rubric

Criteria	Marks
Document Loading and Text Extraction	10
Text Cleaning and Normalization	10
Metadata Extraction	15
LLM-Enriched Keywords	20
Unified Document Loading Pipeline	20
Chunking Implementation	15
Analysis and Reflection	10
Total	100

Deliverables

A Jupyter Notebook (.ipynb) containing all code, outputs, and markdown explanations.
A knowledge_base.json file containing all loaded and cleaned documents with metadata.
A chunks.json file containing all chunks produced by your chunking pipeline.
A short written reflection (150 to 250 words) either in the notebook or as a separate PDF.

Submission Guidelines

Submit your work via the course LMS (for example, Moodle or Google Classroom).

File Naming Convention: <YourName>_RAG_Assignment1.zip

Inside the ZIP:

notebook.ipynb
knowledge_base.json
chunks.json
reflection.pdf (or included in the notebook)

Deadline: 7 days from the date of release.

Late Submission Policy

Up to 24 hours late: 10% penalty applied to the final mark.
24 to 48 hours late: 20% penalty applied to the final mark.
Beyond 48 hours: submission will not be accepted.

Important Instructions

Do not copy code from external sources without understanding it. You must be able to explain every function you submit.
Use of libraries is allowed, but core logic (extraction, cleaning, chunking) must be implemented by you.
Plagiarism of any kind will result in disqualification from the assignment.
Do not hardcode file paths. Use pathlib.Path and relative paths so the notebook runs on any machine.
If the OpenAI API is unavailable, you may stub the generate_keywords() function with a fixed list for testing, but note this clearly in the notebook.

Guidance and Tips

Start with a small set of documents (3 to 5) to verify your pipeline before scaling to the full collection.
Print intermediate outputs at each stage so you can catch errors early.
Do not just implement — explain your decisions. Why did you choose that chunk size? Why does that metadata field matter?
Think from a retrieval perspective. A good chunk is one that can be matched to a user query, not just one that is the right length.
You may use any document collection you have access to. Wikipedia exports, company reports, research abstracts, and policy documents all work well.

Bonus (Optional — up to +10 Marks)

Embed your chunks and run a similarity search to verify that your chunks are actually retrievable for relevant queries.
Compare retrieval quality between your sentence-aware chunks and a naive fixed-size chunking baseline.
Visualise the distribution of chunk lengths across your knowledge base as a histogram.

Instructor Note

This assignment is designed to simulate real-world pipeline design thinking. There is no single correct implementation. What matters is the clarity of your reasoning, the quality of your implementation, and the depth of your analysis. A well-explained pipeline with minor bugs will score better than a working pipeline with no explanation.

Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?

Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.

Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.

Get Started Today

Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.

Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.

Email: contact@codersarts.com

Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.

Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.

Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.