top of page

Designing a Production-Ready Chunking Pipeline for Retrieval-Augmented Generation

  • Mar 25
  • 9 min read




ASSIGNMENT REQUIREMENT DOCUMENT


Course Name: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers

Institution: [Institution Name]

Semester: [Semester / Term — e.g., Spring 2026]

Instructor: [Instructor Name]

Student Level: Postgraduate / Senior Undergraduate (Year 3–4)

Submission Platform: Moodle LMS

Total Assignments: 2



Note to Students: This assignment contains the complete requirements for one of the  course assignments. Read the entire document carefully before beginning. The assignment is self-contained but builds upon the cumulative knowledge of the course. Deadlines, rubrics, and submission procedures are specified individually per assignment.





Purpose

This assignment requires you to design, implement, and evaluate a chunking pipeline that produces high-quality text chunks optimized for retrieval in a RAG (Retrieval-Augmented Generation) system. You will explore multiple chunking strategies—from simple fixed-size splitting to advanced semantic grouping—and critically analyze the trade-offs that arise when preparing documents for vector-based retrieval.





Connection to Course Learning Outcomes (CLOs)

This assignment directly supports the following course learning outcomes:


CLO

Description

Relevance

CLO 1

Identify structural failure modes of pure vector search

Understanding why chunking quality matters for retrieval precision

CLO 2

Implement hybrid search strategies

Chunks produced here feed into hybrid retrieval pipelines

CLO 3

Evaluate trade-offs between retrieval approaches

The comparative analysis of chunking strategies mirrors retrieval strategy evaluation

CLO 4

Design production-ready information retrieval pipelines

The hybrid chunking pipeline simulates real-world system design




Assignment Type & Duration


  • Type: Individual Assignment

  • Duration: 7 days from release

  • Weighting: 50% of total coursework





Learning Objectives

Upon successful completion of this assignment, you will be able to:


  1. Implement at least five distinct text chunking strategies (fixed-size, token-aware, sliding window, semantic, structure-aware) and articulate the assumptions underlying each approach.

  2. Analyze the trade-offs between chunking strategies in terms of context preservation, token efficiency, redundancy, and retrieval readiness.

  3. Design a hybrid chunking pipeline that combines multiple strategies to produce coherent, token-compliant, and context-preserving chunks.

  4. Evaluate chunk quality using quantitative and qualitative methods, including cosine similarity, token distribution analysis, and boundary coherence checks.

  5. Justify design decisions with evidence-based reasoning, explaining why a particular chunking approach is suited to a given document type.

  6. Construct a clear, well-documented technical report that communicates system design and experimental findings to both technical and non-technical audiences.





Task Description




Dataset Requirements

You must work with a minimum of 3 documents in the following formats:


Format

Description

Suggested Sources

Plain text (.txt)

Unstructured prose, 500–1500 words

Technical blog posts, research article excerpts

Markdown (.md)

Structured with headings and sections, 500–1500 words

Documentation pages, README files

HTML (.html)

Structured with heading tags (h1–h6), 500–1500 words

Web articles, technical documentation


You may source documents from publicly available technical blogs, documentation pages, or research articles. Include your source data in your submission.




Part A: Core Chunking Implementations (55 Marks)



Task 1: Baseline Chunking — 10 Marks

Implement at least two basic chunking strategies:


  • Fixed-size chunking (word-based or character-based)

  • Sentence-based chunking


Requirements: - Clearly define the chunk size parameter and justify your choice - Print sample output chunks for at least one document - Discuss the limitations of each strategy with specific reference to the failure modes studied in Chapter 1 (e.g., broken context, split identifiers)


Deliverables: Working code with sample outputs; 200–300 word analysis of limitations.



Task 2: Token-Aware Chunking — 15 Marks

Implement token-based chunking using a tokenizer (e.g., tiktoken, transformers tokenizer).


Requirements: - Define chunk_size and overlap parameters - Display the token count for each generated chunk - Verify that no chunk exceeds the defined token limit


Analysis (required): - Compare token-aware chunking with fixed-size chunking on the same document - Discuss token efficiency: do token-aware chunks better preserve semantic boundaries? - Minimum 300–400 words of comparative analysis



Task 3: Sliding Window Chunking — 10 Marks

Implement chunking with configurable overlap.


Requirements: - Define window_size and stride parameters - Demonstrate how overlap preserves contextual continuity between adjacent chunks


Analysis (required): - When is overlap useful? When does it introduce harmful redundancy? - Discuss the trade-off between redundancy and context preservation (200–300 words)



Task 4: Semantic Chunking — 20 Marks

Use embedding-based similarity to group semantically related sentences into chunks.


Requirements: - Use an embedding model (e.g., sentence-transformers, OpenAI embeddings) - Compute pairwise cosine similarity between consecutive sentences - Split chunks at points where similarity drops below a defined threshold


Experimentation (required): - Test at least 3 different threshold values - Compare the resulting chunk boundaries across thresholds - Explain how semantic chunking preserves meaning that fixed-size methods destroy


Deliverables: Working code, comparison table/chart of threshold experiments, 400–500 word analysis.




Part B: Advanced Pipeline Design (35 Marks)



Task 1: Structure-Aware Chunking — 15 Marks

Handle structured documents by respecting document hierarchy.


Markdown documents: - Split based on header levels (##, ###, etc.) - Preserve section-level grouping


HTML documents: - Extract sections using heading tags (h1, h2, h3, etc.) - Maintain the logical structure of the original document


Requirements: - Demonstrate that chunks do not mix content from unrelated sections - Show at least 2 examples of structure-aware chunks from each format



Task 2: Hybrid Chunking Pipeline — 20 Marks

Design and implement a combined pipeline that integrates:


  1. Structure-aware splitting (as a first pass)

  2. Semantic grouping (to refine boundaries)

  3. Token normalization (to enforce limits)

  4. Optional overlap (for context preservation)


Requirements: - Clearly define the pipeline stages with a visual diagram or pseudocode - Show the final chunk outputs for all 3 document types - Verify that all final chunks are: - Coherent (do not mix unrelated content) - Within token limits - Context-preserving (important information is not split across chunks)


Deliverables: Pipeline diagram, working code, final chunk samples, 400–500 word justification of design decisions.




Part C: Evaluation & Insights (10 Marks)



Task 1: Evaluation & Reflection — 10 Marks

Provide a comprehensive evaluation of your chunking strategies.


You must answer the following questions:


  1. Which chunking method produced the best chunks, and by what criteria did you assess “best”?

  2. What trade-offs did you observe between different strategies?

  3. How does chunking quality affect retrieval quality in a RAG system? (Connect to concepts from Chapters 1–2 of this course.)



Optional Bonus (up to +10 Marks): - Build a mini RAG demo using your chunks with a vector store (e.g., ChromaDB, FAISS) - Compare retrieval quality across chunking strategies using a set of test queries - Visualize similarity scores or chunk quality distributions





Difficulty & Scope




Expectations by Performance Level


Level

Description

Basic (40–59%)

Implements Tasks 1–3 correctly with minimal analysis. Code runs but lacks clear documentation. Report is descriptive rather than analytical. Chunking parameters are chosen without justification.

Proficient (60–79%)

Implements Tasks 1–6 with clear code, reasonable analysis, and justified parameter choices. The hybrid pipeline works but may lack refinement. Report demonstrates understanding of trade-offs and connects to RAG retrieval concepts.

Advanced (80–100%)

All 7 tasks completed with depth. Semantic chunking experiments are thorough. Hybrid pipeline is well-designed with clear engineering rationale. Report demonstrates critical thinking, connects chunking to retrieval quality (Chapters 1–2), and identifies non-obvious insights. Bonus tasks attempted with meaningful results.





Marking Rubric


Criteria

Marks

Description

Task 1: Baseline Chunking

10

Correct implementation of fixed-size and sentence-based chunking; clear sample outputs; meaningful limitation analysis

Task 2: Token-Aware Chunking

15

Working tokenizer integration; comparative analysis with fixed-size; discussion of token efficiency

Task 3: Sliding Window Chunking

10

Correct overlap implementation; analysis of redundancy vs. context trade-offs

Task 4: Semantic Chunking

20

Embedding-based grouping works; minimum 3 threshold experiments; insightful boundary analysis

Task 5: Structure-Aware Chunking

15

Correct handling of Markdown and HTML structure; sections not mixed; examples shown

Task 6: Hybrid Chunking Pipeline

20

Pipeline integrates multiple strategies; output is coherent, token-compliant, and context-preserving; design is justified

Task 7: Evaluation & Insights

10

Addresses all 3 required questions; connects to course concepts; demonstrates critical thinking

Total

100


Bonus (optional)

+10

RAG demo, retrieval comparison, or visualization





Formatting & Structural Requirements




Code Submission


  • Format: Jupyter Notebook (.ipynb)

  • All code must be well-commented with clear sectioning for each task

  • Use markdown cells to separate tasks and provide brief explanations

  • Code must be executable without modification (include any required pip install commands)




Report


  • Length: 3–5 pages (excluding references and appendices)

  • Format: PDF or DOCX

  • Font: Times New Roman, 12pt

  • Spacing: 1.5 line spacing

  • Margins: 2.54 cm (1 inch) on all sides

  • Headings: Use a consistent heading hierarchy (Heading 1 for task sections, Heading 2 for subsections)

  • Citation Style: IEEE format



Required Sections:


  • Introduction (brief overview of your approach)

  • Task-by-Task Analysis (one section per task)

  • Hybrid Pipeline Design (with pipeline diagram)

  • Evaluation & Reflection

  • References




Output Samples


  • Include representative sample chunks from each chunking strategy

  • Include final hybrid pipeline chunks for all 3 document types

  • May be embedded in the notebook or submitted as a separate document





Permitted Resources & Academic Integrity Policy




Permitted Resources


  • Course notebooks (Chapters 1–5) and lecture materials

  • Official documentation for Python libraries: tiktoken, sentence-transformers, scikit-learn, BeautifulSoup, NLTK, spaCy

  • Public tutorials and documentation for chunking techniques

  • Textbooks listed in the course syllabus




AI Tool Policy


  • Permitted: AI tools (e.g., ChatGPT, GitHub Copilot) may be used for:

  • Debugging code errors

  • Understanding library documentation

  • Generating boilerplate code (e.g., file I/O, parsing)

  • NOT Permitted: Using AI to generate entire task solutions, analysis paragraphs, or report sections

  • Requirement: If you use an AI tool, you must declare it in a dedicated “AI Usage Declaration” section of your report, specifying:


    • Which tool was used

    • For which specific purpose

    • What portion of your work was assisted




Academic Integrity Declaration


I declare that this assignment is my own original work. All sources and AI tools used have been appropriately acknowledged. I understand that plagiarism, collusion, and undeclared AI-generated content constitute academic misconduct and will be dealt with according to the institution’s academic integrity policy.


Student Name: ____________________________Student ID: ____________________________Date: ____________________________Signature: ____________________________





Step-by-Step Submission Instructions (Moodle)


  1. Log in to Moodle at [institution Moodle URL] using your student credentials.

  2. Navigate to the course page: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers.

  3. Click on the assignment link: Assignment 1: Chunking Pipeline.

  4. Prepare your submission as a single ZIP file with the following structure:

<YourName>_Chunking_Assignment.zip├── notebook.ipynb├── report.pdf (or report.docx)└── data/ (optional — your source documents)

  1. File naming convention: LastName_FirstName_Chunking_Assignment.zip

  2. Upload your ZIP file using the file submission area.

  3. Click “Submit Assignment” — do NOT just save as draft.

  4. Verify your submission by checking the confirmation email from Moodle.




Accepted File Formats


  • Notebook: .ipynb

  • Report: .pdf (preferred) or .docx

  • Archive: .zip




Deadline

Submit within 7 days from assignment release.




Late Submission Policy

Submission Window

Penalty

Up to 24 hours late

10% deduction

24–48 hours late

20% deduction

Beyond 48 hours

Not accepted





Support & Communication Guidelines


  • Office Hours: [Day and Time — e.g., Tuesdays 2:00–4:00 PM], [Location / Online Link]

  • Discussion Forum: Use the Moodle discussion forum for all general questions. Post in the Assignment 1 Q&A thread.

  • Email: For confidential matters only, email [instructor@institution.edu] with the subject line: [Hybrid Search] Assignment 1 — [Your Name]

  • Response Time: Expect a response within 48 hours on working days. Questions posted on weekends will be addressed on the next working day.

  • Peer Collaboration: You may discuss high-level strategies and concepts with classmates, but all code and written analysis must be your own. Sharing code, output, or report text is not permitted.





Frequently Asked Questions (FAQ)




Q1: Can I use documents shorter or longer than 500–1500 words?

A: The 500–1500 word range is a guideline for each document. Slightly shorter or longer is acceptable, but ensure your documents are substantial enough to demonstrate meaningful chunking behaviour.




Q2: Do I need an API key (e.g., OpenAI) for semantic chunking?

A: No. You can use free, locally-running models such as sentence-transformers (e.g., all-MiniLM-L6-v2) for embeddings. No paid API access is required.




Q3: What if my hybrid pipeline doesn’t always produce perfect chunks?

A: That is expected and encouraged. The evaluation values your analysis of why certain chunks are imperfect more than achieving perfection. Document the failures and explain what would improve them.




Q4: How much analysis is “enough” for each task?

A: Follow the word counts specified in each task description. Prioritize depth over breadth — a focused analysis of one interesting failure case is more valuable than a superficial summary of everything.




Q5: Can I use LangChain or similar frameworks for chunking?

A: You may use framework utilities for helper functions (e.g., text splitting), but the core logic of each strategy must be implemented by you. Do not submit a one-line call to a framework’s chunking function as your implementation.




Q6: What is the difference between the notebook and the report?

A: The notebook contains your code, outputs, and brief inline explanations. The report is a formal document that presents your approach, comparisons, pipeline design, and reflections in a structured narrative format. They are complementary, not redundant.




Q7: Is the bonus section worth attempting?

A: If you have completed all 7 tasks with quality, the bonus can significantly improve your grade. However, prioritize completing the core tasks well before attempting the bonus.




Q8: Can I use a language other than Python?

A: Python is the expected and recommended language. If you use another language, you must justify it and ensure the grader can run your code without special setup.





Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?


Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.


Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.




Get Started Today



Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.




Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.









Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.


Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.


Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.




Comments


bottom of page