Designing a Production-Ready Chunking Pipeline for Retrieval-Augmented Generation

Mar 25
9 min read

ASSIGNMENT REQUIREMENT DOCUMENT

Course Name: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers

Institution: [Institution Name]

Semester: [Semester / Term — e.g., Spring 2026]

Instructor: [Instructor Name]

Student Level: Postgraduate / Senior Undergraduate (Year 3–4)

Submission Platform: Moodle LMS

Total Assignments: 2

Note to Students: This assignment contains the complete requirements for one of the course assignments. Read the entire document carefully before beginning. The assignment is self-contained but builds upon the cumulative knowledge of the course. Deadlines, rubrics, and submission procedures are specified individually per assignment.

Purpose

This assignment requires you to design, implement, and evaluate a chunking pipeline that produces high-quality text chunks optimized for retrieval in a RAG (Retrieval-Augmented Generation) system. You will explore multiple chunking strategies—from simple fixed-size splitting to advanced semantic grouping—and critically analyze the trade-offs that arise when preparing documents for vector-based retrieval.

Connection to Course Learning Outcomes (CLOs)

This assignment directly supports the following course learning outcomes:

CLO	Description	Relevance
CLO 1	Identify structural failure modes of pure vector search	Understanding why chunking quality matters for retrieval precision
CLO 2	Implement hybrid search strategies	Chunks produced here feed into hybrid retrieval pipelines
CLO 3	Evaluate trade-offs between retrieval approaches	The comparative analysis of chunking strategies mirrors retrieval strategy evaluation
CLO 4	Design production-ready information retrieval pipelines	The hybrid chunking pipeline simulates real-world system design

Assignment Type & Duration

Type: Individual Assignment
Duration: 7 days from release
Weighting: 50% of total coursework

Learning Objectives

Upon successful completion of this assignment, you will be able to:

Implement at least five distinct text chunking strategies (fixed-size, token-aware, sliding window, semantic, structure-aware) and articulate the assumptions underlying each approach.
Analyze the trade-offs between chunking strategies in terms of context preservation, token efficiency, redundancy, and retrieval readiness.
Design a hybrid chunking pipeline that combines multiple strategies to produce coherent, token-compliant, and context-preserving chunks.
Evaluate chunk quality using quantitative and qualitative methods, including cosine similarity, token distribution analysis, and boundary coherence checks.
Justify design decisions with evidence-based reasoning, explaining why a particular chunking approach is suited to a given document type.
Construct a clear, well-documented technical report that communicates system design and experimental findings to both technical and non-technical audiences.

Task Description

Dataset Requirements

You must work with a minimum of 3 documents in the following formats:

Format	Description	Suggested Sources
Plain text (.txt)	Unstructured prose, 500–1500 words	Technical blog posts, research article excerpts
Markdown (.md)	Structured with headings and sections, 500–1500 words	Documentation pages, README files
HTML (.html)	Structured with heading tags (h1–h6), 500–1500 words	Web articles, technical documentation

You may source documents from publicly available technical blogs, documentation pages, or research articles. Include your source data in your submission.

Part A: Core Chunking Implementations (55 Marks)

Task 1: Baseline Chunking — 10 Marks

Implement at least two basic chunking strategies:

Fixed-size chunking (word-based or character-based)
Sentence-based chunking

Requirements: - Clearly define the chunk size parameter and justify your choice - Print sample output chunks for at least one document - Discuss the limitations of each strategy with specific reference to the failure modes studied in Chapter 1 (e.g., broken context, split identifiers)

Deliverables: Working code with sample outputs; 200–300 word analysis of limitations.

Task 2: Token-Aware Chunking — 15 Marks

Implement token-based chunking using a tokenizer (e.g., tiktoken, transformers tokenizer).

Requirements: - Define chunk_size and overlap parameters - Display the token count for each generated chunk - Verify that no chunk exceeds the defined token limit

Analysis (required): - Compare token-aware chunking with fixed-size chunking on the same document - Discuss token efficiency: do token-aware chunks better preserve semantic boundaries? - Minimum 300–400 words of comparative analysis

Task 3: Sliding Window Chunking — 10 Marks

Implement chunking with configurable overlap.

Requirements: - Define window_size and stride parameters - Demonstrate how overlap preserves contextual continuity between adjacent chunks

Analysis (required): - When is overlap useful? When does it introduce harmful redundancy? - Discuss the trade-off between redundancy and context preservation (200–300 words)

Task 4: Semantic Chunking — 20 Marks

Use embedding-based similarity to group semantically related sentences into chunks.

Requirements: - Use an embedding model (e.g., sentence-transformers, OpenAI embeddings) - Compute pairwise cosine similarity between consecutive sentences - Split chunks at points where similarity drops below a defined threshold

Experimentation (required): - Test at least 3 different threshold values - Compare the resulting chunk boundaries across thresholds - Explain how semantic chunking preserves meaning that fixed-size methods destroy

Deliverables: Working code, comparison table/chart of threshold experiments, 400–500 word analysis.

Part B: Advanced Pipeline Design (35 Marks)

Task 1: Structure-Aware Chunking — 15 Marks

Handle structured documents by respecting document hierarchy.

Markdown documents: - Split based on header levels (##, ###, etc.) - Preserve section-level grouping

HTML documents: - Extract sections using heading tags (h1, h2, h3, etc.) - Maintain the logical structure of the original document

Requirements: - Demonstrate that chunks do not mix content from unrelated sections - Show at least 2 examples of structure-aware chunks from each format

Task 2: Hybrid Chunking Pipeline — 20 Marks

Design and implement a combined pipeline that integrates:

Structure-aware splitting (as a first pass)
Semantic grouping (to refine boundaries)
Token normalization (to enforce limits)
Optional overlap (for context preservation)

Requirements: - Clearly define the pipeline stages with a visual diagram or pseudocode - Show the final chunk outputs for all 3 document types - Verify that all final chunks are: - Coherent (do not mix unrelated content) - Within token limits - Context-preserving (important information is not split across chunks)

Deliverables: Pipeline diagram, working code, final chunk samples, 400–500 word justification of design decisions.

Part C: Evaluation & Insights (10 Marks)

Task 1: Evaluation & Reflection — 10 Marks

Provide a comprehensive evaluation of your chunking strategies.

You must answer the following questions:

Which chunking method produced the best chunks, and by what criteria did you assess “best”?
What trade-offs did you observe between different strategies?
How does chunking quality affect retrieval quality in a RAG system? (Connect to concepts from Chapters 1–2 of this course.)

Optional Bonus (up to +10 Marks): - Build a mini RAG demo using your chunks with a vector store (e.g., ChromaDB, FAISS) - Compare retrieval quality across chunking strategies using a set of test queries - Visualize similarity scores or chunk quality distributions

Difficulty & Scope

Expectations by Performance Level

Level	Description
Basic (40–59%)	Implements Tasks 1–3 correctly with minimal analysis. Code runs but lacks clear documentation. Report is descriptive rather than analytical. Chunking parameters are chosen without justification.
Proficient (60–79%)	Implements Tasks 1–6 with clear code, reasonable analysis, and justified parameter choices. The hybrid pipeline works but may lack refinement. Report demonstrates understanding of trade-offs and connects to RAG retrieval concepts.
Advanced (80–100%)	All 7 tasks completed with depth. Semantic chunking experiments are thorough. Hybrid pipeline is well-designed with clear engineering rationale. Report demonstrates critical thinking, connects chunking to retrieval quality (Chapters 1–2), and identifies non-obvious insights. Bonus tasks attempted with meaningful results.

Marking Rubric

Criteria	Marks	Description
Task 1: Baseline Chunking	10	Correct implementation of fixed-size and sentence-based chunking; clear sample outputs; meaningful limitation analysis
Task 2: Token-Aware Chunking	15	Working tokenizer integration; comparative analysis with fixed-size; discussion of token efficiency
Task 3: Sliding Window Chunking	10	Correct overlap implementation; analysis of redundancy vs. context trade-offs
Task 4: Semantic Chunking	20	Embedding-based grouping works; minimum 3 threshold experiments; insightful boundary analysis
Task 5: Structure-Aware Chunking	15	Correct handling of Markdown and HTML structure; sections not mixed; examples shown
Task 6: Hybrid Chunking Pipeline	20	Pipeline integrates multiple strategies; output is coherent, token-compliant, and context-preserving; design is justified
Task 7: Evaluation & Insights	10	Addresses all 3 required questions; connects to course concepts; demonstrates critical thinking
Total	100
Bonus (optional)	+10	RAG demo, retrieval comparison, or visualization

Formatting & Structural Requirements

Code Submission

Format: Jupyter Notebook (.ipynb)
All code must be well-commented with clear sectioning for each task
Use markdown cells to separate tasks and provide brief explanations
Code must be executable without modification (include any required pip install commands)

Report

Length: 3–5 pages (excluding references and appendices)
Format: PDF or DOCX
Font: Times New Roman, 12pt
Spacing: 1.5 line spacing
Margins: 2.54 cm (1 inch) on all sides
Headings: Use a consistent heading hierarchy (Heading 1 for task sections, Heading 2 for subsections)
Citation Style: IEEE format

Required Sections:

Introduction (brief overview of your approach)
Task-by-Task Analysis (one section per task)
Hybrid Pipeline Design (with pipeline diagram)
Evaluation & Reflection
References

Output Samples

Include representative sample chunks from each chunking strategy
Include final hybrid pipeline chunks for all 3 document types
May be embedded in the notebook or submitted as a separate document

Permitted Resources & Academic Integrity Policy

Permitted Resources

Course notebooks (Chapters 1–5) and lecture materials
Official documentation for Python libraries: tiktoken, sentence-transformers, scikit-learn, BeautifulSoup, NLTK, spaCy
Public tutorials and documentation for chunking techniques
Textbooks listed in the course syllabus

AI Tool Policy

Permitted: AI tools (e.g., ChatGPT, GitHub Copilot) may be used for:
Debugging code errors
Understanding library documentation
Generating boilerplate code (e.g., file I/O, parsing)
NOT Permitted: Using AI to generate entire task solutions, analysis paragraphs, or report sections
Requirement: If you use an AI tool, you must declare it in a dedicated “AI Usage Declaration” section of your report, specifying:
- Which tool was used
- For which specific purpose
- What portion of your work was assisted

Academic Integrity Declaration

I declare that this assignment is my own original work. All sources and AI tools used have been appropriately acknowledged. I understand that plagiarism, collusion, and undeclared AI-generated content constitute academic misconduct and will be dealt with according to the institution’s academic integrity policy.

Student Name: ____________________________Student ID: ____________________________Date: ____________________________Signature: ____________________________

Step-by-Step Submission Instructions (Moodle)

Log in to Moodle at [institution Moodle URL] using your student credentials.
Navigate to the course page: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers.
Click on the assignment link: Assignment 1: Chunking Pipeline.
Prepare your submission as a single ZIP file with the following structure:

<YourName>_Chunking_Assignment.zip├── notebook.ipynb├── report.pdf (or report.docx)└── data/ (optional — your source documents)

File naming convention: LastName_FirstName_Chunking_Assignment.zip
Upload your ZIP file using the file submission area.
Click “Submit Assignment” — do NOT just save as draft.
Verify your submission by checking the confirmation email from Moodle.

Accepted File Formats

Notebook: .ipynb
Report: .pdf (preferred) or .docx
Archive: .zip

Deadline

Submit within 7 days from assignment release.

Late Submission Policy

Submission Window	Penalty
Up to 24 hours late	10% deduction
24–48 hours late	20% deduction
Beyond 48 hours	Not accepted

Support & Communication Guidelines

Office Hours: [Day and Time — e.g., Tuesdays 2:00–4:00 PM], [Location / Online Link]
Discussion Forum: Use the Moodle discussion forum for all general questions. Post in the Assignment 1 Q&A thread.
Email: For confidential matters only, email [instructor@institution.edu] with the subject line: [Hybrid Search] Assignment 1 — [Your Name]
Response Time: Expect a response within 48 hours on working days. Questions posted on weekends will be addressed on the next working day.
Peer Collaboration: You may discuss high-level strategies and concepts with classmates, but all code and written analysis must be your own. Sharing code, output, or report text is not permitted.

Frequently Asked Questions (FAQ)

Q1: Can I use documents shorter or longer than 500–1500 words?

A: The 500–1500 word range is a guideline for each document. Slightly shorter or longer is acceptable, but ensure your documents are substantial enough to demonstrate meaningful chunking behaviour.

Q2: Do I need an API key (e.g., OpenAI) for semantic chunking?

A: No. You can use free, locally-running models such as sentence-transformers (e.g., all-MiniLM-L6-v2) for embeddings. No paid API access is required.

Q3: What if my hybrid pipeline doesn’t always produce perfect chunks?

A: That is expected and encouraged. The evaluation values your analysis of why certain chunks are imperfect more than achieving perfection. Document the failures and explain what would improve them.

Q4: How much analysis is “enough” for each task?

A: Follow the word counts specified in each task description. Prioritize depth over breadth — a focused analysis of one interesting failure case is more valuable than a superficial summary of everything.

Q5: Can I use LangChain or similar frameworks for chunking?

A: You may use framework utilities for helper functions (e.g., text splitting), but the core logic of each strategy must be implemented by you. Do not submit a one-line call to a framework’s chunking function as your implementation.

Q6: What is the difference between the notebook and the report?

A: The notebook contains your code, outputs, and brief inline explanations. The report is a formal document that presents your approach, comparisons, pipeline design, and reflections in a structured narrative format. They are complementary, not redundant.

Q7: Is the bonus section worth attempting?

A: If you have completed all 7 tasks with quality, the bonus can significantly improve your grade. However, prioritize completing the core tasks well before attempting the bonus.

Q8: Can I use a language other than Python?

A: Python is the expected and recommended language. If you use another language, you must justify it and ensure the grader can run your code without special setup.

Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?

Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.

Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.

Get Started Today

Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.

Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.

Email: contact@codersarts.com

Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.

Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.

Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.