Designing a Production-Ready Chunking Pipeline for Retrieval-Augmented Generation
- Mar 25
- 9 min read

ASSIGNMENT REQUIREMENT DOCUMENT
Course Name: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers
Institution: [Institution Name]
Semester: [Semester / Term — e.g., Spring 2026]
Instructor: [Instructor Name]
Student Level: Postgraduate / Senior Undergraduate (Year 3–4)
Submission Platform: Moodle LMS
Total Assignments: 2
Note to Students: This assignment contains the complete requirements for one of the course assignments. Read the entire document carefully before beginning. The assignment is self-contained but builds upon the cumulative knowledge of the course. Deadlines, rubrics, and submission procedures are specified individually per assignment.
Purpose
This assignment requires you to design, implement, and evaluate a chunking pipeline that produces high-quality text chunks optimized for retrieval in a RAG (Retrieval-Augmented Generation) system. You will explore multiple chunking strategies—from simple fixed-size splitting to advanced semantic grouping—and critically analyze the trade-offs that arise when preparing documents for vector-based retrieval.
Connection to Course Learning Outcomes (CLOs)
This assignment directly supports the following course learning outcomes:
CLO | Description | Relevance |
CLO 1 | Identify structural failure modes of pure vector search | Understanding why chunking quality matters for retrieval precision |
CLO 2 | Implement hybrid search strategies | Chunks produced here feed into hybrid retrieval pipelines |
CLO 3 | Evaluate trade-offs between retrieval approaches | The comparative analysis of chunking strategies mirrors retrieval strategy evaluation |
CLO 4 | Design production-ready information retrieval pipelines | The hybrid chunking pipeline simulates real-world system design |
Assignment Type & Duration
Type: Individual Assignment
Duration: 7 days from release
Weighting: 50% of total coursework
Learning Objectives
Upon successful completion of this assignment, you will be able to:
Implement at least five distinct text chunking strategies (fixed-size, token-aware, sliding window, semantic, structure-aware) and articulate the assumptions underlying each approach.
Analyze the trade-offs between chunking strategies in terms of context preservation, token efficiency, redundancy, and retrieval readiness.
Design a hybrid chunking pipeline that combines multiple strategies to produce coherent, token-compliant, and context-preserving chunks.
Evaluate chunk quality using quantitative and qualitative methods, including cosine similarity, token distribution analysis, and boundary coherence checks.
Justify design decisions with evidence-based reasoning, explaining why a particular chunking approach is suited to a given document type.
Construct a clear, well-documented technical report that communicates system design and experimental findings to both technical and non-technical audiences.
Task Description
Dataset Requirements
You must work with a minimum of 3 documents in the following formats:
Format | Description | Suggested Sources |
Plain text (.txt) | Unstructured prose, 500–1500 words | Technical blog posts, research article excerpts |
Markdown (.md) | Structured with headings and sections, 500–1500 words | Documentation pages, README files |
HTML (.html) | Structured with heading tags (h1–h6), 500–1500 words | Web articles, technical documentation |
You may source documents from publicly available technical blogs, documentation pages, or research articles. Include your source data in your submission.
Part A: Core Chunking Implementations (55 Marks)
Task 1: Baseline Chunking — 10 Marks
Implement at least two basic chunking strategies:
Fixed-size chunking (word-based or character-based)
Sentence-based chunking
Requirements: - Clearly define the chunk size parameter and justify your choice - Print sample output chunks for at least one document - Discuss the limitations of each strategy with specific reference to the failure modes studied in Chapter 1 (e.g., broken context, split identifiers)
Deliverables: Working code with sample outputs; 200–300 word analysis of limitations.
Task 2: Token-Aware Chunking — 15 Marks
Implement token-based chunking using a tokenizer (e.g., tiktoken, transformers tokenizer).
Requirements: - Define chunk_size and overlap parameters - Display the token count for each generated chunk - Verify that no chunk exceeds the defined token limit
Analysis (required): - Compare token-aware chunking with fixed-size chunking on the same document - Discuss token efficiency: do token-aware chunks better preserve semantic boundaries? - Minimum 300–400 words of comparative analysis
Task 3: Sliding Window Chunking — 10 Marks
Implement chunking with configurable overlap.
Requirements: - Define window_size and stride parameters - Demonstrate how overlap preserves contextual continuity between adjacent chunks
Analysis (required): - When is overlap useful? When does it introduce harmful redundancy? - Discuss the trade-off between redundancy and context preservation (200–300 words)
Task 4: Semantic Chunking — 20 Marks
Use embedding-based similarity to group semantically related sentences into chunks.
Requirements: - Use an embedding model (e.g., sentence-transformers, OpenAI embeddings) - Compute pairwise cosine similarity between consecutive sentences - Split chunks at points where similarity drops below a defined threshold
Experimentation (required): - Test at least 3 different threshold values - Compare the resulting chunk boundaries across thresholds - Explain how semantic chunking preserves meaning that fixed-size methods destroy
Deliverables: Working code, comparison table/chart of threshold experiments, 400–500 word analysis.
Part B: Advanced Pipeline Design (35 Marks)
Task 1: Structure-Aware Chunking — 15 Marks
Handle structured documents by respecting document hierarchy.
Markdown documents: - Split based on header levels (##, ###, etc.) - Preserve section-level grouping
HTML documents: - Extract sections using heading tags (h1, h2, h3, etc.) - Maintain the logical structure of the original document
Requirements: - Demonstrate that chunks do not mix content from unrelated sections - Show at least 2 examples of structure-aware chunks from each format
Task 2: Hybrid Chunking Pipeline — 20 Marks
Design and implement a combined pipeline that integrates:
Structure-aware splitting (as a first pass)
Semantic grouping (to refine boundaries)
Token normalization (to enforce limits)
Optional overlap (for context preservation)
Requirements: - Clearly define the pipeline stages with a visual diagram or pseudocode - Show the final chunk outputs for all 3 document types - Verify that all final chunks are: - Coherent (do not mix unrelated content) - Within token limits - Context-preserving (important information is not split across chunks)
Deliverables: Pipeline diagram, working code, final chunk samples, 400–500 word justification of design decisions.
Part C: Evaluation & Insights (10 Marks)
Task 1: Evaluation & Reflection — 10 Marks
Provide a comprehensive evaluation of your chunking strategies.
You must answer the following questions:
Which chunking method produced the best chunks, and by what criteria did you assess “best”?
What trade-offs did you observe between different strategies?
How does chunking quality affect retrieval quality in a RAG system? (Connect to concepts from Chapters 1–2 of this course.)
Optional Bonus (up to +10 Marks): - Build a mini RAG demo using your chunks with a vector store (e.g., ChromaDB, FAISS) - Compare retrieval quality across chunking strategies using a set of test queries - Visualize similarity scores or chunk quality distributions
Difficulty & Scope
Expectations by Performance Level
Level | Description |
Basic (40–59%) | Implements Tasks 1–3 correctly with minimal analysis. Code runs but lacks clear documentation. Report is descriptive rather than analytical. Chunking parameters are chosen without justification. |
Proficient (60–79%) | Implements Tasks 1–6 with clear code, reasonable analysis, and justified parameter choices. The hybrid pipeline works but may lack refinement. Report demonstrates understanding of trade-offs and connects to RAG retrieval concepts. |
Advanced (80–100%) | All 7 tasks completed with depth. Semantic chunking experiments are thorough. Hybrid pipeline is well-designed with clear engineering rationale. Report demonstrates critical thinking, connects chunking to retrieval quality (Chapters 1–2), and identifies non-obvious insights. Bonus tasks attempted with meaningful results. |
Marking Rubric
Criteria | Marks | Description |
Task 1: Baseline Chunking | 10 | Correct implementation of fixed-size and sentence-based chunking; clear sample outputs; meaningful limitation analysis |
Task 2: Token-Aware Chunking | 15 | Working tokenizer integration; comparative analysis with fixed-size; discussion of token efficiency |
Task 3: Sliding Window Chunking | 10 | Correct overlap implementation; analysis of redundancy vs. context trade-offs |
Task 4: Semantic Chunking | 20 | Embedding-based grouping works; minimum 3 threshold experiments; insightful boundary analysis |
Task 5: Structure-Aware Chunking | 15 | Correct handling of Markdown and HTML structure; sections not mixed; examples shown |
Task 6: Hybrid Chunking Pipeline | 20 | Pipeline integrates multiple strategies; output is coherent, token-compliant, and context-preserving; design is justified |
Task 7: Evaluation & Insights | 10 | Addresses all 3 required questions; connects to course concepts; demonstrates critical thinking |
Total | 100 | |
Bonus (optional) | +10 | RAG demo, retrieval comparison, or visualization |
Formatting & Structural Requirements
Code Submission
Format: Jupyter Notebook (.ipynb)
All code must be well-commented with clear sectioning for each task
Use markdown cells to separate tasks and provide brief explanations
Code must be executable without modification (include any required pip install commands)
Report
Length: 3–5 pages (excluding references and appendices)
Format: PDF or DOCX
Font: Times New Roman, 12pt
Spacing: 1.5 line spacing
Margins: 2.54 cm (1 inch) on all sides
Headings: Use a consistent heading hierarchy (Heading 1 for task sections, Heading 2 for subsections)
Citation Style: IEEE format
Required Sections:
Introduction (brief overview of your approach)
Task-by-Task Analysis (one section per task)
Hybrid Pipeline Design (with pipeline diagram)
Evaluation & Reflection
References
Output Samples
Include representative sample chunks from each chunking strategy
Include final hybrid pipeline chunks for all 3 document types
May be embedded in the notebook or submitted as a separate document
Permitted Resources & Academic Integrity Policy
Permitted Resources
Course notebooks (Chapters 1–5) and lecture materials
Official documentation for Python libraries: tiktoken, sentence-transformers, scikit-learn, BeautifulSoup, NLTK, spaCy
Public tutorials and documentation for chunking techniques
Textbooks listed in the course syllabus
AI Tool Policy
Permitted: AI tools (e.g., ChatGPT, GitHub Copilot) may be used for:
Debugging code errors
Understanding library documentation
Generating boilerplate code (e.g., file I/O, parsing)
NOT Permitted: Using AI to generate entire task solutions, analysis paragraphs, or report sections
Requirement: If you use an AI tool, you must declare it in a dedicated “AI Usage Declaration” section of your report, specifying:
Which tool was used
For which specific purpose
What portion of your work was assisted
Academic Integrity Declaration
I declare that this assignment is my own original work. All sources and AI tools used have been appropriately acknowledged. I understand that plagiarism, collusion, and undeclared AI-generated content constitute academic misconduct and will be dealt with according to the institution’s academic integrity policy.
Student Name: ____________________________Student ID: ____________________________Date: ____________________________Signature: ____________________________
Step-by-Step Submission Instructions (Moodle)
Log in to Moodle at [institution Moodle URL] using your student credentials.
Navigate to the course page: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers.
Click on the assignment link: Assignment 1: Chunking Pipeline.
Prepare your submission as a single ZIP file with the following structure:
<YourName>_Chunking_Assignment.zip├── notebook.ipynb├── report.pdf (or report.docx)└── data/ (optional — your source documents)
File naming convention: LastName_FirstName_Chunking_Assignment.zip
Upload your ZIP file using the file submission area.
Click “Submit Assignment” — do NOT just save as draft.
Verify your submission by checking the confirmation email from Moodle.
Accepted File Formats
Notebook: .ipynb
Report: .pdf (preferred) or .docx
Archive: .zip
Deadline
Submit within 7 days from assignment release.
Late Submission Policy
Submission Window | Penalty |
Up to 24 hours late | 10% deduction |
24–48 hours late | 20% deduction |
Beyond 48 hours | Not accepted |
Support & Communication Guidelines
Office Hours: [Day and Time — e.g., Tuesdays 2:00–4:00 PM], [Location / Online Link]
Discussion Forum: Use the Moodle discussion forum for all general questions. Post in the Assignment 1 Q&A thread.
Email: For confidential matters only, email [instructor@institution.edu] with the subject line: [Hybrid Search] Assignment 1 — [Your Name]
Response Time: Expect a response within 48 hours on working days. Questions posted on weekends will be addressed on the next working day.
Peer Collaboration: You may discuss high-level strategies and concepts with classmates, but all code and written analysis must be your own. Sharing code, output, or report text is not permitted.
Frequently Asked Questions (FAQ)
Q1: Can I use documents shorter or longer than 500–1500 words?
A: The 500–1500 word range is a guideline for each document. Slightly shorter or longer is acceptable, but ensure your documents are substantial enough to demonstrate meaningful chunking behaviour.
Q2: Do I need an API key (e.g., OpenAI) for semantic chunking?
A: No. You can use free, locally-running models such as sentence-transformers (e.g., all-MiniLM-L6-v2) for embeddings. No paid API access is required.
Q3: What if my hybrid pipeline doesn’t always produce perfect chunks?
A: That is expected and encouraged. The evaluation values your analysis of why certain chunks are imperfect more than achieving perfection. Document the failures and explain what would improve them.
Q4: How much analysis is “enough” for each task?
A: Follow the word counts specified in each task description. Prioritize depth over breadth — a focused analysis of one interesting failure case is more valuable than a superficial summary of everything.
Q5: Can I use LangChain or similar frameworks for chunking?
A: You may use framework utilities for helper functions (e.g., text splitting), but the core logic of each strategy must be implemented by you. Do not submit a one-line call to a framework’s chunking function as your implementation.
Q6: What is the difference between the notebook and the report?
A: The notebook contains your code, outputs, and brief inline explanations. The report is a formal document that presents your approach, comparisons, pipeline design, and reflections in a structured narrative format. They are complementary, not redundant.
Q7: Is the bonus section worth attempting?
A: If you have completed all 7 tasks with quality, the bonus can significantly improve your grade. However, prioritize completing the core tasks well before attempting the bonus.
Q8: Can I use a language other than Python?
A: Python is the expected and recommended language. If you use another language, you must justify it and ensure the grader can run your code without special setup.
Call to Action
Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?
Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.
Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.
Get Started Today
Schedule an AI & Data Science Consultation:
Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.
Request a Custom AI Demo:
Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.
Email: contact@codersarts.com
Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.
Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.
Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.




Comments