Designing an Adaptive Chunking Engine for Real-World RAG Systems
- 2h
- 10 min read

Purpose
In this assignment, you move beyond isolated chunking techniques to design a complete, adaptive chunking system that intelligently detects document types and selects or combines chunking strategies accordingly. This simulates how chunking is actually deployed in production RAG systems — not as a fixed function, but as a design decision that adapts to input characteristics.
Connection to Course Learning Outcomes (CLOs)
CLO | Description | Relevance |
CLO 1 | Identify structural failure modes of pure vector search | Understanding document-type-specific failures informs strategy selection |
CLO 2 | Implement hybrid search strategies | The adaptive engine produces chunks tailored for hybrid retrieval |
CLO 3 | Evaluate trade-offs in retrieval systems | Comparative experiments across document types mirror real evaluation workflows |
CLO 4 | Design production-ready pipelines | The modular, extensible engine architecture reflects engineering best practices |
CLO 5 | Debug and evaluate RAG pipelines | The evaluation framework directly applies Chapter 5 concepts |
Assignment Type & Duration
Type: Individual Assignment
Duration: 10 days from release
Weighting: 50% of total coursework
Learning Objectives
Upon successful completion of this assignment, you will be able to:
Design a modular, extensible chunking engine with clearly separated concerns (strategy implementation, document detection, strategy selection, evaluation).
Construct a document-type detection system that classifies input text based on structural heuristics.
Evaluate chunking strategies systematically using quantitative metrics (chunk size consistency, redundancy, token distribution) and qualitative assessment (context preservation, semantic coherence).
Synthesize multiple chunking approaches into hybrid strategies with clearly justified layering decisions.
Compare system performance across diverse document types and articulate why different documents require different treatment.
Communicate engineering design decisions through a structured technical report that connects implementation choices to system-level goals.
Task Description
System Architecture
You will build a modular chunking engine with the following file structure:
chunking_engine/
├── strategies.py # Core chunking strategy implementations
├── detector.py # Document type detection logic
├── controller.py # Strategy selection and orchestration
├── evaluation.py # Chunk quality evaluation framework
└── main.py # Entry point and demonstration
Part A: Core Engine Components (55 Marks)
Task 1: Core Chunking Strategy Implementations — 20 Marks
Implement the following functions in strategies.py:
fixed_size_chunk(text, chunk_size)
chunk_with_overlap(text, size, overlap)
sentence_chunker(sentences, max_words)
token_chunk(text, chunk_size, overlap)
semantic_chunk(sentences, embeddings, threshold)
Requirements:
Each function must be modular and reusable — callable independently
Add descriptive docstrings explaining behavior, parameters, assumptions, and return types
Include input validation and edge case handling (e.g., empty text, chunk_size larger than document)
All functions must return a consistent output format
Task 2: Document Type Detection — 15 Marks
Create a detection function in detector.py:
def detect_document_type(text) -> str:
""" Classifies input text into one of:
- 'markdown'
- 'technical_documentation'
- 'narrative'
- 'plain_text' """
Detection Heuristics (suggested, not exhaustive): - Presence of Markdown headers (#, ##) or HTML tags (<h1>, <h2>) - Sentence density and paragraph spacing patterns - Average sentence length - Presence of code blocks, lists, or structured formatting - Vocabulary technicality (e.g., frequency of specialized terms)
Requirements: - Must handle at least 4 distinct document types - Include a confidence or scores output so the selection engine can make informed decisions - Demonstrate the detector on at least 3 different sample documents
Task 3: Strategy Selection Engine — 20 Marks
Create a controller function in controller.py:
def chunk_document(text) -> list:
""" Detects document type, selects the appropriate chunking strategy, and returns the chunked output. """
Strategy Mapping (reference — you may design your own logic):
Detected Document Type | Suggested Strategy |
Markdown | Structure-aware chunking |
Technical documentation | Sentence + token-aware chunking |
Narrative text | Semantic chunking |
Plain text | Fixed-size / overlap chunking |
Requirements:
- The controller must call detect_document_type() internally
- Strategy selection must be justified — explain in your report why each mapping was chosen
- The controller must be extensible: adding a new document type should not require rewriting existing logic
- Handle edge cases: what happens when detection confidence is low?
Part B: Advanced Features (35 Marks)
Task 4: Hybrid Chunking Strategies — 15 Marks
Extend your system to support layered/hybrid strategies, such as:
Structure → Sentence → Token normalization
Sentence → Semantic refinement
Fixed → Overlap → Token limit enforcement
Output Format:
Each final chunk must be a structured object:
{ "chunk": "...", "strategy": "semantic + token", "length": 78, "tokens": 120 }Requirements: - Implement at least 2 distinct hybrid strategies - Show that hybrid strategies produce different (ideally better) results than single strategies - Include metadata with each chunk to enable traceability
Task 5: Evaluation Framework — 10 Marks
Design and implement an evaluation system in evaluation.py:
def evaluate_chunks(chunks) -> dict:
""" Evaluates chunk quality across multiple dimensions. Returns a dictionary of metrics. """
Evaluation Dimensions:
Metric | Description |
Chunk size consistency | Variance and standard deviation of chunk lengths |
Context preservation | Whether semantic boundaries align with chunk boundaries |
Redundancy | Amount of duplicate content across overlapping chunks |
Semantic coherence | Average intra-chunk cosine similarity between sentences |
Token compliance | Percentage of chunks within the defined token limit |
Requirements:
Implement quantitative metrics (at least 3 of the above)
Apply your evaluation to the output of at least 2 different strategies on the same document
Present results in a comparison table or chart
Task 6: Comparative Experiment — 10 Marks
Run your complete system on at least 3 different types of documents:
A Markdown file (e.g., documentation or README)
A technical explanation (e.g., how Transformers work, how RAG pipelines operate)
Mixed paragraph text (e.g., a blog post or opinion article)
Compare the following across document types:
Number of chunks produced
Average chunk size (words and tokens)
Strategy selected by the detector
Qualitative retrieval readiness (would these chunks be useful in a vector search?)
Deliverables: Comparison table, 400–500 word analysis of differences across document types.
Difficulty & Scope
Expectations by Performance Level
Level | Description |
Basic (40–59%) | Implements core strategies (Task 1) and detector (Task 2), but controller logic is hardcoded. Evaluation is minimal. Report is descriptive. Code structure is flat (not modular). |
Proficient (60–79%) | All tasks completed. Modular file structure followed. Detector works on multiple types. Controller uses detection output. Evaluation includes at least 2 quantitative metrics. Report connects design to rationale. |
Advanced (80–100%) | System is fully adaptive and extensible. Hybrid strategies produce measurably better results. Evaluation framework is comprehensive. Report demonstrates systems thinking — connects chunking design to retrieval quality (Chapters 1–5). Code is production-quality with error handling, docstrings, and clean separation of concerns. Bonus tasks attempted. |
Marking Rubric
Criteria | Weight | Description |
Strategy Implementation | 20% | All 5 functions implemented, modular, reusable, with docstrings and edge case handling |
Document Detection Logic | 15% | Detector classifies at least 4 types; uses heuristics; provides confidence scores |
Adaptive System Design | 20% | Controller integrates detection and strategy selection; extensible architecture; justified mapping |
Hybrid Strategy Effectiveness | 15% | At least 2 hybrid strategies; metadata output; demonstrated improvement over single strategies |
Evaluation Framework | 10% | At least 3 quantitative metrics; comparison across strategies; presented clearly |
Report Quality | 10% | Well-structured; justifies design; discusses trade-offs; connects to course concepts |
Code Quality & Modularity | 10% | Clean file structure; proper naming; error handling; consistent interfaces; docstrings |
Total | 100% |
Formatting & Structural Requirements
Code Submission
Structure: Follow the modular file structure specified in the task description
Language: Python 3.10+
Documentation: All public functions must have docstrings
Runnable: main.py must execute end-to-end without errors when run from the project root
Dependencies: Include a requirements.txt file listing all required packages
Report
Length: 1500–2000 words
Format: PDF (required)
Font: Times New Roman, 12pt
Spacing: 1.5 line spacing
Margins: 2.54 cm (1 inch) on all sides
Citation Style: IEEE format
Required Sections:
System Design Overview (with architecture diagram)
Strategy Selection Logic & Justification
Trade-off Analysis (where strategies fail; when semantic chunking helps/hurts)
Hybrid Strategy Justification
Observations Across Document Types
References
Output Samples
Include sample chunks from each document type processed
Annotate at least 3 chunks explaining why they are “good” or “bad”
Permitted Resources & Academic Integrity Policy
Permitted Resources
Course notebooks (Chapters 1–5) and lecture materials
Official documentation for Python libraries: tiktoken, sentence-transformers, scikit-learn, BeautifulSoup, NLTK, spaCy, re
Public tutorials and documentation for NLP and chunking techniques
Textbooks listed in the course syllabus
AI Tool Policy
Permitted: AI tools may be used for:
Debugging code errors
Understanding library APIs
Generating boilerplate code (e.g., file I/O, argument parsing)
NOT Permitted: Using AI to generate strategy implementations, detection logic, system architecture, analysis text, or report content
Requirement: All AI tool usage must be declared (see declaration below)
Academic Integrity Declaration
I declare that this assignment is my own original work. All sources and AI tools used have been appropriately acknowledged. I understand that plagiarism, collusion, and undeclared AI-generated content constitute academic misconduct and will be dealt with according to the institution’s academic integrity policy.
Student Name: ____________________________Student ID: ____________________________Date: ____________________________Signature: ____________________________
Step-by-Step Submission Instructions (Moodle)
Log in to Moodle at [institution Moodle URL] using your student credentials.
Navigate to the course page: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers.
Click on the assignment link: Assignment 2: Adaptive Chunking Engine.
Prepare your submission as a single ZIP file or a GitHub repository link:
Option A — ZIP File:
<YourName>_ChunkingEngine_Assignment.zip
├── chunking_engine/│
├── strategies.py│
├── detector.py│
├── controller.py│
├── evaluation.py│
└── main.py
├── report.pdf
├── output_samples/│
├── markdown_chunks.txt│
├── technical_chunks.txt│
└── mixed_chunks.txt
├── requirements.txt
└── data/ (your source documents)
Option B — GitHub Link: - Repository must be public or shared with the instructor - Include the repository URL in the Moodle submission text box
File naming convention: LastName_FirstName_ChunkingEngine_Assignment.zip
Upload your file or paste your GitHub link.
Click “Submit Assignment” — do NOT just save as draft.
Verify your submission by checking the confirmation email from Moodle.
Accepted File Formats
Code: .py
Report: .pdf
Archive: .zip
Output: .txt, .json, .csv
Deadline
Submit within 10 days from assignment release.Exact date: [Instructor to specify]
Late Submission Policy
Submission Window | Penalty |
Up to 24 hours late | 10% deduction |
24–48 hours late | 20% deduction |
Beyond 48 hours | Not accepted |
Support & Communication Guidelines
Office Hours: [Day and Time — e.g., Thursdays 3:00–5:00 PM], [Location / Online Link]
Discussion Forum: Use the Moodle discussion forum for all general questions. Post in the Assignment 2 Q&A thread.
Email: For confidential matters only, email [instructor@institution.edu] with the subject line: [Hybrid Search] Assignment 2 — [Your Name]
Response Time: Expect a response within 48 hours on working days.
Peer Collaboration: You may discuss design approaches and concepts with classmates, but all code, system design, and written analysis must be your own. Sharing code or report text is not permitted.
Technical Issues: If you encounter technical issues with libraries or environments, post on the forum first — your classmates may have already solved the same problem.
Frequently Asked Questions (FAQ)
Q1: Must I follow the exact file structure (strategies.py, detector.py, etc.)?
A: Yes. The modular structure is part of the assessment. It evaluates your ability to design maintainable, production-like systems. You may add additional files (e.g., utils.py) but cannot merge the required files.
Q2: Can I use pre-built chunking libraries like LangChain’s text splitters?
A: You may use them as reference or for comparison, but your core implementations must be written by you. Submitting a wrapper around a library function will not receive credit.
Q3: How sophisticated does the document type detector need to be?
A: It does not need to be perfect. A heuristic-based detector that reliably distinguishes 3–4 types is sufficient. What matters is that you justify your heuristics and acknowledge limitations.
Q4: What if my adaptive system selects the “wrong” strategy for a document?
A: This is an excellent analysis opportunity. Document the misclassification, explain why it occurred, and discuss how you would improve the detector. Honest error analysis is valued more highly than false perfection.
Q5: Is the GitHub submission option acceptable?
A: Yes, provided the repository is accessible to the instructor and contains all required files (code, report, output samples). Include a clear README.md.
Q6: How should I measure “retrieval readiness” in Task 6?
A: This is a qualitative assessment. Ask yourself: would a vector search engine return useful results if these chunks were indexed? Are chunks self-contained? Do they preserve enough context for an LLM to generate a meaningful answer? Connect your reasoning to the failure modes from Chapter 1.
Q7: Can I add extra strategy functions beyond the 5 required?
A: Absolutely. Additional strategies (e.g., paragraph-based, recursive character splitting) are welcome and demonstrate initiative. Make sure the required 5 are implemented first.
Q8: What counts as a “hybrid strategy”?
A: A hybrid strategy chains two or more individual strategies in a pipeline. For example: first split by structure (headers), then refine each section with semantic chunking, then enforce token limits. The key is that multiple strategies work together, not in isolation.
GENERAL GUIDANCE FOR THE ASSIGNMENTS
Instructor’s Note
Both assignments are designed to simulate real-world system design thinking. There is no single correct answer. What matters is:
Clarity of reasoning — Can you explain why you made each design choice?
Quality of implementation — Is your code clean, modular, and well-documented?
Depth of analysis — Do you go beyond describing what happened to explaining why it happened and what it means?
Tips for Success
Start simple, then build complexity. Get a basic version working first, then iterate.
Visualize chunks wherever possible. Print them, count them, compare them visually.
Focus on why a chunk is good or bad. The analysis is where marks are earned, not just the implementation.
Don’t just implement — analyze deeply. Every chunking strategy has strengths and weaknesses. Your job is to discover and articulate them.
Think from a retrieval perspective, not just a splitting perspective. Ask: would this chunk help an LLM answer a question? Would a vector search find it given a relevant query?
Connect to the course material. Reference the failure modes from Chapter 1, the hybrid strategies from Chapter 2, the re-ranking concepts from Chapter 3, and the evaluation metrics from Chapter 5.
This document was prepared for the course “Hybrid Search and Re-ranking — From Retrieval to Reliable Answers” and is intended for distribution to enrolled students only.
© [Institution Name], [Semester / Year]. All rights reserved.
Call to Action
Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?
Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.
Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.
Get Started Today
Schedule an AI & Data Science Consultation:
Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.
Request a Custom AI Demo:
Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.
Email: contact@codersarts.com
Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.
Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.
Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.

Comments