Designing an Adaptive Chunking Engine for Real-World RAG Systems

Mar 25
10 min read

Updated: Mar 26

Purpose

In this assignment, you move beyond isolated chunking techniques to design a complete, adaptive chunking system that intelligently detects document types and selects or combines chunking strategies accordingly. This simulates how chunking is actually deployed in production RAG systems — not as a fixed function, but as a design decision that adapts to input characteristics.

Connection to Course Learning Outcomes (CLOs)

CLO	Description	Relevance
CLO 1	Identify structural failure modes of pure vector search	Understanding document-type-specific failures informs strategy selection
CLO 2	Implement hybrid search strategies	The adaptive engine produces chunks tailored for hybrid retrieval
CLO 3	Evaluate trade-offs in retrieval systems	Comparative experiments across document types mirror real evaluation workflows
CLO 4	Design production-ready pipelines	The modular, extensible engine architecture reflects engineering best practices
CLO 5	Debug and evaluate RAG pipelines	The evaluation framework directly applies Chapter 5 concepts

Assignment Type & Duration

Type: Individual Assignment
Duration: 10 days from release
Weighting: 50% of total coursework

Learning Objectives

Upon successful completion of this assignment, you will be able to:

Design a modular, extensible chunking engine with clearly separated concerns (strategy implementation, document detection, strategy selection, evaluation).
Construct a document-type detection system that classifies input text based on structural heuristics.
Evaluate chunking strategies systematically using quantitative metrics (chunk size consistency, redundancy, token distribution) and qualitative assessment (context preservation, semantic coherence).
Synthesize multiple chunking approaches into hybrid strategies with clearly justified layering decisions.
Compare system performance across diverse document types and articulate why different documents require different treatment.
Communicate engineering design decisions through a structured technical report that connects implementation choices to system-level goals.

Task Description

System Architecture

You will build a modular chunking engine with the following file structure:


chunking_engine/

├── strategies.py      # Core chunking strategy implementations

├── detector.py        # Document type detection logic

├── controller.py      # Strategy selection and orchestration

├── evaluation.py      # Chunk quality evaluation framework

└── main.py            # Entry point and demonstration

Part A: Core Engine Components (55 Marks)

Task 1: Core Chunking Strategy Implementations — 20 Marks

Implement the following functions in strategies.py:

fixed_size_chunk(text, chunk_size)
chunk_with_overlap(text, size, overlap)
sentence_chunker(sentences, max_words)
token_chunk(text, chunk_size, overlap)
semantic_chunk(sentences, embeddings, threshold)

Requirements:

Each function must be modular and reusable — callable independently
Add descriptive docstrings explaining behavior, parameters, assumptions, and return types
Include input validation and edge case handling (e.g., empty text, chunk_size larger than document)
All functions must return a consistent output format

Task 2: Document Type Detection — 15 Marks

Create a detection function in detector.py:


def detect_document_type(text) -> str:    
""" Classifies input text into one of:    
- 'markdown'    
- 'technical_documentation'    
- 'narrative'    
- 'plain_text' """

Detection Heuristics (suggested, not exhaustive): - Presence of Markdown headers (#, ##) or HTML tags (<h1>, <h2>) - Sentence density and paragraph spacing patterns - Average sentence length - Presence of code blocks, lists, or structured formatting - Vocabulary technicality (e.g., frequency of specialized terms)

Requirements: - Must handle at least 4 distinct document types - Include a confidence or scores output so the selection engine can make informed decisions - Demonstrate the detector on at least 3 different sample documents

Task 3: Strategy Selection Engine — 20 Marks

Create a controller function in controller.py:


def chunk_document(text) -> list:  
""" Detects document type, selects the appropriate chunking strategy,    and returns the chunked output. """

Strategy Mapping (reference — you may design your own logic):

Detected Document Type	Suggested Strategy
Markdown	Structure-aware chunking
Technical documentation	Sentence + token-aware chunking
Narrative text	Semantic chunking
Plain text	Fixed-size / overlap chunking

Requirements:

- The controller must call detect_document_type() internally

- Strategy selection must be justified — explain in your report why each mapping was chosen

- The controller must be extensible: adding a new document type should not require rewriting existing logic

- Handle edge cases: what happens when detection confidence is low?

Part B: Advanced Features (35 Marks)

Task 4: Hybrid Chunking Strategies — 15 Marks

Extend your system to support layered/hybrid strategies, such as:

Structure → Sentence → Token normalization
Sentence → Semantic refinement
Fixed → Overlap → Token limit enforcement

Output Format:

Each final chunk must be a structured object:


{ "chunk": "...",  "strategy": "semantic + token",  "length": 78,  "tokens": 120 }

Requirements: - Implement at least 2 distinct hybrid strategies - Show that hybrid strategies produce different (ideally better) results than single strategies - Include metadata with each chunk to enable traceability

Task 5: Evaluation Framework — 10 Marks

Design and implement an evaluation system in evaluation.py:


def evaluate_chunks(chunks) -> dict:    
"""    Evaluates chunk quality across multiple dimensions.    Returns a dictionary of metrics.    """

Evaluation Dimensions:

Metric	Description
Chunk size consistency	Variance and standard deviation of chunk lengths
Context preservation	Whether semantic boundaries align with chunk boundaries
Redundancy	Amount of duplicate content across overlapping chunks
Semantic coherence	Average intra-chunk cosine similarity between sentences
Token compliance	Percentage of chunks within the defined token limit

Requirements:

Implement quantitative metrics (at least 3 of the above)
Apply your evaluation to the output of at least 2 different strategies on the same document
Present results in a comparison table or chart

Task 6: Comparative Experiment — 10 Marks

Run your complete system on at least 3 different types of documents:

A Markdown file (e.g., documentation or README)
A technical explanation (e.g., how Transformers work, how RAG pipelines operate)
Mixed paragraph text (e.g., a blog post or opinion article)

Compare the following across document types:

Number of chunks produced
Average chunk size (words and tokens)
Strategy selected by the detector
Qualitative retrieval readiness (would these chunks be useful in a vector search?)

Deliverables: Comparison table, 400–500 word analysis of differences across document types.

Difficulty & Scope

Expectations by Performance Level

Level	Description
Basic (40–59%)	Implements core strategies (Task 1) and detector (Task 2), but controller logic is hardcoded. Evaluation is minimal. Report is descriptive. Code structure is flat (not modular).
Proficient (60–79%)	All tasks completed. Modular file structure followed. Detector works on multiple types. Controller uses detection output. Evaluation includes at least 2 quantitative metrics. Report connects design to rationale.
Advanced (80–100%)	System is fully adaptive and extensible. Hybrid strategies produce measurably better results. Evaluation framework is comprehensive. Report demonstrates systems thinking — connects chunking design to retrieval quality (Chapters 1–5). Code is production-quality with error handling, docstrings, and clean separation of concerns. Bonus tasks attempted.

Marking Rubric

Criteria	Weight	Description
Strategy Implementation	20%	All 5 functions implemented, modular, reusable, with docstrings and edge case handling
Document Detection Logic	15%	Detector classifies at least 4 types; uses heuristics; provides confidence scores
Adaptive System Design	20%	Controller integrates detection and strategy selection; extensible architecture; justified mapping
Hybrid Strategy Effectiveness	15%	At least 2 hybrid strategies; metadata output; demonstrated improvement over single strategies
Evaluation Framework	10%	At least 3 quantitative metrics; comparison across strategies; presented clearly
Report Quality	10%	Well-structured; justifies design; discusses trade-offs; connects to course concepts
Code Quality & Modularity	10%	Clean file structure; proper naming; error handling; consistent interfaces; docstrings
Total	100%

Formatting & Structural Requirements

Code Submission

Structure: Follow the modular file structure specified in the task description
Language: Python 3.10+
Documentation: All public functions must have docstrings
Runnable: main.py must execute end-to-end without errors when run from the project root
Dependencies: Include a requirements.txt file listing all required packages

Report

Length: 1500–2000 words
Format: PDF (required)
Font: Times New Roman, 12pt
Spacing: 1.5 line spacing
Margins: 2.54 cm (1 inch) on all sides
Citation Style: IEEE format

Required Sections:

System Design Overview (with architecture diagram)
Strategy Selection Logic & Justification
Trade-off Analysis (where strategies fail; when semantic chunking helps/hurts)
Hybrid Strategy Justification
Observations Across Document Types
References

Output Samples

Include sample chunks from each document type processed
Annotate at least 3 chunks explaining why they are “good” or “bad”

Permitted Resources & Academic Integrity Policy

Permitted Resources

Course notebooks (Chapters 1–5) and lecture materials
Official documentation for Python libraries: tiktoken, sentence-transformers, scikit-learn, BeautifulSoup, NLTK, spaCy, re
Public tutorials and documentation for NLP and chunking techniques
Textbooks listed in the course syllabus

AI Tool Policy

Permitted: AI tools may be used for:
Debugging code errors
Understanding library APIs
Generating boilerplate code (e.g., file I/O, argument parsing)
NOT Permitted: Using AI to generate strategy implementations, detection logic, system architecture, analysis text, or report content
Requirement: All AI tool usage must be declared (see declaration below)

Academic Integrity Declaration

I declare that this assignment is my own original work. All sources and AI tools used have been appropriately acknowledged. I understand that plagiarism, collusion, and undeclared AI-generated content constitute academic misconduct and will be dealt with according to the institution’s academic integrity policy.

Student Name: ____________________________Student ID: ____________________________Date: ____________________________Signature: ____________________________

Step-by-Step Submission Instructions (Moodle)

Log in to Moodle at [institution Moodle URL] using your student credentials.
Navigate to the course page: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers.
Click on the assignment link: Assignment 2: Adaptive Chunking Engine.
Prepare your submission as a single ZIP file or a GitHub repository link:

Option A — ZIP File:


<YourName>_ChunkingEngine_Assignment.zip
├── chunking_engine/│   

├── strategies.py│   

├── detector.py│   

├── controller.py│   

├── evaluation.py│   

└── main.py

├── report.pdf

├── output_samples/│   

├── markdown_chunks.txt│   

├── technical_chunks.txt│   

└── mixed_chunks.txt

├── requirements.txt

└── data/ (your source documents)

Option B — GitHub Link: - Repository must be public or shared with the instructor - Include the repository URL in the Moodle submission text box

File naming convention: LastName_FirstName_ChunkingEngine_Assignment.zip
Upload your file or paste your GitHub link.
Click “Submit Assignment” — do NOT just save as draft.
Verify your submission by checking the confirmation email from Moodle.

Accepted File Formats

Code: .py
Report: .pdf
Archive: .zip
Output: .txt, .json, .csv

Deadline

Submit within 10 days from assignment release.Exact date: [Instructor to specify]

Late Submission Policy

Submission Window	Penalty
Up to 24 hours late	10% deduction
24–48 hours late	20% deduction
Beyond 48 hours	Not accepted

Support & Communication Guidelines

Office Hours: [Day and Time — e.g., Thursdays 3:00–5:00 PM], [Location / Online Link]
Discussion Forum: Use the Moodle discussion forum for all general questions. Post in the Assignment 2 Q&A thread.
Email: For confidential matters only, email [instructor@institution.edu] with the subject line: [Hybrid Search] Assignment 2 — [Your Name]
Response Time: Expect a response within 48 hours on working days.
Peer Collaboration: You may discuss design approaches and concepts with classmates, but all code, system design, and written analysis must be your own. Sharing code or report text is not permitted.
Technical Issues: If you encounter technical issues with libraries or environments, post on the forum first — your classmates may have already solved the same problem.

Frequently Asked Questions (FAQ)

Q1: Must I follow the exact file structure (strategies.py, detector.py, etc.)?

A: Yes. The modular structure is part of the assessment. It evaluates your ability to design maintainable, production-like systems. You may add additional files (e.g., utils.py) but cannot merge the required files.

Q2: Can I use pre-built chunking libraries like LangChain’s text splitters?

A: You may use them as reference or for comparison, but your core implementations must be written by you. Submitting a wrapper around a library function will not receive credit.

Q3: How sophisticated does the document type detector need to be?

A: It does not need to be perfect. A heuristic-based detector that reliably distinguishes 3–4 types is sufficient. What matters is that you justify your heuristics and acknowledge limitations.

Q4: What if my adaptive system selects the “wrong” strategy for a document?

A: This is an excellent analysis opportunity. Document the misclassification, explain why it occurred, and discuss how you would improve the detector. Honest error analysis is valued more highly than false perfection.

Q5: Is the GitHub submission option acceptable?

A: Yes, provided the repository is accessible to the instructor and contains all required files (code, report, output samples). Include a clear README.md.

Q6: How should I measure “retrieval readiness” in Task 6?

A: This is a qualitative assessment. Ask yourself: would a vector search engine return useful results if these chunks were indexed? Are chunks self-contained? Do they preserve enough context for an LLM to generate a meaningful answer? Connect your reasoning to the failure modes from Chapter 1.

Q7: Can I add extra strategy functions beyond the 5 required?

A: Absolutely. Additional strategies (e.g., paragraph-based, recursive character splitting) are welcome and demonstrate initiative. Make sure the required 5 are implemented first.

Q8: What counts as a “hybrid strategy”?

A: A hybrid strategy chains two or more individual strategies in a pipeline. For example: first split by structure (headers), then refine each section with semantic chunking, then enforce token limits. The key is that multiple strategies work together, not in isolation.

GENERAL GUIDANCE FOR THE ASSIGNMENTS

Instructor’s Note

Both assignments are designed to simulate real-world system design thinking. There is no single correct answer. What matters is:

Clarity of reasoning — Can you explain why you made each design choice?
Quality of implementation — Is your code clean, modular, and well-documented?
Depth of analysis — Do you go beyond describing what happened to explaining why it happened and what it means?

Tips for Success

Start simple, then build complexity. Get a basic version working first, then iterate.
Visualize chunks wherever possible. Print them, count them, compare them visually.
Focus on why a chunk is good or bad. The analysis is where marks are earned, not just the implementation.
Don’t just implement — analyze deeply. Every chunking strategy has strengths and weaknesses. Your job is to discover and articulate them.
Think from a retrieval perspective, not just a splitting perspective. Ask: would this chunk help an LLM answer a question? Would a vector search find it given a relevant query?
Connect to the course material. Reference the failure modes from Chapter 1, the hybrid strategies from Chapter 2, the re-ranking concepts from Chapter 3, and the evaluation metrics from Chapter 5.

Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?

Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.

Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.

Get Started Today

Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.

Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.

Email: contact@codersarts.com

Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.

Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.

Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.

Purpose

Connection to Course Learning Outcomes (CLOs)

Assignment Type & Duration

Learning Objectives

Task Description

System Architecture

Part A: Core Engine Components (55 Marks)

Task 1: Core Chunking Strategy Implementations — 20 Marks

Task 2: Document Type Detection — 15 Marks

Task 3: Strategy Selection Engine — 20 Marks

Part B: Advanced Features (35 Marks)

Task 4: Hybrid Chunking Strategies — 15 Marks

Task 5: Evaluation Framework — 10 Marks

Task 6: Comparative Experiment — 10 Marks

Difficulty & Scope

Expectations by Performance Level

Marking Rubric

Formatting & Structural Requirements

Code Submission

Report

Output Samples

Permitted Resources & Academic Integrity Policy

Permitted Resources

AI Tool Policy

Academic Integrity Declaration

Step-by-Step Submission Instructions (Moodle)

Accepted File Formats

Deadline

Late Submission Policy

Support & Communication Guidelines

Frequently Asked Questions (FAQ)

GENERAL GUIDANCE FOR THE ASSIGNMENTS

Instructor’s Note

Tips for Success

Call to Action

Get Started Today

Schedule an AI & Data Science Consultation:

Request a Custom AI Demo:

Comments