top of page

Designing an Adaptive Chunking Engine for Real-World RAG Systems

  • 2h
  • 10 min read



Purpose


In this assignment, you move beyond isolated chunking techniques to design a complete, adaptive chunking system that intelligently detects document types and selects or combines chunking strategies accordingly. This simulates how chunking is actually deployed in production RAG systems — not as a fixed function, but as a design decision that adapts to input characteristics.





Connection to Course Learning Outcomes (CLOs)


CLO

Description

Relevance

CLO 1

Identify structural failure modes of pure vector search

Understanding document-type-specific failures informs strategy selection

CLO 2

Implement hybrid search strategies

The adaptive engine produces chunks tailored for hybrid retrieval

CLO 3

Evaluate trade-offs in retrieval systems

Comparative experiments across document types mirror real evaluation workflows

CLO 4

Design production-ready pipelines

The modular, extensible engine architecture reflects engineering best practices

CLO 5

Debug and evaluate RAG pipelines

The evaluation framework directly applies Chapter 5 concepts




Assignment Type & Duration


  • Type: Individual Assignment

  • Duration: 10 days from release

  • Weighting: 50% of total coursework





Learning Objectives

Upon successful completion of this assignment, you will be able to:


  1. Design a modular, extensible chunking engine with clearly separated concerns (strategy implementation, document detection, strategy selection, evaluation).

  2. Construct a document-type detection system that classifies input text based on structural heuristics.

  3. Evaluate chunking strategies systematically using quantitative metrics (chunk size consistency, redundancy, token distribution) and qualitative assessment (context preservation, semantic coherence).

  4. Synthesize multiple chunking approaches into hybrid strategies with clearly justified layering decisions.

  5. Compare system performance across diverse document types and articulate why different documents require different treatment.

  6. Communicate engineering design decisions through a structured technical report that connects implementation choices to system-level goals.





Task Description


System Architecture

You will build a modular chunking engine with the following file structure:



chunking_engine/

├── strategies.py      # Core chunking strategy implementations

├── detector.py        # Document type detection logic

├── controller.py      # Strategy selection and orchestration

├── evaluation.py      # Chunk quality evaluation framework

└── main.py            # Entry point and demonstration




Part A: Core Engine Components (55 Marks)



Task 1: Core Chunking Strategy Implementations — 20 Marks

Implement the following functions in strategies.py:


  • fixed_size_chunk(text, chunk_size)

  • chunk_with_overlap(text, size, overlap)

  • sentence_chunker(sentences, max_words)

  • token_chunk(text, chunk_size, overlap)

  • semantic_chunk(sentences, embeddings, threshold)



Requirements: 


  • Each function must be modular and reusable — callable independently

  • Add descriptive docstrings explaining behavior, parameters, assumptions, and return types

  • Include input validation and edge case handling (e.g., empty text, chunk_size larger than document)

  • All functions must return a consistent output format



Task 2: Document Type Detection — 15 Marks

Create a detection function in detector.py:



def detect_document_type(text) -> str:    
""" Classifies input text into one of:    
- 'markdown'    
- 'technical_documentation'    
- 'narrative'    
- 'plain_text' """


Detection Heuristics (suggested, not exhaustive): - Presence of Markdown headers (#, ##) or HTML tags (<h1>, <h2>) - Sentence density and paragraph spacing patterns - Average sentence length - Presence of code blocks, lists, or structured formatting - Vocabulary technicality (e.g., frequency of specialized terms)


Requirements: - Must handle at least 4 distinct document types - Include a confidence or scores output so the selection engine can make informed decisions - Demonstrate the detector on at least 3 different sample documents



Task 3: Strategy Selection Engine — 20 Marks

Create a controller function in controller.py:



def chunk_document(text) -> list:  
""" Detects document type, selects the appropriate chunking strategy,    and returns the chunked output. """


Strategy Mapping (reference — you may design your own logic):


Detected Document Type

Suggested Strategy

Markdown

Structure-aware chunking

Technical documentation

Sentence + token-aware chunking

Narrative text

Semantic chunking

Plain text

Fixed-size / overlap chunking


Requirements: 

- The controller must call detect_document_type() internally

- Strategy selection must be justified — explain in your report why each mapping was chosen

- The controller must be extensible: adding a new document type should not require rewriting existing logic

- Handle edge cases: what happens when detection confidence is low?




Part B: Advanced Features (35 Marks)


Task 4: Hybrid Chunking Strategies — 15 Marks

Extend your system to support layered/hybrid strategies, such as:


  • Structure → Sentence → Token normalization

  • Sentence → Semantic refinement

  • Fixed → Overlap → Token limit enforcement


Output Format:

Each final chunk must be a structured object:



{ "chunk": "...",  "strategy": "semantic + token",  "length": 78,  "tokens": 120 }


Requirements: - Implement at least 2 distinct hybrid strategies - Show that hybrid strategies produce different (ideally better) results than single strategies - Include metadata with each chunk to enable traceability





Task 5: Evaluation Framework — 10 Marks

Design and implement an evaluation system in evaluation.py:



def evaluate_chunks(chunks) -> dict:    
"""    Evaluates chunk quality across multiple dimensions.    Returns a dictionary of metrics.    """



Evaluation Dimensions:


Metric

Description

Chunk size consistency

Variance and standard deviation of chunk lengths

Context preservation

Whether semantic boundaries align with chunk boundaries

Redundancy

Amount of duplicate content across overlapping chunks

Semantic coherence

Average intra-chunk cosine similarity between sentences

Token compliance

Percentage of chunks within the defined token limit



Requirements: 


  • Implement quantitative metrics (at least 3 of the above)

  • Apply your evaluation to the output of at least 2 different strategies on the same document

  • Present results in a comparison table or chart



Task 6: Comparative Experiment — 10 Marks

Run your complete system on at least 3 different types of documents:


  1. A Markdown file (e.g., documentation or README)

  2. A technical explanation (e.g., how Transformers work, how RAG pipelines operate)

  3. Mixed paragraph text (e.g., a blog post or opinion article)




Compare the following across document types: 


  • Number of chunks produced

  • Average chunk size (words and tokens)

  • Strategy selected by the detector

  • Qualitative retrieval readiness (would these chunks be useful in a vector search?)



Deliverables: Comparison table, 400–500 word analysis of differences across document types.





Difficulty & Scope



Expectations by Performance Level

Level

Description

Basic (40–59%)

Implements core strategies (Task 1) and detector (Task 2), but controller logic is hardcoded. Evaluation is minimal. Report is descriptive. Code structure is flat (not modular).

Proficient (60–79%)

All tasks completed. Modular file structure followed. Detector works on multiple types. Controller uses detection output. Evaluation includes at least 2 quantitative metrics. Report connects design to rationale.

Advanced (80–100%)

System is fully adaptive and extensible. Hybrid strategies produce measurably better results. Evaluation framework is comprehensive. Report demonstrates systems thinking — connects chunking design to retrieval quality (Chapters 1–5). Code is production-quality with error handling, docstrings, and clean separation of concerns. Bonus tasks attempted.







Marking Rubric

Criteria

Weight

Description

Strategy Implementation

20%

All 5 functions implemented, modular, reusable, with docstrings and edge case handling

Document Detection Logic

15%

Detector classifies at least 4 types; uses heuristics; provides confidence scores

Adaptive System Design

20%

Controller integrates detection and strategy selection; extensible architecture; justified mapping

Hybrid Strategy Effectiveness

15%

At least 2 hybrid strategies; metadata output; demonstrated improvement over single strategies

Evaluation Framework

10%

At least 3 quantitative metrics; comparison across strategies; presented clearly

Report Quality

10%

Well-structured; justifies design; discusses trade-offs; connects to course concepts

Code Quality & Modularity

10%

Clean file structure; proper naming; error handling; consistent interfaces; docstrings

Total

100%






Formatting & Structural Requirements




Code Submission


  • Structure: Follow the modular file structure specified in the task description

  • Language: Python 3.10+

  • Documentation: All public functions must have docstrings

  • Runnable: main.py must execute end-to-end without errors when run from the project root

  • Dependencies: Include a requirements.txt file listing all required packages




Report


  • Length: 1500–2000 words

  • Format: PDF (required)

  • Font: Times New Roman, 12pt

  • Spacing: 1.5 line spacing

  • Margins: 2.54 cm (1 inch) on all sides

  • Citation Style: IEEE format



Required Sections:


  • System Design Overview (with architecture diagram)

  • Strategy Selection Logic & Justification

  • Trade-off Analysis (where strategies fail; when semantic chunking helps/hurts)

  • Hybrid Strategy Justification

  • Observations Across Document Types

  • References




Output Samples


  • Include sample chunks from each document type processed

  • Annotate at least 3 chunks explaining why they are “good” or “bad”





Permitted Resources & Academic Integrity Policy




Permitted Resources


  • Course notebooks (Chapters 1–5) and lecture materials

  • Official documentation for Python libraries: tiktoken, sentence-transformers, scikit-learn, BeautifulSoup, NLTK, spaCy, re

  • Public tutorials and documentation for NLP and chunking techniques

  • Textbooks listed in the course syllabus




AI Tool Policy


  • Permitted: AI tools may be used for:

  • Debugging code errors

  • Understanding library APIs

  • Generating boilerplate code (e.g., file I/O, argument parsing)

  • NOT Permitted: Using AI to generate strategy implementations, detection logic, system architecture, analysis text, or report content

  • Requirement: All AI tool usage must be declared (see declaration below)




Academic Integrity Declaration


I declare that this assignment is my own original work. All sources and AI tools used have been appropriately acknowledged. I understand that plagiarism, collusion, and undeclared AI-generated content constitute academic misconduct and will be dealt with according to the institution’s academic integrity policy.


Student Name: ____________________________Student ID: ____________________________Date: ____________________________Signature: ____________________________





Step-by-Step Submission Instructions (Moodle)


  1. Log in to Moodle at [institution Moodle URL] using your student credentials.

  2. Navigate to the course page: Hybrid Search and Re-ranking — From Retrieval to Reliable Answers.

  3. Click on the assignment link: Assignment 2: Adaptive Chunking Engine.

  4. Prepare your submission as a single ZIP file or a GitHub repository link:


Option A — ZIP File:




<YourName>_ChunkingEngine_Assignment.zip
├── chunking_engine/│   

├── strategies.py│   

├── detector.py│   

├── controller.py│   

├── evaluation.py│   

└── main.py

├── report.pdf

├── output_samples/│   

├── markdown_chunks.txt│   

├── technical_chunks.txt│   

└── mixed_chunks.txt

├── requirements.txt

└── data/ (your source documents)



Option B — GitHub Link: - Repository must be public or shared with the instructor - Include the repository URL in the Moodle submission text box


  1. File naming convention: LastName_FirstName_ChunkingEngine_Assignment.zip

  2. Upload your file or paste your GitHub link.

  3. Click “Submit Assignment” — do NOT just save as draft.

  4. Verify your submission by checking the confirmation email from Moodle.




Accepted File Formats


  • Code: .py

  • Report: .pdf

  • Archive: .zip

  • Output: .txt, .json, .csv




Deadline

Submit within 10 days from assignment release.Exact date: [Instructor to specify]




Late Submission Policy


Submission Window

Penalty

Up to 24 hours late

10% deduction

24–48 hours late

20% deduction

Beyond 48 hours

Not accepted





Support & Communication Guidelines


  • Office Hours: [Day and Time — e.g., Thursdays 3:00–5:00 PM], [Location / Online Link]

  • Discussion Forum: Use the Moodle discussion forum for all general questions. Post in the Assignment 2 Q&A thread.

  • Email: For confidential matters only, email [instructor@institution.edu] with the subject line: [Hybrid Search] Assignment 2 — [Your Name]

  • Response Time: Expect a response within 48 hours on working days.

  • Peer Collaboration: You may discuss design approaches and concepts with classmates, but all code, system design, and written analysis must be your own. Sharing code or report text is not permitted.

  • Technical Issues: If you encounter technical issues with libraries or environments, post on the forum first — your classmates may have already solved the same problem.





Frequently Asked Questions (FAQ)


Q1: Must I follow the exact file structure (strategies.py, detector.py, etc.)?

A: Yes. The modular structure is part of the assessment. It evaluates your ability to design maintainable, production-like systems. You may add additional files (e.g., utils.py) but cannot merge the required files.




Q2: Can I use pre-built chunking libraries like LangChain’s text splitters?

A: You may use them as reference or for comparison, but your core implementations must be written by you. Submitting a wrapper around a library function will not receive credit.




Q3: How sophisticated does the document type detector need to be?

A: It does not need to be perfect. A heuristic-based detector that reliably distinguishes 3–4 types is sufficient. What matters is that you justify your heuristics and acknowledge limitations.




Q4: What if my adaptive system selects the “wrong” strategy for a document?

A: This is an excellent analysis opportunity. Document the misclassification, explain why it occurred, and discuss how you would improve the detector. Honest error analysis is valued more highly than false perfection.




Q5: Is the GitHub submission option acceptable?

A: Yes, provided the repository is accessible to the instructor and contains all required files (code, report, output samples). Include a clear README.md.




Q6: How should I measure “retrieval readiness” in Task 6?

A: This is a qualitative assessment. Ask yourself: would a vector search engine return useful results if these chunks were indexed? Are chunks self-contained? Do they preserve enough context for an LLM to generate a meaningful answer? Connect your reasoning to the failure modes from Chapter 1.




Q7: Can I add extra strategy functions beyond the 5 required?

A: Absolutely. Additional strategies (e.g., paragraph-based, recursive character splitting) are welcome and demonstrate initiative. Make sure the required 5 are implemented first.




Q8: What counts as a “hybrid strategy”?

A: A hybrid strategy chains two or more individual strategies in a pipeline. For example: first split by structure (headers), then refine each section with semantic chunking, then enforce token limits. The key is that multiple strategies work together, not in isolation.





GENERAL GUIDANCE FOR THE ASSIGNMENTS




Instructor’s Note

Both assignments are designed to simulate real-world system design thinking. There is no single correct answer. What matters is:


  • Clarity of reasoning — Can you explain why you made each design choice?

  • Quality of implementation — Is your code clean, modular, and well-documented?

  • Depth of analysis — Do you go beyond describing what happened to explaining why it happened and what it means?




Tips for Success


  1. Start simple, then build complexity. Get a basic version working first, then iterate.

  2. Visualize chunks wherever possible. Print them, count them, compare them visually.

  3. Focus on why a chunk is good or bad. The analysis is where marks are earned, not just the implementation.

  4. Don’t just implement — analyze deeply. Every chunking strategy has strengths and weaknesses. Your job is to discover and articulate them.

  5. Think from a retrieval perspective, not just a splitting perspective. Ask: would this chunk help an LLM answer a question? Would a vector search find it given a relevant query?

  6. Connect to the course material. Reference the failure modes from Chapter 1, the hybrid strategies from Chapter 2, the re-ranking concepts from Chapter 3, and the evaluation metrics from Chapter 5.



This document was prepared for the course “Hybrid Search and Re-ranking — From Retrieval to Reliable Answers” and is intended for distribution to enrolled students only.

© [Institution Name], [Semester / Year]. All rights reserved.





Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?


Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.


Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.




Get Started Today



Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.




Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.









Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.


Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.


Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.




Comments


bottom of page