Building a Metadata-Aware Ingestion & Retrieval Pipeline

Mar 25
10 min read

Course: Metadata Filtering

Level: Medium → Advanced

Type: Individual Assignment

Duration: 5–7 days

Objective

The objective of this assignment is to help you:

Understand why metadata filtering is essential for production RAG systems
Design a metadata schema for a real-world knowledge base
Implement metadata-preserving chunking so that chunk-level metadata is never lost
Build and apply pre-filters using ChromaDB's filter syntax
Compare pre-filtering vs post-filtering approaches with measurable outcomes
Think critically about access control, multi-tenancy, and time-sensitivity in retrieval

By the end of this assignment, you will have built a fully functional metadata-aware ingestion and filtered retrieval pipeline from scratch — the foundation every production RAG system needs.

Problem Statement

You work as a developer at a mid-sized company. The company has decided to build an internal knowledge assistant powered by RAG. Documents come from multiple departments (e.g., HR, Engineering, Finance), carry different access levels (public, internal, confidential), and are updated across years.

Your task is to:

Design, implement, and evaluate a metadata-aware ingestion and filtered retrieval pipeline that ensures the right documents reach the right users — and only them.
You must demonstrate that your system correctly handles department scoping, access control, time-based freshness, and combined filtering — all while preserving metadata at the chunk level.

Dataset Requirements

You must create or generate a synthetic dataset of at least 15 documents with the following characteristics:

Required Metadata Fields (per document)

Field	Type	Description	Example Values
department	string	Originating department	HR, Engineering, Marketing, Finance, Legal
access_level	string	Human-readable access tier	public, internal, confidential, restricted
access_level_num	integer	Numeric access tier (for range comparisons)	1 (public) → 4 (restricted)
year	integer	Year the document was created/last updated	2022, 2023, 2024, 2025
category	string	Document category	policy, procedure, standard, guide, memo

Required Distribution

At least 3 departments represented
At least 3 different access levels used
At least 2 different years represented
Documents should range from 300–1000 words each

Content Guidance

Write realistic organizational content. Examples:

HR: Leave policies, onboarding guides, code of conduct
Engineering: Coding standards, deployment procedures, incident response
Finance: Expense policies, budget guidelines, audit procedures

Note: You may write these documents manually or generate them programmatically. What matters is that they are realistic and well-tagged with metadata.

Tasks & Requirements

Task 1: Metadata Schema Design (10 Marks)

Design a formal metadata schema for your knowledge base.

Requirements:

Define your schema clearly, specifying:

Required fields — fields that every document MUST have
Filterable fields — fields that can be used as retrieval filters
Type constraints — expected data type for each field (string, integer, etc.)
Allowed values — enumeration of valid values where applicable (e.g., access levels)

Write a validation function validate_document_metadata(metadata, schema) that:

Checks if all required fields are present
Checks if field values match expected types
Checks if enumerated fields contain only allowed values
Returns clear error messages for invalid documents

Include a brief write-up (5–8 sentences) explaining:

Why you chose these specific fields
How each field supports a filtering use case

Deliverable: Schema definition (as a Python dictionary or dataclass) + validation function + explanation.

Task 2: Metadata-Preserving Chunking (20 Marks)

This is the most critical step. When you split documents into chunks, metadata must flow from the parent document to every chunk.

Requirements:

Implement a chunking function/class that:

Accepts a document (text + metadata)
Splits text into chunks of a configurable size (default: ~1000 characters) with configurable overlap (default: ~200 characters)
Copies (not references!) all parent document metadata to each chunk
Adds the following chunk-specific metadata to each chunk:
- chunk_index — position of the chunk within the document (0-based)
- total_chunks — total number of chunks from this document
- source_doc_id — identifier of the parent document
- source_doc_title — title of the parent document

Demonstrate your chunker on at least 3 documents from different departments
Print a clear output table showing, for each chunk:

Chunk ID
First 80 characters of text
Inherited metadata (department, access_level, year)
Chunk-specific metadata (chunk_index, total_chunks)

Write a short explanation (5–8 sentences) answering:

What would go wrong if you used a reference instead of a copy for metadata?
Why is chunk_index / total_chunks useful in production?

Common Pitfall to Avoid:If you use chunk_metadata = doc_metadata (assignment, not copy), then modifying one chunk's metadata will corrupt all other chunks from the same document. You must use .copy() or equivalent.

Task 3: Ingestion into ChromaDB (15 Marks)

Ingest your chunks (with metadata) into a ChromaDB collection.

Requirements:

Create a ChromaDB collection with a meaningful name
Generate embeddings for each chunk using OpenAI's embedding model (e.g., text-embedding-3-small)
Store chunks with:
- ids — unique chunk identifiers
- documents — chunk text
- embeddings — vector embeddings
- metadatas — the full metadata dictionary for each chunk
After ingestion, verify by:
- Printing the total number of chunks stored
- Querying a sample chunk and confirming its metadata is intact
- Demonstrating that metadata fields are present and correct

Tip: Use tiktoken to verify that no chunk exceeds a reasonable token limit (e.g., 500 tokens) before embedding. Print a warning if any chunk is too large.

Task 4: Implementing Pre-Filters (25 Marks)

Build four types of metadata filters and demonstrate each with a retrieval query.

4a. Time-Based Filter (5 Marks)

Build a filter that restricts results to documents from a specific year or year range
Example: "Show only results from 2024 or later"
ChromaDB syntax: {"year": {"$gte": 2024}}
Run a query with and without this filter and compare the results

4b. Access-Control Filter (5 Marks)

Build a filter that restricts results based on a user's access level
A user with access_level_num = 2 should only see chunks where access_level_num <= 2
ChromaDB syntax: {"access_level_num": {"$lte": 2}}
Run a query as a "public only" user and a "confidential" user — show the difference

4c. Department-Scoping Filter (5 Marks)

Build a filter that restricts results to a specific department
ChromaDB syntax: {"department": {"$eq": "HR"}}
Run a query for "leave policy" with department filter for HR vs without filter — show the difference

4d. Combined Filter (10 Marks)

Build a filter that combines at least 3 conditions using the $and operator
Example: "HR documents, from 2024 or later, with access level ≤ 2"
ChromaDB syntax:


{

    "$and": [

        {"department": {"$eq": "HR"}},

        {"year": {"$gte": 2024}},

        {"access_level_num": {"$lte": 2}}

    ]

}

Run the query and display the results with their metadata
Discuss: What happens if a query matches zero results? How would you handle that in production?

For each sub-task, present:

The filter used (as Python dict)
The query text
The returned results (chunk text excerpt + metadata)
A brief analysis (2–3 sentences) of what the filter accomplished

Task 5: Pre-Filtering vs Post-Filtering Comparison (20 Marks)

This is a critical concept. You must implement both approaches and compare them empirically.

Pre-Filtering (The Correct Way)

Apply metadata filter during the ChromaDB .query() call using the where parameter
ChromaDB narrows the candidate set first, then performs similarity search within that subset
You should always get up to n_results items back (if enough documents match the filter)

Post-Filtering (The Problematic Way)

Perform a regular similarity search without any filter (retrieve top K from the full collection)
After receiving results, manually filter out chunks that don't match the desired metadata conditions
You may end up with fewer than K results — or even zero results

Requirements:

Implement both approaches as separate functions:

pre_filtered_search(query, filters, n_results) — uses ChromaDB's where parameter
post_filtered_search(query, filters, n_results) — queries without filter, then filters results in Python

Run both functions with the same query and same filter for at least 3 different test cases:

A broad filter (many matching chunks)
A narrow filter (few matching chunks)
A very specific filter (only 1–2 matching chunks)

For each test case, record and compare:

Number of results returned
Whether the correct/relevant chunks were found
Whether any results violate the filter conditions (should be 0 for pre-filter, possibly >0 in raw results for post-filter)

Present your results in a comparison table:

Test Case	Filter	Pre-Filter Results	Post-Filter Results	Key Observation
...	...	...	...	...

Write an analysis (8–12 sentences) that addresses:

Why does post-filtering frequently return fewer results than requested?
In what scenario could post-filtering return wrong results (violating security)?
Why is pre-filtering the correct approach for production systems?
Is there ever a valid reason to post-filter? If so, when?

Task 6: Building Filter Helper Functions (10 Marks)

Build a small library of reusable helper functions that simplify filter construction.

Required Functions:

build_access_filter(user_access_level: int) -> dictReturns a ChromaDB-compatible filter for access control.
build_time_filter(after: int = None, before: int = None) -> dictReturns a time-range filter. Must handle:

Only after provided → $gte
Only before provided → $lte
Both provided → $and with both conditions

build_department_filter(department: str) -> dictReturns a department equality filter.
combine_filters(*filters) -> dictTakes multiple individual filter dicts and combines them into a single $and filter.Must handle edge cases:

Single filter → return it as-is (no wrapping)
Empty input → return empty dict or None

Requirements:

Each function must include a docstring explaining its behavior
Demonstrate all 4 functions individually, then show them combined
Show a real query using the combined output

Deliverables

1. Code (Required)

Jupyter Notebook (.ipynb) — well-organized with clear section headers matching the task numbers
Code must be well-commented — explain your reasoning, not just what the code does
All outputs (print statements, tables) must be visible in the submitted notebook (run all cells before submitting)

2. Report (Required)

A short report (3–5 pages) covering:

Introduction — Brief description of the problem and your approach
Schema Design — Your chosen metadata schema and rationale
Chunking Approach — How you preserved metadata and key decisions made
Filtering Analysis — Summary of pre-filter vs post-filter comparison with your key findings
Challenges & Learnings — What was difficult, what surprised you, what you would do differently
Conclusion — How metadata filtering changes the quality of a RAG system

Format: PDF or DOCX

3. Output Samples (Required)

Include clearly labeled outputs for:

Sample chunks from Task 2 (showing metadata inheritance)
Query results from each filter type in Task 4
Pre-filter vs post-filter comparison table from Task 5

Submission Guidelines

Platform

Submit via your LMS (e.g., Moodle / Google Classroom / institutional portal).

File Naming Convention

<YourName>_MetadataFiltering_Assignment1.zip

ZIP Structure



<YourName>_MetadataFiltering_Assignment1/

├── notebook.ipynb

├── report.pdf

└── data/                  (optional — if you stored your documents as files)

    ├── hr_leave_policy.txt

    ├── eng_coding_standards.txt

    └── ...

Deadline

Submit within 7 days from assignment release date.

Late Submission Policy

Delay	Penalty
Up to 24 hours	10% deduction
24–48 hours	20% deduction
Beyond 48 hours	Not accepted

Important Instructions

Implement everything yourself. You must write your own chunking logic, filter builders, and comparison functions. Demonstrate that you understand how each component works.
Explain your reasoning clearly. Code alone is not enough — use markdown cells and comments to explain why you made specific decisions.
Stick to the taught technology stack. Use ChromaDB as your vector database and OpenAI for embeddings. You may use tiktoken for token counting, pandas for tabular display, and numpy for numerical operations.
Use of external libraries is permitted for utility tasks (e.g., formatting output), but all core logic (chunking, filtering, comparison) must be your own implementation.
Plagiarism will result in disqualification. If you reference any external resource, cite it. Submitting copied code without understanding will be treated as academic dishonesty.
Run all cells before submission. A notebook with missing outputs will lose marks.

Evaluation Rubric

Criteria	Marks
Task 1 — Metadata Schema Design	10
Task 2 — Metadata-Preserving Chunking	20
Task 3 — Ingestion into ChromaDB	15
Task 4 — Implementing Pre-Filters	25
Task 5 — Pre-Filter vs Post-Filter Compare	20
Task 6 — Filter Helper Functions	10
Total	100

Grading Breakdown

Grade Range	Interpretation
90–100	Exceptional — all tasks complete, deep analysis, production-quality code
75–89	Strong — all tasks complete, good analysis, minor gaps
60–74	Satisfactory — most tasks complete, basic analysis
40–59	Needs Improvement — several tasks incomplete or shallow
Below 40	Unsatisfactory — major tasks missing or fundamentally incorrect

Guidance & Tips

Start with your dataset. Everything depends on having well-structured documents with proper metadata. Spend time here first.
Test your chunker on a single document before running it on the full set. Print intermediate outputs to verify metadata is flowing correctly.
Visualize your results. Use pandas DataFrames to display chunks and their metadata — it makes analysis much easier.
Think from a user's perspective. Ask yourself: "If I were an HR employee, would I see only what I should see?"
The pre-filter vs post-filter comparison is where most learning happens. Don't rush it — set up thoughtful test cases that clearly demonstrate the difference.
Don't over-engineer. A clean, correct, well-explained implementation is worth more than a complex one that is hard to follow.

Bonus (Optional — up to +10 Marks)

Multi-Tenant Isolation (+5): Add an organization_id field. Create documents for 2 organizations. Demonstrate that a query by Org A never returns Org B's documents, even if the content is semantically identical.
Filter Compliance Metric (+5): Implement a function that takes query results and a filter, and returns the percentage of results that satisfy the filter. Show that pre-filtering achieves 100% compliance while post-filtering may not.

Instructor Note

This assignment focuses on the foundational mechanics of metadata filtering: schema design, metadata-preserving chunking, and filtered retrieval. These are non-negotiable requirements for any production RAG system.

There is no single correct schema or perfect filter combination. What matters is:

Clarity of design — Can you justify your schema choices?
Correctness of implementation — Does metadata truly flow to every chunk?
Depth of comparison — Did you genuinely compare pre-filter vs post-filter with meaningful test cases?
Quality of reasoning — Do you understand why these techniques matter, not just how to code them?

Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?

Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.

Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.

Get Started Today

Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.

Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.

Email: contact@codersarts.com

Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.

Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.

Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.

Objective

Problem Statement

Dataset Requirements

Required Metadata Fields (per document)

Required Distribution

Content Guidance

Tasks & Requirements

Task 1: Metadata Schema Design (10 Marks)

Task 2: Metadata-Preserving Chunking (20 Marks)

Task 3: Ingestion into ChromaDB (15 Marks)

Task 4: Implementing Pre-Filters (25 Marks)

4a. Time-Based Filter (5 Marks)

4b. Access-Control Filter (5 Marks)

4c. Department-Scoping Filter (5 Marks)

4d. Combined Filter (10 Marks)

Task 5: Pre-Filtering vs Post-Filtering Comparison (20 Marks)

Pre-Filtering (The Correct Way)

Post-Filtering (The Problematic Way)

Requirements:

Task 6: Building Filter Helper Functions (10 Marks)

Deliverables

1. Code (Required)

2. Report (Required)

3. Output Samples (Required)

Submission Guidelines

Platform

File Naming Convention

ZIP Structure

Deadline

Late Submission Policy

Important Instructions

Evaluation Rubric

Grading Breakdown

Guidance & Tips

Bonus (Optional — up to +10 Marks)

Instructor Note

Call to Action

Get Started Today

Schedule an AI & Data Science Consultation:

Request a Custom AI Demo:

Comments