top of page

Building a Metadata-Aware Ingestion & Retrieval Pipeline

  • 2 hours ago
  • 10 min read



Course: Metadata Filtering

Level: Medium → Advanced

Type: Individual Assignment

Duration: 5–7 days





Objective

The objective of this assignment is to help you:


  • Understand why metadata filtering is essential for production RAG systems

  • Design a metadata schema for a real-world knowledge base

  • Implement metadata-preserving chunking so that chunk-level metadata is never lost

  • Build and apply pre-filters using ChromaDB's filter syntax

  • Compare pre-filtering vs post-filtering approaches with measurable outcomes

  • Think critically about access control, multi-tenancy, and time-sensitivity in retrieval


    By the end of this assignment, you will have built a fully functional metadata-aware ingestion and filtered retrieval pipeline from scratch — the foundation every production RAG system needs.





Problem Statement

You work as a developer at a mid-sized company. The company has decided to build an internal knowledge assistant powered by RAG. Documents come from multiple departments (e.g., HR, Engineering, Finance), carry different access levels (public, internal, confidential), and are updated across years.


Your task is to:


  • Design, implement, and evaluate a metadata-aware ingestion and filtered retrieval pipeline that ensures the right documents reach the right users — and only them.

  • You must demonstrate that your system correctly handles department scoping, access control, time-based freshness, and combined filtering — all while preserving metadata at the chunk level.





Dataset Requirements

You must create or generate a synthetic dataset of at least 15 documents with the following characteristics:



Required Metadata Fields (per document)

Field

Type

Description

Example Values

department

string

Originating department

HR, Engineering, Marketing, Finance, Legal

access_level

string

Human-readable access tier

public, internal, confidential, restricted

access_level_num

integer

Numeric access tier (for range comparisons)

1 (public) → 4 (restricted)

year

integer

Year the document was created/last updated

2022, 2023, 2024, 2025

category

string

Document category

policy, procedure, standard, guide, memo



Required Distribution


  • At least 3 departments represented

  • At least 3 different access levels used

  • At least 2 different years represented

  • Documents should range from 300–1000 words each




Content Guidance

Write realistic organizational content. Examples:


  • HR: Leave policies, onboarding guides, code of conduct

  • Engineering: Coding standards, deployment procedures, incident response

  • Finance: Expense policies, budget guidelines, audit procedures


Note: You may write these documents manually or generate them programmatically. What matters is that they are realistic and well-tagged with metadata.





Tasks & Requirements




Task 1: Metadata Schema Design (10 Marks)

Design a formal metadata schema for your knowledge base.


Requirements:

Define your schema clearly, specifying:


  • Required fields — fields that every document MUST have

  • Filterable fields — fields that can be used as retrieval filters

  • Type constraints — expected data type for each field (string, integer, etc.)

  • Allowed values — enumeration of valid values where applicable (e.g., access levels)



Write a validation function validate_document_metadata(metadata, schema) that:


  • Checks if all required fields are present

  • Checks if field values match expected types

  • Checks if enumerated fields contain only allowed values

  • Returns clear error messages for invalid documents



Include a brief write-up (5–8 sentences) explaining:


  • Why you chose these specific fields

  • How each field supports a filtering use case



Deliverable: Schema definition (as a Python dictionary or dataclass) + validation function + explanation.




Task 2: Metadata-Preserving Chunking (20 Marks)

This is the most critical step. When you split documents into chunks, metadata must flow from the parent document to every chunk.


Requirements:


  1. Implement a chunking function/class that:


  • Accepts a document (text + metadata)

  • Splits text into chunks of a configurable size (default: ~1000 characters) with configurable overlap (default: ~200 characters)

  • Copies (not references!) all parent document metadata to each chunk

  • Adds the following chunk-specific metadata to each chunk:

    • chunk_index — position of the chunk within the document (0-based)

    • total_chunks — total number of chunks from this document

    • source_doc_id — identifier of the parent document

    • source_doc_title — title of the parent document


  1. Demonstrate your chunker on at least 3 documents from different departments

  2. Print a clear output table showing, for each chunk:


  • Chunk ID

  • First 80 characters of text

  • Inherited metadata (department, access_level, year)

  • Chunk-specific metadata (chunk_index, total_chunks)


  1. Write a short explanation (5–8 sentences) answering:


  • What would go wrong if you used a reference instead of a copy for metadata?

  • Why is chunk_index / total_chunks useful in production?



Common Pitfall to Avoid:If you use chunk_metadata = doc_metadata (assignment, not copy), then modifying one chunk's metadata will corrupt all other chunks from the same document. You must use .copy() or equivalent.




Task 3: Ingestion into ChromaDB (15 Marks)

Ingest your chunks (with metadata) into a ChromaDB collection.


Requirements:


  1. Create a ChromaDB collection with a meaningful name

  2. Generate embeddings for each chunk using OpenAI's embedding model (e.g., text-embedding-3-small)

  3. Store chunks with:

    • ids — unique chunk identifiers

    • documents — chunk text

    • embeddings — vector embeddings

    • metadatas — the full metadata dictionary for each chunk

  4. After ingestion, verify by:

    • Printing the total number of chunks stored

    • Querying a sample chunk and confirming its metadata is intact

    • Demonstrating that metadata fields are present and correct


Tip: Use tiktoken to verify that no chunk exceeds a reasonable token limit (e.g., 500 tokens) before embedding. Print a warning if any chunk is too large.




Task 4: Implementing Pre-Filters (25 Marks)

Build four types of metadata filters and demonstrate each with a retrieval query.


4a. Time-Based Filter (5 Marks)


  • Build a filter that restricts results to documents from a specific year or year range

  • Example: "Show only results from 2024 or later"

  • ChromaDB syntax: {"year": {"$gte": 2024}}

  • Run a query with and without this filter and compare the results




4b. Access-Control Filter (5 Marks)


  • Build a filter that restricts results based on a user's access level

  • A user with access_level_num = 2 should only see chunks where access_level_num <= 2

  • ChromaDB syntax: {"access_level_num": {"$lte": 2}}

  • Run a query as a "public only" user and a "confidential" user — show the difference




4c. Department-Scoping Filter (5 Marks)

  • Build a filter that restricts results to a specific department

  • ChromaDB syntax: {"department": {"$eq": "HR"}}

  • Run a query for "leave policy" with department filter for HR vs without filter — show the difference




4d. Combined Filter (10 Marks)

  • Build a filter that combines at least 3 conditions using the $and operator

  • Example: "HR documents, from 2024 or later, with access level ≤ 2"

  • ChromaDB syntax:



{

    "$and": [

        {"department": {"$eq": "HR"}},

        {"year": {"$gte": 2024}},

        {"access_level_num": {"$lte": 2}}

    ]

}



  • Run the query and display the results with their metadata

  • Discuss: What happens if a query matches zero results? How would you handle that in production?



For each sub-task, present:


  • The filter used (as Python dict)

  • The query text

  • The returned results (chunk text excerpt + metadata)

  • A brief analysis (2–3 sentences) of what the filter accomplished




Task 5: Pre-Filtering vs Post-Filtering Comparison (20 Marks)

This is a critical concept. You must implement both approaches and compare them empirically.


Pre-Filtering (The Correct Way)


  1. Apply metadata filter during the ChromaDB .query() call using the where parameter

  2. ChromaDB narrows the candidate set first, then performs similarity search within that subset

  3. You should always get up to n_results items back (if enough documents match the filter)




Post-Filtering (The Problematic Way)


  1. Perform a regular similarity search without any filter (retrieve top K from the full collection)

  2. After receiving results, manually filter out chunks that don't match the desired metadata conditions

  3. You may end up with fewer than K results — or even zero results




Requirements:


  1. Implement both approaches as separate functions:


  • pre_filtered_search(query, filters, n_results) — uses ChromaDB's where parameter

  • post_filtered_search(query, filters, n_results) — queries without filter, then filters results in Python


  1. Run both functions with the same query and same filter for at least 3 different test cases:


  • A broad filter (many matching chunks)

  • A narrow filter (few matching chunks)

  • A very specific filter (only 1–2 matching chunks)


  1. For each test case, record and compare:


  • Number of results returned

  • Whether the correct/relevant chunks were found

  • Whether any results violate the filter conditions (should be 0 for pre-filter, possibly >0 in raw results for post-filter)


  1. Present your results in a comparison table:


Test Case

Filter

Pre-Filter Results

Post-Filter Results

Key Observation

...

...

...

...

...


  1. Write an analysis (8–12 sentences) that addresses:


  • Why does post-filtering frequently return fewer results than requested?

  • In what scenario could post-filtering return wrong results (violating security)?

  • Why is pre-filtering the correct approach for production systems?

  • Is there ever a valid reason to post-filter? If so, when?




Task 6: Building Filter Helper Functions (10 Marks)

Build a small library of reusable helper functions that simplify filter construction.


Required Functions:


  1. build_access_filter(user_access_level: int) -> dictReturns a ChromaDB-compatible filter for access control.

  2. build_time_filter(after: int = None, before: int = None) -> dictReturns a time-range filter. Must handle:


  • Only after provided → $gte

  • Only before provided → $lte

  • Both provided → $and with both conditions


  1. build_department_filter(department: str) -> dictReturns a department equality filter.

  2. combine_filters(*filters) -> dictTakes multiple individual filter dicts and combines them into a single $and filter.Must handle edge cases:


  • Single filter → return it as-is (no wrapping)

  • Empty input → return empty dict or None



Requirements:


  • Each function must include a docstring explaining its behavior

  • Demonstrate all 4 functions individually, then show them combined

  • Show a real query using the combined output





Deliverables




1. Code (Required)


  • Jupyter Notebook (.ipynb) — well-organized with clear section headers matching the task numbers

  • Code must be well-commented — explain your reasoning, not just what the code does

  • All outputs (print statements, tables) must be visible in the submitted notebook (run all cells before submitting)




2. Report (Required)


A short report (3–5 pages) covering:


  • Introduction — Brief description of the problem and your approach

  • Schema Design — Your chosen metadata schema and rationale

  • Chunking Approach — How you preserved metadata and key decisions made

  • Filtering Analysis — Summary of pre-filter vs post-filter comparison with your key findings

  • Challenges & Learnings — What was difficult, what surprised you, what you would do differently

  • Conclusion — How metadata filtering changes the quality of a RAG system


Format: PDF or DOCX




3. Output Samples (Required)

Include clearly labeled outputs for:


  • Sample chunks from Task 2 (showing metadata inheritance)

  • Query results from each filter type in Task 4

  • Pre-filter vs post-filter comparison table from Task 5





Submission Guidelines




Platform

Submit via your LMS (e.g., Moodle / Google Classroom / institutional portal).





File Naming Convention

<YourName>_MetadataFiltering_Assignment1.zip





ZIP Structure




<YourName>_MetadataFiltering_Assignment1/

├── notebook.ipynb

├── report.pdf

└── data/                  (optional — if you stored your documents as files)

    ├── hr_leave_policy.txt

    ├── eng_coding_standards.txt

    └── ...



Deadline

Submit within 7 days from assignment release date.




Late Submission Policy

Delay

Penalty

Up to 24 hours

10% deduction

24–48 hours

20% deduction

Beyond 48 hours

Not accepted





Important Instructions


  1. Implement everything yourself. You must write your own chunking logic, filter builders, and comparison functions. Demonstrate that you understand how each component works.

  2. Explain your reasoning clearly. Code alone is not enough — use markdown cells and comments to explain why you made specific decisions.

  3. Stick to the taught technology stack. Use ChromaDB as your vector database and OpenAI for embeddings. You may use tiktoken for token counting, pandas for tabular display, and numpy for numerical operations.

  4. Use of external libraries is permitted for utility tasks (e.g., formatting output), but all core logic (chunking, filtering, comparison) must be your own implementation.

  5. Plagiarism will result in disqualification. If you reference any external resource, cite it. Submitting copied code without understanding will be treated as academic dishonesty.

  6. Run all cells before submission. A notebook with missing outputs will lose marks.





Evaluation Rubric


Criteria

Marks

Task 1 — Metadata Schema Design

10

Task 2 — Metadata-Preserving Chunking

20

Task 3 — Ingestion into ChromaDB

15

Task 4 — Implementing Pre-Filters

25

Task 5 — Pre-Filter vs Post-Filter Compare

20

Task 6 — Filter Helper Functions

10

Total

100





Grading Breakdown


Grade Range

Interpretation

90–100

Exceptional — all tasks complete, deep analysis, production-quality code

75–89

Strong — all tasks complete, good analysis, minor gaps

60–74

Satisfactory — most tasks complete, basic analysis

40–59

Needs Improvement — several tasks incomplete or shallow

Below 40

Unsatisfactory — major tasks missing or fundamentally incorrect





Guidance & Tips


  • Start with your dataset. Everything depends on having well-structured documents with proper metadata. Spend time here first.

  • Test your chunker on a single document before running it on the full set. Print intermediate outputs to verify metadata is flowing correctly.

  • Visualize your results. Use pandas DataFrames to display chunks and their metadata — it makes analysis much easier.

  • Think from a user's perspective. Ask yourself: "If I were an HR employee, would I see only what I should see?"

  • The pre-filter vs post-filter comparison is where most learning happens. Don't rush it — set up thoughtful test cases that clearly demonstrate the difference.

  • Don't over-engineer. A clean, correct, well-explained implementation is worth more than a complex one that is hard to follow.





Bonus (Optional — up to +10 Marks)


  • Multi-Tenant Isolation (+5): Add an organization_id field. Create documents for 2 organizations. Demonstrate that a query by Org A never returns Org B's documents, even if the content is semantically identical.

  • Filter Compliance Metric (+5): Implement a function that takes query results and a filter, and returns the percentage of results that satisfy the filter. Show that pre-filtering achieves 100% compliance while post-filtering may not.





Instructor Note

This assignment focuses on the foundational mechanics of metadata filtering: schema design, metadata-preserving chunking, and filtered retrieval. These are non-negotiable requirements for any production RAG system.


There is no single correct schema or perfect filter combination. What matters is:


  • Clarity of design — Can you justify your schema choices?

  • Correctness of implementation — Does metadata truly flow to every chunk?

  • Depth of comparison — Did you genuinely compare pre-filter vs post-filter with meaningful test cases?

  • Quality of reasoning — Do you understand why these techniques matter, not just how to code them?





Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?


Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.


Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.




Get Started Today



Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.




Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.









Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.


Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.


Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.




Comments


bottom of page