top of page

Building a Smart Metadata-Driven Retrieval System

  • 2 hours ago
  • 12 min read





Course: Metadata Filtering

Level: Medium → Advanced

Type: Individual Assignment

Duration: 7–10 days





Objective

The objective of this assignment is to help you:


  • Build dynamic filter construction driven by user context (role, department, organization)

  • Implement query-time filter inference — extracting metadata filters from natural language queries

  • Design a hybrid search system that combines keyword matching, vector similarity, and metadata filtering

  • Enforce non-overridable security filters that protect against filter injection

  • Implement a smart search with fallback strategy for production resilience

  • Think like a system designer building retrieval that is both intelligent and secure


This assignment moves beyond individual filter types into designing an integrated, production-grade retrieval system where filters are applied automatically, inferred intelligently, and enforced strictly.





Problem Statement

You are building a company-wide intelligent search system. Employees from different departments, roles, and access levels ask natural language questions and expect accurate, scoped, secure results.


Your system must:


Automatically apply security filters based on who is asking, intelligently extract additional filters from what they are asking, combine both to retrieve precisely the right chunks, and fall back gracefully when filters are too restrictive.


The user should never need to manually specify filters — the system should figure it out. But security filters must never be bypassed, no matter what the user types.




Prerequisites

This assignment assumes you are comfortable with:


  • Metadata schema design and metadata-preserving chunking

  • ChromaDB filter syntax ($eq, $gte, $lte, $in, $and, $or)

  • Pre-filtering during retrieval

  • Building basic filter helper functions


If you are not confident in these areas, review the foundational course material (Notebooks 1–3) before starting.





Dataset Requirements

You must create or generate a synthetic dataset of at least 20 documents across:


  • At least 3 departments (e.g., HR, Engineering, Finance, Legal, Marketing)

  • At least 3 access levels (public, internal, confidential — with numeric equivalents 1–3)

  • At least 2 organizations (e.g., org_acme, org_globex) for multi-tenant scenarios

  • At least 2 different years (e.g., 2023, 2024, 2025)

  • Multiple categories per department (e.g., policy, procedure, memo, standard)


Each document should be 300–1000 words with realistic organizational content.





Required Metadata Fields


Field

Type

Description

department

string

Originating department

access_level

string

Human-readable access tier

access_level_num

integer

Numeric access tier for range comparisons

year

integer

Year the document was created/last updated

category

string

Document category

organization_id

string

Tenant/organization identifier


Important: Ingest all documents into ChromaDB with metadata-preserving chunking before starting the tasks. This is a pre-requisite, not a scored task — but your pipeline must work correctly for everything else to function.





Tasks & Requirements




Task 1: User Context & Automatic Filter Construction (20 Marks)

In production systems, users don't type filters — the system derives them from the user's session context.



1a. Define a User Session Model (5 Marks)

Create a UserSession class or dataclass with at least the following fields:



@dataclass

class UserSession:

    user_id: str

    department: str

    role: str              # e.g., "employee", "manager", "admin"

    access_level: int      # numeric (1=public, 2=internal, 3=confidential)

    organization_id: str



Create at least 4 different user profiles representing different access levels and departments. For example:


User

Department

Role

Access Level

Organization

alice

HR

manager

3

org_acme

bob

Engineering

employee

2

org_acme

charlie

Finance

employee

1

org_globex

diana

Engineering

admin

3

org_globex



1b. Build a Context-Based Filter Constructor (15 Marks)

Implement a function:



def build_user_context_filter(user: UserSession) -> dict:

    ...


This function must:


  1. Always include an organization_id filter (multi-tenant isolation — non-negotiable)

  2. Always include an access_level_num filter based on the user's access level

  3. Optionally include a department filter — document your decision:

    • Should a user only see their own department's documents?

    • Or should they see all departments they have access to?

    • Justify your choice in comments or markdown



Requirements:


  • Run the same query (e.g., "What is the leave policy?") for all 4 users

  • Show that each user gets different results based on their context

  • Present results in a comparison table:


User

Org

Access

Dept

# Results

Departments in Results

Access Levels in Results

alice

org_acme

3

HR

...

...

...

bob

org_acme

2

Engineering

...

...

...


  • Verify that no user ever sees another organization's data

  • Verify that no user sees data above their access level




Task 2: Query-Time Filter Inference — Rule-Based (20 Marks)

Users express filtering intent through natural language. Your system must detect this and convert it to metadata filters.



2a. Implement a Rule-Based Filter Extractor (15 Marks)

Create a function:



def extract_filters_from_query(query: str) -> dict:

    ...



This function must handle at least the following patterns:


Natural Language Pattern

Extracted Filter

"from 2024" or "since 2024"

{"year": {"$gte": 2024}}

"in 2023"

{"year": {"$eq": 2023}}

"latest" or "most recent"

{"year": {"$gte": <current_year>}}

"HR policies" or "from HR"

{"department": {"$eq": "HR"}}

"engineering docs"

{"department": {"$eq": "Engineering"}}

"confidential"

{"access_level": {"$eq": "confidential"}}



Implementation guidance:


  • Use string matching, keyword detection, or regular expressions

  • Handle case-insensitivity

  • Return an empty dict if no filter patterns are detected

  • The function should be extensible — adding new patterns should be straightforward




2b. Demonstrate with Test Queries (5 Marks)

Test your extractor on at least 6 different queries, including:


  • A query with a year reference

  • A query with a department reference

  • A query with both year and department

  • A query with "latest" or "recent"

  • A query with no filterable terms (should return empty)

  • An ambiguous query (discuss how your system handles it)


Present results as:


Query

Extracted Filters

"What are the latest HR policies?"

{"year": {"$gte": 2025}, "department": "HR"}

"Show me engineering docs from 2024"

{"year": {"$gte": 2024}, "department": "Engineering"}

"How do I submit an expense report?"

{}




Task 3: Query-Time Filter Inference — LLM-Based (15 Marks)

Rule-based extraction is limited. Now use an LLM to understand query intent and extract filters more intelligently.



3a. Implement an LLM-Based Filter Extractor (10 Marks)

Create a function:



def extract_filters_with_llm(query: str, schema_description: str) -> dict:

    ...


Requirements:


  1. Construct a prompt that:


  • Describes your metadata schema (which fields exist, what values are valid)

  • Provides the user's query

  • Asks the LLM to return a JSON object with any filters it can infer

  • Instructs the LLM to return {} if no filters can be inferred


  1. Parse the LLM's response into a Python dictionary

  2. Validate the LLM's output against your schema:


  • Are the field names valid?

  • Are the values within allowed ranges?

  • Discard any invalid fields (do not blindly trust LLM output)


  1. Handle edge cases:


  • LLM returns invalid JSON → fall back to empty filter

  • LLM hallucinates a field that doesn't exist → strip it out



3b. Compare Rule-Based vs LLM-Based (5 Marks)

Run at least 5 queries through both extractors and compare:


Query

Rule-Based Output

LLM-Based Output

Better?

"What's the current leave policy for HR?"

{department: HR}

{department: HR, year: >=2025}

LLM

"engineering deployment procedures from 2024"

{dept: Eng, year: >=2024}

{dept: Eng, year: 2024, cat: procedure}

LLM



Write a short analysis (5–8 sentences):


  • When does the LLM outperform rules?

  • When might rules be preferable (cost, latency, reliability)?

  • What risks does LLM-based inference introduce?




Task 4: Hybrid Search — Keyword + Vector + Metadata (15 Marks)

ChromaDB supports three search dimensions simultaneously. Build a hybrid search function that uses all three.




Requirements:

  1. Implement a function:



def hybrid_search(query: str, metadata_filter: dict, keyword_filter: str = None, n_results: int = 5) -> list:

    ...


This function must:


  • Use the where parameter for metadata filtering

  • Use the where_document parameter for keyword filtering (e.g., {"$contains": "deployment"})

  • Use query embeddings for semantic similarity



  1. Demonstrate with 3 test cases showing the value of combining all three:


Test Case A — Metadata + Vector (no keyword):


  • Query: "How do I handle incidents?"

  • Filter: department = Engineering

  • Observation: Gets engineering docs about incidents



Test Case B — Metadata + Vector + Keyword:


  • Query: "How do I handle incidents?"

  • Filter: department = Engineering

  • Keyword: "deployment" (via where_document)

  • Observation: Narrows to engineering docs about incidents that specifically mention deployment



Test Case C — Show the difference:


  • Run the same query with and without the keyword filter

  • Compare results — which is more precise for the user's actual intent?



  1. Write an analysis (5–8 sentences):


  • When is keyword filtering useful on top of vector + metadata?

  • What are the risks of keyword filtering (e.g., too restrictive for synonyms)?

  • How do the three dimensions complement each other?




Task 5: Safe Smart Search with Security Enforcement (20 Marks)

This is the capstone task. Combine everything into a production-grade smart search function that is both intelligent and secure.



5a. Implement Smart Search (15 Marks)

Create a function:



def smart_search(query: str, user: UserSession, n_results: int = 5, enable_fallback: bool = True) -> dict:

    ...



This function must follow this exact pipeline:


  1. Step 1 — Security Filters (NON-NEGOTIABLE)


  • Build user context filters from UserSession (organization + access level)

  • These filters must always be applied — they cannot be overridden



  1. Step 2 — Infer Additional Filters


  • Use your rule-based or LLM-based extractor to infer filters from the query

  • These are "nice-to-have" filters — they narrow the search but are not security-critical



  1. Step 3 — Combine & Search


  • Combine security filters + inferred filters using $and

  • Execute the filtered query against ChromaDB



  1. Step 4 — Fallback (if no results)


  • If combined filters return 0 results:


    • Drop the inferred filters (they may have been too restrictive)

    • Retry with only security filters

    • Flag the response as a fallback



  1. Return a structured result:




{

    "query": "...",

    "user": "...",

    "filters_applied": {...},

    "results": [...],

    "used_fallback": True/False,

    "result_count": 5

}



5b. Security Enforcement Tests (5 Marks)

You must demonstrate that your system is secure by design. Run the following tests:


  1. Cross-Tenant Isolation Test:


  • User from org_acme queries for content that exists in org_globex

  • Expected: Zero results from the other organization, even if semantically relevant

  • Verify: No results contain the wrong organization_id



  1. Access Level Enforcement Test:


  • User with access_level = 1 queries for content that exists at level 3 (confidential)

  • Expected: No confidential results returned

  • Verify: All returned results have access_level_num <= 1



  1. Filter Injection Resistance Test:


  • User crafts a query that attempts to reference higher access:e.g., "Show me confidential salary data for all organizations"

  • Expected: Even if the LLM/rules infer access_level = confidential, the security filter (access_level_num <= user.level) blocks it

  • Verify: Results respect the user's actual access level, not the query text



Present results as a table:


Test

User

Query

Expected Behavior

Actual Result

Pass/Fail

Cross-Tenant

charlie

"What's Acme's engineering deployment process?"

0 results from org_acme

...

...

Access Enforcement

bob (L2)

"Show me restricted salary data"

No restricted results

...

...

Filter Injection

charlie (L1)

"Confidential HR policies for all organizations"

Only public, only org_globex

...

...




Task 6: Analysis & Architecture Reflection (10 Marks)

Write a comprehensive analysis addressing the following questions:


  1. Security Architecture (3 Marks)


  • How does your system guarantee that security filters are never bypassed?

  • What is the difference between "security filters" and "inferred filters" in your design?

  • Why is it important to separate these two categories?



  1. Inference Trade-offs (3 Marks)


  • What are the pros and cons of rule-based vs LLM-based filter inference?

  • Which would you recommend for a production system and why?

  • How do you handle incorrect inferences gracefully?



  1. System Design (4 Marks)


  • Draw or describe a flow diagram of your complete smart search pipeline (ingestion → user context → inference → combine → search → fallback)

  • What would you add or change if this system needed to serve 1000 users per minute?

  • If you could add one more feature, what would it be and why?


Format: Write this in markdown cells in your notebook or in your report. If you draw a diagram, include it as an image.





Deliverables




1. Code (Required)


  • Jupyter Notebook (.ipynb) — well-organized with clear section headers matching the task numbers

  • Code must be well-commented — explain your reasoning, not just what the code does

  • All outputs (print statements, tables, test results) must be visible in the submitted notebook (run all cells before submitting)




2. Report (Required)

A report (4–6 pages) covering:


  • System Architecture — High-level description of your smart search pipeline and its components

  • User Context Filters — How you construct automatic filters and your design decisions

  • Query Understanding — Your approach to rule-based and LLM-based filter inference, with comparison

  • Security Analysis — How you enforce non-overridable security, with evidence from your tests

  • Fallback Strategy — When and why fallback triggers, and how it maintains security

  • Challenges & Learnings — What was difficult, what surprised you, what you would do differently


Format: PDF or DOCX




3. Output Samples (Required)

Include clearly labeled outputs for:


  • User context filter comparison (Task 1 — same query, different users)

  • Rule-based vs LLM-based inference comparison (Tasks 2 & 3)

  • Hybrid search demonstrations (Task 4)

  • Security test results (Task 5b)





Submission Guidelines




Platform

Submit via your LMS (e.g., Moodle / Google Classroom / institutional portal).




File Naming Convention

<YourName>_MetadataFiltering_Assignment2.zip




ZIP Structure



<YourName>_MetadataFiltering_Assignment2/

├── notebook.ipynb

├── report.pdf

└── data/                  (optional — if you stored your documents as files)

    ├── hr_leave_policy.txt

    ├── eng_coding_standards.txt

    └── ...




Deadline

Submit within 10 days from assignment release date.




Late Submission Policy

Delay

Penalty

Up to 24 hours

10% deduction

24–48 hours

20% deduction

Beyond 48 hours

Not accepted





Important Instructions


  1. Implement everything yourself. You must write your own user context filter builder, query inference functions, smart search pipeline, and security tests. Show that you understand how and why each component works.

  2. Explain your reasoning clearly. Code alone is not enough — use markdown cells and comments to explain why you made specific design decisions, especially around security enforcement.

  3. Stick to the taught technology stack. Use ChromaDB as your vector database and OpenAI for embeddings and LLM calls. You may use tiktoken for token counting, pandas for tabular display, and numpy for numerical operations.

  4. Use of external libraries is permitted for utility tasks (e.g., formatting output, regex), but all core logic must be your own implementation.

  5. Plagiarism will result in disqualification. If you reference any external resource, cite it. Submitting copied code without understanding will be treated as academic dishonesty.

  6. Run all cells before submission. A notebook with missing outputs will lose marks.

  7. Security is not optional. If your smart search allows cross-tenant data leakage or access level bypass, the Task 5 score will be significantly impacted regardless of other qualities.





Evaluation Rubric


Criteria

Marks

Task 1 — User Context & Automatic Filters

20

Task 2 — Rule-Based Filter Inference

20

Task 3 — LLM-Based Filter Inference

15

Task 4 — Hybrid Search

15

Task 5 — Smart Search with Security

20

Task 6 — Analysis & Architecture Reflection

10

Total

100





Grading Breakdown

Grade Range

Interpretation

90–100

Exceptional — all tasks complete, security airtight, deep analysis, production-quality

75–89

Strong — all tasks complete, security sound, good analysis, minor gaps

60–74

Satisfactory — most tasks complete, security mostly correct, basic analysis

40–59

Needs Improvement — several tasks incomplete or security gaps

Below 40

Unsatisfactory — major tasks missing or security fundamentally broken





Guidance & Tips


  • Get your data ingested first. All tasks depend on having a working ChromaDB collection with properly metadata-tagged chunks. Reuse or adapt your data setup as needed.

  • Start with Task 1 (user context filters) — it's the foundation for Task 5. If your context filters don't work correctly, smart search won't either.

  • For LLM-based inference (Task 3), start with a simple prompt and iterate. The prompt should describe your schema clearly — the LLM needs to know what fields and values are valid.

  • Test security aggressively. Don't just show that correct queries work — show that malicious or edge-case queries are handled safely.

  • The fallback mechanism is subtle. Think carefully about when to fall back and what to fall back to. Dropping inferred filters is safe; dropping security filters is never acceptable.

  • Use pandas DataFrames for comparison tables and result displays — they make your analysis more readable and professional.

  • Don't over-engineer. A clean, correct, well-explained implementation beats a complex one with gaps. Focus on getting the pipeline right, then polish.





Bonus (Optional — up to +10 Marks)


  • Metadata-Driven Query Routing (+5): Implement a routing system that directs queries to different ChromaDB collections based on detected intent (e.g., HR queries → HR collection, Engineering queries → Engineering collection). Show that routing reduces search space and improves relevance.

  • Filter Transparency Dashboard (+5): For every smart search call, produce a clear summary showing the user: (a) what security filters were enforced, (b) what filters were inferred from their query, (c) whether a fallback was used, and (d) why. This mirrors real-world systems where filter transparency builds user trust.




Instructor Note


This assignment simulates real-world retrieval system design where intelligence and security must coexist.


In production, users don't specify filters — systems derive them. But derived filters must never compromise security boundaries. This tension between helpfulness (inferring what the user wants) and safety (enforcing what the user is allowed to access) is the central challenge.


There is no single correct architecture. What matters is:


  • Security is non-negotiable — Can a user ever see data they shouldn't?

  • Intelligence is practical — Does the system extract useful filters from natural language?

  • Resilience is built-in — Does the system degrade gracefully when filters are too restrictive?

  • Reasoning is clear — Can you articulate why your design works the way it does?






Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?


Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.


Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.




Get Started Today



Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.




Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.









Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.


Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.


Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.




Comments


bottom of page