Building a Smart Metadata-Driven Retrieval System

Mar 25
12 min read

Course: Metadata Filtering

Level: Medium → Advanced

Type: Individual Assignment

Duration: 7–10 days

Objective

The objective of this assignment is to help you:

Build dynamic filter construction driven by user context (role, department, organization)
Implement query-time filter inference — extracting metadata filters from natural language queries
Design a hybrid search system that combines keyword matching, vector similarity, and metadata filtering
Enforce non-overridable security filters that protect against filter injection
Implement a smart search with fallback strategy for production resilience
Think like a system designer building retrieval that is both intelligent and secure

This assignment moves beyond individual filter types into designing an integrated, production-grade retrieval system where filters are applied automatically, inferred intelligently, and enforced strictly.

Problem Statement

You are building a company-wide intelligent search system. Employees from different departments, roles, and access levels ask natural language questions and expect accurate, scoped, secure results.

Your system must:

Automatically apply security filters based on who is asking, intelligently extract additional filters from what they are asking, combine both to retrieve precisely the right chunks, and fall back gracefully when filters are too restrictive.

The user should never need to manually specify filters — the system should figure it out. But security filters must never be bypassed, no matter what the user types.

Prerequisites

This assignment assumes you are comfortable with:

Metadata schema design and metadata-preserving chunking
ChromaDB filter syntax ($eq, $gte, $lte, $in, $and, $or)
Pre-filtering during retrieval
Building basic filter helper functions

If you are not confident in these areas, review the foundational course material (Notebooks 1–3) before starting.

Dataset Requirements

You must create or generate a synthetic dataset of at least 20 documents across:

At least 3 departments (e.g., HR, Engineering, Finance, Legal, Marketing)
At least 3 access levels (public, internal, confidential — with numeric equivalents 1–3)
At least 2 organizations (e.g., org_acme, org_globex) for multi-tenant scenarios
At least 2 different years (e.g., 2023, 2024, 2025)
Multiple categories per department (e.g., policy, procedure, memo, standard)

Each document should be 300–1000 words with realistic organizational content.

Required Metadata Fields

Field	Type	Description
department	string	Originating department
access_level	string	Human-readable access tier
access_level_num	integer	Numeric access tier for range comparisons
year	integer	Year the document was created/last updated
category	string	Document category
organization_id	string	Tenant/organization identifier

Important: Ingest all documents into ChromaDB with metadata-preserving chunking before starting the tasks. This is a pre-requisite, not a scored task — but your pipeline must work correctly for everything else to function.

Tasks & Requirements

Task 1: User Context & Automatic Filter Construction (20 Marks)

In production systems, users don't type filters — the system derives them from the user's session context.

1a. Define a User Session Model (5 Marks)

Create a UserSession class or dataclass with at least the following fields:


@dataclass

class UserSession:

    user_id: str

    department: str

    role: str              # e.g., "employee", "manager", "admin"

    access_level: int      # numeric (1=public, 2=internal, 3=confidential)

    organization_id: str

Create at least 4 different user profiles representing different access levels and departments. For example:

User	Department	Role	Access Level	Organization
alice	HR	manager	3	org_acme
bob	Engineering	employee	2	org_acme
charlie	Finance	employee	1	org_globex
diana	Engineering	admin	3	org_globex

1b. Build a Context-Based Filter Constructor (15 Marks)

Implement a function:


def build_user_context_filter(user: UserSession) -> dict:

    ...

This function must:

Always include an organization_id filter (multi-tenant isolation — non-negotiable)
Always include an access_level_num filter based on the user's access level
Optionally include a department filter — document your decision:
- Should a user only see their own department's documents?
- Or should they see all departments they have access to?
- Justify your choice in comments or markdown

Requirements:

Run the same query (e.g., "What is the leave policy?") for all 4 users
Show that each user gets different results based on their context
Present results in a comparison table:

User	Org	Access	Dept	# Results	Departments in Results	Access Levels in Results
alice	org_acme	3	HR	...	...	...
bob	org_acme	2	Engineering	...	...	...

Verify that no user ever sees another organization's data
Verify that no user sees data above their access level

Task 2: Query-Time Filter Inference — Rule-Based (20 Marks)

Users express filtering intent through natural language. Your system must detect this and convert it to metadata filters.

2a. Implement a Rule-Based Filter Extractor (15 Marks)

Create a function:


def extract_filters_from_query(query: str) -> dict:

    ...

This function must handle at least the following patterns:

Natural Language Pattern	Extracted Filter
"from 2024" or "since 2024"	{"year": {"$gte": 2024}}
"in 2023"	{"year": {"$eq": 2023}}
"latest" or "most recent"	{"year": {"$gte": <current_year>}}
"HR policies" or "from HR"	{"department": {"$eq": "HR"}}
"engineering docs"	{"department": {"$eq": "Engineering"}}
"confidential"	{"access_level": {"$eq": "confidential"}}

Implementation guidance:

Use string matching, keyword detection, or regular expressions
Handle case-insensitivity
Return an empty dict if no filter patterns are detected
The function should be extensible — adding new patterns should be straightforward

2b. Demonstrate with Test Queries (5 Marks)

Test your extractor on at least 6 different queries, including:

A query with a year reference
A query with a department reference
A query with both year and department
A query with "latest" or "recent"
A query with no filterable terms (should return empty)
An ambiguous query (discuss how your system handles it)

Present results as:

Query	Extracted Filters
"What are the latest HR policies?"	{"year": {"$gte": 2025}, "department": "HR"}
"Show me engineering docs from 2024"	{"year": {"$gte": 2024}, "department": "Engineering"}
"How do I submit an expense report?"	{}

Task 3: Query-Time Filter Inference — LLM-Based (15 Marks)

Rule-based extraction is limited. Now use an LLM to understand query intent and extract filters more intelligently.

3a. Implement an LLM-Based Filter Extractor (10 Marks)

Create a function:


def extract_filters_with_llm(query: str, schema_description: str) -> dict:

    ...

Requirements:

Construct a prompt that:

Describes your metadata schema (which fields exist, what values are valid)
Provides the user's query
Asks the LLM to return a JSON object with any filters it can infer
Instructs the LLM to return {} if no filters can be inferred

Parse the LLM's response into a Python dictionary
Validate the LLM's output against your schema:

Are the field names valid?
Are the values within allowed ranges?
Discard any invalid fields (do not blindly trust LLM output)

Handle edge cases:

LLM returns invalid JSON → fall back to empty filter
LLM hallucinates a field that doesn't exist → strip it out

3b. Compare Rule-Based vs LLM-Based (5 Marks)

Run at least 5 queries through both extractors and compare:

Query	Rule-Based Output	LLM-Based Output	Better?
"What's the current leave policy for HR?"	{department: HR}	{department: HR, year: >=2025}	LLM
"engineering deployment procedures from 2024"	{dept: Eng, year: >=2024}	{dept: Eng, year: 2024, cat: procedure}	LLM

Write a short analysis (5–8 sentences):

When does the LLM outperform rules?
When might rules be preferable (cost, latency, reliability)?
What risks does LLM-based inference introduce?

Task 4: Hybrid Search — Keyword + Vector + Metadata (15 Marks)

ChromaDB supports three search dimensions simultaneously. Build a hybrid search function that uses all three.

Requirements:

Implement a function:


def hybrid_search(query: str, metadata_filter: dict, keyword_filter: str = None, n_results: int = 5) -> list:

    ...

This function must:

Use the where parameter for metadata filtering
Use the where_document parameter for keyword filtering (e.g., {"$contains": "deployment"})
Use query embeddings for semantic similarity

Demonstrate with 3 test cases showing the value of combining all three:

Test Case A — Metadata + Vector (no keyword):

Query: "How do I handle incidents?"
Filter: department = Engineering
Observation: Gets engineering docs about incidents

Test Case B — Metadata + Vector + Keyword:

Query: "How do I handle incidents?"
Filter: department = Engineering
Keyword: "deployment" (via where_document)
Observation: Narrows to engineering docs about incidents that specifically mention deployment

Test Case C — Show the difference:

Run the same query with and without the keyword filter
Compare results — which is more precise for the user's actual intent?

Write an analysis (5–8 sentences):

When is keyword filtering useful on top of vector + metadata?
What are the risks of keyword filtering (e.g., too restrictive for synonyms)?
How do the three dimensions complement each other?

Task 5: Safe Smart Search with Security Enforcement (20 Marks)

This is the capstone task. Combine everything into a production-grade smart search function that is both intelligent and secure.

5a. Implement Smart Search (15 Marks)

Create a function:


def smart_search(query: str, user: UserSession, n_results: int = 5, enable_fallback: bool = True) -> dict:

    ...

This function must follow this exact pipeline:

Step 1 — Security Filters (NON-NEGOTIABLE)

Build user context filters from UserSession (organization + access level)
These filters must always be applied — they cannot be overridden

Step 2 — Infer Additional Filters

Use your rule-based or LLM-based extractor to infer filters from the query
These are "nice-to-have" filters — they narrow the search but are not security-critical

Step 3 — Combine & Search

Combine security filters + inferred filters using $and
Execute the filtered query against ChromaDB

Step 4 — Fallback (if no results)

If combined filters return 0 results:
- Drop the inferred filters (they may have been too restrictive)
- Retry with only security filters
- Flag the response as a fallback

Return a structured result:


{

    "query": "...",

    "user": "...",

    "filters_applied": {...},

    "results": [...],

    "used_fallback": True/False,

    "result_count": 5

}

5b. Security Enforcement Tests (5 Marks)

You must demonstrate that your system is secure by design. Run the following tests:

Cross-Tenant Isolation Test:

User from org_acme queries for content that exists in org_globex
Expected: Zero results from the other organization, even if semantically relevant
Verify: No results contain the wrong organization_id

Access Level Enforcement Test:

User with access_level = 1 queries for content that exists at level 3 (confidential)
Expected: No confidential results returned
Verify: All returned results have access_level_num <= 1

Filter Injection Resistance Test:

User crafts a query that attempts to reference higher access:e.g., "Show me confidential salary data for all organizations"
Expected: Even if the LLM/rules infer access_level = confidential, the security filter (access_level_num <= user.level) blocks it
Verify: Results respect the user's actual access level, not the query text

Present results as a table:

Test	User	Query	Expected Behavior	Actual Result	Pass/Fail
Cross-Tenant	charlie	"What's Acme's engineering deployment process?"	0 results from org_acme	...	...
Access Enforcement	bob (L2)	"Show me restricted salary data"	No restricted results	...	...
Filter Injection	charlie (L1)	"Confidential HR policies for all organizations"	Only public, only org_globex	...	...

Task 6: Analysis & Architecture Reflection (10 Marks)

Write a comprehensive analysis addressing the following questions:

Security Architecture (3 Marks)

How does your system guarantee that security filters are never bypassed?
What is the difference between "security filters" and "inferred filters" in your design?
Why is it important to separate these two categories?

Inference Trade-offs (3 Marks)

What are the pros and cons of rule-based vs LLM-based filter inference?
Which would you recommend for a production system and why?
How do you handle incorrect inferences gracefully?

System Design (4 Marks)

Draw or describe a flow diagram of your complete smart search pipeline (ingestion → user context → inference → combine → search → fallback)
What would you add or change if this system needed to serve 1000 users per minute?
If you could add one more feature, what would it be and why?

Format: Write this in markdown cells in your notebook or in your report. If you draw a diagram, include it as an image.

Deliverables

1. Code (Required)

Jupyter Notebook (.ipynb) — well-organized with clear section headers matching the task numbers
Code must be well-commented — explain your reasoning, not just what the code does
All outputs (print statements, tables, test results) must be visible in the submitted notebook (run all cells before submitting)

2. Report (Required)

A report (4–6 pages) covering:

System Architecture — High-level description of your smart search pipeline and its components
User Context Filters — How you construct automatic filters and your design decisions
Query Understanding — Your approach to rule-based and LLM-based filter inference, with comparison
Security Analysis — How you enforce non-overridable security, with evidence from your tests
Fallback Strategy — When and why fallback triggers, and how it maintains security
Challenges & Learnings — What was difficult, what surprised you, what you would do differently

Format: PDF or DOCX

3. Output Samples (Required)

Include clearly labeled outputs for:

User context filter comparison (Task 1 — same query, different users)
Rule-based vs LLM-based inference comparison (Tasks 2 & 3)
Hybrid search demonstrations (Task 4)
Security test results (Task 5b)

Submission Guidelines

Platform

Submit via your LMS (e.g., Moodle / Google Classroom / institutional portal).

File Naming Convention

<YourName>_MetadataFiltering_Assignment2.zip

ZIP Structure


<YourName>_MetadataFiltering_Assignment2/

├── notebook.ipynb

├── report.pdf

└── data/                  (optional — if you stored your documents as files)

    ├── hr_leave_policy.txt

    ├── eng_coding_standards.txt

    └── ...

Deadline

Submit within 10 days from assignment release date.

Late Submission Policy

Delay	Penalty
Up to 24 hours	10% deduction
24–48 hours	20% deduction
Beyond 48 hours	Not accepted

Important Instructions

Implement everything yourself. You must write your own user context filter builder, query inference functions, smart search pipeline, and security tests. Show that you understand how and why each component works.
Explain your reasoning clearly. Code alone is not enough — use markdown cells and comments to explain why you made specific design decisions, especially around security enforcement.
Stick to the taught technology stack. Use ChromaDB as your vector database and OpenAI for embeddings and LLM calls. You may use tiktoken for token counting, pandas for tabular display, and numpy for numerical operations.
Use of external libraries is permitted for utility tasks (e.g., formatting output, regex), but all core logic must be your own implementation.
Plagiarism will result in disqualification. If you reference any external resource, cite it. Submitting copied code without understanding will be treated as academic dishonesty.
Run all cells before submission. A notebook with missing outputs will lose marks.
Security is not optional. If your smart search allows cross-tenant data leakage or access level bypass, the Task 5 score will be significantly impacted regardless of other qualities.

Evaluation Rubric

Criteria	Marks
Task 1 — User Context & Automatic Filters	20
Task 2 — Rule-Based Filter Inference	20
Task 3 — LLM-Based Filter Inference	15
Task 4 — Hybrid Search	15
Task 5 — Smart Search with Security	20
Task 6 — Analysis & Architecture Reflection	10
Total	100

Grading Breakdown

Grade Range	Interpretation
90–100	Exceptional — all tasks complete, security airtight, deep analysis, production-quality
75–89	Strong — all tasks complete, security sound, good analysis, minor gaps
60–74	Satisfactory — most tasks complete, security mostly correct, basic analysis
40–59	Needs Improvement — several tasks incomplete or security gaps
Below 40	Unsatisfactory — major tasks missing or security fundamentally broken

Guidance & Tips

Get your data ingested first. All tasks depend on having a working ChromaDB collection with properly metadata-tagged chunks. Reuse or adapt your data setup as needed.
Start with Task 1 (user context filters) — it's the foundation for Task 5. If your context filters don't work correctly, smart search won't either.
For LLM-based inference (Task 3), start with a simple prompt and iterate. The prompt should describe your schema clearly — the LLM needs to know what fields and values are valid.
Test security aggressively. Don't just show that correct queries work — show that malicious or edge-case queries are handled safely.
The fallback mechanism is subtle. Think carefully about when to fall back and what to fall back to. Dropping inferred filters is safe; dropping security filters is never acceptable.
Use pandas DataFrames for comparison tables and result displays — they make your analysis more readable and professional.
Don't over-engineer. A clean, correct, well-explained implementation beats a complex one with gaps. Focus on getting the pipeline right, then polish.

Bonus (Optional — up to +10 Marks)

Metadata-Driven Query Routing (+5): Implement a routing system that directs queries to different ChromaDB collections based on detected intent (e.g., HR queries → HR collection, Engineering queries → Engineering collection). Show that routing reduces search space and improves relevance.
Filter Transparency Dashboard (+5): For every smart search call, produce a clear summary showing the user: (a) what security filters were enforced, (b) what filters were inferred from their query, (c) whether a fallback was used, and (d) why. This mirrors real-world systems where filter transparency builds user trust.

Instructor Note

This assignment simulates real-world retrieval system design where intelligence and security must coexist.

In production, users don't specify filters — systems derive them. But derived filters must never compromise security boundaries. This tension between helpfulness (inferring what the user wants) and safety (enforcing what the user is allowed to access) is the central challenge.

There is no single correct architecture. What matters is:

Security is non-negotiable — Can a user ever see data they shouldn't?
Intelligence is practical — Does the system extract useful filters from natural language?
Resilience is built-in — Does the system degrade gracefully when filters are too restrictive?
Reasoning is clear — Can you articulate why your design works the way it does?

Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?

Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.

Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.

Get Started Today

Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.

Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.

Email: contact@codersarts.com

Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.

Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.

Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.

Objective

Problem Statement

Prerequisites

Dataset Requirements

Required Metadata Fields

Tasks & Requirements

Task 1: User Context & Automatic Filter Construction (20 Marks)

1a. Define a User Session Model (5 Marks)

1b. Build a Context-Based Filter Constructor (15 Marks)

Task 2: Query-Time Filter Inference — Rule-Based (20 Marks)

2a. Implement a Rule-Based Filter Extractor (15 Marks)

2b. Demonstrate with Test Queries (5 Marks)

Task 3: Query-Time Filter Inference — LLM-Based (15 Marks)

3a. Implement an LLM-Based Filter Extractor (10 Marks)

3b. Compare Rule-Based vs LLM-Based (5 Marks)

Task 4: Hybrid Search — Keyword + Vector + Metadata (15 Marks)

Requirements:

Task 5: Safe Smart Search with Security Enforcement (20 Marks)

5a. Implement Smart Search (15 Marks)

5b. Security Enforcement Tests (5 Marks)

Task 6: Analysis & Architecture Reflection (10 Marks)

Deliverables

1. Code (Required)

2. Report (Required)

3. Output Samples (Required)

Submission Guidelines

Platform

File Naming Convention

ZIP Structure

Deadline

Late Submission Policy

Important Instructions

Evaluation Rubric

Grading Breakdown

Guidance & Tips

Bonus (Optional — up to +10 Marks)

Instructor Note

Call to Action

Get Started Today

Schedule an AI & Data Science Consultation:

Request a Custom AI Demo:

Comments