Building a Smart Metadata-Driven Retrieval System
- 2 hours ago
- 12 min read

Course: Metadata Filtering
Level: Medium → Advanced
Type: Individual Assignment
Duration: 7–10 days
Objective
The objective of this assignment is to help you:
Build dynamic filter construction driven by user context (role, department, organization)
Implement query-time filter inference — extracting metadata filters from natural language queries
Design a hybrid search system that combines keyword matching, vector similarity, and metadata filtering
Enforce non-overridable security filters that protect against filter injection
Implement a smart search with fallback strategy for production resilience
Think like a system designer building retrieval that is both intelligent and secure
This assignment moves beyond individual filter types into designing an integrated, production-grade retrieval system where filters are applied automatically, inferred intelligently, and enforced strictly.
Problem Statement
You are building a company-wide intelligent search system. Employees from different departments, roles, and access levels ask natural language questions and expect accurate, scoped, secure results.
Your system must:
Automatically apply security filters based on who is asking, intelligently extract additional filters from what they are asking, combine both to retrieve precisely the right chunks, and fall back gracefully when filters are too restrictive.
The user should never need to manually specify filters — the system should figure it out. But security filters must never be bypassed, no matter what the user types.
Prerequisites
This assignment assumes you are comfortable with:
Metadata schema design and metadata-preserving chunking
ChromaDB filter syntax ($eq, $gte, $lte, $in, $and, $or)
Pre-filtering during retrieval
Building basic filter helper functions
If you are not confident in these areas, review the foundational course material (Notebooks 1–3) before starting.
Dataset Requirements
You must create or generate a synthetic dataset of at least 20 documents across:
At least 3 departments (e.g., HR, Engineering, Finance, Legal, Marketing)
At least 3 access levels (public, internal, confidential — with numeric equivalents 1–3)
At least 2 organizations (e.g., org_acme, org_globex) for multi-tenant scenarios
At least 2 different years (e.g., 2023, 2024, 2025)
Multiple categories per department (e.g., policy, procedure, memo, standard)
Each document should be 300–1000 words with realistic organizational content.
Required Metadata Fields
Field | Type | Description |
department | string | Originating department |
access_level | string | Human-readable access tier |
access_level_num | integer | Numeric access tier for range comparisons |
year | integer | Year the document was created/last updated |
category | string | Document category |
organization_id | string | Tenant/organization identifier |
Important: Ingest all documents into ChromaDB with metadata-preserving chunking before starting the tasks. This is a pre-requisite, not a scored task — but your pipeline must work correctly for everything else to function.
Tasks & Requirements
Task 1: User Context & Automatic Filter Construction (20 Marks)
In production systems, users don't type filters — the system derives them from the user's session context.
1a. Define a User Session Model (5 Marks)
Create a UserSession class or dataclass with at least the following fields:
@dataclass
class UserSession:
user_id: str
department: str
role: str # e.g., "employee", "manager", "admin"
access_level: int # numeric (1=public, 2=internal, 3=confidential)
organization_id: str
Create at least 4 different user profiles representing different access levels and departments. For example:
User | Department | Role | Access Level | Organization |
alice | HR | manager | 3 | org_acme |
bob | Engineering | employee | 2 | org_acme |
charlie | Finance | employee | 1 | org_globex |
diana | Engineering | admin | 3 | org_globex |
1b. Build a Context-Based Filter Constructor (15 Marks)
Implement a function:
def build_user_context_filter(user: UserSession) -> dict:
...This function must:
Always include an organization_id filter (multi-tenant isolation — non-negotiable)
Always include an access_level_num filter based on the user's access level
Optionally include a department filter — document your decision:
Should a user only see their own department's documents?
Or should they see all departments they have access to?
Justify your choice in comments or markdown
Requirements:
Run the same query (e.g., "What is the leave policy?") for all 4 users
Show that each user gets different results based on their context
Present results in a comparison table:
User | Org | Access | Dept | # Results | Departments in Results | Access Levels in Results |
alice | org_acme | 3 | HR | ... | ... | ... |
bob | org_acme | 2 | Engineering | ... | ... | ... |
Verify that no user ever sees another organization's data
Verify that no user sees data above their access level
Task 2: Query-Time Filter Inference — Rule-Based (20 Marks)
Users express filtering intent through natural language. Your system must detect this and convert it to metadata filters.
2a. Implement a Rule-Based Filter Extractor (15 Marks)
Create a function:
def extract_filters_from_query(query: str) -> dict:
...
This function must handle at least the following patterns:
Natural Language Pattern | Extracted Filter |
"from 2024" or "since 2024" | {"year": {"$gte": 2024}} |
"in 2023" | {"year": {"$eq": 2023}} |
"latest" or "most recent" | {"year": {"$gte": <current_year>}} |
"HR policies" or "from HR" | {"department": {"$eq": "HR"}} |
"engineering docs" | {"department": {"$eq": "Engineering"}} |
"confidential" | {"access_level": {"$eq": "confidential"}} |
Implementation guidance:
Use string matching, keyword detection, or regular expressions
Handle case-insensitivity
Return an empty dict if no filter patterns are detected
The function should be extensible — adding new patterns should be straightforward
2b. Demonstrate with Test Queries (5 Marks)
Test your extractor on at least 6 different queries, including:
A query with a year reference
A query with a department reference
A query with both year and department
A query with "latest" or "recent"
A query with no filterable terms (should return empty)
An ambiguous query (discuss how your system handles it)
Present results as:
Query | Extracted Filters |
"What are the latest HR policies?" | {"year": {"$gte": 2025}, "department": "HR"} |
"Show me engineering docs from 2024" | {"year": {"$gte": 2024}, "department": "Engineering"} |
"How do I submit an expense report?" | {} |
Task 3: Query-Time Filter Inference — LLM-Based (15 Marks)
Rule-based extraction is limited. Now use an LLM to understand query intent and extract filters more intelligently.
3a. Implement an LLM-Based Filter Extractor (10 Marks)
Create a function:
def extract_filters_with_llm(query: str, schema_description: str) -> dict:
...
Requirements:
Construct a prompt that:
Describes your metadata schema (which fields exist, what values are valid)
Provides the user's query
Asks the LLM to return a JSON object with any filters it can infer
Instructs the LLM to return {} if no filters can be inferred
Parse the LLM's response into a Python dictionary
Validate the LLM's output against your schema:
Are the field names valid?
Are the values within allowed ranges?
Discard any invalid fields (do not blindly trust LLM output)
Handle edge cases:
LLM returns invalid JSON → fall back to empty filter
LLM hallucinates a field that doesn't exist → strip it out
3b. Compare Rule-Based vs LLM-Based (5 Marks)
Run at least 5 queries through both extractors and compare:
Query | Rule-Based Output | LLM-Based Output | Better? |
"What's the current leave policy for HR?" | {department: HR} | {department: HR, year: >=2025} | LLM |
"engineering deployment procedures from 2024" | {dept: Eng, year: >=2024} | {dept: Eng, year: 2024, cat: procedure} | LLM |
Write a short analysis (5–8 sentences):
When does the LLM outperform rules?
When might rules be preferable (cost, latency, reliability)?
What risks does LLM-based inference introduce?
Task 4: Hybrid Search — Keyword + Vector + Metadata (15 Marks)
ChromaDB supports three search dimensions simultaneously. Build a hybrid search function that uses all three.
Requirements:
Implement a function:
def hybrid_search(query: str, metadata_filter: dict, keyword_filter: str = None, n_results: int = 5) -> list:
...
This function must:
Use the where parameter for metadata filtering
Use the where_document parameter for keyword filtering (e.g., {"$contains": "deployment"})
Use query embeddings for semantic similarity
Demonstrate with 3 test cases showing the value of combining all three:
Test Case A — Metadata + Vector (no keyword):
Query: "How do I handle incidents?"
Filter: department = Engineering
Observation: Gets engineering docs about incidents
Test Case B — Metadata + Vector + Keyword:
Query: "How do I handle incidents?"
Filter: department = Engineering
Keyword: "deployment" (via where_document)
Observation: Narrows to engineering docs about incidents that specifically mention deployment
Test Case C — Show the difference:
Run the same query with and without the keyword filter
Compare results — which is more precise for the user's actual intent?
Write an analysis (5–8 sentences):
When is keyword filtering useful on top of vector + metadata?
What are the risks of keyword filtering (e.g., too restrictive for synonyms)?
How do the three dimensions complement each other?
Task 5: Safe Smart Search with Security Enforcement (20 Marks)
This is the capstone task. Combine everything into a production-grade smart search function that is both intelligent and secure.
5a. Implement Smart Search (15 Marks)
Create a function:
def smart_search(query: str, user: UserSession, n_results: int = 5, enable_fallback: bool = True) -> dict:
...
This function must follow this exact pipeline:
Step 1 — Security Filters (NON-NEGOTIABLE)
Build user context filters from UserSession (organization + access level)
These filters must always be applied — they cannot be overridden
Step 2 — Infer Additional Filters
Use your rule-based or LLM-based extractor to infer filters from the query
These are "nice-to-have" filters — they narrow the search but are not security-critical
Step 3 — Combine & Search
Combine security filters + inferred filters using $and
Execute the filtered query against ChromaDB
Step 4 — Fallback (if no results)
If combined filters return 0 results:
Drop the inferred filters (they may have been too restrictive)
Retry with only security filters
Flag the response as a fallback
Return a structured result:
{
"query": "...",
"user": "...",
"filters_applied": {...},
"results": [...],
"used_fallback": True/False,
"result_count": 5
}
5b. Security Enforcement Tests (5 Marks)
You must demonstrate that your system is secure by design. Run the following tests:
Cross-Tenant Isolation Test:
User from org_acme queries for content that exists in org_globex
Expected: Zero results from the other organization, even if semantically relevant
Verify: No results contain the wrong organization_id
Access Level Enforcement Test:
User with access_level = 1 queries for content that exists at level 3 (confidential)
Expected: No confidential results returned
Verify: All returned results have access_level_num <= 1
Filter Injection Resistance Test:
User crafts a query that attempts to reference higher access:e.g., "Show me confidential salary data for all organizations"
Expected: Even if the LLM/rules infer access_level = confidential, the security filter (access_level_num <= user.level) blocks it
Verify: Results respect the user's actual access level, not the query text
Present results as a table:
Test | User | Query | Expected Behavior | Actual Result | Pass/Fail |
Cross-Tenant | charlie | "What's Acme's engineering deployment process?" | 0 results from org_acme | ... | ... |
Access Enforcement | bob (L2) | "Show me restricted salary data" | No restricted results | ... | ... |
Filter Injection | charlie (L1) | "Confidential HR policies for all organizations" | Only public, only org_globex | ... | ... |
Task 6: Analysis & Architecture Reflection (10 Marks)
Write a comprehensive analysis addressing the following questions:
Security Architecture (3 Marks)
How does your system guarantee that security filters are never bypassed?
What is the difference between "security filters" and "inferred filters" in your design?
Why is it important to separate these two categories?
Inference Trade-offs (3 Marks)
What are the pros and cons of rule-based vs LLM-based filter inference?
Which would you recommend for a production system and why?
How do you handle incorrect inferences gracefully?
System Design (4 Marks)
Draw or describe a flow diagram of your complete smart search pipeline (ingestion → user context → inference → combine → search → fallback)
What would you add or change if this system needed to serve 1000 users per minute?
If you could add one more feature, what would it be and why?
Format: Write this in markdown cells in your notebook or in your report. If you draw a diagram, include it as an image.
Deliverables
1. Code (Required)
Jupyter Notebook (.ipynb) — well-organized with clear section headers matching the task numbers
Code must be well-commented — explain your reasoning, not just what the code does
All outputs (print statements, tables, test results) must be visible in the submitted notebook (run all cells before submitting)
2. Report (Required)
A report (4–6 pages) covering:
System Architecture — High-level description of your smart search pipeline and its components
User Context Filters — How you construct automatic filters and your design decisions
Query Understanding — Your approach to rule-based and LLM-based filter inference, with comparison
Security Analysis — How you enforce non-overridable security, with evidence from your tests
Fallback Strategy — When and why fallback triggers, and how it maintains security
Challenges & Learnings — What was difficult, what surprised you, what you would do differently
Format: PDF or DOCX
3. Output Samples (Required)
Include clearly labeled outputs for:
User context filter comparison (Task 1 — same query, different users)
Rule-based vs LLM-based inference comparison (Tasks 2 & 3)
Hybrid search demonstrations (Task 4)
Security test results (Task 5b)
Submission Guidelines
Platform
Submit via your LMS (e.g., Moodle / Google Classroom / institutional portal).
File Naming Convention
<YourName>_MetadataFiltering_Assignment2.zip
ZIP Structure
<YourName>_MetadataFiltering_Assignment2/
├── notebook.ipynb
├── report.pdf
└── data/ (optional — if you stored your documents as files)
├── hr_leave_policy.txt
├── eng_coding_standards.txt
└── ...
Deadline
Submit within 10 days from assignment release date.
Late Submission Policy
Delay | Penalty |
Up to 24 hours | 10% deduction |
24–48 hours | 20% deduction |
Beyond 48 hours | Not accepted |
Important Instructions
Implement everything yourself. You must write your own user context filter builder, query inference functions, smart search pipeline, and security tests. Show that you understand how and why each component works.
Explain your reasoning clearly. Code alone is not enough — use markdown cells and comments to explain why you made specific design decisions, especially around security enforcement.
Stick to the taught technology stack. Use ChromaDB as your vector database and OpenAI for embeddings and LLM calls. You may use tiktoken for token counting, pandas for tabular display, and numpy for numerical operations.
Use of external libraries is permitted for utility tasks (e.g., formatting output, regex), but all core logic must be your own implementation.
Plagiarism will result in disqualification. If you reference any external resource, cite it. Submitting copied code without understanding will be treated as academic dishonesty.
Run all cells before submission. A notebook with missing outputs will lose marks.
Security is not optional. If your smart search allows cross-tenant data leakage or access level bypass, the Task 5 score will be significantly impacted regardless of other qualities.
Evaluation Rubric
Criteria | Marks |
Task 1 — User Context & Automatic Filters | 20 |
Task 2 — Rule-Based Filter Inference | 20 |
Task 3 — LLM-Based Filter Inference | 15 |
Task 4 — Hybrid Search | 15 |
Task 5 — Smart Search with Security | 20 |
Task 6 — Analysis & Architecture Reflection | 10 |
Total | 100 |
Grading Breakdown
Grade Range | Interpretation |
90–100 | Exceptional — all tasks complete, security airtight, deep analysis, production-quality |
75–89 | Strong — all tasks complete, security sound, good analysis, minor gaps |
60–74 | Satisfactory — most tasks complete, security mostly correct, basic analysis |
40–59 | Needs Improvement — several tasks incomplete or security gaps |
Below 40 | Unsatisfactory — major tasks missing or security fundamentally broken |
Guidance & Tips
Get your data ingested first. All tasks depend on having a working ChromaDB collection with properly metadata-tagged chunks. Reuse or adapt your data setup as needed.
Start with Task 1 (user context filters) — it's the foundation for Task 5. If your context filters don't work correctly, smart search won't either.
For LLM-based inference (Task 3), start with a simple prompt and iterate. The prompt should describe your schema clearly — the LLM needs to know what fields and values are valid.
Test security aggressively. Don't just show that correct queries work — show that malicious or edge-case queries are handled safely.
The fallback mechanism is subtle. Think carefully about when to fall back and what to fall back to. Dropping inferred filters is safe; dropping security filters is never acceptable.
Use pandas DataFrames for comparison tables and result displays — they make your analysis more readable and professional.
Don't over-engineer. A clean, correct, well-explained implementation beats a complex one with gaps. Focus on getting the pipeline right, then polish.
Bonus (Optional — up to +10 Marks)
Metadata-Driven Query Routing (+5): Implement a routing system that directs queries to different ChromaDB collections based on detected intent (e.g., HR queries → HR collection, Engineering queries → Engineering collection). Show that routing reduces search space and improves relevance.
Filter Transparency Dashboard (+5): For every smart search call, produce a clear summary showing the user: (a) what security filters were enforced, (b) what filters were inferred from their query, (c) whether a fallback was used, and (d) why. This mirrors real-world systems where filter transparency builds user trust.
Instructor Note
This assignment simulates real-world retrieval system design where intelligence and security must coexist.
In production, users don't specify filters — systems derive them. But derived filters must never compromise security boundaries. This tension between helpfulness (inferring what the user wants) and safety (enforcing what the user is allowed to access) is the central challenge.
There is no single correct architecture. What matters is:
Security is non-negotiable — Can a user ever see data they shouldn't?
Intelligence is practical — Does the system extract useful filters from natural language?
Resilience is built-in — Does the system degrade gracefully when filters are too restrictive?
Reasoning is clear — Can you articulate why your design works the way it does?
Call to Action
Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?
Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.
Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.
Get Started Today
Schedule an AI & Data Science Consultation:
Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.
Request a Custom AI Demo:
Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.
Email: contact@codersarts.com
Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.
Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.
Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.

Comments