Building a Golden Dataset and Evaluating Retrieval Quality

Mar 24
6 min read

Course: RAG Evaluation

Level: Beginner to Medium

Type: Individual

Duration: 5 to 7 days

Objective

This assignment tests your ability to build the two foundational components of any RAG evaluation workflow: a golden dataset and a retrieval quality report. Without a golden dataset, no evaluation metric has meaning. Without retrieval evaluation, you cannot tell whether failures come from the retrieval stage or the generation stage. By completing this assignment, you will have a reusable evaluation foundation that all subsequent evaluation work can build on.

Problem Statement

You are given a chunked knowledge base of at least 30 chunks drawn from a document collection of your choice. Your task is to build a golden dataset of 15 query-answer pairs, validate it, analyse its topic coverage, and then run a full retrieval evaluation using recall, precision, MRR, and NDCG metrics.

Tasks

Task 1: Manually Create 5 Golden Entries (15 marks)

Read through your knowledge base chunks and identify 5 distinct, answerable queries. Each query must be grounded in a specific chunk or set of chunks.
For each entry, write: a natural language query, an expected answer (drawn directly from the source text), and the chunk ID of the ground-truth source chunk.
Use create_golden_record() to build each entry with a deterministic record ID generated from the query text.
Tag at least 2 entries as EASY and at least 2 as MEDIUM or HARD. Print all 5 entries in a formatted table.

Task 2: Synthetically Generate 10 Golden Entries (20 marks)

Write a generate_synthetic_entry(chunk, persona) function that calls gpt-4o-mini-2024-07-18 with a prompt that asks the model to produce a realistic query and expected answer for a given chunk, written from the perspective of the given persona.
Use at least three distinct personas (for example: new_employee, manager, customer). Rotate personas across your 10 entries to ensure query diversity.
Run the function on 10 different chunks selected to cover as many source documents as possible.
Print a summary table showing: chunk ID, source document, persona used, generated query, and assigned difficulty.

Task 3: Validate, Combine, and Analyse Coverage (15 marks)

Write a validation function that checks each entry for: duplicate queries, queries that are too generic (fewer than 5 words), and missing required fields.
Remove or fix any entries that fail validation. Print the number of entries removed and the reason for each removal.
Combine the 5 manual and 10 synthetic entries into a single list and save it to golden_dataset.json.
Compute topic coverage: what percentage of your source documents are represented by at least one golden entry? Display a coverage table. Aim for at least 80% coverage.

Task 4: Build the Retrieval Function and Embed Chunks (10 marks)

Write a get_embedding(text) function that calls the OpenAI API using text-embedding-3-small.
Embed all chunks in your knowledge base. Store the result as a list of dictionaries, each containing the original chunk fields plus an embedding key.
Write a retrieve_chunks(query, embedded_chunks, k) function that embeds the query and returns the top-k chunks ranked by cosine similarity.
Test the function on 3 queries from your golden dataset and print the top-3 results with their similarity scores.

Task 5: Compute Recall at k and Precision at k (20 marks)

Write recall_at_k(retrieved_ids, ground_truth_ids, k) and precision_at_k(retrieved_ids, ground_truth_ids, k) functions.
Run retrieval for all 15 golden queries at k=3 and k=5. Compute recall and precision for each query and display the results in a table.
Compute mean Recall@3, Recall@5, Precision@3, and Precision@5 across all queries.
Identify any query where recall is below 1.0. Print the query, the ground-truth chunk, and the top-3 retrieved chunks, and explain why retrieval may have failed.

Task 6: Compute MRR and NDCG, and Produce a Full Report (20 marks)

Write compute_mrr(results) and compute_ndcg(results, k) functions. For NDCG, use logarithm base 2 for the discount.
Compute MRR and NDCG@5 for all 15 queries and display per-query scores alongside recall and precision.
Generate a final retrieval evaluation report showing: mean Recall@5, Precision@5, MRR, NDCG@5, and a breakdown by query difficulty (EASY vs MEDIUM vs HARD).
Simulate a before/after comparison by changing k from 3 to 5. Show how recall and precision change and explain the trade-off.

Evaluation Rubric

Criteria	Marks
Manual Golden Entries	15
Synthetic Golden Entries	20
Validation, Combination, and Coverage	15
Retrieval Function and Embedding	10
Recall at k and Precision at k	20
MRR, NDCG, and Full Report	20
Total	100

Deliverables

A Jupyter Notebook (.ipynb) containing all code, outputs, and markdown explanations.
A golden_dataset.json file containing all 15 validated entries.
A retrieval_evaluation.json file containing per-query recall, precision, MRR, and NDCG scores.
A coverage analysis table (in the notebook) showing which source documents are represented.

Submission Guidelines

Submit your work via the course LMS (for example, Moodle or Google Classroom).

File Naming Convention: <YourName>_RAGEval_Assignment1.zip

Inside the ZIP:

notebook.ipynb
golden_dataset.json
retrieval_evaluation.json

Deadline: 7 days from the date of release.

Late Submission Policy

Up to 24 hours late: 10% penalty applied to the final mark.
24 to 48 hours late: 20% penalty applied to the final mark.
Beyond 48 hours: submission will not be accepted.

Important Instructions

Do not reuse the sample golden dataset from the course notebooks. You must build your own from a knowledge base of your choice.
Your knowledge base must contain at least 30 chunks from at least 5 distinct source documents.
All metric functions (recall_at_k, precision_at_k, compute_mrr, compute_ndcg) must be implemented by you. Do not use evaluation libraries such as RAGAS or TruLens.
Plagiarism of any kind will result in disqualification from the assignment.
Do not hardcode file paths. Use pathlib.Path and relative paths.

Guidance and Tips

Start by reading your chunks carefully before writing queries. A good golden query has exactly one best answer in the knowledge base.
Persona-based generation produces more diverse queries than generic prompting. Experiment with personas that reflect realistic users of your knowledge base.
If MRR is significantly lower than Recall, the right chunks are being found but ranked too low. Investigate why.
Do not optimise for a perfect score. A score below 1.0 with a clear diagnosis is more valuable than a perfect score with no explanation.
Think from an evaluator perspective, not just an implementer. What does a low NDCG tell you that low Recall does not?

Bonus (Optional — up to +10 Marks)

Extend the coverage analysis to compute query diversity: measure the average pairwise cosine similarity between golden query embeddings. Lower similarity means more diverse queries.
Implement graded relevance scoring (0 = irrelevant, 1 = partially relevant, 2 = highly relevant) instead of binary labelling and recompute NDCG.
Visualise MRR and NDCG scores as a bar chart grouped by query difficulty.

Instructor Note

This assignment is designed to build the evaluation foundation that all later work depends on. A weak golden dataset will produce misleading metrics at every stage. There is no single correct set of queries. What matters is that each query is specific, answerable, grounded in the knowledge base, and representative of how a real user would interact with the system. Take the golden dataset step seriously — it is the most important output of this assignment.

Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?

Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.

Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.

Get Started Today

Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.

Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.

Email: contact@codersarts.com

Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.

Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.

Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.