Evaluating Generation Quality and Building an LLM Judge

Mar 24
7 min read

Course: RAG Evaluation

Level: Medium to Advanced

Type: Individual

Duration: 7 to 10 days

Objective

This assignment tests your ability to evaluate the generation stage of a RAG pipeline, attribute failures to the correct pipeline stage, and automate the entire evaluation workflow using an LLM as a judge. You will generate RAG answers, measure faithfulness and completeness, run end-to-end error attribution, build a structured LLM judge, and compare automated scores against your manual labels. By the end, you will have a complete, automated evaluation pipeline that can score hundreds of RAG answers without manual effort.

Problem Statement

Using the golden dataset and retrieval evaluation results from Assignment 1, you will generate RAG answers for all 15 queries and evaluate them across four dimensions: faithfulness, completeness, token F1, and answer similarity. You will then build an end-to-end error attribution system and replace manual evaluation with an automated LLM judge. Finally, you will validate the judge by comparing its scores against your manual labels.

Tasks

Task 1: Generate RAG Answers and Check Faithfulness (15 marks)

Write a generate_rag_answer(query, retrieved_chunks) function that builds a RAG prompt with a system instruction enforcing context-only answers (temperature=0) and calls gpt-4o-mini-2024-07-18.
Generate answers for all 15 golden queries using the top-3 retrieved chunks from Assignment 1.
Write a check_faithfulness(answer, context) function that calls the LLM to identify any claims in the answer that are not supported by the retrieved context. Return a faithfulness score between 0.0 and 1.0 and a list of unsupported claims.
Run faithfulness checking on all 15 answers. Print a table showing query, faithfulness score, and any unsupported claims. Flag any answer scoring below 0.9.

Task 2: Compute Token F1 and Answer Similarity (15 marks)

Implement token_f1_score(generated, expected) from scratch. Tokenise by splitting on whitespace and punctuation, compute token-level precision and recall, and return the F1 score.
Implement answer_similarity(generated, expected) using text-embedding-3-small embeddings and cosine similarity.
Compute both metrics for all 15 queries. Display per-query scores and mean scores across the dataset.
Identify the 3 queries with the lowest token F1 scores. For each one, display the generated answer, the expected answer, and a brief explanation of why the scores differ.

Task 3: Completeness Scoring (10 marks)

Write a score_completeness(generated, expected) function that measures how much of the expected answer is covered by the generated answer. You may use token overlap, embedding similarity, or an LLM-based approach.
Compute completeness for all 15 queries. Print per-query scores and the mean completeness score.
Identify the queries where completeness is below 0.7. For each one, state what information is present in the expected answer but missing from the generated answer.

Task 4: End-to-End Error Attribution (20 marks)

Write a classify_failure(query_metrics) function that takes the combined retrieval and generation metrics for a single query and returns one of four labels: no_failure, retrieval_failure, generation_hallucination, or generation_omission.
Use the following rules as a starting point — you may adjust thresholds based on your data: retrieval_failure if Recall@k < 1.0; generation_hallucination if faithfulness < 0.8; generation_omission if completeness < 0.7 and faithfulness >= 0.8; no_failure otherwise.
Run attribution across all 15 queries and print a summary showing the count and percentage of each failure type.
For at least one query in each failure category that exists in your results, display the full pipeline trace: query, retrieved chunks, generated answer, expected answer, and the metrics that triggered the label.
Simulate an improvement: change one parameter (for example, system prompt wording, number of retrieved chunks, or chunk size) and re-run evaluation on at least 5 queries. Show the before and after scores in a comparison table.

Task 5: Build an LLM Judge (25 marks)

Write a judge_answer(query, context, answer) function that calls gpt-4o-mini-2024-07-18 with a structured judge prompt. The prompt must include: a system instruction defining the judge role, a scoring rubric for at least three dimensions (faithfulness, completeness, and relevance), and an instruction to return a JSON object with scores and reasoning.
Use Pydantic or Python dataclasses to parse and validate the JSON output. Handle parsing errors gracefully.
Run the judge on all 15 queries using the same retrieved chunks and generated answers from Task 1.
Write a judge_chunk_relevance(query, chunk) function that scores each retrieved chunk on a 0 to 2 scale (0 = irrelevant, 1 = partially relevant, 2 = highly relevant). Run it on 5 queries and display the relevance matrix.
Save the full judge report to judge_evaluation.json. The file must include: timestamp, model name, per-query scores with reasoning, and aggregate scores.

Task 6: Compare Judge Scores Against Manual Labels (15 marks)

Build a comparison table showing, for each query: manual faithfulness score (from Task 1), judge faithfulness score (from Task 5), manual completeness score (from Task 3), and judge completeness score.
Compute the agreement rate between manual and judge scores for faithfulness and for completeness. Define agreement as scores within 0.2 of each other.
Identify any query where the judge and manual scores diverge by more than 0.3. Examine the judge reasoning for that query and explain why the disagreement occurred.
Write a short analysis (150 to 200 words) on when you would trust the LLM judge over manual evaluation, when you would not, and what steps you would take to improve judge calibration.

Evaluation Rubric

Criteria	Marks
RAG Answer Generation and Faithfulness Check	15
Token F1 and Answer Similarity	15
Completeness Scoring	10
End-to-End Error Attribution	20
LLM Judge Pipeline	25
Judge vs Manual Comparison and Analysis	15
Total	100

Deliverables

A Jupyter Notebook (.ipynb) containing all code, outputs, and markdown explanations.
A generation_evaluation.json file containing per-query faithfulness, completeness, token F1, and answer similarity scores.
A judge_evaluation.json file containing per-query judge scores, reasoning, and aggregate metrics.
A before/after comparison table (in the notebook) showing the impact of your improvement in Task 4.
A written analysis (150 to 200 words) on judge reliability in the notebook.

Submission Guidelines

Submit your work via the course LMS (for example, Moodle or Google Classroom).

File Naming Convention: <YourName>_RAGEval_Assignment2.zip

Inside the ZIP:

notebook.ipynb
generation_evaluation.json
judge_evaluation.json

Deadline: 7 days from the date of release.

Late Submission Policy

Up to 24 hours late: 10% penalty applied to the final mark.
24 to 48 hours late: 20% penalty applied to the final mark.
Beyond 48 hours: submission will not be accepted.

Important Instructions

All metric functions (token_f1_score, score_completeness) must be implemented by you. Do not use evaluation libraries such as RAGAS, DeepEval, or TruLens.
The judge prompt in Task 5 must include a rubric that defines what each score level means. A prompt that only says 'rate from 0 to 1' will not receive full marks.
For Task 4, the improvement simulation must involve a real change to the pipeline, not just re-running the same code.
Your notebook must be fully runnable from top to bottom without errors. Use a .env file for the API key and load it with python-dotenv.
Plagiarism of any kind will result in disqualification from the assignment.

Guidance and Tips

Set temperature=0 for all evaluation calls (faithfulness, completeness, judge) to ensure reproducible results.
When comparing judge vs manual scores, do not expect perfect agreement. A well-calibrated judge with a 90% agreement rate is a strong result.
Faithfulness and completeness are opposite failure modes. A system that is perfectly faithful but incomplete is still failing. Make sure your analysis addresses both dimensions.
Do not just implement — diagnose. A well-explained attribution result showing 33% generation omissions is far more actionable than a list of scores with no interpretation.
Think about what your judge prompt rewards. If the rubric rewards verbosity, the judge may score longer answers higher regardless of faithfulness.

Bonus (Optional — up to +10 Marks)

Extend the judge to score a fourth dimension: source citation quality. Score 0 if the answer cites no source, 1 if it references the document type, and 2 if it cites the specific chunk or section.
Run the full pipeline on 20 additional queries (beyond the 15 golden entries) and produce a summary report showing the distribution of failure types.
Visualise judge scores as a heatmap across queries and scoring dimensions.

Instructor Note

This assignment is intentionally open-ended in the improvement simulation and the judge design. There is no single correct judge prompt or attribution threshold. What matters is that you can justify your choices with evidence from your results, explain what your metrics are actually measuring, and demonstrate awareness of the limitations of automated evaluation. A thoughtful analysis of a moderate implementation will always score better than a high-quality implementation with no explanation.

Call to Action

Ready to transform your business with AI-powered intelligence that accelerates insights, enhances decision-making, and unlocks the full value of your data?

Codersarts is here to help you turn complex data workflows into efficient, scalable, and evidence-driven AI systems that empower teams to make smarter, faster, and more confident decisions.

Whether you’re a startup looking to build AI-driven products, an enterprise aiming to optimize operations through data science, or a research organization advancing innovation with intelligent data solutions, we bring the expertise and experience needed to design, develop, and deploy impactful AI systems that drive measurable business outcomes.

Get Started Today

Schedule an AI & Data Science Consultation:

Book a 30-minute discovery call with our AI strategists and data science experts to discuss your challenges, identify high-impact opportunities, and explore how intelligent AI solutions can transform your workflows and performance.

Request a Custom AI Demo:

Experience AI in action with a personalized demonstration built around your business use cases, datasets, operational environment, and decision workflows — showcasing practical value and real-world impact.

Email: contact@codersarts.com

Transform your organization from data accumulation to intelligent decision enablement — accelerating insight generation, improving operational efficiency, and strengthening competitive advantage.

Partner with Codersarts to build scalable AI solutions including RAG systems, predictive analytics platforms, intelligent automation tools, recommendation engines, and custom machine learning models that empower your teams to deliver exceptional results.

Contact us today and take the first step toward next-generation AI and data science capabilities that grow with your business ambitions.