top of page

Build Your First LLM-as-a-Judge for RAG Pipelines with Python and OpenAI

  • 9 hours ago
  • 26 min read


Introduction


Retrieval-Augmented Generation (RAG) pipelines are widely used to build question-answering systems grounded in private or domain-specific documents. But evaluating whether a RAG pipeline is actually working well is harder than building it. Traditional metrics like BLEU and ROUGE measure surface-level word overlap and miss the semantic quality of answers. Human review is accurate but expensive and slow at any meaningful scale.


LLM-as-a-Judge sits between these two extremes. The idea is to use a capable language model to evaluate the output of another model against a structured rubric, the same way a human reviewer would, but at the speed and cost of an API call. Because evaluation is fundamentally easier than generation, a strong judge model can reliably assess specific properties at scale even when the answers are complex.


In this tutorial, we build a complete RAG evaluation system using a university academic FAQ as the knowledge base. We create a Chroma vector store from FAQ documents, build a RAG pipeline with GPT-4o-mini as the generator, and then use GPT-4o as the assessor to score every response on two dimensions: whether the answer is grounded in the retrieved context, and whether it actually addresses the student’s question.






What We Are Building


An automated evaluation pipeline for a university academic FAQ RAG system. The workflow:


  1. Index academic FAQ documents into a Chroma vector store

  2. Retrieve relevant chunks for each student question via similarity search

  3. Generate answers using GPT-4o-mini grounded in the retrieved context

  4. Assess each answer for grounding using GPT-4o with a structured rubric

  5. Assess each answer for topicality using a separate rubric

  6. Report scores and reasoning across all test questions




Tech Stack


Component

Tool

Generator model

gpt-4o-mini (configurable via .env)

Assessor model

gpt-4o (configurable via .env)

Embeddings

text-embedding-3-small (configurable via .env)

Vector store

Chroma via LangChain

Orchestration

LangChain

Environment

python-dotenv




Project Structure




llm_judge_rag/
├── evaluate.py       # full pipeline: index, retrieve, generate, assess, 
├── .env              # OPENAI_API_KEY and model configuration
report
└── requirements.txt




Setting Up the Environment


Install the required packages. LangChain-Chroma and LangChain-OpenAI are separate packages from the core LangChain library, so they need to be installed explicitly.



pip install openai langchain langchain-openai langchain-chroma chromadb python-dotenv rich


All model names and configuration values are read from .env so you can change models without touching the code:



OPENAI_API_KEY=your_openai_api_key_here
GENERATOR_MODEL=gpt-4o-mini
JUDGE_MODEL=gpt-4o
EMBED_MODEL=text-embedding-3-small
TOP_K=3
JUDGE_TEMPERATURE=0.0


Then load everything at the top of the script:



import os                        # access environment variables set by load_dotenv
import json                      # parse JSON responses returned by the assessor model
import time                      # measure wall-clock latency for each API call
import textwrap                  # dedent multi-line rubric strings without leading whitespace
from datetime import datetime    # timestamp each logged API call in ISO 8601 format

from dotenv import load_dotenv                   # reads .env file and injects keys into os.environ
from openai import OpenAI                        # official OpenAI Python SDK for chat completions
from langchain_openai import OpenAIEmbeddings    # OpenAI embedding model wrapped for LangChain
from langchain_chroma import Chroma              # Chroma vector store integration for LangChain
from langchain_core.documents import Document    # LangChain document wrapper used by the vector store
from rich.console import Console                 # provides colored, styled terminal output
from rich.panel import Panel                     # renders a boxed panel around a block of text

load_dotenv()  # reads all key=value pairs from .env into os.environ before any os.environ.get() call

api_key        = os.environ.get("OPENAI_API_KEY")                # secret key required for all OpenAI API calls
gen_model      = os.environ.get("GENERATOR_MODEL")               # weaker, cheaper model — answers student questions
assessor_model = os.environ.get("JUDGE_MODEL")                   # stronger model — scores answers only, never generates
embed_model    = os.environ.get("EMBED_MODEL")                   # embedding model used to index and query the vector store
top_k          = int(os.environ.get("TOP_K", 3))                 # how many FAQ chunks to retrieve per question
assessor_temp  = float(os.environ.get("JUDGE_TEMPERATURE", 0.0)) # zero temperature gives stable, reproducible scores

if not api_key:                                                            # fail fast rather than getting a cryptic auth error later
    raise ValueError("Set OPENAI_API_KEY in your .env file before running.")  # force the developer to configure the key before proceeding

gpt_client = OpenAI(api_key=api_key)    # single client instance reused for every API call in the pipeline
console    = Console()                   # single Rich console instance shared across all output functions

REQUEST_METADATA = {                     # dict attached to every OpenAI call for tracking in the usage dashboard
    "dev_name":    "Ganesh",             # name of the developer running this script
    "project":     "llm-judge-rag",      # project identifier shown in the OpenAI usage dashboard
    "environment": "local",              # separates local runs from CI or production calls
    "purpose":     "testing",            # intent label for this run
}

TOKEN_COST = {                                                              # per-token cost in USD for each model used
    "gpt-4o-mini": {"input": 0.000150 / 1000, "output": 0.000600 / 1000}, # $0.15 per million input, $0.60 per million output
    "gpt-4o":      {"input": 0.002500 / 1000, "output": 0.010000 / 1000}, # $2.50 per million input, $10.00 per million output
}

call_log       = []   # grows by one entry per API call — written to stats.json after every question
assessment_log = []   # grows by one entry per question — stores chunks, answer, and both scores


def tracked_call(stage, model, messages, text, extra_params=None):  # wraps every OpenAI call to log tokens, cost, and latency
    extra_params = extra_params or {}             # default to empty dict so **extra_params never raises a TypeError
    timestamp    = datetime.now().isoformat()     # capture the exact time this call was initiated
    start        = time.time()                    # start stopwatch before sending the HTTP request

    response = gpt_client.chat.completions.create(  # send the chat completion request to OpenAI
        model=model,               # the model to call — gen_model for answers, assessor_model for scoring
        messages=messages,         # full conversation array: system prompt followed by user content
        metadata=REQUEST_METADATA, # attached to every call so usage appears tagged in the dashboard
        **extra_params,            # optional overrides such as temperature or response_format
    )

    elapsed        = round(time.time() - start, 2)                    # wall-clock seconds rounded to 2 decimal places
    usage          = response.usage                                    # token count object attached to every OpenAI response
    rates          = TOKEN_COST.get(model, {"input": 0, "output": 0}) # look up per-token rates; fall back to zero for unknown models
    input_cost     = usage.prompt_tokens     * rates["input"]         # cost of the prompt tokens in USD
    output_cost    = usage.completion_tokens * rates["output"]        # cost of the generated tokens in USD
    total_cost     = input_cost + output_cost                         # total USD cost for this single API call

    call_log.append({                                                              # append a full record for this call to the global log
        "timestamp":          timestamp,                                           # ISO timestamp of when the call started
        "stage":              stage,                                               # pipeline stage: "generator", "grounding", or "topicality"
        "model":              model,                                               # which OpenAI model handled this call
        "text":               text,                                                # the question that triggered this call
        "result": {                                                                # nested dict holding the model's output and token counts
            "result":             response.choices[0].message.content.strip(),    # the model's response text, whitespace trimmed
            "prompt_tokens":      usage.prompt_tokens,                            # number of tokens in the input
            "completion_tokens":  usage.completion_tokens,                        # number of tokens in the output
            "total_tokens":       usage.total_tokens,                             # prompt_tokens + completion_tokens
            "input_cost":         round(input_cost,  7),                          # prompt cost in USD, 7 decimal places
            "output_cost":        round(output_cost, 7),                          # completion cost in USD, 7 decimal places
            "total_cost":         round(total_cost,  7),                          # combined cost in USD, 7 decimal places
        },
        "time_taken_seconds": elapsed,                                            # latency for this call in seconds
    })

    return response   # return the full response object so the caller can extract content or usage


Using a stronger model (gpt-4o) to assess the outputs of a weaker one (gpt-4o-mini) avoids self-preference bias. When a model evaluates its own outputs it tends to favour its own style regardless of quality. Separating the generator from the assessor produces more objective scores. Setting JUDGE_TEMPERATURE=0.0 ensures the assessor gives consistent scores when the same input is evaluated more than once.




Building the Knowledge Base


The knowledge base is a set of text chunks representing a university academic FAQ. Each chunk covers a distinct policy area: admission deadlines, financial aid, course registration, graduation requirements, campus housing, academic integrity, grade appeals, and internship credit. These are stored as plain strings and converted into LangChain Document objects for indexing.



ACADEMIC_DOCS = [                   # list of 8 plain-text FAQ chunks that form the entire knowledge base
    (                               # chunk 0 — undergraduate admission deadlines
     "Undergraduate admission applications must be submitted by January 15 for fall enrollment. "   # fall application cutoff
     "Applications received after the deadline are reviewed on a rolling basis subject to available capacity. "  # late application handling
     "Transfer students have a separate deadline of March 1 for the same fall term."),              # transfer deadline differs

    (                               # chunk 1 — financial aid FAFSA deadline and GPA requirement
     "Financial aid applications require the FAFSA to be completed by February 28 each year. "     # FAFSA priority deadline
     "Students who miss the priority deadline may still receive loans but are unlikely to receive grant funding. "  # consequence of missing deadline
     "All financial aid awards are contingent on maintaining a minimum GPA of 2.0 and full-time enrollment status."),  # ongoing eligibility conditions

    (                               # chunk 2 — course registration priority and drop/add window
     "Course registration opens four weeks before the start of each semester. "                    # when registration begins
     "Priority registration is granted first to students with disabilities, then by credit hours earned in descending order. "  # registration priority order
     "Students may add or drop courses without academic penalty during the first two weeks of the semester."),  # penalty-free change window

    (                               # chunk 3 — graduation credit hours and GPA thresholds
     "To graduate, students must complete a minimum of 120 credit hours with at least 40 credits at the 300 level or above. "  # total and upper-division credit requirements
     "A cumulative GPA of 2.0 is required for graduation, with a minimum 2.0 in the declared major. "  # GPA requirement for graduation
     "All students must complete the university writing requirement and the diversity and inclusion requirement."),  # mandatory non-credit requirements

    (                               # chunk 4 — on-campus housing and first-year residency rule
     "On-campus housing applications open on April 1 for the following academic year. "            # housing application open date
     "Returning students are given priority over incoming students. "                              # priority order for room assignment
     "First-year students are required to live on campus unless they are commuting from a permanent family residence within 30 miles."),  # residency requirement and commuter exemption

    (                               # chunk 5 — academic integrity violations and escalating consequences
     "Academic integrity violations are handled by the Office of Student Conduct. "                # which office manages violations
     "First violations typically result in a zero on the assignment and a formal written warning. "  # first-offense penalty
     "A second violation may result in course failure or suspension depending on severity."),       # escalated consequences for repeat violations

    (                               # chunk 6 — grade appeal process, deadline, and eligibility
     "Students may appeal a final grade within 30 days of the grade being posted. "               # appeal submission window
     "Appeals must be submitted in writing to the department chair with supporting documentation. "  # submission format and recipient
     "Grade appeals are only considered for computational errors or procedural violations, not for disagreement with grading standards."),  # valid grounds for appeal

    (                               # chunk 7 — internship credit rules and elective-only limitation
     "Internship credit is available for approved work placements of at least 10 hours per week over a full semester. "  # minimum hours to qualify
     "Students must secure the placement independently and submit a signed employer agreement before the semester begins. "  # student responsibility and paperwork
     "Internship credits count as elective credits only and cannot substitute for required major coursework."),  # credit type limitation
]


Each chunk is short enough to fit inside a single context window but specific enough that the retriever can identify which chunks are relevant for a given question.




Building the Vector Store


We convert the FAQ chunks into LangChain Document objects, embed them using the model specified in .env, and store them in a local Chroma collection. The vector store handles similarity search at query time, returning the top-k most relevant chunks for any input question.




embedder = OpenAIEmbeddings(model=embed_model, api_key=api_key)  # embedding model that converts text strings into numeric vectors

doc_objects = [                                   # build a LangChain Document for each FAQ chunk
    Document(                                     # LangChain wrapper that pairs text with metadata
        page_content=chunk,                       # the raw FAQ text that will be embedded and stored
        metadata={                                # dict of extra fields attached to every document
            "source":      f"faq_chunk_{idx}",   # unique string name used to trace which chunk was retrieved
            "chunk_index": idx,                  # numeric index matching the position in ACADEMIC_DOCS
        },
    )
    for idx, chunk in enumerate(ACADEMIC_DOCS)   # pair each chunk string with its zero-based index
]

faq_store = Chroma.from_documents(               # embed all documents and load them into a Chroma collection
    doc_objects,                                 # the list of Document objects to embed and index
    embedder,                                    # the embedding function used to vectorise each document
    collection_name="academic_faq",              # name for the in-memory Chroma collection — no disk persistence needed
)

console.print(f"  Vector store ready  |  [cyan]{len(doc_objects)} FAQ chunks indexed[/cyan]")  # confirm index size at startup


The source metadata is carried through to the assessment report. When a low grounding score appears alongside irrelevant source chunks, that points to a retrieval failure rather than a generation failure. These two failure modes need completely different fixes, so tracking sources makes diagnosis much faster.




Building the RAG Query Function


With the vector store in place, we build the retrieve-and-generate function. For each incoming question, it fetches the top-k relevant chunks from Chroma, joins them into a single context block, and sends that context along with the question to the generator model. The system prompt instructs the model to answer using only the provided context and to say so clearly if the context does not contain enough information.



RESPONDER_PROMPT = (                                                                              # system prompt that controls how the generator answers
    "You are a helpful university academic advisor. "                                             # defines the model's role and perspective
    "Answer the student's question using only the information in the provided context. "          # restricts the answer strictly to retrieved content
    "If the context does not contain a clear answer, say so rather than guessing. "              # instructs the model to admit when context is insufficient
    "Do not make up policies, deadlines, or requirements not stated in the context."              # explicit anti-hallucination instruction
)


def query_and_respond(question, store=faq_store, k=top_k):  # retrieves relevant chunks and generates a grounded answer
    hits          = store.similarity_search(question, k=k)                    # retrieve the top-k most relevant chunks by cosine similarity
    context_block = "\n\n".join(doc.page_content for doc in hits)             # join all retrieved chunk texts into one string for the prompt

    messages = [                                                               # build the conversation array for the generator
        {"role": "system", "content": RESPONDER_PROMPT},                      # system prompt sets the advisor persona and grounding rules
        {"role": "user",   "content": f"Context:\n{context_block}\n\nQuestion: {question}"},  # user message puts retrieved context above the question
    ]

    reply = tracked_call("generator", gen_model, messages, text=question)     # call the generator and log tokens, cost, and latency

    return {                                                                    # return all outputs needed by the assessors and the UI
        "question": question,                                                  # original question — passed through for the assessors
        "context":  context_block,                                             # full joined context — sent to the grounding assessor
        "answer":   reply.choices[0].message.content.strip(),                  # the model's answer with leading/trailing whitespace removed
        "sources":  [doc.metadata["source"] for doc in hits],                 # chunk names only — used for terminal display
        "chunks":   [                                                          # full chunk text with source name — saved to stats.json and shown in terminal
            {"source": doc.metadata["source"], "text": doc.page_content}      # one dict per retrieved chunk: name and full text
            for doc in hits                                                    # iterate over all retrieved Document objects
        ],
    }


The system prompt is the most important lever for grounding. Explicitly telling the model not to infer information beyond the context significantly reduces fabricated answers, even before any evaluation is applied.




Grounding Assessor


The grounding assessor evaluates whether every factual claim in the generated answer is directly supported by the retrieved context. It does not assess whether the answer is useful or complete, only whether it invents or misrepresents information. A low grounding score means the generator is producing claims that are not in the retrieved documents.



GROUNDING_RUBRIC = textwrap.dedent("""   # rubric string sent to the assessor as the system prompt — dedent removes leading indentation
    You are an expert evaluator measuring the grounding of an AI-generated answer.
    Grounding means every factual claim in the answer is directly supported by the provided context.
    An answer is ungrounded if it introduces policies, deadlines, or requirements not stated in the context.

    Score the answer on a scale of 1 to 5:
      5 — Every claim is explicitly supported by the context. No invented information.
      4 — Nearly all claims are supported; one minor inference that does not change the meaning.
      3 — Most claims are supported but one notable unsupported statement is present.
      2 — Several claims go beyond or contradict the context.
      1 — The answer contains multiple fabricated or contradictory claims.

    Respond in valid JSON with exactly these keys:
      "score": integer from 1 to 5
      "reasoning": one or two sentences explaining the score
""").strip()  # strip() removes the leading and trailing newlines added by the triple-quote string


def check_grounding(question, context, answer):  # sends the answer and its source context to the assessor and returns a score
    prompt = (                                   # assemble all three pieces the assessor needs to evaluate grounding
        f"CONTEXT:\n{context}\n\n"               # the retrieved FAQ chunks — what the answer should be based on
        f"QUESTION: {question}\n\n"              # the original question — included for framing, not directly scored
        f"ANSWER: {answer}"                      # the generated answer being evaluated for grounding
    )

    messages = [                                                            # build the conversation array for the grounding assessor
        {"role": "system", "content": GROUNDING_RUBRIC},                   # system prompt sets the rubric and scoring criteria
        {"role": "user",   "content": prompt},                             # user message contains context, question, and answer
    ]

    outcome = tracked_call(                                                 # call the assessor and log tokens, cost, and latency
        "grounding", assessor_model, messages, text=question,              # stage label "grounding" and the question for the log
        extra_params={                                                      # additional parameters controlling assessor behaviour
            "temperature": assessor_temp,                                   # zero temperature for reproducible, consistent scores
            "response_format": {"type": "json_object"},                    # forces the model to return valid JSON — no manual text parsing needed
        },
    )

    return json.loads(outcome.choices[0].message.content)  # parse the JSON string and return {"score": int, "reasoning": str}


response_format={"type": "json_object"} forces the model to return valid JSON, which makes the output directly parseable without any text stripping or error handling for malformed responses.




Topicality Assessor


The topicality assessor evaluates whether the answer actually addresses what the student asked, independently of whether it is grounded. An answer can be perfectly supported by the retrieved context but still miss the point of the question. Separating grounding from topicality lets you diagnose problems precisely: a low topicality score with a high grounding score means the retriever is pulling the wrong chunks, not that the generator is hallucinating.



TOPICALITY_RUBRIC = textwrap.dedent("""   # rubric string sent to the assessor as the system prompt — dedent removes leading indentation
    You are an expert evaluator measuring the topicality of an AI-generated answer.
    Topicality means the answer directly addresses what the student asked.
    Evaluate topicality independently of whether the answer is factually grounded.

    Score the answer on a scale of 1 to 5:
      5 — The answer fully addresses the question with no off-topic content.
      4 — The answer addresses the main question with minor tangential content.
      3 — The answer partially addresses the question but misses key aspects.
      2 — The answer touches the topic but mostly addresses something else.
      1 — The answer does not address the question at all.

    Respond in valid JSON with exactly these keys:
      "score": integer from 1 to 5
      "reasoning": one or two sentences explaining the score
""").strip()  # strip() removes the leading and trailing newlines added by the triple-quote string


def check_topicality(question, answer):  # sends the question and answer to the assessor without context and returns a score
    messages = [                                                                    # build the conversation array for the topicality assessor
        {"role": "system", "content": TOPICALITY_RUBRIC},                          # system prompt sets the rubric and scoring criteria
        {"role": "user",   "content": f"QUESTION: {question}\n\nANSWER: {answer}"}, # context intentionally excluded — topicality is purely question vs answer
    ]

    outcome = tracked_call(                                                         # call the assessor and log tokens, cost, and latency
        "topicality", assessor_model, messages, text=question,                     # stage label "topicality" and the question for the log
        extra_params={                                                              # additional parameters controlling assessor behaviour
            "temperature": assessor_temp,                                           # zero temperature for reproducible scores
            "response_format": {"type": "json_object"},                            # forces valid JSON — no parsing errors
        },
    )

    return json.loads(outcome.choices[0].message.content)  # parse and return {"score": int, "reasoning": str}


The topicality assessor receives only the question and the answer, not the context. This is intentional: topicality is about the relationship between the question and the answer, not about whether the answer matches the source documents.




Saving Stats


After each question is answered and scored, save_stats() writes the full session state to stats.json. This runs after every question so no data is lost if the session ends unexpectedly. The file includes a run_info header with per-stage token and cost totals, an assessments array with one record per question including the retrieved chunks, and a calls array with the raw data for every individual API call made so far.



def save_stats():  # computes per-stage totals from call_log and writes the full session state to stats.json
    if not call_log:   # skip if no API calls have been made yet — nothing to save
        return         # exit early to avoid dividing by zero or writing an empty file

    gen_entries   = [e for e in call_log if e["stage"] == "generator"]    # filter call_log to only generator calls
    grnd_entries  = [e for e in call_log if e["stage"] == "grounding"]    # filter call_log to only grounding assessor calls
    topic_entries = [e for e in call_log if e["stage"] == "topicality"]   # filter call_log to only topicality assessor calls

    def _sum(entries, key): return sum(e["result"][key] for e in entries)  # helper: sum a result field across all entries in a stage list

    gen_input    = _sum(gen_entries,   "prompt_tokens")       # total input tokens sent to the generator across all questions
    gen_output   = _sum(gen_entries,   "completion_tokens")   # total output tokens returned by the generator across all questions
    gen_tokens   = _sum(gen_entries,   "total_tokens")        # generator input + output tokens combined

    grnd_input   = _sum(grnd_entries,  "prompt_tokens")       # total input tokens sent to the grounding assessor
    grnd_output  = _sum(grnd_entries,  "completion_tokens")   # total output tokens returned by the grounding assessor
    grnd_tokens  = _sum(grnd_entries,  "total_tokens")        # grounding assessor input + output tokens combined

    topic_input  = _sum(topic_entries, "prompt_tokens")       # total input tokens sent to the topicality assessor
    topic_output = _sum(topic_entries, "completion_tokens")   # total output tokens returned by the topicality assessor
    topic_tokens = _sum(topic_entries, "total_tokens")        # topicality assessor input + output tokens combined

    total_input  = gen_input  + grnd_input  + topic_input    # grand total input tokens across all three stages
    total_output = gen_output + grnd_output + topic_output   # grand total output tokens across all three stages
    total_tokens = gen_tokens + grnd_tokens + topic_tokens   # overall token count for the entire session so far

    gen_cost   = _sum(gen_entries,   "total_cost")   # total USD cost for all generator calls
    grnd_cost  = _sum(grnd_entries,  "total_cost")   # total USD cost for all grounding assessor calls
    topic_cost = _sum(topic_entries, "total_cost")   # total USD cost for all topicality assessor calls
    total_cost = gen_cost + grnd_cost + topic_cost   # combined USD cost for the entire session so far

    output = {                                           # top-level dict that will be serialised to stats.json
        "run_info": {                                    # metadata header describing this evaluation run
            "generator_model": gen_model,               # name of the model that answered questions
            "judge_model":     assessor_model,           # name of the model that scored answers
            "embed_model":     embed_model,              # name of the model that created embeddings
            "timestamp":       datetime.now().isoformat(), # ISO timestamp of when this file was last written
            "total_questions": len(gen_entries),         # one generator call per question — equals questions asked so far
            "total_api_calls": len(call_log),            # three calls per question: generator + grounding + topicality
            "usage": {                                   # per-stage and overall token and cost breakdown
                "generator": {                           # stats for the answer-generation stage only
                    "model":               gen_model,    # model used for answering
                    "total_input_tokens":  gen_input,    # prompt tokens sent to the generator across all questions
                    "total_output_tokens": gen_output,   # completion tokens returned by the generator
                    "total_tokens":        gen_tokens,   # generator input + output combined
                    "total_cost":          round(gen_cost, 6),    # generator total cost in USD
                },
                "grounding": {                           # stats for the grounding assessment stage only
                    "model":               assessor_model,  # model used for grounding assessment
                    "total_input_tokens":  grnd_input,      # prompt tokens sent to the grounding assessor
                    "total_output_tokens": grnd_output,     # completion tokens returned by the grounding assessor
                    "total_tokens":        grnd_tokens,     # grounding assessor input + output combined
                    "total_cost":          round(grnd_cost, 6),   # grounding assessor total cost in USD
                },
                "topicality": {                          # stats for the topicality assessment stage only
                    "model":               assessor_model,  # model used for topicality assessment
                    "total_input_tokens":  topic_input,     # prompt tokens sent to the topicality assessor
                    "total_output_tokens": topic_output,    # completion tokens returned by the topicality assessor
                    "total_tokens":        topic_tokens,    # topicality assessor input + output combined
                    "total_cost":          round(topic_cost, 6),  # topicality assessor total cost in USD
                },
                "total_input_tokens":  total_input,         # grand total input tokens for the session
                "total_output_tokens": total_output,        # grand total output tokens for the session
                "total_tokens":        total_tokens,        # grand total tokens across all three stages
                "total_cost":          round(total_cost, 6), # grand total cost in USD for the session
            },
        },
        "assessments": assessment_log,   # one record per question: answer, retrieved chunks, grounding score, topicality score
        "calls":       call_log,         # one record per API call: stage, model, tokens, cost, latency, response text
    }

    with open("stats.json", "w", encoding="utf-8") as f:   # open stats.json for writing — overwrites previous content
        json.dump(output, f, indent=2)                       # write pretty-printed JSON with 2-space indentation


Saving after every question rather than only at the end means the file always reflects the current session state. If the process is interrupted or the user closes the terminal, all completed questions are already on disk.




Interactive Evaluation Loop


Instead of running a fixed batch of questions at startup, the pipeline drops into an interactive loop where the user can type any question and receive a scored answer immediately. Each question triggers three API calls: one generator call and two assessor calls. After the scores are displayed, save_stats() updates stats.json with the latest data.



def _score_color(score):  # maps a 1-5 integer score to a Rich color name for terminal output
    if score >= 4:         # 4 or 5 means the answer is well grounded or fully on-topic
        return "green"     # green signals a good result
    if score == 3:         # 3 means the answer is partially correct or misses some aspects
        return "yellow"    # yellow signals a partial result that may need attention
    return "red"           # 1 or 2 means the answer is hallucinated, contradictory, or off-topic


def interactive_loop():  # runs the REPL: reads a question, scores it, prints results, loops until the user quits
    console.print()                                                               # blank line before the header rule
    console.rule("[bold cyan]LLM-as-a-Judge RAG Evaluator[/bold cyan]")          # full-width rule with centered title
    console.print()                                                               # blank line after the rule
    console.print("  Type a question and press Enter. Type [bold]quit[/bold] to exit.\n")  # usage instructions shown once at startup

    while True:                    # loop indefinitely until the user types quit or presses Ctrl+C
        try:
            question = input("  Question: ").strip()   # read user input from stdin and strip surrounding whitespace
        except (KeyboardInterrupt, EOFError):           # catch Ctrl+C and piped end-of-input
            console.print("\n  Goodbye.")               # print farewell message before exiting
            break                                       # exit the loop cleanly

        if not question:           # skip completely empty input and show the prompt again
            continue               # go back to the top of the while loop without calling any APIs

        if question.lower() in ("quit", "exit", "q"):  # accept several common exit commands
            console.print("  Goodbye.")                 # print farewell message before exiting
            break                                       # exit the loop cleanly

        console.print("  [cyan]Thinking...[/cyan]")    # notify the user that API calls are in progress

        rag_output = query_and_respond(question)        # retrieve top-k context chunks and generate a grounded answer
        grounding  = check_grounding(                   # score whether every claim in the answer is supported by the context
            rag_output["question"],                     # the original question — passed through from rag_output
            rag_output["context"],                      # the retrieved context the answer should be based on
            rag_output["answer"],                       # the generated answer being evaluated
        )
        topicality = check_topicality(                  # score whether the answer actually addresses what was asked
            rag_output["question"],                     # the original question
            rag_output["answer"],                       # the generated answer being evaluated
        )

        g_score = grounding["score"]    # extract the integer grounding score (1–5) from the assessor's JSON response
        t_score = topicality["score"]   # extract the integer topicality score (1–5) from the assessor's JSON response

        console.print()                         # blank line before the answer panel for visual spacing
        console.print(Panel(                    # display the answer inside a bordered panel
            rag_output["answer"],               # the generated answer text shown inside the panel
            title="[bold green]Answer[/bold green]",   # panel title shown centered in the top border
            border_style="green",               # green border to visually group the answer block
            padding=(1, 2),                     # 1 line top/bottom padding, 2 characters left/right padding
        ))

        console.print(                          # print the grounding score and assessor reasoning on one line
            f"  [bold]Grounding :[/bold]  "
            f"[{_score_color(g_score)}]{g_score}/5[/{_score_color(g_score)}]  "  # score colored green/yellow/red based on value
            f"{grounding['reasoning']}"         # one or two sentences from the assessor explaining the grounding score
        )
        console.print(                          # print the topicality score and assessor reasoning on one line
            f"  [bold]Topicality:[/bold]  "
            f"[{_score_color(t_score)}]{t_score}/5[/{_score_color(t_score)}]  "  # score colored green/yellow/red based on value
            f"{topicality['reasoning']}"        # one or two sentences from the assessor explaining the topicality score
        )

        for chunk in rag_output["chunks"]:      # iterate over each retrieved FAQ chunk to display it
            console.print(Panel(                # display each chunk inside a bordered panel
                chunk["text"],                  # the full FAQ chunk text shown inside the panel
                title=f"[bold yellow]{chunk['source']}[/bold yellow]",  # chunk name shown in the top border
                border_style="yellow",          # yellow border to visually distinguish source chunks from the answer
                padding=(0, 2),                 # no top/bottom padding, 2 characters left/right padding
            ))

        assessment_log.append({                            # record this question's complete result before writing to disk
            "question":          rag_output["question"],   # the question that was asked
            "answer":            rag_output["answer"],     # the generated answer
            "chunks":            rag_output["chunks"],     # list of dicts with source name and full chunk text
            "grounding_score":   g_score,                  # integer 1–5 grounding score from the assessor
            "grounding_reason":  grounding["reasoning"],   # assessor's explanation for the grounding score
            "topicality_score":  t_score,                  # integer 1–5 topicality score from the assessor
            "topicality_reason": topicality["reasoning"],  # assessor's explanation for the topicality score
        })

        save_stats()   # write stats.json immediately after every question so data survives an unexpected exit

        console.print()                    # blank line after the last chunk panel for visual spacing
        console.rule(style="cyan")         # full-width cyan rule to visually separate this Q&A block from the next
        console.print()                    # blank line between the rule and the next Question prompt


interactive_loop()   # entry point — starts the interactive session once the vector store has been built

The reasoning field is the most useful part of the output. A score alone tells you that something is wrong; the reasoning tells you exactly what the assessor objected to, which gives you a concrete starting point for improving either the retriever or the generator.




Running the Pipeline


With all the code in place, running the pipeline is a single command. Make sure your .env file is configured and your virtual environment is active, then:



python evaluate.py



Output





Retrieved source chunks are shown below the scores so you can verify which FAQ entries the answer was built from. After each question stats.json is updated with token counts, costs, and the full assessment record. Type quit to exit.


stats.json file




{
  "run_info": {
    "generator_model": "gpt-4o-mini",
    "judge_model": "gpt-4o",
    "embed_model": "text-embedding-3-small",
    "timestamp": "2026-06-18T12:29:14.579397",
    "total_questions": 8,
    "total_api_calls": 24,
    "usage": {
      "generator": {
        "model": "gpt-4o-mini",
        "total_input_tokens": 1887,
        "total_output_tokens": 213,
        "total_tokens": 2100,
        "total_cost": 0.000411
      },
      "grounding": {
        "model": "gpt-4o",
        "total_input_tokens": 3116,
        "total_output_tokens": 401,
        "total_tokens": 3517,
        "total_cost": 0.0118
      },
      "topicality": {
        "model": "gpt-4o",
        "total_input_tokens": 1724,
        "total_output_tokens": 326,
        "total_tokens": 2050,
        "total_cost": 0.00757
      },
      "total_input_tokens": 6727,
      "total_output_tokens": 940,
      "total_tokens": 7667,
      "total_cost": 0.019781
    }
  },
  "assessments": [
    {
      "question": "When is the application deadline for undergraduate admission in fall?",
      "answer": "The application deadline for undergraduate admission in fall is January 15.",
      "grounding_score": 5,
      "grounding_reason": "The answer correctly states the application deadline for undergraduate admission in fall as January 15, which is explicitly supported by the provided context.",
      "topicality_score": 3,
      "topicality_reason": "The answer provides a specific date for the application deadline, which partially addresses the question. However, it lacks context or clarification that deadlines can vary by institution, which is a key aspect of the question.",
      "chunks": [
        {"source": "faq_chunk_0", "text": "Undergraduate admission applications must be submitted by January 15 for fall enrollment. Applications received after the deadline are reviewed on a rolling basis subject to available capacity. Transfer students have a separate deadline of March 1 for the same fall term."},
        {"source": "faq_chunk_4", "text": "On-campus housing applications open on April 1 for the following academic year. Returning students are given priority over incoming students. First-year students are required to live on campus unless they are commuting from a permanent family residence within 30 miles."},
        {"source": "faq_chunk_1", "text": "Financial aid applications require the FAFSA to be completed by February 28 each year. Students who miss the priority deadline may still receive loans but are unlikely to receive grant funding. All financial aid awards are contingent on maintaining a minimum GPA of 2.0 and full-time enrollment status."}
      ]
    },
    {
      "question": "What GPA do I need to maintain to keep my financial aid?",
      "answer": "You need to maintain a minimum GPA of 2.0 to keep your financial aid.",
      "grounding_score": 5,
      "grounding_reason": "The answer is fully supported by the context, which states that all financial aid awards are contingent on maintaining a minimum GPA of 2.0.",
      "topicality_score": 3,
      "topicality_reason": "The answer partially addresses the question by providing a specific GPA requirement, but it misses key aspects such as the variability of GPA requirements depending on the type of financial aid or institution.",
      "chunks": [
        {"source": "faq_chunk_1", "text": "Financial aid applications require the FAFSA to be completed by February 28 each year. Students who miss the priority deadline may still receive loans but are unlikely to receive grant funding. All financial aid awards are contingent on maintaining a minimum GPA of 2.0 and full-time enrollment status."},
        {"source": "faq_chunk_3", "text": "To graduate, students must complete a minimum of 120 credit hours with at least 40 credits at the 300 level or above. A cumulative GPA of 2.0 is required for graduation, with a minimum 2.0 in the declared major. All students must complete the university writing requirement and the diversity and inclusion requirement."},
        {"source": "faq_chunk_0", "text": "Undergraduate admission applications must be submitted by January 15 for fall enrollment. Applications received after the deadline are reviewed on a rolling basis subject to available capacity. Transfer students have a separate deadline of March 1 for the same fall term."}
      ]
    },
    {
      "question": "Can I drop a course after the semester starts?",
      "answer": "Yes, you can drop a course after the semester starts without academic penalty during the first two weeks of the semester.",
      "grounding_score": 5,
      "grounding_reason": "The answer is fully supported by the context, which states that students may add or drop courses without academic penalty during the first two weeks of the semester. There is no additional or invented information in the answer.",
      "topicality_score": 5,
      "topicality_reason": "The answer directly addresses the question by confirming that a course can be dropped after the semester starts and provides additional detail about the timeframe for doing so without penalty.",
      "chunks": [
        {"source": "faq_chunk_2", "text": "Course registration opens four weeks before the start of each semester. Priority registration is granted first to students with disabilities, then by credit hours earned in descending order. Students may add or drop courses without academic penalty during the first two weeks of the semester."},
        {"source": "faq_chunk_0", "text": "Undergraduate admission applications must be submitted by January 15 for fall enrollment. Applications received after the deadline are reviewed on a rolling basis subject to available capacity. Transfer students have a separate deadline of March 1 for the same fall term."},
        {"source": "faq_chunk_7", "text": "Internship credit is available for approved work placements of at least 10 hours per week over a full semester. Students must secure the placement independently and submit a signed employer agreement before the semester begins. Internship credits count as elective credits only and cannot substitute for required major coursework."}
      ]
    },
    {
      "question": "How many credit hours do I need to graduate?",
      "answer": "You need to complete a minimum of 120 credit hours to graduate.",
      "grounding_score": 5,
      "grounding_reason": "The answer accurately reflects the information provided in the context, stating that a minimum of 120 credit hours is required to graduate. There is no additional or invented information present in the answer.",
      "topicality_score": 3,
      "topicality_reason": "The answer provides a general number of credit hours needed to graduate, but it lacks specificity as requirements can vary by institution and program.",
      "chunks": [
        {"source": "faq_chunk_3", "text": "To graduate, students must complete a minimum of 120 credit hours with at least 40 credits at the 300 level or above. A cumulative GPA of 2.0 is required for graduation, with a minimum 2.0 in the declared major. All students must complete the university writing requirement and the diversity and inclusion requirement."},
        {"source": "faq_chunk_7", "text": "Internship credit is available for approved work placements of at least 10 hours per week over a full semester. Students must secure the placement independently and submit a signed employer agreement before the semester begins. Internship credits count as elective credits only and cannot substitute for required major coursework."},
        {"source": "faq_chunk_2", "text": "Course registration opens four weeks before the start of each semester. Priority registration is granted first to students with disabilities, then by credit hours earned in descending order. Students may add or drop courses without academic penalty during the first two weeks of the semester."}
      ]
    },
 
 ...

    {
      "question": "Can internship credits count toward my major requirements?",
      "answer": "No, internship credits cannot substitute for required major coursework and only count as elective credits.",
      "grounding_score": 5,
      "grounding_reason": "The answer accurately reflects the context, stating that internship credits cannot substitute for required major coursework and only count as elective credits. This is directly supported by the provided information.",
      "topicality_score": 3,
      "topicality_reason": "The answer partially addresses the question by stating that internship credits cannot count toward major requirements, but it misses key aspects such as potential exceptions or variations depending on the institution or program.",
      "chunks": [
        {"source": "faq_chunk_7", "text": "Internship credit is available for approved work placements of at least 10 hours per week over a full semester. Students must secure the placement independently and submit a signed employer agreement before the semester begins. Internship credits count as elective credits only and cannot substitute for required major coursework."},
        {"source": "faq_chunk_3", "text": "To graduate, students must complete a minimum of 120 credit hours with at least 40 credits at the 300 level or above. A cumulative GPA of 2.0 is required for graduation, with a minimum 2.0 in the declared major. All students must complete the university writing requirement and the diversity and inclusion requirement."},
        {"source": "faq_chunk_2", "text": "Course registration opens four weeks before the start of each semester. Priority registration is granted first to students with disabilities, then by credit hours earned in descending order. Students may add or drop courses without academic penalty during the first two weeks of the semester."}
      ]
    }
  ],
  "calls": [
    {
      "timestamp": "2026-06-18T12:26:33.185806",
      "stage": "generator",
      "model": "gpt-4o-mini",
      "text": "When is the application deadline for undergraduate admission in fall?",
      "result": {
        "result": "The application deadline for undergraduate admission in fall is January 15.",
        "prompt_tokens": 229,
        "completion_tokens": 13,
        "total_tokens": 242,
        "input_cost": 3.44e-05,
        "output_cost": 7.8e-06,
        "total_cost": 4.22e-05
      },
      "time_taken_seconds": 1.64
    },
    {
      "timestamp": "2026-06-18T12:26:34.822822",
      "stage": "grounding",
      "model": "gpt-4o",
      "text": "When is the application deadline for undergraduate admission in fall?",
      "result": {
        "result": "{\n  \"score\": 5,\n  \"reasoning\": \"The answer correctly states the application deadline for undergraduate admission in fall as January 15, which is explicitly supported by the provided context.\"\n}",
        "prompt_tokens": 369,
        "completion_tokens": 41,
        "total_tokens": 410,
        "input_cost": 0.0009225,
        "output_cost": 0.00041,
        "total_cost": 0.0013325
      },
      "time_taken_seconds": 1.12
    },
    {
      "timestamp": "2026-06-18T12:26:35.938573",
      "stage": "topicality",
      "model": "gpt-4o",
      "text": "When is the application deadline for undergraduate admission in fall?",
      "result": {
        "result": "{\"score\":3,\"reasoning\":\"The answer provides a specific deadline, which partially addresses the question, but it lacks context such as the specific institution or any variations in deadlines that might exist.\"}",
        "prompt_tokens": 202,
        "completion_tokens": 40,
        "total_tokens": 242,
        "input_cost": 0.000505,
        "output_cost": 0.0004,
        "total_cost": 0.000905
      },
      "time_taken_seconds": 1.95
    },
    {
      "timestamp": "2026-06-18T12:27:12.294590",
      "stage": "generator",
      "model": "gpt-4o-mini",
      "text": "What GPA do I need to maintain to keep my financial aid?",
      "result": {
        "result": "You need to maintain a minimum GPA of 2.0 to keep your financial aid.",
        "prompt_tokens": 251,
        "completion_tokens": 18,
        "total_tokens": 269,
        "input_cost": 3.77e-05,
        "output_cost": 1.08e-05,
        "total_cost": 4.84e-05
      },
      "time_taken_seconds": 1.01
    },

...

  ]
}



Who Can Benefit


  • Students learning NLP or applied AI can use this project to understand what RAG evaluation looks like in practice. Building the assessors from scratch makes the underlying concepts concrete in a way that using an evaluation framework alone does not.

  • Startups shipping AI assistants backed by private documents can integrate this pattern into their CI/CD pipeline to catch regressions automatically whenever the knowledge base or system prompt changes.

  • Data scientists maintaining RAG systems can use the per-question reasoning output to identify which document chunks are causing grounding failures and improve retrieval quality incrementally.

  • Enterprises running high-volume AI pipelines can adapt the assessment loop to sample a fraction of production traffic for continuous quality monitoring without reviewing every query manually.

  • AI engineers can extend the two-assessor pattern to additional dimensions: completeness, tone, citation accuracy, or any other property that matters for their specific domain.




How Codersarts Can Help


If you want to take this further, Codersarts offers hands-on support at every stage.


  • For learners: Live 1-to-1 sessions with an AI engineer who can walk through the RAG architecture, the assessor design, and help you adapt the pipeline to your own knowledge base.

  • For teams: End-to-end development of custom RAG evaluation systems including rubric design, calibration against human labels, and integration with your existing testing infrastructure.

  • For enterprises: Architecture consulting for production-grade LLM evaluation pipelines, including sampling strategies, cost optimisation, and dashboarding.


Reach out at contact@codersarts.com or visit www.codersarts.com to get started.




Continue Your AI Learning Journey with Codersarts


If you enjoyed this article and would like to discover more about RAG applications, production-ready LLM systems, and real-world RAG and MCP implementations, be sure to explore these other blogs from Codersarts:





Comments


bottom of page