Evaluating Natural Language to SQL Generation with Promptfoo and Python

1 day ago
17 min read

Introduction

Most LLM evaluation tutorials check whether a generated answer “sounds right” by asking another LLM to grade it. That works for tone and style, but it falls apart for tasks with an objectively correct answer. SQL generation is exactly that kind of task: a query either returns the right rows or it does not, and no amount of LLM-rubric grading can substitute for actually running the query.

In this tutorial we build a promptfoo evaluation for a natural language to SQL generator. Instead of asking a second model whether the generated SQL “looks correct,” a custom Python assertion executes the generated query against a real SQLite database and compares the returned rows to a known-correct reference query. We compare two OpenAI models, gpt-4o-mini and gpt-5-mini, track real per-token cost despite promptfoo’s pricing table not yet knowing about newer models, and demonstrate that the grading logic genuinely catches wrong answers rather than rubber-stamping everything.

What We Are Building

A promptfoo project that evaluates a SQL-generation prompt against a fixed orders table. The workflow:

Define a prompt that takes a plain English question and returns a SQLite query
Write four test cases, each with a known-correct reference query
Execute every generated query against an in-memory database and compare actual results, not text
Compare two models on the same questions
Track real per-token cost in a stats.json file that accumulates across runs

Tech Stack

Component	Tool
Evaluation framework	promptfoo (Node.js CLI, run via npx)
Models compared	gpt-4o-mini and gpt-5-mini
Correctness check	Custom Python assertion executing SQL against SQLite
Cost tracking	Custom build_stats.py using per-token rates from .env
Environment	python-dotenv

Project Structure


promptfoo_code_review_eval/
├── promptfooconfig.yaml        # prompts, providers, test cases, assertions
├── assert_sql_correctness.py   # custom Python assertion — executes SQL and compares results
├── build_stats.py              # parses promptfoo's results.json into stats.json with our own cost rates
├── requirements.txt            # python-dotenv
├── .env                        # OPENAI_API_KEY and per-model cost rates
└── README.md

Setting Up the Environment

Promptfoo is a Node.js CLI tool, so no global install is required. npx downloads and runs it on first use:


npx promptfoo@latest eval

Promptfoo requires Node ^20.20.0 or >=22.22.0. Check your version first:


node --version

On Windows, upgrade via winget if needed:


winget upgrade --id OpenJS.NodeJS.22 --silent --accept-package-agreements --accept-source-agreements

.env holds the OpenAI key and the per-token cost rates used later for tracking spend:


OPENAI_API_KEY=your_openai_api_key_here
GPT_4O_MINI_INPUT_COST=0.00000015
GPT_4O_MINI_OUTPUT_COST=0.00000060
GPT_5_MINI_INPUT_COST=0.00000025
GPT_5_MINI_OUTPUT_COST=0.00000200

Promptfoo loads .env automatically, so the API key needs no extra wiring. The cost rates are read separately by build_stats.py, covered later.

Building the Promptfoo Configuration

Everything promptfoo needs, the prompts, the providers, the test cases, and the assertions, lives in one YAML file. Create a file named promptfooconfig.yaml in your project root. Promptfoo automatically detects this exact filename when you run npx promptfoo@latest eval from the same directory, so the name is not optional.

A quick note before the full file: lines inside a raw: | block scalar are literal prompt text. A trailing # there would become part of the string sent to the model rather than a YAML comment, so the prompt content itself is left uncommented below, exactly as it has to be. Every structural line around it is commented.


description: "Natural language to SQL query generator evaluation"  # shown as the eval's title in the promptfoo web UI

prompts:
  - id: prompt_correct                 # unique id used to wire this prompt to a provider below
    label: prompt_correct              # human-readable label shown in the results table
    raw: |                             # the literal prompt template; {{question}} is substituted per test case
      You are a SQL assistant. Given a table schema and a question in plain English,
      write a single SQLite query that answers the question exactly.

      Schema:
      CREATE TABLE orders (
          order_id INTEGER PRIMARY KEY,
          customer_name TEXT,
          amount REAL,
          order_date TEXT,
          status TEXT
      );

      Question: {{question}}

      Respond with ONLY the SQL query inside a ```sql code block. Do not include any explanation or commentary.

  # DELIBERATE SABOTAGE — used only to prove the eval catches a wrong answer.
  # Identical to prompt_correct except for one false instruction that only matters
  # for the percentage test case (test 4). Tests 1-3 are unaffected by this line.
  - id: prompt_degraded                # unique id for the sabotaged prompt, explained in full further down
    label: prompt_degraded             # human-readable label shown in the results table
    raw: |                             # identical to prompt_correct except for the extra paragraph below
      You are a SQL assistant. Given a table schema and a question in plain English,
      write a single SQLite query that answers the question exactly.

      Schema:
      CREATE TABLE orders (
          order_id INTEGER PRIMARY KEY,
          customer_name TEXT,
          amount REAL,
          order_date TEXT,
          status TEXT
      );

      Question: {{question}}

      Note: whenever a question asks for a percentage or share of a total, always compute
      the denominator as the sum across ALL rows in the table, ignoring any status filter
      mentioned elsewhere in the question.

      Respond with ONLY the SQL query inside a ```sql code block. Do not include any explanation or commentary.

providers:
  - id: openai:chat:gpt-4o-mini         # promptfoo's provider id format for OpenAI chat models
    label: "GPT-4o-mini"                # display name shown in the results table
    prompts: [prompt_degraded]          # this provider receives the sabotaged prompt, explained further down
    config:
      metadata:                         # tags visible in the OpenAI dashboard usage logs
        dev_name: "Ganesh"              # developer name for tracking who ran this eval
        project: "codex-test"           # project identifier in the usage dashboard
        environment: "local"            # separates local runs from CI or production
        purpose: "testing"              # intent label for this run
  - id: openai:chat:gpt-5-mini          # the second model being compared
    label: "GPT-5-mini"                 # display name shown in the results table
    prompts: [prompt_correct]           # this provider receives the real, correct prompt
    config:
      metadata:                         # same metadata shape, repeated per provider
        dev_name: "Ganesh"
        project: "codex-test"
        environment: "local"
        purpose: "testing"

defaultTest:
  assert:                               # assertions applied automatically to every test case below
    - type: latency                     # fails any call slower than this threshold
      threshold: 15000                  # 15 seconds, expressed in milliseconds
    - type: cost                        # fails any call costlier than this threshold
      threshold: 0.01                   # $0.01 per call, using promptfoo's own pricing table
    - type: python                      # every test case is also graded by our custom assertion
      value: file://assert_sql_correctness.py  # path to the file containing get_assert()

tests:
  - description: "Simple filter on a numeric column"           # row label shown in the results table
    vars:
      question: "List the order_id and amount for every order where the amount is greater than 100."  # substituted into {{question}}
      expected_query: "SELECT order_id, amount FROM orders WHERE amount > 100"  # known-correct answer, never shown to the model
      order_matters: "false"            # GROUP BY/WHERE results have no meaningful row order

  - description: "Aggregation with GROUP BY"                    # row label shown in the results table
    vars:
      question: "For each customer, show their name and their total order amount across all orders."  # substituted into {{question}}
      expected_query: "SELECT customer_name, SUM(amount) FROM orders GROUP BY customer_name"  # reference query for this question
      order_matters: "false"            # per-customer totals have no required row order

  - description: "Filter, sort, and limit combined"             # row label shown in the results table
    vars:
      question: "Show the order_id and order_date of the 2 most recently placed orders that have status 'completed', with the most recent first."  # substituted into {{question}}
      expected_query: "SELECT order_id, order_date FROM orders WHERE status = 'completed' ORDER BY order_date DESC LIMIT 2"  # reference query
      order_matters: "true"             # "most recent first" makes row order part of correctness

  - description: "Nested aggregation — each customer's share of total completed revenue"  # row label shown in the results table
    vars:
      question: "For each customer, what percentage of the total amount across all completed orders do they account for? Show customer_name and their percentage, rounded to 1 decimal place. Only consider completed orders, including in the total they are compared against."  # substituted into {{question}}
      expected_query: "SELECT customer_name, ROUND(SUM(amount) * 100.0 / (SELECT SUM(amount) FROM orders WHERE status = 'completed'), 1) AS pct FROM orders WHERE status = 'completed' GROUP BY customer_name"  # the trap: omitting this WHERE in the subquery gives wrong percentages
      order_matters: "false"            # per-customer percentages have no required row order

{{question}} is the only variable shown to the model. expected_query and order_matters exist purely for grading and are never substituted into the prompt, since the prompt template never references them. The fourth test case is the hardest one: it requires a subquery inside the denominator of a percentage calculation, and forgetting to filter that subquery by status = 'completed' produces a plausible-looking query that returns subtly wrong numbers. That subtlety is exactly what prompt_degraded exploits, covered in full further down.

The Custom Assertion: Executing SQL Instead of Trusting Appearances

This is the core of the evaluation. Rather than comparing the generated SQL text to the reference SQL text, the assertion runs both queries against the same database and compares the actual rows returned. Create a file named assert_sql_correctness.py in the same directory as promptfooconfig.yaml — this is the exact filename referenced by value: file://assert_sql_correctness.py in the config above.


import re                                            # extract SQL from a fenced markdown code block
import sqlite3                                        # in-memory database used to actually run both queries

SCHEMA_SQL = """
CREATE TABLE orders (
    order_id INTEGER PRIMARY KEY,
    customer_name TEXT,
    amount REAL,
    order_date TEXT,
    status TEXT
);
"""
# Note: this is a Python triple-quoted string, so a trailing # on any line above
# would become literal text inside the SQL rather than a comment, the same
# constraint as the raw: | block in the YAML file.

ORDER_ROWS = [                                       # fixed, deterministic order data so every test run sees the same table
    (1, "Alice",   120.50, "2024-01-15", "completed"),  # one of two completed orders for Alice
    (2, "Bob",      75.00, "2024-01-20", "completed"),  # first of two completed orders for Bob
    (3, "Alice",   200.00, "2024-02-01", "completed"),  # Alice's second completed order
    (4, "Charlie",  50.00, "2024-02-10", "cancelled"),  # cancelled order, excluded from completed-only totals
    (5, "Bob",     300.00, "2024-02-15", "completed"),  # Bob's second completed order
    (6, "Charlie", 150.00, "2024-03-01", "completed"),  # Charlie's only completed order
]


def build_test_database():                          # fresh in-memory SQLite DB, rebuilt for every grading call
    conn = sqlite3.connect(":memory:")               # no file on disk — every call starts from a clean, identical state
    conn.executescript(SCHEMA_SQL)                   # create the orders table
    conn.executemany("INSERT INTO orders VALUES (?, ?, ?, ?, ?)", ORDER_ROWS)  # load the fixed seed rows
    conn.commit()                                    # persist the inserts before any query runs against this connection
    return conn                                      # caller runs both the generated and reference queries on this connection


def extract_sql_statement(model_output):             # pull the query out of a ```sql fenced block, or use raw text
    fenced = re.search(r"```(?:sql)?\s*(.*?)```", model_output, re.DOTALL | re.IGNORECASE)  # look for a fenced code block
    raw_sql = fenced.group(1) if fenced else model_output  # fall back to the whole output if no fence is found
    return raw_sql.strip().rstrip(";")               # trailing semicolons are harmless but stripped for consistency


def get_assert(output, context):                     # promptfoo calls this for every test wired to this file
    generated_sql = extract_sql_statement(output)                              # the model's SQL, fences stripped
    expected_sql  = context["vars"]["expected_query"]                          # known-correct reference query for this test
    order_matters = str(context["vars"].get("order_matters", "false")).lower() == "true"  # whether row order is part of the spec

    conn = build_test_database()                     # same seed data for both queries, every time

    try:
        generated_rows = conn.execute(generated_sql).fetchall()  # actually execute the model's query
    except sqlite3.Error as exc:                      # catches SQL syntax errors and invalid column or table references
        conn.close()                                  # release the connection before returning
        return {
            "pass": False,                            # a query that does not even run can never be correct
            "score": 0.0,                             # zero score for a non-executing query
            "reason": f"Generated SQL failed to execute: {exc}. Query was: {generated_sql}",  # syntax errors fail immediately
        }

    expected_rows = conn.execute(expected_sql).fetchall()  # run the reference query against the same data
    conn.close()                                      # release the connection now that both queries have run

    rows_match = (generated_rows == expected_rows) if order_matters else (sorted(generated_rows) == sorted(expected_rows))  # exact order vs sorted comparison

    if rows_match:
        return {
            "pass": True,                             # the generated query returned exactly the expected data
            "score": 1.0,                             # full score for a correct query
            "reason": f"Query executed successfully and returned {len(generated_rows)} row(s) matching the expected result exactly.",
        }

    return {
        "pass": False,                                # the generated query ran but returned the wrong data
        "score": 0.0,                                 # zero score for incorrect results
        "reason": f"Query results do not match. Generated SQL returned {generated_rows}, expected {expected_rows}.",  # shows both result sets for fast debugging
    }

get_assert(output, context) is the exact function signature promptfoo requires for file-based Python assertions, found via type: python and value: file://assert_sql_correctness.py in the config. Building a fresh in-memory database on every single call means there is no shared state between test cases and no risk of one test’s data leaking into another’s result. Sorting both result sets before comparing handles the common case where GROUP BY output order is not semantically meaningful, while order_matters: "true" enforces exact row order for questions where it is, like “most recent first.”

Tracking Cost and Token Usage

Promptfoo’s own cost assertion relies on its internal pricing table, and newer models like gpt-5-mini are not guaranteed to be in it yet. Rather than trust a possibly-missing price, this script reads promptfoo’s own results.json output and applies our own per-token rates from .env to the real token counts promptfoo reports. Create a file named build_stats.py in the same directory.


import os                           # read per-model cost rates from .env
import json                         # parse promptfoo's results.json and write stats.json
from datetime import datetime       # timestamp for each run and for the stats file

from dotenv import load_dotenv      # load .env before any os.environ.get() call

load_dotenv()  # injects all key=value pairs from .env into os.environ

COST_RATES = {  # per-token USD rates, keyed by the promptfoo provider id used in promptfooconfig.yaml
    "openai:chat:gpt-4o-mini": {
        "input":  float(os.environ.get("GPT_4O_MINI_INPUT_COST",  0.00000015)),
        "output": float(os.environ.get("GPT_4O_MINI_OUTPUT_COST", 0.00000060)),
    },
    "openai:chat:gpt-5-mini": {
        "input":  float(os.environ.get("GPT_5_MINI_INPUT_COST",  0.00000025)),
        "output": float(os.environ.get("GPT_5_MINI_OUTPUT_COST", 0.00000200)),
    },
}

RESULTS_FILE = "results.json"  # produced by: npx promptfoo@latest eval --output results.json
STATS_FILE   = "stats.json"    # written by this script — accumulates across every run


def load_eval_rows(path):                                  # read promptfoo's flat per-test, per-provider result rows
    with open(path, "r", encoding="utf-8") as f:
        data = json.load(f)
    return data["results"]["results"]                       # promptfoo nests the row array under results.results


def build_call_record(row, run_id):                         # convert one promptfoo result row into our own call record
    provider_id    = row["provider"]["id"]                  # e.g. "openai:chat:gpt-5-mini"
    usage          = row.get("tokenUsage") or {}             # token counts reported by promptfoo for this call
    prompt_tok     = usage.get("prompt", 0)                  # input tokens for this call
    completion_tok = usage.get("completion", 0)              # output tokens for this call
    total_tok      = usage.get("total", prompt_tok + completion_tok)  # combined token count

    rates       = COST_RATES.get(provider_id, {"input": 0, "output": 0})  # our own rates, not promptfoo's built-in table
    input_cost  = prompt_tok     * rates["input"]            # cost of the prompt in USD
    output_cost = completion_tok * rates["output"]           # cost of the completion in USD
    total_cost  = input_cost + output_cost                   # combined cost for this single call

    return {
        "run_id":            run_id,                                                # which build_stats.py execution this call belongs to
        "provider":          provider_id,                                          # which model handled this call
        "test_description":  row.get("testCase", {}).get("description", ""),       # which test case this row belongs to
        "prompt_tokens":      prompt_tok,                    # input tokens for this call
        "completion_tokens":  completion_tok,                # output tokens for this call
        "total_tokens":       total_tok,                     # combined token count
        "input_cost":         round(input_cost,  7),         # prompt cost in USD
        "output_cost":        round(output_cost, 7),         # completion cost in USD
        "total_cost":         round(total_cost,  7),          # total cost in USD
        "latency_ms":         row.get("latencyMs", 0),        # wall-clock latency reported by promptfoo
        "passed":             row.get("success", False),     # whether every assertion on this row passed
    }


def compute_usage(calls, provider_id):                       # token and cost totals for one provider across a set of calls
    entries = [c for c in calls if c["provider"] == provider_id]  # filter down to only this provider's calls
    return {
        "total_input_tokens":  sum(c["prompt_tokens"]     for c in entries),  # summed input tokens for this provider
        "total_output_tokens": sum(c["completion_tokens"] for c in entries),  # summed output tokens for this provider
        "total_tokens":        sum(c["total_tokens"]      for c in entries),  # combined tokens for this provider
        "total_cost":          round(sum(c["total_cost"]  for c in entries), 6),  # combined USD cost for this provider
        "total_calls":         len(entries),                                  # number of calls made by this provider
    }


def summarize(calls):                                         # build a provider breakdown + totals for any list of calls
    provider_ids      = sorted(set(c["provider"] for c in calls))            # every distinct model present in this set
    usage_by_provider = {pid: compute_usage(calls, pid) for pid in provider_ids}  # per-model usage breakdown

    return {
        "providers":       provider_ids,                     # every model id present in this set of calls
        "total_api_calls": len(calls),                        # total calls in this set
        "usage": {
            **usage_by_provider,                              # one entry per provider with its own token/cost totals
            "total_tokens": sum(u["total_tokens"] for u in usage_by_provider.values()),  # combined tokens across providers
            "total_cost":   round(sum(u["total_cost"] for u in usage_by_provider.values()), 6),  # combined cost across providers
        },
    }


def main():
    try:                                                       # load everything saved by previous runs so stats accumulate
        with open(STATS_FILE, "r", encoding="utf-8") as f:
            previous = json.load(f)
        all_calls = previous.get("calls", [])                  # every call ever recorded, across all past runs
        runs      = previous.get("runs", [])                   # one summary entry per past execution of this script
    except (FileNotFoundError, json.JSONDecodeError):
        all_calls = []                                          # first run — start with empty history
        runs      = []

    run_id    = datetime.now().isoformat()                     # unique identifier for this execution of build_stats.py
    rows      = load_eval_rows(RESULTS_FILE)                    # load promptfoo's raw evaluation rows for this run only
    new_calls = [build_call_record(row, run_id) for row in rows]  # convert this run's rows into our own call record format
    all_calls.extend(new_calls)                                 # grow the lifetime call history with this run's calls

    run_summary = summarize(new_calls)                          # totals scoped to only this run
    run_summary["run_id"] = run_id
    runs.append(run_summary)                                    # add this run's summary to the run history

    lifetime_summary = summarize(all_calls)                     # totals across every run ever recorded

    output = {
        "run_info": {
            "timestamp":  datetime.now().isoformat(),           # when this stats file was last written
            "total_runs": len(runs),                            # how many times build_stats.py has been run
            **lifetime_summary,                                 # lifetime providers, total_api_calls, and usage
        },
        "runs":  runs,                                          # one entry per build_stats.py execution
        "calls": all_calls,                                     # every individual call ever recorded, across all runs
    }

    with open(STATS_FILE, "w", encoding="utf-8") as f:   # overwrite stats.json with the merged lifetime + per-run data
        json.dump(output, f, indent=2)                    # pretty-print with 2-space indentation

    print(                                                # summary line printed to the terminal after every run
        f"Wrote {STATS_FILE} — this run: {len(new_calls)} calls (${run_summary['usage']['total_cost']:.6f}). "
        f"Lifetime: {len(all_calls)} calls across {len(runs)} runs, total cost ${lifetime_summary['usage']['total_cost']:.6f}"
    )


if __name__ == "__main__":                                # run main() only when this file is executed directly, not when imported
    main()

Running build_stats.py more than once does not overwrite history. Each execution loads the existing stats.json, appends the new run’s calls, and recomputes both a per-run summary and a lifetime total. run_info always reflects every call ever recorded, runs has one entry per execution with its own scoped totals, and calls has every individual call tagged with the run_id that produced it.

A Deliberately Sabotaged Prompt: Proving the Evaluation Actually Catches Mistakes

In repeated live testing with a single, identical, correct prompt for both models, gpt-4o-mini and gpt-5-mini answered every question correctly, including several deliberately tricky variants tried along the way: a HAVING clause after GROUP BY, “second highest value” instead of maximum, and a JOIN-based regional rollup with a distinct-customer count. That is a genuinely useful finding: for single-table SQL generation at this level of difficulty, the cheaper model is just as reliable as the newer one.

It also meant there was no live failing row to point to as proof the evaluation actually catches mistakes rather than rubber-stamping everything. To demonstrate that the grading logic genuinely discriminates, gpt-4o-mini is wired to prompt_degraded instead of prompt_correct, using promptfoo’s per-provider prompts: field shown in the full configuration above. That prompt is identical to the correct one except for one extra paragraph of false guidance: it tells the model to always compute percentage denominators over every row in the table, ignoring any status filter mentioned in the question.

That instruction is irrelevant to the first three questions, since none of them involve a percentage calculation, so gpt-4o-mini still passes those normally. It only matters for test case 4, where following the bad instruction produces a denominator computed over every order instead of just completed ones, giving subtly wrong percentages. gpt-5-mini is wired to prompt_correct and passes every test.

This is intentionally not a fair model comparison, and the project’s own README says so directly: do not conclude from this result that gpt-4o-mini is worse at SQL generation than gpt-5-mini. Point both providers at prompt_correct and gpt-4o-mini passes everything too. The sabotaged prompt exists purely to generate a real, honest failing test case, produced by a real model following real instructions, rather than a fabricated result.

Running the Evaluation

With .env configured and a supported Node version installed, run the evaluation:


npx promptfoo@latest eval --output results.json --no-cache

--no-cache forces real API calls instead of reusing promptfoo’s local cache from an identical previous run. The terminal prints a pass/fail table for all 4 questions against both models:

The single failure is gpt-4o-mini on the percentage test, exactly as expected from the sabotaged prompt. Browse the same results in an interactive UI:

We will run it again. Instead of re-executing the entire file from the beginning, it will use the cached results from the previous execution. This saves both time and cost.


npx promptfoo@latest eval

or

npx promptfoo@latest eval --output results.json

results.json file


{
  "evalId": "eval-5wn-2026-06-25T04:51:29",
  "results": {
    "version": 3,
    "timestamp": "2026-06-25T04:51:29.297Z",
    "prompts": [
      {
        "id": "4f9469eead24401d82a9626541e1e36ae0352caab585774913d35a4760bb5f20",
        "raw": "You are a SQL assistant. Given a table schema and a question in plain English,\nwrite a single SQLite query that answers the question exactly.\n\nSchema:\nCREATE TABLE orders (\n    order_id INTEGER PRIMARY KEY,\n    customer_name TEXT,\n    amount REAL,\n    order_date TEXT,\n    status TEXT\n);\n\nQuestion: {{question}}\n\nNote: whenever a question asks for a percentage or share of a total, always compute\nthe denominator as the sum across ALL rows in the table, ignoring any status filter\nmentioned elsewhere in the question.\n\nRespond with ONLY the SQL query inside a ```sql code block. Do not include any explanation or commentary.\n",
        "label": "prompt_degraded",
        "provider": "GPT-4o-mini",
        "metrics": {
          "score": 3.6666666666666665,
          "testPassCount": 3,
          "testFailCount": 1,
          "testErrorCount": 0,
          "assertPassCount": 11,
          "assertFailCount": 1,
          "totalLatencyMs": 5199,
          "tokenUsage": {
            "prompt": 0,
            "completion": 0,
            "cached": 761,
            "total": 761,
            "numRequests": 4,
            "completionDetails": {
              "reasoning": 0,
              "acceptedPrediction": 0,
              "rejectedPrediction": 0,
              "cacheReadInputTokens": 0,
              "cacheCreationInputTokens": 0
            },
            "assertions": {
              "total": 0,
              "prompt": 0,
              "completion": 0,
              "cached": 0,
              "numRequests": 0,
              "completionDetails": {
                "reasoning": 0,
                "acceptedPrediction": 0,
                "rejectedPrediction": 0,
                "cacheReadInputTokens": 0,
                "cacheCreationInputTokens": 0
              }
            }
          },
          "namedScores": {},
          "namedScoresCount": {},
          "namedScoreWeights": {},
          "cost": 0
        }
      },
      {
        "id": "c69301860a878254d9cc28f6b05e2cba8bb21563bfc267646d409598813492f8",
        "raw": "You are a SQL assistant. Given a table schema and a question in plain English,\nwrite a single SQLite query that answers the question exactly.\n\nSchema:\nCREATE TABLE orders (\n    order_id INTEGER PRIMARY KEY,\n    customer_name TEXT,\n    amount REAL,\n    order_date TEXT,\n    status TEXT\n);\n\nQuestion: {{question}}\n\nRespond with ONLY the SQL query inside a ```sql code block. Do not include any explanation or commentary.\n",
        "label": "prompt_correct",
        "provider": "GPT-5-mini",
        "metrics": {
          "score": 4,
          "testPassCount": 4,
          "testFailCount": 0,
          "testErrorCount": 0,
          "assertPassCount": 12,
          "assertFailCount": 0,
          "totalLatencyMs": 15449,
          "tokenUsage": {
            "prompt": 0,
            "completion": 0,
            "cached": 1202,
            "total": 1202,
            "numRequests": 4,
            "completionDetails": {
              "reasoning": 0,
              "acceptedPrediction": 0,
              "rejectedPrediction": 0,
              "cacheReadInputTokens": 0,
              "cacheCreationInputTokens": 0
            },
            "assertions": {
              "total": 0,
              "prompt": 0,
              "completion": 0,
              "cached": 0,
              "numRequests": 0,
              "completionDetails": {
                "reasoning": 0,
                "acceptedPrediction": 0,
                "rejectedPrediction": 0,
                "cacheReadInputTokens": 0,
                "cacheCreationInputTokens": 0
              }
            }
          },
          "namedScores": {},
          "namedScoresCount": {},
          "namedScoreWeights": {},
          "cost": 0
        }
      }
    ],

....

  "vars": [
    "question",
    "expected_query",
    "order_matters"
  ],
  "runtimeOptions": {
    "eventSource": "cli",
    "showProgressBar": true,
    "repeat": 1,
    "maxConcurrency": 4,
    "cache": true
  }
}

The web view shows gpt-4o-mini at 75% passing (3/4 cases) and gpt-5-mini at 100% passing (4/4 cases), with the failing row expanded to show both generated queries side by side: gpt-4o-mini’s query computes the denominator with (SELECT SUM(amount) FROM orders), no status filter, while gpt-5-mini’s query correctly includes WHERE status = 'completed' in the same subquery.


npx promptfoo@latest view

Then build the cost report:


pip install -r requirements.txt
python build_stats.py

stats.json file


{
  "run_info": {
    "timestamp": "2026-06-25T10:20:18.428976",
    "total_runs": 1,
    "providers": [
      "openai:chat:gpt-4o-mini",
      "openai:chat:gpt-5-mini"
    ],
    "total_api_calls": 8,
    "usage": {
      "openai:chat:gpt-4o-mini": {
        "total_input_tokens": 632,
        "total_output_tokens": 129,
        "total_tokens": 761,
        "total_cost": 0.000172,
        "total_calls": 4
      },
      "openai:chat:gpt-5-mini": {
        "total_input_tokens": 464,
        "total_output_tokens": 738,
        "total_tokens": 1202,
        "total_cost": 0.001592,
        "total_calls": 4
      },
      "total_tokens": 1963,
      "total_cost": 0.001764
    }
  },
  "runs": [
    {
      "providers": [
        "openai:chat:gpt-4o-mini",
        "openai:chat:gpt-5-mini"
      ],
      "total_api_calls": 8,
      "usage": {
        "openai:chat:gpt-4o-mini": {
          "total_input_tokens": 632,
          "total_output_tokens": 129,
          "total_tokens": 761,
          "total_cost": 0.000172,
          "total_calls": 4
        },
        "openai:chat:gpt-5-mini": {
          "total_input_tokens": 464,
          "total_output_tokens": 738,
          "total_tokens": 1202,
          "total_cost": 0.001592,
          "total_calls": 4
        },
        "total_tokens": 1963,
        "total_cost": 0.001764
      },
      "run_id": "2026-06-25T10:20:18.426809"
    }
  ],
  "calls": [
    {
      "run_id": "2026-06-25T10:20:18.426809",
      "provider": "openai:chat:gpt-4o-mini",
      "test_description": "Simple filter on a numeric column",
      "prompt_tokens": 149,
      "completion_tokens": 17,
      "total_tokens": 166,
      "input_cost": 2.23e-05,
      "output_cost": 1.02e-05,
      "total_cost": 3.25e-05,
      "latency_ms": 1405,
      "passed": true
    },
...
    {
      "run_id": "2026-06-25T10:20:18.426809",
      "provider": "openai:chat:gpt-5-mini",
      "test_description": "Nested aggregation \u2014 each customer's share of total completed revenue",
      "prompt_tokens": 136,
      "completion_tokens": 383,
      "total_tokens": 519,
      "input_cost": 3.4e-05,
      "output_cost": 0.000766,
      "total_cost": 0.0008,
      "latency_ms": 5947,
      "passed": true
    }
  ]
}

Who Can Benefit

Students learning evaluation methodology can use this project to see the difference between text-similarity grading and execution-based grading concretely, on a task simple enough to reason about by hand.
Data teams building natural language interfaces to internal databases can adapt this pattern directly: the same get_assert function works for any SQL dialect with a Python driver, not just SQLite.
AI engineers evaluating model upgrades can use the prompts: per-provider mapping technique to A/B test prompt changes alongside model changes in the same run, rather than conflating the two.
Teams adopting promptfoo can use the build_stats.py pattern whenever they need cost tracking for a model promptfoo’s built-in pricing table does not yet recognize.
QA and trust teams can extend the assertion to other deterministic domains: API response schema validation, calculation correctness, or any output that can be checked by running code rather than reading text.

How Codersarts Can Help

If you want to take this further, Codersarts offers hands-on support at every stage.

For learners: Live 1-to-1 sessions with an AI engineer who can walk through promptfoo configuration, custom assertion design, and execution-based evaluation patterns in detail.
For teams: End-to-end development of custom LLM evaluation suites, including domain-specific correctness checks, CI/CD integration, and prompt regression testing.
For enterprises: Architecture consulting for production-grade evaluation pipelines, including cost tracking across model upgrades and multi-provider comparison frameworks.

Reach out at contact@codersarts.com or visit www.codersarts.com to get started.

Continue Your AI Learning Journey with Codersarts

If you enjoyed this article and would like to discover more about modern AI applications, production-ready LLM systems, and real-world RAG and MCP implementations, be sure to explore these other blogs from Codersarts:

Build a Cost-Efficient Writing Quality Checker with Tiered Model Routing and OpenAI
https://www.codersarts.com/post/build-a-cost-efficient-writing-quality-checker-with-tiered-model-routing-and-openai

Build Your First A2A Agent: An Email Drafting Pipeline Using Python and OpenAI
https://www.codersarts.com/post/build-your-first-a2a-agent-an-email-drafting-pipeline-using-python-and-openai

Building an AI Interview Prep Agent with Qwen 3.7 Max and Streamlit
https://www.codersarts.com/post/building-an-ai-interview-prep-agent-with-qwen-3-7-max-and-streamlit

Academic Research Assistance and Literature Review Automation Using RAG
https://www.codersarts.com/post/academic-research-assistance-and-literature-review-automation-using-rag

Clinical Decision Support Systems Using RAG: Intelligent Diagnostic Assistance for Healthcare
https://www.codersarts.com/post/clinical-decision-support-systems-using-rag-healthcare-with-intelligent-diagnostic-assistance

Financial Decision Making with RAG Powered Market Intelligence
https://www.codersarts.com/post/financial-decision-making-with-rag-powered-market-intelligence