Evaluating Natural Language to SQL Generation with Promptfoo and Python
- 1 day ago
- 17 min read
Introduction
Most LLM evaluation tutorials check whether a generated answer “sounds right” by asking another LLM to grade it. That works for tone and style, but it falls apart for tasks with an objectively correct answer. SQL generation is exactly that kind of task: a query either returns the right rows or it does not, and no amount of LLM-rubric grading can substitute for actually running the query.
In this tutorial we build a promptfoo evaluation for a natural language to SQL generator. Instead of asking a second model whether the generated SQL “looks correct,” a custom Python assertion executes the generated query against a real SQLite database and compares the returned rows to a known-correct reference query. We compare two OpenAI models, gpt-4o-mini and gpt-5-mini, track real per-token cost despite promptfoo’s pricing table not yet knowing about newer models, and demonstrate that the grading logic genuinely catches wrong answers rather than rubber-stamping everything.

What We Are Building
A promptfoo project that evaluates a SQL-generation prompt against a fixed orders table. The workflow:
Define a prompt that takes a plain English question and returns a SQLite query
Write four test cases, each with a known-correct reference query
Execute every generated query against an in-memory database and compare actual results, not text
Compare two models on the same questions
Track real per-token cost in a stats.json file that accumulates across runs
Tech Stack
Component | Tool |
Evaluation framework | promptfoo (Node.js CLI, run via npx) |
Models compared | gpt-4o-mini and gpt-5-mini |
Correctness check | Custom Python assertion executing SQL against SQLite |
Cost tracking | Custom build_stats.py using per-token rates from .env |
Environment | python-dotenv |
Project Structure
promptfoo_code_review_eval/
├── promptfooconfig.yaml # prompts, providers, test cases, assertions
├── assert_sql_correctness.py # custom Python assertion — executes SQL and compares results
├── build_stats.py # parses promptfoo's results.json into stats.json with our own cost rates
├── requirements.txt # python-dotenv
├── .env # OPENAI_API_KEY and per-model cost rates
└── README.md
Setting Up the Environment
Promptfoo is a Node.js CLI tool, so no global install is required. npx downloads and runs it on first use:
npx promptfoo@latest eval
Promptfoo requires Node ^20.20.0 or >=22.22.0. Check your version first:
node --version
On Windows, upgrade via winget if needed:
winget upgrade --id OpenJS.NodeJS.22 --silent --accept-package-agreements --accept-source-agreements
.env holds the OpenAI key and the per-token cost rates used later for tracking spend:
OPENAI_API_KEY=your_openai_api_key_here
GPT_4O_MINI_INPUT_COST=0.00000015
GPT_4O_MINI_OUTPUT_COST=0.00000060
GPT_5_MINI_INPUT_COST=0.00000025
GPT_5_MINI_OUTPUT_COST=0.00000200
Promptfoo loads .env automatically, so the API key needs no extra wiring. The cost rates are read separately by build_stats.py, covered later.
Building the Promptfoo Configuration
Everything promptfoo needs, the prompts, the providers, the test cases, and the assertions, lives in one YAML file. Create a file named promptfooconfig.yaml in your project root. Promptfoo automatically detects this exact filename when you run npx promptfoo@latest eval from the same directory, so the name is not optional.
A quick note before the full file: lines inside a raw: | block scalar are literal prompt text. A trailing # there would become part of the string sent to the model rather than a YAML comment, so the prompt content itself is left uncommented below, exactly as it has to be. Every structural line around it is commented.
description: "Natural language to SQL query generator evaluation" # shown as the eval's title in the promptfoo web UI
prompts:
- id: prompt_correct # unique id used to wire this prompt to a provider below
label: prompt_correct # human-readable label shown in the results table
raw: | # the literal prompt template; {{question}} is substituted per test case
You are a SQL assistant. Given a table schema and a question in plain English,
write a single SQLite query that answers the question exactly.
Schema:
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_name TEXT,
amount REAL,
order_date TEXT,
status TEXT
);
Question: {{question}}
Respond with ONLY the SQL query inside a ```sql code block. Do not include any explanation or commentary.
# DELIBERATE SABOTAGE — used only to prove the eval catches a wrong answer.
# Identical to prompt_correct except for one false instruction that only matters
# for the percentage test case (test 4). Tests 1-3 are unaffected by this line.
- id: prompt_degraded # unique id for the sabotaged prompt, explained in full further down
label: prompt_degraded # human-readable label shown in the results table
raw: | # identical to prompt_correct except for the extra paragraph below
You are a SQL assistant. Given a table schema and a question in plain English,
write a single SQLite query that answers the question exactly.
Schema:
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_name TEXT,
amount REAL,
order_date TEXT,
status TEXT
);
Question: {{question}}
Note: whenever a question asks for a percentage or share of a total, always compute
the denominator as the sum across ALL rows in the table, ignoring any status filter
mentioned elsewhere in the question.
Respond with ONLY the SQL query inside a ```sql code block. Do not include any explanation or commentary.
providers:
- id: openai:chat:gpt-4o-mini # promptfoo's provider id format for OpenAI chat models
label: "GPT-4o-mini" # display name shown in the results table
prompts: [prompt_degraded] # this provider receives the sabotaged prompt, explained further down
config:
metadata: # tags visible in the OpenAI dashboard usage logs
dev_name: "Ganesh" # developer name for tracking who ran this eval
project: "codex-test" # project identifier in the usage dashboard
environment: "local" # separates local runs from CI or production
purpose: "testing" # intent label for this run
- id: openai:chat:gpt-5-mini # the second model being compared
label: "GPT-5-mini" # display name shown in the results table
prompts: [prompt_correct] # this provider receives the real, correct prompt
config:
metadata: # same metadata shape, repeated per provider
dev_name: "Ganesh"
project: "codex-test"
environment: "local"
purpose: "testing"
defaultTest:
assert: # assertions applied automatically to every test case below
- type: latency # fails any call slower than this threshold
threshold: 15000 # 15 seconds, expressed in milliseconds
- type: cost # fails any call costlier than this threshold
threshold: 0.01 # $0.01 per call, using promptfoo's own pricing table
- type: python # every test case is also graded by our custom assertion
value: file://assert_sql_correctness.py # path to the file containing get_assert()
tests:
- description: "Simple filter on a numeric column" # row label shown in the results table
vars:
question: "List the order_id and amount for every order where the amount is greater than 100." # substituted into {{question}}
expected_query: "SELECT order_id, amount FROM orders WHERE amount > 100" # known-correct answer, never shown to the model
order_matters: "false" # GROUP BY/WHERE results have no meaningful row order
- description: "Aggregation with GROUP BY" # row label shown in the results table
vars:
question: "For each customer, show their name and their total order amount across all orders." # substituted into {{question}}
expected_query: "SELECT customer_name, SUM(amount) FROM orders GROUP BY customer_name" # reference query for this question
order_matters: "false" # per-customer totals have no required row order
- description: "Filter, sort, and limit combined" # row label shown in the results table
vars:
question: "Show the order_id and order_date of the 2 most recently placed orders that have status 'completed', with the most recent first." # substituted into {{question}}
expected_query: "SELECT order_id, order_date FROM orders WHERE status = 'completed' ORDER BY order_date DESC LIMIT 2" # reference query
order_matters: "true" # "most recent first" makes row order part of correctness
- description: "Nested aggregation — each customer's share of total completed revenue" # row label shown in the results table
vars:
question: "For each customer, what percentage of the total amount across all completed orders do they account for? Show customer_name and their percentage, rounded to 1 decimal place. Only consider completed orders, including in the total they are compared against." # substituted into {{question}}
expected_query: "SELECT customer_name, ROUND(SUM(amount) * 100.0 / (SELECT SUM(amount) FROM orders WHERE status = 'completed'), 1) AS pct FROM orders WHERE status = 'completed' GROUP BY customer_name" # the trap: omitting this WHERE in the subquery gives wrong percentages
order_matters: "false" # per-customer percentages have no required row order
{{question}} is the only variable shown to the model. expected_query and order_matters exist purely for grading and are never substituted into the prompt, since the prompt template never references them. The fourth test case is the hardest one: it requires a subquery inside the denominator of a percentage calculation, and forgetting to filter that subquery by status = 'completed' produces a plausible-looking query that returns subtly wrong numbers. That subtlety is exactly what prompt_degraded exploits, covered in full further down.
The Custom Assertion: Executing SQL Instead of Trusting Appearances
This is the core of the evaluation. Rather than comparing the generated SQL text to the reference SQL text, the assertion runs both queries against the same database and compares the actual rows returned. Create a file named assert_sql_correctness.py in the same directory as promptfooconfig.yaml — this is the exact filename referenced by value: file://assert_sql_correctness.py in the config above.
import re # extract SQL from a fenced markdown code block
import sqlite3 # in-memory database used to actually run both queries
SCHEMA_SQL = """
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_name TEXT,
amount REAL,
order_date TEXT,
status TEXT
);
"""
# Note: this is a Python triple-quoted string, so a trailing # on any line above
# would become literal text inside the SQL rather than a comment, the same
# constraint as the raw: | block in the YAML file.
ORDER_ROWS = [ # fixed, deterministic order data so every test run sees the same table
(1, "Alice", 120.50, "2024-01-15", "completed"), # one of two completed orders for Alice
(2, "Bob", 75.00, "2024-01-20", "completed"), # first of two completed orders for Bob
(3, "Alice", 200.00, "2024-02-01", "completed"), # Alice's second completed order
(4, "Charlie", 50.00, "2024-02-10", "cancelled"), # cancelled order, excluded from completed-only totals
(5, "Bob", 300.00, "2024-02-15", "completed"), # Bob's second completed order
(6, "Charlie", 150.00, "2024-03-01", "completed"), # Charlie's only completed order
]
def build_test_database(): # fresh in-memory SQLite DB, rebuilt for every grading call
conn = sqlite3.connect(":memory:") # no file on disk — every call starts from a clean, identical state
conn.executescript(SCHEMA_SQL) # create the orders table
conn.executemany("INSERT INTO orders VALUES (?, ?, ?, ?, ?)", ORDER_ROWS) # load the fixed seed rows
conn.commit() # persist the inserts before any query runs against this connection
return conn # caller runs both the generated and reference queries on this connection
def extract_sql_statement(model_output): # pull the query out of a ```sql fenced block, or use raw text
fenced = re.search(r"```(?:sql)?\s*(.*?)```", model_output, re.DOTALL | re.IGNORECASE) # look for a fenced code block
raw_sql = fenced.group(1) if fenced else model_output # fall back to the whole output if no fence is found
return raw_sql.strip().rstrip(";") # trailing semicolons are harmless but stripped for consistency
def get_assert(output, context): # promptfoo calls this for every test wired to this file
generated_sql = extract_sql_statement(output) # the model's SQL, fences stripped
expected_sql = context["vars"]["expected_query"] # known-correct reference query for this test
order_matters = str(context["vars"].get("order_matters", "false")).lower() == "true" # whether row order is part of the spec
conn = build_test_database() # same seed data for both queries, every time
try:
generated_rows = conn.execute(generated_sql).fetchall() # actually execute the model's query
except sqlite3.Error as exc: # catches SQL syntax errors and invalid column or table references
conn.close() # release the connection before returning
return {
"pass": False, # a query that does not even run can never be correct
"score": 0.0, # zero score for a non-executing query
"reason": f"Generated SQL failed to execute: {exc}. Query was: {generated_sql}", # syntax errors fail immediately
}
expected_rows = conn.execute(expected_sql).fetchall() # run the reference query against the same data
conn.close() # release the connection now that both queries have run
rows_match = (generated_rows == expected_rows) if order_matters else (sorted(generated_rows) == sorted(expected_rows)) # exact order vs sorted comparison
if rows_match:
return {
"pass": True, # the generated query returned exactly the expected data
"score": 1.0, # full score for a correct query
"reason": f"Query executed successfully and returned {len(generated_rows)} row(s) matching the expected result exactly.",
}
return {
"pass": False, # the generated query ran but returned the wrong data
"score": 0.0, # zero score for incorrect results
"reason": f"Query results do not match. Generated SQL returned {generated_rows}, expected {expected_rows}.", # shows both result sets for fast debugging
}
get_assert(output, context) is the exact function signature promptfoo requires for file-based Python assertions, found via type: python and value: file://assert_sql_correctness.py in the config. Building a fresh in-memory database on every single call means there is no shared state between test cases and no risk of one test’s data leaking into another’s result. Sorting both result sets before comparing handles the common case where GROUP BY output order is not semantically meaningful, while order_matters: "true" enforces exact row order for questions where it is, like “most recent first.”
Tracking Cost and Token Usage
Promptfoo’s own cost assertion relies on its internal pricing table, and newer models like gpt-5-mini are not guaranteed to be in it yet. Rather than trust a possibly-missing price, this script reads promptfoo’s own results.json output and applies our own per-token rates from .env to the real token counts promptfoo reports. Create a file named build_stats.py in the same directory.
import os # read per-model cost rates from .env
import json # parse promptfoo's results.json and write stats.json
from datetime import datetime # timestamp for each run and for the stats file
from dotenv import load_dotenv # load .env before any os.environ.get() call
load_dotenv() # injects all key=value pairs from .env into os.environ
COST_RATES = { # per-token USD rates, keyed by the promptfoo provider id used in promptfooconfig.yaml
"openai:chat:gpt-4o-mini": {
"input": float(os.environ.get("GPT_4O_MINI_INPUT_COST", 0.00000015)),
"output": float(os.environ.get("GPT_4O_MINI_OUTPUT_COST", 0.00000060)),
},
"openai:chat:gpt-5-mini": {
"input": float(os.environ.get("GPT_5_MINI_INPUT_COST", 0.00000025)),
"output": float(os.environ.get("GPT_5_MINI_OUTPUT_COST", 0.00000200)),
},
}
RESULTS_FILE = "results.json" # produced by: npx promptfoo@latest eval --output results.json
STATS_FILE = "stats.json" # written by this script — accumulates across every run
def load_eval_rows(path): # read promptfoo's flat per-test, per-provider result rows
with open(path, "r", encoding="utf-8") as f:
data = json.load(f)
return data["results"]["results"] # promptfoo nests the row array under results.results
def build_call_record(row, run_id): # convert one promptfoo result row into our own call record
provider_id = row["provider"]["id"] # e.g. "openai:chat:gpt-5-mini"
usage = row.get("tokenUsage") or {} # token counts reported by promptfoo for this call
prompt_tok = usage.get("prompt", 0) # input tokens for this call
completion_tok = usage.get("completion", 0) # output tokens for this call
total_tok = usage.get("total", prompt_tok + completion_tok) # combined token count
rates = COST_RATES.get(provider_id, {"input": 0, "output": 0}) # our own rates, not promptfoo's built-in table
input_cost = prompt_tok * rates["input"] # cost of the prompt in USD
output_cost = completion_tok * rates["output"] # cost of the completion in USD
total_cost = input_cost + output_cost # combined cost for this single call
return {
"run_id": run_id, # which build_stats.py execution this call belongs to
"provider": provider_id, # which model handled this call
"test_description": row.get("testCase", {}).get("description", ""), # which test case this row belongs to
"prompt_tokens": prompt_tok, # input tokens for this call
"completion_tokens": completion_tok, # output tokens for this call
"total_tokens": total_tok, # combined token count
"input_cost": round(input_cost, 7), # prompt cost in USD
"output_cost": round(output_cost, 7), # completion cost in USD
"total_cost": round(total_cost, 7), # total cost in USD
"latency_ms": row.get("latencyMs", 0), # wall-clock latency reported by promptfoo
"passed": row.get("success", False), # whether every assertion on this row passed
}
def compute_usage(calls, provider_id): # token and cost totals for one provider across a set of calls
entries = [c for c in calls if c["provider"] == provider_id] # filter down to only this provider's calls
return {
"total_input_tokens": sum(c["prompt_tokens"] for c in entries), # summed input tokens for this provider
"total_output_tokens": sum(c["completion_tokens"] for c in entries), # summed output tokens for this provider
"total_tokens": sum(c["total_tokens"] for c in entries), # combined tokens for this provider
"total_cost": round(sum(c["total_cost"] for c in entries), 6), # combined USD cost for this provider
"total_calls": len(entries), # number of calls made by this provider
}
def summarize(calls): # build a provider breakdown + totals for any list of calls
provider_ids = sorted(set(c["provider"] for c in calls)) # every distinct model present in this set
usage_by_provider = {pid: compute_usage(calls, pid) for pid in provider_ids} # per-model usage breakdown
return {
"providers": provider_ids, # every model id present in this set of calls
"total_api_calls": len(calls), # total calls in this set
"usage": {
**usage_by_provider, # one entry per provider with its own token/cost totals
"total_tokens": sum(u["total_tokens"] for u in usage_by_provider.values()), # combined tokens across providers
"total_cost": round(sum(u["total_cost"] for u in usage_by_provider.values()), 6), # combined cost across providers
},
}
def main():
try: # load everything saved by previous runs so stats accumulate
with open(STATS_FILE, "r", encoding="utf-8") as f:
previous = json.load(f)
all_calls = previous.get("calls", []) # every call ever recorded, across all past runs
runs = previous.get("runs", []) # one summary entry per past execution of this script
except (FileNotFoundError, json.JSONDecodeError):
all_calls = [] # first run — start with empty history
runs = []
run_id = datetime.now().isoformat() # unique identifier for this execution of build_stats.py
rows = load_eval_rows(RESULTS_FILE) # load promptfoo's raw evaluation rows for this run only
new_calls = [build_call_record(row, run_id) for row in rows] # convert this run's rows into our own call record format
all_calls.extend(new_calls) # grow the lifetime call history with this run's calls
run_summary = summarize(new_calls) # totals scoped to only this run
run_summary["run_id"] = run_id
runs.append(run_summary) # add this run's summary to the run history
lifetime_summary = summarize(all_calls) # totals across every run ever recorded
output = {
"run_info": {
"timestamp": datetime.now().isoformat(), # when this stats file was last written
"total_runs": len(runs), # how many times build_stats.py has been run
**lifetime_summary, # lifetime providers, total_api_calls, and usage
},
"runs": runs, # one entry per build_stats.py execution
"calls": all_calls, # every individual call ever recorded, across all runs
}
with open(STATS_FILE, "w", encoding="utf-8") as f: # overwrite stats.json with the merged lifetime + per-run data
json.dump(output, f, indent=2) # pretty-print with 2-space indentation
print( # summary line printed to the terminal after every run
f"Wrote {STATS_FILE} — this run: {len(new_calls)} calls (${run_summary['usage']['total_cost']:.6f}). "
f"Lifetime: {len(all_calls)} calls across {len(runs)} runs, total cost ${lifetime_summary['usage']['total_cost']:.6f}"
)
if __name__ == "__main__": # run main() only when this file is executed directly, not when imported
main()
Running build_stats.py more than once does not overwrite history. Each execution loads the existing stats.json, appends the new run’s calls, and recomputes both a per-run summary and a lifetime total. run_info always reflects every call ever recorded, runs has one entry per execution with its own scoped totals, and calls has every individual call tagged with the run_id that produced it.
A Deliberately Sabotaged Prompt: Proving the Evaluation Actually Catches Mistakes
In repeated live testing with a single, identical, correct prompt for both models, gpt-4o-mini and gpt-5-mini answered every question correctly, including several deliberately tricky variants tried along the way: a HAVING clause after GROUP BY, “second highest value” instead of maximum, and a JOIN-based regional rollup with a distinct-customer count. That is a genuinely useful finding: for single-table SQL generation at this level of difficulty, the cheaper model is just as reliable as the newer one.
It also meant there was no live failing row to point to as proof the evaluation actually catches mistakes rather than rubber-stamping everything. To demonstrate that the grading logic genuinely discriminates, gpt-4o-mini is wired to prompt_degraded instead of prompt_correct, using promptfoo’s per-provider prompts: field shown in the full configuration above. That prompt is identical to the correct one except for one extra paragraph of false guidance: it tells the model to always compute percentage denominators over every row in the table, ignoring any status filter mentioned in the question.
That instruction is irrelevant to the first three questions, since none of them involve a percentage calculation, so gpt-4o-mini still passes those normally. It only matters for test case 4, where following the bad instruction produces a denominator computed over every order instead of just completed ones, giving subtly wrong percentages. gpt-5-mini is wired to prompt_correct and passes every test.
This is intentionally not a fair model comparison, and the project’s own README says so directly: do not conclude from this result that gpt-4o-mini is worse at SQL generation than gpt-5-mini. Point both providers at prompt_correct and gpt-4o-mini passes everything too. The sabotaged prompt exists purely to generate a real, honest failing test case, produced by a real model following real instructions, rather than a fabricated result.
Running the Evaluation
With .env configured and a supported Node version installed, run the evaluation:
npx promptfoo@latest eval --output results.json --no-cache
--no-cache forces real API calls instead of reusing promptfoo’s local cache from an identical previous run. The terminal prints a pass/fail table for all 4 questions against both models:
The single failure is gpt-4o-mini on the percentage test, exactly as expected from the sabotaged prompt. Browse the same results in an interactive UI:



We will run it again. Instead of re-executing the entire file from the beginning, it will use the cached results from the previous execution. This saves both time and cost.
npx promptfoo@latest eval
or
npx promptfoo@latest eval --output results.json

results.json file
{
"evalId": "eval-5wn-2026-06-25T04:51:29",
"results": {
"version": 3,
"timestamp": "2026-06-25T04:51:29.297Z",
"prompts": [
{
"id": "4f9469eead24401d82a9626541e1e36ae0352caab585774913d35a4760bb5f20",
"raw": "You are a SQL assistant. Given a table schema and a question in plain English,\nwrite a single SQLite query that answers the question exactly.\n\nSchema:\nCREATE TABLE orders (\n order_id INTEGER PRIMARY KEY,\n customer_name TEXT,\n amount REAL,\n order_date TEXT,\n status TEXT\n);\n\nQuestion: {{question}}\n\nNote: whenever a question asks for a percentage or share of a total, always compute\nthe denominator as the sum across ALL rows in the table, ignoring any status filter\nmentioned elsewhere in the question.\n\nRespond with ONLY the SQL query inside a ```sql code block. Do not include any explanation or commentary.\n",
"label": "prompt_degraded",
"provider": "GPT-4o-mini",
"metrics": {
"score": 3.6666666666666665,
"testPassCount": 3,
"testFailCount": 1,
"testErrorCount": 0,
"assertPassCount": 11,
"assertFailCount": 1,
"totalLatencyMs": 5199,
"tokenUsage": {
"prompt": 0,
"completion": 0,
"cached": 761,
"total": 761,
"numRequests": 4,
"completionDetails": {
"reasoning": 0,
"acceptedPrediction": 0,
"rejectedPrediction": 0,
"cacheReadInputTokens": 0,
"cacheCreationInputTokens": 0
},
"assertions": {
"total": 0,
"prompt": 0,
"completion": 0,
"cached": 0,
"numRequests": 0,
"completionDetails": {
"reasoning": 0,
"acceptedPrediction": 0,
"rejectedPrediction": 0,
"cacheReadInputTokens": 0,
"cacheCreationInputTokens": 0
}
}
},
"namedScores": {},
"namedScoresCount": {},
"namedScoreWeights": {},
"cost": 0
}
},
{
"id": "c69301860a878254d9cc28f6b05e2cba8bb21563bfc267646d409598813492f8",
"raw": "You are a SQL assistant. Given a table schema and a question in plain English,\nwrite a single SQLite query that answers the question exactly.\n\nSchema:\nCREATE TABLE orders (\n order_id INTEGER PRIMARY KEY,\n customer_name TEXT,\n amount REAL,\n order_date TEXT,\n status TEXT\n);\n\nQuestion: {{question}}\n\nRespond with ONLY the SQL query inside a ```sql code block. Do not include any explanation or commentary.\n",
"label": "prompt_correct",
"provider": "GPT-5-mini",
"metrics": {
"score": 4,
"testPassCount": 4,
"testFailCount": 0,
"testErrorCount": 0,
"assertPassCount": 12,
"assertFailCount": 0,
"totalLatencyMs": 15449,
"tokenUsage": {
"prompt": 0,
"completion": 0,
"cached": 1202,
"total": 1202,
"numRequests": 4,
"completionDetails": {
"reasoning": 0,
"acceptedPrediction": 0,
"rejectedPrediction": 0,
"cacheReadInputTokens": 0,
"cacheCreationInputTokens": 0
},
"assertions": {
"total": 0,
"prompt": 0,
"completion": 0,
"cached": 0,
"numRequests": 0,
"completionDetails": {
"reasoning": 0,
"acceptedPrediction": 0,
"rejectedPrediction": 0,
"cacheReadInputTokens": 0,
"cacheCreationInputTokens": 0
}
}
},
"namedScores": {},
"namedScoresCount": {},
"namedScoreWeights": {},
"cost": 0
}
}
],
....
"vars": [
"question",
"expected_query",
"order_matters"
],
"runtimeOptions": {
"eventSource": "cli",
"showProgressBar": true,
"repeat": 1,
"maxConcurrency": 4,
"cache": true
}
}
The web view shows gpt-4o-mini at 75% passing (3/4 cases) and gpt-5-mini at 100% passing (4/4 cases), with the failing row expanded to show both generated queries side by side: gpt-4o-mini’s query computes the denominator with (SELECT SUM(amount) FROM orders), no status filter, while gpt-5-mini’s query correctly includes WHERE status = 'completed' in the same subquery.
npx promptfoo@latest view



Then build the cost report:
pip install -r requirements.txt
python build_stats.py

stats.json file
{
"run_info": {
"timestamp": "2026-06-25T10:20:18.428976",
"total_runs": 1,
"providers": [
"openai:chat:gpt-4o-mini",
"openai:chat:gpt-5-mini"
],
"total_api_calls": 8,
"usage": {
"openai:chat:gpt-4o-mini": {
"total_input_tokens": 632,
"total_output_tokens": 129,
"total_tokens": 761,
"total_cost": 0.000172,
"total_calls": 4
},
"openai:chat:gpt-5-mini": {
"total_input_tokens": 464,
"total_output_tokens": 738,
"total_tokens": 1202,
"total_cost": 0.001592,
"total_calls": 4
},
"total_tokens": 1963,
"total_cost": 0.001764
}
},
"runs": [
{
"providers": [
"openai:chat:gpt-4o-mini",
"openai:chat:gpt-5-mini"
],
"total_api_calls": 8,
"usage": {
"openai:chat:gpt-4o-mini": {
"total_input_tokens": 632,
"total_output_tokens": 129,
"total_tokens": 761,
"total_cost": 0.000172,
"total_calls": 4
},
"openai:chat:gpt-5-mini": {
"total_input_tokens": 464,
"total_output_tokens": 738,
"total_tokens": 1202,
"total_cost": 0.001592,
"total_calls": 4
},
"total_tokens": 1963,
"total_cost": 0.001764
},
"run_id": "2026-06-25T10:20:18.426809"
}
],
"calls": [
{
"run_id": "2026-06-25T10:20:18.426809",
"provider": "openai:chat:gpt-4o-mini",
"test_description": "Simple filter on a numeric column",
"prompt_tokens": 149,
"completion_tokens": 17,
"total_tokens": 166,
"input_cost": 2.23e-05,
"output_cost": 1.02e-05,
"total_cost": 3.25e-05,
"latency_ms": 1405,
"passed": true
},
...
{
"run_id": "2026-06-25T10:20:18.426809",
"provider": "openai:chat:gpt-5-mini",
"test_description": "Nested aggregation \u2014 each customer's share of total completed revenue",
"prompt_tokens": 136,
"completion_tokens": 383,
"total_tokens": 519,
"input_cost": 3.4e-05,
"output_cost": 0.000766,
"total_cost": 0.0008,
"latency_ms": 5947,
"passed": true
}
]
}
Who Can Benefit
Students learning evaluation methodology can use this project to see the difference between text-similarity grading and execution-based grading concretely, on a task simple enough to reason about by hand.
Data teams building natural language interfaces to internal databases can adapt this pattern directly: the same get_assert function works for any SQL dialect with a Python driver, not just SQLite.
AI engineers evaluating model upgrades can use the prompts: per-provider mapping technique to A/B test prompt changes alongside model changes in the same run, rather than conflating the two.
Teams adopting promptfoo can use the build_stats.py pattern whenever they need cost tracking for a model promptfoo’s built-in pricing table does not yet recognize.
QA and trust teams can extend the assertion to other deterministic domains: API response schema validation, calculation correctness, or any output that can be checked by running code rather than reading text.
How Codersarts Can Help
If you want to take this further, Codersarts offers hands-on support at every stage.
For learners: Live 1-to-1 sessions with an AI engineer who can walk through promptfoo configuration, custom assertion design, and execution-based evaluation patterns in detail.
For teams: End-to-end development of custom LLM evaluation suites, including domain-specific correctness checks, CI/CD integration, and prompt regression testing.
For enterprises: Architecture consulting for production-grade evaluation pipelines, including cost tracking across model upgrades and multi-provider comparison frameworks.
Reach out at contact@codersarts.com or visit www.codersarts.com to get started.
Continue Your AI Learning Journey with Codersarts
If you enjoyed this article and would like to discover more about modern AI applications, production-ready LLM systems, and real-world RAG and MCP implementations, be sure to explore these other blogs from Codersarts:
Build a Cost-Efficient Writing Quality Checker with Tiered Model Routing and OpenAI
Build Your First A2A Agent: An Email Drafting Pipeline Using Python and OpenAI
Building an AI Interview Prep Agent with Qwen 3.7 Max and Streamlit
https://www.codersarts.com/post/building-an-ai-interview-prep-agent-with-qwen-3-7-max-and-streamlit
Academic Research Assistance and Literature Review Automation Using RAG
Clinical Decision Support Systems Using RAG: Intelligent Diagnostic Assistance for Healthcare
Financial Decision Making with RAG Powered Market Intelligence
https://www.codersarts.com/post/financial-decision-making-with-rag-powered-market-intelligence




Comments