top of page

How to Build an AI SQL Generator: Query Any CSV File with Plain English

  • 18 hours ago
  • 12 min read


You have a CSV file full of data. You know the answer is in there — which product is underperforming, which region is trending, which week had the anomaly. But finding it means importing the file into a database, remembering the exact syntax for GROUP BY, HAVING, and ORDER BY DESC, and writing a query you'll throw away the moment the meeting ends.


For non-developers, that's a dead end. For developers, it's twenty minutes of friction for a two-second answer.


The AI SQL Generator eliminates this entirely. Upload any CSV file, type a question in plain English, and receive a working SQL query executed against your data — with results displayed in a scrollable table — in seconds.


Real-world use cases this application handles:

  • Business analysts querying sales, revenue, or pipeline CSV exports without SQL knowledge

  • Product managers analyzing event logs and user behavior data from analytics tool exports

  • Startup founders interrogating their own data without waiting for a data team

  • Students learning how natural language maps to SQL queries

  • Freelancers running rapid ad-hoc analysis on client-supplied files

  • Data journalists investigating public datasets and government records


This article covers the system architecture, the dual-database design, the AI prompt engineering layer, the error-recovery loop, and the implementation phases you will work through to ship it. It does not include full source code — that is available in the complete course at labs.codersarts.com.



📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]


How It Works: Core Concept


The concept powering this AI SQL generator is called schema-injected text-to-SQL.

Every SQL query is a structured instruction to a database. Natural language is an unstructured instruction from a human. The gap between the two is context — specifically, the schema (what tables and columns exist) and intent (what the user wants to know). A large language model like GPT-4o-mini, when given both pieces in a carefully engineered prompt, reliably bridges that gap and produces valid, executable SQL.


Why the naive approach fails. The obvious first idea is to send the user's question directly to the AI and ask it to write SQL. This breaks immediately because the model doesn't know your column names. If your CSV has a column called total_rev_usd, the AI guesses revenue or sales_amount — plausible names that don't exist in your database. The generated SQL fails on execution every time.


How schema injection solves it. Before sending anything to the AI, the backend parses the uploaded CSV, infers column types, sanitizes all names, and builds a schema description block listing every table, every column, its type, and three sample values. This block is injected into the prompt alongside the user's question. The model now knows exactly what it is working with.

Think of it like handing a translator a dictionary before asking them to translate. Without the dictionary, they guess. With it, they are precise.





SETUP PHASE (on CSV upload):
  CSV file(s)
      │
      ▼
  Pandas parse + type inference
      │
      ▼
  Column sanitization (spaces → underscores, lowercased)
      │
      ▼
  DataFrame → in-memory SQLite
      │
      ▼
  Schema object (table name, columns, types, 3 sample values per column)
      │
      ▼
  Schema rendered in UI sidebar

QUERY PHASE (on each natural language question):
  User plain English question
      │
      ▼
  Build structured prompt: [system role] + [schema block] + [question]
      │
      ▼
  OpenAI GPT-4o-mini → JSON { "sql": "...", "explanation": "..." }
      │
      ▼
  Execute SQL against in-memory SQLite
      │
      ├─ Success → return columns + rows to frontend
      │
      └─ Failure → send error to AI for one-shot retry
                       │
                       ▼
                   Corrected SQL → re-execute → return results

                   


System Architecture Deep Dive


The natural language to SQL Python application has five distinct layers that work in sequence on every request.


Layer 1 — Frontend (React + TypeScript + Tailwind CSS). A dark-themed single-page application with three states: the CSV upload screen, the query interface (schema sidebar, question input, SQL display, results table), and the history panel. Vite provides fast hot-reload during development. The entire UI fits comfortably in a single App.tsx with React useState for state management.


Layer 2 — Backend API (FastAPI + Python). The REST layer that handles file uploads, routes queries to the AI service, executes SQL against the in-memory database, manages session state through a global singleton, and persists query history via SQLAlchemy. FastAPI's automatic OpenAPI docs make the API self-documenting during development.


Layer 3 — AI Layer (OpenAI GPT-4o-mini). The natural-language-to-SQL engine. The backend sends a crafted prompt and receives a JSON object with sql and explanation fields. Temperature is set to 0.1 for near-deterministic output, and the system prompt strictly constrains the model to SELECT-only statements.


Layer 4 — Data Layer (dual SQLite). Two SQLite databases run simultaneously. The in-memory database holds the user's uploaded CSV data and is discarded on each new upload, preserving privacy. The persistent query_history.db file stores saved query results across server restarts, managed through SQLAlchemy ORM.


Layer 5 — External APIs (OpenAI). Only the OpenAI Chat Completions API is used. The API key is read from an environment variable and the client is initialized as a singleton to avoid repeated instantiation overhead.



Component Table

Component

Role

Options

CSV Parser

Reads raw bytes, infers types, sanitizes names

Pandas (used), Polars, csv stdlib

In-memory DB

Executes SQL against uploaded data

SQLite in-memory (used), DuckDB

SQL Generator

Converts natural language to SQL via AI

OpenAI GPT-4o-mini (used), Claude, Gemini

Prompt Builder

Injects schema context into AI prompt

Custom (used), LangChain, DSPy

Error Recovery

Retries failed SQL with the error message

Custom one-shot retry (used)

History Store

Persists saved query results across sessions

SQLite + SQLAlchemy (used), PostgreSQL

REST API

Routes requests between frontend and services

FastAPI (used), Flask, Django

Frontend

UI for upload, query, results, history

React + TypeScript (used), Next.js, Streamlit

State Manager

Holds tables, result, and active view

React useState (used), Zustand

Type System

Validates API request/response shapes

Pydantic v2 (backend), TypeScript (frontend)



Data Flow Walkthrough


  1. User drags and drops 1–3 CSV files onto the upload zone.

  2. Frontend sends POST /upload as a multipart request.

  3. Backend validates file count (max 3) and extension (.csv only).

  4. Each file is parsed with Pandas: column names are sanitized (spaces become underscores, lowercased), types are inferred (INTEGER / REAL / DATE / TEXT), and up to three sample values are extracted per column.

  5. Each DataFrame is loaded into a fresh in-memory SQLite database. The previous database is discarded if one exists.

  6. Schema objects are returned to the frontend. The sidebar renders table names, column names, and types.

  7. User types a natural language question and submits.

  8. Frontend sends POST /ask with the question string.

  9. Backend builds the AI prompt: system role (SQL expert, SELECT-only, JSON-only) + schema block (all tables, columns, types, samples) + user question.

  10. Backend calls the OpenAI Chat Completions API at temperature 0.1.

  11. AI returns JSON. Backend strips any accidental markdown fences before parsing.

  12. Backend executes the SQL against the in-memory SQLite connection.

  13. On success: columns, rows, and row count are returned in the API response.

  14. On failure: the error message is appended to the prompt and a second AI call is made. If the retry also fails, a 422 error returns the broken SQL and error message.

  15. Frontend renders the SQL with the one-sentence explanation and the results table.

  16. User clicks "Save to History" → POST /queries/save serializes the result to query_history.db.



Two Non-Obvious Design Decisions


Decision 1: In-memory SQLite instead of PostgreSQL. The obvious choice is a proper database server, but PostgreSQL requires schema creation DDL, connection pooling, and teardown scripts. In-memory SQLite requires none of that — Pandas writes directly to it via df.to_sql(), and it vanishes when replaced. Multi-table uploads work trivially.



Decision 2: Dual-database architecture. Keeping user data in-memory and query history in a file database is a deliberate privacy boundary. One database for both would either accidentally persist user data or lose query history on every upload. The split gives each database exactly one job.




Tech Stack Recommendation


Stack A — Beginner / Prototype (Build in a Weekend)

Layer

Technology

Why

Backend

Python 3.11 + FastAPI

Minimal boilerplate; auto docs; async support

AI

OpenAI GPT-4o-mini

Cheapest capable model; no setup required

CSV processing

Pandas 2.x

df.to_sql() handles the DB load automatically

Query engine

SQLite in-memory

Zero infrastructure; built into Python stdlib

History store

SQLite file + SQLAlchemy

No server needed; survives restarts

Frontend

React 18 + Vite + Tailwind CSS

Fast setup; excellent developer experience

Deployment

Railway or Render (free tier)

Push-to-deploy; no DevOps required

Estimated monthly cost: $5–$15 (OpenAI API for 500–1500 queries; hosting on free tier)



Stack B — Production-Ready (Designed to Scale)

Layer

Technology

Why

Backend

FastAPI + Gunicorn (4 workers)

Process-level concurrency for production traffic

AI

GPT-4o with gpt-4o-mini fallback

Better JOIN accuracy; fallback for cost control

Query engine

DuckDB in-process

10–100× faster than SQLite for analytical queries

History store

PostgreSQL + SQLAlchemy + Alembic

ACID, migrations, multi-user support

Auth

Clerk or Auth0

Per-user session isolation

Frontend

Next.js 14 + TypeScript

SSR, file-based routing, built-in API routes

Deployment

AWS ECS Fargate + RDS

Auto-scaling; managed database

Monitoring

Sentry + OpenTelemetry

Error tracking; AI latency tracing

Caching

Redis

Cache repeated identical queries

Estimated monthly cost: $80–$200 (AWS ECS ~$30, RDS ~$15, OpenAI API $30–150 by volume)



Implementation Phases


Phase 1: CSV Upload and Schema Extraction


Build the file upload endpoint and the pipeline that converts raw CSV bytes into a structured schema object. This phase is the foundation — without reliable schema extraction, the AI cannot generate correct SQL.


Key decisions to make:

  • How to handle encoding (UTF-8 vs Latin-1 fallback for legacy files)

  • How to sanitize column names with spaces, special characters, and duplicate names after sanitization

  • How strict to make type inference (strict reduces false positives; permissive reduces TEXT columns that should be REAL)

  • How many sample values to pass to the AI (more improves JOIN inference accuracy; more increases token cost)

  • Whether to support Excel files in addition to CSV (adds openpyxl dependency)


Handling duplicate column names that appear after sanitization — a subtle bug that breaks SQL generation silently if left unfixed — is covered in detail in the full course with working, tested code.



Phase 2: AI-Powered SQL Generation


Connect the extracted schema to OpenAI and build the prompt engineering layer. This is the most iterative phase — small changes to the prompt produce dramatically different outputs.


Key decisions to make:

  • System prompt role: how explicitly to constrain SELECT-only output

  • Schema representation format: how to lay out the schema block (table-first, whether to include sample values inline or strip them)

  • Output format: structured JSON vs raw SQL (JSON is required to extract the explanation separately)

  • Temperature: 0.1 for determinism vs higher values for varied query suggestions

  • Model selection: gpt-4o-mini (fast, cheap) vs gpt-4o (more accurate on complex multi-table JOINs)


Prompt engineering for reliable JSON output — including stripping markdown fences that the model adds despite explicit instructions — is covered in detail in the full course with working, tested code.



Phase 3: SQL Execution and Error Recovery


Execute the generated SQL against the in-memory database and build the one-shot retry loop that handles AI mistakes gracefully. This phase transforms the app from a SQL generator into a SQL executor.


Key decisions to make:

  • How to serialize SQLite rows for the API response (list of lists vs list of dicts)

  • Whether to sanitize AI-generated SQL server-side before execution (strip semicolons, block non-SELECT as a second safety layer)

  • How to construct the retry prompt (append error context vs replace original prompt)

  • What to return when both attempts fail (the broken SQL + error, so the user can debug)

  • How to surface the was_retried flag in the UI (transparency for the user)


Building the error-recovery retry loop that feeds the SQLite error message back to the AI for a corrected query — and testing it against 20+ failure modes — is covered in detail in the full course with working, tested code.



Phase 4: React Frontend and Schema Sidebar


Build the three-state UI: the drag-and-drop upload screen, the query interface with schema sidebar, and the result display. This phase is where the application becomes usable by non-developers.


Key decisions to make:

  • State management: React useState (sufficient for single-user) vs Zustand (better for complex multi-view state)

  • Layout: how to split the screen between the schema sidebar and the query area

  • SQL display: syntax highlighting (requires react-syntax-highlighter) vs plain monospace

  • Error presentation: how to show failed queries without overwhelming the user

  • Responsiveness: how the sidebar collapses on mobile


Wiring the async query flow — including loading states, error boundaries, and the retry indicator badge — is covered in detail in the full course with working, tested code.



Phase 5: Query History and Persistent Storage


Add the persistent history panel using a second SQLite database. This is the polish phase that turns a demo into a product users return to.


Key decisions to make:

  • ORM vs raw SQL for history writes (SQLAlchemy makes schema migrations easier later)

  • How to serialize rows (a list of lists) into SQLite TEXT (JSON serialization is correct; consistent deserialization is the challenge)

  • History scope: per-session (no auth) vs per-user (requires authentication)

  • History UI: card list with truncated SQL, click to reload, delete on hover

  • Whether to allow re-running a saved query against fresh uploaded data


Designing the dual-database architecture that keeps user-uploaded data private while persisting query history across server restarts is covered in detail in the full course with working, tested code.



Common Challenges


1. The AI returns markdown-fenced JSON despite explicit instructions.


Root cause: GPT models are trained to format code in triple backtick blocks. Even with "respond with JSON only" in the system prompt, the model adds ```json ``` wrappers regularly.


Fix: Write a post-processing function that strips leading ```json or ``` and trailing ``` before calling json.loads(). Make this step mandatory, not optional.



2. Column names with spaces break the generated SQL.


Root cause: When a CSV has a column called "Total Revenue", Pandas sanitizes it to total_revenue. But the AI prompt sometimes receives the original unsanitized name if you're not careful about which schema object you inject. The AI then references a column that doesn't exist in the database.


Fix: Only inject sanitized column names into the AI prompt. Never expose pre-sanitization headers to the model.



3. Multi-table JOINs fail or use the wrong join key.


Root cause: When two CSV tables have a column that should be a foreign key (e.g., user_id), the AI guesses the join condition from column name similarity. If one table names it user_id and the other names it uid, the JOIN fails silently.


Fix: Include three sample values per column in the schema block — when the AI sees the same actual values appearing in both tables' columns, it reliably identifies the correct join key.



4. Currency and numeric columns inferred as TEXT.


Root cause: Pandas reads $1,234.56 as object dtype because of the $ and , characters. Type inference returns TEXT. The AI then generates arithmetic operations on a TEXT column, which fails in SQLite.


Fix: Add a preprocessing step that strips common currency and number formatting characters ($, ,, %) before type inference.



5. The in-memory database is overwritten by concurrent uploads.


Root cause: The global dbmanager singleton is not thread-safe. If two users upload simultaneously, one can overwrite the other's database mid-session.


Fix: In production, key the DB manager by a UUID session token passed in the request header. Each session gets its own isolated in-memory database.



6. Large CSVs with 80+ columns exceed the token limit before the question is even asked.


Root cause: Injecting the full schema for a wide table (80 columns × name + type + 3 samples) can exceed 2,000 tokens before the user's question is added.


Fix: Limit sample value injection to the first 5 columns per table; truncate column lists beyond 30 with an explicit note to the AI that the schema was abbreviated.



7. SQLite type affinity causes unexpected comparison failures.


Root cause: SQLite's flexible type affinity means that a column inferred as TEXT and containing "2024-01-15" can be compared with date strings — but the comparison behavior differs from explicit DATE columns.


Fix: Document the type inference rules clearly and include date-formatted sample values in the schema block so the AI uses DATE() functions appropriately.


Solving these issues took us over 40 hours of testing across different CSV shapes and edge cases — the course walks you through each fix with working code.




Ready to Build This Yourself?


Understanding an architecture is not the same as shipping it. The gap between this article and a working, deployed application is filled with dependency conflicts, prompt tuning sessions, edge-case CSVs, and debugging AI responses that almost work.


The AI SQL Generator course on labs.codersarts.com gives you everything you need to go from zero to deployed:


✅ Full source code for all 5 modules (backend + frontend, fully commented)

✅ Step-by-step tutorials walking through every architectural decision

✅ Prompt engineering guide with the exact system prompt that produces reliable JSON

✅ Working error-recovery retry loop with tests covering 20+ failure modes

✅ Docker Compose setup for reproducible local development

✅ Deployment walkthrough for Railway, Render, and AWS ECS

✅ Lifetime access — including all future module additions and updates

✅ Community support via the Codersarts Discord


$30. Everything above.



Already have a project in motion and need a faster path? Book a 1:1 guided session at $20/hour — build it alongside the Codersarts team, with your own data, your own stack decisions, and your own questions answered in real time.




Conclusion


The AI SQL Generator is a five-layer system: a React frontend, a FastAPI backend, an OpenAI schema-injection layer, a dual SQLite data layer, and an error-recovery loop that catches and corrects AI mistakes before they reach the user. The key architectural insight is the dual-database split — in-memory SQLite for user data (fast, private, disposable) and a persistent file database for query history (durable, sessionable, user-friendly).


The simplest place to start is Stack A: Python + FastAPI + GPT-4o-mini + SQLite in-memory. You can have a working prototype in a weekend without touching Docker or PostgreSQL.


When you are ready to move from architecture to working code, the full course is waiting at labs.codersarts.com — complete source, tested configurations, and a full deployment walkthrough included.

Comments


bottom of page