How to Build an AI Voice Assistant with OpenAI, FastAPI, and Next.js
- 20 hours ago
- 12 min read

You have a working text chatbot. The code is clean, the responses are sharp — and every time you demo it, someone in the room says, "But can I just talk to it?"
Adding voice to an AI application sounds simple until you realise you are actually wiring together four completely separate concerns: browser microphone capture, speech-to-text transcription, an LLM agent with memory and tool use, and text-to-speech synthesis that plays back as audio. Each layer has its own failure modes, latency budget, and API contract. Get any one of them wrong and the whole experience collapses.
The AI Voice Assistant covered in this post solves all four concerns in a single, cohesive full-stack application. It captures audio in the browser, transcribes it with OpenAI Whisper, runs it through an Agents SDK-powered AI agent that can search a local knowledge base, synthesizes the answer as natural-sounding speech, and plays it back — all in a single round trip.
Real-world use cases:
Students building a personal AI coding tutor they can actually speak with
Developers prototyping customer support voice bots before committing to a telephony platform
Technical founders adding voice interaction to an existing SaaS product
Educators creating accessible AI learning tools for non-keyboard users
Freelancers delivering voice UI as a premium feature on client projects
Hobbyists experimenting with real-time AI audio pipelines on a weekend
This post covers the full architecture, recommended stacks, and the five implementation phases. It does not include the full source code — that is inside the course.
📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]
How It Works: Core Concept
The AI voice assistant is built on a speech pipeline — a sequential chain of models, each transforming the user's input one step closer to a natural spoken response. Think of it like a telephone interpreter: someone speaks in one language, the interpreter listens, thinks, formulates a reply, and speaks back. The whole cycle must complete in under two or three seconds or the conversation feels broken.
Why the naive approach fails. The most obvious approach — send the raw audio to a single API and get speech back — does not exist yet for general-purpose assistants that need custom logic, memory, or knowledge retrieval. You have to compose three separate models: a speech-to-text model, a language model, and a text-to-speech model. Doing this naively — calling each sequentially in a synchronous HTTP handler — works in development but exposes latency debt at scale.
How this architecture solves it. The application separates concerns cleanly. The browser handles recording and playback using the native MediaRecorder API. The FastAPI backend owns the three-model pipeline: Whisper transcribes the audio to text, the OpenAI Agents SDK handles reasoning (and can call a ChromaDB tool for context retrieval), and OpenAI TTS synthesizes the reply. The backend returns base64-encoded MP3 audio alongside the transcript and response text — so the frontend can display the conversation and play the audio simultaneously.
ASCII data-flow diagram:
SETUP PHASE (server start)
─────────────────────────
ChromaDB ──seed()──► LocalVectorStore (knowledge notes pre-loaded)
RUNTIME PHASE (per voice request)
──────────────────────────────────
Browser (mic)
│ audio/webm blob
▼
POST /api/voice (FastAPI)
│ write to temp file
▼
OpenAI Whisper (STT)
│ transcript: string
▼
SessionStore.history_text() ◄── in-memory conversation history
│ combined prompt
▼
OpenAI Agents SDK (LLM Agent)
│ └── tool: search_project_knowledge()
│ └── ChromaDB keyword search
│ answer: string
▼
OpenAI TTS
│ audio bytes (MP3)
▼
base64 encode
│ VoiceResponse JSON
▼
Browser
│ display transcript + answer
└── play MP3 audio
Analogy. Imagine a live interpreter booth at a conference. The speaker talks → the interpreter listens and understands → checks their reference notebook → speaks the translation aloud. This application is that booth, running on a server, with three AI models doing the listening, thinking, and speaking.
System Architecture Deep Dive
Architecture Overview
The AI Voice Assistant has five clearly defined layers that work together:
Frontend layer (Next.js + TypeScript + Tailwind CSS). The React component manages three states simultaneously: microphone recording (via MediaRecorder), HTTP communication with the backend, and audio playback (via the Audio Web API). The UI presents a mic button, a text input fallback, a live status indicator ("Listening → Uploading → Thinking → Speaking → Ready"), and a chat history panel.
Backend layer (FastAPI + Python). Two endpoints handle all traffic: POST /api/voice for audio input and POST /api/chat for text input. Both share the same session management and agent logic. FastAPI's async event loop runs the three-model pipeline concurrently where possible, and asynccontextmanager seeds the vector store once at startup.
AI layer (OpenAI Agents SDK + Whisper + TTS). The VoiceAssistantAgent wraps an Agents SDK Agent object with a search_project_knowledge function tool. The agent decides autonomously when to call the tool — the developer does not hardcode retrieval logic. Whisper (gpt-4o-mini-transcribe) handles STT; OpenAI TTS (gpt-4o-mini-tts) handles synthesis with a configurable voice and instruction style.
Data layer (ChromaDB + SessionStore). ChromaDB provides a persistent local vector store for project knowledge notes. The SessionStore is an in-memory Python dict that keeps the last 8 message turns per session — enough context for coherent conversation without ballooning the prompt.
External APIs (OpenAI). All three model calls hit OpenAI's API. The API key lives in a server-side .env file and is never exposed to the browser.
Component Table
Component | Role | Technology Options |
Audio capture | Record browser microphone input | MediaRecorder API (WebM/MP4), WebRTC getUserMedia |
STT | Convert audio bytes to text | OpenAI Whisper, Deepgram Nova-2, AssemblyAI |
LLM Agent | Reason, retrieve context, generate reply | OpenAI Agents SDK, LangChain, LlamaIndex |
TTS | Convert text to spoken audio | OpenAI TTS, ElevenLabs, Google Cloud TTS |
Vector Store | Store and retrieve knowledge notes | ChromaDB (local), Pinecone, pgvector |
Session Store | Track per-user conversation history | In-memory dict, Redis, PostgreSQL |
Backend API | Route requests, manage pipeline | FastAPI, Flask, Django |
Frontend UI | Record, display, play back | Next.js + React, SvelteKit, plain HTML/JS |
Containerisation | Reproducible dev and production | Docker + Docker Compose |
Config management | Secrets and environment variables | python-dotenv, Pydantic Settings |
Data Flow Walkthrough
User clicks the Mic button — navigator.mediaDevices.getUserMedia requests microphone permission.
MediaRecorder starts capturing audio chunks into a Blob (WebM on Chrome, MP4 on Safari).
User clicks Stop — recorder.onstop fires, assembles the Blob, and POST /api/voice is called with the audio file and the current sessionId.
FastAPI writes the audio to a named temp file (preserving the extension for MIME detection).
transcribe_audio() sends the file to OpenAI Whisper and returns the transcript string.
SESSION_STORE.history_text() retrieves up to 8 prior turns for this session.
The combined prompt (history + transcript) is passed to Runner.run(agent, prompt).
If the agent calls search_project_knowledge, ChromaDB performs a keyword-ranked search and returns matching notes.
The agent returns a final text answer (2–4 short sentences, optimised for spoken playback).
synthesize_speech() sends the answer to OpenAI TTS and streams the MP3 to a temp file.
The MP3 bytes are base64-encoded and returned in a VoiceResponse JSON payload.
The frontend displays the transcript and answer as chat bubbles, then calls audio.play() on the base64 data URL.
Non-Obvious Design Decisions
Decision 1 — In-memory session store capped at 8 turns. An unlimited history would grow the prompt on every turn, increasing cost and latency linearly. Eight turns is enough for natural conversation coherence. This is a deliberate trade-off: it means the server must stay up for the session to persist, which is acceptable for a single-server deployment. Scaling to multiple workers requires externalising the store to Redis.
Decision 2 — Agent SDK over a raw client.chat.completions.create() call. Using the Agents SDK adds a function-tool layer that lets the LLM decide when to query ChromaDB — rather than always retrieving and injecting context. This reduces unnecessary retrieval overhead and keeps the prompt clean. It also makes the agent extensible: adding new tools (web search, calendar lookup) requires one decorator, not a prompt rewrite.
Tech Stack Recommendation
Stack A — Beginner / Prototype (build in a weekend)
Layer | Technology | Why |
Frontend | Next.js 14 + Tailwind CSS | Zero-config, includes the API route layer if needed |
Backend | FastAPI (Python 3.12) | Async by default, automatic OpenAPI docs |
STT | OpenAI Whisper (gpt-4o-mini-transcribe) | Single API call, no infrastructure to manage |
LLM | OpenAI Agents SDK + gpt-4o-mini | Cheapest capable model; Agents SDK adds tool use for free |
TTS | OpenAI TTS (gpt-4o-mini-tts) | One API call, multiple voice options |
Vector Store | ChromaDB (local persistent) | Zero-config, runs in-process, no server needed |
Containerisation | Docker Compose | One command to run frontend + backend together |
Estimated monthly cost: $5–$15 in OpenAI API credits (at ~500 voice interactions/day, each costing ~$0.001 total across STT + LLM + TTS).
Stack B — Production-Ready (designed to scale)
Layer | Technology | Why |
Frontend | Next.js 14 on Vercel | Edge network, automatic HTTPS, zero-downtime deploys |
Backend | FastAPI on Railway / Fly.io | Persistent server with health checks and auto-restart |
STT | Deepgram Nova-2 | Lower latency and cost at volume vs OpenAI Whisper |
LLM | OpenAI GPT-4o via Agents SDK | Higher intelligence for complex queries |
TTS | ElevenLabs or OpenAI TTS HD | Higher audio quality, more voice customisation |
Vector Store | Pinecone or pgvector (PostgreSQL) | Horizontally scalable, proper embedding search |
Session Store | Redis (Upstash serverless) | Survives server restarts, shared across workers |
Queue | Celery + Redis | Offload long TTS jobs from the request thread |
Auth | Clerk or Supabase Auth | User accounts, API key protection |
Monitoring | Sentry + Axiom | Error tracking and latency observability |
Estimated monthly cost: $30–$120 depending on traffic (hosting ~$20, OpenAI credits ~$20–$80, Pinecone free tier covers most prototypes).
Implementation Phases
Phase 1: Backend Foundation and API Setup
In this phase the developer creates the FastAPI project structure, configures pydantic-settings for environment variable management, and defines the Pydantic schemas (ChatRequest, ChatResponse, VoiceResponse). The health endpoint (GET /health) is implemented and tested first — it is the simplest possible proof that the server is running and CORS is configured correctly.
Key decisions: Which Python version to target (3.12+ recommended for str | None union syntax); how to structure the app directory (app/, app/services/, app/tests/); whether to use python-dotenv or pydantic-settings for env loading (choose pydantic-settings — it validates types at startup and catches missing keys before the first request).
Getting CORS right for audio uploads — including the correct allow_methods and allow_headers — is covered in detail in the full course with working, tested code.
Phase 2: Speech-to-Text and Text-to-Speech Integration
The STT and TTS services are standalone functions that wrap the OpenAI SDK. The STT service writes audio bytes to a named temp file (preserving the file extension so OpenAI can detect the MIME type), calls client.audio.transcriptions.create, and cleans up the temp file in a finally block. The TTS service uses with_streaming_response.create to stream the MP3 to disk before reading it back as bytes.
Key decisions: Whether to run STT and TTS in asyncio threads using anyio.to_thread.run_sync (yes — both are blocking SDK calls); which TTS voice to use (coral, alloy, echo, fable, onyx, nova — all configurable via env vars); what audio format to accept from the browser (prefer audio/webm with audio/mp4 as a Safari fallback, and validate the MIME type server-side).
Handling the Safari audio/mp4 vs Chrome audio/webm split without breaking either browser is covered in detail in the full course with working, tested code.
Phase 3: AI Agent with Knowledge Retrieval
The VoiceAssistantAgent wraps an OpenAI Agent object with a @function_tool decorated function that queries ChromaDB. The LocalVectorStore seeds itself with project knowledge notes on first startup. The agent's instructions field shapes its personality and verbosity — critical for voice output, where a 500-word essay is unusable.
Key decisions: How many messages to keep in the session history (8 is a practical ceiling); whether to use real OpenAI embeddings in ChromaDB or a keyword-ranking fallback (keyword ranking is used in the demo; real embeddings are needed for semantic search in production); how to write the agent's system prompt so responses are short enough to synthesize without feeling clipped.
Tuning the agent prompt so answers are concise enough for audio playback without sacrificing accuracy is covered in detail in the full course with working, tested code.
Phase 4: Frontend Voice UI
The Next.js VoiceAssistant component manages the full recording lifecycle using three useRef hooks: recorderRef for the MediaRecorder instance, chunksRef for accumulating audio data, and streamRef for releasing the microphone track on stop. The status machine (Ready → Listening → Uploading → Thinking → Speaking → Ready) gives users constant feedback and prevents double-submits.
Key decisions: Whether to start recording with recorder.start(timeslice) for streaming or collect all chunks on onstop (collect on stop — simpler and sufficient for short voice interactions); how to handle getUserMedia permission denial gracefully; how to structure the sendVoice API helper to pass sessionId via FormData alongside the audio file.
Building a reliable MediaRecorder state machine that handles browser differences, permission errors, and mid-recording failures is covered in detail in the full course with working, tested code.
Phase 5: Docker, Environment Config, and Deployment
The production Docker setup uses a multi-service docker-compose.yml: a Python backend container and a Next.js frontend container, both with health checks. Environment variables are injected at runtime (never baked into the image). The backend Dockerfile copies requirements.txt first for layer caching, then the application code.
Key decisions: Whether to use a multi-stage build for the Next.js frontend to reduce image size (yes — the node:alpine builder stage vs the final node:alpine runner stage); how to handle the ChromaDB persistence volume so data survives container restarts; which CORS origins to allow in production (always restrict to the specific frontend domain — never "*" in production).
The full Docker Compose configuration with health checks, volume mounts, and production CORS settings is covered in detail in the full course with working, tested code.
Common Challenges
1. Cross-Browser Audio MIME Type Mismatch
Problem. MediaRecorder.isTypeSupported("audio/webm") returns true on Chrome and false on Safari. If you hardcode audio/webm, Safari records but the backend receives an unrecognised audio file and OpenAI Whisper returns an error.
Root cause. Safari only supports audio/mp4 (with AAC codec) for MediaRecorder output.
Fix. Feature-detect at runtime: const mimeType = MediaRecorder.isTypeSupported("audio/webm") ? "audio/webm" : ""; — and pass mimeType to the MediaRecorder constructor only if it is non-empty. Server-side, preserve the original filename extension when writing to the temp file so OpenAI can infer the correct format.
2. MediaRecorder Fires onstop Before All Chunks Arrive
Problem. In some browsers, onstop fires before the final ondataavailable event, meaning the last audio chunk is missing from the Blob and the transcript is cut short.
Root cause. The onstop and final ondataavailable events are dispatched asynchronously, and the order is not guaranteed across all browsers.
Fix. Call recorder.stop() and let ondataavailable collect the final chunk before building the Blob. Accumulate chunks in an array ref and assemble the Blob only inside onstop — by that point, all prior ondataavailable events have already fired.
3. OpenAI API Key Exposed in Browser Network Tab
Problem. Developers new to FastAPI sometimes put the OpenAI API key in the frontend .env file (as NEXT_PUBLIC_OPENAI_API_KEY) and call the OpenAI API directly from the browser. The key is visible in the browser network tab and in the built JavaScript bundle.
Root cause. Misunderstanding of NEXT_PUBLIC_ prefix — it bakes values into the client-side bundle.
Fix. Always call OpenAI from the backend only. The frontend communicates with your FastAPI endpoints, which authenticate with OpenAI on the server side.
4. In-Memory Session Loss on Server Restart
Problem. The SessionStore is a plain Python defaultdict. Restarting the uvicorn process wipes all conversation history.
Root cause. No persistence layer.
Fix. For single-server deployments this is acceptable — just document it. For production, replace SESSION_STORE with a Redis-backed store (Upstash Redis is free tier and requires zero infrastructure). Use session_id as the Redis key and store the last N messages as a JSON list.
5. TTS Latency Feels Slow on Long Responses
Problem. The full pipeline takes 3–5 seconds on longer responses. Users click the mic button, wait, and wonder if anything happened.
Root cause. Three sequential API round trips (STT → LLM → TTS) each add 500ms–1.5s of latency. Long LLM outputs take longer to synthesize.
Fix. Keep agent instructions explicit about brevity: "Answer in 2 to 4 short sentences." Use a live status indicator (Listening → Thinking → Speaking) so users know the pipeline is progressing. For advanced use cases, stream the LLM output and synthesize TTS sentence by sentence.
6. ChromaDB Keyword Search Missing Semantic Matches
Problem. The demo uses a keyword-scoring fallback instead of true vector embeddings. A user asking "How does this app handle security?" may not match notes stored under "Privacy note" because the word "security" does not appear in the note title.
Root cause. The demo seeds ChromaDB with embeddings=[[0.0]] (zero vectors) to avoid a dependency on an embedding model during development. Cosine similarity against zero vectors is meaningless.
Fix. In production, replace zero vectors with real embeddings using text-embedding-3-small. Seed the collection with properly embedded documents so ChromaDB's built-in ANN search returns semantically relevant results.
7. CORS Preflight Failing for Audio Uploads
Problem. The browser sends an OPTIONS preflight request before the multipart audio POST. If CORS middleware is not configured to handle Content-Type: multipart/form-data, the preflight returns 400 and the audio never uploads.
Root cause. FastAPI's CORSMiddleware must include allow_methods=["*"] and allow_headers=["*"] (or explicitly list Content-Type) to pass audio upload preflights.
Fix. Confirm that CORSMiddleware is the first middleware added to the app (order matters in ASGI stacks) and that allow_headers is not overly restrictive in production.
Solving these issues took us over 40 hours of testing across browsers, environments, and edge cases — the course walks you through each fix with working code.
Ready to Build This Yourself?
Understanding the architecture is one thing. Shipping a working, tested, production-ready AI voice assistant is another. The gap between reading this post and having deployable code is the part most tutorials skip.
The AI Voice Assistant course on Codersarts Labs gives you everything you need to close that gap:
✅ Complete, commented source code for backend and frontend
✅ Step-by-step tutorials for every phase of the build
✅ Docker Compose setup that runs the full stack with one command
✅ Tested browser compatibility for Chrome, Firefox, and Safari
✅ Production CORS and environment variable configuration walkthrough
✅ ChromaDB seeding with real embeddings for semantic search
✅ Redis session store upgrade guide for multi-worker deployments
✅ Lifetime access to all future updates and additions
✅ Community support from the Codersarts team and fellow students
$30. Everything above.
Want to build this alongside a senior Codersarts engineer? Book a 1:1 Guided Session at $20/hour — we pair-program the entire assistant with you, live.
Conclusion
The AI Voice Assistant connects five distinct technologies — browser MediaRecorder, OpenAI Whisper, OpenAI Agents SDK, ChromaDB, and OpenAI TTS — into a single, coherent pipeline that takes a spoken question and returns a spoken answer in seconds. The FastAPI backend owns the pipeline and keeps the API key off the client; the Next.js frontend owns the recording lifecycle and the audio playback.
If you are starting from scratch, begin with Stack A: FastAPI + ChromaDB (local) + OpenAI APIs + Next.js, all wired together with Docker Compose. You can have a working prototype in a weekend and upgrade individual layers (Redis, Pinecone, ElevenLabs) as your needs grow.
The full course at labs.codersarts.com includes the complete source code, video walkthroughs, and tested configurations for every phase — so you spend your time building, not debugging.



Comments