OpenAI Whisper + FastAPI Integration: Working Example (2026)
- 8 hours ago
- 6 min read

You drop OpenAI Whisper into your FastAPI app, the Chrome demo works perfectly, you ship it - and then a user on Safari uploads a recording and Whisper throws Invalid file format. Or your event loop locks up on a 12 MB upload. Or temp files start piling up because the cleanup path was wrong. Most OpenAI Whisper + FastAPI tutorials online stop at the happy path. This post covers the four things that actually break in production, with copy-paste Python code.
TL;DR - the OpenAI Whisper API integration FastAPI pattern
To integrate OpenAI Whisper with FastAPI correctly:
(1) accept the audio as a multipart UploadFile
(2) write it to a temporary file preserving the original extension so OpenAI can detect the MIME type,
(3) call client.audio.transcriptions.create wrapped in anyio.to_thread.run_sync so the blocking SDK call doesn't freeze the event loop,
(4) delete the temp file in a finally block. Validate MIME type and file size before transcription so bad uploads fail fast.
How the integration flows
Browser microphone (audio/webm on Chrome, audio/mp4 on Safari)
│
│ multipart POST → /api/transcribe
▼
FastAPI UploadFile
│
│ validate MIME + size
▼
Temp file (extension preserved: .webm or .mp4)
│
│ anyio.to_thread.run_sync() ← off the event loop
▼
OpenAI Whisper API (gpt-4o-mini-transcribe)
│
▼
{"transcript": "..."} → back to browser
Three things matter and they're invisible in the OpenAI Python quickstart: the extension preservation, the thread offload, and the cleanup. Each one fails silently - your code "works" until it doesn't.
Step 1 - The transcription service
Keep STT logic in a single module. Below is a minimal app/services/stt.py that handles all four production concerns in ~25 lines.
# app/services/stt.py
from pathlib import Path
import os
import tempfile
import anyio
from openai import OpenAI
client = OpenAI() # reads OPENAI_API_KEY from your environment
def _transcribe_sync(file_path: str) -> str:
"""Blocking call to OpenAI Whisper. Run inside a thread."""
with open(file_path, "rb") as f:
result = client.audio.transcriptions.create(
model="gpt-4o-mini-transcribe",
file=f,
)
return result.text
async def transcribe_audio(audio_bytes: bytes, filename: str) -> str:
"""
Save uploaded audio to a temp file preserving its extension,
then send it to OpenAI Whisper. Cleans up the temp file
no matter what happens.
"""
suffix = Path(filename).suffix or ".webm"
fd, tmp_path = tempfile.mkstemp(suffix=suffix)
os.close(fd)
try:
Path(tmp_path).write_bytes(audio_bytes)
return await anyio.to_thread.run_sync(_transcribe_sync, tmp_path)
finally:
Path(tmp_path).unlink(missing_ok=True)
Why these specific choices:
tempfile.mkstemp(suffix=...) instead of NamedTemporaryFile. mkstemp returns a path you can re-open by name on every OS (including Windows). NamedTemporaryFile is finicky cross-platform when you need to pass the path to another process.
suffix=Path(filename).suffix - this is the line that fixes Safari. OpenAI's audio API infers the MIME type from the file extension. If you write the upload to /tmp/upload.tmp, OpenAI returns Invalid file format. If you write it to /tmp/upload.mp4 or /tmp/upload.webm, it works.
anyio.to_thread.run_sync - the OpenAI Python SDK's audio.transcriptions.create is synchronous and network-bound. Calling it directly from an async def handler blocks the event loop for the full 500 ms–2 s of the API round trip, and your server can't handle other requests while it waits. Offloading to a thread is one line of code and immediately fixes it.
missing_ok=True - if anything earlier failed before the file was written, unlink would raise on Python < 3.8. Always set this flag.
Step 2 - The FastAPI endpoint
# app/main.py
from fastapi import FastAPI, UploadFile, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from app.services.stt import transcribe_audio
app = FastAPI()
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:3000"], # tighten in production
allow_methods=["*"],
allow_headers=["*"],
)
ALLOWED_AUDIO = {
"audio/webm", # Chrome, Firefox
"audio/mp4", # Safari
"audio/mpeg",
"audio/wav",
}
MAX_AUDIO_BYTES = 25 * 1024 * 1024 # OpenAI's hard limit
@app.post("/api/transcribe")
async def transcribe(file: UploadFile):
if file.content_type not in ALLOWED_AUDIO:
raise HTTPException(
status_code=415,
detail=f"Unsupported audio type: {file.content_type}",
)
audio_bytes = await file.read()
if len(audio_bytes) > MAX_AUDIO_BYTES:
raise HTTPException(status_code=413, detail="Audio file too large (max 25 MB)")
transcript = await transcribe_audio(
audio_bytes=audio_bytes,
filename=file.filename or "audio.webm",
)
return {"transcript": transcript}
Two non-obvious details:
The MIME whitelist is opinionated. The browser will send you whatever its MediaRecorder produced - accept the formats you've actually tested, reject the rest with 415 Unsupported Media Type. Don't let the OpenAI API be your validator; that wastes a network round trip and gives users a worse error message.
The 25 MB ceiling matches OpenAI's documented limit. Enforce it server-side; a malicious or buggy client can lie about Content-Length.
Step 3 - The minimum viable client (curl, for testing)
Test before you wire up the browser. This single curl confirms the whole backend works:
curl -X POST http://localhost:8000/api/transcribe \
-F "file=@./test-audio.webm;type=audio/webm"If you get {"transcript": "..."} back, your OpenAI Whisper FastAPI integration is working. If not, the error code tells you exactly which step failed: 415 = MIME validation, 413 = file size, 500 = inside transcribe_audio (usually the OpenAI call or a missing API key).
Step 4 - Environment configuration
Don't read OPENAI_API_KEY directly with os.getenv and hope. Use pydantic-settings so a missing key fails at startup with a useful error instead of mid-request:
# app/settings.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
openai_api_key: str
whisper_model: str = "gpt-4o-mini-transcribe"
class Config:
env_file = ".env"
settings = Settings()
Then in stt.py, import settings.openai_api_key and pass it to the OpenAI client explicitly. If the .env is missing or the key is empty, the server refuses to start - which is what you want.
Common pitfalls in OpenAI Whisper + FastAPI integration
Why does OpenAI Whisper return "Invalid file format" only on Safari?
Safari's MediaRecorder produces audio/mp4, not audio/webm. If your code writes the upload to a temp file without preserving the extension (e.g., a generic .tmp path), OpenAI can't infer the format and rejects the upload. Fix: use tempfile.mkstemp(suffix=Path(filename).suffix) so the extension is always preserved.
Why does my FastAPI server freeze when multiple users transcribe at once?
The OpenAI Python SDK's audio.transcriptions.create is a synchronous, blocking call. If you call it directly from an async def handler, it blocks FastAPI's event loop for the full duration of the network round trip - typically 500 ms to 2 seconds - and no other requests can be served. Wrap the call with anyio.to_thread.run_sync so it runs in a worker thread.
Should I use the streaming version of the OpenAI Whisper API?
For most voice-assistant use cases, no. Whisper's batch (non-streaming) API returns the full transcript in one response and is simpler to integrate with a temp-file workflow. Streaming Whisper is useful only when you need partial transcripts during a live recording (e.g., live captions). For a record-then-transcribe flow, batch is the right call.
How big can the audio file be?
OpenAI's documented limit is 25 MB per request. Enforce this server-side before calling the API - otherwise you waste an upload and get a confusing API error. For longer audio, split the file client-side or server-side before transcription.
Do I need to use real OpenAI embeddings or can I use the keyword-search fallback?
That's a question for your vector store (e.g., ChromaDB), not Whisper. Whisper itself doesn't generate embeddings - it transcribes audio to text. If your downstream pipeline retrieves context based on the transcript, that's a separate integration. See the main post for the full pipeline.
Where this fits in a full AI voice assistant
This OpenAI Whisper + FastAPI integration is the speech-to-text slice of a full voice assistant pipeline. The complete flow is:
Browser records audio with MediaRecorder.
Audio is uploaded to FastAPI and transcribed by Whisper (this post).
The transcript goes to an LLM agent (OpenAI Agents SDK) that may search a vector store for context.
The agent's text answer is synthesized to speech by OpenAI TTS.
The MP3 is base64-encoded and returned to the browser for playback.
For the full architecture, all five layers, browser code, and the design decisions behind each one, read the AI Voice Assistant architecture deep-dive.
Grab the free PRD template
We've packaged the complete Product Requirements Document for this voice assistant - architecture, API spec, sprint plan, system prompt, the full STT/LLM/TTS contracts - as a downloadable PDF. Use it as a starting point for your own build.
Download the AI Voice Assistant PRD → (free, 362 KB)
Get the complete project
If you want the working repo - backend, frontend, Docker Compose, tested across Chrome/Firefox/Safari, with the OpenAI Whisper FastAPI integration example wired into a full voice pipeline - the AI Voice Assistant course on Codersarts Labs ships it all:
Complete commented source code for backend and frontend.
Step-by-step lessons across 5 modules / 20 lessons.
Docker Compose setup that runs the full stack with one command.
Production CORS and environment variable configuration.
ChromaDB seeding patterns for semantic search.
Redis session store upgrade path for multi-worker deployments.
$29.99 self-paced. Everything above.
Closing
The OpenAI Whisper + FastAPI integration is short - about 50 lines of Python - but the line count hides three production gotchas: extension preservation, async offload, and temp-file cleanup. Get those right and your STT layer is solid. From here, the natural next step is the LLM agent, then OpenAI TTS for the spoken reply. The architecture deep-dive linked above covers all three layers end-to-end.



Comments