Build Your First AI Voice Agent: Speech, Conversation, and Audio Playback with Python and OpenAI

9 hours ago
13 min read

Introduction

Most AI tutorials show you a text box. You type, the model replies, and the whole exchange stays on screen. That covers the mechanics of calling an LLM, but it leaves out what makes voice AI feel genuinely different: the question comes from a microphone, the answer comes back as speech, and the whole thing happens without touching a keyboard.

This tutorial builds a working voice AI agent from scratch in Python. Press Enter to start recording, speak your question, press Enter again to stop, and the agent transcribes your words, generates a reply, and reads it back to you out loud. No GUI, no cloud dashboard to configure, no framework to learn first.

What We Are Building

A command-line voice AI agent that runs a full four-stage pipeline in one turn:

Record your voice from the microphone into a WAV file
Transcribe the WAV using Google Speech Recognition (free, no API key)
Generate a reply using GPT-4o-mini with conversation memory
Speak the reply back using edge-tts (free, Microsoft neural voices)

The agent keeps a running conversation history within each session, so follow-up questions work naturally. Every exchange is logged to a history.json file with the transcript, reply, token counts, cost, and time taken.

Tech Stack

Component	Tool
Microphone recording	sounddevice + numpy
Audio file read/write	soundfile
Speech to text	SpeechRecognition + Google (free, no API key)
Language model	GPT-4o-mini via OpenAI API
Text to speech	edge-tts (free, Microsoft neural voices)
Audio playback	playsound
Terminal UI	rich
Config	python-dotenv

Project Structure


voice_ai_agent/
├── agent.py        # VoiceAgent class: record, transcribe, chat, speak
├── main.py         # terminal loop: recording UI, display, history logging
├── .env            # API key, model, voice, and pricing config
├── requirements.txt
└── history.json    # created at runtime, one record per exchange

Setting Up

1. Install Dependencies


pip install openai python-dotenv rich sounddevice soundfile numpy edge-tts "playsound==1.2.2" SpeechRecognition

sounddevice handles microphone capture. soundfile reads and writes WAV files. SpeechRecognition connects to Google’s free speech recognition service with no API key. edge-tts generates speech using Microsoft’s neural voices at no cost. playsound plays the resulting MP3 — it is a single-purpose library that does nothing except play a sound file and block until it finishes, which is exactly what a terminal app needs.

2. Configure Environment

Create a .env file in the project folder:



# Your OpenAI API key
OPENAI_API_KEY=your_openai_api_key_here

# Model for the LLM conversation
MODEL=gpt-4o-mini

# edge-tts voice (free, no API key needed)
# Options: en-US-JennyNeural, en-US-GuyNeural, en-GB-SoniaNeural, en-IN-NeerjaNeural
TTS_VOICE=en-US-JennyNeural

# GPT-4o-mini pricing per 1M tokens, update here if OpenAI changes rates
INPUT_COST_PER_1M=0.150
OUTPUT_COST_PER_1M=0.600

TTS_VOICE is an edge-tts voice name. Microsoft offers dozens of neural voices across languages and accents. en-US-JennyNeural is a clear, natural-sounding default. You can browse all available voices by running edge-tts --list-voices in the terminal.

Building the Voice Agent: agent.py

agent.py contains the VoiceAgent class and the module-level constants it depends on. The class has four methods: record, transcribe, chat, and speak, each responsible for exactly one stage of the pipeline.

Imports and Constants



import asyncio          # lets us call async functions (like edge-tts) from normal synchronous Python code
import os               # reads values from environment variables and removes temp audio files from disk
import tempfile         # creates disposable files on disk for audio without picking names ourselves
import threading        # lets recording run in the background while the main thread waits for Enter

import edge_tts                  # Microsoft neural text-to-speech — free, no account or API key needed
import numpy as np               # merges the list of audio chunks into a single flat array
import sounddevice as sd         # opens the microphone and streams audio in real time
import soundfile as sf           # writes the captured audio to a WAV file that Google STT can read
import speech_recognition as sr  # wraps Google's free speech-to-text service — no API key required
from playsound import playsound  # plays an MP3 file through the speakers and waits until it finishes
from openai import OpenAI        # client for the OpenAI API — used here only for generating text replies
from dotenv import load_dotenv   # reads the .env file and injects its values into the process environment

load_dotenv()  # call this before any os.getenv() — otherwise the .env values are not visible to the process

_SAMPLE_RATE = 16_000  # 16 kHz is the standard sample rate for speech recognition models
_CHANNELS    = 1       # mono audio is all that speech recognition needs; stereo would double the file size

_PRICING = {
    "input":  float(os.getenv("INPUT_COST_PER_1M",  "0.150")),  # dollars per 1 million prompt tokens; read from .env so you can update without touching code
    "output": float(os.getenv("OUTPUT_COST_PER_1M", "0.600")),  # dollars per 1 million reply tokens; output costs more than input per token
}

_SYSTEM_PROMPT = (
    "You are a helpful voice assistant. Keep your answers concise and conversational, "
    "since they will be spoken aloud. Avoid markdown, bullet points, and special characters."
)

_CALL_METADATA = {
    "dev_name":    "Ganesh",      # your name; appears next to this API call in the OpenAI usage dashboard
    "project":     "codex-test",  # custom label so you can filter all calls from this project in the dashboard
    "environment": "local",       # tells you at a glance that this call came from a laptop, not a production server
    "purpose":     "testing",     # reminder of why the call was made when reviewing cost logs later
}

asyncio is needed because edge-tts is an async library. speech_recognition connects to Google’s free STT service. playsound is a single-purpose library for playing audio files with no game engine or GUI overhead. SAMPLERATE and CHANNELS define the audio capture format: 16kHz mono is standard for speech recognition and keeps the WAV file small. SYSTEM_PROMPT tells the model to give spoken-friendly answers without formatting characters that sound odd when read aloud.

The VoiceAgent Class



class VoiceAgent:
    def __init__(self):
        self.client     = OpenAI(api_key=os.getenv("OPENAI_API_KEY", ""))  # creates a connection to OpenAI; used only to generate text replies, not for any voice work
        self.model      = os.getenv("MODEL",     "gpt-4o-mini")            # which GPT model to call; gpt-4o-mini is fast and cheap enough for spoken conversations
        self.voice      = os.getenv("TTS_VOICE", "en-US-JennyNeural")      # which Microsoft voice to use; Jenny is a clear, natural-sounding American English voice
        self.recognizer = sr.Recognizer()                                   # the speech recognition engine; created once at startup so we don't rebuild it on every call
        self.history    = [{"role": "system", "content": _SYSTEM_PROMPT}]  # the full conversation so far; starts with the system prompt and gains one entry per question and answer

self.client is the OpenAI client used only for the LLM chat() call, not for any voice work. self.recognizer is created once and reused across all transcription calls rather than instantiated per request. self.history starts with the system prompt and grows with each turn so the model always has the full conversation context.

Recording Audio



    def record(self, stop_event: threading.Event) -> str:
        frames = []  # collects raw audio chunks as they arrive; sounddevice fills this list continuously while the mic is open

        def _callback(indata, frame_count, time_info, status):
            frames.append(indata.copy())  # sounddevice calls this automatically for every chunk of audio captured; we copy to avoid overwriting

        with sd.InputStream(samplerate=_SAMPLE_RATE, channels=_CHANNELS,
                            dtype="int16", callback=_callback):  # opens the microphone; audio flows into _callback until the 'with' block exits
            stop_event.wait()  # pauses here until main.py signals that the user pressed Enter to stop recording

        audio = np.concatenate(frames, axis=0)  # merges all the small captured chunks into one continuous audio array

        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f:
            tmp_path = f.name  # save the file path before the 'with' block closes the file handle; we need the path to write audio next

        sf.write(tmp_path, audio, _SAMPLE_RATE)  # writes the audio array to disk as a standard WAV file that Google STT can read
        return tmp_path

sounddevice calls callback on each audio chunk while the InputStream is open. The frames list accumulates all the chunks, and stopevent.wait() holds the stream open until the user presses Enter in main.py. After the event fires, the chunks are concatenated and written to a temporary WAV file for Google STT to read.

Transcribing with Google STT

 

   def transcribe(self, audio_path: str) -> str:
        with sr.AudioFile(audio_path) as source:
            audio = self.recognizer.record(source)  # loads the entire WAV file into an AudioData object that recognize_google can process
        try:
            return self.recognizer.recognize_google(audio)  # sends the audio to Google's speech API over the internet; free to use with no account or API key
        except sr.UnknownValueError:
            return ""  # Google could not make out any words — happens with silence, background noise, or very unclear speech
        except sr.RequestError as exc:
            raise RuntimeError(f"Google STT request failed: {exc}") from exc  # network problem or Google rejected the request; re-raise so main.py can display the error

sr.AudioFile reads the WAV and wraps it in an AudioData object that the recognizer understands. recognize_google() sends it to Google’s free speech API and returns the transcript as a plain string. UnknownValueError means the audio was too unclear or silent, so returning an empty string lets main.py handle that gracefully without crashing.

Generating a Reply



    def chat(self, text: str) -> dict:
        self.history.append({"role": "user", "content": text})  # adds this question to the conversation list so the model sees everything said before it

        response = self.client.chat.completions.create(
            model=self.model,         # the GPT model to use, read from .env
            messages=self.history,    # the entire conversation history; sending all of it lets the model answer follow-up questions
            max_tokens=300,           # limits the reply length; 300 tokens is roughly 225 words, enough for a spoken response
            metadata=_CALL_METADATA,  # extra labels sent to the OpenAI dashboard for tracking; the model never sees these
        )

        reply = response.choices[0].message.content or ""  # extract the text of the model's reply from the API response
        self.history.append({"role": "assistant", "content": reply})  # add the reply to history so future questions know what was already said

        prompt_tokens     = response.usage.prompt_tokens      # number of tokens in everything we sent (system prompt + history + new question)
        completion_tokens = response.usage.completion_tokens  # number of tokens in the model's reply
        input_cost  = round((prompt_tokens     / 1_000_000) * _PRICING["input"],  6)  # convert prompt tokens to dollars using the per-million price
        output_cost = round((completion_tokens / 1_000_000) * _PRICING["output"], 6)  # reply tokens cost more per million than prompt tokens

        return {
            "reply":             reply,
            "prompt_tokens":     prompt_tokens,
            "completion_tokens": completion_tokens,
            "total_tokens":      prompt_tokens + completion_tokens,
            "input_cost":        input_cost,
            "output_cost":       output_cost,
            "total_cost":        round(input_cost + output_cost, 6),
        }

Both the user message and the assistant reply are appended to self.history after every turn. When the next question arrives, messages=self.history sends the entire conversation to the model, which is what makes follow-up questions like “explain that last part differently” work without any special handling.

Speaking the Reply



async def _tts_save(text: str, voice: str, path: str) -> None:
    communicate = edge_tts.Communicate(text, voice)  # prepares the speech request with the text to say and the voice to use
    await communicate.save(path)                      # streams the MP3 audio from Microsoft's servers and writes it to the given file path
    def speak(self, text: str) -> None:
        with tempfile.NamedTemporaryFile(suffix=".mp3", delete=False) as f:
            tmp_path = f.name  # reserve a temp file path for the MP3; delete=False keeps the file on disk after closing so playsound can open it

        asyncio.run(_tts_save(text, self.voice, tmp_path))  # runs the async edge-tts function synchronously; downloads the spoken MP3 from Microsoft
        playsound(tmp_path)   # plays the MP3 through the speakers; blocks here until the audio finishes playing
        os.unlink(tmp_path)   # removes the temp MP3 from disk; it was only needed for this one reply

ttssave is a module-level async function because edge-tts uses Python’s async networking. asyncio.run() bridges the synchronous speak() method into the async world without making the rest of the codebase async. playsound() blocks until the MP3 finishes playing, so the next recording prompt only appears after the agent has finished speaking.

Building the Terminal App: main.py

main.py owns everything the user sees and controls: the header, the record and stop prompts, the transcript and reply panels, and the history file.

Imports and Setup



import json                    # reads and writes history.json to persist each exchange
import os                      # removes the temp WAV file after transcription is done
import threading               # creates a background thread for recording and a shared stop signal
import time                    # measures how long each exchange takes from transcription to reply
from datetime import datetime  # records the exact time each exchange happened
from pathlib import Path       # builds the path to history.json relative to this file's location

from rich.console import Console  # handles all coloured and styled terminal output
from rich.panel import Panel      # draws a bordered box around the transcript and the agent's reply

from agent import VoiceAgent  # the class that runs all four pipeline stages: record, transcribe, chat, speak

console = Console()     # one Console instance used by every display function in this file
agent   = VoiceAgent()  # one VoiceAgent instance; creating it here sets up the OpenAI client and starts an empty conversation history

HISTORY_FILE = Path(__file__).parent / "history.json"  # resolves to history.json in the same folder as this script

console and agent are created once at module load and shared by all functions below. VoiceAgent() at module level means the OpenAI client is initialized before the first prompt appears, and self.history starts empty so every python main.py begins a fresh conversation.

Saving History



def save_history(transcript: str, result: dict, elapsed: float) -> None:
    history = []
    if HISTORY_FILE.exists():  # if the file already exists, load the previous exchanges into the list first
        try:
            history = json.loads(HISTORY_FILE.read_text(encoding="utf-8"))  # parse the JSON file into a Python list of records
        except json.JSONDecodeError:
            history = []  # reset to empty if the file is corrupted or was accidentally left blank
    history.append({
        "timestamp":          datetime.now().isoformat(),  # exact date and time this exchange completed, in ISO format
        "transcript":         transcript,                  # the text that Google STT converted your speech into
        "result":             result,                      # the model's reply, token counts, and cost for this exchange
        "time_taken_seconds": elapsed,                     # total seconds from transcription start to reply received
    })
    HISTORY_FILE.write_text(json.dumps(history, indent=2, ensure_ascii=False), encoding="utf-8")  # write the updated list back; indent=2 makes the file readable in any text editor

Each exchange appends one record to the same history.json file. Tokens, cost, and time are stored here but never printed to the terminal, keeping the on-screen experience focused on the conversation rather than billing details.

The Input Loop



def run() -> None:
    print_header()  # print the "Voice AI Agent" banner at the top of the session
    console.print("  Speak your question and the agent will answer you aloud.")  # remind the user this is a voice interface
    console.print("  Type [bold]exit[/bold] at any prompt to quit.\n")           # tell the user how to exit cleanly

    while True:  # keep the conversation going until the user types exit or presses Ctrl+C
        try:
            cmd = console.input("  Press Enter to start recording (or type exit): ").strip().lower()  # wait for the user to press Enter or type a command
        except (KeyboardInterrupt, EOFError):  # handle Ctrl+C or a closed stdin gracefully
            console.print("\n  Goodbye.")
            break

        if cmd in ("exit", "quit"):  # check if the user typed a quit command instead of pressing Enter
            console.print("  Goodbye.")
            break

        stop_event   = threading.Event()    # a shared flag; calling stop_event.set() tells the recording thread to stop
        path_holder: list[str] = []         # a list we use to pass the WAV file path back from the background thread

        def _record():
            path_holder.append(agent.record(stop_event))  # records in the background; stores the WAV path in path_holder when done

        t = threading.Thread(target=_record, daemon=True)  # background thread for recording; daemon=True means it shuts down automatically if the program exits
        t.start()  # start capturing audio immediately in the background

        console.print()  # blank line for visual breathing room
        console.print("  [green]Recording...[/green] Press Enter to stop.")  # tell the user the mic is live
        console.print()  # blank line so the stop prompt doesn't crowd the recording message

        try:
            console.input("")  # waits here until the user presses Enter; no prompt is shown, which keeps the UI clean
        except (KeyboardInterrupt, EOFError):  # if the user hits Ctrl+C during recording, stop cleanly
            stop_event.set()   # tell the recording thread to stop
            t.join()           # wait for the thread to finish before exiting
            console.print("\n  Goodbye.")
            break

        stop_event.set()  # signals the recording thread to stop; the mic stream closes and the WAV file is written
        t.join()          # waits for the background thread to finish before we try to read the WAV path

        audio_path = path_holder[0] if path_holder else None  # retrieve the WAV path the thread stored; None if something went wrong
        if not audio_path:  # skip this turn if the recording thread didn't produce a file
            console.print("  [red]Recording failed.[/red]\n")
            continue

        console.print()  # blank line before the transcribing status
        console.print("  [dim]Transcribing...[/dim]")  # let the user know audio is being sent to Google STT

        try:
            start      = time.time()                   # record the start time so we can measure total processing time
            transcript = agent.transcribe(audio_path)  # sends the WAV to Google STT and returns the spoken words as a string
            os.unlink(audio_path)                      # delete the temp WAV; it was only needed for this transcription
        except Exception as exc:
            console.print(f"  [red]Transcription error: {exc}[/red]\n")  # show the error and loop back for another attempt
            continue

        if not transcript:  # empty string means Google heard nothing useful — silence or background noise
            console.print("  [yellow]No speech detected. Try again.[/yellow]\n")
            continue  # skip the LLM call and go back to the recording prompt

        console.print()  # blank line before the transcript panel
        console.print(Panel(transcript, title="[bold]You said[/bold]", border_style="blue", padding=(0, 2)))  # show what Google STT heard in a blue bordered box
        console.print()  # blank line after the panel

        console.print("  [dim]Thinking...[/dim]")  # let the user know the LLM call is in progress

        try:
            result  = agent.chat(transcript)          # sends the transcript to GPT with full conversation history; returns reply, tokens, and cost
            elapsed = round(time.time() - start, 2)  # seconds from transcription start to LLM reply received
        except Exception as exc:
            console.print(f"  [red]Chat error: {exc}[/red]\n")  # show the error and loop back without crashing the session
            continue

        console.print()  # blank line before the reply panel
        console.print(Panel(result["reply"], title="[bold green]Agent[/bold green]",
                            border_style="green", padding=(1, 2)))  # show the agent's reply in a green bordered box
        console.print()  # blank line after the panel before speaking begins

        try:
            agent.speak(result["reply"])  # converts the text reply to MP3 using edge-tts and plays it out loud
        except Exception as exc:
            console.print(f"  [red]TTS error: {exc}[/red]\n")  # show the TTS error but keep the session alive

        console.print()  # blank line to separate this exchange from the next recording prompt

        save_history(transcript, result, elapsed)  # appends this exchange to history.json as a permanent record


if __name__ == "__main__":
    run()  # entry point: only runs when the file is executed directly, not when imported

run() is where the four-stage pipeline connects to user input. Recording runs in a background thread so the main thread is free to watch for the second Enter press. Once recording stops, the audio goes to Google STT, the transcript goes to the LLM, and the reply goes to edge-tts for playback. Any error at any stage is caught and shown without crashing the loop, so one bad request does not end the session.

Running the App


python main.py

Output

Type exit at any recording prompt to end the session. The conversation history for that session lives in self.history inside VoiceAgent and is not reloaded on the next run, so each python main.py starts a fresh conversation.

Who Can Benefit

Students and beginners get a complete STT, LLM, TTS pipeline they can run immediately and trace line by line, without signing up for a voice platform or learning a framework first.
Developers get a clean reference implementation for the record, transcribe, generate, speak loop with no hidden abstractions, useful as a base to extend with wake-word detection, streaming TTS, or a persistent memory store.
Enterprises get a working pattern for internal voice assistants, meeting note tools, and accessibility features built on their own OpenAI account rather than a third-party vendor.
AI engineers get a foundation to add real-time streaming, voice activity detection (VAD), wake-word detection, or a more sophisticated memory layer such as a vector store.

How Codersarts Can Help

If you want to take this further, Codersarts offers hands-on support at every stage.

For learners: Live 1-to-1 sessions with an AI engineer who can walk through each line with you, explain the audio and API concepts, and help you extend the project for your own use case.
For teams: End-to-end development of production voice AI systems including wake-word detection, streaming pipelines, custom voices, and integration with existing products.
For enterprises: Architecture consulting and implementation for internal voice assistants, document Q&A with voice input, and call-center automation prototypes.

Reach out at contact@codersarts.com or visit www.codersarts.com to get started.

Continue Your AI Learning Journey with Codersarts

If you enjoyed this article and would like to discover more about modern AI applications, production-ready LLM systems, and real-world RAG and MCP implementations, be sure to explore these other blogs from Codersarts: