How to Build Speech Recognition From Scratch with Python, PyTorch, MFCCs, and OpenAI TTS
- 1 day ago
- 12 min read

Introduction
You have watched enough "speech recognition in 5 minutes" videos. Every tutorial starts the same way: pip install openai-whisper, load a model, call transcribe(), done. That is not learning speech recognition — it is calling someone else's black box. When the transcription is wrong, when latency is too high, or when you need to run entirely offline on a Raspberry Pi, you have no idea where to start debugging because you never understood what actually happens between a microphone and a word prediction.
This course solves that gap. Speech Recognition From Scratch With Python, PyTorch, and OpenAI TTS walks you from a raw WAV file all the way to a running keyword spotter, without touching Whisper, librosa, torchaudio, or any cloud speech API for the core recognition logic. You build every piece yourself: the audio loader, the FFT, the mel filter bank, the MFCC extractor, the PyTorch CNN, and the CTC decoding logic.
Real-world use cases this project unlocks:
Offline keyword spotting for voice commands on embedded or edge devices
Wake-word and command detection prototypes for home automation or robotics
Educational speech recognition labs for Python and machine learning courses
Synthetic speech dataset generation using OpenAI TTS for rapid model experiments
Audio feature extraction demos for digital signal processing and ML classes
A foundation for future full speech-to-text training with CTC loss
This post walks through the architecture, the tech stack, the implementation phases, and the non-obvious challenges you will hit along the way. It does not include the full source code — that lives in the course.
📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]
How It Works: Core Concepts
The naive approach and why it fails
The obvious first attempt at speech recognition is to feed raw audio waveform samples directly into a neural network. A one-second clip sampled at 16,000 Hz is a vector of 16,000 numbers. Small models cannot learn meaningful patterns from 16,000-dimensional inputs when the dataset is tiny. The model sees timing, amplitude, and noise as equally important features. Two people saying "yes" at different volumes, speeds, or accents produce completely different waveforms, even though the word is identical. A raw-waveform model would need millions of samples to generalise.
The solution that powers real speech models — Whisper included — is feature engineering. Instead of training on raw samples, you convert audio into a compact, perceptually-meaningful representation called a Mel-Frequency Cepstral Coefficient (MFCC). Think of MFCCs as a compression of the tonal fingerprint of a sound: they capture which frequencies are present, weighted the way human hearing perceives them, with most of the noise and speaker-specific variation removed.
The pipeline in two lines
Dataset / setup phase:
WAV file → NumPy array → Frame + window → DFT/FFT → Power spectrum
→ Mel filter bank → Log mel spectrum → DCT → MFCC tensor → Keyword labelInference / runtime phase:
WAV file → NumPy array → MFCC tensor (fixed shape) → PyTorch Conv1D CNN
→ Softmax → Keyword class + confidence scoreThe analogy
Imagine you are trying to identify a song by its sheet music rather than its waveform recording. The sheet music (MFCC) strips away microphone quality, room acoustics, and singer-specific timbre, leaving only the melody and rhythm that define the song. Your brain — or your CNN — learns to recognise the melody pattern, not the raw sound wave. That is exactly what mel-scale frequency compression does for speech.
System Architecture Deep Dive
Architecture overview
The project is organised into five layers, each with a distinct responsibility:
Audio IO Layer — loads and writes WAV files using Python's standard library wave module and NumPy. It handles mono/stereo conversion, sample-width normalisation (8-bit, 16-bit, 32-bit), and optional resampling to a target rate (16,000 Hz by default).
DSP Feature Layer — implements the full signal processing pipeline from scratch: framing, windowing (Hann window), the Discrete Fourier Transform (DFT), the recursive Fast Fourier Transform (FFT), magnitude spectrograms, mel filter banks, log-mel features, and the Discrete Cosine Transform (DCT) to produce MFCC coefficient arrays.
Dataset Layer — provides two data sources. The local keyword path uses short WAV clips labelled as silence, "no", or "yes". The optional OpenAI TTS path calls gpt-4o-mini-tts to synthesise speech WAV files, saves keyword labels and transcript text, and writes CSV manifests for reproducible training runs.
Model Layer — contains two PyTorch models. The keyword spotter is a compact 1D CNN with two convolutional blocks, batch normalisation, dropout, and a fully-connected classifier head. The CTC acoustic model is a BiGRU that outputs per-frame character log-probabilities for learners who want to explore full speech-to-text decoding.
Inference and Decoding Layer — runs the trained CNN on a WAV file and returns a predicted keyword with a confidence score. The CTC path implements greedy decoding (collapse repeated characters, remove blank tokens) and introduces beam search concepts, even though the included CTC model is a teaching demonstration rather than a production decoder.
Component reference table
Component | Role | Technology Options |
WAV reader/writer | Load and save audio arrays | Python wave + NumPy (used), soundfile, scipy.io.wavfile |
Resampler | Normalise sample rates | NumPy linear interpolation (used), librosa.resample, torchaudio.functional.resample |
FFT implementation | Frequency transform | NumPy recursive FFT (used), numpy.fft, scipy.fft |
Mel filter bank | Perceptual frequency weighting | NumPy (used), librosa.filters.mel, torchaudio.transforms.MelSpectrogram |
MFCC extractor | Compact feature vectors | NumPy DCT (used), librosa.feature.mfcc, torchaudio.transforms.MFCC |
Keyword dataset | Labelled training clips | Local WAV clips, OpenAI TTS (used), Google Speech Commands dataset |
TTS dataset builder | Synthetic audio generation | OpenAI gpt-4o-mini-tts (used), ElevenLabs, Google Cloud TTS |
CNN keyword model | Keyword classification | PyTorch 1D CNN (used), PyTorch MLP, scikit-learn SVM on MFCCs |
CTC acoustic model | Full speech-to-text decoding | PyTorch BiGRU (used), wav2vec 2.0, Whisper encoder |
Training loop | Model optimisation | PyTorch manual loop with tqdm (used), PyTorch Lightning, Hugging Face Trainer |
Data flow walkthrough
Load audio — the WAV reader opens a file, reads raw bytes, converts them to a normalised float32 NumPy array in the range [-1.0, 1.0], and collapses stereo to mono.
Frame the signal — the audio array is sliced into overlapping frames of ~25 ms with a 10 ms hop between frames.
Apply window — each frame is multiplied by a Hann window to reduce spectral leakage at frame edges.
Compute FFT — the FFT of each frame yields a complex spectrum; the magnitude squared gives the power spectrum.
Apply mel filters — a bank of triangular filters spaced on the mel scale weights the power spectrum, producing one mel energy value per filter per frame.
Log compression — the log of each mel energy value compresses the dynamic range, matching the logarithmic sensitivity of human hearing.
Apply DCT — the DCT decorrelates the log-mel energies and retains the first N coefficients (typically 13–40) as the final MFCC feature vector per frame.
Pad or crop to fixed shape — the MFCC matrix is padded or trimmed to a fixed number of frames so every sample produces identically shaped tensors.
Feed to CNN — the fixed MFCC tensor is passed through the 1D convolutional layers, and the softmax output gives class probabilities.
Return prediction — the argmax of the softmax is mapped to a keyword label (silence, no, yes) with a confidence score.
Non-obvious design decisions
Decision 1: Why fixed-length MFCC tensors instead of variable-length sequences. PyTorch Conv1d requires consistent input shapes across a batch. Rather than using padding masks or packing sequences (which adds complexity for beginners), the project pads short clips with zeros and crops long clips to a fixed maximum number of frames. This keeps the CNN architecture simple and the batch loading deterministic.
Decision 2: Why implement FFT from scratch rather than calling numpy.fft directly. The teaching implementation of DFT (O(N²)) and the recursive Cooley-Tukey FFT (O(N log N)) exists purely to expose the mathematics. Once learners run both and compare their outputs and runtimes on the same audio frame, the algorithm becomes intuitive rather than magic. numpy.fft is used for production code in the project; the scratch implementation runs once as a diagnostic.
Tech Stack Recommendation
Stack A: Beginner / Prototype (one weekend build)
Layer | Technology | Why |
Language | Python 3.10+ | Universal ML ecosystem |
Audio IO | Python wave + NumPy | No extra dependencies |
Feature extraction | NumPy (from scratch) | Explicit maths, easy debugging |
Model | PyTorch 2.x 1D CNN | Minimal API surface for beginners |
TTS dataset builder | OpenAI API (gpt-4o-mini-tts) | Fast synthetic data, WAV output |
Training progress | tqdm | One-liner progress bars |
Environment | venv + pip | Zero setup friction |
Estimated monthly cost: $0–$3 (OpenAI TTS charges ~$0.015 per 1,000 characters; generating 200 training samples costs under $1. Everything else runs locally for free.)
Stack B: Production-Ready (designed to scale)
Layer | Technology | Why |
Language | Python 3.11+ | Performance improvements, better typing |
Audio IO | soundfile + NumPy | Supports more formats, better error handling |
Resampling | torchaudio.functional | GPU-accelerated, production-grade |
Feature extraction | torchaudio.transforms | Optimised C++ backend, batch transforms |
Model | PyTorch 2.x + Lightning | Clean training loops, checkpointing, early stopping |
Dataset | Google Speech Commands v2 | 105,000 real speech clips, 35 keywords |
Training infra | AWS SageMaker or GCP Vertex AI | Scalable GPU training |
Serving | FastAPI + ONNX export | Low-latency inference, language-agnostic |
Monitoring | Weights & Biases | Experiment tracking, model versioning |
Containerisation | Docker | Reproducible environments |
Estimated monthly cost: $20–$120 depending on GPU instance type and usage. Training a keyword model from scratch on Speech Commands v2 takes 1–3 hours on a T4 GPU instance (~$2–$6 per run). Inference serving on a small FastAPI container on AWS Fargate costs $5–$15/month at low traffic.
Implementation Phases
Phase 1: Audio IO and Smoke Tests
What you build: A WAV reader that converts raw binary audio into a normalised NumPy float array, a WAV writer that saves arrays back to disk, and a set of diagnostic checks that print sample rate, channel count, sample width, duration, and min/max amplitude for any input file.
Key technical decisions:
Which sample widths to support (8-bit unsigned, 16-bit signed, 32-bit signed)
Whether to normalise to [-1.0, 1.0] or keep integer samples
How to collapse stereo to mono (average channels vs. left-channel only)
Whether to resample at load time or as a separate preprocessing step
Correctly handling all WAV sample widths without introducing clipping or offset errors is covered in detail in the full course with working, tested code.
Phase 2: DSP Feature Extraction — FFT, Spectrograms, and MFCCs
What you build: A teaching implementation of the Discrete Fourier Transform, the recursive Cooley-Tukey FFT, a magnitude spectrogram function, a mel filter bank generator, and a full MFCC extraction pipeline. You also implement Hann windowing and log compression.
Key technical decisions:
Frame size and hop length in milliseconds vs. samples (converting between them for different sample rates)
How many mel filters to use (26 is common for keyword spotting, 80 for Whisper-class models)
How many MFCC coefficients to retain (13 is classic, 20–40 is modern)
Whether to add delta and delta-delta coefficients (first and second derivatives of MFCCs over time) for richer features
Understanding why the mel scale is non-linear and how triangular filter banks encode perceptual frequency weighting is covered in detail in the full course with working, tested code.
Phase 3: Keyword Model — CNN Training and Inference
What you build: A PyTorch 1D CNN that classifies fixed-length MFCC tensors as one of three classes: silence, "no", or "yes". You write the dataset class, the DataLoader configuration, the training loop, the validation loop, and the inference function that takes a WAV path and returns a class label with confidence.
Key technical decisions:
How many convolutional layers and filter widths to use
Where to place BatchNorm and Dropout to reduce overfitting
How to split a small dataset into train/val/test without data leakage
Whether to freeze early layers when fine-tuning on new keywords
Debugging tensor shape mismatches between the MFCC padded output and the Conv1d input is covered in detail in the full course with working, tested code.
Phase 4: OpenAI TTS Dataset Generation
What you build: A dataset builder script that calls the OpenAI TTS API with gpt-4o-mini-tts to generate WAV files for a list of keywords or sentences, saves them to organised folders, writes a keyword label file, a transcript file, and a CSV manifest recording each sample's filename, transcript, keyword, voice, and duration.
Key technical decisions:
How to request WAV output format from the OpenAI TTS API
Which voices to use for variety (and how many samples per voice per keyword)
How to handle API rate limits and retry logic
How to structure the CSV manifest so it is compatible with both the keyword CNN and the CTC model
Generating a balanced, multi-voice synthetic dataset that avoids keyword-class imbalance is covered in detail in the full course with working, tested code.
Phase 5: CTC Decoding Fundamentals
What you build: A PyTorch BiGRU acoustic model that outputs per-frame character log-probabilities, and two decoding functions: greedy decoding (collapse repeated characters, remove blank tokens) and an introduction to beam search logic. You also write a CTC manifest loader that reads transcript CSV records and builds character vocabularies.
Key technical decisions:
How to define the CTC vocabulary (characters + blank token index)
Why the CTC blank token is distinct from the space character
How to interpret model outputs when the model is untrained vs. trained
How to connect the CTC model to a real audio dataset for future training
Understanding why an untrained CTC model produces garbage output and what a trained one looks like is covered in detail in the full course with working, tested code.
Common Challenges
1. WAV sample width inconsistency
Problem: WAV files store samples as 8-bit unsigned, 16-bit signed, or 32-bit signed integers. Reading them with wave.readframes() gives raw bytes; converting those bytes to floats requires different struct.unpack format codes for each width. Root cause: The wave module does not normalise output; it returns raw bytes and leaves format interpretation to the caller. Fix: Branch on wav.getsampwidth() and use np.frombuffer() with the matching dtype (uint8, int16, int32), then scale to [-1.0, 1.0] separately for signed vs. unsigned types.
2. FFT size mismatch
Problem: The FFT implementation raises an error or returns unexpected shapes when the frame length is not a power of two. Root cause: The recursive Cooley-Tukey FFT algorithm requires power-of-two input lengths to split the signal cleanly. Fix: Zero-pad each frame to the next power of two before the FFT, or set the frame length to exactly 512 (32 ms at 16 kHz), which is a natural power of two.
3. Inconsistent MFCC tensor shapes
Problem: The CNN raises a shape error at the first Conv1d layer because different audio clips produce different numbers of frames. Root cause: Clips of different lengths produce MFCC matrices with different frame counts along the time axis. PyTorch batching requires identical shapes. Fix: Choose a fixed maximum number of frames (e.g., 32 for 1-second clips at a 10 ms hop), pad shorter clips with zeros along the time axis, and truncate longer clips. Apply this consistently in both training and inference.
4. Overfitting on synthetic or tiny datasets
Problem: Training accuracy reaches 98% but validation accuracy stalls at 60%. Root cause: A few dozen WAV clips per class is far too few for a CNN to generalise. The model memorises training samples rather than learning features. Fix: Add Dropout (0.3–0.5) after each convolutional block, use data augmentation (add Gaussian noise, time-shift clips by ±10%), generate more samples via OpenAI TTS with multiple voices, and monitor the validation loss curve to detect overfitting early.
5. OpenAI TTS returning MP3 instead of WAV
Problem: The TTS API returns MP3 audio by default, and the WAV reader fails to parse it. Root cause: The default response_format in the OpenAI audio API is mp3; WAV must be requested explicitly. Fix: Set response_format="wav" in the API call parameters, and confirm the response Content-Type header is audio/wav before writing to disk.
6. CTC blank token confusion
Problem: The greedy decoder produces output filled with the blank character or collapses all repeated letters into single letters incorrectly. Root cause: The CTC blank token occupies a specific index in the vocabulary (usually index 0 or the last index). If the decoder uses the wrong index, it misidentifies real characters as blanks or vice versa. Fix: Define the blank index once in the vocabulary configuration and import it consistently in both the model definition and the decoder. Never hard-code the blank index as a magic number.
7. API key exposure in dataset builder scripts
Problem: The OpenAI API key gets committed to source control inside the dataset builder script. Root cause: Developers frequently hard-code keys during rapid prototyping. Fix: Load the key from an environment variable (os.environ["OPENAI_API_KEY"]) or a .env file that is listed in .gitignore. Never hard-code API keys, even in local projects.
Solving these issues took us 18 hours of testing — the course walks you through each fix with working code.
Ready to Build This Yourself?
Understanding an architecture is not the same as shipping working code. There is a significant gap between knowing what MFCCs are and actually debugging a Conv1d shape mismatch at 11 PM. The course closes that gap.
Here is everything included in the full course:
✅ Full source code for every module — WAV IO, FFT, MFCCs, CNN, CTC decoder, dataset builder
✅ 5 modules and 20 structured lessons that build on each other sequentially
✅ Local Python setup guide tested on macOS, Ubuntu, and Windows
✅ Tested environment configuration with pinned dependency versions
✅ OpenAI TTS dataset generation walkthrough with multi-voice CSV manifests
✅ CNN keyword model training workflow with checkpointing and validation curves
✅ CTC model architecture and greedy decoding walkthrough
✅ Project packaging guide — reproducible folder structure, README, and extension roadmap
✅ Lifetime access — take it at your own pace, revisit whenever you need
✅ Course updates as the stack evolves
✅ Community support via the Codersarts learner forum
$29.99. Everything above.
Want someone to sit with you, help you set up your environment, debug your local issues, train your first model, and plan your next extension? Book a 1:1 Guided Session with the Codersarts team — $99.99 and ship your first keyword spotter in a single session.
Conclusion
Speech recognition from scratch with Python and PyTorch is a five-layer problem: load the audio, extract MFCC features, train a compact CNN or BiGRU model on labelled data, decode the model output back to words, and package the whole pipeline for repeatability. None of those layers requires Whisper or a cloud API — they require NumPy, PyTorch, and a clear understanding of the maths.
If you are starting today, begin with the simplest viable stack: Python's wave module, NumPy for feature extraction, and a three-class CNN trained on a handful of WAV clips. Get that working first. The CTC path, the OpenAI TTS generator, and the production deployment layers are all extensions of the same core pipeline.



Comments