How to Build a Diffusion Model from Scratch in PyTorch (DDPM + DDIM + Classifier-Free Guidance)
- 21 minutes ago
- 13 min read

1. Introduction: You Call from_pretrained() But Do You Know What Runs Inside?
You have seen the demos. Stable Diffusion turns a text prompt into a photorealistic image in seconds. DALL-E 3 generates anything you describe. Midjourney produces artwork that wins competitions. Every one of these systems is built on a denoising diffusion probabilistic model at its core — and yet most developers who use them have never looked past the API call.
If you have ever typed diffusers.StableDiffusionPipeline.from_pretrained("...") and moved on, you are not alone. But that abstraction hides a genuinely elegant piece of mathematics and engineering: a neural network that learns to reverse a Gaussian noise process, one tiny step at a time, until a crisp image emerges from static.
This post teaches you how to build a diffusion model from scratch in PyTorch — noise schedules, sinusoidal time embeddings, a U-Net with residual blocks and attention, the ε-prediction loss, DDPM and DDIM samplers, and classifier-free guidance — so you understand every parameter you are tuning.
Real-world applications of this exact implementation include:
Building a domain-specific image generator for logos, character art, scientific images, or medical scans
Implementing the latent denoiser core at the heart of a Stable Diffusion clone
Generating synthetic training data to augment downstream classifiers
Conducting ablation studies on noise schedules, U-Net depth, and guidance scale
Teaching diffusion models in workshops, university courses, or research publications
Understanding the internals before contributing to open-source generative AI libraries
This post covers the architecture, design decisions, and implementation phases for a full DDPM pipeline. It does not include the complete production-ready source code — that lives in the full course at labs.codersarts.com.
📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]
2. How It Works: Core Concept
The Big Idea in Plain English
Imagine you have a photograph. You sprinkle a tiny amount of Gaussian noise on it — barely perceptible. You repeat this 1,000 times. By step 1,000 the image is indistinguishable from pure white noise. That is the forward process, and it is mathematically trivial: each step just adds a small amount of noise according to a learned or fixed schedule of variances (β₁, β₂, …, β_T).
The interesting part is training a neural network to invert that process: given a noisy image at timestep t, predict what noise was added, so you can subtract it. If the network can do that reliably at every t, you can start from pure noise and denoise all the way back to a clean image.
Why the Naive Approaches Fail
Three things that seem obvious turn out to be wrong in practice:
Predicting x₀ directly is unstable. Having the network output the clean image causes training to collapse — the target is too high-variance for early timesteps when the image is nearly all noise. The DDPM paper's key insight is to instead predict the noise ε that was added, which gives a simplified MSE loss with well-behaved gradients.
Computing the variational lower bound explicitly is intractable. The full ELBO derivation looks terrifying. Ho et al. (2020) showed it simplifies to a single MSE term: L = E[||ε − ε_θ(x_t, t)||²]. You do not need the full VLB in practice.
Batch normalization corrupts training. The noise scale shifts dramatically across the 1,000 timesteps in a single batch. Batch norm computes running statistics across the batch, which poisons the normalization when different samples are at vastly different noise levels. Group normalization normalizes within each sample independently and solves this completely.
How DDPM Solves It
A U-Net takes the noisy image and the current timestep t as inputs, predicts the noise ε, and the model is trained with L_simple = MSE(ε_true, ε_pred). At inference, 1,000 reverse steps subtract the predicted noise with a small Gaussian residual. DDIM replaces the stochastic reverse steps with a deterministic ODE solver, cutting required steps from 1,000 to 50 or fewer.
ASCII Pipeline Diagram
══════════════════════════════════════════════════════════════════════
TRAINING PHASE
══════════════════════════════════════════════════════════════════════
[Clean image x₀]
│
▼
Sample t ~ Uniform(1, T)
Sample ε ~ N(0, I)
│
▼
Forward process (closed-form):
x_t = √ᾱ_t · x₀ + √(1−ᾱ_t) · ε
│
▼
U-Net with time conditioning:
ε_θ(x_t, t) ← sinusoidal embedding(t) → FiLM inject at each ResBlock
│
▼
Loss: MSE(ε, ε_θ(x_t, t))
│
▼
Backprop + AdamW + EMA on weights
══════════════════════════════════════════════════════════════════════
RUNTIME / INFERENCE PHASE (DDPM)
══════════════════════════════════════════════════════════════════════
[Pure Gaussian noise x_T]
│
▼ (repeat T → 0)
Predict noise: ε_θ(x_t, t)
Optionally scale with CFG:
ε̃ = ε_uncond + w · (ε_cond − ε_uncond)
│
▼
Remove noise (DDPM step or DDIM step)
│
▼
[Generated image x₀]
══════════════════════════════════════════════════════════════════════
An Analogy
Think of the forward process as slowly grinding a marble sculpture into dust over 1,000 steps, recording how much dust you added each time. Diffusion training teaches the network to reverse that process: given the half-destroyed sculpture at any stage, figure out which dust to remove. Once the network is good enough, you hand it a pile of random dust and it sculpts something new, removing imagined dust from randomness until a coherent shape emerges.
3. System Architecture Deep Dive
Architecture Overview
A complete DDPM implementation has nine distinct layers, each with a clearly defined job.
Dataset layer — Loads raw images, applies transforms (resize, normalize to [-1, 1]), and builds a DataLoader. The normalization range matters: the forward process and U-Net output are both designed around zero-mean data.
Noise-schedule layer — Precomputes the β sequence and all derived quantities: αₜ = 1 − βₜ, ᾱₜ = ∏αᵢ, and the closed-form noising coefficients. These are registered as buffers on the model so they move to GPU automatically.
U-Net architecture — The core neural network. An encoder (down-sampling path) and decoder (up-sampling path) connected by skip connections, with sinusoidal time embeddings injected at every residual block via FiLM-style scale and shift. Self-attention layers appear at low spatial resolutions (8×8 and 16×16 for 32×32 images) to model long-range dependencies without exhausting GPU memory.
Training loop with EMA — Standard AdamW optimizer plus an exponential moving average copy of the model weights. The EMA model is never updated by gradients — only by a weighted average of the training weights after each step. All sample generation uses the EMA model; sampling from raw training weights produces noticeably inferior results.
DDPM sampler — The iterative reverse process. At each step t, compute the predicted mean using ε_θ and the noise schedule, then add a small Gaussian residual (σₜ · z) to stay stochastic. Runs T=1,000 steps.
DDIM sampler — A non-Markovian alternative to DDPM sampling. Sets η=0 for fully deterministic inference and sub-samples the timestep sequence (e.g., every 20th step), reducing sampling to 50 network forward passes with comparable quality.
Classifier-free guidance layer — Trains a single model that handles both conditional (label provided) and unconditional (label dropped to a null token, ~10–20% of training steps) generation. At inference, runs two forward passes and interpolates: ε̃ = ε_uncond + w · (ε_cond − ε_uncond) where w is the guidance scale.
Evaluation layer — Generates a batch of 10,000 samples and computes Fréchet Inception Distance (FID) and Inception Score (IS) against the real dataset statistics. Lower FID is better; higher IS is better.
API deployment layer — A FastAPI endpoint accepts a JSON payload (class label, guidance scale, sampler type, step count, random seed) and returns a sampled image as a base64-encoded PNG.
Component Table
Component | Role | Technology Options |
Dataset | Image loading, augmentation, normalization | torchvision.datasets, custom ImageFolder, Pillow |
Noise schedule | β sequence, closed-form noising | NumPy (linear, cosine, learned schedules) |
Time embedding | Encode integer t as a vector | Sinusoidal (fixed), learned MLP projection |
U-Net encoder | Down-sample spatial dims, increase channels | ResBlock + AvgPool, strided Conv2d |
U-Net decoder | Up-sample, fuse with skip connections | ResBlock + Upsample + Conv, ConvTranspose2d |
Attention | Long-range feature interaction | Multi-head self-attention (einops), linear attention |
EMA | Stable sampling weights | Manual EMA loop, torch_ema library |
Sampler | Reverse denoising at inference | DDPM stochastic, DDIM deterministic |
API server | Expose sampling endpoint | FastAPI + Uvicorn |
Data Flow Walkthrough
Training path:
Load a mini-batch of clean images x₀ ∈ ℝ^(B×C×H×W), normalized to [-1, 1].
Sample a random timestep t for each image in the batch.
Sample Gaussian noise ε of the same shape as x₀.
Apply the closed-form forward process to compute xₜ in one shot (no loop needed).
Pass xₜ and t to the U-Net; receive predicted noise ε_θ.
Compute L = MSE(ε, ε_θ), backpropagate, step AdamW.
Update EMA weights.
Inference path:
Sample pure Gaussian noise x_T ∈ ℝ^(1×C×H×W).
For t from T down to 1: run the EMA U-Net forward to get ε_θ(xₜ, t); apply DDPM or DDIM update to get xₜ₋₁.
Optionally mix conditional and unconditional predictions with guidance scale w.
At t=0, clip xₜ to [-1, 1] and rescale to [0, 255].
Non-Obvious Design Decisions
Why group normalization, not batch normalization? In diffusion training, a single batch contains images noised at wildly different timesteps — some barely noisy, some pure static. Batch norm's running mean and variance are corrupted by this variance in noise scale. Group norm operates per-sample and is immune to this issue. Using batch norm here will not cause a crash; it will just silently produce worse generations with no obvious error message.
Why restrict self-attention to low resolutions? Multi-head self-attention over an N-pixel feature map costs O(N²) memory. At 64×64 spatial resolution with 512 channels, attention costs ~4 GB of memory for a single batch item. Running attention at 8×8 and 16×16 only captures the same semantic long-range structure at a fraction of the cost. Adding attention at 32×32 is often the single largest factor in GPU OOM errors during training.
4. Tech Stack Recommendation
Stack A - Beginner / Prototype (buildable in a weekend)
This stack minimises friction. Everything runs on a single GPU or even a laptop CPU for small datasets like MNIST.
Layer | Technology | Why |
Language | Python 3.10+ | Native PyTorch ecosystem |
Framework | PyTorch 2.x | Autograd, GPU acceleration |
Datasets | torchvision (MNIST, CIFAR-10) | Zero dataset prep |
Tensor ops | einops | Readable attention reshaping |
Visualisation | Matplotlib | Quick denoising trajectory plots |
Progress | tqdm | Training loop feedback |
API (optional) | FastAPI + Uvicorn | Lightweight sampler endpoint |
Estimated cost: $0/month on a personal GPU, or ~$1–5/month on Google Colab Pro for occasional training runs on CIFAR-10.
Stack B — Production-Ready (designed to scale)
This stack is designed for training on custom datasets at 128×128 or larger, with reproducible experiments, proper logging, and a deployable inference service.
Layer | Technology | Why |
Language | Python 3.11 | Faster interpreter startup |
Framework | PyTorch 2.x + torch.compile | ~20% training speed-up on A100 |
Datasets | Custom ImageFolder + Pillow | Domain-specific image sets |
Mixed precision | torch.cuda.amp | Halves memory, speeds training |
Experiment tracking | Weights & Biases | Loss curves, sample grids per epoch |
Distributed training | torchrun + DDP | Multi-GPU scale-out |
API | FastAPI + Uvicorn + Docker | Containerised, reproducible deployment |
Monitoring | Prometheus + Grafana | Latency and throughput metrics |
Cloud compute | AWS p3.2xlarge or Lambda Labs A10 | Cost-effective training |
Estimated cost: ~$50–150/month for periodic training runs on a cloud A10 GPU, plus ~$10–30/month for a small always-on inference instance.
5. Implementation Phases
Building a DDPM from scratch has a natural progression. Rushing ahead to the U-Net before the noise schedule is correct is the most common reason training diverges.
Phase 1: Forward Process and Beta Schedules
What you are building: The mathematical foundation. You implement the β schedule (linear and cosine variants), precompute ᾱₜ for all T timesteps, and write the closed-form q(x_t | x_0) function that adds exactly the right amount of noise in a single shot without stepping through t iterations.
Key technical decisions:
β_start and β_end for the linear schedule (0.0001 and 0.02 are the DDPM paper defaults, but these need adjustment for image sizes other than 32×32).
Whether to use the cosine schedule (better for low-resolution images; avoids over-noising at early timesteps) or the learned schedule.
How to register α, ᾱ, and related quantities as PyTorch buffers so they transfer to GPU automatically.
Verifying that x_T is visually indistinguishable from N(0, I) — visualise the forward trajectory before writing any model code.
Phase 2: Sinusoidal Time Embeddings and Residual Blocks
What you are building: The mechanism by which the U-Net knows which timestep t it is denoising. You implement the sinusoidal embedding (identical in structure to transformer positional encodings), project it through a small MLP, and inject it into every residual block via FiLM conditioning (scale + shift applied after the first GroupNorm).
Key technical decisions:
Embedding dimensionality (128 or 256 for small models; 512 for CIFAR-scale).
The sinusoidal frequency scaling — using the wrong constant here silently degrades training with no obvious error.
Whether to use FiLM (scale + shift), AdaGN (scale + shift inside GroupNorm), or simple addition for time injection.
Number of groups for GroupNorm (32 is standard; must evenly divide your channel count).
Phase 3: Full U-Net Architecture
What you are building: The U-Net itself — encoder path (down-blocks with strided convolution or average pooling), bottleneck (two residual blocks + attention), decoder path (up-blocks with bilinear upsampling + skip connections), and self-attention at the lowest two spatial resolutions.
Key technical decisions:
Channel multiplier sequence (e.g., [1, 2, 4, 8] for base_channels=128 gives [128, 256, 512, 1024]).
Attention at which resolutions — 8×8 and 16×16 for a 32×32 model is standard; adding 32×32 roughly doubles GPU memory.
Residual block depth (2 per level is typical; 3 slows training without proportional quality gain at CIFAR scale).
How to handle the skip connections: direct concatenation (adds channels) vs additive (preserves channels).
Phase 4: Training Loop, EMA, and DDPM/DDIM Sampling
What you are building: The full end-to-end training loop — batched noise sampling, loss computation, optimiser step — plus the EMA weight tracker and both samplers. This is where you see the model actually generate images for the first time.
Key technical decisions:
EMA decay rate (0.9999 for long training runs; 0.999 for fast prototyping on MNIST).
AdamW learning rate (1e-4 or 2e-4 with cosine annealing is standard).
DDIM: how to correctly implement the η=0 update — the naive approach gets the variance term wrong, producing blurry samples.
Sampling frequency during training (generating a sample grid every 1,000 steps is a good signal of whether training is progressing).
Phase 5: Classifier-Free Guidance and Inference API
What you are building: Class-conditional generation via label embeddings added to the time embedding, classifier-free guidance with configurable guidance scale w, FID/IS evaluation, and a FastAPI endpoint that exposes sampling as an HTTP service.
Key technical decisions:
Label dropout rate during training (10–20% unconditional training steps is the standard range).
Guidance scale w — values between 1.5 and 7.5 are typical; higher is sharper but less diverse.
Whether to implement unconditional generation with a null label embedding or by zeroing the label.
FastAPI response format: returning the image as a base64-encoded PNG in JSON vs streaming raw bytes.
6. Common Challenges
When training a diffusion model from scratch, these are the non-obvious issues that consume the most debugging time.
1. Training loss stops decreasing after a few hundred steps Root cause: β_end is too small, so the model never sees heavily noised images and never learns to denoise from noise-dominated inputs. Fix: Visualise your ᾱₜ curve. At t=T, ᾱ_T should be close to zero (0.001 or lower). If it is still 0.1, increase β_end.
2. Samples are grey blobs even with low loss Root cause: Sampling from the raw training weights instead of the EMA copy. The training weights oscillate; EMA smooths this. Fix: Always generate samples with the EMA model, not model.eval().
3. NaN loss after a few thousand steps Root cause: Mixed precision (autocast) combined with large gradient norms. Fix: Add GradScaler and clip gradients to norm 1.0 (torch.nn.utils.clip_grad_norm_).
4. DDIM samples are noticeably blurrier than DDPM samples Root cause: The η=0 deterministic DDIM update includes a variance term that many implementations compute incorrectly, effectively removing high-frequency detail. Fix: Derive the DDIM update from Eq. 12 in Song et al. (2020) step by step and verify the σₜ computation against the paper.
5. GPU OOM error when adding self-attention Root cause: Attention was added at too high a spatial resolution (e.g., 32×32 with 512 channels). Fix: Move attention to 16×16 and below only, or reduce the number of attention heads.
6. Classifier-free guidance produces mode collapse (all samples look the same) Root cause: Guidance scale w is too high (above 10), collapsing the distribution to the mode of the conditional. Fix: Start at w=1.5 and increase slowly. Also verify that unconditional dropout is at 10–20%, not 0%.
7. FID is poor despite visually reasonable samples Root cause: Too few training steps, or generating samples from a model checkpoint that is not fully converged. FID is sensitive to diversity, not just visual quality. Fix: Train for at least 200,000 gradient steps on CIFAR-10. Use the EMA model. Generate ≥10,000 samples for a stable FID estimate.
Solving these issues took us 16 hours of testing — the course walks you through each fix with working code.
7. Ready to Build This Yourself?
Understanding an architecture is not the same as shipping working code. The gap between "I understand DDPM conceptually" and "my model converges and generates sharp samples" is filled with subtle bugs: wrong noise schedule parameters, EMA not being applied to the right copy of the weights, DDIM update equations with sign errors, attention applied at the wrong resolution.
The "Build a Diffusion Model from Scratch in PyTorch" course at labs.codersarts.com closes that gap with fully tested, production-ready code and 20 structured lessons.
The course includes:
✅ Full source code for every component — no pseudocode
✅ 20 structured lessons covering every topic in this post, in order
✅ Linear and cosine beta schedule implementations with verification tests
✅ U-Net architecture with sinusoidal time conditioning and FiLM injection
✅ Training loop with EMA and proper gradient scaling
✅ Both DDPM (stochastic) and DDIM (deterministic, fast) samplers
✅ Classifier-free guidance with configurable w and dropout rate
✅ FID and Inception Score evaluation scripts
✅ FastAPI sampling API with Docker deployment
✅ MNIST, CIFAR-10, and custom dataset support
✅ Lifetime access to all future updates
✅ Community support and Q&A
$39.99. Everything above.
Want live guidance as you build? Book a 1:1 Guided Session at $149.99 — includes everything in the self-paced course plus two live sessions with the Codersarts team to help you pick a dataset, debug training divergence, tune your U-Net, and run your first conditional generation experiment. Book a guided session → labs.codersarts.com
8. Conclusion
A denoising diffusion probabilistic model is, at its core, a U-Net that learns to predict noise — trained on the simple insight that if you can predict what was added, you can subtract it. The architecture layers a fixed or learned noise schedule, a time-conditioned U-Net with group normalization and selective attention, an EMA weight tracker, and at inference, either a 1,000-step stochastic reverse process or a fast deterministic DDIM sampler with classifier-free guidance for sharper conditional output.
The recommended learning order is: forward process and beta schedule first, then sinusoidal time embeddings, then the U-Net block by block, then the training loop with EMA, and finally DDIM and classifier-free guidance. This order mirrors the data flow and makes each bug easier to isolate.
Every major generative image model built in the last three years runs a variant of this pipeline. The best time to understand it from the inside out is now. Start with the full course at labs.codersarts.com →



Comments