How to Build a CNN From Scratch in Python: Conv2D, TinyResNet, CIFAR-10, and Grad-CAM

May 15
13 min read

1. Introduction: The "Black Box" Problem in Computer Vision

You open a PyTorch tutorial. Four lines of code, a pretrained ResNet, and — boom — 94 % CIFAR-10 accuracy. You follow along, copy the snippet, get the number. Then someone asks: "What does a convolution actually do?" and you realise you have no idea.

This is the most common frustration in beginner computer vision. Frameworks are deliberately designed to hide implementation details, and that abstraction is great for shipping products. But it is terrible for actually learning. You end up with a working model and zero transferable understanding of why it works, what can go wrong inside a convolutional layer, or how to debug a model whose predictions seem arbitrary.

Computer Vision From Scratch: Build, Train, and Explain a CNN solves this by making you build every important layer yourself — Conv2D, pooling, and batch normalisation in pure NumPy — and then connecting that understanding to a real, trainable PyTorch classifier on CIFAR-10, topped off with Grad-CAM explainability so you can see exactly what the model is looking at.

Real-world applications of these skills include:

Learning CNN internals before moving to production deep learning libraries
Building educational demos for CS students or engineering teams
Prototyping small image classifiers that run on a laptop — no GPU required
Debugging production model behaviour using saliency and activation maps
Preparing for ML engineering interviews or academic computer vision coursework
Understanding CIFAR-10-style image classification pipelines end to end

This post covers the full architecture, tech stack, implementation phases, and key design decisions. It does not include the complete source code — that lives in the full course at labs.codersarts.com.

📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]

2. How It Works: The Core Concept

Convolution is just matrix multiplication in disguise

A convolutional layer slides a small filter (say, 3×3 pixels) across an input image, computes a dot product at every position, and produces an activation map. Conceptually simple. But the naive implementation — four nested Python for loops over height, width, filter rows, and filter columns — is brutally slow for any real image size.

The standard fix is the im2col trick: you "unroll" all the patches the filter would slide over into a single large matrix, then replace the nested loops with a single matrix multiplication. NumPy's highly optimised BLAS routines do the heavy lifting, and the operation becomes 10–50× faster. This is essentially what every deep learning framework does internally, hidden behind the nn.Conv2d API.

Why the naive approach fails:

Nested Python loops on a 32×32×3 image with 64 filters take seconds per batch — unusable for training
Hard-coded output-shape assumptions break silently under different stride or padding settings
Relying on torch.nn.Conv2d from day one means you never understand what the shape transformations mean when they go wrong
Tiny-subset training without careful metric interpretation produces misleading 100 % accuracy numbers that evaporate on real data
Models trained without explainability tooling are opaque — you cannot tell whether the classifier learned the right features or is cheating on spurious correlations

The end-to-end pipeline

PROJECT SETUP
  │
  ▼
[ NumPy Educational Layer ]
  Conv2D (im2col)  →  MaxPool / AvgPool  →  BatchNorm2D
  │
  ▼
[ Data Layer ]
  Download CIFAR-10 (urllib + tarfile)
  │  Unpack raw .pkl batch files
  ▼  Convert to NCHW tensors
  Custom augmentation (flip, crop, colour jitter)
  │
  ▼
[ Training Layer ]
  TinyResNet (PyTorch)
  │  Smoke test on synthetic data
  ▼  Train on CIFAR-10 subsets
  Evaluate → save checkpoint
  │
  ▼
[ Visualization Layer ]
  Load checkpoint
  │  Forward pass — capture final conv activations
  │  Backward pass — capture gradients
  ▼  Grad-CAM heatmap → Matplotlib overlay
  │
  ▼
[ Artifact Layer ]
  Saved checkpoint (.pt)
  Saved heatmap (.png)
  Verification commands (smoke tests)

The analogy: Think of the pipeline like developing a photograph in a darkroom. The im2col step is like choosing the right developer chemistry — it makes the latent image (features) appear clearly and quickly. The Grad-CAM step is like holding the final print up to the light and asking "what parts of the scene are most exposed?" You get to see not just the result, but the reasoning.

3. System Architecture Deep Dive

Architecture overview

The project is split into five distinct layers, each with a clear responsibility boundary.

Educational NumPy Layer — exists purely to build intuition. These modules are not on the training hot path; they are reference implementations that you run, inspect, and compare to the PyTorch equivalents. They prove that nn.Conv2d is not magic.

Data Layer — downloads CIFAR-10 directly from the University of Toronto mirror using Python's standard urllib and tarfile modules, unpacks the raw Python-pickle batch files, and converts them into correct NCHW (batch × channels × height × width) float tensors. No torchvision dependency. Custom augmentation is applied per-batch using PyTorch tensor operations — horizontal flips, random crops, and colour jitter implemented as readable functions, not wrapped transforms.

Training Layer — a deterministic synthetic-data smoke test validates that the model can overfit a trivial dataset before you ever touch real data. The main training loop then uses CIFAR-10 subsets for practical speed on CPU, with tqdm progress bars and configurable hyperparameters.

Model Layer — TinyResNet is a compact ResNet-style architecture: two residual block groups with optional skip connections, global average pooling, and a linear classifier head. Designed to train to meaningful accuracy on CIFAR-10 subsets in under 30 minutes on a standard laptop CPU.

Visualization Layer — a standalone Grad-CAM script loads a saved checkpoint, registers forward and backward hooks on the final convolutional layer, runs a single forward pass, backpropagates the predicted class score, and generates a heatmap overlay saved to disk.

Artifact Layer — checkpoints are saved with enough metadata (architecture name, base channels, epoch count, best validation accuracy, optimiser state) to reload without hard-coding assumptions anywhere.

Component table

Component	Role	Key Technology Options
Conv2D implementation	Teach convolution mechanics via im2col	NumPy (educational), torch.nn.Conv2d (training)
Pooling layers	Downsample feature maps	NumPy sliding window (educational), torch.nn.MaxPool2d
Batch normalisation	Stabilise activations across batches	NumPy per-channel stats (educational), torch.nn.BatchNorm2d
Dataset loader	Download and parse CIFAR-10 without torchvision	urllib, tarfile, pickle, NumPy → PyTorch tensor
Data augmentation	Improve generalisation	Custom tensor ops, or torchvision.transforms (alternative)
TinyResNet model	Trainable image classifier	PyTorch nn.Module, residual blocks
Training loop	Gradient updates, metric logging	PyTorch autograd, SGD/Adam, tqdm
Checkpoint manager	Save and reload model state	torch.save / torch.load with metadata dict
Grad-CAM engine	Explainability heatmap generation	PyTorch hooks, torch.nn.functional, Matplotlib
Verification suite	Repeatable smoke tests	Python subprocess, assert checks, shell commands

Data flow walkthrough

Project setup: Create virtual environment, install dependencies (numpy, torch, matplotlib, tqdm), confirm Python 3.10+.
NumPy modules: Run educational Conv2D, pooling, and BatchNorm scripts. Observe output shapes. Compare numerically to PyTorch equivalents.
CIFAR-10 download: Script fetches the .tar.gz, extracts six batch files, unpacks pickles, and assembles a train/test split as (N, 3, 32, 32) float32 tensors with labels.
Augmentation: Training batches pass through flip, crop, and colour jitter. Test batches are only normalised.
Smoke test: TinyResNet is instantiated and trained for five epochs on 512 synthetic images. Loss must decrease; test passes if final training accuracy exceeds a threshold.
CIFAR-10 training: Full training run with configurable subset size, epochs, and learning rate. Validation accuracy is logged every epoch.
Checkpoint save: Best model state, config dict, and epoch metrics are serialised to a .pt file.
Grad-CAM: Script loads the checkpoint, runs a single test image through the model with hooks attached to the final conv layer, computes the class-weighted activation map, resizes to 32×32, and overlays on the original image.
Verification: A suite of shell commands confirms every output file exists, all shapes are correct, and the smoke test passes from a clean state.

Two non-obvious design decisions

Decision 1: NumPy layers are intentionally not used in training. It would be tempting to unify the NumPy and PyTorch code paths, but that conflation would introduce complexity without benefit. The NumPy modules are pedagogical artefacts. They exist to be read, not to be performant. Keeping them separate means the training code is clean PyTorch, and the educational code is clean NumPy — each optimised for its own purpose.

Decision 2: CIFAR-10 is loaded without torchvision. The torchvision.datasets.CIFAR10 loader hides the fact that CIFAR-10 is just six pickle files with NumPy arrays. Reimplementing the loader from scratch forces you to understand the NCHW tensor format, dtype casting, and normalisation — concepts that are invisible when a single dataset class handles everything.

4. Tech Stack Recommendation

Stack A — Beginner / Prototype (build in a weekend)

This stack uses the minimal dependencies from the course and runs entirely on CPU. No cloud account required.

Layer	Technology	Why
Language	Python 3.10+	Widest ecosystem compatibility
Array operations	NumPy 1.24+	Educational layers, data manipulation
Deep learning	PyTorch 2.0+ (CPU)	Autograd, optimiser, easy model definition
Dataset loading	urllib + tarfile + pickle	Zero extra dependencies
Augmentation	Custom tensor ops	Readable, dependency-light
Visualisation	Matplotlib 3.7+	Grad-CAM heatmap rendering
Progress	tqdm	Training progress bars

Estimated monthly cost: $0 — runs entirely on local hardware.

Stack B — Production-ready / Scalable

Extend the course project into a deployable service with GPU acceleration and experiment tracking.

Layer	Technology	Why
Language	Python 3.11	Performance improvements, better typing
Deep learning	PyTorch 2.2+ with CUDA 12	GPU-accelerated training
Dataset pipeline	PyTorch DataLoader + torchvision	Multi-worker prefetching
Augmentation	Albumentations	Fastest augmentation library
Experiment tracking	MLflow or Weights & Biases	Reproducible runs, hyperparameter logging
Model serving	TorchServe or FastAPI	HTTP inference endpoint
Containerisation	Docker + NVIDIA Container Toolkit	Reproducible GPU environment
Cloud compute	AWS EC2 g4dn.xlarge or GCP T4 VM	~$0.50/hr on-demand
Monitoring	Prometheus + Grafana	Latency and throughput dashboards
Storage	AWS S3 or GCS	Checkpoint and artefact persistence

Estimated monthly cost: $30–$120 depending on GPU hours used (spot instances reduce this significantly).

5. Implementation Phases

Phase 1: Core CNN Layers (NumPy Educational Modules)

You start by implementing the three fundamental layers that every CNN builds on: Conv2D with the im2col trick, max pooling and average pooling with sliding windows, and batch normalisation with per-channel running statistics. Each module is a standalone Python file that takes an input array, performs the operation, and returns the output — no autograd, no framework magic.