top of page

How to Build OpenAI's CLIP from Scratch with PyTorch (ViT + BPE + InfoNCE)

  • 5 hours ago
  • 14 min read

Introduction


You have used CLIP. You may have called open_clip.create_model_and_transforms(), loaded the weights, and gotten great zero-shot classification results in fifteen minutes. And then someone on your team asked: "Why does batch size matter so much for contrastive learning?" or "Why are we pooling the EOT token instead of mean-pooling?" — and you had no answer, because the model was a black box.


That gap between using CLIP and understanding CLIP is exactly where multimodal AI knowledge stalls. Contrastive vision-language models now power image search, text-to-image generation (Stable Diffusion's text encoder is directly CLIP-derived), generative model evaluation, and domain-specific retrieval at scale. If you cannot reason about what is happening inside the model, you cannot debug it, fine-tune it correctly, or adapt it to your domain.


This post walks through how to build CLIP from scratch using PyTorch: a Vision Transformer image encoder, a causal text transformer with a byte-level BPE tokenizer, a shared joint embedding space, symmetric InfoNCE contrastive loss with a learnable temperature, and a deployable semantic search API. Real-world applications include:


  • Domain-specific image search (e-commerce catalogs, medical imaging, art archives)

  • Zero-shot image classification without labeled training data

  • Text encoder for a Stable Diffusion clone

  • CLIP score computation for evaluating diffusion-model outputs

  • Text-to-image and image-to-image retrieval in a SaaS product

  • Multimodal recommendation and content moderation pipelines


This post covers architecture, stack choices, implementation phases, and the most common failure modes. It does not include full working source code — that lives in the course.

📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]

How It Works: Core Concept


Contrastive Representation Learning in Plain English


Imagine you have 256 images and 256 captions. The key insight behind CLIP is that you do not need class labels at all — you just need the images and their matching captions. The model learns by being trained to answer one question: Given this image embedding and this text embedding, do they belong together or not?


A good analogy: think of CLIP as a bilingual dictionary translator. One encoder speaks "image language," the other speaks "text language," and the training objective is to teach both translators to map the same concept to the same location in a shared dictionary — the joint embedding space. A picture of a golden retriever and the phrase "a golden retriever playing fetch" should land at nearly identical coordinates in that space. Every other (image, caption) combination in the batch is a "distractor" that should land far away.


Why the Naive Approaches Fail


Training a single-modal image classifier with cross-entropy requires explicit class labels, does not generalise to unseen categories, and cannot reason about natural-language descriptions at all. Training a standalone text encoder or a standalone image encoder gives you one language in the dictionary but not the other — you cannot compare across modalities.


Even once you have dual encoders, several subtle mistakes break the model:


  • Wrong pooling choice — mean-pooling all text tokens gives materially worse retrieval scores than pooling the representation at the end-of-text (EOT) token, because the causal attention mask makes the EOT token the natural summary of the full sequence.

  • Fixed temperature — cosine similarity scores between normalised embeddings are small; a fixed temperature that is too large flattens the softmax and kills gradient signal. A learnable temperature that is initialised correctly and exp-clamped to prevent explosion is essential.

  • Small batches — contrastive learning needs hard negatives, and those negatives come for free from the rest of the batch. With 32 samples, the model has 31 negatives to contrast against. With 4096 samples it has 4095 — dramatically richer gradient signal.

  • Symmetric loss ignored — computing InfoNCE only image-to-text (or only text-to-image) means the encoders are not being trained equally, and the representations diverge.


How the Architecture Solves It


CLIP solves every one of these problems explicitly. The symmetric InfoNCE loss contrasts in both directions simultaneously. The learnable temperature (initialised at log(1/0.07) and clamped during training) adapts to the actual similarity scale. The EOT pooling strategy matches the causal masking of the text encoder. And the architecture is designed to run with large batches so that the in-batch negatives are genuinely hard.


Data-Flow Diagram

TRAINING PHASE
──────────────────────────────────────────────────────────────────────
  image_path ──► [PIL load + resize 224×224] ──► [patch 16×16]
                     ──► [ViT encoder] ──► [CLS token] ──► [proj head]
                          ──► L2 norm ──► image_emb  ────┐
                                                         ▼
                                          [N×N cosine similarity matrix]
                                                                     ▲
  caption ──► [BPE tokenizer] ──► [token ids + mask]                 │
                     ──► [causal text transformer] ──► [EOT token]   │
                          ──► [proj head] ──► L2 norm ──► text_emb ──┘
                                                     │
                                    Symmetric InfoNCE loss
                                    (diagonal = positives)

INFERENCE / RETRIEVAL PHASE
──────────────────────────────────────────────────────────────────────
  text query ──► text encoder ──► query_emb
                                        │
                                        ▼
                              [FAISS index search]
                                        │
                              top-k image results

  image query ──► image encoder ──► query_emb ──► [FAISS] ──► top-k images

  zero-shot classification:
  label prompts ("a photo of a {class}") ──► text encoder ──► class_embs
  image ──► image encoder ──► image_emb
  argmax(cosine_sim(image_emb, class_embs)) ──► predicted class

System Architecture Deep Dive


Architecture Overview


The full system has eight layers, each with a clear boundary and a distinct responsibility:


Dataset layer — A captioned image dataset where each row contains an image path and a natural-language caption. A preprocessing step validates image readability, filters captions by token length, and optionally provides a class label for zero-shot evaluation. This layer also handles augmentation (random crop, colour jitter, normalisation to ImageNet statistics).


Tokenizer layer — A byte-level BPE tokenizer trained on the caption corpus. Byte-level BPE handles Unicode, punctuation, emojis, and multi-language captions without any special pre-tokenisation rules. The trained tokenizer vocabulary and merge rules are serialised for reproducibility and loaded at inference time.


Image encoder — A Vision Transformer (ViT). The image is divided into 16×16 (or 32×32) non-overlapping patches, each flattened and projected into the embedding dimension. A learnable CLS token is prepended, learned positional embeddings are added, and the sequence is processed by a stack of multi-head self-attention + MLP transformer blocks. The CLS token's output serves as the global image representation.


Text encoder — A causal transformer (GPT-style). The tokenised caption passes through a learned token embedding + positional embedding layer, then through transformer blocks with causal (lower-triangular) attention masking. The representation at the position of the EOT (end-of-text) token is extracted as the caption embedding.


Projection heads — Both encoders output vectors of their own internal dimension. Two separate learned linear layers project image and text representations into a shared joint embedding space (typically 512 or 1024 dimensions). Both projections are followed by L2 normalisation so all embeddings lie on the unit hypersphere.


Contrastive training layer — The symmetric InfoNCE loss. Given a batch of N image embeddings and N text embeddings, an N×N cosine similarity matrix is computed. The diagonal entries are positive pairs; all off-diagonal entries are negatives. Cross-entropy is applied row-wise (image-to-text direction) and column-wise (text-to-image direction), and both losses are averaged. The temperature parameter τ (learnable, exp-clamped) scales the logits.


Retrieval index — A FAISS index (flat L2 or IVF for larger datasets) built from the L2-normalised image embeddings. At query time, a text or image query is encoded and the index returns the k nearest neighbours by inner product (equivalent to cosine similarity on normalised vectors).


API deployment layer — A FastAPI application with endpoints for text-to-image search, image-to-image search, zero-shot classification, and embedding lookup. The app loads the serialised encoders and FAISS index at startup and serves inference requests via Uvicorn.


Component Table

Component

Role

Technology Options

Dataset loader

Load image-caption pairs, apply augmentation

PyTorch Dataset + torchvision transforms

Tokenizer

Byte-level BPE tokenisation

HuggingFace Tokenizers (BPE trainer)

Image encoder

Patch + attend → CLS embedding

ViT-B/16 (custom PyTorch), timm

Text encoder

Causal transformer → EOT embedding

Custom PyTorch GPT block, HuggingFace GPT-2

Projection heads

Map to joint embedding space

nn.Linear + L2 norm

Contrastive loss

Symmetric InfoNCE

Custom PyTorch, OpenCLIP reference impl

Training loop

Optimiser, LR schedule, checkpointing

PyTorch + AdamW + cosine LR

Retrieval index

ANN search over image embeddings

FAISS flat / IVF-PQ

Serving API

REST endpoints for search and classification

FastAPI + Uvicorn

Deployment

Host the API

Render, Railway, AWS EC2


Data Flow Walkthrough


Training: Each batch loads N (image, caption) pairs. Images are resized, normalised, and passed to the ViT, which outputs N CLS vectors → N image projections → N L2-normalised image embeddings. Captions are tokenised, padded to context length, and passed through the causal text transformer → N EOT vectors → N text projections → N L2-normalised text embeddings. The N×N similarity matrix is computed, cross-entropy loss is applied symmetrically, and gradients flow back through both encoders and both projection heads simultaneously.


Inference: An incoming text query is tokenised, encoded by the text encoder, and L2-normalised. FAISS performs an inner product search against the indexed image embeddings and returns the top-k image paths. For zero-shot classification, a set of prompt strings ("a photo of a {label}") are encoded as text embeddings; the query image embedding is compared against all class embeddings, and the class with the highest cosine similarity is the predicted label.


Non-Obvious Design Decisions


Decision 1 — EOT pooling vs CLS vs mean: The text encoder uses causal masking, meaning each token can only attend to previous tokens. The EOT token is the only position that has attended to every other token in the sequence. Using mean pooling includes padding tokens and positional noise; using CLS would require adding a classification token with bidirectional attention (which changes the model architecture). EOT is the correct choice for this architecture, and the difference in retrieval recall at k=1 is measurable.


Decision 2 — Temperature initialisation and clamping: The learnable temperature parameter is initialised at math.log(1 / 0.07) ≈ 2.659 and its exponent is clamped to the range [0, 4.605] (i.e., the actual temperature is kept between approximately 0.01 and 1.0). Without the upper clamp, the temperature collapses toward zero and loss gradients explode. Without the lower clamp, temperature grows unbounded and the softmax becomes uniform, killing training. This single parameter needs more care than most tutorials mention.


Tech Stack Recommendation


Stack A - Beginner / Prototype (weekend build)

Layer

Technology

Why

Language

Python 3.10

Widely supported, clean typing

Deep learning

PyTorch 2.x

Native autograd, clean ViT implementation

Image I/O

Pillow + torchvision

Simple augmentation pipeline

Tokenizer

HuggingFace Tokenizers

BPE trainer in 5 lines of Python

Retrieval

FAISS-cpu

Works on any laptop, no GPU required for index

Serving

FastAPI + Uvicorn

Auto-generated OpenAPI docs, async support

Deployment

Render (free tier)

Zero-config deployment from GitHub

Estimated monthly cost: $0–$7 (Render free tier or $7 starter instance covers a small prototype with pre-built index)


Stack B - Production-Ready (designed to scale)

Layer

Technology

Why

Language

Python 3.11

Faster interpreter, better typing

Deep learning

PyTorch 2.x + CUDA

GPU training mandatory for real datasets

Training infra

AWS p3.2xlarge or Lambda Labs

Cost-effective V100 or A10 rental

Image storage

AWS S3 + CloudFront

Durable, CDN-delivered

Tokenizer

HuggingFace Tokenizers

Serialisable, fast Rust-backed encoding

Model serialisation

torch.save + ONNX export

Faster inference, broader deployment targets

Retrieval

FAISS-gpu + IVF-PQ

Sub-millisecond ANN at 1M+ scale

API

FastAPI + Uvicorn + Gunicorn

Multi-worker production serving

Containerisation

Docker + Docker Compose

Reproducible environment

Deployment

AWS ECS Fargate or Modal

Autoscaling, cold-start tolerant

Monitoring

Prometheus + Grafana

Retrieval latency, error rate tracking

Estimated monthly cost: $80–$250 (depends on instance size, storage, and query volume; GPU training is a one-time cost of $10–$40 for a small dataset)


Implementation Phases


Phase 1: Dataset Preparation, BPE Tokenizer, and ViT Image Encoder


The foundation of the project is a captioned image dataset — a folder of images paired with a CSV or JSONL file mapping each image path to a natural-language caption. This phase involves writing a CaptionDataset class in PyTorch, applying standard augmentation (random resized crop, colour jitter, normalisation to ImageNet mean/std), and validating that every image path is readable and every caption falls within the model's context length.


The BPE tokenizer is trained directly on the caption corpus using HuggingFace Tokenizers' BpeTrainer. The vocabulary size, special tokens (BOS, EOS, PAD, UNK), and minimum frequency thresholds are the key configuration decisions here. The trained tokenizer is serialised to disk so training and inference use identical vocabulary and merge rules.


The Vision Transformer is built from scratch: a PatchEmbed module that splits 224×224 images into non-overlapping patches, a TransformerBlock implementing multi-head self-attention and an MLP with GELU activation, and a stack of these blocks followed by layer normalisation. The CLS token and learned positional embeddings require careful initialisation — zero-init for the CLS token, standard normal scaled by 1/sqrt(d_model) for positional embeddings.


Key decisions: patch size (16×16 vs 32×32 — smaller patches = longer sequence = more compute but richer spatial detail), encoder depth (number of transformer blocks), embedding dimension, and number of attention heads.



Phase 2: Text Transformer Encoder and Projection Heads


The causal text transformer uses GPT-style architecture: a token embedding table, learned positional embeddings, transformer blocks with causal (masked) self-attention, and layer normalisation. The key difference from the image encoder is the attention mask — a lower-triangular boolean mask ensures each token only attends to its predecessors. The EOT token representation (at the position of the <|endoftext|> token in each sequence) is extracted as the text embedding.


Two separate linear projection heads map the image encoder output and the text encoder output into a shared joint embedding space. Both projections are followed by F.normalize (L2 norm), placing all embeddings on the unit hypersphere. This is important: cosine similarity between unit vectors equals inner product, which is what FAISS uses for efficient search.


Key decisions: projection dimension (commonly 512 or 1024), whether to use a single linear layer or a two-layer MLP as the projection head, and whether to apply layer norm before the projection.



Phase 3: Symmetric InfoNCE Loss, Training Loop, and Evaluation Harness


The contrastive training loop is the core of the project. Given a batch of N image embeddings and N text embeddings, the logit matrix is (image_embs @ text_embs.T) * exp(temperature). The ground-truth labels are [0, 1, 2, ..., N-1] — each image matches exactly one caption and vice versa. Cross-entropy is computed across rows (image-to-text direction) and across columns (text-to-image direction), and both losses are averaged. This symmetric structure ensures both encoders receive equal gradient signal.


The training loop uses AdamW with a cosine learning rate schedule and linear warmup. During training, retrieval recall@1 is computed on a held-out validation set at regular intervals — this metric is a direct indicator of representation quality and is more informative than training loss alone. Checkpointing saves the full model state (both encoders, both projection heads, temperature parameter, and optimiser state).

Key decisions: batch size (larger is better for contrastive learning, but VRAM-constrained), learning rate, warmup fraction, temperature initialisation and clamping range, and validation cadence.



Phase 4: Zero-Shot Classification, FAISS Retrieval Index, and Fine-Tuning


With trained encoders, zero-shot classification is straightforward: construct a set of text prompts by formatting class labels as "a photo of a {label}", encode all prompts with the text encoder, and compare each query image embedding against the full set of class embeddings. The predicted class is the argmax of the cosine similarities. Prompt engineering matters significantly here — "a photo of a {label}" consistently outperforms bare label words by several accuracy percentage points.


The FAISS retrieval index is built by encoding the entire image corpus, L2-normalising all embeddings, and adding them to a faiss.IndexFlatIP (inner product, equivalent to cosine similarity on unit vectors). For larger datasets, IVF-PQ quantisation dramatically reduces memory usage at a small accuracy cost. Text-to-image and image-to-image retrieval are both handled by the same index — only the query encoder changes.


Fine-tuning on a domain-specific dataset (e.g., a product catalog or a medical image archive) follows the same training loop with a lower learning rate and optionally frozen lower-layer parameters. Computing CLIP score (cosine similarity between image and text embeddings of generated images and their prompts) is a natural extension of the evaluation harness.


Key decisions: FAISS index type (flat vs IVF), number of clusters for IVF, PQ codebooks, which layers to freeze during fine-tuning.



Phase 5: FastAPI Semantic Search API and Deployment


The inference API wraps the trained encoders and FAISS index in a FastAPI application. At startup, the app loads the serialised tokenizer, both encoder weights, and the pre-built FAISS index into memory. Endpoints handle text-to-image search, image-to-image search, zero-shot classification, embedding lookup, and a health check. The API accepts image uploads as multipart form data and text queries as JSON body parameters.


For deployment, the application is containerised with Docker and pushed to a registry. Render or Railway offers a zero-configuration deployment path suitable for prototypes and small production workloads. For higher-scale deployments, AWS ECS Fargate with an Application Load Balancer provides autoscaling without managing EC2 instances directly.


Key decisions: whether to keep the FAISS index in memory or on disk with mmap, async vs sync endpoint handlers, authentication strategy, and whether to pre-compute and cache embeddings for common queries.



Common Challenges


Building CLIP from scratch surfaces a set of non-obvious failure modes that are rarely covered in tutorials. Here are the ones that matter most, along with their root causes and fixes.


1. Temperature explosion during early training Root cause: The learnable temperature parameter is exponentiated before scaling logits. If gradients push the parameter negative and large, exp(temperature) grows without bound, logit magnitudes explode, and the softmax saturates — gradients vanish. Fix: Initialise at math.log(1 / 0.07) and clamp the exponentiated value to [0.01, 100.0] (i.e., torch.clamp(self.temperature.exp(), 0.01, 100.0)) in the forward pass.


2. Contrastive training collapses with small batches Root cause: InfoNCE loss quality scales with the number of in-batch negatives. With 16 samples per batch, each image has only 15 negatives — most of which are easy (unambiguous mismatches). The model never sees hard negatives and fails to learn fine-grained distinctions. Fix: Use the largest batch size VRAM allows (256+ is a reasonable minimum for learning useful representations). Gradient accumulation is a viable workaround on constrained hardware.


3. Wrong token pooled from the text encoder Root cause: Mean-pooling includes padding tokens, which carry no semantic content and add noise. CLS pooling requires a bidirectional attention mask. The causal text encoder with EOT pooling is the correct design — EOT has attended to the entire sequence. Fix: Track the actual EOT token position per sequence and index it explicitly: text_features = x[torch.arange(x.shape[0]), eot_positions].


4. Image patches not aligned to tensor shapes Root cause: The PatchEmbed layer uses nn.Conv2d with kernel_size=patch_size and stride=patch_size. If the input image width/height is not divisible by patch size, patches are silently dropped or an error is raised depending on the implementation. Fix: Always resize images to a fixed size (224×224 or 336×336) before encoding, and assert divisibility in the PatchEmbed forward pass.


5. BPE tokenizer producing out-of-vocabulary tokens at inference Root cause: The tokenizer was trained on the training caption corpus, which may not cover all tokens that appear in inference-time queries — especially if the inference domain differs from the training domain. Fix: Train the tokenizer on a broader corpus (e.g., include Wikipedia sentences in addition to the caption dataset), and always include a fallback byte-level encoding in the BPE trainer to guarantee that every input is encodable.


6. Asymmetric gradient flow due to single-direction InfoNCE Root cause: Computing the loss only image-to-text means the image encoder's gradients are correct but the text encoder's are sparse — it only receives gradient when its embeddings are incorrect from the image's perspective, not vice versa. Fix: Always use symmetric InfoNCE: compute loss_i2t (cross-entropy over rows) and loss_t2i (cross-entropy over columns), then return (loss_i2t + loss_t2i) / 2.


7. CLIP-score reward hacking in generative model evaluation Root cause: When CLIP score is used as a feedback signal for a generative model (e.g., RL fine-tuning of a diffusion model), the generator learns to produce images that score well under CLIP without being semantically faithful. This happens because CLIP has known failure modes with spatial reasoning and negations. Fix: Use CLIP score as one signal among several (FID, human preference, diversity metrics). Do not optimise it directly.

Solving these issues took us 14 hours of testing — the course walks you through each fix with working code.


Ready to Build This Yourself?


Understanding architecture is half the battle. The other half is sitting down with a blank Python file and actually building a Vision Transformer from a nn.Module subclass, wiring together the dual encoders, watching your contrastive loss curve, and shipping a FastAPI endpoint that serves real retrieval results. That gap between understanding and shipping is what the course closes.


The Build CLIP from Scratch course on Codersarts Labs includes:

  • Full annotated source code for every component

  • 20 structured lessons from dataset prep to API deployment

  • Complete ViT image encoder implementation (patch embed, multi-head attention, CLS pooling)

  • Byte-level BPE tokenizer training and serialisation

  • Causal text transformer encoder with EOT pooling

  • Symmetric InfoNCE loss with learnable temperature, correct initialisation, and exp-clamping

  • Training and evaluation harness with retrieval recall@k tracking

  • FAISS flat and IVF index construction and querying

  • FastAPI semantic search API with text-to-image and image-to-image endpoints

  • Zero-shot classification pipeline with prompt engineering

  • Domain fine-tuning module

  • Deployment walkthrough (Docker + Render / Railway)

  • Lifetime access to all future updates

  • Community support


$29.99. Everything above.



Want to move faster? Book a 1:1 Guided Session at $99.99 — two live sessions with the Codersarts team to help you stand up the project, debug training, tune the temperature, build your domain dataset, and ship the retrieval API.


Conclusion


Building CLIP from scratch means implementing two parallel transformer encoders (a Vision Transformer for images and a causal text transformer for captions), projecting their outputs into a shared joint embedding space via learned linear heads and L2 normalisation, and training the full system end-to-end with symmetric InfoNCE contrastive loss and a learnable temperature parameter. The result is a model that can do zero-shot classification, text-to-image retrieval, image-to-image similarity, and power a deployable semantic search API.


The recommended starting path: prepare your captioned image dataset first, then train the BPE tokenizer on your captions, then build and test the ViT image encoder in isolation, then add the text encoder and the InfoNCE training loop together. Get retrieval recall moving upward before adding FAISS and the API layer.


Ready to go from architecture to working code? The full course is at labs.codersarts.com.

 
 
 

Comments


bottom of page