How to Build RLHF and DPO from Scratch with PyTorch: The Technique Behind ChatGPT
- 3 minutes ago
- 14 min read

Introduction: The Gap Between Using a Chatbot and Understanding One
You have called trl.PPOTrainer or trl.DPOTrainer. The loss goes down. The model seems better. But if someone asks you why the KL penalty coefficient is set to 0.04, or what happens to advantage estimation when the reward model starts overfitting, or why DPO even works without a separate reinforcement learning step — you don't have a clean answer.
That gap is the problem. Every modern chat model — ChatGPT, Claude, Gemini — is shaped by Reinforcement Learning from Human Feedback (RLHF), yet most developers treat alignment as a black box buried inside a library call. The consequence is that when your aligned model reward-hacks, when PPO training collapses, or when you need to adapt the pipeline for a domain-specific assistant, you are stuck.
Build RLHF and DPO from Scratch (labs.codersarts.com) is a structured project course that closes this gap. You implement the full alignment pipeline yourself in PyTorch: supervised fine-tuning, a reward model trained on Bradley-Terry preference loss, a complete PPO loop with KL penalty, and both DPO and IPO as reward-model-free alternatives.
Real-world use cases this course prepares you for:
Aligning a domain-specific assistant (support, legal, healthcare) to a preferred tone and policy
Training a reward model to rank or filter generations for quality control
Replacing expensive RLHF library wrappers with an owned, debuggable pipeline
Researching and comparing alignment algorithms — PPO vs DPO vs IPO
Preparing for advanced topics like verifiable-reward RL, GRPO, and reasoning-model training
Understanding how production models are aligned so you can evaluate or audit them
This post covers the full architecture, the tech stack, the implementation phases, and the common failure modes. It does not include the full source code — that is inside the course.
📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]
How It Works: The Core Concept
Why Supervised Fine-Tuning Alone Is Not Enough
The naive alignment approach is straightforward: take a pretrained language model, collect a dataset of (prompt, ideal response) pairs, and fine-tune with cross-entropy loss. This is supervised fine-tuning (SFT), and it is genuinely necessary - but it cannot capture preference.
Here is why. SFT treats every token in the training response as equally correct. It cannot express "this response is better than that one" - only "this response happened." Models trained with SFT can follow instructions, but they have no way to distinguish a nuanced, helpful answer from a plausible-but-unhelpful one if both appeared in training. For many tasks, the difference is precisely what matters.
The second naive patch - define a scalar reward and optimize it with RL - also fails on its own. A language model has enough capacity to find responses that score high on any fixed reward function through patterns that have nothing to do with quality. This is reward hacking, and unconstrained RL accelerates it.
The Three-Stage RLHF Solution
RLHF solves this with a three-stage pipeline:
SFT Baseline - fine-tune on instruction data to produce a model that can follow prompts. This is the starting policy.
Reward Model Training - collect pairwise preference data (human annotators pick the better of two responses to each prompt), then train a reward model to predict scores that satisfy those preferences. The Bradley-Terry preference loss is the standard objective: it pushes the score of the chosen response above the rejected one by a learned margin.
Policy Optimization with PPO and KL Control - sample rollouts from the policy, score them with the frozen reward model, compute advantages, and update the policy with the PPO clipped surrogate objective. A KL divergence penalty against the original SFT model is added to the reward: it penalizes the policy for drifting too far from the reference distribution, which is the primary brake on reward hacking.
A useful analogy: think of SFT as teaching a student to answer questions. The reward model is a grader who has seen thousands of examples of better and worse answers and learned to score them. PPO is the student studying under that grader — always trying to score higher, but with a rule that says "you cannot change your writing style completely just to game the rubric."
Direct Preference Optimization (DPO) takes this further. It derives, mathematically, that when you are using a KL-penalized RL objective, the optimal policy can be expressed as a closed-form transformation of the reference model. This means the reward model is implicit — you can train directly on preference pairs with a classification-style loss and skip the explicit RL step. IPO is a regularized variant that avoids overfitting to hard preference labels.
Pipeline Data-Flow Diagrams
Training Phase:
Instruction Data ──►[SFT Fine-tuning] ──► SFT Policy (frozen as reference)
│
Preference Pairs ──► [Reward Model │
(chosen / rejected)] Training w/ │
Bradley-Terry] ──► Reward Model (frozen after training)
│
┌──────────────────────┘
▼
[PPO Policy Optimization]
Policy samples rollouts
Reward model scores them
KL penalty added to reward
Clipped surrogate update
│
▼
Aligned Policy (PPO)
— OR —
[DPO Training on preference pairs] ──► Aligned Policy (DPO / IPO)
Runtime / Evaluation Phase:
User Prompt ──► Aligned Policy ──► Generated Response
│
┌──────────┴──────────┐
▼ ▼
Reward Model Score Head-to-Head vs SFT Baseline
(automated metric) (win rate — human or LLM judge)
└──────────┬──────────┘
▼
Alignment Report
System Architecture Deep Dive
The full alignment project is built in seven layers, each with a clear responsibility.
Layer Overview
Dataset Layer. The foundation is two datasets: an instruction dataset for SFT (prompt → response pairs) and a pairwise preference dataset for reward model training (prompt, chosen response, rejected response). Constructing a high-quality preference dataset is often the most time-consuming part of a real alignment project.
SFT Baseline Layer. A pretrained transformer (GPT-2 scale or a small open-weight model) is fine-tuned on the instruction dataset using standard cross-entropy loss. The resulting model is checkpointed twice: once as the reference model (frozen throughout all subsequent training) and once as the initial policy (the starting point for PPO or DPO).
Reward Model Layer. The reward model shares the transformer backbone of the SFT model but has a single scalar output head added at the final token position. It is trained on the preference dataset with Bradley-Terry loss: for a (prompt, chosen, rejected) triple, the loss increases the score of the chosen response and decreases the score of the rejected one. The reward model is then frozen.
PPO Training Layer. The PPO loop keeps four models in memory simultaneously: the policy (the model being trained), the reference model (frozen SFT), the reward model (frozen), and the value model (a critic that estimates expected cumulative reward). At each step, the policy samples a batch of rollout completions; the reward model scores each completion; advantages are estimated; and the policy is updated with the clipped surrogate objective while the KL penalty against the reference is subtracted from rewards.
DPO and IPO Layer. DPO collapses reward modeling and RL into a single step. It takes (prompt, chosen, rejected) triples and computes log-probability ratios between the policy and the reference model for each response. The loss is a binary cross-entropy over these ratios, which implicitly optimizes the same KL-penalized objective that PPO optimizes explicitly. IPO adds a squared regularization term to prevent the policy from concentrating all probability mass on the preferred response.
Evaluation Layer. All three aligned policies (PPO, DPO, IPO) are evaluated with two metrics: the scalar reward score assigned by the reward model, and the win rate — the fraction of head-to-head comparisons where the aligned model's response is judged better than the SFT baseline's. Win rate is more honest than reward score because the reward model can be gamed.
API Deployment Layer. The best-performing aligned model is served via a FastAPI endpoint backed by Uvicorn. The inference API accepts a prompt and returns a generated response along with the reward model's score for that response.
Component Table
Component | Role | Technology Options |
Instruction dataset | SFT training data | Alpaca, ShareGPT, custom JSONL |
Preference dataset | Reward model training data | Anthropic HH, custom pairwise annotation |
Base policy model | Pretrained transformer backbone | GPT-2, GPT-Neo, Pythia, Mistral-7B |
SFT fine-tuning loop | Instruction-following baseline | HuggingFace Trainer, custom PyTorch loop |
Reward model | Scalar preference predictor | Transformer + linear head (PyTorch) |
PPO training loop | KL-penalized policy optimization | Custom PyTorch (this course), TRL (library) |
DPO/IPO training | Reward-model-free alignment | Custom PyTorch (this course), TRL |
Evaluation harness | Win rate, reward score tracking | NumPy, Weights & Biases, GPT-4 judge |
Inference API | Serve aligned model | FastAPI + Uvicorn |
Deployment platform | Host the inference endpoint | Render, Modal, Hugging Face Spaces |
Data Flow Walkthrough
Training pipeline:
Raw instruction data is loaded with HuggingFace datasets and tokenized.
The base transformer is fine-tuned with cross-entropy loss to produce the SFT baseline.
The SFT model is checkpointed as the reference model and copied as the starting policy.
Preference data is loaded as (prompt, chosen, rejected) triples and tokenized.
The reward model is initialized from the SFT backbone with a scalar head appended.
The reward model is trained with Bradley-Terry loss until held-out preference accuracy stabilizes.
PPO training begins: the policy generates rollout completions; the reward model scores each; advantages are computed over the batch; the clipped PPO objective and the KL penalty are combined; the policy is updated via backpropagation.
DPO training runs in parallel on the same preference triples; log-probability ratios are computed against the frozen reference; cross-entropy loss updates the policy directly.
All checkpoints are saved after each training phase.
Runtime inference pipeline:
A prompt is received by the FastAPI endpoint.
The aligned policy generates a response using top-p sampling.
The reward model scores the response and the score is returned alongside the text.
For evaluation runs, the SFT baseline also generates a response to the same prompt, and a judge model or human annotator picks the better one.
Non-Obvious Design Decisions
Decision 1: The reference model must be frozen. It might seem redundant to keep the SFT model in memory and never update it. The reason is that the KL penalty is computed as the log-ratio between the current policy and the reference policy at each token. If the reference drifts, the penalty becomes meaningless and reward hacking accelerates. The memory cost of four simultaneous models is the direct consequence of this architectural requirement.
Decision 2: Evaluate with win rate, not just reward score. The reward model was trained on a finite preference dataset; it generalizes imperfectly. A policy can score very highly on the reward model by overfit-exploiting its blind spots while producing text a human would not prefer. Win rate measured by a separate judge is more robust. This is why the course's evaluation harness runs both metrics and flags divergence between them as evidence of reward hacking.
Tech Stack Recommendation
Stack A - Beginner / Prototype (Build in a Weekend)
This stack is designed to run on a single GPU (or a free Colab T4) and uses the smallest viable models.
Layer | Technology | Why |
Language | Python 3.10 | Universal ML ecosystem |
Deep learning | PyTorch 2.x | Required for autograd and custom training loops |
Base model | GPT-2 (124M) | Fits in 8 GB VRAM; fast rollout generation |
Model loading | HuggingFace Transformers | Pre-tokenized, pre-weighted base models |
Dataset tooling | HuggingFace datasets | Efficient JSONL loading and tokenization |
Numerical ops | NumPy | Advantage computation, reward normalization |
API server | FastAPI + Uvicorn | Minimal setup for serving the aligned model |
Deployment | Render free tier | No DevOps required |
Estimated monthly cost: $0–$10 (free-tier Colab or Render; no dedicated GPU instance needed for inference at this scale)
Stack B - Production-Ready (Designed to Scale)
This stack targets a 7B-parameter policy with batched inference and experiment tracking.
Layer | Technology | Why |
Language | Python 3.11 | Typing improvements; compatible with all packages |
Deep learning | PyTorch 2.x + CUDA 12 | Full GPU utilization |
Base model | Mistral-7B or Llama-3-8B | Strong instruction-following baseline at 7–8B scale |
Model loading | HuggingFace Transformers + PEFT | LoRA fine-tuning reduces VRAM by 4–8× |
Dataset tooling | HuggingFace datasets + streaming | Handles multi-million-record preference datasets |
Reward model | Custom PyTorch scalar head | Owned, debuggable; no external scoring dependency |
Training infra | DeepSpeed ZeRO-2 | Shards optimizer states across GPUs |
Experiment tracking | Weights & Biases | SFT, reward model, PPO, and DPO runs in one dashboard |
API server | FastAPI + Uvicorn | Production-grade inference with async support |
Deployment | Modal or AWS SageMaker | Autoscaling GPU inference; pay-per-request |
Monitoring | W&B + custom reward/KL dashboards | Detects reward hacking in real time |
Estimated monthly cost: $80–$250 (4× A100 40GB hours for training; T4 instance for inference; W&B free tier sufficient)
Implementation Phases
Breaking the project into phases makes each decision tractable and prevents the common failure of trying to debug PPO before the reward model is actually good.
Phase 1: SFT Baseline
What you build: A fine-tuned language model that follows instructions. You load a pretrained transformer, prepare an instruction dataset in (prompt, response) format, and run a standard supervised training loop with cross-entropy loss. You then evaluate the model qualitatively on held-out prompts and save two checkpoints: the SFT model (your starting policy) and the reference model (frozen copy).
Key technical decisions:
Choose a base model small enough to fit in available VRAM with room for three more models later.
Decide whether to mask the prompt tokens from the loss (recommended: avoids learning to predict the prompt, which is noise).
Choose a learning rate schedule — cosine decay with warmup is standard.
Validate that the SFT model can follow basic instructions before proceeding to reward modeling.
Phase 2: Preference Dataset and Reward Model Training
What you build: A pairwise preference dataset in (prompt, chosen, rejected) format, and a reward model trained on it with Bradley-Terry preference loss. The reward model is initialized from the SFT backbone with a single linear layer appended to the final hidden state of the last token.
Key technical decisions:
Decide how to source preference data: use a public dataset (Anthropic HH, OpenAssistant) or generate synthetic pairs by prompting two different SFT checkpoints and using a judge model to rank them.
Implement Bradley-Terry loss correctly: the loss is −log σ(r_chosen − r_rejected) where σ is the sigmoid function and r is the reward model's scalar output.
Monitor preference accuracy on a held-out validation split — this is your primary signal of reward model quality.
Choose a training budget: reward model overfitting is common on small datasets; early stopping on validation accuracy is essential.
Phase 3: PPO Policy Optimization
What you build: The full PPO training loop. At each iteration, the policy generates a batch of rollout completions; the reward model scores each; the value model estimates a baseline; generalized advantage estimation (GAE) computes per-token advantage values; and the clipped surrogate objective plus KL penalty updates the policy.
Key technical decisions:
Choose the KL penalty coefficient (beta). A value between 0.01 and 0.1 is typical; too low causes reward hacking, too high freezes the policy.
Implement ratio clipping correctly: the PPO objective clips the probability ratio π_θ(a|s) / π_old(a|s) to [1−ε, 1+ε] with ε = 0.2 as the default.
Decide how many PPO epochs to run per rollout batch (typically 4).
Monitor entropy, KL divergence, and reward simultaneously — a divergence between reward score and these diagnostics is the earliest sign of instability.
Phase 4: DPO and IPO Implementation
What you build: A reward-model-free alternative to PPO. DPO computes the log-probability of each response (chosen and rejected) under both the current policy and the frozen reference, then applies a binary cross-entropy loss over the log-ratio. IPO modifies this with a squared term that prevents the policy from over-concentrating on the preferred response.
Key technical decisions:
Derive the DPO loss from the RLHF objective to understand what beta means in this context (it controls the strength of the implicit KL penalty, just as in PPO).
Implement per-token log-probability computation correctly — a common bug is averaging log-probs over the entire sequence including the prompt.
Choose whether to run DPO from the SFT checkpoint or from a PPO checkpoint (the former is more common and less prone to instability).
Implement IPO and verify that its gradient is bounded by construction (unlike DPO's, which can diverge on well-separated preference pairs).
Phase 5: Evaluation, Reward Hacking Detection, and Deployment
What you build: A win-rate evaluation harness that pits the aligned model against the SFT baseline, a reward hacking detector that flags divergence between reward score and win rate, and a FastAPI endpoint that serves the best-performing policy.
Key technical decisions:
Decide on a judge: using a stronger LLM (GPT-4) as the judge is common and correlates well with human preference, but introduces its own biases.
Design the win-rate evaluation to be prompt-controlled (same prompts for all models) and to randomly flip the order of responses to the judge to reduce position bias.
Set a threshold for reward hacking detection: if reward score increases but win rate does not, the model is likely overfitting to the reward model's blind spots.
Expose the reward score alongside generated text in the FastAPI response so downstream consumers can implement their own quality filters.
Common Challenges
Building RLHF from scratch surfaces a set of failure modes that no high-level tutorial explains. Here are the ones that will cost you the most time.
Challenge 1: PPO reward collapse. The reward score rises for a few hundred steps, then collapses to near zero and never recovers. Root cause: the policy has shifted too far from the reference, making rollout completions out-of-distribution for the reward model, which then assigns near-zero or negative scores. Fix: reduce the KL coefficient, add entropy regularization to the PPO objective, and always check that the KL divergence is staying below 5–10 nats.
Challenge 2: Reward model overfitting to surface features. The reward model achieves 95% preference accuracy on training data but only 55% on held-out data. Root cause: preferred responses in the training set happen to be longer, or to start with "Certainly!", and the reward model learns this rather than quality. Fix: balance response lengths between chosen and rejected pairs, and run your reward model on adversarial examples (short chosen, long rejected) to verify it has not learned a length heuristic.
Challenge 3: DPO loss going to zero on hard preference pairs. After a few thousand steps, the DPO loss reaches near-zero but the model has not improved — it has simply assigned very high confidence to the chosen response and very low to the rejected one. Root cause: unconstrained DPO loss can be minimized by making the ratio arbitrarily large, which over-concentrates the policy. Fix: use IPO instead of DPO, or add a regularization term that penalizes log-probability ratios that exceed a threshold.
Challenge 4: Four-model memory pressure. Keeping policy, reference, reward model, and value model simultaneously in GPU memory exceeds the budget of a single consumer GPU. Root cause: the PPO algorithm genuinely requires all four models active during the rollout-and-update phase. Fix: use LoRA for the policy and value model (reducing their memory footprint by 4–8×), offload the reference model to CPU and copy activations on demand, or reduce batch size aggressively.
Challenge 5: Advantage estimation instability. Advantages fluctuate wildly across batches, making PPO updates noisy and slow to converge. Root cause: the value model is not well-trained at the start of PPO, so its baseline estimates are poor. Fix: pre-train the value model on SFT rollouts with mean squared error loss before the PPO loop begins.
Challenge 6: Win rate not matching reward score. The PPO-aligned model scores significantly higher than the SFT baseline on the reward model, but the win rate in head-to-head evaluation is only slightly above 50%. Root cause: the reward model has been partially gamed — the policy has learned to produce responses that score highly on the reward model's heuristics without being meaningfully better. Fix: evaluate win rate with a judge that is independent of the reward model (a stronger LLM or human annotators), and treat divergence between the two metrics as a red flag.
Challenge 7: Tokenization mismatch between models. The reward model assigns scores that do not align with the policy's token positions, causing incorrect advantage assignment. Root cause: using different tokenizers or padding strategies for the policy and reward model. Fix: standardize on a single tokenizer and always left-pad sequences when computing reward scores on rollout batches.
Solving these issues took us 16 hours of testing - the course walks you through each fix with working code.
Ready to Build This Yourself?
Understanding architecture is not the same as shipping code. The gap between "I know how PPO works" and "I have a running alignment pipeline that I understand well enough to debug" is exactly what this course closes.
The Build RLHF and DPO from Scratch course on labs.codersarts.com gives you everything you need:
✅ Full source code for every phase — SFT, reward model, PPO loop, DPO, IPO
✅ 20 structured lessons walking through each implementation decision
✅ Supervised fine-tuning baseline with prompt masking and learning-rate schedule
✅ Preference dataset tooling — generation, curation, and format validation
✅ Reward model with scalar head trained on Bradley-Terry preference loss
✅ PPO loop with KL control, advantage estimation, and ratio clipping
✅ DPO and IPO implementations with mathematical derivations in the code comments
✅ Win-rate evaluation harness with reward hacking detection
✅ FastAPI deployment walkthrough — from local inference to a live endpoint
✅ Lifetime access to course updates as new alignment methods are added
✅ Community support and discussion forum for every implementation question
$29.99. Everything above.
Want a faster path to shipping? Book a 1:1 Guided Session at $99.99 — three live sessions with the Codersarts team to help you stand up the pipeline, debug PPO instability, tune your KL coefficient, and build your own preference dataset.
Conclusion
The RLHF pipeline is three stages working together: SFT to produce an instruction-following baseline, reward modeling on human preference pairs to learn a scoring signal, and PPO with KL control to optimize the policy without reward hacking. DPO and IPO collapse the last two stages into a single training step by deriving the optimal policy analytically.
The right order to work through this is the same order the pipeline runs: start with the SFT baseline and make sure it can follow instructions, then train the reward model and validate its preference accuracy on held-out data, then implement PPO with conservative KL settings before tuning aggressively. Add DPO as a comparison once PPO is stable — its simplicity will make more sense against the backdrop of having implemented PPO first.
If you want to go from architecture to working, tested code, the full course is at labs



Comments