top of page

How to Build DeepSeek-R1 from Scratch - GRPO Reinforcement Learning Explained

  • May 14
  • 13 min read


Introduction: The Techniques Behind 2026's Most Impressive AI Systems Are Still a Black Box

If you have spent the last several months watching reasoning models outperform every benchmark in sight, you have probably asked the same question most ML engineers ask: how, exactly, does this work?


The honest answer — after reading every public paper, blog post, and GitHub repo — is that most explanations stop precisely where the interesting part begins. You get the high-level pitch ("we used GRPO with verifiable rewards") and a pointer to a library you are supposed to just wrap. The actual mechanics — how rollouts are batched at scale, how the group-relative advantage is computed, why the KL coefficient breaks training when you set it wrong, what the "aha moment" looks like in a real training log — are conspicuously absent.


GRPO (Group Relative Policy Optimization) is the algorithm that powers DeepSeek-R1, and it is arguably the most important RL technique to emerge in applied AI in years. This post is a complete technical walkthrough of how to implement it from scratch in PyTorch — no TRL wrappers, no OpenRLHF abstractions — covering real-world use cases including:


  • Training a domain-specific reasoner for medical diagnostics, legal research, or financial analysis

  • Replicating R1-Zero on a small base model (1.5B / 3B / 7B) for research or thesis projects

  • Building R1-style code reasoning models for software engineering assistants

  • Converting a base LM into a reasoner using verifiable rewards, with no preference data required

  • Adding chain-of-thought reasoning capabilities to an existing SFT-tuned chatbot

  • Conducting rigorous RL algorithm comparisons (GRPO vs PPO vs DPO on the same task)


We will cover the core architecture, the full tech stack, implementation phases, and the failure modes you will hit in production. What we will not cover here is the full source code — that lives in the complete course on labs.codersarts.com, which ships with working, tested implementations of every component described below.


📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]

How It Works: The Core Concept Behind GRPO

Why Naive Fine-Tuning Fails for Reasoning

Standard supervised fine-tuning (SFT) teaches a model to imitate — it memorises input-output pairs. If you want your model to reason, you can collect thousands of chain-of-thought examples and fine-tune on them (this is called distillation). The problem: the model learns to look like it is reasoning without learning how to reason. On distribution shifts — novel problem types, slightly rephrased questions — it fails in exactly the ways a student who memorised answers fails on a final exam.


Reinforcement learning sidesteps this. Instead of telling the model what the right answer is, you tell it whether its answer was correct after it tries. This single change — shifting from imitation to outcome-driven exploration — is what produces the emergent self-verification behavior that DeepSeek-R1 became famous for.

Why PPO Is Complicated and GRPO Is Elegant

The standard RL algorithm for LLMs, PPO, requires a separate value network (a "critic") to estimate how good each state is. That is a second model of the same size as your policy, doubling your GPU memory footprint and training complexity, with its own instabilities.


GRPO removes the value network entirely. Instead of estimating value from a separate model, it computes advantage directly from a group of completions for the same prompt. For each prompt, you sample N=8–64 completions, score every one with a rule-based reward function, then compute the group mean and standard deviation. The advantage of completion i is simply (r_i − mean(r)) / std(r). Completions that score above the group average have positive advantage; those below have negative. No critic required.


The analogy: imagine grading on a strict curve. A score of 80 means something very different depending on whether the class average was 75 or 95. GRPO uses this relative signal as the training gradient. This is why it is called group relative policy optimization.

ASCII Data-Flow Diagram

                        GRPO TRAINING PIPELINE

                        ══════════════════════
  ┌────────────────────────────────────────────────────────────┐
  │                     ROLLOUT PHASE                          │
  │                                                            │
  │  Prompt Batch ──► vLLM Sampler ──► N Completions/Prompt    │
  │                     (N=8–64)         per group             │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │                    REWARD PHASE                            │
  │                                                            │
  │  Each completion ──► Rule-based Reward Functions           │
  │                        ├── Accuracy reward (+1.0)          │
  │                        └── Format reward  (+0.1)           │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │                  ADVANTAGE PHASE                           │
  │                                                            │
  │  Group rewards ──► mean(r), std(r)                         │
  │                ──► advantage_i = (r_i − mean) / std        │
  │                    (no value network needed)               │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
  ┌────────────────────────────────────────────────────────────┐
  │                    LOSS PHASE                              │
  │                                                            │
  │  Policy log-probs + advantages ──► Clipped surrogate loss  │
  │  Reference model log-probs     ──► KL divergence penalty   │
  │  Combined loss                 ──► Backprop + update       │
  └────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    Updated Policy (repeat)

System Architecture Deep Dive

Architecture Overview

A GRPO training system has five distinct layers, each with its own concerns and failure modes.


Rollout Layer (vLLM): This is where the policy generates completions. For each prompt in your batch, vLLM samples N completions using batched inference. This is the most GPU-intensive part of the pipeline — expect rollouts to consume 70–80% of total training wall-clock time if you are not careful. vLLM's PagedAttention and continuous batching reduce memory overhead dramatically compared to naive HuggingFace generation.


Policy + Reference Model Layer (HuggingFace Transformers): The policy is the LM being trained. The reference model is a frozen copy of the policy at the start of training (or at the start of each epoch). The KL penalty between policy and reference prevents the policy from drifting too far from its pre-trained distribution — the key stabilisation mechanism in the entire system.


Reward Layer (NumPy + Python): Rule-based reward functions score each completion. These are deliberately simple: a regex or symbolic verifier checks whether the final answer is correct; a format checker confirms the <think>...</think><answer>...</answer> structure is present. The simplicity is a feature — complex learned reward models can be hacked.


Advantage + Loss Layer (PyTorch): Advantage computation is pure NumPy (group statistics per prompt). The GRPO loss is implemented in PyTorch — a clipped policy-ratio surrogate (identical in structure to PPO's) with the group-relative advantages as the weighting signal, plus a KL penalty term.


Training Orchestration Layer (Accelerate + W&B): Accelerate handles multi-GPU distribution. Weights & Biases (optional) logs training metrics including the reward trajectory and the point at which self-verification phrases start appearing in completions — what the DeepSeek team called the "aha moment."

Component Table

Component

Role

Technology Options

Policy model

LM being fine-tuned

Qwen2.5-1.5B/3B/7B, Llama-3.2, Phi-3

Reference model

KL penalty anchor (frozen)

Same as policy, optionally 4-bit via bitsandbytes

Rollout engine

Fast parallel completions

vLLM (primary), TGI (alternative)

Reward functions

Score completions

Python + regex, SymPy for math, unit tests for code

Advantage computation

Group-relative normalisation

NumPy, pure Python

GRPO loss

Policy gradient objective

PyTorch (from scratch)

Training loop

Gradient updates

PyTorch + HuggingFace Trainer or custom loop

Multi-GPU support

Distributed training

Accelerate (primary), DeepSpeed (alternative)

Observability

Metrics, logs, aha-moment detection

W&B, TensorBoard

Data Flow Walkthrough

  1. A batch of reasoning prompts (math, code, logic problems) is sampled from the dataset.

  2. The current policy weights are synced to the vLLM engine (this is non-trivial — see Challenges section).

  3. vLLM generates N=8–64 completions per prompt using temperature sampling.

  4. Each completion is passed to the reward function pipeline. The accuracy reward runs a symbolic or regex verifier against the ground-truth answer. The format reward checks structural compliance.

  5. Rewards are collected per group (all N completions for a single prompt). Group mean and std are computed. Per-completion advantages are computed as (r_i − mean) / std.

  6. The policy's log-probabilities for the completion tokens are computed (forward pass).

  7. The reference model's log-probabilities for the same tokens are computed (frozen forward pass).

  8. The GRPO clipped-surrogate loss is computed: −min(ratio advantage, clip(ratio, 1−ε, 1+ε) advantage) + β * KL.

  9. The loss is backpropagated. The optimiser updates the policy.

  10. Training metrics are logged. The completions are inspected for "aha moment" phrases.

Two Non-Obvious Design Decisions

Why rule-based rewards instead of a reward model? Rule-based rewards are verifiable. A math answer is either symbolically correct or it is not. A code completion either passes the unit tests or it does not. Learned reward models introduce their own failure modes — they can be gamed in ways that look correct to the reward model but are semantically wrong. For tasks where ground truth is checkable, rule-based rewards are strictly superior and far cheaper.


Why is the reference model frozen at initialisation (not updated)? An adaptive reference model would let the policy drift arbitrarily, eventually collapsing into reward-hacking behaviour (outputting gibberish that scores well on the format reward). The frozen reference acts as a soft constraint: the KL penalty grows as the policy moves away from the pretrained distribution, counteracting excessive policy drift. This is analogous to the elastic regularisation in continual learning — it preserves capabilities while permitting task-specific improvement.




Tech Stack Recommendation

Stack A — Beginner / Prototype (Weekend Build)

Layer

Technology

Why

Language

Python 3.10

Universal ML ecosystem support

Base model

Qwen2.5-1.5B

Fits on a single A100 40GB

Policy framework

HuggingFace Transformers

Familiar API, easy checkpointing

Rollout

HuggingFace generate() (batched)

No vLLM setup required for prototyping

Reward

Pure Python + regex

Zero dependencies

Training

Single-GPU PyTorch loop

Simple, debuggable

Logging

TensorBoard

Built into PyTorch


Estimated cost: A single A100 40GB (Vast.ai or Lambda Labs) costs roughly $1.20–$2.00/hr. A prototype training run on 1.5B with a small dataset completes in 8–16 hours. Total cost: $10–$30 per run.

Stack B — Production-Ready (Designed to Scale)

Layer

Technology

Why

Language

Python 3.10+

Base model

Qwen2.5-7B or Llama-3.2-8B

Stronger base reasoning capability

Policy framework

HuggingFace Transformers 4.x

Rollout

vLLM 0.4+

10–30× faster generation via PagedAttention

Reference model

4-bit via bitsandbytes

Halves reference model GPU footprint

Multi-GPU

Accelerate + DeepSpeed ZeRO-2

Shards optimizer states across GPUs

Reward

Python + SymPy + subprocess (code exec)

Symbolic math + sandboxed code eval

Loss

Custom PyTorch (from scratch)

Full control over clipping and KL scheduling

Logging

Weights & Biases

Aha-moment tracking, comparison dashboards

Experiment config

Hydra

Reproducible hyperparameter sweeps


Estimated cost: 4× A100 80GB node (Lambda Labs or CoreWeave) at ~$8–12/hr. A full training run on 7B with 50K problems: 40–80 hours. Total cost: $320–$960 per experiment.




Implementation Phases

Phase 1: Dataset and Reward Function Engineering

The first step is not writing any training code — it is getting your reward functions right. Choose a dataset with verifiable ground truth: GSM8K, MATH, HumanEval, or a domain-specific set you have labelled. Implement two reward functions: an accuracy reward (symbolic verifier or regex extraction of the final answer) and a format reward (check for <think> and <answer> tags). Test both reward functions exhaustively before touching the training loop. Reward bugs are silent — a subtly incorrect verifier will train your model toward wrong answers with perfect confidence.


Key technical decisions: How strict is the format check? Do you reward partial credit for correct reasoning but wrong answer? How do you handle code execution timeouts in the code reward path?


Format reward edge cases — what to do when the model outputs partial tags, nested structures, or Unicode lookalikes — is covered in detail in the full course with working, tested code.

Phase 2: Rollout System with vLLM

Build the rollout pipeline: a function that takes a batch of prompts, generates N completions per prompt using vLLM, and returns the completions with their prompt IDs for reward computation. The central challenge here is keeping vLLM's internal model weights in sync with the HuggingFace policy weights after each gradient update. vLLM loads model weights independently; naive approaches either re-initialise vLLM after every step (too slow) or let it run stale weights (breaks training).


Key technical decisions: What is your N per group (≥8 for stable gradients, ≤64 for memory)? What temperature do you sample at? How do you handle vLLM out-of-memory errors mid-batch?


The exact weight-sync mechanism between HuggingFace and vLLM — the single most under-documented engineering challenge in GRPO — is covered in detail in the full course with working, tested code.

Phase 3: Group Advantage Computation and GRPO Loss

Implement the group-relative advantage computation in NumPy: for each prompt group, compute mean(rewards) and std(rewards), then normalise. Handle the edge case where all completions in a group receive the same reward (std = 0 → advantage = 0 for all; skip gradient update for that group). Then implement the GRPO loss in PyTorch: compute the policy-to-reference log-probability ratio, apply the clipped surrogate with your epsilon, weight by advantage, and add the KL penalty term scaled by your β coefficient.


Key technical decisions: What ε do you use for clipping (0.1–0.2 is typical)? What is the initial β for KL? Do you use a fixed β or an adaptive controller that targets a KL budget?


The numerical stability issues in the GRPO loss — log-sum overflow, advantage normalisation at the sequence vs token level — are covered in detail in the full course with working, tested code.

Phase 4: Multi-GPU Training Loop with Accelerate

Wrap your single-GPU training loop with Accelerate to distribute across multiple GPUs. The reference model requires special handling — it must stay frozen and is typically loaded in 4-bit to save memory. Implement gradient checkpointing on the policy to reduce activation memory. Set up your training scheduler (linear warmup + cosine decay), your logging (W&B or TensorBoard), and your checkpoint-saving logic.


Key technical decisions: DeepSpeed ZeRO-2 vs ZeRO-3? (ZeRO-3 shards model weights but complicates vLLM sync.) Gradient accumulation steps? Mixed precision: bf16 or fp16?


Configuring Accelerate + vLLM + DeepSpeed to co-exist without silent deadlocks is covered in detail in the full course with working, tested code.

Phase 5: Aha Moment Detection and Evaluation

The final phase is evaluation and observability. Implement a logging hook that scans completions for self-verification phrases: "Wait, let me check that again," "Actually, I made an error," "Let me reconsider." These are the hallmarks of the "aha moment" — the point in training where the model begins spontaneously auditing its own reasoning. Compare your trained model against a PPO baseline and a DPO baseline on your held-out test set. Export your final checkpoint in a chat-ready format.


Key technical decisions: How do you define "aha moment" precisely enough to log it automatically? What test sets do you use for evaluation? How do you export and serve the trained reasoner?


Automated aha-moment detection metrics, along with annotated training logs showing the exact epoch where emergence appears, are included in the full course.




Common Challenges

Building a GRPO training system from scratch surfaces a handful of issues that no tutorial blog post will warn you about. Here are the ones that will cost you the most time.


1. The vLLM–HuggingFace weight sync problem. After each gradient update, your HuggingFace policy has new weights but vLLM is still running the old ones. Simply calling vllm.generate() after a PyTorch optimiser step uses stale weights. Fixing this requires either re-initialising the vLLM engine (expensive) or using vLLM's llm.llm_engine.model_executor.driver_worker.model_runner.model.load_weights() API (undocumented, version-sensitive). Many GRPO attempts silently fail here — the loss looks like it is decreasing but the rollout policy is not actually updating.


2. Reward hacking on the format reward. Given a chance, the model will learn that outputting extremely long <think> sections trivially earns the format reward, regardless of content quality. Fix: cap the format reward at a small value (+0.1) relative to the accuracy reward (+1.0), and penalise completions that exceed a token length threshold.


3. KL coefficient instability. A β that is too high causes the policy to barely move (the KL penalty overwhelms the reward signal). Too low and the policy collapses into reward-hacking. The standard fix is an adaptive KL controller that targets a KL budget: if the current KL exceeds the target, increase β; if it is below, decrease it. This turns a brittle scalar into a self-regulating system.


4. High variance from small group sizes. With N < 8, the group mean and std are dominated by noise. A single lucky or unlucky completion skews the advantage estimates across the group. Use N ≥ 8; for critical experiments, N = 16–32. This increases rollout cost but dramatically stabilises training.


5. Reference model memory pressure. A frozen 7B model at fp16 consumes ~14GB of GPU memory. Loading it in 4-bit via bitsandbytes reduces this to ~4GB, freeing memory for larger batch sizes and longer rollout sequences. Make sure you freeze all reference model parameters — a single unfrozen layer causes subtle gradient contamination.


6. The "aha moment" does not arrive on schedule. Emergence is unpredictable. Some training runs show self-verification at epoch 3; others require 20+ epochs on the same task. The intervention: ensure your reward functions are actually discriminative (if 80% of completions get the same reward score, the model has no signal to improve). Increasing dataset diversity and prompt complexity tends to accelerate emergence.


7. Gradient explosion on long reasoning chains. Completions with 1,000+ tokens produce very large gradient norms. Clip gradients aggressively (max_grad_norm = 0.5–1.0) and monitor the gradient norm closely in early training.


Solving these issues took us over 200 hours of testing and iteration — the full course walks you through each fix with working, annotated code and the exact hyperparameter settings we converged on.




Ready to Build This Yourself?

Understanding the architecture is the easy part. Turning it into a working training run — one that actually produces an "aha moment," that doesn't silently fail because of a weight-sync bug, that trains stably across 40+ hours without KL collapse — is a different challenge entirely.


The GRPO and R1-Style Reasoning Training from Scratch course on labs.codersarts.com gives you everything you need to go from this article to a trained, exportable reasoning model:


  • Complete source code — every component described in this post, fully implemented and tested

  • Video walkthroughs — line-by-line explanation of the GRPO loss, rollout system, and reward engineering

  • Training checkpoints — model weights saved at every major phase so you can start from any stage

  • Annotated "aha moment" logs — real W&B and terminal outputs showing emergence in action

  • Side-by-side PPO and DPO baselines — run all three algorithms on the same task for direct comparison

  • GPU budget planning guide — cost estimates and configuration recommendations for 1.5B, 3B, and 7B models

  • Lifetime access + updates — as vLLM, Accelerate, and HuggingFace evolve, so does the course

  • Community support — ask questions, share training runs, get feedback from other researchers and engineers


$29.99. Everything above.






Need more than a self-paced course? Book a 1:1 Guided Session ($99.99) — three live hours with a senior ML engineer who will review your training run, help you debug failures, and help you plan your GPU budget for your specific use case. Book a session → labs.codersarts.com




Conclusion

GRPO is a surprisingly clean algorithm: sample a group of completions, compute their relative merit, and nudge the policy toward the better ones while keeping it anchored to its pre-trained distribution. The conceptual simplicity is deceptive — the engineering complexity lives in the rollout pipeline, the weight-sync mechanism, the reward function design, and the KL scheduling. These are tractable problems with known solutions, but they require hands-on implementation experience to navigate.


If you are starting from scratch, the simplest viable path is: Qwen2.5-1.5B + single A100 + HuggingFace generate() for rollouts + pure Python reward functions. This will not scale to 7B or to large datasets, but it will get you a working training loop in a weekend and teach you every component from first principles.


From there, the full course on labs.codersarts.com takes you the rest of the way — production rollouts with vLLM, multi-GPU training with Accelerate, and a trained reasoning model you can actually deploy.


 
 
 

Comments


bottom of page