top of page

How to Build a Chain-of-Thought Reasoning Model from Scratch (PyTorch + GSM8K)

  • 22 hours ago
  • 11 min read


1. The Problem: Your LLM Gives Confident, Fluent, Wrong Answers

You feed a language model a math word problem. It returns an answer instantly — beautifully formatted, grammatically perfect, completely wrong. No working. No intermediate steps. Just a number, stated with the quiet confidence of a student who copied from the back of the book.


This is not a corner case. It is the default behaviour of any model trained purely on next-token prediction. The model has learned to pattern-match to answers, not to reason toward them. Multi-step arithmetic, logical deductions, constraint-satisfaction puzzles — these all require holding intermediate state across many tokens, and standard generation provides no mechanism for that.


What this blog covers: We walk through the full architecture of a Chain-of-Thought (CoT) reasoning engine — from zero-shot prompting all the way to Tree-of-Thoughts search and supervised fine-tuning on rationales. You will see exactly how each component works and why each design decision was made.


What this blog does not cover: The full, tested, runnable source code. That lives in the Chain-of-Thought Reasoning Models from Scratch course on labs.codersarts.com.


Real-world use cases this architecture unlocks:


  • Math tutors that walk students through every solution step (EdTech)

  • Logic puzzle solvers that show their reasoning, not just their answer

  • Domain-specific reasoners for medical triage, legal research, and technical support

  • Educational assistants that scaffold thinking rather than handing over answers

  • Auditable AI for regulated industries where reasoning traces must be reviewable

  • Coding assistants that reason about edge cases before generating code


📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]

2. How It Works: Chain-of-Thought Reasoning, from First Principles

The Core Idea

The key insight behind Chain-of-Thought prompting is disarmingly simple: if you ask a model to write out its reasoning before giving an answer, it gets the right answer far more often. Wei et al. (2022) showed that adding "Let's think step by step" to a prompt dramatically improved GPT-3's accuracy on multi-step math — no fine-tuning required.


Why does this work? Think of it like showing your work on an exam. The act of writing intermediate steps forces the generation process to stay coherent across many tokens. Each step becomes a "scratchpad" token that the next step can attend to. The model is not smarter — it has simply been given more surface area to reason on.

Why Naive Prompting Fails

A direct prompt ("What is 37 × 48?") gives the model one shot: predict the answer token. There is no room for partial computation. For problems requiring four or five arithmetic sub-steps, the probability of every step being correct compounds badly. Miss one step and the answer is wrong — silently, confidently wrong.

How This Architecture Solves It

The system wraps every problem in a structured prompt that elicits step-by-step reasoning. The model generates multiple reasoning traces in parallel (temperature sampling). A regex-based extractor pulls the final answer from each trace. A majority-vote module picks the most common answer. For hard problems, Tree-of-Thoughts (ToT) replaces linear generation with a search over a tree of reasoning branches.

Pipeline at a Glance

╔══════════════════════════════════════════════════════════════╗
║                      SETUP PHASE                             ║
║  Load base LM (HuggingFace)                                  ║
║  Build prompt templates (zero-shot, few-shot exemplars)      ║
║  Load evaluation datasets (GSM8K, MATH, StrategyQA)          ║
╠══════════════════════════════════════════════════════════════╣
║                      RUNTIME PHASE                           ║
║                                                              ║
║  User problem                                                ║
║       │                                                      ║
║       ▼                                                      ║
║  CoT Prompt Template                                         ║
║       │  (wraps problem in "think step by step" structure)   ║
║       ▼                                                      ║
║  Base LM  ──► N parallel generations (temperature > 0)       ║
║       │          Trace 1: "Step 1: ... Step 2: ... = 42"     ║
║       │          Trace 2: "Step 1: ... Step 3: ... = 42"     ║
║       │          Trace 3: "Step 1: ... Step 2: ... = 41"     ║
║       ▼                                                      ║
║  Regex Answer Extractor  (pulls final numeric / text answer) ║
║       │                                                      ║
║       ▼                                                      ║
║  Majority Vote  ──► {42: 2 votes, 41: 1 vote} → 42           ║
║       │                                                      ║
║  [Hard problem?] → Tree-of-Thoughts BFS/DFS                  ║
║       │                                                      ║
║       ▼                                                      ║
║  Final answer + full reasoning trace                         ║
╚══════════════════════════════════════════════════════════════╝

Analogy: Self-consistency voting is like asking ten colleagues to solve a problem independently and then taking the majority answer. One person might make a careless error; the group almost never all makes the same error.




3. System Architecture Deep Dive

Architecture Overview

The system has five distinct layers:


Prompt Layer — Constructs the input to the model. Handles zero-shot ("Let's think step by step"), few-shot (pre-selected worked examples prepended to the problem), and task-specific templates.


Generation Layer — Runs the base language model via HuggingFace transformers. Controls temperature, top-p, max new tokens, and the number of parallel samples (N).


Extraction Layer — Uses regex patterns to pull structured answers from free-form model output. Handles common phrasings ("the answer is", "= ", "#### ") and normalises numeric formatting.


Voting / Search Layer — Implements self-consistency majority voting for parallel traces, and Tree-of-Thoughts BFS/DFS for structured search over reasoning branches.


Evaluation & Fine-Tuning Layer — Runs accuracy benchmarks on GSM8K, MATH, and StrategyQA. Implements CoT Supervised Fine-Tuning (SFT) where the model is trained on (problem, rationale, answer) triples with question tokens masked.

Component Table

Component

Role

Technology Options

Base language model

Core generation

GPT-2, LLaMA-3.2-1B, Mistral-7B (via HuggingFace)

Tokeniser

Encode/decode text

HuggingFace AutoTokenizer

Prompt template engine

Build structured prompts

Custom Python dataclass, Jinja2, LangChain PromptTemplate

Sampling pipeline

Generate N parallel traces

model.generate() with do_sample=True, num_return_sequences=N

Answer extractor

Parse final answer from text

re module, task-specific regex patterns

Majority vote module

Aggregate answers across traces

collections.Counter, custom normalisation

ToT search engine

BFS/DFS over reasoning tree

Custom Python class, optional beam search wrapper

Evaluation harness

Benchmark accuracy

Custom eval loop, HuggingFace Datasets, Pandas

Fine-tuning trainer

CoT-SFT on rationale data

HuggingFace Trainer, custom loss masking

Plotting

Accuracy comparison charts

Matplotlib

Data Flow: Step-by-Step

  1. User provides a problem — a plain-text math word problem or logic puzzle.

  2. Prompt builder wraps the problem using the selected template (zero-shot or few-shot). Few-shot mode prepends 4–8 hand-curated worked examples.

  3. Tokeniser encodes the full prompt to input IDs.

  4. Model generates N traces — with temperature > 0 and num_return_sequences=N, the model produces N independent continuations in a single forward pass (batched).

  5. Answer extractor applies regex patterns to each trace to pull the final answer token(s). Normalises floats, fractions, and text.

  6. Majority vote module counts occurrences of each answer and returns the plurality winner, along with a confidence score (winning fraction).

  7. For hard problems, the ToT engine replaces steps 4–6 with a tree search: the LM proposes K candidate next steps at each node, a value function (another LM call, or a heuristic) scores them, and the search expands the best branches until a terminal state is reached.

  8. The final answer and the winning reasoning trace are returned to the user.

Non-Obvious Design Decisions

Decision 1: Mask question tokens during CoT-SFT. When fine-tuning on (question, rationale, answer) triples, you must set the loss to zero on question tokens. If you do not, the model learns to copy the question as well as predict the reasoning, wasting capacity and distorting the loss signal. This is implemented via a custom DataCollatorForSeq2SeqWithMasking class.


Decision 2: Use regex over a second LM call for answer extraction. It is tempting to ask the LM itself to extract its answer ("What was your final answer?"). This adds latency, cost, and a new failure mode (the LM mis-paraphrases itself). A well-tuned regex is faster, cheaper, and fails loudly — making it easier to diagnose extraction errors during evaluation.




4. Tech Stack Recommendation

Stack A — Beginner / Prototype (build in a weekend)

Layer

Technology

Why

Language

Python 3.10

Universally supported, clean async

Modeling

HuggingFace Transformers + GPT-2

Small enough to run on CPU/free GPU

Datasets

HuggingFace Datasets (GSM8K)

One-line download

Tensor ops

PyTorch 2.x CPU

No CUDA setup required

Evaluation

NumPy + Pandas

Fast, no extra dependencies

Visualisation

Matplotlib

Zero configuration

Notebook

Jupyter Lab

Interactive iteration


Estimated monthly cost: ~$0 (CPU-only, Colab free tier or local laptop). Expect 30–60 seconds per problem with GPT-2; faster with a GPU runtime.

Stack B — Production-Ready (designed to scale)

Layer

Technology

Why

Language

Python 3.11

Faster interpreter, better type support

Modeling

Mistral-7B-Instruct or LLaMA-3-8B

Strong reasoning at 8B params

Inference runtime

vLLM or TGI

PagedAttention for high-throughput batching

Quantisation

bitsandbytes 4-bit NF4

4× memory reduction, minimal accuracy loss

Fine-tuning

HuggingFace PEFT + LoRA

Efficient adapter training

Datasets

HuggingFace Datasets + local cache

Deterministic splits

API layer

FastAPI + Uvicorn

Async REST endpoints

Monitoring

W&B or MLflow

Experiment tracking

Deployment

Modal.com or AWS SageMaker

Serverless GPU billing

Containerisation

Docker + Compose

Reproducible builds


Estimated monthly cost: $40–$150/month depending on GPU hours (Modal serverless billed per second; A10G ~$1.10/hr).




5. Implementation Phases

Phase 1: Prompt Template System & Zero-Shot CoT

What you are building: A CoTPromptBuilder class that takes a problem string, a task type (arithmetic, commonsense, symbolic), and a mode (zero-shot, few-shot) and returns a fully formatted prompt string.


Key technical decisions:


  • Whether to use a hard-coded "Let's think step by step" suffix or a task-adaptive instruction

  • How to structure few-shot exemplars (problem / rationale / answer format)

  • How many exemplars to include (4 is typical; more improves accuracy but costs tokens)

  • How to prevent exemplar answer strings from appearing in the model's output format


The exact prompt template format that maximises GSM8K accuracy on sub-7B models, including ablation results across five prompt variants, is covered in detail in the full course with working, tested code.




Phase 2: Evaluation Harness (GSM8K, MATH, StrategyQA)

What you are building: A benchmarking loop that loads a dataset split, runs the prompt builder and model generation, applies the regex extractor, compares to the gold label, and accumulates accuracy across the full split.


Key technical decisions:


  • Regex pattern design: how to handle "the answer is X", "= X", "#### X", and LaTeX-formatted answers

  • Whether to normalise floats before comparison (e.g., 42.0 == 42)

  • How to handle multi-part answers and answer ranges in the MATH dataset

  • How to build a reproducible evaluation seed so results are comparable across runs


Designing a robust answer extractor that handles all three dataset formats without false positives is covered in detail in the full course with working, tested code.




Phase 3: Self-Consistency Voting Pipeline

What you are building: A SelfConsistencyPipeline that calls the model N times (or generates N sequences in one batched call), extracts the answer from each trace, and returns the plurality winner with a vote distribution.


Key technical decisions:


  • Whether to use num_return_sequences (one forward pass, faster) or N sequential calls (lower memory)

  • How to set temperature: too low → all traces identical, no diversity; too high → incoherent traces

  • How to handle ties in the vote distribution

  • At what N the accuracy curve plateaus (diminishing returns after N=20–40)


Profiling the accuracy/latency trade-off across N=1 to N=64 on GSM8K, and the temperature sweep that produced the best results, is covered in detail in the full course with working, tested code.




Phase 4: Tree-of-Thoughts Search

What you are building: A TreeOfThoughtsSearcher that, at each node in the reasoning tree, prompts the LM to propose K candidate next steps, scores each candidate with a value function, and expands the best branches using BFS or DFS. Demonstrated on the Game of 24.


Key technical decisions:


  • BFS vs. DFS: BFS explores all branches at depth d before going deeper (better for shallow problems); DFS commits early (better for deep problems with cheap evaluation)

  • Value function design: LM-based scoring ("Is this step promising? yes/no/maybe") vs. heuristic scoring

  • Beam width K and maximum search depth — both directly determine compute cost

  • How to detect terminal states and avoid infinite loops


The full BFS implementation with LM-based value scoring, applied to the Game of 24, is covered in detail in the full course with working, tested code.




Phase 5: CoT Supervised Fine-Tuning (CoT-SFT)

What you are building: A training loop that fine-tunes a small base model on (problem, step-by-step rationale, answer) triples from a curated dataset. The model learns to produce CoT reasoning without any prompting at inference time.


Key technical decisions:


  • Token masking strategy: loss computed only on rationale + answer tokens, not question tokens

  • Whether to use full fine-tuning or LoRA adapters (LoRA is strongly preferred for <16 GB VRAM)

  • Data curation: removing exemplars whose gold rationale is incorrect or whose answer extraction would fail

  • Overfitting detection: CoT-SFT models often overfit to the rationale format before the answer accuracy improves


Implementing the custom loss masking collator and the full LoRA fine-tuning loop on GSM8K rationale data is covered in detail in the full course with working, tested code.




6. Common Challenges (and How to Fix Them)

Building this system from scratch surfaces several non-obvious failure modes that cost days of debugging if you do not know to expect them.


Challenge 1: Answer extraction breaks for paraphrased endings. Root cause: The model sometimes writes "Therefore, the result is 42." instead of "The answer is #### 42", and your regex misses it. Fix: Build a layered extractor with multiple patterns ranked by specificity. Fall back from exact-format patterns to a "last number in the text" heuristic, and log all extraction failures so you can add patterns iteratively.


Challenge 2: Self-consistency cost is linear in N. Root cause: You are running the model N times. N=40 is literally 40× slower and 40× more expensive than greedy decoding. Fix: Use num_return_sequences=N in a single batched call to amortise the KV-cache overhead. For a production system, cache traces for repeated or near-duplicate queries.


Challenge 3: Few-shot exemplars contaminate the test set. Root cause: If you pull exemplars from the same split you are evaluating on, you are testing the model on problems it has seen as examples. Fix: Maintain a held-out exemplar pool sourced from the training split or a separate manually curated set. Document the exact exemplar IDs in your evaluation config.


Challenge 4: Tree-of-Thoughts search explodes combinatorially. Root cause: At depth d with branching factor K, you have K^d candidate paths. K=5, depth=6 → 15,625 LM calls. Fix: Implement best-first search with aggressive value-function pruning. Prune any branch whose score falls below a threshold. Cap maximum search depth and use DFS with backtracking rather than pure BFS.


Challenge 5: Small models hallucinate step formatting without real reasoning. Root cause: A 1–3B parameter model learns to output "Step 1: ... Step 2: ..." because that was in its training data, not because it is actually computing anything. Fix: Evaluate with a rigorous answer extractor that ignores formatting and checks the final answer only. Use the evaluation harness to compare greedy accuracy vs. CoT accuracy — if CoT does not help, the model is too small to benefit.


Challenge 6: CoT-SFT model copies the question instead of reasoning. Root cause: The loss was not masked on question tokens, so the model wastes capacity learning to reproduce the input. Fix: Use a DataCollator that sets labels[i] = -100 for all tokens belonging to the question portion of each sequence.


Challenge 7: Temperature tuning is task-dependent. Root cause: Arithmetic problems benefit from low temperature (diversity hurts when the arithmetic is deterministic); open-ended reasoning problems benefit from higher temperature (diversity helps when multiple valid reasoning paths exist). Fix: Tune temperature on a held-out validation set per task type and hard-code task-specific defaults.


Solving these issues took us over 60 hours of testing and iteration — the course walks you through each fix with working code.




7. Ready to Build This Yourself?

Understanding an architecture on paper and actually shipping working, tested code are two very different things. Every section above describes a component that has at least one surprising failure mode in practice. The prompt format that works on GPT-4 does not work on LLaMA-3-1B. The regex that handles GSM8K breaks on MATH. The temperature that helps arithmetic hurts commonsense reasoning. Getting all of it working together requires building it — and building it with good tests.


The Chain-of-Thought Reasoning Models from Scratch course on labs.codersarts.com gives you everything you need to go from zero to a fully functional, evaluated reasoning engine:


  • ✅ Full, runnable PyTorch source code for every component described above

  • ✅ Step-by-step video walkthroughs of each implementation phase

  • ✅ Pre-configured Jupyter notebooks with all dependencies pinned

  • ✅ GSM8K, MATH, and StrategyQA evaluation harness with reproducible results

  • ✅ Self-consistency pipeline tested at N = 1, 8, 20, 40

  • ✅ Tree-of-Thoughts BFS/DFS implementation with Game of 24 demo

  • ✅ CoT-SFT training loop with LoRA, tested on a 1.5B model

  • ✅ Reproducible accuracy comparison charts (greedy vs. CoT vs. self-consistency vs. fine-tuned)

  • ✅ Lifetime access and free updates as new reasoning techniques land

  • ✅ Community support channel for questions on your implementation


$29.99. Everything above.



Want hands-on help with your specific use case? Upgrade to the 1:1 Guided Session ($99.99) — two live hours with a Codersarts mentor, a full code review of your fork, and a personalised extension project tailored to your domain.




8. Conclusion

Chain-of-Thought reasoning transforms a next-token predictor into a step-by-step problem solver by giving the model explicit scratchpad space. Self-consistency voting makes that solution robust by aggregating across multiple independent reasoning paths. Tree-of-Thoughts extends it to structured problems by replacing linear generation with a guided search over a tree of partial solutions.


If you are starting today, begin with Stack A: GPT-2, GSM8K, and a zero-shot CoT prompt. Get the evaluation harness working first — accurate measurement is the foundation of every subsequent improvement. Once you see the accuracy lift from CoT over greedy decoding, the motivation to add self-consistency and fine-tuning becomes concrete and measurable.


The full, tested implementation is available at labs.codersarts.com. Start building.


 
 
 

Comments


bottom of page