How to Fine-Tune an LLM with QLoRA and DPO for Reliable JSON Extraction
- 19 hours ago
- 12 min read

Introduction
You have a production task that sounds simple: take a messy customer message and return a clean JSON object — intent, urgency, order ID, and nothing else. You write a careful system prompt, add three few-shot examples, and test with GPT-4. It works. Then you switch to a smaller open-source model to cut costs and the wheels fall off. The model wraps the JSON in Markdown fences. It adds extra keys. It occasionally returns plain prose instead of structured data. You tighten the prompt. It gets better but never perfect. You wonder whether you need a bigger model, or whether there is a smarter path.
There is. Fine-tuning a small open-source model with QLoRA supervised fine-tuning and Direct Preference Optimization (DPO) can push a 1.5B-parameter model from ~49% exact-match accuracy to 100% on a narrow task — using a single GPU, in a single afternoon.
This post covers the full end-to-end architecture for building that workflow: how the pipeline works, why naive approaches fail, the complete tech stack, the key implementation phases, and the real-world pitfalls you will hit. Concretely, we use Qwen2.5-1.5B-Instruct as the base model and target strict JSON extraction from free-form customer text.
Real-world use cases this architecture supports:
Customer support ticket classification and structured field extraction
Order, refund, and cancellation intent extraction from free-form messages
Converting messy operational text into strict JSON for downstream APIs
Training smaller open-source models to match larger models on narrow internal tasks
Aligning model output style to eliminate markdown fences, prose wrappers, and extra keys
Building proof-of-concept fine-tuning pipelines before scaling to 7B or 8B models
This post covers the architecture, tech stack, implementation phases, and known challenges. It does not include the full source code - that is available in the full QLoRA + DPO course on Codersarts Labs.
How It Works: Core Concept
The Problem with Prompting Alone
The naive approach is to keep adding instructions to the system prompt: "Return only JSON. No Markdown. No prose. Always include these exact keys." For large frontier models this often works. For smaller models running locally or on a budget GPU, it does not — at least not reliably. There are three root causes.
First, small models have limited instruction-following capacity. They learned the general concept of "return JSON" during pretraining and RLHF, but they did not specifically internalize your schema under your distribution of inputs. Long prompts help, but they are not the same as updating the weights.
Second, you have no measurable feedback loop. "The output looks about right" is not the same as "98% of outputs pass a JSON schema validator." Without a metric, you cannot tell whether a new prompt version is actually better.
Third, for production deployment you often want a smaller, faster, cheaper model, not a larger one. Prompting a 70B model to solve a task that a 1.5B model could handle after fine-tuning wastes compute and money.
The Solution: QLoRA SFT + DPO
Think of fine-tuning like hiring a specialist instead of briefing a generalist. The base model already understands language and instructions. Fine-tuning teaches it the exact format and style your task requires, at the weight level.
QLoRA (Quantized Low-Rank Adaptation) makes this practical on a single consumer GPU. The base model weights are frozen in 4-bit NF4 quantization, cutting memory use by roughly 4×. Two small trainable matrices — a low-rank decomposition A × B — are inserted alongside each target weight matrix. Only these adapter matrices are trained, which means fewer than 2% of parameters are updated. After training, the adapters can be saved and loaded independently of the base model.
DPO (Direct Preference Optimization) takes the SFT-trained adapter one step further by teaching the model to prefer correct responses over stylistically wrong ones — for example, a clean JSON object versus the same object wrapped in backtick fences — without needing a separate reward model.
Pipeline Data Flow
┌────────────────────────────────────────────────────────────────────┐
│ SETUP & DATA PHASE │
│ │
│ Raw text messages │
│ │ │
│ ▼ │
│ Define JSON schema ──► Generate labelled examples │
│ │ │
│ ▼ │
│ Format with chat template │
│ (apply_chat_template) │
│ │ │
│ ▼ │
│ Evaluate BASE model baseline │
└────────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────────────┐
│ TRAINING & EVALUATION PHASE │
│ │
│ Load Qwen2.5-1.5B (4-bit NF4) │
│ │ │
│ ▼ │
│ Attach LoRA adapter (PEFT) │
│ │ │
│ ▼ │
│ SFTTrainer (TRL) ──► SFT adapter ──► Evaluate SFT model │
│ │ │
│ ▼ │
│ Build DPO dataset (chosen / rejected pairs) │
│ │ │
│ ▼ │
│ DPOTrainer (TRL) ──► DPO adapter ──► Evaluate DPO model │
│ │ │
│ ▼ │
│ Compare Base / SFT / DPO metrics │
│ │ │
│ ▼ │
│ Save adapter OR Merge weights OR Push to Hub │
└────────────────────────────────────────────────────────────────────┘System Architecture Deep Dive
Architecture Overview
The system is structured as seven layers that work in sequence, from raw data to a deployable model artifact.
Component | Role | Technology Options |
Notebook interface | Author experiments, visualise results, iterate quickly | Jupyter / Google Colab / VS Code Notebooks |
Data layer | Store, format, and batch training examples | Hugging Face datasets, pandas DataFrames, JSON files |
Model loading layer | Load the base model with 4-bit quantization into GPU memory | transformers + bitsandbytes NF4, GPTQ, AWQ |
Adapter layer | Attach trainable LoRA matrices to target model modules | peft (LoRA, QLoRA), IA³, adapters |
SFT training layer | Supervised fine-tuning on labelled examples | TRL SFTTrainer, Trainer, Axolotl |
Preference training layer | Align model output style with preference pairs | TRL DPOTrainer, ORPO, SimPO, RLHF/PPO |
Evaluation layer | Measure JSON validity, schema validity, field accuracy, exact match | Custom Python metrics, evaluate, json_schema |
Deployment layer | Export adapter weights, merge, or publish | peft merge utilities, Hugging Face Hub push_to_hub |
Data Flow Walkthrough
Schema definition — The developer defines a target JSON schema specifying the exact keys, value types, and allowed enum values the model must output.
Dataset generation — Labelled examples are created: each example pairs a customer message (user turn) with a schema-compliant JSON object (assistant turn). The dataset should cover the full range of intents, urgency levels, and linguistic variations expected in production.
Chat template formatting — Each example is formatted with the model's native chat template using apply_chat_template. This converts the conversation into a single input string. The training loss is then masked so it applies only to the assistant response tokens, not the user prompt.
Baseline evaluation — The unmodified base model is run on the evaluation set. Metrics are recorded across four dimensions: JSON validity (is it parseable?), schema validity (does it match the schema?), field accuracy (are individual fields correct?), and exact match (does the full output match the reference?).
QLoRA model loading — The base model is loaded in 4-bit NF4 quantization using BitsAndBytesConfig. A LoRA adapter is attached to the query and value projection matrices (q_proj, v_proj) via get_peft_model.
SFT training — SFTTrainer runs for a small number of epochs. Only the LoRA adapter weights are updated.
SFT evaluation — The SFT adapter is merged with the base model in memory for inference, and the evaluation suite re-runs. In our tested configuration, exact match improves from 0.490 (baseline) to 1.000 (SFT).
DPO dataset construction — A preference dataset is built with chosen responses (correct schema-compliant JSON) and rejected responses (the same JSON with common style failures: backtick fences, extra keys, lowercase intent values).
DPO training — DPOTrainer trains the SFT adapter against its own frozen reference copy, weighted by the beta parameter.
DPO evaluation — The DPO adapter is evaluated. In our tested configuration, exact match settles at 0.900 — a slight regression from SFT, attributable to overly easy rejected examples.
Deployment — The final adapter is either saved as a lightweight checkpoint, merged into the base model weights using merge_and_unload, or pushed to the Hugging Face Hub.
Non-Obvious Design Decisions
Why apply loss only to assistant response tokens? If the loss is calculated over the entire sequence including the user prompt, the model wastes gradient signal trying to memorise the input. Masking the prompt tokens so only assistant tokens contribute to the cross-entropy loss makes the model learn what to output, not what it was asked.
Why keep DPO beta low (0.1–0.3)? A high beta makes the model stay very close to the reference SFT policy, limiting how much the preference data can shift behaviour. A very low beta lets the model diverge aggressively from SFT, which risks overwriting task accuracy in favour of stylistic correctness. Calibrating beta is the single most impactful knob in the DPO phase.
Tech Stack Recommendation
Stack A - Beginner / Prototype (build in a weekend)
Layer | Technology | Why |
Runtime | Google Colab (free T4 GPU) | No local setup, 15 GB GPU RAM sufficient for 1.5B at 4-bit |
Base model | Qwen/Qwen2.5-1.5B-Instruct | Small, fast, strong instruction following, good chat template |
Quantization | bitsandbytes NF4 | Drop-in 4-bit support with paged Adam optimiser |
Adapters | peft LoRA | Minimal code, integrates directly with TRL trainers |
Training | TRL SFTTrainer + DPOTrainer | Handles data collation, loss masking, and training loop |
Evaluation | Custom Python (json, jsonschema) | No additional dependencies |
Deployment | Save adapter locally | One function call: model.save_pretrained() |
Estimated monthly cost: ~$0 on Colab free tier for experimentation; ~$5–10 if using Colab Pro for longer training runs.
Stack B - Production-Ready (designed to scale)
Layer | Technology | Why |
Runtime | AWS EC2 p3.2xlarge or Lambda Labs A10 | Dedicated GPU, persistent storage, no session timeouts |
Base model | Qwen/Qwen2.5-7B-Instruct (or Llama-3-8B) | Higher accuracy ceiling for harder extraction tasks |
Quantization | bitsandbytes NF4 or GPTQ | GPTQ gives faster inference post-training |
Adapters | peft QLoRA (rank 16–64) | Higher rank for more complex tasks |
Training | TRL SFTTrainer + DPOTrainer with Accelerate | Multi-GPU support, gradient checkpointing |
Experiment tracking | Weights & Biases | Training curves, hyperparameter sweeps, model registry |
Evaluation | LM Eval Harness + custom metrics | Reproducible benchmark comparisons |
Deployment | Merge weights → vLLM or TGI serving | Maximum inference throughput; adapter overhead eliminated |
Model registry | Hugging Face Hub (private repo) | Versioned model artifacts, easy rollback |
Estimated monthly cost: ~$200–400 depending on GPU instance choice and training frequency.
Implementation Phases
Phase 1: Dataset and Schema Design
The foundation of any fine-tuning project is the dataset. In this phase you define the exact JSON schema your model must produce and generate labelled training examples.
Key decisions include: how many unique keys to include, what enum values are allowed for intent and urgency fields, how many training examples are sufficient (typically 200–500 for a narrow task), whether to generate synthetic examples or use real customer messages, and how to split examples across train and evaluation sets.
The quality bar is strict: every target response in your training set must be valid JSON, schema-compliant, and contain no extra keys. A single malformed example in the training set teaches the model that malformed output is acceptable.
Phase 2: Model Loading and Adapter Configuration
In this phase you load the base model under 4-bit NF4 quantization and attach a LoRA adapter via PEFT.
Key decisions include: which target modules to inject LoRA matrices into (q_proj and v_proj are standard starting points; adding k_proj, o_proj, and the MLP projections increases capacity at the cost of more trainable parameters), the LoRA rank (8–16 is a good starting range), the lora_alpha scaling factor (usually 2× the rank), the lora_dropout value (0.05–0.1), and whether to enable gradient checkpointing to trade compute for memory savings.
Getting this configuration right determines whether training fits in available GPU memory and whether the adapter has enough capacity to learn the task.
Phase 3: Supervised Fine-Tuning with SFTTrainer
In this phase you run the SFT training loop and validate that the adapter is learning.
Key decisions include: batch size and gradient accumulation steps (the effective batch size should be at least 16–32), number of training epochs (3–5 is typical for small datasets), learning rate and warmup schedule, the max_seq_length setting (long sequences increase memory use non-linearly), and how to apply loss masking so only the assistant response tokens contribute to the loss.
After training, you run the evaluation suite. The hallmark of a successful SFT run is that JSON validity and schema validity jump to near 100% before exact match does — the model learns structure before it learns precision.
Phase 4: DPO Preference Optimization
In this phase you build a preference dataset and run DPOTrainer to refine the adapter's output style.
Key decisions include: how to design rejected responses that are genuinely confusable with correct ones (too-easy negatives produce no useful gradient signal), the DPO beta parameter (start at 0.1), the learning rate (typically 1e-5 to 5e-5, lower than SFT), the reference model setup (TRL uses the SFT checkpoint automatically), and how many preference pairs to include (50–200 pairs is sufficient for a narrow style-alignment task).
A known pitfall: if rejected examples are obviously wrong (e.g., completely unparseable text), the DPO loss saturates early and the model learns nothing useful, or it slightly overwrites SFT task knowledge. Our tested configuration shows a 10-point regression from SFT exact match (1.000 → 0.900) when rejected examples are too easy.
Phase 5: Evaluation, Adapter Saving, and Deployment
In this final phase you compare Base, SFT, and DPO results across all four metrics, then choose a deployment strategy.
Key decisions include: whether to ship the lightweight adapter alone (fast to swap, requires the base model at runtime), merge the adapter weights into the base model for a standalone checkpoint (eliminates adapter overhead at inference, slightly larger file), or push either artifact to a private Hugging Face Hub repository for versioned deployment.
The evaluation comparison at this stage is the deliverable your team cares about most — it answers the question: "Did fine-tuning actually help, and is the improvement worth the deployment complexity?"
Common Challenges
Fine-tuning a small open-source model for a production task sounds straightforward on paper. In practice, you will hit these problems.
1. Out-of-memory (OOM) errors during training Root cause: The effective memory footprint during training includes the model weights, LoRA gradients, optimizer states, and the forward activation graph. Even at 4-bit quantization, a 1.5B model with a batch size of 4 and a long max sequence length can exceed 15 GB. Fix: Enable gradient checkpointing, reduce batch size to 1 and increase gradient accumulation steps to compensate, lower max_seq_length, and use the paged Adam optimizer from bitsandbytes.
2. Loss applies to the entire sequence including the prompt Root cause: By default, SFTTrainer may compute cross-entropy loss over all tokens. If the prompt is long, the model wastes capacity memorising the instruction rather than the response. Fix: Use DataCollatorForCompletionOnlyLM from TRL with the correct response template string to mask all prompt tokens before passing to the loss function.
3. Base model evaluation gives misleadingly high JSON validity Root cause: The base model sometimes outputs something that looks like JSON for simple inputs. A single-metric pass/fail can be deceiving; JSON parseable does not mean schema-compliant. Fix: Always evaluate all four metrics together — JSON validity, schema validity, field-level accuracy, and exact match — and treat exact match as the ground truth signal.
4. DPO degrades task accuracy (regression from SFT) Root cause: If rejected examples are too easy to distinguish from correct ones (e.g., completely different format), the DPO gradient does not meaningfully update the adapter. Instead, it slightly disrupts the SFT-learned behaviour by nudging the model away from its reference distribution. Fix: Create harder rejected examples — responses that are mostly correct but have one subtle failure (a missing field, a wrong enum value, or backtick wrapping). Reduce the beta parameter to stay closer to the SFT reference policy.
5. Chat template formatting mismatches Root cause: Different models use different special tokens and conversation structures. If you format examples manually without using apply_chat_template, the training input distribution will differ from what the model saw during pretraining, hurting convergence. Fix: Always use the model's tokenizer's apply_chat_template method with tokenize=False first to inspect the formatted string, then verify the loss mask aligns with the assistant turn.
6. Adapter not loading at inference time Root cause: Loading a PEFT adapter requires the base model to be loaded first with an identical configuration. If the base model is loaded with different quantization settings or a different dtype, the adapter weights will not align. Fix: Always load the base model using the same BitsAndBytesConfig you used during training before calling PeftModel.from_pretrained.
7. Schema keys appearing in wrong order or with wrong types Root cause: Language models generate JSON token by token. If the training data includes examples where keys appear in different orders, the model learns that order is flexible and will sometimes vary it. Fix: Sort keys in all training target responses. Use a schema validator that is order-insensitive for your evaluation metric, but standardise the training data so the model learns a consistent pattern.
Solving these issues took us 12+ hours of testing — the fine-tuning course on Codersarts Labs walks you through each fix with working code.
Ready to Build This Yourself?
Understanding the architecture is one thing. Writing, debugging, and shipping the actual code is another. Between memory errors, masked loss configs, DPO hyperparameter sensitivity, and deployment decisions, there are at least a dozen points where a working notebook is worth far more than a blog post.
The Fine-Tuning LLMs with LoRA & DPO for Task-Specific Performance course on Codersarts Labs gives you everything you need to go from zero to a deployed adapter in one focused session:
✅ Full source notebook — QLoRA + DPO pipeline, runnable end to end
✅ Step-by-step lessons covering every design decision
✅ Google Colab setup — no local GPU required
✅ Tested LoRA configurations for Qwen2.5-1.5B
✅ Complete SFTTrainer workflow with correct loss masking
✅ Complete DPOTrainer workflow with preference dataset builder
✅ Evaluation suite — JSON validity, schema validity, field accuracy, exact match
✅ Deployment walkthrough — adapter saving, weight merging, Hub upload
✅ Real training logs showing Base / SFT / DPO metric progression
✅ Lifetime access and future updates
✅ Community support
$29.99. Everything above.
Want a guided session instead? Book a private 1:1 implementation session with the Codersarts Labs team — dataset design, training troubleshooting, and deployment planning covered end to end. $99.99.
Conclusion
Fine-tuning a small open-source LLM with QLoRA and DPO is one of the highest-leverage techniques available to ML engineers building production AI applications today. The architecture is straightforward: load a quantized base model, attach a LoRA adapter, run supervised fine-tuning on a labelled task-specific dataset, optionally refine output style with DPO preference pairs, then evaluate and deploy. The whole pipeline fits on a single T4 GPU, and the measurable improvement over prompt engineering is dramatic for narrow tasks — a 10× gain in exact-match accuracy in our tested configuration.
The best place to start is Colab, Qwen2.5-1.5B-Instruct, and a focused 200-example JSON extraction dataset. Run the SFT phase first. Measure. Then decide whether DPO adds value for your specific style requirements.



Comments