LLM Inference Optimization: AWQ Quantization, Speculative Decoding, and Prefix Caching Benchmarked on a 7B Model
- 6 hours ago
- 13 min read

Introduction: Your GPU Bill Is Too High and Your TTFT Is Too Slow
You have a 7B model running in production — or about to go into production — and the numbers are not pretty. At fp16 baseline on an A100, you are burning through GPU hours faster than your runway allows, and time-to-first-token is hovering in the 800–1,200 ms range when your product SLA demands sub-500 ms. You have heard the terms thrown around — AWQ quantization, speculative decoding, prefix caching — but every resource you find either explains one technique in isolation or buries the practical details under academic notation. Nobody shows you all three working together with real latency numbers.
This post is the resource you have been looking for. It walks through a reproducible LLM inference optimization pipeline that applies AWQ 4-bit quantization, speculative decoding with a small draft model, and prefix caching — all inside vLLM — and measures the latency and cost impact of each technique with a Python benchmark harness.
Real-world teams that need exactly this:
ML engineers at a startup who need to cut GPU inference costs before a funding runway runs out
Teams hitting latency SLAs (< 500 ms TTFT) that the fp16 baseline cannot meet
Researchers running large-scale evaluations who need to maximise throughput on a fixed GPU budget
Platform engineers deciding which model variant to serve and needing reproducible benchmark evidence
Developers moving from cloud API spend to self-hosted inference who need to quantify the hardware ROI
MLOps teams building an internal model serving standard and needing a decision framework for future model onboarding
This post covers the full architecture, tech stack, implementation phases, and known challenges. It does not include full source code — that is in the course.
📄 Before you dive in — grab the free PRD template that maps out this entire system: architecture, API spec, sprint plan, and system prompt. [Download the free PRD]
How It Works: The Core Concept - LLM Inference
The Naive Approach and Why It Breaks Down
Running a 7B transformer in fp16 is the path of least resistance: load the weights, start vLLM, call the API. The problem is that autoregressive generation is fundamentally sequential — the model generates one token per forward pass, and each forward pass moves tens of gigabytes of weight tensors across the memory bus. At fp16, a 7B model consumes roughly 14 GB of VRAM just for weights, leaving little headroom for the KV cache that grows with every token in the context. The result: high VRAM pressure, low batch sizes, and slow throughput. On a single A100 80 GB you can serve the model, but at production traffic levels the cost per 1M tokens climbs quickly.
The insight behind this optimization pipeline is that these are three largely independent bottlenecks, and each has a well-understood engineering fix:
Weight memory bandwidth is the primary throughput limiter → fix with quantization (AWQ reduces weights from 16-bit to 4-bit)
Sequential token generation dominates median latency → fix with speculative decoding (a tiny draft model proposes multiple tokens per step)
Repeated KV-cache recomputation for shared prompt prefixes wastes cycles → fix with prefix caching (compute once, reuse on every request)
The Pipeline in Plain English
Think of it like a kitchen prep workflow. Quantization is mise en place — doing expensive weight compression work once offline so that serving is faster. Speculative decoding is having a prep cook (the draft model) rough-chop ingredients in parallel while the head chef (the target model) does only the final quality check. Prefix caching is keeping yesterday's stock in the fridge — you don't reboil from scratch on every order.
ASCII Data-Flow Diagram
SETUP PHASE (run once)
─────────────────────
7B fp16 checkpoint
│
▼
AutoAWQ calibration
(calibration dataset)
│
▼
4-bit AWQ checkpoint ──► stored on disk / HuggingFace Hub
│
▼
vLLM server start
--quantization awq
--speculative-model <draft>
--num-speculative-tokens 5
--enable-prefix-caching
│
▼
[Server ready]
RUNTIME / BENCHMARK PHASE
──────────────────────────
Benchmark harness
(warmup N requests)
│
▼
Timed request batch ───────────────────────────────────────────┐
│ │
▼ │
vLLM receives prompt │
│ │
├──► Prefix cache hit? ──YES──► reuse KV cache, skip │
│ │ prefill for prefix │
│ NO │
│ │ │
│ ▼ │
│ Full prefill (compute KV cache, store in cache) │
│ │
▼ │
Draft model generates K candidate tokens │
│ │
▼ │
Target model verifies all K tokens in ONE forward pass │
│ │
▼ │
Accepted tokens returned → stream to client │
│ │
▼ │
Record: TTFT, TPS, VRAM ◄─────────────────────────────────────┘
│
▼
CSV + decision matrix
System Architecture Deep Dive
Architecture Overview
The system has five distinct layers that work in concert:
Offline preparation layer handles quantization of the base model checkpoint using AutoAWQ. This is a one-time cost — typically 30–90 minutes on a GPU — that produces a 4-bit model checkpoint that is roughly 2× smaller in storage and VRAM footprint than the fp16 original.
Model serving layer is a vLLM inference server configured with three feature flags: the AWQ quantized weights, a speculative decoding draft model, and the prefix caching toggle. vLLM handles batching, KV cache management, and the speculative decoding verification loop internally.
Benchmark harness layer is a Python process that sends structured request batches to the vLLM OpenAI-compatible API endpoint, records wall-clock timing for first-token and full-response, computes tokens-per-second, and writes results to CSV.
Results layer is a pandas/matplotlib pipeline that reads the benchmark CSVs and renders a side-by-side comparison table and a decision matrix mapping hardware + latency target + quality tolerance to the recommended optimization configuration.
Infrastructure layer wraps everything in Docker and NVIDIA container toolkit to make the environment fully reproducible.
Component Table
Component | Role | Options |
Base Model | 7B fp16 target for quantization | Mistral-7B-v0.3, LLaMA-3-8B, Qwen2-7B |
Quantization Tool | Offline 4-bit weight compression | AutoAWQ (recommended), auto-gptq |
Quantized Checkpoint | 4-bit AWQ weights served by vLLM | Local disk, HuggingFace Hub private repo |
Draft Model | Small model for speculative decoding | TinyLlama-1.1B, Mistral-7B draft (same family) |
Inference Server | LLM serving with batching + caching | vLLM (primary), TGI (alternative) |
Benchmark Harness | Timing, TTFT, TPS, cost measurement | Custom Python harness, pytest-benchmark |
Results Renderer | CSV → tables + decision matrix | pandas + matplotlib, rich |
Container Runtime | Reproducible GPU environment | Docker + NVIDIA Container Toolkit |
GPU Hardware | Compute for serving and quantization | A100 80GB (production), RTX 4090 (dev) |
Config Management | vLLM server arguments and harness params | YAML config + argparse |
Data Flow: Step by Step
Developer runs the AutoAWQ calibration script, pointing it at the fp16 checkpoint and a calibration dataset (typically 512 samples from C4 or the domain-specific dataset).
AutoAWQ produces a 4-bit quantized checkpoint and saves it to disk (or pushes to HuggingFace Hub).
The docker-compose up command starts the vLLM server with the AWQ checkpoint, speculative decoding config pointing to the draft model, and --enable-prefix-caching flag.
The benchmark harness sends a configurable number of warmup requests (discarded from timing) to prime the CUDA kernels and the prefix cache.
The timed benchmark loop sends request batches at varying batch sizes (1, 4, 8, 16, 32) and records wall-clock TTFT and tokens-per-second for each.
The harness writes one CSV row per configuration per batch size: config, batch_size, ttft_p50_ms, ttft_p95_ms, tps_mean, vram_gb, cost_per_1m_tokens.
The results script reads all CSVs and renders a comparison table and decision matrix, showing which combination to use for each hardware/latency/quality scenario.
Non-Obvious Design Decisions
Warmup request exclusion is mandatory, not optional.
The first several requests after vLLM starts are significantly slower due to CUDA kernel compilation (torch.compile or CUDA graph capture) and an empty prefix cache. If your harness does not discard a warmup batch of at least 5–10 requests, your fp16 baseline will look artificially slow, making every optimization appear more impactful than it actually is. This is a common benchmarking error that produces misleadingly optimistic ROI projections.
Testing across batch sizes is non-negotiable for speculative decoding.
Speculative decoding's speedup degrades — and can reverse — at high batch sizes. At batch=1 you might see 1.5–2.5× TTFT improvement; at batch=16 the verification overhead can negate the gain entirely. A benchmark that only tests batch=1 will lead teams to enable speculative decoding in a production environment where they serve batch=16 requests, only to discover latency has regressed. The harness is explicitly designed to expose this interaction.
Tech Stack Recommendation
Stack A — Beginner / Weekend Prototype
This stack can be assembled in a Saturday afternoon. It uses managed cloud GPU rental (Lambda Labs or Vast.ai) so you avoid hardware setup entirely.
Layer | Technology | Why |
GPU Rental | Lambda Labs A100 or RTX 4090 (on-demand) | No hardware commitment; cheap for a weekend |
Base Model | Mistral-7B-v0.3 (HuggingFace) | Well-documented, great AWQ support |
Quantization | AutoAWQ (pip) | Simpler API than auto-gptq, better defaults |
Inference Server | vLLM latest (pip) | One command to install and serve |
Draft Model | TinyLlama-1.1B (same tokenizer family) | Free, small, publicly available |
Benchmark Harness | Custom Python script (aiohttp + time) | No framework dependency |
Results | pandas + rich (terminal table) | Zero frontend setup |
E
stimated monthly cost: $50–$100 in cloud GPU rental for development and benchmarking (assuming 20–30 GPU-hours total).
Stack B — Production-Ready
Layer | Technology | Why |
GPU Hardware | A100 80GB (bare metal or dedicated cloud) | Sufficient VRAM for AWQ + prefix cache |
Base Model | LLaMA-3-8B-Instruct (gated) | State-of-the-art quality at 7–8B scale |
Quantization | AutoAWQ with domain calibration dataset | Minimises task-specific quality regression |
Quantized Checkpoint | HuggingFace Hub private repo | Version-controlled, easy rollback |
Inference Server | vLLM with Docker Compose | Reproducible, restartable |
Draft Model | LLaMA-3-8B draft (matched tokenizer) | Highest acceptance rate for speculative decoding |
Config Management | YAML + environment variables | Separates benchmark configs from code |
Monitoring | Prometheus + Grafana | TTFT and TPS in production dashboards |
Benchmark Harness | Python + pytest-benchmark + pandas | Reproducible, CI-friendly |
CI / CD | GitHub Actions + DVC for checkpoint tracking | Automates re-benchmark on model updates |
Estimated monthly cost: $2,500–$4,000 for a dedicated A100 node (bare metal hosting); $800–$1,400 for on-demand cloud A100 at 40–60 GPU-hours/month serving load. Savings versus fp16 baseline on the same hardware typically offset the optimization engineering investment within 2–4 weeks.
Implementation Phases
Phase 1: Baseline Benchmarking
Before touching any optimization, you need a clean fp16 baseline. This phase involves standing up a stock vLLM server with the fp16 model, writing or configuring the benchmark harness, and running a full timed sweep across batch sizes 1, 4, 8, 16, and 32. The key technical decision here is your warmup strategy: you must choose a warmup request count that is large enough to saturate the CUDA compilation cache (typically 10–20 requests) but not so large that it dominates your benchmark wall time. You also need to lock in your prompt workload — a mix of short and long system prompts, at a fixed input/output token ratio — so that every subsequent phase is measured on an identical workload.
The output is a baseline CSV file: the control group against which all optimizations will be measured. Without a rigorous baseline, you cannot meaningfully claim any speedup.
Phase 2: AWQ Quantization
This phase applies offline 4-bit weight quantization using AutoAWQ. You will run the calibration process — feeding 512 representative samples through the model to determine the optimal per-channel weight scales — and then export the quantized checkpoint. The key technical decisions: which calibration dataset to use (a domain-specific sample from your actual traffic distribution almost always beats the generic C4 default), how many calibration samples to use, and whether to use AWQ or GPTQ (AWQ is generally preferred for vLLM due to better kernel support and slightly higher output quality at the same bit width, but GPTQ has a longer track record). After reloading vLLM with the quantized weights, you re-run the full benchmark harness and record VRAM reduction and throughput change.
Typical results: VRAM footprint drops from ~14 GB to ~4–5 GB, throughput increases 1.3–1.8×, and output quality as measured by perplexity on a held-out set degrades by less than 2% on most tasks with a good calibration dataset.
Phase 3: Speculative Decoding
With the AWQ checkpoint as the new baseline, you now configure speculative decoding. This requires selecting a draft model — a small model from the same family and with the same tokenizer vocabulary as the target — and setting the num_speculative_tokens parameter (typically 3–8; higher values help on long outputs but increase verification overhead on short ones). The critical constraint is tokenizer alignment: the draft and target models must share an identical tokenizer. A mismatch produces silent token-alignment errors that degrade output quality without raising an exception, one of the most dangerous failure modes in this project.
The harness re-runs the full batch-size sweep. You will see TTFT drop significantly at low batch sizes and flatten or regress at high batch sizes — and the decision matrix will capture exactly where the crossover point is for your hardware.
Phase 4: Prefix Caching
Prefix caching is enabled with a single vLLM flag: --enable-prefix-caching. The technique is conceptually simple — vLLM hashes the token sequence of each prompt prefix and stores the computed KV-cache blocks for reuse — but realising a high cache hit rate in practice requires prompt versioning discipline. Any change to the system prompt, including trailing whitespace or a version string in the first line, invalidates the entire cached prefix for that hash. The benchmark harness must run a workload that simulates realistic cache hit patterns: a mix of requests that share the same system prompt (cache hits) and requests with unique prefixes (cache misses), at a ratio that reflects your actual traffic distribution.
Phase 5: Combined Optimization and Decision Matrix
The final phase enables all three optimizations simultaneously — AWQ + speculative decoding + prefix caching — and runs a final benchmark sweep. Not all combinations of these features are supported in every vLLM version, and enabling them in the wrong order in the config can cause silent fallbacks (e.g. speculative decoding silently disabled when certain attention backends are active). The phase also compiles all five benchmark CSVs (fp16 baseline, AWQ only, AWQ + spec, AWQ + cache, all three) into the decision matrix: a lookup table that maps your hardware tier, TTFT target (e.g. < 200 ms, < 500 ms, < 1 s), and quality tolerance (e.g. < 1% perplexity regression) to the recommended combination of optimizations.
Common Challenges
Engineers who attempt this optimization stack without guidance typically hit the same set of non-obvious problems. Here are the most impactful ones.
AWQ quality regression is task-dependent, not universal.
The degradation from 4-bit quantization is not uniform — it is significantly higher on tasks involving precise numerical reasoning or structured output (e.g. JSON extraction) than on open-ended text generation. Teams that benchmark quality only on generic text and then deploy to a structured-output workload discover the regression in production. The fix is to run a task-specific quality evaluation on a held-out set from your actual use case before declaring AWQ acceptable.
Silent draft model tokenizer mismatch.
If you accidentally configure a draft model with a superficially compatible but subtly different tokenizer (e.g. different special token IDs or vocabulary size by 1), speculative decoding will proceed without raising an error but will produce corrupted outputs. The root cause is that token IDs from the draft model are passed to the target model for verification without any alignment check at the application layer. The fix is to explicitly assert draft_tokenizer.vocab == target_tokenizer.vocab before starting the server.
Speculative decoding reverses at large batch sizes.
At batch sizes above ~8–16 (hardware-dependent), the overhead of running the draft model and verifying K tokens per step exceeds the savings from reduced autoregressive iterations. The root cause is that the verification forward pass scales with batch size × speculative tokens, while the benefit (saved autoregressive steps) does not scale proportionally. The fix is to benchmark across batch sizes, not just batch=1, and configure speculative decoding only if your production traffic batch size is in the sweet spot.
Prefix cache invalidation from prompt drift.
Any whitespace change, version bump embedded in the system prompt, or personalisation token inserted at the start of the prompt invalidates the KV-cache prefix. Teams with "live" system prompts (updated weekly with product news or instructions) see cache hit rates collapse to near zero. The fix is to separate the stable system prompt prefix from the dynamic suffix, and ensure only the stable portion is subject to caching.
vLLM flag ordering for combined optimizations.
AWQ + speculative decoding + prefix caching in combination requires specific flag ordering and vLLM version pinning. In some vLLM releases, speculative decoding and prefix caching have conflicting KV-cache management requirements and one silently disables the other. The fix is to run vllm --version and cross-reference against the compatibility matrix in the course materials before upgrading.
Cold-start benchmark bias.
Torch.compile and CUDA graph capture on the first few requests after server start produce dramatically inflated latency numbers. Including these in your timing average makes the fp16 baseline look worse than it is, causing every subsequent optimization to appear more impactful. The fix is always discard the first N requests (N = 10–20 in practice) as warmup before starting the timed loop.
GPTQ vs AWQ interaction with vLLM's fused kernels.
GPTQ-quantized models on some GPU architectures do not benefit from vLLM's CUDA fused attention kernels the way AWQ models do. If you switch from AWQ to GPTQ to compare quality, the throughput numbers are not directly comparable without verifying kernel activation. The fix is to run vllm info after loading each quantized model and confirm which kernel is active.
Ready to Build This Yourself?
Understanding the architecture and knowing the pitfalls is not the same as having production-ready code running on your hardware. The gap between "I understand how AWQ quantization works" and "I have a benchmarked, optimized vLLM server with a decision matrix I trust" is filled with vLLM version mismatches, calibration dataset choices, silent tokenizer bugs, and cold-start timing artefacts.
The LLM Inference Optimization course at labs.codersarts.com closes that gap with everything you need to ship:
✅ Full working Python source code for the benchmark harness (TTFT, TPS, cost)
✅ AutoAWQ calibration scripts with domain dataset configuration
✅ vLLM Docker Compose configs for all five optimization combinations
✅ Speculative decoding setup with tokenizer alignment validation
✅ Prefix caching prompt schema and cache hit rate measurement
✅ Decision matrix template pre-filled with A100 and RTX 4090 benchmark results
✅ Benchmark result CSVs for Mistral-7B across all configurations (download and compare)
✅ Video walkthroughs for each of the 12 lessons (3 modules)
✅ Lifetime access and free updates as vLLM evolves
✅ Community support channel for questions on your specific hardware
$29.99. Everything above.
Need to run this pipeline on your specific model, hardware, and latency targets with a Codersarts engineer live on the call? The 1:1 Guided Session ($99) is a pair-programming session where we run the full optimization pipeline together and interpret your benchmark results.
Conclusion
LLM inference optimization is not a single technique — it is a pipeline. AWQ quantization shrinks the weight memory footprint and unlocks higher batch throughput. Speculative decoding cuts median TTFT by parallelising candidate token generation with a small draft model. Prefix caching eliminates redundant KV-cache computation for repeated prompt prefixes. Composing all three inside vLLM, measured with a rigorous benchmark harness, gives you a reproducible decision matrix that tells you exactly which combination to deploy for your hardware, latency target, and quality tolerance.
The simplest place to start: spin up a cloud A100, install vLLM, quantize your model with AutoAWQ using the default calibration settings, and re-run your existing benchmark. Even AWQ alone will show a meaningful VRAM and throughput improvement and give you confidence that the quality trade-off is acceptable for your task.
When you are ready to go deeper — add speculative decoding and prefix caching, sweep across batch sizes, and produce a decision matrix you can show your team — the full course at labs.codersarts.com has every script, config, and benchmark result waiting for you.



Comments