Make a trained model respond to thousands of requests per second in production. p99 latency < 100ms, throughput 1000 req/s, expensive GPUs at 80%+ utilization. That's model serving. This guide covers dynamic batching, quantization, canary / shadow deploys, and the LLM-serving specifics.
Common Serving Tools
| Tool | Characteristics |
|---|---|
| TF Serving | TensorFlow official, SavedModel format |
| TorchServe | PyTorch official, Java-based |
| Triton (NVIDIA) | Multi-framework (TF, PT, ONNX, TensorRT), GPU-optimized |
| vLLM | LLM-specific, PagedAttention for throughput |
| BentoML | Python-friendly, auto Docker containerization |
| SageMaker / Vertex AI Endpoints | Cloud-managed, auto-scaling |
Dynamic Batching — The Throughput Key
Naive serving:
request 1 → predict → response
request 2 → predict → response
...
→ Low GPU utilization (idle between requests)
Dynamic batching:
Group requests arriving within 10ms into a batch → one GPU call
→ Exploit GPU parallelism (batch=32 can be 30× throughput of batch=1)
Triton / vLLM auto batching:
max_batch_size = 32
max_queue_delay_microseconds = 10000
request 1 → queue
... (at 10ms or reaching 32)
→ batch GPU call → distribute results back to each request
Trade-off:
- Larger batch = throughput ↑ but latency ↑ (queue wait)
- Smaller batch = latency ↓ but GPU underusedQuantization — Faster with Minimal Accuracy Loss
Default: FP32 (32-bit float) — large memory + heavy compute
FP16 (half precision):
- Memory 1/2
- 2-4× faster on modern GPU Tensor Cores
- Nearly identical accuracy on most models (< 0.5% drop)
- Default for modern training / serving
INT8 quantization:
- Memory 1/4
- 4-8× faster (NVIDIA TensorRT, ONNX Runtime)
- 1-3% accuracy drop possible — task-dependent
- Common in LLM production serving
INT4 / GPTQ / AWQ (LLM):
- 7B LLMs can serve from a single 24GB GPU
- Even LLaMA 70B fits a single A100 80GB at INT4
→ Always compare quality after quantization. Confirm business impact.Canary / Shadow / Blue-Green
Strategies for deploying new model v3:
Blue-Green:
blue (current v2, 100% traffic) ↔ green (new v3, 0% traffic, idle)
After tests, switch (100% v3) — instant
Swap back on failure
Canary:
v2: 95% traffic
v3: 5% traffic
Increase gradually on good metrics (10%, 25%, 50%, 100%)
Drop to 0% on bad
Shadow:
v2: 100% traffic (actual response)
v3: 100% traffic (runs but response unused)
→ Compare metrics only, zero user impact
→ Safer than A/B (users don't see results)
→ 2× cost (both models running)
Multi-armed bandit:
Auto traffic allocation (Thompson sampling etc.)
Better model auto-gains traffic
Explore / exploit balance
→ Critical models: shadow → canary → 100%; otherwise canary directly.GPU vs CPU
GPU is faster for:
- Large models (hundreds of millions of parameters+)
- Batchable workloads (image classification, batch=32+)
- LLMs (matrix-multiplication heavy)
CPU is appropriate for:
- Small models (tens of thousands of parameters)
- Low-latency single inference (batch=1)
- Cost (1/10–1/100 of GPU cost)
- Low throughput
Modern decisions:
- LLM = GPU required (A100, H100, L4)
- Classic ML (XGBoost, LightGBM) = CPU is enough
- Vision (ResNet etc.) = CPU OK, batch → GPU
Post-quantization LLM on CPU works too (llama.cpp's CPU LLM)LLM Serving's Specifics
Traditional model: input → output in one call
LLM: sequential token generation (autoregressive)
input "What is..." → "the" → "the meaning" → "the meaning of"...
Issues:
- KV cache grows per token (memory ↑)
- Sequence lengths vary within a batch (padding waste)
- Streaming response is the default UX
vLLM's answers:
- PagedAttention: page-based KV cache, like virtual memory
- Continuous batching: a finished sequence's slot is filled by a new one
- → 24× throughput (vs HuggingFace baseline)
Other LLM serving tools:
- TGI (Text Generation Inference, HuggingFace)
- LMDeploy
- TensorRT-LLM
- Ollama (local, dev-friendly)Common Pitfalls
- batch=1 only — wastes GPU + expensive. Dynamic batching is required.
- No quantization — serving in FP32 → costly + slow. At least FP16.
- Cold starts — new auto-scaled instances take minutes to load big models. Pin minimum replicas.
- No warm-up — first request is very slow (JIT compile, cache miss). Run dummy inferences at startup.
- Insufficient monitoring — track latency p99, throughput, GPU util, error rate. Without these, silent regression.
Wrap-up
Model serving is ML's "production engineering" — dynamic batching / quantization / canary deploys as the standard. LLM is its own world (vLLM, TGI).
Practical — small models → BentoML or cloud endpoints; large models → Triton or cloud GPU endpoints; LLMs → vLLM or TGI. Always consider quantization + strong monitoring.