How Model Serving Actually Works

Make a trained model respond to thousands of requests per second in production. p99 latency < 100ms, throughput 1000 req/s, expensive GPUs at 80%+ utilization. That's model serving. This guide covers dynamic batching, quantization, canary / shadow deploys, and the LLM-serving specifics.

Common Serving Tools

Tool	Characteristics
TF Serving	TensorFlow official, SavedModel format
TorchServe	PyTorch official, Java-based
Triton (NVIDIA)	Multi-framework (TF, PT, ONNX, TensorRT), GPU-optimized
vLLM	LLM-specific, PagedAttention for throughput
BentoML	Python-friendly, auto Docker containerization
SageMaker / Vertex AI Endpoints	Cloud-managed, auto-scaling

Dynamic Batching — The Throughput Key

Naive serving:
  request 1 → predict → response
  request 2 → predict → response
  ...
  → Low GPU utilization (idle between requests)

Dynamic batching:
  Group requests arriving within 10ms into a batch → one GPU call
  → Exploit GPU parallelism (batch=32 can be 30× throughput of batch=1)

Triton / vLLM auto batching:
  max_batch_size = 32
  max_queue_delay_microseconds = 10000

  request 1 → queue
  ... (at 10ms or reaching 32)
  → batch GPU call → distribute results back to each request

Trade-off:
- Larger batch = throughput ↑ but latency ↑ (queue wait)
- Smaller batch = latency ↓ but GPU underused

Quantization — Faster with Minimal Accuracy Loss

Default: FP32 (32-bit float) — large memory + heavy compute

FP16 (half precision):
- Memory 1/2
- 2-4× faster on modern GPU Tensor Cores
- Nearly identical accuracy on most models (< 0.5% drop)
- Default for modern training / serving

INT8 quantization:
- Memory 1/4
- 4-8× faster (NVIDIA TensorRT, ONNX Runtime)
- 1-3% accuracy drop possible — task-dependent
- Common in LLM production serving

INT4 / GPTQ / AWQ (LLM):
- 7B LLMs can serve from a single 24GB GPU
- Even LLaMA 70B fits a single A100 80GB at INT4

→ Always compare quality after quantization. Confirm business impact.

Canary / Shadow / Blue-Green

Strategies for deploying new model v3:

Blue-Green:
  blue (current v2, 100% traffic) ↔ green (new v3, 0% traffic, idle)
  After tests, switch (100% v3) — instant
  Swap back on failure

Canary:
  v2: 95% traffic
  v3: 5% traffic
  Increase gradually on good metrics (10%, 25%, 50%, 100%)
  Drop to 0% on bad

Shadow:
  v2: 100% traffic (actual response)
  v3: 100% traffic (runs but response unused)
  → Compare metrics only, zero user impact
  → Safer than A/B (users don't see results)
  → 2× cost (both models running)

Multi-armed bandit:
  Auto traffic allocation (Thompson sampling etc.)
  Better model auto-gains traffic
  Explore / exploit balance

→ Critical models: shadow → canary → 100%; otherwise canary directly.

GPU vs CPU

GPU is faster for:
- Large models (hundreds of millions of parameters+)
- Batchable workloads (image classification, batch=32+)
- LLMs (matrix-multiplication heavy)

CPU is appropriate for:
- Small models (tens of thousands of parameters)
- Low-latency single inference (batch=1)
- Cost (1/10–1/100 of GPU cost)
- Low throughput

Modern decisions:
- LLM = GPU required (A100, H100, L4)
- Classic ML (XGBoost, LightGBM) = CPU is enough
- Vision (ResNet etc.) = CPU OK, batch → GPU

Post-quantization LLM on CPU works too (llama.cpp's CPU LLM)

LLM Serving's Specifics

Traditional model: input → output in one call
LLM: sequential token generation (autoregressive)
     input "What is..." → "the" → "the meaning" → "the meaning of"...

Issues:
- KV cache grows per token (memory ↑)
- Sequence lengths vary within a batch (padding waste)
- Streaming response is the default UX

vLLM's answers:
- PagedAttention: page-based KV cache, like virtual memory
- Continuous batching: a finished sequence's slot is filled by a new one
- → 24× throughput (vs HuggingFace baseline)

Other LLM serving tools:
- TGI (Text Generation Inference, HuggingFace)
- LMDeploy
- TensorRT-LLM
- Ollama (local, dev-friendly)

Common Pitfalls

batch=1 only — wastes GPU + expensive. Dynamic batching is required.
No quantization — serving in FP32 → costly + slow. At least FP16.
Cold starts — new auto-scaled instances take minutes to load big models. Pin minimum replicas.
No warm-up — first request is very slow (JIT compile, cache miss). Run dummy inferences at startup.
Insufficient monitoring — track latency p99, throughput, GPU util, error rate. Without these, silent regression.

Wrap-up

Model serving is ML's "production engineering" — dynamic batching / quantization / canary deploys as the standard. LLM is its own world (vLLM, TGI).

Practical — small models → BentoML or cloud endpoints; large models → Triton or cloud GPU endpoints; LLMs → vLLM or TGI. Always consider quantization + strong monitoring.