model serving 은 어떻게 동작할까?

training 끝난 model 을 production 에서 매 초 수천 요청에 응답하게. latency p99 < 100ms, throughput 1000 req/s, GPU 비싸니 utilization 80%+. 이게 model serving. 이 가이드는 dynamic batching, quantization, canary / shadow deploy, LLM serving 의 특수성을 정리한다.

대표 Serving 도구

도구	특징
TF Serving	TensorFlow 의 official, SavedModel 형식
TorchServe	PyTorch 의 official, Java 기반
Triton (NVIDIA)	multi-framework (TF, PT, ONNX, TensorRT), GPU 최적화
vLLM	LLM 전용, PagedAttention 으로 throughput ↑
BentoML	Python 친화, Docker container 자동
SageMaker / Vertex AI Endpoints	cloud managed, auto-scaling

Dynamic Batching — throughput 의 핵심

naive serving:
  request 1 → predict → response
  request 2 → predict → response
  ...
  → GPU utilization 낮음 (req 사이 idle)

Dynamic batching:
  10ms 안에 도착한 request 들을 batch 로 묶어 한 번에 GPU 에 전달
  → GPU 의 parallel 활용 (batch=32 가 batch=1 보다 30× throughput)

Triton / vLLM 의 자동 batching:
  max_batch_size = 32
  max_queue_delay_microseconds = 10000

  request 1 → queue
  ... (10ms 또는 32 도달 시)
  → batch GPU 호출 → 결과를 각 request 에 분배

trade-off:
- 더 큰 batch = throughput ↑ but latency ↑ (queue 대기)
- 작은 batch = latency ↓ but GPU 낭비

Quantization — accuracy 손해 거의 없이 빠르게

기본: FP32 (32-bit float) — 큰 메모리 + 큰 compute

FP16 (half precision):
- 메모리 1/2
- modern GPU 의 Tensor Core 활용 시 2-4× 빠름
- 거의 모든 model 에서 accuracy 거의 동일 (< 0.5% 손해)
- modern training / serving 의 default

INT8 quantization:
- 메모리 1/4
- 4-8× 빠름 (NVIDIA TensorRT, ONNX Runtime)
- accuracy 1-3% 손해 가능 — task 따라
- LLM 의 production serving 에서 흔함

INT4 / GPTQ / AWQ (LLM):
- LLM 의 7B model 도 single 24GB GPU 에서 serving 가능
- LLaMA 70B 도 INT4 면 single A100 80GB 에 fit

→ quantize 후 quality 비교 필수. business 영향 확인.

Canary / Shadow / Blue-Green

새 model v3 배포 전략:

Blue-Green:
  blue (현재 v2, 100% traffic) ↔ green (새 v3, 0% traffic, idle)
  test 후 switch (100% v3) — instant
  문제 시 swap back

Canary:
  v2: 95% traffic
  v3: 5% traffic
  metric 좋으면 점진적 증가 (10%, 25%, 50%, 100%)
  나쁘면 0% 로

Shadow:
  v2: 100% traffic (실제 응답)
  v3: 100% traffic (실행하지만 응답 사용 X)
  → 비교 metric 만 측정, user 영향 0
  → A/B 보다 안전 (사용자가 결과 안 봄)
  → cost 2× (두 model 모두 실행)

Multi-armed bandit:
  traffic 분배가 자동 (Thompson sampling 등)
  좋은 model 의 traffic 자동 증가
  탐험 vs 활용의 균형

→ critical model 은 shadow → canary → 100%, 일반은 canary 직접.

GPU vs CPU

GPU 가 빠른 경우:
- 큰 model (수억 parameter+)
- batch 가능한 workload (image classification batch=32+)
- LLM (matrix multiplication heavy)

CPU 가 적합한 경우:
- 작은 model (수만 parameter)
- low-latency single inference (batch=1)
- 비용 (GPU 의 1/10-1/100 cost)
- 작은 throughput

modern 결정:
- LLM = GPU 필수 (A100, H100, L4)
- 일반 ML (XGBoost, LightGBM) = CPU 로도 충분
- vision (ResNet 등) = CPU 도 가능, batch 면 GPU

quantization 후 CPU 도 LLM 가능 (llama.cpp 의 CPU LLM)

LLM Serving 의 특수성

전통 model: input → output 의 한 번 호출
LLM: token 별로 sequential generation (autoregressive)
     input "What is..." → "the" → "the meaning" → "the meaning of"...

문제:
- 매 token 마다 KV cache 누적 (memory ↑)
- batch 안 sequence 길이 다양 (padding waste)
- streaming response 가 default UX

해결 (vLLM):
- PagedAttention: KV cache 를 page 단위로 — virtual memory 처럼
- continuous batching: 완료된 sequence 자리에 새 sequence 즉시 시작
- → throughput 24× (HuggingFace baseline 대비)

LLM serving 의 다른 도구:
- TGI (Text Generation Inference, HuggingFace)
- LMDeploy
- TensorRT-LLM
- Ollama (로컬, dev 친화)

흔한 함정

batch=1 만 사용 — GPU 낭비 + 비싼 cost. dynamic batching 필수.
quantization 안 함 — FP32 그대로 serving → 비싸 + 느림. 최소 FP16.
cold start — auto-scaling 의 새 instance 가 큰 model load 시 분 단위. min replica 박기.
warm-up 안 함 — 첫 request 가 매우 느림 (JIT compile, cache miss). startup 시 dummy inference.
monitoring 부족 — latency p99, throughput, GPU util, error rate 다 봐야. 안 보면 silent 성능 저하.

마무리

Model serving 은 ML 의 "production engineering" 영역 — dynamic batching / quantization / canary deploy 의 표준. LLM 은 별도 도구 (vLLM, TGI) 의 own world.

실용 — 작은 model = BentoML 또는 cloud endpoint, 큰 model = Triton 또는 cloud GPU endpoint, LLM = vLLM 또는 TGI. 모든 경우 quantization 검토 + monitoring 강력.