structured logging 은 어떻게 동작할까?

printf("user %d logged in\n", id) 같은 free-form log 가 production 에서 무용. grep 으로는 못 찾는 정보 천지. 해결:structured logging — JSON 형식 + 일관된 field + correlation ID. 이 가이드는 그 메커니즘과 운영 패턴을 정리한다.

Free-form vs Structured

// Free-form (printf)
log.info("User 42 logged in from 192.168.1.1")
log.error("DB failed: connection refused")

문제:
  - "지난 1시간 user 42 의 모든 log" → grep 으로 어려움
  - "어느 user 가 가장 많이 에러?" → 파싱 필요
  - field 일관성 X (user 42 vs user id 42 vs uid=42)

// Structured (JSON)
log.info({user_id: 42, ip: "192.168.1.1"}, "User logged in")
log.error({error: e, db: "postgres"}, "DB failed")

→ JSON 출력:
{"ts":"...", "level":"info", "msg":"User logged in", "user_id":42, "ip":"..."}
{"ts":"...", "level":"error", "msg":"DB failed", "error":"...", "db":"postgres"}

장점:
  - filter / query 가능 (user_id=42 같은 정확 match)
  - aggregation 자연 (count by user_id)
  - 도구 간 호환 (Loki, Elasticsearch, CloudWatch 모두 JSON 이해)

Log Level — 5 단계

DEBUG — 개발용 detail. production 에서는 보통 disabled
INFO  — 정상 동작의 의미 있는 event ("User logged in")
WARN  — 정상이지만 주의할 일 ("Retry attempt 3", "Slow query 2s")
ERROR — 실패했지만 service continue ("DB query failed, retrying")
FATAL — service down 또는 즉시 page 필요 ("Cannot bind port")

규칙:
  - INFO 가 너무 많으면 비용 ↑ + 신호 묻힘
  - ERROR 는 다음 행동 ("retry 함") 까지 같이 박기
  - FATAL 은 정말 fatal — alert 자동 trigger 대상

Correlation ID — service 간 request 잇기

microservice 환경:
  Client → API Gateway → Service A → Service B → DB

각 service 의 log:
  api-gw:    {msg: "request received", request_id: "req_abc123"}
  service-a: {msg: "calling B", request_id: ???}      ← lost!
  service-b: {msg: "DB query slow", request_id: ???}

해결: request_id 를 HTTP header 로 전달
  - first service 가 UUID 생성 (또는 incoming header 사용)
  - downstream call 시 X-Request-ID 헤더 박음
  - 각 service 의 log middleware 가 자동 inject

검색:
  log_stream | filter request_id = "req_abc123"
  → 그 요청의 모든 service log 한번에

OpenTelemetry trace_id 와 통합

tracing 도 있으면 request_id 대신 trace_id 사용:

{"msg":"DB slow", "trace_id":"abc123", "span_id":"s5", "duration_ms":3200}

→ log + trace 통합:
  - log 에서 ERROR 발견 → trace_id 클릭 → 그 요청의 전체 trace tree
  - trace 에서 slow span 발견 → 그 span 의 log 들 lookup

도구 통합 (Grafana, Honeycomb 등) 이 이 link 자동 활용.

Log Aggregation — 한 곳에 모으기

도구	유형	특징
Elasticsearch / OpenSearch (ELK)	self-host / cloud	full-text search 강함, 큰 storage 비용
Loki (Grafana)	OSS	label 기반 index (body 안 index) → 저렴, Prometheus 와 같은 모델
CloudWatch Logs (AWS)	cloud	AWS 통합, query 비용 GB 당
Cloud Logging (GCP)	cloud	GCP 통합, JSON-aware query
Datadog Logs	SaaS	비싸지만 통합 강력 (metric / trace)
Splunk	enterprise	전통, 강력한 SPL query 언어

Sampling — log 도 sampling 필요

production 의 비용 폭발 주범 — INFO log 가 매 요청 마다.

해결 1 — log level 조정
  production = WARN 이상만 → INFO drop

해결 2 — head-based sampling
  10% 요청만 INFO log
  ERROR / FATAL 은 항상

해결 3 — duration-based
  duration > 1s 만 INFO log (slow path 만 관찰)

해결 4 — adaptive
  부하 ↑ 면 sampling rate ↓ (load-shedding)

함정: sampling 후 "이 user 의 에러 추적" 어려움. user_id 별 100% 또는
      trace ID 같이 sample 결정 (한 trace 의 모든 log 동일 결정).

흔한 함정

printf 로 PII 출력 — email · phone · token 등 민감정보가 log 에. GDPR 위반 위험. redaction middleware 필수.
error 의 stack trace 누락 — 한 줄 "DB failed" 만 → 디버깅 불가. error.stack 같이 박기.
대량 binary / large object 를 log 에 — log aggregation 의 storage 폭발. 별도 저장소 (S3) link 만.
log 구조 일관성 없음 — service 마다 field 명 다름 (user_id vs uid vs userId). 처음에 schema 잡아야.
sync log write — disk IO 가 request latency 영향. async logger (e.g. pino, zap) 사용.

마무리

Structured logging = JSON + 일관 field + correlation ID. 단순한 변화지만 production debugging 의 가능 vs 불가능 결정.

시작 권고 — pino (Node), zap (Go), structlog (Python). log 의 schema 일찍 정의 + middleware 로 trace_id / request_id 자동 inject. 비용 폭발 시 sampling — trace 와 일관된 결정.