How Structured Logging Actually Works

printf("user %d logged in\n", id)-style free-form logs are useless in production. Grep can't find what you need. Solution: structured logging — JSON + consistent fields + correlation IDs. This guide covers the mechanics and operational patterns.

Free-form vs Structured

// Free-form (printf)
log.info("User 42 logged in from 192.168.1.1")
log.error("DB failed: connection refused")

Problems:
  - "All logs from user 42 last hour" → hard via grep
  - "Which user errors most?" → needs parsing
  - Inconsistent fields (user 42 vs user id 42 vs uid=42)

// Structured (JSON)
log.info({user_id: 42, ip: "192.168.1.1"}, "User logged in")
log.error({error: e, db: "postgres"}, "DB failed")

→ JSON output:
{"ts":"...", "level":"info", "msg":"User logged in", "user_id":42, "ip":"..."}
{"ts":"...", "level":"error", "msg":"DB failed", "error":"...", "db":"postgres"}

Pros:
  - Filter / query precisely (user_id=42 exact match)
  - Natural aggregation (count by user_id)
  - Tool compatibility (Loki, Elasticsearch, CloudWatch all understand JSON)

Log Levels — Five Tiers

DEBUG — dev detail. Usually disabled in production
INFO  — normal but meaningful events ("User logged in")
WARN  — normal but noteworthy ("Retry attempt 3", "Slow query 2s")
ERROR — failed but service continues ("DB query failed, retrying")
FATAL — service down or page-worthy ("Cannot bind port")

Rules:
  - Excessive INFO = cost ↑ + signal lost
  - ERROR logs should include the next action ("retrying")
  - FATAL is FATAL — alert auto-trigger candidate

Correlation IDs — Tying a Request Across Services

Microservice:
  Client → API Gateway → Service A → Service B → DB

Per-service logs:
  api-gw:    {msg: "request received", request_id: "req_abc123"}
  service-a: {msg: "calling B", request_id: ???}      ← lost!
  service-b: {msg: "DB query slow", request_id: ???}

Fix: propagate request_id via HTTP header
  - First service generates UUID (or reuses incoming header)
  - Downstream calls carry X-Request-ID
  - Each service's log middleware injects automatically

Query:
  log_stream | filter request_id = "req_abc123"
  → All services' logs for that request in one place

Integration with OpenTelemetry trace_id

With tracing, use trace_id instead of (or alongside) request_id:

{"msg":"DB slow", "trace_id":"abc123", "span_id":"s5", "duration_ms":3200}

→ Log + trace unified:
  - In logs you find an ERROR → click trace_id → full trace tree
  - In traces a slow span → look up the span's logs

Tools (Grafana, Honeycomb, etc.) use the link automatically.

Log Aggregation — Centralizing

Tool	Type	Notes
Elasticsearch / OpenSearch (ELK)	Self-host / cloud	Strong full-text search; high storage cost
Loki (Grafana)	OSS	Label-indexed (body not indexed) → cheap, Prometheus-like model
CloudWatch Logs (AWS)	Cloud	AWS-integrated; query priced per GB
Cloud Logging (GCP)	Cloud	GCP-integrated, JSON-aware queries
Datadog Logs	SaaS	Expensive but strong (metric / trace) integration
Splunk	Enterprise	Classic, powerful SPL query language

Sampling — Logs Need It Too

Production cost driver — INFO log on every request.

Strategy 1 — adjust level
  production = WARN or higher → drop INFO

Strategy 2 — head-based sampling
  10% of requests get INFO logs
  ERROR / FATAL always

Strategy 3 — duration-based
  Only INFO when duration > 1s (observe slow path)

Strategy 4 — adaptive
  Drop sampling rate when load is high (load-shedding)

Trap: after sampling, "trace this user's error" gets hard. Either
      keep 100% per user_id, or share the sampling decision with the
      trace ID (all logs of the same trace get the same fate).

Common Pitfalls

Logging PII via printf — email, phone, tokens leak into logs. GDPR risk. Redaction middleware required.
Missing stack traces on errors — one-line "DB failed" = undebuggable. Always include error.stack.
Logging large binaries / objects — log aggregation storage blows up. Store separately (S3) and log only a link.
Inconsistent log shape — different services use different field names (user_id vs uid vs userId). Define a schema upfront.
Sync log writes — disk IO hits request latency. Use an async logger (e.g. pino, zap).

Wrap-up

Structured logging = JSON + consistent fields + correlation IDs. A simple change but the difference between possible and impossible production debugging.

Recommended starts — pino (Node), zap (Go), structlog (Python). Define a log schema early; inject trace_id / request_id via middleware. If costs explode, sample — with trace-consistent decisions.