printf("user %d logged in\n", id)-style free-form logs are useless in production. Grep can't find what you need. Solution: structured logging — JSON + consistent fields + correlation IDs. This guide covers the mechanics and operational patterns.
Free-form vs Structured
// Free-form (printf)
log.info("User 42 logged in from 192.168.1.1")
log.error("DB failed: connection refused")
Problems:
- "All logs from user 42 last hour" → hard via grep
- "Which user errors most?" → needs parsing
- Inconsistent fields (user 42 vs user id 42 vs uid=42)
// Structured (JSON)
log.info({user_id: 42, ip: "192.168.1.1"}, "User logged in")
log.error({error: e, db: "postgres"}, "DB failed")
→ JSON output:
{"ts":"...", "level":"info", "msg":"User logged in", "user_id":42, "ip":"..."}
{"ts":"...", "level":"error", "msg":"DB failed", "error":"...", "db":"postgres"}
Pros:
- Filter / query precisely (user_id=42 exact match)
- Natural aggregation (count by user_id)
- Tool compatibility (Loki, Elasticsearch, CloudWatch all understand JSON)Log Levels — Five Tiers
DEBUG — dev detail. Usually disabled in production
INFO — normal but meaningful events ("User logged in")
WARN — normal but noteworthy ("Retry attempt 3", "Slow query 2s")
ERROR — failed but service continues ("DB query failed, retrying")
FATAL — service down or page-worthy ("Cannot bind port")
Rules:
- Excessive INFO = cost ↑ + signal lost
- ERROR logs should include the next action ("retrying")
- FATAL is FATAL — alert auto-trigger candidateCorrelation IDs — Tying a Request Across Services
Microservice:
Client → API Gateway → Service A → Service B → DB
Per-service logs:
api-gw: {msg: "request received", request_id: "req_abc123"}
service-a: {msg: "calling B", request_id: ???} ← lost!
service-b: {msg: "DB query slow", request_id: ???}
Fix: propagate request_id via HTTP header
- First service generates UUID (or reuses incoming header)
- Downstream calls carry X-Request-ID
- Each service's log middleware injects automatically
Query:
log_stream | filter request_id = "req_abc123"
→ All services' logs for that request in one placeIntegration with OpenTelemetry trace_id
With tracing, use trace_id instead of (or alongside) request_id:
{"msg":"DB slow", "trace_id":"abc123", "span_id":"s5", "duration_ms":3200}
→ Log + trace unified:
- In logs you find an ERROR → click trace_id → full trace tree
- In traces a slow span → look up the span's logs
Tools (Grafana, Honeycomb, etc.) use the link automatically.Log Aggregation — Centralizing
| Tool | Type | Notes |
|---|---|---|
| Elasticsearch / OpenSearch (ELK) | Self-host / cloud | Strong full-text search; high storage cost |
| Loki (Grafana) | OSS | Label-indexed (body not indexed) → cheap, Prometheus-like model |
| CloudWatch Logs (AWS) | Cloud | AWS-integrated; query priced per GB |
| Cloud Logging (GCP) | Cloud | GCP-integrated, JSON-aware queries |
| Datadog Logs | SaaS | Expensive but strong (metric / trace) integration |
| Splunk | Enterprise | Classic, powerful SPL query language |
Sampling — Logs Need It Too
Production cost driver — INFO log on every request.
Strategy 1 — adjust level
production = WARN or higher → drop INFO
Strategy 2 — head-based sampling
10% of requests get INFO logs
ERROR / FATAL always
Strategy 3 — duration-based
Only INFO when duration > 1s (observe slow path)
Strategy 4 — adaptive
Drop sampling rate when load is high (load-shedding)
Trap: after sampling, "trace this user's error" gets hard. Either
keep 100% per user_id, or share the sampling decision with the
trace ID (all logs of the same trace get the same fate).Common Pitfalls
- Logging PII via printf — email, phone, tokens leak into logs. GDPR risk. Redaction middleware required.
- Missing stack traces on errors — one-line "DB failed" = undebuggable. Always include error.stack.
- Logging large binaries / objects — log aggregation storage blows up. Store separately (S3) and log only a link.
- Inconsistent log shape — different services use different field names (user_id vs uid vs userId). Define a schema upfront.
- Sync log writes — disk IO hits request latency. Use an async logger (e.g. pino, zap).
Wrap-up
Structured logging = JSON + consistent fields + correlation IDs. A simple change but the difference between possible and impossible production debugging.
Recommended starts — pino (Node), zap (Go), structlog (Python). Define a log schema early; inject trace_id / request_id via middleware. If costs explode, sample — with trace-consistent decisions.