Something is wrong in production. But what? One user? All users? The DB? An external API? — Answering questions like these is observability. Unlike pure monitoring (watch predefined metrics), it must answer questions you didn't define in advance. This guide covers the three pillars (logs, metrics, traces) and the nature of cardinality.
Monitoring vs Observability — Meaning
Monitoring: answer known questions
- alert when CPU > 80%
- alert when 5xx rate > 1%
- predefined dashboards
Observability: the ability to answer new questions
- "Why did last night's 11pm latency spike?"
- "Where did this user's request fail?"
- "After feature X shipped, which endpoint's p99 changed?"
→ Both needed. Monitoring detects incidents; observability diagnoses them.Three Pillars — Logs · Metrics · Traces
Logs — Record of What Happened
Shape: events (timestamp + context)
{
"ts": "2026-05-25T14:32:01Z",
"level": "error",
"msg": "DB connection refused",
"host": "api-prod-3",
"user_id": "u_42",
"request_id": "req_abc123",
"stack": "..."
}
Strength: rich context, precise debugging
Weakness: high volume (TBs), expensive aggregation, query costMetrics — Time-Series Numbers
Shape: timestamp + number + labels (aggregable)
http_requests_total{method="POST", path="/login", status="200"} 1234
http_requests_total{method="POST", path="/login", status="500"} 5
Strength: small (bytes per event), fast aggregation (count/avg/p99), great for dashboards
Weakness: only what you defined; no per-event detailTraces — A Request's Journey
Shape: tree of spans (time + parent-child)
trace abc123 (total 450ms)
├─ HTTP POST /login (450ms) service: api
│ ├─ DB query users (320ms) service: db
│ │ └─ INDEX scan (310ms) ← bottleneck!
│ └─ Redis SET session (15ms) service: cache
Strength: cross-service flow obvious, where slowness lives at a glance
Weakness: needs instrumentation, high volume (usually sampled)Cardinality — The Cost Ceiling
Cardinality = the number of unique label combinations
Example 1 — low:
http_requests{method="GET|POST|...", status="200|400|500"}
→ 5 methods × 6 statuses = 30 series
Example 2 — explosion:
http_requests{method, status, user_id, request_id}
→ 1M users × 100M requests = 10^11 series 😱
Prometheus etc.: cardinality × time-series count = storage cost
DataDog / Honeycomb etc.: cardinality-priced
→ Why "log everything" doesn't work = the cardinality ceiling.
Keep high-cardinality (user_id, request_id) in traces/logs,
NOT in metric labels.Which Pillar Fits Which Question
| Question | Tool |
|---|---|
| Is the system healthy right now? | metrics (dashboards, alerts) |
| p99 trend for this endpoint? | metrics |
| Why was THIS request slow? | traces (look up by trace ID) |
| Why is this user erroring? | logs (filter by user_id) + trace |
| What happened at 11pm yesterday? | logs (time + context search) |
| Latency distribution A → B? | traces aggregation + metrics |
OpenTelemetry — A Single Standard
Before: different SDKs per vendor
- DataDog SDK / New Relic agent / Honeycomb SDK / ...
- Switching tools = changing every app
OpenTelemetry (CNCF, since 2019):
- Unified spec + SDKs for logs / metrics / traces
- Swap the exporter to change backend (Jaeger / Tempo / DataDog / ...)
→ Vendor lock-in resolved. Instrument once, free destination.Common Pitfalls
- "Just log everything" — volume + cost explosion + signal-vs-noise. Deliberate sampling + structured logs.
- Checking averages only — average looks fine but p99 = 2s means 1% of users are in pain. Histograms + percentiles are essential.
- user_id / request_id as metric labels — cardinality explosion. Put them in traces / logs instead.
- Too many alerts — 3am noisy pages → on-call burnout. SLO-based alerts (see the SLI/SLO guide).
- Tracing impacting production — unsampled tracing adds ms per request. Head sampling (1%) or tail sampling (errors 100%).
Wrap-up
Observability isn't a single tool — it's a capability: answering questions you didn't define ahead of time. The three pillars complement each other (logs = detail, metrics = aggregation, traces = flow). Cardinality is the cost ceiling.
Start: structured logging + basic metrics (RED / USE) + tracing on critical paths. Incremental, not full instrumentation on every service at once.