How Observability Actually Works

Something is wrong in production. But what? One user? All users? The DB? An external API? — Answering questions like these is observability. Unlike pure monitoring (watch predefined metrics), it must answer questions you didn't define in advance. This guide covers the three pillars (logs, metrics, traces) and the nature of cardinality.

Monitoring vs Observability — Meaning

Monitoring: answer known questions
  - alert when CPU > 80%
  - alert when 5xx rate > 1%
  - predefined dashboards

Observability: the ability to answer new questions
  - "Why did last night's 11pm latency spike?"
  - "Where did this user's request fail?"
  - "After feature X shipped, which endpoint's p99 changed?"

→ Both needed. Monitoring detects incidents; observability diagnoses them.

Three Pillars — Logs · Metrics · Traces

Logs — Record of What Happened

Shape: events (timestamp + context)

{
  "ts": "2026-05-25T14:32:01Z",
  "level": "error",
  "msg": "DB connection refused",
  "host": "api-prod-3",
  "user_id": "u_42",
  "request_id": "req_abc123",
  "stack": "..."
}

Strength: rich context, precise debugging
Weakness: high volume (TBs), expensive aggregation, query cost

Metrics — Time-Series Numbers

Shape: timestamp + number + labels (aggregable)

http_requests_total{method="POST", path="/login", status="200"} 1234
http_requests_total{method="POST", path="/login", status="500"} 5

Strength: small (bytes per event), fast aggregation (count/avg/p99), great for dashboards
Weakness: only what you defined; no per-event detail

Traces — A Request's Journey

Shape: tree of spans (time + parent-child)

trace abc123 (total 450ms)
├─ HTTP POST /login         (450ms)  service: api
│  ├─ DB query users        (320ms)  service: db
│  │  └─ INDEX scan         (310ms)  ← bottleneck!
│  └─ Redis SET session     (15ms)   service: cache

Strength: cross-service flow obvious, where slowness lives at a glance
Weakness: needs instrumentation, high volume (usually sampled)

Cardinality — The Cost Ceiling

Cardinality = the number of unique label combinations

Example 1 — low:
  http_requests{method="GET|POST|...", status="200|400|500"}
  → 5 methods × 6 statuses = 30 series

Example 2 — explosion:
  http_requests{method, status, user_id, request_id}
  → 1M users × 100M requests = 10^11 series 😱

Prometheus etc.: cardinality × time-series count = storage cost
DataDog / Honeycomb etc.: cardinality-priced

→ Why "log everything" doesn't work = the cardinality ceiling.
   Keep high-cardinality (user_id, request_id) in traces/logs,
   NOT in metric labels.

Which Pillar Fits Which Question

Question	Tool
Is the system healthy right now?	metrics (dashboards, alerts)
p99 trend for this endpoint?	metrics
Why was THIS request slow?	traces (look up by trace ID)
Why is this user erroring?	logs (filter by user_id) + trace
What happened at 11pm yesterday?	logs (time + context search)
Latency distribution A → B?	traces aggregation + metrics

OpenTelemetry — A Single Standard

Before: different SDKs per vendor
  - DataDog SDK / New Relic agent / Honeycomb SDK / ...
  - Switching tools = changing every app

OpenTelemetry (CNCF, since 2019):
  - Unified spec + SDKs for logs / metrics / traces
  - Swap the exporter to change backend (Jaeger / Tempo / DataDog / ...)

→ Vendor lock-in resolved. Instrument once, free destination.

Common Pitfalls

"Just log everything" — volume + cost explosion + signal-vs-noise. Deliberate sampling + structured logs.
Checking averages only — average looks fine but p99 = 2s means 1% of users are in pain. Histograms + percentiles are essential.
user_id / request_id as metric labels — cardinality explosion. Put them in traces / logs instead.
Too many alerts — 3am noisy pages → on-call burnout. SLO-based alerts (see the SLI/SLO guide).
Tracing impacting production — unsampled tracing adds ms per request. Head sampling (1%) or tail sampling (errors 100%).

Wrap-up

Observability isn't a single tool — it's a capability: answering questions you didn't define ahead of time. The three pillars complement each other (logs = detail, metrics = aggregation, traces = flow). Cardinality is the cost ceiling.

Start: structured logging + basic metrics (RED / USE) + tracing on critical paths. Incremental, not full instrumentation on every service at once.