How Metrics Actually Work

Dashboard graphs, Prometheus series, DataDog charts — all metrics. The four types (counter / gauge / histogram / summary), their meanings, why averages are dangerous, and the Prometheus pull vs StatsD push trade-off — this guide covers them.

Four Metric Types

1. Counter — Monotonically Increasing

http_requests_total{path="/login"} 12450
http_requests_total{path="/login"} 12451  ← +1
http_requests_total{path="/login"} 12452  ← +1

Property: zero or increase only. Resets to 0 on process restart.
Use: request counts, error counts, bytes sent, etc.

Analyze:
  rate(http_requests_total[5m])  → requests per second

2. Gauge — Arbitrary Up/Down

memory_used_bytes 4294967296
memory_used_bytes 5368709120  ← up
memory_used_bytes 3221225472  ← down

Property: a value at a point in time.
Use: memory/CPU usage, queue depth, active connections, temperature

3. Histogram — Distribution

Per-request latency counted into buckets:

http_request_duration_seconds_bucket{le="0.1"}   1000  ← requests ≤100ms
http_request_duration_seconds_bucket{le="0.5"}   1200
http_request_duration_seconds_bucket{le="1.0"}   1250
http_request_duration_seconds_bucket{le="+Inf"}  1260  ← total
http_request_duration_seconds_sum                 350.2  ← total time
http_request_duration_seconds_count               1260   ← total count

Analyze:
  histogram_quantile(0.99, ...)  → p99 latency
  avg                             → average (sum / count)

4. Summary — Quantile Computed Client-Side

http_request_duration_seconds{quantile="0.5"}  0.12
http_request_duration_seconds{quantile="0.9"}  0.45
http_request_duration_seconds{quantile="0.99"} 1.20
http_request_duration_seconds_count 1260

Pro: pre-computed on the client (less server work)
Con: can't aggregate quantiles across instances
     (avg of per-instance p99 ≠ p99 of all)
     → histogram is safer in distributed environments

Why Averages Are Dangerous

Latency for one endpoint:

100 requests in ms:
  99 × 50ms
   1 × 5000ms

average = (99 × 50 + 5000) / 100 = 99.5ms  ← looks "OK"
p99    = 5000ms                              ← 1% of users wait 5s!
max    = 5000ms

→ Average alone hides the 1% in pain.
→ Look at p50 (median), p95, p99 together — inspect the tail.

Rules of thumb:
  p99 > p95 × 2  → tail latency issue (investigate slow path)
  max  >> p99    → outlier (network glitch, GC pause)

Prometheus Pull vs StatsD Push

Pull (Prometheus)

Prometheus server → scrapes /metrics on the application
                    (every 15s, 30s, ...)

Pros:
- Service discovery (knows which hosts are alive)
- Simple app side (just expose /metrics)
- "Can't scrape" itself is a down signal
Cons:
- Hard to measure short-lived jobs (CI etc.)
- App must know about Prometheus (open port)
- Scrape load on huge clusters

Push (StatsD, DogStatsD)

application → StatsD daemon → backend

Pros:
- Short-lived jobs can report
- App only needs the destination
- Low latency (UDP, fire-and-forget)
Cons:
- No automatic discovery
- "No metric" is ambiguous (down or idle?)
- No ordering guarantee

Exemplars — Linking Metrics to Traces

Attach trace_id to a sample request in a histogram bucket:

http_request_duration_bucket{le="1.0"} 1250
  exemplar: trace_id=abc123, value=0.9, ts=2026-05-25T14:00:00Z

→ On a dashboard you see "p99 = 1.2s" and click the representative trace
→ Instantly see whether it's a service / DB / external API issue

Prometheus 2.27+ + OpenTelemetry standard. The metric ↔ trace link.

RED / USE — Standard Metric Patterns

RED (request-oriented service)

Rate — requests per second
Errors — error ratio
Duration — latency distribution (p50/p99)

USE (resource)

Utilization — % used (CPU, memory)
Saturation — queue / waiting (load avg)
Errors — resource errors (disk IO etc.)

RED per service, USE per machine — golden signals.

Common Pitfalls

Judging by averages alone — see above. Always pair with percentiles + max.
Label cardinality explosion — http_requests{user_id} blows up series + storage. user_id belongs in logs/traces, not metrics.
Looking at raw counter values — counters are cumulative; you need rate(counter[5m]).
Averaging summary quantiles across instances — meaningless. Use histograms to aggregate then quantile.
Too-frequent scrape — 1s scrape blows Prometheus load + storage. 15-30s is usually fine; 5s only for hot services.

Wrap-up

Metrics are small, fast, aggregable — the source for dashboards and alerts in observability. Understand the four types (counter / gauge / histogram / summary), look at percentiles, mind cardinality.

Start: RED / USE for per-service / per-machine, four signals. Pair with traces and logs for specific alerting / debugging. Exemplars bridge metric → trace.