Dashboard graphs, Prometheus series, DataDog charts — all metrics. The four types (counter / gauge / histogram / summary), their meanings, why averages are dangerous, and the Prometheus pull vs StatsD push trade-off — this guide covers them.
Four Metric Types
1. Counter — Monotonically Increasing
http_requests_total{path="/login"} 12450
http_requests_total{path="/login"} 12451 ← +1
http_requests_total{path="/login"} 12452 ← +1
Property: zero or increase only. Resets to 0 on process restart.
Use: request counts, error counts, bytes sent, etc.
Analyze:
rate(http_requests_total[5m]) → requests per second2. Gauge — Arbitrary Up/Down
memory_used_bytes 4294967296
memory_used_bytes 5368709120 ← up
memory_used_bytes 3221225472 ← down
Property: a value at a point in time.
Use: memory/CPU usage, queue depth, active connections, temperature3. Histogram — Distribution
Per-request latency counted into buckets:
http_request_duration_seconds_bucket{le="0.1"} 1000 ← requests ≤100ms
http_request_duration_seconds_bucket{le="0.5"} 1200
http_request_duration_seconds_bucket{le="1.0"} 1250
http_request_duration_seconds_bucket{le="+Inf"} 1260 ← total
http_request_duration_seconds_sum 350.2 ← total time
http_request_duration_seconds_count 1260 ← total count
Analyze:
histogram_quantile(0.99, ...) → p99 latency
avg → average (sum / count)4. Summary — Quantile Computed Client-Side
http_request_duration_seconds{quantile="0.5"} 0.12
http_request_duration_seconds{quantile="0.9"} 0.45
http_request_duration_seconds{quantile="0.99"} 1.20
http_request_duration_seconds_count 1260
Pro: pre-computed on the client (less server work)
Con: can't aggregate quantiles across instances
(avg of per-instance p99 ≠ p99 of all)
→ histogram is safer in distributed environmentsWhy Averages Are Dangerous
Latency for one endpoint:
100 requests in ms:
99 × 50ms
1 × 5000ms
average = (99 × 50 + 5000) / 100 = 99.5ms ← looks "OK"
p99 = 5000ms ← 1% of users wait 5s!
max = 5000ms
→ Average alone hides the 1% in pain.
→ Look at p50 (median), p95, p99 together — inspect the tail.
Rules of thumb:
p99 > p95 × 2 → tail latency issue (investigate slow path)
max >> p99 → outlier (network glitch, GC pause)Prometheus Pull vs StatsD Push
Pull (Prometheus)
Prometheus server → scrapes /metrics on the application
(every 15s, 30s, ...)
Pros:
- Service discovery (knows which hosts are alive)
- Simple app side (just expose /metrics)
- "Can't scrape" itself is a down signal
Cons:
- Hard to measure short-lived jobs (CI etc.)
- App must know about Prometheus (open port)
- Scrape load on huge clustersPush (StatsD, DogStatsD)
application → StatsD daemon → backend
Pros:
- Short-lived jobs can report
- App only needs the destination
- Low latency (UDP, fire-and-forget)
Cons:
- No automatic discovery
- "No metric" is ambiguous (down or idle?)
- No ordering guaranteeExemplars — Linking Metrics to Traces
Attach trace_id to a sample request in a histogram bucket:
http_request_duration_bucket{le="1.0"} 1250
exemplar: trace_id=abc123, value=0.9, ts=2026-05-25T14:00:00Z
→ On a dashboard you see "p99 = 1.2s" and click the representative trace
→ Instantly see whether it's a service / DB / external API issue
Prometheus 2.27+ + OpenTelemetry standard. The metric ↔ trace link.RED / USE — Standard Metric Patterns
RED (request-oriented service)
- Rate — requests per second
- Errors — error ratio
- Duration — latency distribution (p50/p99)
USE (resource)
- Utilization — % used (CPU, memory)
- Saturation — queue / waiting (load avg)
- Errors — resource errors (disk IO etc.)
RED per service, USE per machine — golden signals.
Common Pitfalls
- Judging by averages alone — see above. Always pair with percentiles + max.
- Label cardinality explosion — http_requests{user_id} blows up series + storage. user_id belongs in logs/traces, not metrics.
- Looking at raw counter values — counters are cumulative; you need
rate(counter[5m]). - Averaging summary quantiles across instances — meaningless. Use histograms to aggregate then quantile.
- Too-frequent scrape — 1s scrape blows Prometheus load + storage. 15-30s is usually fine; 5s only for hot services.
Wrap-up
Metrics are small, fast, aggregable — the source for dashboards and alerts in observability. Understand the four types (counter / gauge / histogram / summary), look at percentiles, mind cardinality.
Start: RED / USE for per-service / per-machine, four signals. Pair with traces and logs for specific alerting / debugging. Exemplars bridge metric → trace.