metrics 는 어떻게 동작할까?

대시보드의 그래프, Prometheus 의 series, DataDog 의 chart — 모두 metric. 4 가지 type (counter / gauge / histogram / summary) 와 그 의미, average 가 위험한 이유, Prometheus pull vs StatsD push 의 trade-off — 이 가이드는 정리한다.

Metric 4 가지 type

1. Counter — 단조 증가

http_requests_total{path="/login"} 12450
http_requests_total{path="/login"} 12451  ← +1
http_requests_total{path="/login"} 12452  ← +1

특징: 0 또는 증가만. process restart 시 0 으로 reset.
용도: 요청 수, 에러 수, byte 송신 등

분석:
  rate(http_requests_total[5m])  → 초당 요청 수

2. Gauge — 임의로 증감

memory_used_bytes 4294967296
memory_used_bytes 5368709120  ← 증가
memory_used_bytes 3221225472  ← 감소

특징: 시점의 값. 증감 자유.
용도: 메모리·CPU 사용량, queue 깊이, 활성 connection 수, 온도

3. Histogram — 분포

각 요청의 latency 를 bucket 에 카운트:

http_request_duration_seconds_bucket{le="0.1"}   1000  ← <=100ms 요청 수
http_request_duration_seconds_bucket{le="0.5"}   1200
http_request_duration_seconds_bucket{le="1.0"}   1250
http_request_duration_seconds_bucket{le="+Inf"}  1260  ← 전체
http_request_duration_seconds_sum                 350.2  ← 총 시간
http_request_duration_seconds_count               1260   ← 총 개수

분석:
  histogram_quantile(0.99, ...)  → p99 latency
  avg                             → 평균 (sum / count)

4. Summary — quantile 직접 계산

http_request_duration_seconds{quantile="0.5"}  0.12
http_request_duration_seconds{quantile="0.9"}  0.45
http_request_duration_seconds{quantile="0.99"} 1.20
http_request_duration_seconds_count 1260

장점: client 에서 미리 계산 (서버 부담 X)
단점: 여러 instance 의 quantile 못 합침 (avg of p99 ≠ p99 of all)
      → histogram 이 distributed 환경에서 더 안전

왜 average 가 위험한가

endpoint 의 latency 데이터:

요청 100 개 latency (ms):
  99 개: 50ms 씩
  1 개: 5000ms

average = (99 × 50 + 5000) / 100 = 99.5ms  ← "OK" 처럼 보임
p99    = 5000ms                              ← 1% 의 user 가 5초 대기!
max    = 5000ms

→ average 만 보면 1% 사용자 고통 안 보임.
→ p50 (median), p95, p99 함께 보기 — 분포의 tail 확인.

규칙:
  p99 > p95 × 2  → tail latency 문제 (slow path 따로 조사)
  max  >> p99    → outlier (network glitch, GC pause)

Prometheus pull vs StatsD push

Pull (Prometheus)

Prometheus server → application 의 /metrics endpoint scrape
                    (every 15s, 30s, ...)

장점:
- service discovery (어떤 host 가 살아있는지 자동)
- application 단순 (그냥 /metrics endpoint 만 expose)
- "scrape 못함" 자체가 down 시그널
단점:
- short-lived job (CI 같은) 측정 어려움
- application 이 Prometheus 의 존재 알아야 (port 노출)
- 큰 cluster scrape 부담

Push (StatsD, DogStatsD)

application → StatsD daemon → backend

장점:
- short-lived job 도 측정 가능
- application 이 destination 만 알면 됨
- low latency (UDP, fire-and-forget)
단점:
- discovery 자동 X
- "metric 안 옴" 이 down vs 그냥 idle 구분 어려움
- ordering 보장 X

Exemplars — metric → trace 연결

histogram bucket 의 sample 요청에 trace_id 박음:

http_request_duration_bucket{le="1.0"} 1250
  exemplar: trace_id=abc123, value=0.9, ts=2026-05-25T14:00:00Z

→ 대시보드에서 "p99 = 1.2s" 보면서 그 대표 trace 클릭
→ 어떤 service / DB / external API 때문인지 즉시 확인

Prometheus 2.27+ + OpenTelemetry 표준. trace 와 metric 의 link.

RED / USE 표준 metric 패턴

RED (request-oriented service)

Rate — 초당 요청 수
Errors — 에러 비율
Duration — latency 분포 (p50/p99)

USE (resource)

Utilization — % 사용 (CPU, memory)
Saturation — queue / waiting (load avg)
Errors — 자원 에러 (disk IO error 등)

Service 마다 RED, machine 마다 USE — golden signal.

흔한 함정

average 만 보고 OK 판단 — 위 참조. percentile + max 함께.
label cardinality 폭발 — http_requests{user_id}처럼 high-cardinality label → series 폭증 + storage cost. user_id 는 metric 아닌 log/trace 에.
counter 의 rate 안 보고 raw value 만 — counter 누적값은 의미 작음. `rate(counter[5m])` 같이 derivative 봐야.
summary 의 cross-instance avg — instance 별 p99 평균은 의미 없음. histogram 으로 통합 후 quantile 계산.
너무 잦은 scrape — 1s scrape → Prometheus load + storage 폭증. 보통 15-30s 충분, hot service 만 5s.

마무리

Metric 은 작고 빠르고 집계 가능 — observability 의 dashboard / alert 의 자료. 4 type 정확히 이해 (counter / gauge / histogram / summary) + percentile 함께 보기 + cardinality 의식이 핵심.

시작: RED / USE 패턴으로 service / machine 별 4 차원만 잡고, 구체 알람 / 디버깅은 trace · log 와 결합. exemplar 로 metric → trace navigation.