distributed tracing 은 어떻게 동작할까?

microservice 환경에서 한 요청이 10 개 service 거침. p99 latency 2초 — 어디서? trace 가 답. 이 가이드는 span / context propagation / W3C Trace Context / sampling 의 메커니즘과, "1% sampling 이 가장 느린 요청을 놓치는" 함정의 해결 (tail sampling) 을 정리한다.

Span — Trace 의 기본 단위

span:
  - 한 작업의 시작 ~ 끝
  - 메타: name, start_time, duration, attributes (tags)
  - parent span_id (있을 수 있음)
  - trace_id (전체 trace 의 root ID)

trace = 같은 trace_id 의 span 들의 tree

예:
trace_id = abc123
  span_id = root (POST /login)
  ├ span_id = s1 (parent=root, "validate input")
  ├ span_id = s2 (parent=root, "DB query")
  │ └ span_id = s2a (parent=s2, "INDEX scan")
  └ span_id = s3 (parent=root, "Redis set")

Trace Context Propagation — 서비스 간 잇기

Service A 가 service B 호출. B 의 span 이 A 의 trace 에 속하려면 — trace_id + parent span_id 가 HTTP header 로 전달돼야.

W3C Trace Context (표준):

HTTP request from A to B:
  traceparent: 00-abc123-s5-01
                │  │     │  │
                │  │     │  └─ trace flags (01 = sampled)
                │  │     └──── parent span_id (A 의 현재 span)
                │  └─────────── trace_id
                └─────────────── version

B 가 새 span 생성 시:
  - trace_id 그대로 사용 (abc123)
  - new span_id 생성
  - parent_id = s5 (header 에서)

→ trace 전체가 한 tree 로 reconstruction 가능.

Instrumentation — span 생성 방법

Manual

// OpenTelemetry SDK (Node.js 예)
const tracer = trace.getTracer("my-service");

async function handleLogin(req) {
  const span = tracer.startSpan("handle-login");
  span.setAttribute("user_id", req.userId);
  try {
    const user = await db.users.find(req.userId);
    span.setAttribute("user.found", !!user);
    return user;
  } catch (e) {
    span.recordException(e);
    span.setStatus({code: SpanStatusCode.ERROR});
    throw e;
  } finally {
    span.end();
  }
}

Auto-instrumentation

OpenTelemetry 의 auto-instrumentation 라이브러리가 popular framework (Express, FastAPI, Spring Boot, gRPC client) 에 hook 박아 span 자동 생성. 시작은 auto, hot path 만 manual 보강.

Sampling — 모든 trace 저장 X

매 요청 trace 하면 비용 폭발. sampling 필수.

Head sampling — 시작 시 결정

첫 service 가 trace_id 만들 때:
  random < 0.01 ? sampled=true : sampled=false

장점: 단순, 모든 service 가 같은 결정 (consistent)
단점: 가장 느린 요청도 99% 확률로 drop

Tail sampling — 끝까지 본 후 결정

모든 span 을 buffer 에 모음 → trace 끝나면 평가:
  - duration > 1s ? → keep
  - 에러 발생? → keep
  - 정상 + 빠름 → 1% 만 keep

장점: 느린/에러 trace 100% 유지 (디버깅 가치 큼)
단점: trace 끝까지 기다림 → buffer 필요 + latency 약간
도구: OpenTelemetry Collector 의 tail_sampling processor

실제 backend 들

도구	유형	강점
Jaeger	오픈소스 (CNCF)	self-host, Cassandra/Elasticsearch backend
Tempo (Grafana)	오픈소스	S3/GCS backend — 저렴, Grafana 통합
Zipkin	오픈소스	가장 오래된 (2012, Twitter), 단순
Honeycomb	SaaS	high cardinality + BubbleUp (anomaly 자동 탐지)
Lightstep (now ServiceNow)	SaaS	distributed system 전문, 큰 trace volume
DataDog APM	SaaS	metrics / logs 통합, 마케팅 강력
AWS X-Ray	cloud	AWS 서비스 자동 연동

흔한 함정

context propagation 빠짐 — async task / queue 이전 시 trace context 안 박힘 → trace tree 끊김. wrapper 필수.
span 의 attribute 폭발 — high-cardinality (user_id 등) attribute 가 backend 의 indexing 부담. 의도적 선택.
1% head sampling 만 — 가장 느린/에러 trace drop 위험. tail sampling 또는 error always-sampled.
SDK 의 overhead — instrumentation 자체가 1-5% CPU. profile 측정 + hot path 만 manual.
trace 의 부피 underestimate — 백만 요청 × 100 span × KB = TB/day. retention 정책 필수.

마무리

Distributed tracing 의 본질은 span tree + context propagation. W3C Trace Context 로 표준화돼 vendor 간 호환. OpenTelemetry 가 instrumentation 단일 SDK.

실용 — 모든 trace 저장 불가능, sampling 필수. head sampling 으로 시작하되 production 의 디버깅 가치 위해 tail sampling 검토. error 는 항상 keep.