"We promise 99.9% availability" — what does that mean? 43 minutes of downtime per month? Just downtime? Latency too? SLI · SLO · error budgets — the core framework from Google's SRE book. This guide covers defining SLIs, writing SLOs, using error budgets, and multi-window burn-rate alerts.
SLI · SLO · SLA — Meanings
SLI (Service Level Indicator) — measurement
"successful request ratio over the last 5 minutes"
"p99 latency over the last 7 days"
SLO (Service Level Objective) — target
"99.9% requests successful over 30 days"
"p99 latency < 200ms over 7 days"
→ the SLI must satisfy this target
SLA (Service Level Agreement) — contract (external + penalties)
"Below 99.9% monthly availability → 30% credit"
→ conservative vs SLO (SLO has a 0.1% buffer over SLA)What Makes a Good SLI
- Reflects user experience — "CPU < 80%" doesn't (users don't feel it). User-visible measures like "request responded with 200".
- Binary classifiable— good vs bad. "5xx → bad". Distributions are fine too ("p99 < 500ms → good").
- Expressed as a ratio — "good events / total events". Naturally pairs with SLO targets like 99.9%.
Common SLI Patterns
Availability:
SLI = (non-5xx requests) / (total requests)
Latency:
SLI = (1-min buckets with p99 < 500ms) / (total 1-min buckets)
Freshness (data pipeline):
SLI = (samples with lag < 5min) / (total samples)
Throughput:
SLI = (samples with queue depth < 100) / (total samples)What an SLO Target Means
SLO: "99.9% successful over 30 days"
30 days = 43,200 minutes
0.1% = 43.2 minutes — that's the error budget
→ Up to 43 minutes of downtime per month is OK (or equivalent 5xx)
→ Under 43 minutes used → safe to deploy new features
→ Over → freeze (stabilize first)
That's the heart of error budgets:
"100% availability is impossible, so spend up to the agreed 1%.
Unused budget = freedom to ship."What Availability % Really Means
| SLO | Monthly downtime | Daily | Meaning |
|---|---|---|---|
| 99% | 7.2 hours | 14.4 min | "two nines" — internal tools |
| 99.9% | 43 min | 1.4 min | "three nines" — typical SaaS |
| 99.95% | 22 min | 43 sec | Business-critical |
| 99.99% | 4.3 min | 9 sec | "four nines" — finance, payments core |
| 99.999% | 26 sec | 0.9 sec | "five nines" — telecom, very rare |
Each extra nine typically costs 10× — 99.99 → 99.999 is the most expensive jump. Don't put 5 nines on every service; it wastes resources.
Error Budget — A Velocity Contract
Old tension:
Product: "Ship fast for the users!"
SRE: "Stability first! Slow down!"
Error budget resolves this:
- Agree on an SLO (e.g. 99.9% = 43 min budget/month)
- Budget remaining → ship freely
- Budget exhausted → freeze (pause releases, stabilize)
→ "Decide via an objective metric." Less political friction.
Operations:
Weekly budget review
- 80%+ used → reduce next week's releases (caution)
- 100% used → freeze
- Quarterly burn patterns → re-tune SLO itselfMulti-Window Multi-Burn-Rate Alerts
Problem with naive alerts: SLI dropped below 99.9% → page. But:
SLI = 99.0% for 30 min → 1% budget spent. Tiny vs monthly → no page
SLI = 99.0% for 6 hours → 22% budget spent. Critical → page
SLI = 50% for 5 min → similar burn. Immediate page
→ "How fast we burn the budget" is the right basis for alerts.
Multi-window, multi-burn-rate (Google SRE recommended):
severity | burn rate | 1h window | 6h window | action
---------|-----------|-----------|-----------|--------
Page | 14.4x | true | true | page immediately
Page | 6x | true | true | page immediately
Ticket | 3x | true | true | open ticket
Ticket | 1x | true | true | open ticket
→ Both windows must agree = fewer false positives (ignore brief spikes).
Multiple burn rates = severity tiers.Writing SLOs — Practical Steps
- Identify critical user journeys — checkout, login, search. Not every endpoint needs an SLO.
- Pick an SLI per journey — availability + latency (usually both).
- Set the SLO target — too low = meaningless, too high = always exhausted. 99.9% is often the sweet spot. Use 1-3 months of measurement data.
- Agree on an error budget policy — what happens when exhausted, review cadence.
- Set multi-window burn-rate alerts — 1h/6h windows are the standard.
- Quarterly review — is the SLO too low/high? Does it match user expectations?
Common Pitfalls
- SLI as an internal metric — CPU / memory etc. don't reflect user experience. Use user-visible measures only.
- Promising 100% SLO — impossible. Even 99.9% allows 43 minutes per month. 100% = endless alerts + permanent freeze.
- Alerting right when SLO is missed — too many false positives. Use burn-rate + multi-window.
- "Budget is unused, ship anything" — a single deploy can cause a big incident. Use gradual rollouts + monitoring.
- SLOs on every endpoint — operational burden. Only the critical user journeys.
Wrap-up
SLI / SLO / error budgets aren't just metrics — they're an automated stability-vs-velocity agreement. The core insight from Google SRE.
Practical: SLOs on 3-5 critical user journeys (not every endpoint); explicit budget policy (freeze conditions); multi-window burn-rate alerts (1h + 6h); quarterly review. Start at 99.9% — adjust based on measurement.