How SLIs and SLOs Actually Work

"We promise 99.9% availability" — what does that mean? 43 minutes of downtime per month? Just downtime? Latency too? SLI · SLO · error budgets — the core framework from Google's SRE book. This guide covers defining SLIs, writing SLOs, using error budgets, and multi-window burn-rate alerts.

SLI · SLO · SLA — Meanings

SLI (Service Level Indicator) — measurement
  "successful request ratio over the last 5 minutes"
  "p99 latency over the last 7 days"

SLO (Service Level Objective) — target
  "99.9% requests successful over 30 days"
  "p99 latency < 200ms over 7 days"
  → the SLI must satisfy this target

SLA (Service Level Agreement) — contract (external + penalties)
  "Below 99.9% monthly availability → 30% credit"
  → conservative vs SLO (SLO has a 0.1% buffer over SLA)

What Makes a Good SLI

Reflects user experience — "CPU < 80%" doesn't (users don't feel it). User-visible measures like "request responded with 200".
Binary classifiable— good vs bad. "5xx → bad". Distributions are fine too ("p99 < 500ms → good").
Expressed as a ratio — "good events / total events". Naturally pairs with SLO targets like 99.9%.

Common SLI Patterns

Availability:
  SLI = (non-5xx requests) / (total requests)

Latency:
  SLI = (1-min buckets with p99 < 500ms) / (total 1-min buckets)

Freshness (data pipeline):
  SLI = (samples with lag < 5min) / (total samples)

Throughput:
  SLI = (samples with queue depth < 100) / (total samples)

What an SLO Target Means

SLO: "99.9% successful over 30 days"

30 days = 43,200 minutes
0.1% = 43.2 minutes — that's the error budget

→ Up to 43 minutes of downtime per month is OK (or equivalent 5xx)
→ Under 43 minutes used → safe to deploy new features
→ Over → freeze (stabilize first)

That's the heart of error budgets:
  "100% availability is impossible, so spend up to the agreed 1%.
   Unused budget = freedom to ship."

What Availability % Really Means

SLO	Monthly downtime	Daily	Meaning
99%	7.2 hours	14.4 min	"two nines" — internal tools
99.9%	43 min	1.4 min	"three nines" — typical SaaS
99.95%	22 min	43 sec	Business-critical
99.99%	4.3 min	9 sec	"four nines" — finance, payments core
99.999%	26 sec	0.9 sec	"five nines" — telecom, very rare

Each extra nine typically costs 10× — 99.99 → 99.999 is the most expensive jump. Don't put 5 nines on every service; it wastes resources.

Error Budget — A Velocity Contract

Old tension:
  Product: "Ship fast for the users!"
  SRE:    "Stability first! Slow down!"

Error budget resolves this:
  - Agree on an SLO (e.g. 99.9% = 43 min budget/month)
  - Budget remaining → ship freely
  - Budget exhausted → freeze (pause releases, stabilize)

→ "Decide via an objective metric." Less political friction.

Operations:
  Weekly budget review
  - 80%+ used → reduce next week's releases (caution)
  - 100% used → freeze
  - Quarterly burn patterns → re-tune SLO itself

Multi-Window Multi-Burn-Rate Alerts

Problem with naive alerts: SLI dropped below 99.9% → page. But:

SLI = 99.0% for 30 min → 1% budget spent. Tiny vs monthly → no page
SLI = 99.0% for 6 hours → 22% budget spent. Critical → page
SLI = 50% for 5 min → similar burn. Immediate page

→ "How fast we burn the budget" is the right basis for alerts.

Multi-window, multi-burn-rate (Google SRE recommended):

severity | burn rate | 1h window | 6h window | action
---------|-----------|-----------|-----------|--------
Page     | 14.4x     | true      | true      | page immediately
Page     | 6x        | true      | true      | page immediately
Ticket   | 3x        | true      | true      | open ticket
Ticket   | 1x        | true      | true      | open ticket

→ Both windows must agree = fewer false positives (ignore brief spikes).
   Multiple burn rates = severity tiers.

Writing SLOs — Practical Steps

Identify critical user journeys — checkout, login, search. Not every endpoint needs an SLO.
Pick an SLI per journey — availability + latency (usually both).
Set the SLO target — too low = meaningless, too high = always exhausted. 99.9% is often the sweet spot. Use 1-3 months of measurement data.
Agree on an error budget policy — what happens when exhausted, review cadence.
Set multi-window burn-rate alerts — 1h/6h windows are the standard.
Quarterly review — is the SLO too low/high? Does it match user expectations?

Common Pitfalls

SLI as an internal metric — CPU / memory etc. don't reflect user experience. Use user-visible measures only.
Promising 100% SLO — impossible. Even 99.9% allows 43 minutes per month. 100% = endless alerts + permanent freeze.
Alerting right when SLO is missed — too many false positives. Use burn-rate + multi-window.
"Budget is unused, ship anything" — a single deploy can cause a big incident. Use gradual rollouts + monitoring.
SLOs on every endpoint — operational burden. Only the critical user journeys.

Wrap-up

SLI / SLO / error budgets aren't just metrics — they're an automated stability-vs-velocity agreement. The core insight from Google SRE.

Practical: SLOs on 3-5 critical user journeys (not every endpoint); explicit budget policy (freeze conditions); multi-window burn-rate alerts (1h + 6h); quarterly review. Start at 99.9% — adjust based on measurement.