Skip to content
yutils

How SLIs and SLOs Actually Work

What an SLI actually measures, how to write SLOs that mean something, error budgets as a feature-velocity contract, multi-window multi-burn-rate alerts that don't page you at 3 AM for noise, and why availability % is harder than it looks.

~10 min read

"We promise 99.9% availability" — what does that mean? 43 minutes of downtime per month? Just downtime? Latency too? SLI · SLO · error budgets — the core framework from Google's SRE book. This guide covers defining SLIs, writing SLOs, using error budgets, and multi-window burn-rate alerts.

SLI · SLO · SLA — Meanings

SLI (Service Level Indicator) — measurement
  "successful request ratio over the last 5 minutes"
  "p99 latency over the last 7 days"

SLO (Service Level Objective) — target
  "99.9% requests successful over 30 days"
  "p99 latency < 200ms over 7 days"
  → the SLI must satisfy this target

SLA (Service Level Agreement) — contract (external + penalties)
  "Below 99.9% monthly availability → 30% credit"
  → conservative vs SLO (SLO has a 0.1% buffer over SLA)

What Makes a Good SLI

  • Reflects user experience"CPU < 80%" doesn't (users don't feel it). User-visible measures like "request responded with 200".
  • Binary classifiable— good vs bad. "5xx → bad". Distributions are fine too ("p99 < 500ms → good").
  • Expressed as a ratio — "good events / total events". Naturally pairs with SLO targets like 99.9%.

Common SLI Patterns

Availability:
  SLI = (non-5xx requests) / (total requests)

Latency:
  SLI = (1-min buckets with p99 < 500ms) / (total 1-min buckets)

Freshness (data pipeline):
  SLI = (samples with lag < 5min) / (total samples)

Throughput:
  SLI = (samples with queue depth < 100) / (total samples)

What an SLO Target Means

SLO: "99.9% successful over 30 days"

30 days = 43,200 minutes
0.1% = 43.2 minutes — that's the error budget

→ Up to 43 minutes of downtime per month is OK (or equivalent 5xx)
→ Under 43 minutes used → safe to deploy new features
→ Over → freeze (stabilize first)

That's the heart of error budgets:
  "100% availability is impossible, so spend up to the agreed 1%.
   Unused budget = freedom to ship."

What Availability % Really Means

SLOMonthly downtimeDailyMeaning
99%7.2 hours14.4 min"two nines" — internal tools
99.9%43 min1.4 min"three nines" — typical SaaS
99.95%22 min43 secBusiness-critical
99.99%4.3 min9 sec"four nines" — finance, payments core
99.999%26 sec0.9 sec"five nines" — telecom, very rare

Each extra nine typically costs 10× — 99.99 → 99.999 is the most expensive jump. Don't put 5 nines on every service; it wastes resources.

Error Budget — A Velocity Contract

Old tension:
  Product: "Ship fast for the users!"
  SRE:    "Stability first! Slow down!"

Error budget resolves this:
  - Agree on an SLO (e.g. 99.9% = 43 min budget/month)
  - Budget remaining → ship freely
  - Budget exhausted → freeze (pause releases, stabilize)

→ "Decide via an objective metric." Less political friction.

Operations:
  Weekly budget review
  - 80%+ used → reduce next week's releases (caution)
  - 100% used → freeze
  - Quarterly burn patterns → re-tune SLO itself

Multi-Window Multi-Burn-Rate Alerts

Problem with naive alerts: SLI dropped below 99.9% → page. But:

SLI = 99.0% for 30 min → 1% budget spent. Tiny vs monthly → no page
SLI = 99.0% for 6 hours → 22% budget spent. Critical → page
SLI = 50% for 5 min → similar burn. Immediate page

→ "How fast we burn the budget" is the right basis for alerts.

Multi-window, multi-burn-rate (Google SRE recommended):

severity | burn rate | 1h window | 6h window | action
---------|-----------|-----------|-----------|--------
Page     | 14.4x     | true      | true      | page immediately
Page     | 6x        | true      | true      | page immediately
Ticket   | 3x        | true      | true      | open ticket
Ticket   | 1x        | true      | true      | open ticket

→ Both windows must agree = fewer false positives (ignore brief spikes).
   Multiple burn rates = severity tiers.

Writing SLOs — Practical Steps

  1. Identify critical user journeys — checkout, login, search. Not every endpoint needs an SLO.
  2. Pick an SLI per journey — availability + latency (usually both).
  3. Set the SLO target — too low = meaningless, too high = always exhausted. 99.9% is often the sweet spot. Use 1-3 months of measurement data.
  4. Agree on an error budget policy — what happens when exhausted, review cadence.
  5. Set multi-window burn-rate alerts — 1h/6h windows are the standard.
  6. Quarterly review — is the SLO too low/high? Does it match user expectations?

Common Pitfalls

  • SLI as an internal metric — CPU / memory etc. don't reflect user experience. Use user-visible measures only.
  • Promising 100% SLO — impossible. Even 99.9% allows 43 minutes per month. 100% = endless alerts + permanent freeze.
  • Alerting right when SLO is missed — too many false positives. Use burn-rate + multi-window.
  • "Budget is unused, ship anything" — a single deploy can cause a big incident. Use gradual rollouts + monitoring.
  • SLOs on every endpoint — operational burden. Only the critical user journeys.

Wrap-up

SLI / SLO / error budgets aren't just metrics — they're an automated stability-vs-velocity agreement. The core insight from Google SRE.

Practical: SLOs on 3-5 critical user journeys (not every endpoint); explicit budget policy (freeze conditions); multi-window burn-rate alerts (1h + 6h); quarterly review. Start at 99.9% — adjust based on measurement.

Back to guides