How Feature Flags Actually Work

A feature flag is a switch that separates "deploying code" from "exposing a feature". The deploy is done but the feature stays off, and you can turn it on later without redeploying. It looks simple, but each flag type has a different lifespan and operating model, and traps hide in gradual rollout, A/B wiring, and where you evaluate (server vs client). This guide covers all of it. For the statistics of A/B experiments, see the separate guide (how-ab-testing-actually-works).

The Problem Flags Solve — Separating Deploy from Release

Without flags:
  merging code = immediate exposure to all users
  → finding a bug means rollback = redeploy (minutes to tens of minutes)
  → the cause of "no deploys on Friday" culture

With flags:
  if (flags.newCheckout) { new checkout } else { old checkout }
  - code is deployed but flag=off → nobody sees it
  - when ready, flag=on → instant exposure with no redeploy
  - if something breaks, flag=off → instant rollback with no redeploy
    (seconds)

Key: separate deploy from release.
Pairs with trunk-based development — even unfinished features merge to
main behind a flag.

Flag Types — Different Lifespans

Treating all flags the same is a disaster. Different lifespan and purpose means different management.

Type	Purpose	Lifespan	Changed by
Release	Gradual exposure of unfinished/new features	Short (remove after launch)	Dev team
Ops / kill-switch	Instantly cut a feature during an incident	Long (kept permanently)	Ops / SRE
Experiment (A/B)	Compare and measure variants	Only the experiment window	PM / data team
Permission	Gate features by plan / role	Permanent (part of the product)	Product / sales

Why the distinction matters:
- Release flags must be removed once shipped (else flag debt)
- Ops/kill-switches are permanent — never remove (incident readiness)
- Experiment flags are removed after the experiment, once the result is
  applied
- Permission flags are product logic itself — entitlements, not really
  "flags"

→ Even though they all look like "if (flag)", their management policy is
   the opposite. Tagging each flag with a type to track its lifespan is the
   core of operating them.

Gradual Rollout — Percentages and Rings

Percentage-based rollout:
  Day 1:  on for 1% of users  (canary — verify risk with a few)
  Day 2:  5% → if metrics are healthy
  Day 3:  25%
  Day 4:  50%
  Day 5:  100%
  On detecting a problem, drop back to the previous step (or 0%)

Key: the same user always gets the same result (sticky)
  rollout % = decided by whether the user-ID hash is below a threshold
  bucket = hash(userId + flagKey) % 100
  if (bucket < rolloutPercent) → on
  → raising 1%→5% keeps the existing 1% as-is (no flickering)

Ring-based rollout (expanding concentric circles):
  Ring 0: internal staff (dogfooding)
  Ring 1: beta sign-ups / some regions
  Ring 2: a % of general users
  Ring 3: everyone
  → advance to the next after each ring passes (the pattern Microsoft uses)

Sticky bucketing (same user = same result) is the key. If it's random per request, the same user flickers between the new and old UI — confusing — and A/B measurement is polluted. Assign stably via a user-ID hash.

Wiring into A/B Tests

Feature flags and A/B experiments share the same infrastructure:
  - both "split users into groups" (bucketing)
  - both are sticky (same user = same variant)

Difference:
  rollout flag:  goal = safely reach 100% (gradual exposure)
  experiment:    goal = compare metrics across variants (a statistical
                 conclusion)

Wiring flow:
  1. Put the new feature behind a flag
  2. Split 50/50 for an A/B experiment (control vs new)
  3. new wins on the metric + statistically significant → decision
  4. Roll the winning variant out to 100%
  5. Remove the flag (delete the branch from the code)

Caveat: if one user is in several experiments at once, interactions
  pollute the results.
  → separate with orthogonal assignment or mutually-exclusive layers.

For experiment stats (sample size, significance, peeking) see
how-ab-testing-actually-works.

Flag Evaluation — Server vs Client

Server-side evaluation:
  - the server decides the flag value and returns the result (or the
    evaluated UI)
  - pro: flag rules and unreleased features aren't exposed to the client
    (security)
  - pro: instantly consistent (the server is the single source of truth)
  - con: evaluation is on the request path, so it can affect latency

Client-side evaluation:
  - the client receives flag definitions (rules) and evaluates them itself
  - pro: branches instantly with no round trip (fast UI switching)
  - con: unreleased feature code and flag rules are exposed in the bundle
  - con: each user refreshes flags at a different time → consistency issues

Edge / SDK caching:
  - the SDK receives flag definitions via polling/streaming and caches
    locally
  - evaluation is local (fast) + definitions are centrally managed → a
    compromise between the two

Consistency Traps

Trap 1 — flag propagation delay:
  even after flipping a flag off, there's a delay of the SDK cache
  TTL / polling interval.
  A kill-switch with 30 s polling → the incident persists for 30 s.
  → kill-switches need streaming (instant push) or a short TTL.

Trap 2 — consistency within one request:
  if a request evaluates a flag several times and the value changes
  in between, the front uses new logic and the back uses old → a broken
  state.
  → snapshot the flag once at request start and freeze it for the request.

Trap 3 — server/client mismatch:
  the server evaluates new while the client (due to cache lag) evaluates
  old → the UI and API mismatch.
  → unify the evaluation location to one side, or make the client follow
    the server's decision.

Trap 4 — fallback when the flag service fails:
  what if the flag service dies? the default must be clear.
  → new feature default=off, a kill-switch's "normal" default=on
    (fail-safe).
  Design defaults so that, when the service is down, you fall to the safe
  side.

Flag Debt — Pile It Up and the Code Rots

The problem:
  if you don't delete shipped release flags:
  if (flags.newCheckoutV1) {...} else {...}   ← V1 is already 100%
  if (flags.newCheckoutV2) {...} else {...}   ← then another on top
  → dead branches pile up, exploding code readability and tests
  → "when does this else ever run?" nobody knows → fear of changing it

Combinatorial explosion:
  10 flags = in theory 2^10 = 1024 possible state combinations.
  Most are untested → unforeseen interaction bugs.

Cleanup rules:
- attach an expiry / ticket to release flags → remove right after 100%
- include "remove the flag" in the launch checklist (shipping ≠ done;
  done means cleaned up too)
- periodic flag audits — find old flags, flags that are always the same
  value, and remove them
- kill-switches and permissions are permanent → not debt (track separately)

Implementation — What a Good Flag System Needs

Central management + instant change — toggle flags via a dashboard/API with no redeploy. Changes apply in seconds.
Targeting rules — target not just by % but by user attributes (plan, region, version).
Sticky bucketing — same user = same result (hash based).
Audit log — who changed which flag when (for incident root-cause).
Safe defaults — a defined fallback for when the flag service is down.
Type/expiry metadata — for tracking flag debt.

Common Pitfalls

Not deleting flags — release-flag debt piles up as dead branches and rots the code. Bake expiry and cleanup into the launch process.
Non-sticky bucketing — random per request makes the same user flicker between UIs and pollutes A/B measurement. Pin via a user-ID hash.
Slow kill-switch propagation — the incident persists for the polling interval. Kill-switches need streaming/instant push or a short TTL.
Flag value changing within one request — snapshot at request start and freeze it for the request to stay consistent.
Exposing flag rules to the client — unreleased features and internal rules leak into the bundle. Evaluate sensitive ones server-side.
No default defined for flag-service failure — if the service dies, the app breaks. Fail-safe: new features off, kill-switches to their normal value.
Ignoring overlapping-experiment interaction — one user in several experiments at once pollutes results. Separate with mutually-exclusive layers.

Wrap-up

The essence of feature flags is separating deploy from release. Deploy code safely first, turn exposure on gradually (by % or ring), and roll back in seconds with no redeploy when something breaks. A/B experiments ride on the same bucketing infrastructure.

The traps are mostly in two places — consistency (sticky bucketing, an in-request snapshot, fast kill-switch propagation, safe defaults) and flag debt (remove right after launch). Distinguishing flag types — keeping release flags short-lived while tracking kill-switches and permissions as permanent — is the core of healthy flag operations.