A feature flag is a switch that separates "deploying code" from "exposing a feature". The deploy is done but the feature stays off, and you can turn it on later without redeploying. It looks simple, but each flag type has a different lifespan and operating model, and traps hide in gradual rollout, A/B wiring, and where you evaluate (server vs client). This guide covers all of it. For the statistics of A/B experiments, see the separate guide (how-ab-testing-actually-works).
The Problem Flags Solve — Separating Deploy from Release
Without flags:
merging code = immediate exposure to all users
→ finding a bug means rollback = redeploy (minutes to tens of minutes)
→ the cause of "no deploys on Friday" culture
With flags:
if (flags.newCheckout) { new checkout } else { old checkout }
- code is deployed but flag=off → nobody sees it
- when ready, flag=on → instant exposure with no redeploy
- if something breaks, flag=off → instant rollback with no redeploy
(seconds)
Key: separate deploy from release.
Pairs with trunk-based development — even unfinished features merge to
main behind a flag.Flag Types — Different Lifespans
Treating all flags the same is a disaster. Different lifespan and purpose means different management.
| Type | Purpose | Lifespan | Changed by |
|---|---|---|---|
| Release | Gradual exposure of unfinished/new features | Short (remove after launch) | Dev team |
| Ops / kill-switch | Instantly cut a feature during an incident | Long (kept permanently) | Ops / SRE |
| Experiment (A/B) | Compare and measure variants | Only the experiment window | PM / data team |
| Permission | Gate features by plan / role | Permanent (part of the product) | Product / sales |
Why the distinction matters:
- Release flags must be removed once shipped (else flag debt)
- Ops/kill-switches are permanent — never remove (incident readiness)
- Experiment flags are removed after the experiment, once the result is
applied
- Permission flags are product logic itself — entitlements, not really
"flags"
→ Even though they all look like "if (flag)", their management policy is
the opposite. Tagging each flag with a type to track its lifespan is the
core of operating them.Gradual Rollout — Percentages and Rings
Percentage-based rollout:
Day 1: on for 1% of users (canary — verify risk with a few)
Day 2: 5% → if metrics are healthy
Day 3: 25%
Day 4: 50%
Day 5: 100%
On detecting a problem, drop back to the previous step (or 0%)
Key: the same user always gets the same result (sticky)
rollout % = decided by whether the user-ID hash is below a threshold
bucket = hash(userId + flagKey) % 100
if (bucket < rolloutPercent) → on
→ raising 1%→5% keeps the existing 1% as-is (no flickering)
Ring-based rollout (expanding concentric circles):
Ring 0: internal staff (dogfooding)
Ring 1: beta sign-ups / some regions
Ring 2: a % of general users
Ring 3: everyone
→ advance to the next after each ring passes (the pattern Microsoft uses)Sticky bucketing (same user = same result) is the key. If it's random per request, the same user flickers between the new and old UI — confusing — and A/B measurement is polluted. Assign stably via a user-ID hash.
Wiring into A/B Tests
Feature flags and A/B experiments share the same infrastructure:
- both "split users into groups" (bucketing)
- both are sticky (same user = same variant)
Difference:
rollout flag: goal = safely reach 100% (gradual exposure)
experiment: goal = compare metrics across variants (a statistical
conclusion)
Wiring flow:
1. Put the new feature behind a flag
2. Split 50/50 for an A/B experiment (control vs new)
3. new wins on the metric + statistically significant → decision
4. Roll the winning variant out to 100%
5. Remove the flag (delete the branch from the code)
Caveat: if one user is in several experiments at once, interactions
pollute the results.
→ separate with orthogonal assignment or mutually-exclusive layers.
For experiment stats (sample size, significance, peeking) see
how-ab-testing-actually-works.Flag Evaluation — Server vs Client
Server-side evaluation:
- the server decides the flag value and returns the result (or the
evaluated UI)
- pro: flag rules and unreleased features aren't exposed to the client
(security)
- pro: instantly consistent (the server is the single source of truth)
- con: evaluation is on the request path, so it can affect latency
Client-side evaluation:
- the client receives flag definitions (rules) and evaluates them itself
- pro: branches instantly with no round trip (fast UI switching)
- con: unreleased feature code and flag rules are exposed in the bundle
- con: each user refreshes flags at a different time → consistency issues
Edge / SDK caching:
- the SDK receives flag definitions via polling/streaming and caches
locally
- evaluation is local (fast) + definitions are centrally managed → a
compromise between the twoConsistency Traps
Trap 1 — flag propagation delay:
even after flipping a flag off, there's a delay of the SDK cache
TTL / polling interval.
A kill-switch with 30 s polling → the incident persists for 30 s.
→ kill-switches need streaming (instant push) or a short TTL.
Trap 2 — consistency within one request:
if a request evaluates a flag several times and the value changes
in between, the front uses new logic and the back uses old → a broken
state.
→ snapshot the flag once at request start and freeze it for the request.
Trap 3 — server/client mismatch:
the server evaluates new while the client (due to cache lag) evaluates
old → the UI and API mismatch.
→ unify the evaluation location to one side, or make the client follow
the server's decision.
Trap 4 — fallback when the flag service fails:
what if the flag service dies? the default must be clear.
→ new feature default=off, a kill-switch's "normal" default=on
(fail-safe).
Design defaults so that, when the service is down, you fall to the safe
side.Flag Debt — Pile It Up and the Code Rots
The problem:
if you don't delete shipped release flags:
if (flags.newCheckoutV1) {...} else {...} ← V1 is already 100%
if (flags.newCheckoutV2) {...} else {...} ← then another on top
→ dead branches pile up, exploding code readability and tests
→ "when does this else ever run?" nobody knows → fear of changing it
Combinatorial explosion:
10 flags = in theory 2^10 = 1024 possible state combinations.
Most are untested → unforeseen interaction bugs.
Cleanup rules:
- attach an expiry / ticket to release flags → remove right after 100%
- include "remove the flag" in the launch checklist (shipping ≠ done;
done means cleaned up too)
- periodic flag audits — find old flags, flags that are always the same
value, and remove them
- kill-switches and permissions are permanent → not debt (track separately)Implementation — What a Good Flag System Needs
- Central management + instant change — toggle flags via a dashboard/API with no redeploy. Changes apply in seconds.
- Targeting rules — target not just by % but by user attributes (plan, region, version).
- Sticky bucketing — same user = same result (hash based).
- Audit log — who changed which flag when (for incident root-cause).
- Safe defaults — a defined fallback for when the flag service is down.
- Type/expiry metadata — for tracking flag debt.
Common Pitfalls
- Not deleting flags — release-flag debt piles up as dead branches and rots the code. Bake expiry and cleanup into the launch process.
- Non-sticky bucketing — random per request makes the same user flicker between UIs and pollutes A/B measurement. Pin via a user-ID hash.
- Slow kill-switch propagation — the incident persists for the polling interval. Kill-switches need streaming/instant push or a short TTL.
- Flag value changing within one request — snapshot at request start and freeze it for the request to stay consistent.
- Exposing flag rules to the client — unreleased features and internal rules leak into the bundle. Evaluate sensitive ones server-side.
- No default defined for flag-service failure — if the service dies, the app breaks. Fail-safe: new features off, kill-switches to their normal value.
- Ignoring overlapping-experiment interaction — one user in several experiments at once pollutes results. Separate with mutually-exclusive layers.
Wrap-up
The essence of feature flags is separating deploy from release. Deploy code safely first, turn exposure on gradually (by % or ring), and roll back in seconds with no redeploy when something breaks. A/B experiments ride on the same bucketing infrastructure.
The traps are mostly in two places — consistency (sticky bucketing, an in-request snapshot, fast kill-switch propagation, safe defaults) and flag debt (remove right after launch). Distinguishing flag types — keeping release flags short-lived while tracking kill-switches and permissions as permanent — is the core of healthy flag operations.