Skip to content
yutils

How A/B Testing Actually Works

Randomization (the magic), why 'peeking' invalidates p-values, sample size planning, sequential testing for early stopping, multi-armed bandits as the explore/exploit alternative, and the network effect / cross-experiment contamination traps.

~10 min read

"Changed this button color and conversion went up 5%" — really? Could it be chance? Statistics gives the answer via A/B testing. But traps abound — "peeking", flawed randomization, network effects. This guide covers the precise mechanics and pitfalls.

Randomization — The Core Magic

Naive comparison:
  v1 (last week's data): 3.5% conversion
  v2 (this week's data): 4.0% conversion
  → 0.5% improvement?

Problems:
- Time-of-day differences (weekend vs weekday)
- External events (marketing campaign, holidays)
- User mix changes (more new users)
- Seasonality

Randomization fixes this:
  Same window, randomly assign users to two groups:
  - control (A): see v1, 50%
  - treatment (B): see v2, 50%

→ Theoretically both groups sample from the same population
→ The diff = v1/v2 effect + random noise
→ Statistical tests separate noise from real effect

Statistical Significance — p-value

H0 (null): "v1 = v2, no difference"
H1 (alternative): "v1 ≠ v2"

p-value = "Probability of observing this difference or larger by
           chance, assuming H0 is true"

Example:
  observed: v2 has 0.5% higher conversion
  p-value = 0.03

  Interpretation: "If v1 = v2, this big a difference would happen 3%
  of the time by chance"
  → Conclude "v1 ≠ v2" (at significance level 5%)

Common confusion:
- p-value < 0.05 = "95% sure v2 is better"? — not exactly
- True meaning: "probability of this difference (or larger) under H0"
  ≠ "probability v2 is better"

Peeking — The Biggest Trap

Watching p-value daily during the test:

Day 1: p = 0.4
Day 3: p = 0.2
Day 5: p = 0.07
Day 7: p = 0.04   ← "Significant! Stop!"

Problem: "wait until significant" inflates false positives.

  Single test: 5% false positive rate (significance level)
  Daily check for 1 week: ~25% false positive
  Daily check for 1 month: ~50% false positive

Fixes:
1. Pre-decide sample size + only look at that point
2. Sequential testing (Wald's SPRT) — peeking allowed, thresholds adjusted
3. Bayesian — directly interpret credible intervals (with care)

→ Netflix, Meta, etc. use advanced sequential / mSPRT.
   Small teams: "decide sample size up front + wait to the end" is usually enough.

Sample Size Planning

Pick an appropriate sample size:
- Baseline conversion: 3% (current)
- Minimum Detectable Effect (MDE): 0.3% (meaningful diff)
- Significance level (α): 0.05
- Power (1-β): 0.8 (80% chance to detect a real effect)

Approximate formula:
  n = 16 × baseline × (1-baseline) / MDE²
  = 16 × 0.03 × 0.97 / 0.003²
  ≈ 51,733 per group

→ ~100K users needed. Less traffic → can only detect larger effects.

Tools:
- Evan Miller's calculator (online)
- Python statsmodels.stats.power
- R pwr package

Trade-off:
- Smaller MDE (0.1%) → sample size explodes
- Larger MDE (1%) → miss smaller effects

Sequential Testing — Peeking Done Right

Classic A/B = "fixed sample size + decide at the end". Sequential
tests allow "continue / stop / suspend" at any point with statistical
correctness.

Tools:
- mSPRT (Sequential Probability Ratio Test) — Optimizely, Netflix
- Always Valid Inference — Microsoft
- Group Sequential Testing — peek at fixed checkpoints
- Bayesian early stopping — credible interval based

Benefits:
- Strong effects → terminate early → save resources
- Small effects → wait long enough
- Statistical validity preserved (FWER)

Adoption:
- Simplest: allow 5 peeks per week with α/5 = 0.01 (conservative)
- Formal mSPRT is more efficient but complex to implement

Multi-armed Bandit — A Different Approach

Weakness of A/B: "during the test, even the loser gets 50% of traffic".
   → Even clear losers waste traffic until samples are filled.

Multi-armed bandit:
- Better variants automatically receive more traffic
- explore (occasionally try others) vs exploit (use the best)
- Thompson sampling is the standard algorithm

vs A/B:
- A/B: "confirm then change" — high learning cost but strong stats
- Bandit: "learn while exploiting" — lower opportunity cost, weaker stat analysis

Use:
- A/B: measure long-term causal effects of a new feature
- Bandit: short campaigns / recommendations / headline picks (immediate value)

Common Pitfalls

  • Peeking + early stop — false positives explode. Plan sample size + sequential tests.
  • Network effects — A's users interact with B's users (social, marketplaces) → effects bleed. Use cluster-randomized or different randomization units.
  • Simpson's paradox — A looks better overall but B wins every segment. Analyze by segment + weighted averages.
  • Multiple testing — testing 10 metrics simultaneously → one false positive likely. Bonferroni (α/n) or declare a primary metric.
  • Ignoring metrics beyond conversion — A may help conversion but hurt retention. Use a North Star metric + guardrails.

Wrap-up

A/B testing — randomization eliminates confounders + statistical tests separate noise from real effects. Lots of pitfalls: peeking / multiple testing / network effects.

Practical — start simple (fixed sample, one metric, no peeking). Larger orgs use Optimizely / VWO platforms or build their own + sequential testing. For ML model comparison, consider A/B and shadow both.

Back to guides