"Changed this button color and conversion went up 5%" — really? Could it be chance? Statistics gives the answer via A/B testing. But traps abound — "peeking", flawed randomization, network effects. This guide covers the precise mechanics and pitfalls.
Randomization — The Core Magic
Naive comparison:
v1 (last week's data): 3.5% conversion
v2 (this week's data): 4.0% conversion
→ 0.5% improvement?
Problems:
- Time-of-day differences (weekend vs weekday)
- External events (marketing campaign, holidays)
- User mix changes (more new users)
- Seasonality
Randomization fixes this:
Same window, randomly assign users to two groups:
- control (A): see v1, 50%
- treatment (B): see v2, 50%
→ Theoretically both groups sample from the same population
→ The diff = v1/v2 effect + random noise
→ Statistical tests separate noise from real effectStatistical Significance — p-value
H0 (null): "v1 = v2, no difference"
H1 (alternative): "v1 ≠ v2"
p-value = "Probability of observing this difference or larger by
chance, assuming H0 is true"
Example:
observed: v2 has 0.5% higher conversion
p-value = 0.03
Interpretation: "If v1 = v2, this big a difference would happen 3%
of the time by chance"
→ Conclude "v1 ≠ v2" (at significance level 5%)
Common confusion:
- p-value < 0.05 = "95% sure v2 is better"? — not exactly
- True meaning: "probability of this difference (or larger) under H0"
≠ "probability v2 is better"Peeking — The Biggest Trap
Watching p-value daily during the test:
Day 1: p = 0.4
Day 3: p = 0.2
Day 5: p = 0.07
Day 7: p = 0.04 ← "Significant! Stop!"
Problem: "wait until significant" inflates false positives.
Single test: 5% false positive rate (significance level)
Daily check for 1 week: ~25% false positive
Daily check for 1 month: ~50% false positive
Fixes:
1. Pre-decide sample size + only look at that point
2. Sequential testing (Wald's SPRT) — peeking allowed, thresholds adjusted
3. Bayesian — directly interpret credible intervals (with care)
→ Netflix, Meta, etc. use advanced sequential / mSPRT.
Small teams: "decide sample size up front + wait to the end" is usually enough.Sample Size Planning
Pick an appropriate sample size:
- Baseline conversion: 3% (current)
- Minimum Detectable Effect (MDE): 0.3% (meaningful diff)
- Significance level (α): 0.05
- Power (1-β): 0.8 (80% chance to detect a real effect)
Approximate formula:
n = 16 × baseline × (1-baseline) / MDE²
= 16 × 0.03 × 0.97 / 0.003²
≈ 51,733 per group
→ ~100K users needed. Less traffic → can only detect larger effects.
Tools:
- Evan Miller's calculator (online)
- Python statsmodels.stats.power
- R pwr package
Trade-off:
- Smaller MDE (0.1%) → sample size explodes
- Larger MDE (1%) → miss smaller effectsSequential Testing — Peeking Done Right
Classic A/B = "fixed sample size + decide at the end". Sequential
tests allow "continue / stop / suspend" at any point with statistical
correctness.
Tools:
- mSPRT (Sequential Probability Ratio Test) — Optimizely, Netflix
- Always Valid Inference — Microsoft
- Group Sequential Testing — peek at fixed checkpoints
- Bayesian early stopping — credible interval based
Benefits:
- Strong effects → terminate early → save resources
- Small effects → wait long enough
- Statistical validity preserved (FWER)
Adoption:
- Simplest: allow 5 peeks per week with α/5 = 0.01 (conservative)
- Formal mSPRT is more efficient but complex to implementMulti-armed Bandit — A Different Approach
Weakness of A/B: "during the test, even the loser gets 50% of traffic".
→ Even clear losers waste traffic until samples are filled.
Multi-armed bandit:
- Better variants automatically receive more traffic
- explore (occasionally try others) vs exploit (use the best)
- Thompson sampling is the standard algorithm
vs A/B:
- A/B: "confirm then change" — high learning cost but strong stats
- Bandit: "learn while exploiting" — lower opportunity cost, weaker stat analysis
Use:
- A/B: measure long-term causal effects of a new feature
- Bandit: short campaigns / recommendations / headline picks (immediate value)Common Pitfalls
- Peeking + early stop — false positives explode. Plan sample size + sequential tests.
- Network effects — A's users interact with B's users (social, marketplaces) → effects bleed. Use cluster-randomized or different randomization units.
- Simpson's paradox — A looks better overall but B wins every segment. Analyze by segment + weighted averages.
- Multiple testing — testing 10 metrics simultaneously → one false positive likely. Bonferroni (α/n) or declare a primary metric.
- Ignoring metrics beyond conversion — A may help conversion but hurt retention. Use a North Star metric + guardrails.
Wrap-up
A/B testing — randomization eliminates confounders + statistical tests separate noise from real effects. Lots of pitfalls: peeking / multiple testing / network effects.
Practical — start simple (fixed sample, one metric, no peeking). Larger orgs use Optimizely / VWO platforms or build their own + sequential testing. For ML model comparison, consider A/B and shadow both.