How A/B Test Statistical Significance Works
Statistical significance tells you whether the difference in conversion rates between two variants is real or just noise. This calculator uses a two-proportion z-test, the standard method for comparing two independent conversion rates.
The test pools both variants' data to estimate a shared conversion rate, calculates the standard error, then measures how many standard errors apart the two rates are (the z-score). A higher absolute z-score means stronger evidence of a real difference. The p-value converts that z-score into a probability: the chance you'd see results this extreme if there were truly no difference.
When to Stop an A/B Test
Never stop a test early because one variant "looks like it's winning." Early results are noisy and unreliable. You need enough visitors to reach statistical power (typically 80%), which means your test has an 80% chance of detecting a real effect if one exists.
The required sample size depends on your baseline conversion rate and the minimum detectable effect (MDE) you care about. Smaller effects need more visitors to detect. Use the sample size recommendation from this calculator as your stopping rule.
Required Sample Size by Baseline Conversion Rate
| Baseline Rate | MDE (Relative) | Sample / Variant | At 1K/day |
|---|---|---|---|
| 2% | 10% | ~24,000 | ~24 days |
| 5% | 10% | ~15,000 | ~15 days |
| 5% | 5% | ~60,000 | ~60 days |
| 10% | 10% | ~7,000 | ~7 days |
| 20% | 10% | ~3,200 | ~3 days |
MDE = minimum detectable effect. A 10% relative MDE on a 5% baseline means you want to detect a shift from 5.0% to 5.5% (or 4.5%). Smaller MDEs require exponentially more traffic.
One-Tailed vs Two-Tailed Tests
This calculator uses a two-tailed test, which checks if Variant B is significantly different from Variant A in either direction (better or worse). A one-tailed test only checks one direction, giving you a lower p-value but missing potential negative results.
Use two-tailed tests in practice. You always want to know if your change made things worse, not just whether it helped. One-tailed tests are tempting because they reach significance faster, but they mask regressions.
5 Common A/B Testing Mistakes
- Peeking and stopping early: Checking results daily and stopping when p < 0.05 inflates your false positive rate to 25–30%. Decide your sample size in advance and commit to it.
- Ignoring sample size: With 200 visitors per variant, you can only detect 30%+ relative differences. Most real effects are 5–15%, which need thousands of visitors.
- Testing too many variants at once: Each additional variant increases the chance of a false positive. Correct for multiple comparisons (Bonferroni) or run sequential tests.
- Not segmenting by device/source: A variant might win on mobile and lose on desktop. Aggregate results can hide this. Check segment-level data after reaching significance.
- Changing the test mid-flight: Modifying the variant, adding traffic sources, or changing goals mid-test invalidates the statistical assumptions. Start a new test instead.
Bayesian vs Frequentist A/B Testing
This calculator uses the frequentist approach (z-test, p-values). It asks: "If there were no real difference, how likely is the data I observed?" You get a binary significant/not-significant answer at a fixed confidence level.
Bayesian testing gives a probability that one variant is better than the other (e.g., "92% chance B beats A"). It's more intuitive and allows continuous monitoring without inflating error rates, but requires choosing a prior distribution. Tools like Google Optimize used Bayesian methods. For most teams, the frequentist approach here works well as long as you commit to a sample size before peeking.
For splitting datasets into train/test groups, use our dataset split calculator. For dividing any amount by custom percentages, try the percentage split calculator.