How do I know if my A/B test is statistically significant?

Your A/B test is statistically significant when the p-value is below your chosen threshold (usually 0.05 for 95% confidence). This means there is less than a 5% chance the observed difference happened by random chance. Enter your data above — if the result shows "significant," you can confidently declare a winner.

How long should I run an A/B test?

Run your test until you reach the minimum sample size for statistical power (usually 80%). For a typical 5% baseline conversion rate with a 10% minimum detectable effect, you need about 3,800 visitors per variant. At 1,000 visitors per day, that is roughly 8 days. Never stop a test early just because one variant looks like it is winning.

What is a good sample size for an A/B test?

It depends on your baseline conversion rate and the minimum effect you want to detect. For a 5% conversion rate detecting a 10% relative change, you need roughly 3,800 visitors per variant. For smaller effects (5% relative), you need about 15,000 per variant. Our calculator provides a specific recommendation based on your data.

What is the difference between p-value and confidence level?

The p-value is the probability of seeing your results (or more extreme) if there were truly no difference between variants. The confidence level is 1 minus your significance threshold (alpha). At 95% confidence (alpha = 0.05), a p-value below 0.05 means statistically significant. Lower p-values indicate stronger evidence of a real difference.

A/B Test Calculator — Statistical Significance Tool

Baseline CVR	10% Uplift Target	Sample/Variant	Total Needed
1%	1.1%	14,751	29,502
2%	2.2%	7,238	14,476
3%	3.3%	4,733	9,466
5%	5.5%	2,772	5,544
10%	11%	1,325	2,650
20%	22%	601	1,202

Baseline CVR

10% Uplift Target

Sample/Variant

Total Needed

1.1%

14,751

29,502

2.2%

7,238

14,476

3.3%

4,733

9,466

5.5%

2,772

5,544

10%

11%

1,325

2,650

20%

22%

601

1,202

How A/B Test Statistical Significance Works

Statistical significance tells you whether the difference in conversion rates between two variants is real or just noise. This calculator uses a two-proportion z-test, the standard method for comparing two independent conversion rates.

The test pools both variants' data to estimate a shared conversion rate, calculates the standard error, then measures how many standard errors apart the two rates are (the z-score). A higher absolute z-score means stronger evidence of a real difference. The p-value converts that z-score into a probability: the chance you'd see results this extreme if there were truly no difference.

When to Stop an A/B Test

Never stop a test early because one variant "looks like it's winning." Early results are noisy and unreliable. You need enough visitors to reach statistical power (typically 80%), which means your test has an 80% chance of detecting a real effect if one exists.

The required sample size depends on your baseline conversion rate and the minimum detectable effect (MDE) you care about. Smaller effects need more visitors to detect. Use the sample size recommendation from this calculator as your stopping rule.

Required Sample Size by Baseline Conversion Rate

Baseline Rate	MDE (Relative)	Sample / Variant	At 1K/day
2%	10%	~24,000	~24 days
5%	10%	~15,000	~15 days
5%	5%	~60,000	~60 days
10%	10%	~7,000	~7 days
20%	10%	~3,200	~3 days

MDE = minimum detectable effect. A 10% relative MDE on a 5% baseline means you want to detect a shift from 5.0% to 5.5% (or 4.5%). Smaller MDEs require exponentially more traffic.

One-Tailed vs Two-Tailed Tests

This calculator uses a two-tailed test, which checks if Variant B is significantly different from Variant A in either direction (better or worse). A one-tailed test only checks one direction, giving you a lower p-value but missing potential negative results.

Use two-tailed tests in practice. You always want to know if your change made things worse, not just whether it helped. One-tailed tests are tempting because they reach significance faster, but they mask regressions.

5 Common A/B Testing Mistakes

Peeking and stopping early: Checking results daily and stopping when p < 0.05 inflates your false positive rate to 25–30%. Decide your sample size in advance and commit to it.
Ignoring sample size: With 200 visitors per variant, you can only detect 30%+ relative differences. Most real effects are 5–15%, which need thousands of visitors.
Testing too many variants at once: Each additional variant increases the chance of a false positive. Correct for multiple comparisons (Bonferroni) or run sequential tests.
Not segmenting by device/source: A variant might win on mobile and lose on desktop. Aggregate results can hide this. Check segment-level data after reaching significance.
Changing the test mid-flight: Modifying the variant, adding traffic sources, or changing goals mid-test invalidates the statistical assumptions. Start a new test instead.

Bayesian vs Frequentist A/B Testing

This calculator uses the frequentist approach (z-test, p-values). It asks: "If there were no real difference, how likely is the data I observed?" You get a binary significant/not-significant answer at a fixed confidence level.

Bayesian testing gives a probability that one variant is better than the other (e.g., "92% chance B beats A"). It's more intuitive and allows continuous monitoring without inflating error rates, but requires choosing a prior distribution. Tools like Google Optimize used Bayesian methods. For most teams, the frequentist approach here works well as long as you commit to a sample size before peeking.

For splitting datasets into train/test groups, use our dataset split calculator. For dividing any amount by custom percentages, try the percentage split calculator.

A/B Test Calculator — Statistical Significance & Sample Size

Variant A (Control)

Variant B (Treatment)

Confidence Level

Minimum Sample Size by Baseline Conversion Rate

How This Calculator Works

Enter Your Details

Get Fair Results

Share & Settle

Frequently Asked Questions

People Also Calculate