Skip to main content
AI-PoweredFinancial CalculatorsFree — No sign-up

A/B Test Calculator — Statistical Significance & Sample Size

Most A/B tests get called too early. A 20% uplift on 500 visitors is almost certainly noise — you need roughly 3,800 visitors per variant to detect a 10% change at 95% confidence. Plug in your control and variant numbers to get the actual p-value and find out whether you have a winner or wishful thinking.

0.05 (95%)

Standard α

~3,800/variant

Min Sample

80%

Power Target

10% relative

Common MDE

By SplitGenius TeamUpdated February 2026

To check if an A/B test is statistically significant, compare the conversion rates of both variants using a two-proportion z-test. If the p-value is below 0.05 (95% confidence), the difference is real, not random chance. Enter your visitors and conversions for each variant below to get your z-score, p-value, and winner.

Variant A (Control)

Variant B (Treatment)

Confidence Level

Industry standard — balances false positives with practical sample sizes.

Minimum Sample Size by Baseline Conversion Rate

Visitors needed per variant to detect a 10% relative improvement at 95% confidence and 80% power.

Baseline CVR10% Uplift TargetSample/VariantTotal Needed
1%1.1%14,75129,502
2%2.2%7,23814,476
3%3.3%4,7339,466
5%5.5%2,7725,544
10%11%1,3252,650
20%22%6011,202

How This Calculator Works

1

Enter Your Details

Fill in amounts, people, and preferences. Takes under 30 seconds.

2

Get Fair Results

See an instant breakdown with data-driven calculations and Fairness Scores.

3

Share & Settle

Copy a shareable link to discuss results with everyone involved.

Frequently Asked Questions

People Also Calculate

Explore 182+ Free Calculators

Split rent, bills, tips, trips, wedding costs, childcare, and more.

Browse All Calculators

How A/B Test Statistical Significance Works

Statistical significance tells you whether the difference in conversion rates between two variants is real or just noise. This calculator uses a two-proportion z-test, the standard method for comparing two independent conversion rates.

The test pools both variants' data to estimate a shared conversion rate, calculates the standard error, then measures how many standard errors apart the two rates are (the z-score). A higher absolute z-score means stronger evidence of a real difference. The p-value converts that z-score into a probability: the chance you'd see results this extreme if there were truly no difference.

When to Stop an A/B Test

Never stop a test early because one variant "looks like it's winning." Early results are noisy and unreliable. You need enough visitors to reach statistical power (typically 80%), which means your test has an 80% chance of detecting a real effect if one exists.

The required sample size depends on your baseline conversion rate and the minimum detectable effect (MDE) you care about. Smaller effects need more visitors to detect. Use the sample size recommendation from this calculator as your stopping rule.

Required Sample Size by Baseline Conversion Rate

Baseline RateMDE (Relative)Sample / VariantAt 1K/day
2%10%~24,000~24 days
5%10%~15,000~15 days
5%5%~60,000~60 days
10%10%~7,000~7 days
20%10%~3,200~3 days

MDE = minimum detectable effect. A 10% relative MDE on a 5% baseline means you want to detect a shift from 5.0% to 5.5% (or 4.5%). Smaller MDEs require exponentially more traffic.

One-Tailed vs Two-Tailed Tests

This calculator uses a two-tailed test, which checks if Variant B is significantly different from Variant A in either direction (better or worse). A one-tailed test only checks one direction, giving you a lower p-value but missing potential negative results.

Use two-tailed tests in practice. You always want to know if your change made things worse, not just whether it helped. One-tailed tests are tempting because they reach significance faster, but they mask regressions.

5 Common A/B Testing Mistakes

  • Peeking and stopping early: Checking results daily and stopping when p < 0.05 inflates your false positive rate to 25–30%. Decide your sample size in advance and commit to it.
  • Ignoring sample size: With 200 visitors per variant, you can only detect 30%+ relative differences. Most real effects are 5–15%, which need thousands of visitors.
  • Testing too many variants at once: Each additional variant increases the chance of a false positive. Correct for multiple comparisons (Bonferroni) or run sequential tests.
  • Not segmenting by device/source: A variant might win on mobile and lose on desktop. Aggregate results can hide this. Check segment-level data after reaching significance.
  • Changing the test mid-flight: Modifying the variant, adding traffic sources, or changing goals mid-test invalidates the statistical assumptions. Start a new test instead.

Bayesian vs Frequentist A/B Testing

This calculator uses the frequentist approach (z-test, p-values). It asks: "If there were no real difference, how likely is the data I observed?" You get a binary significant/not-significant answer at a fixed confidence level.

Bayesian testing gives a probability that one variant is better than the other (e.g., "92% chance B beats A"). It's more intuitive and allows continuous monitoring without inflating error rates, but requires choosing a prior distribution. Tools like Google Optimize used Bayesian methods. For most teams, the frequentist approach here works well as long as you commit to a sample size before peeking.

For splitting datasets into train/test groups, use our dataset split calculator. For dividing any amount by custom percentages, try the percentage split calculator.