Back to Blog
Statistics8 min read

9 A/B Testing Statistics Mistakes That Wreck Your Results

The statistics errors that make A/B test results lie: peeking, p-hacking, ignoring power, misreading confidence intervals, multiple comparisons, and more — explained simply with fixes.

By AB Test Plan

The most damaging A/B testing statistics mistake is peeking — stopping when results first cross 95% significance. That one habit inflates your false positive rate from the nominal 5% to 25–50%, so most of your "winners" are noise. The eight mistakes below compound the damage.

Quick Reference: Mistakes at a Glance

Mistake Why it lies Fix
Peeking at results Each check is an extra significance test; false positives stack Pre-commit to a fixed sample size and stop date
Stopping at 95% the moment you hit it Significance fluctuates; you cherry-pick the peak Run to your pre-registered end point, every time
Confusing statistical and practical significance Small N can make tiny effects "significant" Define minimum business-relevant effect before launch
Ignoring statistical power Underpowered tests miss real wins and manufacture false negatives Target 80% power minimum; calculate sample size upfront
Multiple comparisons Each extra metric/variant adds false positive probability Adjust for multiple comparisons or designate one primary metric
Misinterpreting the p-value p-value ≠ probability the variant is better Learn the correct definition (see below)
Too-short test windows Novelty effects, day-of-week patterns, and seasonal spikes corrupt results Run for at least one full business cycle (usually 7 days minimum)
Sample ratio mismatch Broken traffic splits mean your two groups aren't comparable Check observed vs. expected visitor counts before reading results
Treating directional results as proven "Trending positive" is not a win Ship only clear wins; log directional results as hypotheses

1. Peeking at Results

Peeking means checking your test's p-value repeatedly during the run and stopping when you first see p < 0.05. It feels responsible — why let a loser run longer than necessary? But statistically, each check is an independent significance test, and the more checks you run, the more chances you give noise to look like a signal.

Simulations show that checking every day of a two-week test can push your real false positive rate to 25% or higher, even though every individual check uses a 5% threshold.

Fix: Use a fixed-horizon approach. Before the test launches, calculate the required sample size using your baseline conversion rate, minimum detectable effect, and target power. Commit to stopping only when that sample size is reached. If you need to monitor in real time, use sequential testing methods designed for continuous monitoring — but fixed-horizon is the right default for most teams.

2. Stopping at 95% the Moment You Hit It

Even if you only check once, stopping the instant you cross 95% means you captured the moment random fluctuation peaked in your favor. Significance fluctuates throughout a test — a variant might cross 95% on day 4, drop to 88% on day 7, then climb back to 96% by day 14. Stopping at the day-4 peak captures noise.

Fix: Pre-register your end point — a specific date or visitor count — before the test starts. Stop there regardless of what the significance meter reads. That's the difference between a confirmatory experiment and data dredging.

3. Confusing Statistical and Practical Significance

A result can be statistically significant and practically meaningless. With a large enough sample, you can detect a 0.01% conversion rate lift at p < 0.001 — a difference that adds $40/month in revenue, far below the cost of shipping and maintaining the change.

The reverse is equally common: a test shows +12% relative lift, the team ships it, and later realizes the confidence interval was wide enough to include zero. The variant may have done nothing.

Fix: Before launch, define your minimum detectable effect in business terms. Ask: "What is the smallest improvement that would change a decision?" That threshold, not 0.05, is what you're actually testing against.

4. Ignoring Statistical Power

Power is the probability that your test will detect a real effect if one exists. Most teams focus entirely on statistical significance (controlling false positives) and ignore power (controlling false negatives). The industry default is 80% power, meaning even a well-designed test misses 20% of real effects.

Running an underpowered test is worse than not running it at all. You get a negative result, conclude the variant doesn't work, and file it away — but the variant might have produced a genuine +8% lift that your sample size was simply too small to detect.

Fix: Calculate your required sample size before you start, targeting at least 80% power. Use the A/B test sample size calculator to find the right number. If you can't reach that sample size in a reasonable time, accept a larger MDE or don't run the test.

5. The Multiple Comparisons Problem

Each statistical test you run carries a false positive rate. Track five metrics in a single test and the probability of at least one false positive is 1 - (0.95)⁵, or about 23% — even though every individual metric uses a 5% threshold. Multiple variants multiply this further.

Fix: Designate one primary metric before launch — it alone determines ship/kill. Secondary metrics are for learning, not decisions. If you must test multiple variants, apply the Bonferroni correction (divide your alpha by the number of comparisons) or similar family-wise error rate control.

6. Misinterpreting the P-Value

This is the most widespread conceptual error in experimentation. Many practitioners read p = 0.03 as "there is a 97% probability that the variant is better than control." That is incorrect.

The p-value is the probability of observing data this extreme, or more extreme, assuming the null hypothesis is true — i.e., assuming no real difference exists between variants. It says nothing directly about the probability that your variant is better. A low p-value means the observed gap is unlikely if there were no true effect; it does not mean the null is probably false.

Fix: Replace "p = 0.03 means there's a 97% chance the variant wins" with "if there were truly no difference, we'd observe a gap this large only 3% of the time." That framing keeps you honest. For a direct probability estimate of which variant is better, Bayesian tools provide posterior probabilities — though they introduce their own assumptions.

7. Too-Short Test Windows (No Full Business Cycles)

User behavior varies by day of week — B2B products peak Monday through Wednesday, e-commerce sites on weekends. A 48-hour window that falls on your highest-traffic days inflates conversion rates across both variants, but the relative difference between them is unreliable.

The novelty effect compounds this: users sometimes respond to any change simply because it's unfamiliar, and that lift vanishes within days.

Fix: Run every test for at least one complete business cycle — at minimum seven days, typically two weeks for B2C. Check the test duration guide to estimate the right window for your traffic and MDE. It sounds slow; it's the cost of results you can act on.

8. Sample Ratio Mismatch (SRM)

Sample ratio mismatch happens when the actual split between your control and variant groups differs meaningfully from the intended split. You set up a 50/50 test and after two weeks you have 22,000 visitors in control and 18,500 in variant — a 54/46 split. Something went wrong in the assignment mechanism.

SRM invalidates the test entirely. An uneven split means the two groups may differ systematically beyond the treatment — perhaps bots were filtered differently, or mobile users landed in one variant at a higher rate. Any measured effect could reflect that compositional gap, not your change.

Fix: Before reading results, verify observed visitor counts match expected proportions. A chi-squared test works: flag if p < 0.01. Most platforms run this automatically. If you detect SRM, investigate assignment logic, bot filtering, and any JavaScript that might suppress the tracking call before declaring a result.

9. Treating Directional Results as Proven

A test ends with p = 0.12 and a +6% lift for the variant. The team says "it's trending positive, let's ship it." This is not how statistics works.

A p-value of 0.12 means there's a 12% chance of observing this result under the null — that's not a trend, it's noise that reached 88% confidence instead of 95%. Shipping on that basis makes roughly 1 in 8 "directional positives" false positives, and you won't know which.

Fix: Treat the significance threshold as a hard line. Results below it are inconclusive — not winners, not losers. Log them as hypotheses, redesign with a larger expected effect, or move on. Use AB Test Plan to structure follow-up experiments properly.


Fixed-Horizon vs. Sequential Testing

All fixes above assume a fixed-horizon approach: calculate required sample size upfront, stop exactly there. This works for most teams.

Sequential testing (always-valid inference) lets you check results at any time without inflating error rates. Methods like mSPRT build in the cost of peeking from the start. The trade-off is slightly larger total sample sizes. If testing velocity is constrained by run time and you have the tooling, it's worth exploring. Otherwise, fixed-horizon is simpler and sufficient.


The Common Thread

Nine mistakes, one root cause: launching without pre-registering the end point, primary metric, and required sample size. Answer three questions before every test starts:

  1. What single metric determines the winner?
  2. What sample size do I need at 80% power?
  3. On what date or visitor count will I stop — unconditionally?

Use the AB Test Plan calculator for question two, and the test duration guide to anchor question three to your real traffic.

A/B testing statisticsstatistical significancep-valueconfidence intervalCRO

Ready to plan your next A/B test?

Use AI to generate experiment ideas, build hypotheses, and calculate sample sizes.

Start Planning — Free