What is AB Test Plan?

AB Test Plan is a free AI-powered tool that predicts A/B test outcomes using synthetic persona simulation. Instead of spending weeks of real traffic, you get a prediction in 60 seconds — complete with a Run/Iterate/Kill verdict, persona-by-persona reasoning, and specific iteration suggestions.

How does the A/B test prediction work?

AB Test Plan generates 6 diverse synthetic personas, each with real economic constraints (fixed budgets, time pressure), specific behavioral patterns (skepticism levels, decision styles), and existing workflow investments (switching costs). Each persona independently evaluates your control and variant, then the tool synthesizes their responses into an actionable prediction with a Run, Iterate, or Kill verdict.

Why should I trust synthetic persona predictions?

Unlike generic AI chatbots, AB Test Plan's personas have rigid constraints that force honest trade-offs — like a real person deciding whether to spend their limited budget on your tool vs. keeping their current workflow. The behavioral anchoring methodology is based on Stanford Generative Agent research and forces personas to prioritize rather than agree.

How does ICE scoring work?

ICE scoring rates each experiment idea on three dimensions: Impact (how much will this move the needle, 1-10), Confidence (how sure are you it will work, 1-10), and Ease (how easy is it to implement, 1-10). The total ICE score helps you prioritize which experiments to run first. Higher scores indicate better candidates for testing.

What frameworks does AB Test Plan use?

AB Test Plan uses ICE Scoring for prioritization, Reforge Growth Loops, Cialdini's 6 Principles of Persuasion, Fogg Behavior Model, Jobs-to-be-Done framework, loss aversion, cognitive load theory, behavioral anchoring, and trade-off forcing methodology for realistic persona simulation.

How do I calculate the right sample size for an A/B test?

The built-in calculator determines sample size based on your baseline conversion rate, minimum detectable effect (MDE), statistical significance level (typically 95%), and statistical power (typically 80%). It tells you exactly how many visitors per variation you need and how many days the test will take based on your daily traffic.

Is AB Test Plan free?

Yes, AB Test Plan is completely free. Generate experiment ideas, build hypotheses, calculate sample sizes, preview variants, and run persona predictions at no cost. No account or credit card required.

How long should I run an A/B test?

Run your test until it reaches statistical significance (typically 95% confidence) and has run for at least 1-2 full business cycles (7-14 days minimum). But first, run it through AB Test Plan's prediction simulation to make sure the test is worth running at all — 70-80% of A/B tests lose or are inconclusive.

9 A/B Testing Statistics Mistakes That Wreck Your Results

The most damaging A/B testing statistics mistake is peeking — stopping when results first cross 95% significance. That one habit inflates your false positive rate from the nominal 5% to 25–50%, so most of your "winners" are noise. The eight mistakes below compound the damage.

Quick Reference: Mistakes at a Glance

Mistake	Why it lies	Fix
Peeking at results	Each check is an extra significance test; false positives stack	Pre-commit to a fixed sample size and stop date
Stopping at 95% the moment you hit it	Significance fluctuates; you cherry-pick the peak	Run to your pre-registered end point, every time
Confusing statistical and practical significance	Small N can make tiny effects "significant"	Define minimum business-relevant effect before launch
Ignoring statistical power	Underpowered tests miss real wins and manufacture false negatives	Target 80% power minimum; calculate sample size upfront
Multiple comparisons	Each extra metric/variant adds false positive probability	Adjust for multiple comparisons or designate one primary metric
Misinterpreting the p-value	p-value ≠ probability the variant is better	Learn the correct definition (see below)
Too-short test windows	Novelty effects, day-of-week patterns, and seasonal spikes corrupt results	Run for at least one full business cycle (usually 7 days minimum)
Sample ratio mismatch	Broken traffic splits mean your two groups aren't comparable	Check observed vs. expected visitor counts before reading results
Treating directional results as proven	"Trending positive" is not a win	Ship only clear wins; log directional results as hypotheses

1. Peeking at Results

Peeking means checking your test's p-value repeatedly during the run and stopping when you first see p < 0.05. It feels responsible — why let a loser run longer than necessary? But statistically, each check is an independent significance test, and the more checks you run, the more chances you give noise to look like a signal.

Simulations show that checking every day of a two-week test can push your real false positive rate to 25% or higher, even though every individual check uses a 5% threshold.

Fix: Use a fixed-horizon approach. Before the test launches, calculate the required sample size using your baseline conversion rate, minimum detectable effect, and target power. Commit to stopping only when that sample size is reached. If you need to monitor in real time, use sequential testing methods designed for continuous monitoring — but fixed-horizon is the right default for most teams.

2. Stopping at 95% the Moment You Hit It

Even if you only check once, stopping the instant you cross 95% means you captured the moment random fluctuation peaked in your favor. Significance fluctuates throughout a test — a variant might cross 95% on day 4, drop to 88% on day 7, then climb back to 96% by day 14. Stopping at the day-4 peak captures noise.

Fix: Pre-register your end point — a specific date or visitor count — before the test starts. Stop there regardless of what the significance meter reads. That's the difference between a confirmatory experiment and data dredging.

3. Confusing Statistical and Practical Significance

A result can be statistically significant and practically meaningless. With a large enough sample, you can detect a 0.01% conversion rate lift at p < 0.001 — a difference that adds $40/month in revenue, far below the cost of shipping and maintaining the change.

The reverse is equally common: a test shows +12% relative lift, the team ships it, and later realizes the confidence interval was wide enough to include zero. The variant may have done nothing.

Fix: Before launch, define your minimum detectable effect in business terms. Ask: "What is the smallest improvement that would change a decision?" That threshold, not 0.05, is what you're actually testing against.

4. Ignoring Statistical Power

Power is the probability that your test will detect a real effect if one exists. Most teams focus entirely on statistical significance (controlling false positives) and ignore power (controlling false negatives). The industry default is 80% power, meaning even a well-designed test misses 20% of real effects.

Running an underpowered test is worse than not running it at all. You get a negative result, conclude the variant doesn't work, and file it away — but the variant might have produced a genuine +8% lift that your sample size was simply too small to detect.

Fix: Calculate your required sample size before you start, targeting at least 80% power. Use the A/B test sample size calculator to find the right number. If you can't reach that sample size in a reasonable time, accept a larger MDE or don't run the test.

5. The Multiple Comparisons Problem

Each statistical test you run carries a false positive rate. Track five metrics in a single test and the probability of at least one false positive is 1 - (0.95)⁵, or about 23% — even though every individual metric uses a 5% threshold. Multiple variants multiply this further.

Fix: Designate one primary metric before launch — it alone determines ship/kill. Secondary metrics are for learning, not decisions. If you must test multiple variants, apply the Bonferroni correction (divide your alpha by the number of comparisons) or similar family-wise error rate control.

6. Misinterpreting the P-Value

This is the most widespread conceptual error in experimentation. Many practitioners read p = 0.03 as "there is a 97% probability that the variant is better than control." That is incorrect.

The p-value is the probability of observing data this extreme, or more extreme, assuming the null hypothesis is true — i.e., assuming no real difference exists between variants. It says nothing directly about the probability that your variant is better. A low p-value means the observed gap is unlikely if there were no true effect; it does not mean the null is probably false.

Fix: Replace "p = 0.03 means there's a 97% chance the variant wins" with "if there were truly no difference, we'd observe a gap this large only 3% of the time." That framing keeps you honest. For a direct probability estimate of which variant is better, Bayesian tools provide posterior probabilities — though they introduce their own assumptions.

7. Too-Short Test Windows (No Full Business Cycles)

User behavior varies by day of week — B2B products peak Monday through Wednesday, e-commerce sites on weekends. A 48-hour window that falls on your highest-traffic days inflates conversion rates across both variants, but the relative difference between them is unreliable.

The novelty effect compounds this: users sometimes respond to any change simply because it's unfamiliar, and that lift vanishes within days.

Fix: Run every test for at least one complete business cycle — at minimum seven days, typically two weeks for B2C. Check the test duration guide to estimate the right window for your traffic and MDE. It sounds slow; it's the cost of results you can act on.

8. Sample Ratio Mismatch (SRM)

Sample ratio mismatch happens when the actual split between your control and variant groups differs meaningfully from the intended split. You set up a 50/50 test and after two weeks you have 22,000 visitors in control and 18,500 in variant — a 54/46 split. Something went wrong in the assignment mechanism.

SRM invalidates the test entirely. An uneven split means the two groups may differ systematically beyond the treatment — perhaps bots were filtered differently, or mobile users landed in one variant at a higher rate. Any measured effect could reflect that compositional gap, not your change.

Fix: Before reading results, verify observed visitor counts match expected proportions. A chi-squared test works: flag if p < 0.01. Most platforms run this automatically. If you detect SRM, investigate assignment logic, bot filtering, and any JavaScript that might suppress the tracking call before declaring a result.

9. Treating Directional Results as Proven

A test ends with p = 0.12 and a +6% lift for the variant. The team says "it's trending positive, let's ship it." This is not how statistics works.

A p-value of 0.12 means there's a 12% chance of observing this result under the null — that's not a trend, it's noise that reached 88% confidence instead of 95%. Shipping on that basis makes roughly 1 in 8 "directional positives" false positives, and you won't know which.

Fix: Treat the significance threshold as a hard line. Results below it are inconclusive — not winners, not losers. Log them as hypotheses, redesign with a larger expected effect, or move on. Use AB Test Plan to structure follow-up experiments properly.

Fixed-Horizon vs. Sequential Testing

All fixes above assume a fixed-horizon approach: calculate required sample size upfront, stop exactly there. This works for most teams.

Sequential testing (always-valid inference) lets you check results at any time without inflating error rates. Methods like mSPRT build in the cost of peeking from the start. The trade-off is slightly larger total sample sizes. If testing velocity is constrained by run time and you have the tooling, it's worth exploring. Otherwise, fixed-horizon is simpler and sufficient.

The Common Thread

Nine mistakes, one root cause: launching without pre-registering the end point, primary metric, and required sample size. Answer three questions before every test starts:

What single metric determines the winner?
What sample size do I need at 80% power?
On what date or visitor count will I stop — unconditionally?

Use the AB Test Plan calculator for question two, and the test duration guide to anchor question three to your real traffic.