What is AB Test Plan?

AB Test Plan is a free AI-powered tool that predicts A/B test outcomes using synthetic persona simulation. Instead of spending weeks of real traffic, you get a prediction in 60 seconds — complete with a Run/Iterate/Kill verdict, persona-by-persona reasoning, and specific iteration suggestions.

How does the A/B test prediction work?

AB Test Plan generates 6 diverse synthetic personas, each with real economic constraints (fixed budgets, time pressure), specific behavioral patterns (skepticism levels, decision styles), and existing workflow investments (switching costs). Each persona independently evaluates your control and variant, then the tool synthesizes their responses into an actionable prediction with a Run, Iterate, or Kill verdict.

Why should I trust synthetic persona predictions?

Unlike generic AI chatbots, AB Test Plan's personas have rigid constraints that force honest trade-offs — like a real person deciding whether to spend their limited budget on your tool vs. keeping their current workflow. The behavioral anchoring methodology is based on Stanford Generative Agent research and forces personas to prioritize rather than agree.

How does ICE scoring work?

ICE scoring rates each experiment idea on three dimensions: Impact (how much will this move the needle, 1-10), Confidence (how sure are you it will work, 1-10), and Ease (how easy is it to implement, 1-10). The total ICE score helps you prioritize which experiments to run first. Higher scores indicate better candidates for testing.

What frameworks does AB Test Plan use?

AB Test Plan uses ICE Scoring for prioritization, Reforge Growth Loops, Cialdini's 6 Principles of Persuasion, Fogg Behavior Model, Jobs-to-be-Done framework, loss aversion, cognitive load theory, behavioral anchoring, and trade-off forcing methodology for realistic persona simulation.

How do I calculate the right sample size for an A/B test?

The built-in calculator determines sample size based on your baseline conversion rate, minimum detectable effect (MDE), statistical significance level (typically 95%), and statistical power (typically 80%). It tells you exactly how many visitors per variation you need and how many days the test will take based on your daily traffic.

Is AB Test Plan free?

Yes, AB Test Plan is completely free. Generate experiment ideas, build hypotheses, calculate sample sizes, preview variants, and run persona predictions at no cost. No account or credit card required.

How long should I run an A/B test?

Run your test until it reaches statistical significance (typically 95% confidence) and has run for at least 1-2 full business cycles (7-14 days minimum). But first, run it through AB Test Plan's prediction simulation to make sure the test is worth running at all — 70-80% of A/B tests lose or are inconclusive.

How Long Should You Run an A/B Test?

"Is my test done yet?" is the most dangerous question in A/B testing. End too early and you'll ship false positives. Run too long and you're wasting traffic that could power the next experiment. Here's how to get it right.

The Short Answer

Run your test until both conditions are met:

You've reached your pre-calculated sample size
The test has run for at least 1-2 full business cycles (typically 7-14 days)

Never stop a test just because it shows statistical significance before reaching these thresholds.

Why Minimum Runtime Matters

The peeking problem

If you check your test results daily and stop when you first see p < 0.05, your actual false positive rate isn't 5% — it can be as high as 20-30%.

This happens because statistical significance fluctuates randomly during a test. Early in the test, small sample sizes produce volatile results. A test might show "95% significant" on Day 3, lose significance on Day 5, and never reach it again.

The math: In a typical 14-day test with daily checks, there's approximately a 25% chance of seeing p < 0.05 at some point even when there's no real difference between variants. This is called the multiple comparisons problem.

Day-of-week effects

User behavior varies by day of the week. For most consumer businesses:

Weekend vs weekday: Conversion rates can differ by 20-50%
Monday vs Friday: Different user intents and session lengths
Payday effects: Spending patterns shift around the 1st and 15th

If your test runs Monday through Thursday only, you're measuring a biased slice of your audience. Always run for complete weeks.

Novelty and primacy effects

When you change something on your site, two things happen:

Novelty effect: Returning users notice the change and engage more (inflates results)
Primacy effect: Returning users are confused by the change and engage less (deflates results)

Both effects fade after 1-2 weeks as users acclimate. A test that runs only 3-4 days captures these transient effects rather than the true long-term impact.

How to Calculate Test Duration

Test duration is a function of your sample size and daily traffic:

Duration (days) = Total sample needed / Daily visitors

Where total sample = sample size per variation × number of variations.

Example:

Sample size per variation: 15,000
Variations: 2 (control + 1 variant)
Daily visitors: 3,000

Duration = 30,000 / 3,000 = 10 days

Since 10 days covers more than one full week, this meets both criteria. If the math said 5 days, you'd still run for 7 to capture a full business cycle.

Duration Quick Reference

Daily Traffic	10% MDE	15% MDE	20% MDE
500/day	8 weeks	4 weeks	2 weeks
1,000/day	4 weeks	2 weeks	10 days
2,500/day	12 days	7 days	7 days*
5,000/day	7 days*	7 days*	7 days*
10,000/day	7 days*	7 days*	7 days*

Based on 3% baseline, 95% significance, 80% power. Asterisk = sample reached before 7 days, so minimum 7-day rule applies.

When to Stop Early (Safely)

There are two statistically valid approaches to early stopping:

1. Sequential testing

Methods like always-valid p-values (also called anytime-valid inference) are designed for continuous monitoring. They use wider confidence intervals that account for multiple looks at the data.

Tools that support sequential testing: Optimizely (Stats Engine), Eppo, and some custom implementations using the mSPRT framework.

2. Pre-planned interim analyses

You can pre-commit to checking results at specific points (e.g., at 50% and 100% of your sample size) using adjusted significance thresholds. This is called the O'Brien-Fleming method:

Analysis	% of Sample	Required p-value
Interim (1st look)	50%	p < 0.005
Final (2nd look)	100%	p < 0.048

This maintains an overall 5% false positive rate while allowing you to stop early if the effect is very large.

The Most Expensive Mistakes

1. Calling a winner after 2 days

Your test shows a 40% lift with p = 0.03 after 48 hours. Exciting — but with only 600 visitors per variation, this is almost certainly noise. The true effect is probably much smaller, and you'll see regression to the mean when you ship it.

Cost: Shipping a "winner" that actually has no effect, then wondering why your overall metrics didn't improve.

2. Running too long

If your test reached statistical significance at the planned sample size two weeks ago and you're still running it "to be sure," you're burning traffic. Every day a conclusive test keeps running is a day you're not running the next experiment.

Rule of thumb: Stop at your planned sample size. If results are significant, ship. If not, call it inconclusive and move on.

3. Restarting when results look bad

Halfway through your test, the variant is losing. You "restart" the test with a fresh audience. This is a form of p-hacking — you're selectively discarding data that doesn't support your hypothesis.

If a test is losing at the planned sample size, that's a valid result. Learn from it.

4. Ignoring external events

A flash sale, a viral social post, a server outage — any of these can contaminate your test data. If a major external event occurs during your test, note it. If it affected only part of the test period, consider excluding that data or extending the test.

Decision Framework

Is sample size reached?
├── NO → Keep running
└── YES
    ├── Has it run for 7+ days?
    │   ├── NO → Keep running to 7 days
    │   └── YES
    │       ├── Statistically significant?
    │       │   ├── YES → Ship the winner
    │       │   └── NO → Inconclusive. Ship control.
    │       └── Check guardrail metrics before shipping
    └── (continue to duration check)

Plan Your Test Duration

AB Test Plan calculates the exact duration for your test based on your baseline rate, MDE, and daily traffic — plus the projected business impact if the experiment wins.