Back to Blog
Statistics5 min read

How Long Should You Run an A/B Test?

The definitive guide to A/B test duration. Learn why minimum runtime matters, how to calculate it, and the costly mistakes teams make by stopping too early.

By AB Test Plan

"Is my test done yet?" is the most dangerous question in A/B testing. End too early and you'll ship false positives. Run too long and you're wasting traffic that could power the next experiment. Here's how to get it right.

The Short Answer

Run your test until both conditions are met:

  1. You've reached your pre-calculated sample size
  2. The test has run for at least 1-2 full business cycles (typically 7-14 days)

Never stop a test just because it shows statistical significance before reaching these thresholds.

Why Minimum Runtime Matters

The peeking problem

If you check your test results daily and stop when you first see p < 0.05, your actual false positive rate isn't 5% — it can be as high as 20-30%.

This happens because statistical significance fluctuates randomly during a test. Early in the test, small sample sizes produce volatile results. A test might show "95% significant" on Day 3, lose significance on Day 5, and never reach it again.

The math: In a typical 14-day test with daily checks, there's approximately a 25% chance of seeing p < 0.05 at some point even when there's no real difference between variants. This is called the multiple comparisons problem.

Day-of-week effects

User behavior varies by day of the week. For most consumer businesses:

  • Weekend vs weekday: Conversion rates can differ by 20-50%
  • Monday vs Friday: Different user intents and session lengths
  • Payday effects: Spending patterns shift around the 1st and 15th

If your test runs Monday through Thursday only, you're measuring a biased slice of your audience. Always run for complete weeks.

Novelty and primacy effects

When you change something on your site, two things happen:

  • Novelty effect: Returning users notice the change and engage more (inflates results)
  • Primacy effect: Returning users are confused by the change and engage less (deflates results)

Both effects fade after 1-2 weeks as users acclimate. A test that runs only 3-4 days captures these transient effects rather than the true long-term impact.

How to Calculate Test Duration

Test duration is a function of your sample size and daily traffic:

Duration (days) = Total sample needed / Daily visitors

Where total sample = sample size per variation × number of variations.

Example:

  • Sample size per variation: 15,000
  • Variations: 2 (control + 1 variant)
  • Daily visitors: 3,000

Duration = 30,000 / 3,000 = 10 days

Since 10 days covers more than one full week, this meets both criteria. If the math said 5 days, you'd still run for 7 to capture a full business cycle.

Duration Quick Reference

Daily Traffic 10% MDE 15% MDE 20% MDE
500/day 8 weeks 4 weeks 2 weeks
1,000/day 4 weeks 2 weeks 10 days
2,500/day 12 days 7 days 7 days*
5,000/day 7 days* 7 days* 7 days*
10,000/day 7 days* 7 days* 7 days*

Based on 3% baseline, 95% significance, 80% power. Asterisk = sample reached before 7 days, so minimum 7-day rule applies.

When to Stop Early (Safely)

There are two statistically valid approaches to early stopping:

1. Sequential testing

Methods like always-valid p-values (also called anytime-valid inference) are designed for continuous monitoring. They use wider confidence intervals that account for multiple looks at the data.

Tools that support sequential testing: Optimizely (Stats Engine), Eppo, and some custom implementations using the mSPRT framework.

2. Pre-planned interim analyses

You can pre-commit to checking results at specific points (e.g., at 50% and 100% of your sample size) using adjusted significance thresholds. This is called the O'Brien-Fleming method:

Analysis % of Sample Required p-value
Interim (1st look) 50% p < 0.005
Final (2nd look) 100% p < 0.048

This maintains an overall 5% false positive rate while allowing you to stop early if the effect is very large.

The Most Expensive Mistakes

1. Calling a winner after 2 days

Your test shows a 40% lift with p = 0.03 after 48 hours. Exciting — but with only 600 visitors per variation, this is almost certainly noise. The true effect is probably much smaller, and you'll see regression to the mean when you ship it.

Cost: Shipping a "winner" that actually has no effect, then wondering why your overall metrics didn't improve.

2. Running too long

If your test reached statistical significance at the planned sample size two weeks ago and you're still running it "to be sure," you're burning traffic. Every day a conclusive test keeps running is a day you're not running the next experiment.

Rule of thumb: Stop at your planned sample size. If results are significant, ship. If not, call it inconclusive and move on.

3. Restarting when results look bad

Halfway through your test, the variant is losing. You "restart" the test with a fresh audience. This is a form of p-hacking — you're selectively discarding data that doesn't support your hypothesis.

If a test is losing at the planned sample size, that's a valid result. Learn from it.

4. Ignoring external events

A flash sale, a viral social post, a server outage — any of these can contaminate your test data. If a major external event occurs during your test, note it. If it affected only part of the test period, consider excluding that data or extending the test.

Decision Framework

Is sample size reached?
├── NO → Keep running
└── YES
    ├── Has it run for 7+ days?
    │   ├── NO → Keep running to 7 days
    │   └── YES
    │       ├── Statistically significant?
    │       │   ├── YES → Ship the winner
    │       │   └── NO → Inconclusive. Ship control.
    │       └── Check guardrail metrics before shipping
    └── (continue to duration check)

Plan Your Test Duration

AB Test Plan calculates the exact duration for your test based on your baseline rate, MDE, and daily traffic — plus the projected business impact if the experiment wins.

test durationstatistical significancepeeking problemA/B testing

Ready to plan your next A/B test?

Use AI to generate experiment ideas, build hypotheses, and calculate sample sizes.

Start Planning — Free