How Long Should You Run an A/B Test?
The definitive guide to A/B test duration. Learn why minimum runtime matters, how to calculate it, and the costly mistakes teams make by stopping too early.
"Is my test done yet?" is the most dangerous question in A/B testing. End too early and you'll ship false positives. Run too long and you're wasting traffic that could power the next experiment. Here's how to get it right.
The Short Answer
Run your test until both conditions are met:
- You've reached your pre-calculated sample size
- The test has run for at least 1-2 full business cycles (typically 7-14 days)
Never stop a test just because it shows statistical significance before reaching these thresholds.
Why Minimum Runtime Matters
The peeking problem
If you check your test results daily and stop when you first see p < 0.05, your actual false positive rate isn't 5% — it can be as high as 20-30%.
This happens because statistical significance fluctuates randomly during a test. Early in the test, small sample sizes produce volatile results. A test might show "95% significant" on Day 3, lose significance on Day 5, and never reach it again.
The math: In a typical 14-day test with daily checks, there's approximately a 25% chance of seeing p < 0.05 at some point even when there's no real difference between variants. This is called the multiple comparisons problem.
Day-of-week effects
User behavior varies by day of the week. For most consumer businesses:
- Weekend vs weekday: Conversion rates can differ by 20-50%
- Monday vs Friday: Different user intents and session lengths
- Payday effects: Spending patterns shift around the 1st and 15th
If your test runs Monday through Thursday only, you're measuring a biased slice of your audience. Always run for complete weeks.
Novelty and primacy effects
When you change something on your site, two things happen:
- Novelty effect: Returning users notice the change and engage more (inflates results)
- Primacy effect: Returning users are confused by the change and engage less (deflates results)
Both effects fade after 1-2 weeks as users acclimate. A test that runs only 3-4 days captures these transient effects rather than the true long-term impact.
How to Calculate Test Duration
Test duration is a function of your sample size and daily traffic:
Duration (days) = Total sample needed / Daily visitors
Where total sample = sample size per variation × number of variations.
Example:
- Sample size per variation: 15,000
- Variations: 2 (control + 1 variant)
- Daily visitors: 3,000
Duration = 30,000 / 3,000 = 10 days
Since 10 days covers more than one full week, this meets both criteria. If the math said 5 days, you'd still run for 7 to capture a full business cycle.
Duration Quick Reference
| Daily Traffic | 10% MDE | 15% MDE | 20% MDE |
|---|---|---|---|
| 500/day | 8 weeks | 4 weeks | 2 weeks |
| 1,000/day | 4 weeks | 2 weeks | 10 days |
| 2,500/day | 12 days | 7 days | 7 days* |
| 5,000/day | 7 days* | 7 days* | 7 days* |
| 10,000/day | 7 days* | 7 days* | 7 days* |
Based on 3% baseline, 95% significance, 80% power. Asterisk = sample reached before 7 days, so minimum 7-day rule applies.
When to Stop Early (Safely)
There are two statistically valid approaches to early stopping:
1. Sequential testing
Methods like always-valid p-values (also called anytime-valid inference) are designed for continuous monitoring. They use wider confidence intervals that account for multiple looks at the data.
Tools that support sequential testing: Optimizely (Stats Engine), Eppo, and some custom implementations using the mSPRT framework.
2. Pre-planned interim analyses
You can pre-commit to checking results at specific points (e.g., at 50% and 100% of your sample size) using adjusted significance thresholds. This is called the O'Brien-Fleming method:
| Analysis | % of Sample | Required p-value |
|---|---|---|
| Interim (1st look) | 50% | p < 0.005 |
| Final (2nd look) | 100% | p < 0.048 |
This maintains an overall 5% false positive rate while allowing you to stop early if the effect is very large.
The Most Expensive Mistakes
1. Calling a winner after 2 days
Your test shows a 40% lift with p = 0.03 after 48 hours. Exciting — but with only 600 visitors per variation, this is almost certainly noise. The true effect is probably much smaller, and you'll see regression to the mean when you ship it.
Cost: Shipping a "winner" that actually has no effect, then wondering why your overall metrics didn't improve.
2. Running too long
If your test reached statistical significance at the planned sample size two weeks ago and you're still running it "to be sure," you're burning traffic. Every day a conclusive test keeps running is a day you're not running the next experiment.
Rule of thumb: Stop at your planned sample size. If results are significant, ship. If not, call it inconclusive and move on.
3. Restarting when results look bad
Halfway through your test, the variant is losing. You "restart" the test with a fresh audience. This is a form of p-hacking — you're selectively discarding data that doesn't support your hypothesis.
If a test is losing at the planned sample size, that's a valid result. Learn from it.
4. Ignoring external events
A flash sale, a viral social post, a server outage — any of these can contaminate your test data. If a major external event occurs during your test, note it. If it affected only part of the test period, consider excluding that data or extending the test.
Decision Framework
Is sample size reached?
├── NO → Keep running
└── YES
├── Has it run for 7+ days?
│ ├── NO → Keep running to 7 days
│ └── YES
│ ├── Statistically significant?
│ │ ├── YES → Ship the winner
│ │ └── NO → Inconclusive. Ship control.
│ └── Check guardrail metrics before shipping
└── (continue to duration check)
Plan Your Test Duration
AB Test Plan calculates the exact duration for your test based on your baseline rate, MDE, and daily traffic — plus the projected business impact if the experiment wins.