Services · A/B testing agency
A/B testing run with statistical rigour, not vibes.
Most A/B testing programs are statistically broken. Peeking, segment hunting, calling wins on point estimate, treating linked metrics as independent confirmations. We run testing programs that hold up six months after rollout, not just at the moment the dashboard goes green.
What goes wrong
Four ways A/B testing programs quietly lie to you.
Stopping the test the second it crosses 95%
Bayesian peeking is allowed, but only with rules. Continuous monitoring without sequential correction inflates false positives by two to three times. A test that hits 95% on day two and gets called is more likely to be noise than signal.
Segment hunting after the fact
Splitting the audience by device, source, country, and new-vs-returning until one of them goes green is how you generate confident-looking wins that don’t replicate. Segments are pre-registered or they don’t count.
Linked metrics counted as independent confirmations
ATC and CVR are mechanistically linked. Same noise event pumps both. Two metrics moving in the same direction is directional confirmation, not independent validation. Most teams treat them as the latter.
Calling wins on point estimate alone
The point estimate is the central best guess of a posterior with a range that often spans zero to seventy percent. Shipping based on the headline number without reading the credible interval is how you ship lifts that don’t hold up after rollout.
How we run it
Five rules locked at brief time, not at call time.
- 1
Pre-test power analysis
Before launch, we calculate the minimum detectable effect at your target test duration. If the surface doesn’t have the volume to detect a lift you’d care about, we don’t run the test.
- 2
Primary metric locked in the brief
Default to CVR unless there’s a specific reason for a funnel step or revenue metric. Locked before launch, not after.
- 3
Track A versus Track B
Low-risk UX tests follow a three-day stability rule. Functional and commercial tests have a fourteen-day minimum to capture day-of-week effects. The track is picked at brief time, not at call time.
- 4
Stable signal, not magic dates
A win needs three consecutive days at or above ninety-five percent probability-to-beat-baseline, with primary and secondary metrics in agreement, and minimum order count per arm.
- 5
Descriptive language at handoff
‘During the test we observed +28 percent CVR uplift with 96 percent confidence’, not ‘you got a +28 percent lift’. We don’t promise replication. We hardcode and watch.
Tools we work with
Platform-agnostic. No referral fees.
We deploy through your existing testing tool, or help you pick one. The methodology matters more than the platform. These are the tools we ship through most often.
Intelligems
Preferred for Shopify. Server-side rendering, native theme integration.
GrowthBook
Open-source, self-hostable. Strong feature flag and stats engine.
Convert
Solid generalist. Good for non-Shopify ecommerce.
ABTasty
Enterprise license already in place? Fine. Otherwise expensive.
Optimizely
Works. Capable. Expensive. We don’t lead with it.
Frequently asked
Questions A/B testing buyers ask before booking.
- Do I need a testing tool already installed?
- No. Part of the engagement covers tool selection if you don’t already have one. We don’t take referral fees, so the recommendation is based on your stack and traffic volume, not on what pays us.
- How many tests can I expect per month?
- Two to four shipped experiments. More is possible on high-traffic stores. Less is sometimes the right call when a single test needs three weeks to detect the lift that matters.
- What happens to the test code after a win?
- Hardcoded into your theme or platform during the next release window. The learnings library captures the hypothesis, the result, and the reasoning. Future hypotheses build on it.
- Do you do multivariate testing?
- Rarely. Most teams need fewer, sharper tests rather than more arms. Multivariate is the right call when interaction effects matter and you have the traffic to support eight or sixteen variants. Most ecommerce surfaces don’t.
- Can you audit our current testing program?
- Yes. The first month of every engagement is, in part, exactly that. We look at the tests you’ve called, the calling rule you used, and the results that actually replicated post-rollout. Most programs have a hidden false-positive rate above twenty percent.
You see a revenue uplift, or you don't pay.
That is the deal on every 90-day sprint we run. If the program does not produce a measurable revenue uplift by the end of the quarter, we refund the final 50% of the sprint fee. No asterisks, no vanity metrics, and no hiding behind “we ran some experiments.”
Ready when you are
Let's move your numbers.
Let's grab fifteen minutes to look at your funnel together, and we'll tell you straight whether we are a fit, with no slide deck or sales script in the way.
Prefer email? jono@impactconversion.com