CorsoUX - Corso di UX Design
Back to Blog
UX - User Research

A/B Testing in UX: How to Run Tests That Actually Move Conversions in 2026

How to design an A/B test that's actually useful for product decisions: hypotheses, metrics, sample size, 2026 tools, and the statistical mistakes that invalidate most tests in the wild.

CorsoUX11 min read
A/B Testing in UX: How to Run Tests That Actually Move Conversions in 2026

A/B testing is the most democratic method for making product decisions. It doesn't require a senior researcher, it isn't expensive, and it speaks the language business loves: numbers. It's also one of the easiest methods to get wrong: most A/B tests run inside US and UK companies are tainted by statistical errors or bad framing, and produce decisions that should never have been made.

This article walks you through A/B testing done right in 2026: how to formulate a hypothesis worth testing, how to size the sample, which tools to use today (different from the ones you'd have picked five years ago), and how to spot results that look like winners but aren't.

What you'll learn:

  • What an A/B test really is and when it makes sense to run one
  • How to formulate testable hypotheses and pick a primary metric
  • The A/B testing tools most teams use in 2026
  • How to avoid the 5 most common statistical mistakes
  • Real examples of tests that reshaped famous products

What an A/B test is and what it isn't

An A/B test (or split test) compares two versions of an element — a page, a button, a copy line, a flow — by randomly showing them to two similar groups of users and measuring which one produces better results on a metric defined before the test.

The methodological heart is randomization: if you split users in a genuinely random way, the differences you observe between A and B can be attributed, with high probability, to the design change rather than to other variables. It's a controlled experiment applied to product practice.

What an A/B test is not:

  • It is not "try something for two weeks and see if it's better than last month." That's a before/after comparison, contaminated by a thousand uncontrolled variables.
  • It is not "show the new variant to 100 people and ask which they prefer." That's a preference test, not a behavioral A/B test.
  • It is not "ship the new feature and stare at the metrics." That's a rollout, not an experiment.

A/B tests measure real behavior, not opinions. That's their superpower.

When it makes sense to run an A/B test

Three conditions that must be true at the same time to make it worth your while.

1. You have enough volume. An A/B test needs thousands of conversions to detect small differences (5–10%). If your site has 200 visits a day and 5 conversions, a single test would take months. Below a certain threshold, qualitative tests (usability tests with 5 people) produce more insight with less risk.

2. The expected effect is meaningful. Testing whether "changing the button color from blue to green" lifts conversions by 0.3% is technically possible but almost never economically sensible. Focus on changes that could plausibly move the target metric by 10–30% or more.

3. You have a clear primary metric. "We want to improve the experience" is not a metric. "Increase the checkout completion rate from the cart" is. One metric per test, period.

If even one of the three conditions is missing, an A/B test is probably the wrong tool. A usability study, an interview, or a progressive rollout without statistical pretenses is usually the better call.

How to design a serious A/B test

Phase 1: Formulate the hypothesis

A strong hypothesis always has this structure:

If I change [element] to [variant], then [metric] will change by [expected direction and magnitude] because [reasoning grounded in data or research].

Weak example: "Let's test a red button and see if it works better."
Strong example: "If I move the 'Place order' button above the fold on mobile, the checkout completion rate will increase by at least 15%, because session recordings show that 40% of mobile users never scroll far enough to reach the current button."

A good hypothesis has reasoning, expected magnitude, and a source for the insight. Without those three, you're testing blind.

Phase 2: Size the sample

The key question: how many conversions do you need to detect an effect of X% with 95% statistical confidence?

The formula is painful, but free calculators do it in 10 seconds:

Concrete example: if your current conversion rate is 5% and you want to detect a 1-point absolute lift (to 6%), you need roughly 6,300 visits per variant with 95% confidence and 80% statistical power.

The important message: size before, not after. Tests you "stop as soon as it looks like someone's winning" are one of the main sources of false positives in the wild.

Phase 3: Set the test up in the tool

Pick a tool (see below), and define:

  • Target URL (or flow, for mobile/app tests)
  • Variant A (control — the current design)
  • Variant B (your hypothesis)
  • Primary metric (and only one primary)
  • Secondary metrics (2–3 max, for sanity checks)
  • Percentage of traffic routed into the test (typically 50/50)
  • Minimum duration of the test (based on the required sample)

Phase 4: Let the test run without touching it

The golden rule: do not look at the results before the planned end. Every time you peek and make decisions based on partial data, you introduce stopping bias that invalidates the test.

Suggested minimum duration: at least 2 full weeks, even if the numbers land earlier. This covers weekly cycles (weekend users vs. weekday users) that in almost every product show different behavior.

Phase 5: Analyze and decide

When the test ends, read the results:

  • Is the difference statistically significant? (p-value < 0.05 is the classic threshold)
  • Is the difference meaningful in practice? A test can be statistically significant but show an effect too small to justify rolling out.
  • Do the secondary metrics tell the same story? If click-through goes up but bounce rate goes up more, something is wrong with the framing.

Decide: ship the winning variant, or stop everything if there's no clear winner.

A/B testing tools in 2026

The landscape has shifted a lot in the last few years. Google Optimize (the free option) was sunset in 2023, and that pushed the market toward more specialized paid solutions.

Enterprise and scale-ups

  • Optimizely — the most complete option for high-volume companies. Web, mobile, feature flags, full-stack testing. Enterprise pricing. Heavily used at US scale-ups and Fortune 500s.
  • VWO — a more accessible alternative to Optimizely, with a complete suite that includes session recording and heatmaps. Starts at a few hundred dollars a month.
  • AB Tasty — a European platform with strong GDPR and UK DPA 2018 posture, widely used by UK retailers and EU-based companies serving British and European customers.

Feature flags and server-side testing

  • LaunchDarkly — the standard for feature flagging, with server-side A/B testing built in. Designed for tech-heavy product teams and widely adopted across Silicon Valley.
  • Statsig — a more recent alternative popular with US and UK tech startups, with a generous free tier and transparent stats methodology.
  • GrowthBook — open source and self-hostable, a strong pick for teams that need full control of their data for CCPA, HIPAA, or UK DPA reasons.

Small business and prototypes

  • Convert.com — mid-market, easy to use, popular with Shopify and BigCommerce stores.
  • Microsoft Clarity — free. Not a real A/B testing tool, but provides session recording and heatmaps that feed smarter test hypotheses.

UX research focused

  • Maze — unmoderated usability tests that can work as preference tests; not a replacement for a proper behavioral A/B test.

For a deeper look at unmoderated research tools, read the guide to unmoderated testing tools.

The most common statistical mistakes

Five traps that invalidate a huge share of the tests run in US and UK companies:

1. Peeking at the results

Checking results every day and stopping the test "as soon as B looks like it's winning" is one of the fastest ways to convince yourself that a worse variant is actually better. Every early stop inflates the false-positive rate.

Fix: set the duration before the test and don't look at partial results. If you have to look, use tools with sequential correction (Bayesian tests, group sequential analysis).

2. Testing tiny changes

Testing whether "the button should be #3A7CF5 or #3B7EF6" is a waste of time: color differences that small don't produce effects detectable at typical traffic levels. Test big hypotheses, not pixel nudges.

3. Running multiple tests on the same page at once

If you A/B test the button and A/B test the headline at the same time, the variants entangle and you can't attribute effects anymore. One test at a time per product area — or a structured multivariate test (MVT), if you have the chops to design one.

4. Ignoring guardrail metrics

A test that raises click-through can lower lead quality. A test that raises signups can raise churn. Always define guardrail metrics and watch them.

5. Ignoring the novelty effect

When you show users a new variant, many click on it simply because it's different. The novelty effect fades in 1–2 weeks. If your test runs shorter, you'll attribute to design what was really short-lived curiosity.

Real A/B tests that changed famous products

Obama 2008: the test that changed political fundraising

The 2008 Obama presidential campaign ran one of the most celebrated A/B tests in the history of digital. The signup page for the newsletter (and donations) had 6 variants: different images (photo vs. video of Obama) and different CTA button copy ("Sign Up", "Learn More", "Join Us Now", "Sign Up Now").

The winning combination (family photo + "Learn More" button) delivered a 40% improvement in signup rates. Multiplied across the campaign's total traffic, it generated roughly $288 million in extra donations versus baseline, according to post-campaign analysis by Dan Siroker, then Director of Analytics and later co-founder of Optimizely.

Airbnb: the pricing display that unlocked bookings

Airbnb runs on an aggressive A/B testing culture: every meaningful product change goes through a test. One of the best-known experiments covered how search results are rendered — showing total prices (with fees) instead of per-night prices reduced cart abandonment but also reduced clicks on the top results. The trade-off was accepted because the primary metric (completed bookings) improved.

Booking.com: a thousand parallel experiments

At peak growth, Booking.com was literally running thousands of experiments in parallel — a level of sophistication only possible with huge traffic and a dedicated platform. A public lesson from their engineering blog: "most of your winning tests will be in the 1–2% range, not the 20% range. Be suspicious of results that look too good to be true."

Frequently asked questions

How long does a typical A/B test last?

It depends on volume. For a site with medium traffic (10,000 visits a day), a typical test runs 2–4 weeks. For lower volumes it can stretch to 6–8 weeks. For very high volumes (Booking, Amazon), a test can wrap in 2–3 days.

Can I run A/B tests with Google Analytics?

Google Analytics 4 is not an A/B testing tool: it measures user behavior but doesn't randomize traffic between variants. You need a dedicated tool (Optimizely, VWO, Statsig, and so on) that can then feed experiment data back into GA4 for integrated analysis.

Are A/B tests and usability tests alternatives?

No, they're complementary. A usability test (5 people, moderated) reveals why a design doesn't work. An A/B test (thousands of users, behavioral) measures how much a solution works. The best product teams at companies like Spotify and Monzo use both in sequence: usability tests to generate hypotheses, A/B tests to validate them at scale.

Do I need a statistician to run A/B tests?

For simple tests, no: modern tools handle the math. For complex tests (MVT, segmentations, cross-device), consulting a statistician or data scientist makes a huge difference — both for interpreting results and for avoiding mistakes.

What's the difference between A/B testing and multivariate testing (MVT)?

An A/B test compares two variants of a single element. An MVT compares many combinations of multiple elements at the same time (e.g. 3 headlines × 3 images × 2 buttons = 18 variants). MVTs need much more traffic — typically 5–10× more than a simple A/B test.

Can I run A/B tests on small volumes without formal stats?

Yes, but be honest about what the result is. A "mini A/B test" with a few hundred users can give you qualitative signal — "B might work better" — but not statistical proof. It's closer to an extended preference test. Read the guide to preference testing for the qualitative equivalent.

Next steps

A/B testing is a powerful tool, but only when used at the right moment in the product cycle. Three practical pieces of advice:

  1. Don't start here: before quantitative tests, run qualitative user research to understand the real problems
  2. Read the full guide to user research methods to put A/B testing in context alongside other available methods
  3. Study Hick's Law and other cognitive principles to build stronger hypotheses about user behavior

In the User Research course at CorsoUX, A/B testing sits alongside interviews, moderated and unmoderated usability tests, with hands-on exercises on real products supervised by mentors who run research every day at US and UK companies.

Condividi
A/B Testing in UX: Practical 2026 Guide with Tools | CorsoUX | CorsoUX