Stats Primer: Designing a Trustworthy A/B Test

The design decisions that make a randomized experiment trustworthy: the causal estimand, the design checklist, and the validity threats (SRM, peeking, clustering, interference).

Published Oct 30, 2025 updated

By Sayan Biswas

9 min read

Part of the statistical-inference arc, building on p-values and CIs and z and t statistics. Its companions cover the parametric tests and non-parametric tests; all lead into sample size for A/B tests.

In the previous article, we’ve looked at hypothesis tests as tools to decide whether a difference is real or just random noise. But in product experimentation, we care about something deeper: Did the new feature cause the improvement?

Surprisingly, the foundation of modern causal inference in product analytics comes straight from the z-tests and t-tests we just explored. A/B testing is simply hypothesis testing embedded into a randomized experimental design.

In this article, we’ll connect these dots and explore how the statistical tests we’ve mastered become the core engines powering parametric and non-parametric A/B testing.

1. What is an A/B Test?

An A/B test is a method used to compare two versions of a variable (like a webpage, email, or feature) to determine which one performs better. In a typical setup, we test a new version (Version B) against a current version (Version A) to see if the new version produces a statistically significant improvement.

The Motivation: Making business decisions based on small differences without rigorous testing is essentially acting on random noise. The goal is to move away from guessing and towards data-driven decisions. By isolating variables and testing them, we ensure that observed improvements are real and not just driven by randomness or novelty effects.

1.1 The Holy Grail: A/B Testing and Causal Inference

A/B testing is fundamentally an exercise in Causal Inference. In data science, we often struggle to distinguish correlation from causation. If users who use a “Wishlist” feature spend more money, did the feature cause the spending, or do high-spending users simply like using wishlists?

A/B testing (Randomized Controlled Trials) resolves this through Randomization.

By randomly assigning users to Group A (Control) or Group B (Treatment), we balance every observable and unobservable characteristic (age, income, motivation, etc.) in expectation across the two groups. This nuance matters: randomization does not guarantee the two realized groups are identical (chance imbalance is always possible in any single experiment). What it guarantees is a known assignment mechanism under which the groups are comparable on average, so that a difference in outcomes is not systematically confounded.

To state the causal target precisely, it helps to use potential outcomes. Each user has two hypothetical outcomes, $Y(1)$ if treated and $Y(0)$ if not, of which we ever observe only one. The estimand we care about is the Average Treatment Effect:

\[\text{ATE} = \mathbb{E}[Y(1) - Y(0)].\]

What we can actually compute is the observed group difference, $\mathbb{E}[Y \mid T=1] - \mathbb{E}[Y \mid T=0]$. These are not the same object: the first is a causal quantity over counterfactuals, the second is a difference between two groups of different users. Randomization is exactly the condition that makes the second an unbiased estimate of the first, provided a few more assumptions hold:

No interference (SUTVA): one user’s treatment assignment does not affect another user’s outcome (violated by network effects, shared inventory, or marketplace supply).
Consistent treatment: the “treatment” is one well-defined experience, not a moving target.
Intent-to-treat (ITT) as the default: analyze users by the group they were assigned to, not the treatment they actually received, which cleanly handles non-compliance.
Assignment vs. exposure vs. analysis populations: the users randomized, the users who actually saw the change, and the users included in the analysis can differ (through triggering or attrition), and conflating them biases the estimate.

1.2 When We Can’t Randomize: A/B Testing vs. Quasi-Experiments

Sometimes, A/B testing is impossible due to ethical concerns, technical limitations, or the need to analyze historical data. In these “Observational Studies,” we cannot assume the treated and untreated groups are identical.

To infer causation without randomization, we rely on Quasi-Experimental methods. Unlike A/B testing, these methods require strict (and often unverifiable) assumptions to simulate a randomized experiment.

Difference-in-Differences (DiD): This method compares the change in outcomes over time between a treated group and a control group. Instead of comparing the groups directly (which might be different), we compare their trend lines.
- Example: Comparing traffic in a city that launched a marketing campaign vs. a similar city that didn’t, before and after the launch.
- The Catch: It relies on the “Parallel Trends Assumption”, that without the treatment, both groups would have moved in the same way.
Regression Discontinuity Design (RDD): This is used when treatment is assigned based on a strict cutoff (e.g., a scholarship given to students with a GPA above 3.5). RDD compares students just above the cutoff (3.51) to those just below (3.49).
- The Logic: Students right at the margin are likely very similar, so the cutoff acts as a “local” A/B test.
- The Catch: Results are only valid for users near the cutoff, not the entire population.
Instrumental Variables (IV): This technique uses a third variable (the Instrument) that affects the treatment but has no direct effect on the outcome. It acts as a “natural experiment” that forces some variation in the treatment assignment.
- Example: Using a “winning a lottery” as an instrument to study the effect of wealth on health. The lottery determines wealth randomly, breaking the link between wealth and unobserved factors like ambition.

1.3 What Makes an A/B Test Trustworthy: A Design Checklist

Randomization gives us a license to infer causation, but only if the experiment is designed and run correctly. Before analyzing anything, a trustworthy test pins down, in advance:

The decision. What action does this test inform (ship, iterate, roll back)? Start from the decision, not the test.
Estimand and unit. The causal quantity (usually the ATE on a primary metric) and the randomization unit (user, session, account, cluster). The analysis unit must match the randomization unit.
Metrics. One primary metric, a small set of guardrail metrics (latency, cancellations, complaints) that must not regress, and diagnostic metrics for debugging.
Minimum detectable effect (MDE). The smallest effect worth acting on, chosen from business value before seeing data.
Sample size and duration. Enough traffic to detect the MDE at the chosen power, run over a window that covers weekday/weekend cycles and novelty effects (typically at least one to two full weeks).
A pre-specified analysis and stopping rule, so the test is not silently re-optimized after the fact.

1.4 Threats to Validity

Most failed experiments fail for design reasons, not arithmetic. The recurring threats:

Sample-ratio mismatch (SRM): if a 50/50 split arrives as 51.8/48.2 on large traffic, the randomization or logging is broken and no downstream p-value is trustworthy. Check the assignment counts with a chi-square goodness-of-fit test before anything else.
Peeking: repeatedly checking a fixed-horizon test and stopping the moment it crosses 0.05 inflates the false-positive rate well above 5%. Either fix the horizon in advance or use a valid sequential method (always-valid p-values, group-sequential boundaries).
Novelty and primacy effects: users react to change itself, so early results can over- or under-state the steady-state effect. This is one reason duration matters.
Clustering and the unit of analysis. The most common silent error. If users are randomized but we analyze sessions (or events) as if independent, we drastically understate the standard error and manufacture significance. With one user contributing many correlated sessions, the effective sample size is the number of users, not sessions; use a cluster-robust standard error or aggregate to the user level first.
Interference and contamination: spillovers between users (marketplaces, social features, shared caches) violate SUTVA and bias the estimate.
Attrition and missing outcomes: differential dropout between arms reintroduces the very confounding randomization removed.

A tempting but invalid analysis. Suppose we randomize 20,000 users 50/50 and each visits several times, for 120,000 sessions total. It is tempting to pool all sessions and run a two-proportion z-test on session-level conversion, which yields a tiny p-value. This is pseudoreplication: the sessions from one user are correlated, so treating them as 120,000 independent trials overstates the information by a large factor. The fix is to compute each user’s conversion (or fit a cluster-robust / mixed model) and analyze at the user level, the unit we actually randomized. The corrected standard error is often several times larger, and the “significant” result can evaporate.

2. Causal Inference Landscape

We can find a concise (but not exhaustive) representation of different well-known causal inference frameworks below.

    Causal Inference
    │
    ├── 1. Randomized Experiments (True Experiments)
    │       └── A/B Testing (Online Controlled Experiments)
    │             │
    │             ├── Parametric A/B Tests
    │             │       ├── 1 and 2-Sample Z-Tests
    │             │       ├── 1 and 2-Proportion Z-Tests
    │             │       ├── 1-Sample t-Test
    │             │       ├── 2-Sample t-Test (Welch)
    │             │       └── ANOVA / Chi-Square tests
    │             │
    │             └── Non-Parametric A/B Tests
    │                     ├── Permutation Test
    │                     ├── Bootstrap Test (resampling-based)
    │                     └── Kolmogorov–Smirnov (KS) Test
    │
    └── 2. Quasi-Experiments (When Randomization Is Not Possible)
            ├── Difference-in-Differences (DiD)
            │      └── Uses time variation + treated/untreated groups
            ├── Regression Discontinuity Design (RDD)
            │      └── Uses sharp or fuzzy cutoffs as treatment assignment
            └── Instrumental Variables (IV)
                └── Uses external “instrument” that influences treatment but not outcome directly

What Lies Ahead: We’ve covered z-tests and t-tests in the previous post. The remaining mechanisms split naturally into two companion posts: the parametric tests (ANOVA, Chi-Square, Bonferroni) and the non-parametric tests (permutation and bootstrap).

Where This Leads

We now have the design discipline to run and interpret a trustworthy randomized experiment. The threads continue in the companion posts and beyond:

The specific tests. The parametric tests (ANOVA, Chi-square, Bonferroni) and the non-parametric tests (permutation, bootstrap) are worked through in the two sibling posts.
How much traffic do we actually need to detect the effect that matters? That is the sample-size and power question.
What can we do when assignment was not randomized? The quasi-experimental methods sketched above (DiD, RDD, IV) are developed in the causal-inference post that follows.

Resources

Kohavi, R., Tang, D. & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press.
Imbens, G. W. & Rubin, D. B. (2015). Causal Inference for Statistics, Social, and Biomedical Sciences. Cambridge University Press.

Statistics