7 minute read

In a previous post, we explored how large samples allows us to perform A/B testing based on the z-test and normal distribution. But what happens when your sample size is small? Can you still perform A/B testing?

The answer is yes — but you need to switch tools. In this post, we’ll dive into how to perform A/B testing using the t-test, the theory behind, and why it’s the right approach when your data is limited.

AB Testing T-Test

Workflow for A/B testing with t-Test

The workflow for A/B testing with t-test is similar to that for A/B testing with z-test. You can find the workflow in my previous post.

Theory for A/B Testing with t-Test

🎯 Problem Setup

Suppose you’re testing a new product layout, and your metric of interest is average time on page. You have:

  • Group A (control): current layout
  • Group B (variant): new layout

Suppose:

  • \(\mu_A\) and \(\sigma_A\) are the true average time on page and standard deviation in A
  • \(\mu_B\) and \(\sigma_B\) are the true average time on page and standard deviation in B

These are unknown parameters we are trying to make inferences about using sample data:

  • Group A has \(n_A\) users with average time on page \(\bar{\mu}_A\) and standard deviation \(s_A\)
  • Group B has \(n_B\) users with average time on page \(\bar{\mu}_B\) and standard deviation \(s_B\)

We use \(\bar{\mu}_A\) and \(\bar{\mu}_B\) to estimate \(\mu_A\) and \(\mu_B\). We also use hypothesis testing to decide if the difference in the observed average times on page is statistically significant.

🔍 Step 1: Define Hypotheses

  • Null hypothesis \(H_0\): No difference in the average time on page, i.e.,
\[H_0: \mu_A = \mu_B\]
  • Alternative hypothesis \(H_1\): Variant B performs differently, i.e.,
\[H_1: \mu_A \neq \mu_B \,\, \text{(two-sided test)}\]

or if B is better, i.e.,

\[H_1: \mu_B > \mu_A \,\, \text{(one-sided test)}\]

⚙️ Step 2: Estimated Standard Error (SE) Under \(H_0\)

SE is a measure of the uncertainty or variability in the difference between two observed average time on page \(\bar{\mu}_A\) and \(\bar{\mu}_B\). It tells you how much the observed difference between groups A and B might vary just due to random sampling.

The true SE for the difference in the observed average time on page under \(H_0\) is:

\[\text{SE} = \sqrt{ \text{Var} (\bar{\mu}_B - \bar{\mu}_A ) } = \sqrt{\frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B}}\]

where \(\text{Var} (\bar{\mu}_B - \bar{\mu}_A ) = \text{Var} (\bar{\mu}_B ) + \text{Var} (\bar{\mu}_A )\) (due to the independency between \(\bar{\mu}_A\) and \(\bar{\mu}_B\)), \(\text{Var} (\bar{\mu}_A ) = \frac{\sigma_A^2}{n_A}\), and \(\text{Var} (\bar{\mu}_B ) = \frac{\sigma_A^2}{n_B}\).

However, this true SE is only theoretical because \(\sigma_A\) and \(\sigma_B\) are unknown.

Instead, the estimated SE for the difference in the observed average time on page is computed using the sample standard deviation \(s_A\) and \(s_B\) of \(\sigma_A\) and \(\sigma_B\):

\[\text{SE} = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}\]

🧮 Step 3: T-Statistic Under \(H_0\)

The z-score tells us how far the difference of the observed average time on page we see is from zero, using standard error as the unit, assuming that \(H_0\) is true:

\[t = \frac{\bar{\mu}_B - \bar{\mu}_A}{\text{SE}}\]

So,

  • If \(t\) is close to 0, the observed difference is what we’d expect from random chance
  • If \(t\) is far from 0, the difference is larger than what we’d expect from chance, so it might be statistically significant

📉 Step 4: Compute P-Value Under \(H_0\)

Just knowing “how many standard errors away” from the computed z-score is isn’t enough — we want to quantify how likely it is to observe such a result by chance. Therefore, we need to compute “p-value”, which is the probability of obtaining a result as extreme as (or more extreme than) your observed data, assuming that the null hypothesis \(H_0\) is true.

To do this, given the t-score formula above, we use the t-distribution (instead of the normal distribution) to model this probability because we have extra uncertainty due to estimating standard deviations from small samples.

The assumption for using the t-distribution is that the data is approximately normal so that standard deviations from small samples (e.g., \(s_A\)) is a good estmate of the true standard deviation (e.g., \(\sigma_A\)).

Unlike the normal distribution (which has a fixed shape), the t-distribution changes shape depending on the degrees of freedom (DOF). The t-distribution accounts for extra uncertainty that arises when we estimate the population standard deviation from a small sample. That uncertainty depends on the sample size — and that’s where degrees of freedom (df) come in.

  • Smaller sample size → smaller DOF → more uncertainty → heavier tails.
  • Larger sample size → larger DOF → less uncertainty → closer to standard normal distribution.

Therefore, choosing the correct DOF is critical for an accurate p-value.

This is the Welch–Satterthwaite formula for DOF, which is commonly used in practice:

\[\text{DOF} = \frac{ \Big( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \Big)^2 }{ \frac{ \Big( \frac{s_A^2}{n_A} \Big)^2 }{n_A - 1} + \frac{ \Big( \frac{s_B^2}{n_B} \Big)^2 }{n_B - 1} }\]

(This DOF formula will be explained in more detail at the end of this post)

We use the t-score to find a p-value from the t-distribution.

The p-value answers this question: “If the null hypothesis \(H_0\) is true, what is the probability of seeing a result this extreme or more extreme just by chance?”

Mathematically, p-value is the area under the curve of the t-distribution beyond your t-score.

  • For a two-tailed test:
\[p_{value} = 2 P (T > |t|)\]
  • For a one-tailed test:
\[p_{value} = P (T > |t|)\]

where \(P (.)\) is the probability that is computed using a t-distribution with the DOF we calculated earlier.

✅ Step 5: Decide to reject \(H_0\) or not

  • If \(p_{value} < \alpha\) (commonly \(0.05\)), we reject \(H_0\) and conclude that the difference is statistically significant.
  • Otherwise, we fail to reject \(H_0\).

Implementing an Example A/B Test in Python

Let’s simulate a basic A/B test for the average time on page.

from scipy.stats import ttest_ind

# Simulated time-on-page data (in seconds)
group_A = [120, 130, 115, 123, 140]
group_B = [150, 160, 145, 155, 170]

# Welch's t-test (does not assume equal variances)
t_stat, p_value = ttest_ind(group_B, group_A, equal_var=False)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

Output:

t-statistic: 4.9736
p-value: 0.0011

With a p-value of 0.0011 significantly smaller than 0.05, we reject the null hypothesis and conclude that Group B performs significantly better.

🧪 When and Why to Use the t-Test in A/B Testing

✅ Use the t-test when:

  • Comparing Means: Use the t-test when your goal is to compare the average performance of two groups (e.g., average time on page or purchase amount).
  • Unknown Population Variance: The population standard deviation is usually unknown, so you need to estimate it from your sample.
  • Small to Moderate Sample Sizes: While a common rule of thumb is to use the t-test when the sample size is less than 30 per group, it remains useful even with slightly larger samples when the population variance is unknown.
  • Approximately Normal Data: The t-test assumes that the underlying data is approximately normally distributed. This assumption is particularly important when the sample size is very small.

❌ Don’t use the t-test when:

  • Comparing Proportions Directly: While a t-test can technically be used in some transformed or regression-based comparisons of proportions, in standard A/B testing practice, we typically use a z-test for comparing two proportions (like conversion rates), because proportions follow a binomial distribution. Proportions are not continuous variables like those in t-test, and their variance depends on the proportion itself. For very small samples, Fisher’s exact test is more appropriate.
  • Severely Non-Normal Data in Very Small Samples: When the data is heavily skewed or not normally distributed, and the sample size is extremely small, the t-test may not be reliable due to biased or unstable standard deviation estimate of the data samples. In such cases, consider non-parametric alternatives like the Mann-Whitney U test.

💡 Real-World Advice:

The t-test is usually safe and practical in A/B testing, product analytics, and experiments as long as your sample isn’t too tiny or your data isn’t wildly non-normal.

Summary

The t-test is a powerful tool for A/B testing when your sample size is small and you’re comparing means. It accounts for the added uncertainty in small datasets, allowing you to make data-informed decisions — even when data is limited.


🚀 The code of the example is available here.

For further inquiries or collaboration, please contact me at my email.


DOF Formula Explained:

Recall that each group A or B has its own sample size, estimated variances of the avarage time on page, and DOF.

  • Group A has size \(n_A\), estimated variance \(s_A^2\) and \(n_A-1\) DOF.
  • Group B has size \(n_B\), estimated variance \(s_B^2\) and \(n_B-1\) DOF.

The DOF formula tells us: How much combined uncertainty we have from both A and B.

🎯 Numerator: It is the square of the estimated variance of the difference between the average time on page of both groups.

🔽 Denominator: It represents the combined uncertainty in estimating that variance, based on the degrees of freedom of each variance estimate of each group.

Therefore, the DOF formula is saying: How stable is our estimate of the variance? The more stable (i.e., less relative uncertainty), the higher the degrees of freedom.

📊 Interpretation

  • If both groups have large samples and similar variances, the ratio gives a large DOF, and the t-distribution looks almost normal.
  • If one group has high variance or low sample size, the denominator is bigger → DOF is smaller → heavier tails in the t-distribution.