A/B Testing using T-Test: Making Smart Decisions with Limited Data

In a previous post, we explored how large samples allows us to perform A/B testing based on the z-test and normal distribution. But what happens when your sample size is small? Can you still perform A/B testing?

The answer is yes — but you need to switch tools. In this post, we’ll dive into how to perform A/B testing using the t-test, the theory behind, and why it’s the right approach when your data is limited.

AB Testing T-Test

Workflow for A/B testing with t-Test

The workflow for A/B testing with t-test is similar to that for A/B testing with z-test. You can find the workflow in my previous post.

Theory for A/B Testing with t-Test

🎯 Problem Setup

Suppose you’re testing a new product layout, and your metric of interest is average time on page. You have:

Group A (control): current layout
Group B (variant): new layout

Suppose:

\(\mu_A\) and \(\sigma_A\) are the true average time on page and standard deviation in A
\(\mu_B\) and \(\sigma_B\) are the true average time on page and standard deviation in B

These are unknown parameters we are trying to make inferences about using sample data:

Group A has \(n_A\) users with average time on page \(\bar{\mu}_A\) and standard deviation \(s_A\)
Group B has \(n_B\) users with average time on page \(\bar{\mu}_B\) and standard deviation \(s_B\)

We use \(\bar{\mu}_A\) and \(\bar{\mu}_B\) to estimate \(\mu_A\) and \(\mu_B\). We also use hypothesis testing to decide if the difference in the observed average times on page is statistically significant.

🔍 Step 1: Define Hypotheses

Null hypothesis \(H_0\): No difference in the average time on page, i.e.,

\[H_0: \mu_A = \mu_B\]

Alternative hypothesis \(H_1\): Variant B performs differently, i.e.,

\[H_1: \mu_A \neq \mu_B \,\, \text{(two-sided test)}\]

or if B is better, i.e.,

\[H_1: \mu_B > \mu_A \,\, \text{(one-sided test)}\]

⚙️ Step 2: Estimated Standard Error (SE) Under \(H_0\)

SE is a measure of the uncertainty or variability in the difference between two observed average time on page \(\bar{\mu}_A\) and \(\bar{\mu}_B\). It tells you how much the observed difference between groups A and B might vary just due to random sampling.

The true SE for the difference in the observed average time on page under \(H_0\) is:

\[\text{SE} = \sqrt{ \text{Var} (\bar{\mu}_B - \bar{\mu}_A ) } = \sqrt{\frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B}}\]

where \(\text{Var} (\bar{\mu}_B - \bar{\mu}_A ) = \text{Var} (\bar{\mu}_B ) + \text{Var} (\bar{\mu}_A )\) (due to the independency between \(\bar{\mu}_A\) and \(\bar{\mu}_B\)), \(\text{Var} (\bar{\mu}_A ) = \frac{\sigma_A^2}{n_A}\), and \(\text{Var} (\bar{\mu}_B ) = \frac{\sigma_A^2}{n_B}\).

However, this true SE is only theoretical because \(\sigma_A\) and \(\sigma_B\) are unknown.

Instead, the estimated SE for the difference in the observed average time on page is computed using the sample standard deviation \(s_A\) and \(s_B\) of \(\sigma_A\) and \(\sigma_B\):

\[\text{SE} = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}\]

🧮 Step 3: T-Statistic Under \(H_0\)

The z-score tells us how far the difference of the observed average time on page we see is from zero, using standard error as the unit, assuming that \(H_0\) is true:

\[t = \frac{\bar{\mu}_B - \bar{\mu}_A}{\text{SE}}\]

So,

If \(t\) is close to 0, the observed difference is what we’d expect from random chance
If \(t\) is far from 0, the difference is larger than what we’d expect from chance, so it might be statistically significant

📉 Step 4: Compute P-Value Under \(H_0\)

Just knowing “how many standard errors away” from the computed z-score is isn’t enough — we want to quantify how likely it is to observe such a result by chance. Therefore, we need to compute “p-value”, which is the probability of obtaining a result as extreme as (or more extreme than) your observed data, assuming that the null hypothesis \(H_0\) is true.

To do this, given the t-score formula above, we use the t-distribution (instead of the normal distribution) to model this probability because we have extra uncertainty due to estimating standard deviations from small samples.

The assumption for using the t-distribution is that the data is approximately normal so that standard deviations from small samples (e.g., \(s_A\)) is a good estmate of the true standard deviation (e.g., \(\sigma_A\)).

Unlike the normal distribution (which has a fixed shape), the t-distribution changes shape depending on the degrees of freedom (DOF). The t-distribution accounts for extra uncertainty that arises when we estimate the population standard deviation from a small sample. That uncertainty depends on the sample size — and that’s where degrees of freedom (df) come in.

Smaller sample size → smaller DOF → more uncertainty → heavier tails.
Larger sample size → larger DOF → less uncertainty → closer to standard normal distribution.

Therefore, choosing the correct DOF is critical for an accurate p-value.

This is the Welch–Satterthwaite formula for DOF, which is commonly used in practice:

\[\text{DOF} = \frac{ \Big( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \Big)^2 }{ \frac{ \Big( \frac{s_A^2}{n_A} \Big)^2 }{n_A - 1} + \frac{ \Big( \frac{s_B^2}{n_B} \Big)^2 }{n_B - 1} }\]

(This DOF formula will be explained in more detail at the end of this post)

We use the t-score to find a p-value from the t-distribution.

The p-value answers this question: “If the null hypothesis \(H_0\) is true, what is the probability of seeing a result this extreme or more extreme just by chance?”

Mathematically, p-value is the area under the curve of the t-distribution beyond your t-score.

For a two-tailed test:

\[p_{value} = 2 P (T > |t|)\]

For a one-tailed test:

\[p_{value} = P (T > |t|)\]

where \(P (.)\) is the probability that is computed using a t-distribution with the DOF we calculated earlier.

✅ Step 5: Decide to reject \(H_0\) or not

If \(p_{value} < \alpha\) (commonly \(0.05\)), we reject \(H_0\) and conclude that the difference is statistically significant.
Otherwise, we fail to reject \(H_0\).

Implementing an Example A/B Test in Python

Let’s simulate a basic A/B test for the average time on page.

from scipy.stats import ttest_ind

# Simulated time-on-page data (in seconds)
group_A = [120, 130, 115, 123, 140]
group_B = [150, 160, 145, 155, 170]

# Welch's t-test (does not assume equal variances)
t_stat, p_value = ttest_ind(group_B, group_A, equal_var=False)

print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

Output:

t-statistic: 4.9736
p-value: 0.0011

With a p-value of 0.0011 significantly smaller than 0.05, we reject the null hypothesis and conclude that Group B performs significantly better.

🧪 When and Why to Use the t-Test in A/B Testing

✅ Use the t-test when:

Comparing Means: Use the t-test when your goal is to compare the average performance of two groups (e.g., average time on page or purchase amount).
Unknown Population Variance: The population standard deviation is usually unknown, so you need to estimate it from your sample.
Small to Moderate Sample Sizes: While a common rule of thumb is to use the t-test when the sample size is less than 30 per group, it remains useful even with slightly larger samples when the population variance is unknown.
Approximately Normal Data: The t-test assumes that the underlying data is approximately normally distributed. This assumption is particularly important when the sample size is very small.

❌ Don’t use the t-test when:

Comparing Proportions Directly: While a t-test can technically be used in some transformed or regression-based comparisons of proportions, in standard A/B testing practice, we typically use a z-test for comparing two proportions (like conversion rates), because proportions follow a binomial distribution. Proportions are not continuous variables like those in t-test, and their variance depends on the proportion itself. For very small samples, Fisher’s exact test is more appropriate.
Severely Non-Normal Data in Very Small Samples: When the data is heavily skewed or not normally distributed, and the sample size is extremely small, the t-test may not be reliable due to biased or unstable standard deviation estimate of the data samples. In such cases, consider non-parametric alternatives like the Mann-Whitney U test.

💡 Real-World Advice:

The t-test is usually safe and practical in A/B testing, product analytics, and experiments as long as your sample isn’t too tiny or your data isn’t wildly non-normal.

Summary

The t-test is a powerful tool for A/B testing when your sample size is small and you’re comparing means. It accounts for the added uncertainty in small datasets, allowing you to make data-informed decisions — even when data is limited.

🚀 The code of the example is available here.

For further inquiries or collaboration, please contact me at my email.

DOF Formula Explained:

Recall that each group A or B has its own sample size, estimated variances of the avarage time on page, and DOF.

Group A has size \(n_A\), estimated variance \(s_A^2\) and \(n_A-1\) DOF.
Group B has size \(n_B\), estimated variance \(s_B^2\) and \(n_B-1\) DOF.

The DOF formula tells us: How much combined uncertainty we have from both A and B.

🎯 Numerator: It is the square of the estimated variance of the difference between the average time on page of both groups.

🔽 Denominator: It represents the combined uncertainty in estimating that variance, based on the degrees of freedom of each variance estimate of each group.

Therefore, the DOF formula is saying: How stable is our estimate of the variance? The more stable (i.e., less relative uncertainty), the higher the degrees of freedom.

📊 Interpretation

If both groups have large samples and similar variances, the ratio gives a large DOF, and the t-distribution looks almost normal.
If one group has high variance or low sample size, the denominator is bigger → DOF is smaller → heavier tails in the t-distribution.