In a previous post, we explored how large samples allows us to perform A/B testing based on the z-test and normal distribution. But what happens when your sample size is small? Can you still perform A/B testing?
The answer is yes — but you need to switch tools. In this post, we’ll dive into how to perform A/B testing using the t-test, the theory behind, and why it’s the right approach when your data is limited.
The workflow for A/B testing with t-test is similar to that for A/B testing with z-test. You can find the workflow in my previous post.
Suppose you’re testing a new product layout, and your metric of interest is average time on page. You have:
Suppose:
These are unknown parameters we are trying to make inferences about using sample data:
We use \(\bar{\mu}_A\) and \(\bar{\mu}_B\) to estimate \(\mu_A\) and \(\mu_B\). We also use hypothesis testing to decide if the difference in the observed average times on page is statistically significant.
🔍 Step 1: Define Hypotheses
or if B is better, i.e.,
\[H_1: \mu_B > \mu_A \,\, \text{(one-sided test)}\]⚙️ Step 2: Estimated Standard Error (SE) Under \(H_0\)
SE is a measure of the uncertainty or variability in the difference between two observed average time on page \(\bar{\mu}_A\) and \(\bar{\mu}_B\). It tells you how much the observed difference between groups A and B might vary just due to random sampling.
The true SE for the difference in the observed average time on page under \(H_0\) is:
\[\text{SE} = \sqrt{ \text{Var} (\bar{\mu}_B - \bar{\mu}_A ) } = \sqrt{\frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B}}\]where \(\text{Var} (\bar{\mu}_B - \bar{\mu}_A ) = \text{Var} (\bar{\mu}_B ) + \text{Var} (\bar{\mu}_A )\) (due to the independency between \(\bar{\mu}_A\) and \(\bar{\mu}_B\)), \(\text{Var} (\bar{\mu}_A ) = \frac{\sigma_A^2}{n_A}\), and \(\text{Var} (\bar{\mu}_B ) = \frac{\sigma_A^2}{n_B}\).
However, this true SE is only theoretical because \(\sigma_A\) and \(\sigma_B\) are unknown.
Instead, the estimated SE for the difference in the observed average time on page is computed using the sample standard deviation \(s_A\) and \(s_B\) of \(\sigma_A\) and \(\sigma_B\):
\[\text{SE} = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}\]🧮 Step 3: T-Statistic Under \(H_0\)
The z-score tells us how far the difference of the observed average time on page we see is from zero, using standard error as the unit, assuming that \(H_0\) is true:
\[t = \frac{\bar{\mu}_B - \bar{\mu}_A}{\text{SE}}\]So,
📉 Step 4: Compute P-Value Under \(H_0\)
Just knowing “how many standard errors away” from the computed z-score is isn’t enough — we want to quantify how likely it is to observe such a result by chance. Therefore, we need to compute “p-value”, which is the probability of obtaining a result as extreme as (or more extreme than) your observed data, assuming that the null hypothesis \(H_0\) is true.
To do this, given the t-score formula above, we use the t-distribution (instead of the normal distribution) to model this probability because we have extra uncertainty due to estimating standard deviations from small samples.
The assumption for using the t-distribution is that the data is approximately normal so that standard deviations from small samples (e.g., \(s_A\)) is a good estmate of the true standard deviation (e.g., \(\sigma_A\)).
Unlike the normal distribution (which has a fixed shape), the t-distribution changes shape depending on the degrees of freedom (DOF). The t-distribution accounts for extra uncertainty that arises when we estimate the population standard deviation from a small sample. That uncertainty depends on the sample size — and that’s where degrees of freedom (df) come in.
Therefore, choosing the correct DOF is critical for an accurate p-value.
This is the Welch–Satterthwaite formula for DOF, which is commonly used in practice:
\[\text{DOF} = \frac{ \Big( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \Big)^2 }{ \frac{ \Big( \frac{s_A^2}{n_A} \Big)^2 }{n_A - 1} + \frac{ \Big( \frac{s_B^2}{n_B} \Big)^2 }{n_B - 1} }\](This DOF formula will be explained in more detail at the end of this post)
We use the t-score to find a p-value from the t-distribution.
The p-value answers this question: “If the null hypothesis \(H_0\) is true, what is the probability of seeing a result this extreme or more extreme just by chance?”
Mathematically, p-value is the area under the curve of the t-distribution beyond your t-score.
where \(P (.)\) is the probability that is computed using a t-distribution with the DOF we calculated earlier.
✅ Step 5: Decide to reject \(H_0\) or not
Let’s simulate a basic A/B test for the average time on page.
from scipy.stats import ttest_ind
# Simulated time-on-page data (in seconds)
group_A = [120, 130, 115, 123, 140]
group_B = [150, 160, 145, 155, 170]
# Welch's t-test (does not assume equal variances)
t_stat, p_value = ttest_ind(group_B, group_A, equal_var=False)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
Output:
t-statistic: 4.9736
p-value: 0.0011
With a p-value of 0.0011 significantly smaller than 0.05, we reject the null hypothesis and conclude that Group B performs significantly better.
✅ Use the t-test when:
❌ Don’t use the t-test when:
💡 Real-World Advice:
The t-test is usually safe and practical in A/B testing, product analytics, and experiments as long as your sample isn’t too tiny or your data isn’t wildly non-normal.
The t-test is a powerful tool for A/B testing when your sample size is small and you’re comparing means. It accounts for the added uncertainty in small datasets, allowing you to make data-informed decisions — even when data is limited.
🚀 The code of the example is available here.
For further inquiries or collaboration, please contact me at my email.
Recall that each group A or B has its own sample size, estimated variances of the avarage time on page, and DOF.
The DOF formula tells us: How much combined uncertainty we have from both A and B.
🎯 Numerator: It is the square of the estimated variance of the difference between the average time on page of both groups.
🔽 Denominator: It represents the combined uncertainty in estimating that variance, based on the degrees of freedom of each variance estimate of each group.
Therefore, the DOF formula is saying: How stable is our estimate of the variance? The more stable (i.e., less relative uncertainty), the higher the degrees of freedom.
📊 Interpretation