A/B Testing using T-Test: Making Smart Decisions with Limited Data
In a previous post, we explored how large samples allows us to perform A/B testing based on the z-test and normal distribution. But what happens when your sample size is small? Can you still perform A/B testing?
The answer is yes — but you need to switch tools. In this post, we’ll dive into how to perform A/B testing using the t-test, the theory behind, and why it’s the right approach when your data is limited.
Workflow for A/B testing with t-Test
The workflow for A/B testing with t-test is similar to that for A/B testing with z-test. You can find the workflow in my previous post.
Theory for A/B Testing with t-Test
🎯 Problem Setup
Suppose you’re testing a new product layout, and your metric of interest is average time on page. You have:
- Group A (control): current layout
- Group B (variant): new layout
Suppose:
- \(\mu_A\) and \(\sigma_A\) are the true average time on page and standard deviation in A
- \(\mu_B\) and \(\sigma_B\) are the true average time on page and standard deviation in B
These are unknown parameters we are trying to make inferences about using sample data:
- Group A has \(n_A\) users with average time on page \(\bar{\mu}_A\) and standard deviation \(s_A\)
- Group B has \(n_B\) users with average time on page \(\bar{\mu}_B\) and standard deviation \(s_B\)
We use \(\bar{\mu}_A\) and \(\bar{\mu}_B\) to estimate \(\mu_A\) and \(\mu_B\). We also use hypothesis testing to decide if the difference in the observed average times on page is statistically significant.
🔍 Step 1: Define Hypotheses
- Null hypothesis \(H_0\): No difference in the average time on page, i.e.,
- Alternative hypothesis \(H_1\): Variant B performs differently, i.e.,
or if B is better, i.e.,
\[H_1: \mu_B > \mu_A \,\, \text{(one-sided test)}\]⚙️ Step 2: Estimated Standard Error (SE) Under \(H_0\)
SE is a measure of the uncertainty or variability in the difference between two observed average time on page \(\bar{\mu}_A\) and \(\bar{\mu}_B\). It tells you how much the observed difference between groups A and B might vary just due to random sampling.
The true SE for the difference in the observed average time on page under \(H_0\) is:
\[\text{SE} = \sqrt{ \text{Var} (\bar{\mu}_B - \bar{\mu}_A ) } = \sqrt{\frac{\sigma_A^2}{n_A} + \frac{\sigma_B^2}{n_B}}\]where \(\text{Var} (\bar{\mu}_B - \bar{\mu}_A ) = \text{Var} (\bar{\mu}_B ) + \text{Var} (\bar{\mu}_A )\) (due to the independency between \(\bar{\mu}_A\) and \(\bar{\mu}_B\)), \(\text{Var} (\bar{\mu}_A ) = \frac{\sigma_A^2}{n_A}\), and \(\text{Var} (\bar{\mu}_B ) = \frac{\sigma_A^2}{n_B}\).
However, this true SE is only theoretical because \(\sigma_A\) and \(\sigma_B\) are unknown.
Instead, the estimated SE for the difference in the observed average time on page is computed using the sample standard deviation \(s_A\) and \(s_B\) of \(\sigma_A\) and \(\sigma_B\):
\[\text{SE} = \sqrt{\frac{s_A^2}{n_A} + \frac{s_B^2}{n_B}}\]🧮 Step 3: T-Statistic Under \(H_0\)
The z-score tells us how far the difference of the observed average time on page we see is from zero, using standard error as the unit, assuming that \(H_0\) is true:
\[t = \frac{\bar{\mu}_B - \bar{\mu}_A}{\text{SE}}\]So,
- If \(t\) is close to 0, the observed difference is what we’d expect from random chance
- If \(t\) is far from 0, the difference is larger than what we’d expect from chance, so it might be statistically significant
📉 Step 4: Compute P-Value Under \(H_0\)
Just knowing “how many standard errors away” from the computed z-score is isn’t enough — we want to quantify how likely it is to observe such a result by chance. Therefore, we need to compute “p-value”, which is the probability of obtaining a result as extreme as (or more extreme than) your observed data, assuming that the null hypothesis \(H_0\) is true.
To do this, given the t-score formula above, we use the t-distribution (instead of the normal distribution) to model this probability because we have extra uncertainty due to estimating standard deviations from small samples.
The assumption for using the t-distribution is that the data is approximately normal so that standard deviations from small samples (e.g., \(s_A\)) is a good estmate of the true standard deviation (e.g., \(\sigma_A\)).
Unlike the normal distribution (which has a fixed shape), the t-distribution changes shape depending on the degrees of freedom (DOF). The t-distribution accounts for extra uncertainty that arises when we estimate the population standard deviation from a small sample. That uncertainty depends on the sample size — and that’s where degrees of freedom (df) come in.
- Smaller sample size → smaller DOF → more uncertainty → heavier tails.
- Larger sample size → larger DOF → less uncertainty → closer to standard normal distribution.
Therefore, choosing the correct DOF is critical for an accurate p-value.
This is the Welch–Satterthwaite formula for DOF, which is commonly used in practice:
\[\text{DOF} = \frac{ \Big( \frac{s_A^2}{n_A} + \frac{s_B^2}{n_B} \Big)^2 }{ \frac{ \Big( \frac{s_A^2}{n_A} \Big)^2 }{n_A - 1} + \frac{ \Big( \frac{s_B^2}{n_B} \Big)^2 }{n_B - 1} }\](This DOF formula will be explained in more detail at the end of this post)
We use the t-score to find a p-value from the t-distribution.
The p-value answers this question: “If the null hypothesis \(H_0\) is true, what is the probability of seeing a result this extreme or more extreme just by chance?”
Mathematically, p-value is the area under the curve of the t-distribution beyond your t-score.
- For a two-tailed test:
- For a one-tailed test:
where \(P (.)\) is the probability that is computed using a t-distribution with the DOF we calculated earlier.
✅ Step 5: Decide to reject \(H_0\) or not
- If \(p_{value} < \alpha\) (commonly \(0.05\)), we reject \(H_0\) and conclude that the difference is statistically significant.
- Otherwise, we fail to reject \(H_0\).
Implementing an Example A/B Test in Python
Let’s simulate a basic A/B test for the average time on page.
from scipy.stats import ttest_ind
# Simulated time-on-page data (in seconds)
group_A = [120, 130, 115, 123, 140]
group_B = [150, 160, 145, 155, 170]
# Welch's t-test (does not assume equal variances)
t_stat, p_value = ttest_ind(group_B, group_A, equal_var=False)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
Output:
t-statistic: 4.9736
p-value: 0.0011
With a p-value of 0.0011 significantly smaller than 0.05, we reject the null hypothesis and conclude that Group B performs significantly better.
🧪 When and Why to Use the t-Test in A/B Testing
✅ Use the t-test when:
- Comparing Means: Use the t-test when your goal is to compare the average performance of two groups (e.g., average time on page or purchase amount).
- Unknown Population Variance: The population standard deviation is usually unknown, so you need to estimate it from your sample.
- Small to Moderate Sample Sizes: While a common rule of thumb is to use the t-test when the sample size is less than 30 per group, it remains useful even with slightly larger samples when the population variance is unknown.
- Approximately Normal Data: The t-test assumes that the underlying data is approximately normally distributed. This assumption is particularly important when the sample size is very small.
❌ Don’t use the t-test when:
- Comparing Proportions Directly: While a t-test can technically be used in some transformed or regression-based comparisons of proportions, in standard A/B testing practice, we typically use a z-test for comparing two proportions (like conversion rates), because proportions follow a binomial distribution. Proportions are not continuous variables like those in t-test, and their variance depends on the proportion itself. For very small samples, Fisher’s exact test is more appropriate.
- Severely Non-Normal Data in Very Small Samples: When the data is heavily skewed or not normally distributed, and the sample size is extremely small, the t-test may not be reliable due to biased or unstable standard deviation estimate of the data samples. In such cases, consider non-parametric alternatives like the Mann-Whitney U test.
💡 Real-World Advice:
The t-test is usually safe and practical in A/B testing, product analytics, and experiments as long as your sample isn’t too tiny or your data isn’t wildly non-normal.
Summary
The t-test is a powerful tool for A/B testing when your sample size is small and you’re comparing means. It accounts for the added uncertainty in small datasets, allowing you to make data-informed decisions — even when data is limited.
🚀 The code of the example is available here.
For further inquiries or collaboration, please contact me at my email.
DOF Formula Explained:
Recall that each group A or B has its own sample size, estimated variances of the avarage time on page, and DOF.
- Group A has size \(n_A\), estimated variance \(s_A^2\) and \(n_A-1\) DOF.
- Group B has size \(n_B\), estimated variance \(s_B^2\) and \(n_B-1\) DOF.
The DOF formula tells us: How much combined uncertainty we have from both A and B.
🎯 Numerator: It is the square of the estimated variance of the difference between the average time on page of both groups.
🔽 Denominator: It represents the combined uncertainty in estimating that variance, based on the degrees of freedom of each variance estimate of each group.
Therefore, the DOF formula is saying: How stable is our estimate of the variance? The more stable (i.e., less relative uncertainty), the higher the degrees of freedom.
📊 Interpretation
- If both groups have large samples and similar variances, the ratio gives a large DOF, and the t-distribution looks almost normal.
- If one group has high variance or low sample size, the denominator is bigger → DOF is smaller → heavier tails in the t-distribution.