March 25, 2025

[ad_1]

Beginner’s guide to hypothesis testing in Python

AB tests or randomised experiments are the gold standard method used to understand the causal impact of a treatment of interest on the outcome considered. Being able to evaluate AB test results and draw an inference about the treatment is a useful skill for any data enthusiasts. In this post, we will look at practical ways to evaluate the statistical significance of the difference between the two independent sample means of continuous data in Python.

Photo by Tolga Ulkan on Unsplash

In the simplest form of AB test, we have two variants that we want to compare. In one variant, say variant A, we have the default setup to set as baseline. The records who are assigned the default scenario are often referred to as control group. In the other variant, say variant B, we introduce the treatment of interest. The records who are assigned the treatment are often referred to as treatment group. We hypothesise that this treatment may provide us certain benefit over the default setup and want to test if the hypothesis holds in reality. In AB tests, variants are randomly assigned to records such that both groups are comparable.

Now, let’s imagine we just finished collecting sample data from an AB test. It’s time to evaluate the causal impact of the treatment on the outcome. We can’t simply compare the difference between two groups as it only tells us about that particular sample data and doesn’t tell us much about the population. To make an inference from the sample data, we will use hypothesis testing.

We will use combination of a few different tests to analyse the sample data. We will look at two different options.

🔎 Option 1

This is how our option 1 flow looks like:

Option 1

Student’s t-test is a popular test to compare two unpaired sample means so we will use Student’s t-test where it’s feasible. However, in order to use Student’s t-test, we will first check with the data if the following assumptions are met.

📍 Assumption of normality
Student’s t-test assumes that the sampling distribution of means for both groups are normally distributed. Let’s clarify what we mean by sampling distribution of means. Imagine we draw a random sample of size n, we record its mean. Then, we take another random sample of size n and record its mean. We do this let’s say 10,000 times in total to collect many sample means. If we plot these 10,000 means, we will see the sampling distribution of means.

According to Central Limit Theorem:
◼️ The sampling distribution of means gets approximately normal when the sample size is around 30 or more regardless of the distribution of the population.
◼️ For normally distributed population, the sampling distribution of means will be approximately normal even with smaller sample size (i.e. less than 30).

Let’s look at a simple illustration of this in Python. We will create a imaginary population data for two groups:

import numpy as np
import pandas as pd
from scipy.stats import (skewnorm, shapiro, levene, ttest_ind,
mannwhitneyu)
pd.options.display.float_format = "{:.2f}".format
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid", context="talk", palette="Set2")
N = 100000
np.random.seed(42)
pop_a = np.random.normal(loc=100, scale=40, size=N)
pop_b = skewnorm.rvs(10, size=N)*50
fig, ax = plt.subplots(1, 2, figsize=(10,5))
sns.histplot(pop_a, bins=30, kde=True, ax=ax[0])
ax[0].set_title(f"Group A (mean={pop_a.mean():.2f})")
sns.histplot(pop_b, bins=30, kde=True, ax=ax[1])
ax[1].set_title(f"Group B (mean={pop_b.mean():.2f})")
fig.suptitle('Population distribution')
fig.tight_layout()

We can see that the population data is normally distributed for group A whereas population data for group B is right-skewed. Now we will plot the sampling distribution of means from both population with sample size of 2 and 30 respectively:

n_draw = 10000
for n in [2, 30]:
np.random.seed(42)
sample_means_a = np.empty(n_draw)
sample_means_b = np.empty(n_draw)
for i in range(n_draw):
sample_a = np.random.choice(pop_a, size=n, replace=False)
sample_means_a[i] = sample_a.mean()

sample_b = np.random.choice(pop_b, size=n, replace=False)
sample_means_b[i] = sample_b.mean()

fig, ax = plt.subplots(1, 2, figsize=(10,5))
sns.histplot(sample_means_a, bins=30, kde=True, ax=ax[0])
ax[0].set_title(f"Group A (mean={sample_means_a.mean():.2f})")
sns.histplot(sample_means_b, bins=30, kde=True, ax=ax[1])
ax[1].set_title(f"Group B (mean={sample_means_b.mean():.2f})")
fig.suptitle(f"Sampling distribution of means (n={n})")
fig.tight_layout()

We can see that for even small sample size of 2, the sampling distribution of means is normally distributed for population A because the population is normally distributed to start with. When the sample size is 30, the sampling distribution of means are both approximately normally distributed. We see that the mean of sample means in the sampling distribution is very close to the population mean. Here’re great additional resources to read on sampling distribution of means and assumption of normality:
◼️ Distribution of Sample Means
◼️ The Assumption(s) of Normality

So this means, if both groups sample are 30 or above, then we assume this assumptions is met. When sample size is smaller than 30, we will check if the populations are normally distributed with Shapiro-Wilk test. If the test says one of the population is not normally distributed, then we will use Mann-Whitney U test as an alternative test to compare the two sample means. This test doesn’t make an assumption about normality.

📍 Equal variance assumption
Student’s t-test also assumes that both populations have equal variance. We will use Levene’s test to find out if the two groups have equal variance. If the assumption of normality is met but the equal variance assumption is not met according Levene’s test, we will use Welsh’s t-test as an alternative since Welsh’s t-test doesn’t make an assumption about equal variance.

🔨 Option 2

According to this and this source, we could use Welsch’s t-test as the default over Student’s t-test. The following are some of the paraphrased and simplified main reasons the authors of the sources describe:
◼️ Equal variance in reality is very unlikely
◼️ Levene’s test tend to have low power
◼ Even if the two populations have equal variance, Welsch’s t-test is as powerful as Student’s t-test.

Therefore, we could consider a much simpler alternative option:

Option 2

Now, it’s time to translate these options into Python code.

Let’s imagine we have collected the following sample data:

n = 100
np.random.seed(42)
grp_a = np.random.normal(loc=40, scale=20, size=n)
grp_b = np.random.normal(loc=60, scale=15, size=n)
df = pd.DataFrame({'var': np.concatenate([grp_a, grp_b]),
'grp': ['a']*n+['b']*n})
print(df.shape)
df.groupby('grp')['var'].describe()

Here’s the distribution of two sample data:

sns.kdeplot(data=df, x='var', hue="grp", fill=True);

Scenario 1: Does treatment have an impact?

We will assume that we wanted to test the following hypothesis:

Null hypothesis often is the conservative take that the treatment has no effect. We will only reject the null hypothesis if we have sufficient statistical evidence. In other words, no impact until proven impactful. If the means are statistically significantly different, then we can say that the treatment has an impact. This is going to be a two-tail test. We will use an alpha of 0.05 to evaluate our results.

Let’s create a function to test the difference according to option 1 flow:

Awesome, we will use the function to check if the population means are different:

check_mean_significance1(grp_a, grp_b)

Lovely, p-value is very close to 0 and lower than the alpha, we reject the null hypothesis and conclude that we have sufficient statistical evidence to suggest that the mean of the two groups are different: The treatment has an impact.

Let’s now adapt the code snippet for option 2:

Time to apply this to our dataset:

check_mean_significance2(grp_a, grp_b)

Awesome, we get the same conclusion in this example as the equal variance assumption was not met in the first option.

Scenario 2: Does treatment have a positive impact?

In the above scenario, we didn’t care about the direction of the impact. In practice, often we want to know whether a treatment has a positive impact (or negative impact depending the outcome considered). So we will change the hypothesis slightly:

Now, this becomes a one-tail test. We will reuse the function but this time we will change the test from two-tailed test to one-tailed test with the alternative argument:

check_mean_significance1(grp_a, grp_b, alternative="less")

Since p-value is lower than the alpha, we reject the null hypothesis and conclude that we have sufficient statistical evidence to suggest that mean of the treatment group is statistically significantly higher than that of the control group: The treatment has an impact on the outcome.

For completeness, let’s look at option 2 as well:

check_mean_significance2(grp_a, grp_b, alternative="less")

Voila, we have reached end of the post. Hope you have learned practical ways to compare sample means and make an inference about the population. With this skill, we can help inform many important decisions.

Photo by Avinash Kumar on Unsplash

Would you like to access more content like this? Medium members get unlimited access to any articles on Medium. If you become a member using my referral link, a portion of your membership fee will directly go to support me.

Statistical significance testing of two independent sample means with SciPy Republished from Source https://towardsdatascience.com/statistical-significance-testing-of-two-independent-sample-means-with-scipy-638cb834b4d1?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>

[ad_2]

Source link