Permutation Test

From Sustainability Methods



Permutation testing, a non-parametric statistical method, offers a robust alternative to traditional hypothesis testing. It is particularly useful in scenarios where the assumptions required for parametric tests are not met. By reshuffling the data and observing the outcomes, permutation tests provide an empirical approach to hypothesis testing, making them adaptable to a wide range of data types and distributions. Imagine you have a bunch of data points, and you are trying to find out, if there is a meaningful pattern or just random noise. Permutation testing helps with this by mixing up the data and looking at what happens. It is similar to shuffling a deck of cards to see, if a particular arrangement happens just by chance.

Think of permutation testing as a cousin to another method called 'bootstrap' (please, refer to Wiki entry "Bootstrapping in Python" to learn more). Both use random shuffling of your data, but they have different goals. Bootstrap is about understanding how your sample represents a bigger population. In contrast, permutation testing is more about playing the 'what-if' game: what if there was no specific pattern in the data? It tries to see, what kind of random patterns can pop up when there is actually no real structure in the data. In other words, permutation is best for testing hypotheses and bootstrap is best for estimating confidence intervals.

Concept of Permutation Testing

At the heart of permutation testing is a simple question: Are different groups really different in terms of some statistical measure, or is it just a coinsidence? To answer this, we start with the assumption (called the null hypothesis) that there is no difference between the groups.

Here is how it works: for instance, you have groups A and B (and maybe C, D, and so on). You mix all their data points together because, according to your starting assumption, they are all the same. This mixing represents the idea that the specific treatment or condition each group experienced does not really make a difference. Then you create new groups from this big mixed pool of data and calculate your statistic (like an average) for these new groups. By doing this over and over and seeing how much these new groups differ from each other, you can start to understand, if the original difference between A and B was real or just a coincidence.

Steps in Permutation Testing in Python

1. Combining Data: Data from different groups are combined, embodying the null hypothesis of no significant difference.
2. Resampling: The combined data is repeatedly shuffled, and resamples are drawn to mimic the original sample sizes.
3. Calculating Statistics: For each permutation, the statistic of interest (e.g., mean difference) is calculated.
4. Forming Distribution: This process is repeated numerous times, generating a distribution of the test statistic under the null hypothesis.
5. Comparison and Conclusion: The actual test statistic calculated from the original data is compared with this distribution. If it lies in the extreme, the null hypothesis is rejected, indicating statistical significance.

#Create Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Define the sample size for each group
sample_size_a = 1000
sample_size_b = 700

# Generate random data for each group
# Assuming a normal distribution for both groups with the same mean and standard deviation

# Generate data for two groups with different sample sizes
group_a = np.random.normal(loc=50, scale=10, size=30)  # Group A with 30 samples
group_b = np.random.normal(loc=55, scale=15, size=40)  # Group B with 40 samples

# Create a DataFrame
df = pd.DataFrame({
    "Group": ["A"]*30 + ["B"]*40,
    "Value": np.concatenate([group_a, group_b])

Group Value
0 A 54.967142
1 A 48.617357
2 A 56.476885
3 A 65.230299
4 A 47.658466
ax = df.boxplot(by='Group', column='Value')

Perm test1.png

mean_a = df[df.Group == 'A'].Value.mean()
mean_b = df[df.Group == 'B'].Value.mean()
abs(mean_b - mean_a)


Group B has Value that are greater than those of A by 5.4 on average. The question is whether this difference is within the range of what random chance might produce, i.e., is statistically significant. One way to answer this is to apply a permutation test — combine all the session times together and then repeatedly shuffle and divide them into groups of 30 and 40 (recall that nA = 30 for Group A, and nB = 40 for Group B).

To apply a permutation test, we need a function to randomly assign the 70 samples to a group of 30 (Group A) and a group of 40 (Group B).

#Create Permutation Function
import random
def perm_fun(x, nA, nB):   
 n = nA + nB
 idx_B = set(random.sample(range(n), nB))
 idx_A = set(range(n)) - idx_B
 return x.loc[idx_B].mean() - x.loc[idx_A].mean()
nA = df[df.Group == 'A'].shape[0]
nB = df[df.Group == 'B'].shape[0]
print(perm_fun(df.Value, nA, nB))


This function works by sampling (without replacement) nB indices and assigning them to the B group; the remaining nA indices are assigned to group A. The difference between the two means is returned. Calling this function R = 1000 times and specifying nA = 30 and nB = 40 leads to a distribution of differences in the session times that can be plotted as a histogram. In Python this is done as follows using the hist() method:

perm_diffs = [perm_fun(df.Value, nA, nB) for _ in range(1000)]

fig, ax = plt.subplots(figsize=(5, 5))
ax.hist(perm_diffs, bins=11, rwidth=0.9)
ax.axvline(x = mean_b - mean_a, color='black', lw=2)
ax.set_xlabel('Value differences')

Perm test2.png

np.mean(perm_diffs > mean_b - mean_a)


In the context of a permutation test, this value is the empirical p-value for the test. This suggests that the observed difference in Value between Group A and Group B is not within the range of chance variation and thus is statistically significant difference from each others.

Strengths & Challenges

Strengths of Permutation

  • Good for exploring the role of random variation.
  • Relatively easy to code, interpret, and explain.
  • Data can be numeric or binary.
  • Sample sizes can be the same or different
  • Does not require normally distributed data
  • Does not require large sample size.

Challenges of Permutation

  • Computationially Expensive
  • Assume that observations are exchangeable under the null hypothesis. If this assumption is violated, for example, in time series data where observations are correlated, the test may not be valid.
  • Assume that the null hypothesis involves some form of equality (e.g., equal means). They are not designed to test more complex null hypotheses without adaptation.


1. Edgington, E., & Onghena, P. (2007). Randomization Tests (4th ed.). Chapman & Hall/CRC Press.

2. Bruce, P. (2014). Introductory Statistics and Analytics: A Resampling Perspective. Wiley.

3. Bruce, P., & Bruce, A. (2017). Practical Statistics for Data Scientists. O'Reilly Media, Inc. ISBN 9781491952962.

The author of this entry is Matthew Eiampikul.