# Permutation Test with Covariates – A Comprehensive Guide

nonparametricpermutation-test

I would like to perform a permutation test on a dataset with several batches and would like to take those batches into account during testing. Similar to a linear model variable ~ explanatory_variable + batch.

To illustrate what I mean, let's consider the following toy example:

There are two main types of lung cancer: Adenocarcionoma (LUAD) and squamous cell carcinoma (LUSC). The gene NKX2-1 is a known marker for LUAD tumor cells. Let's assume it is expressed 1.5x as much in LUAD compared to LUSC. Let's also assume our dataset consists of two batches that (for whatever reason) happen to have different baseline expressions of NKX2-1. In each dataset, the LUAD and LUSC patients are highly imbalanced:

batch # LUSC # LUAD NKX2-1 expr in LUSC NKX2-1 expr in LUAD
1 30 3 4 6
2 3 30 2 3

Ignoring batch effects, simply comparing the mean between datasets will incorrectly lead to the conclusion NKX2-1 were expressed at a higher level in LUSC. That's why usually I would use a linear model NKX2-1 ~ tumor_type + batch.

Unfortunately, my real data are not gene expression (for which established statistical models exist) and violate the normality assumption, which is why I would like to perform a permutation test instead. Is there an established way to perform a permutation test that takes batches (or other covariates) into account?

Here's some Python code playing with above example:

import numpy as np
import pandas as pd
import scipy.stats
import statsmodels.formula.api as smf
np.random.seed(0)

# for the sake of the example, use a normal distribution.
# The real data violates normality assumption
lusc1 = np.random.normal(loc=4, scale=1, size=30)
lusc2 = np.random.normal(loc=2, scale=1, size=3)

df = pd.DataFrame().assign(
tumor_type=["LUSC"] * 30 + ["LUAD"] * 3 + ["LUSC"] * 3 + ["LUAD"] * 30,
dataset=["batch1"] * 33 + ["batch2"] * 33,
)

# Compute fold-change of LUSC compared to LUAD
np.mean(df.loc[lambda x: x["tumor_type"] == "LUAD", "expr"]) / np.mean(
df.loc[lambda x: x["tumor_type"] == "LUSC", "expr"]
)
# Huh, so the expression is actually less in LUAD!?
0.71248
model = smf.ols("expr ~ C(tumor_type, Treatment('LUSC')) + dataset", data=df)
res = model.fit()
res.summary()

Using a linear model, I correctly get a positive coefficient for tumor_type. But how can I achieve the same with a permutation test?