I would like to perform a permutation test on a dataset with several batches and would like to take those batches into account during testing. Similar to a linear model variable ~ explanatory_variable + batch
.
To illustrate what I mean, let's consider the following toy example:
There are two main types of lung cancer: Adenocarcionoma (LUAD) and squamous cell carcinoma (LUSC). The gene NKX2-1
is a known marker for LUAD tumor cells. Let's assume it is expressed 1.5x as much in LUAD compared to LUSC. Let's also assume our dataset consists of two batches that (for whatever reason) happen to have different baseline expressions of NKX2-1
. In each dataset, the LUAD and LUSC patients are highly imbalanced:
batch | # LUSC | # LUAD | NKX2-1 expr in LUSC | NKX2-1 expr in LUAD |
---|---|---|---|---|
1 | 30 | 3 | 4 | 6 |
2 | 3 | 30 | 2 | 3 |
Ignoring batch effects, simply comparing the mean between datasets will incorrectly lead to the conclusion NKX2-1
were expressed at a higher level in LUSC. That's why usually I would use a linear model NKX2-1 ~ tumor_type + batch
.
Unfortunately, my real data are not gene expression (for which established statistical models exist) and violate the normality assumption, which is why I would like to perform a permutation test instead. Is there an established way to perform a permutation test that takes batches (or other covariates) into account?
Here's some Python code playing with above example:
import numpy as np
import pandas as pd
import scipy.stats
import statsmodels.formula.api as smf
np.random.seed(0)
# for the sake of the example, use a normal distribution.
# The real data violates normality assumption
lusc1 = np.random.normal(loc=4, scale=1, size=30)
luad1 = np.random.normal(loc=6, scale=1, size=3)
lusc2 = np.random.normal(loc=2, scale=1, size=3)
luad2 = np.random.normal(loc=3, scale=1, size=30)
df = pd.DataFrame().assign(
expr=np.hstack([lusc1, luad1, lusc2, luad2]),
tumor_type=["LUSC"] * 30 + ["LUAD"] * 3 + ["LUSC"] * 3 + ["LUAD"] * 30,
dataset=["batch1"] * 33 + ["batch2"] * 33,
)
# Compute fold-change of LUSC compared to LUAD
np.mean(df.loc[lambda x: x["tumor_type"] == "LUAD", "expr"]) / np.mean(
df.loc[lambda x: x["tumor_type"] == "LUSC", "expr"]
)
# Huh, so the expression is actually less in LUAD!?
0.71248
model = smf.ols("expr ~ C(tumor_type, Treatment('LUSC')) + dataset", data=df)
res = model.fit()
res.summary()
Using a linear model, I correctly get a positive coefficient for tumor_type
. But how can I achieve the same with a permutation test?
Best Answer
It's possible to use randomization tests with regressions instead of relying on asymptotic distributions. There was recently a paper on this in the Quarterly Journal of Economics (link to ungated version) and the author provides some Stata code. However, Simonsohn subsequently showed that randomization tests did not outperform the appropriate robust method for computing standard errors.