Does a larger sample size increase multi-collinearity between predictors, after imputation of missing data

data-imputationmachine learningmissing datamulticollinearity

I have two datasets that have exactly the same 1701 predictors, but one has 936 subjects and the other has 547 subjects. (The initial rationale for creating these two different datasets was to see whether the conclusions would depend on the sample size and the percentage of missing data.) I then put both datasets through the same machine learning pipeline that comprises the following steps:

  • Scaling
  • Imputation of missing data with K-nearest neighbours
  • Dropping of predictors with zero variance
  • Dropping of predictors with a variance inflation factor (VIF) >= 5.0
  • Encoding of ordinal and categorical predictors
  • Feature selection using recursive feature elimination
  • Machine learning model

I then noticed that the 4th step – the dropping of collinear predictors using VIF – resulted in the number of predictors dropping from 1641 to 954 for the larger dataset (with 936 subjects), but the number of predictors didn't budge at all (1604 to 1604) for the smaller dataset (with 547 subjects). For the latter, I verified that the intermediate dataset entered into the VIF algorithm indeed led to all calculated VIFs being < 5.0, hence the retention of all predictors wasn't due to an error.

That brings me to my question:
Does having a larger sample size (more subjects) somehow result in an increase in multi-collinearity amongst a dataset's predictors, after missing data imputation with K-nearest neighbours (or any other imputation method)?

I tried looking for answers on this forum as well as via a Google search, but couldn't find anything relevant. I'd be really grateful for any advice on this matter.
Thank you very much!

=======================

Addendum made after Ian Barnett's answer:

I've experimented using the code kindly written by Ian Barnett, and it seems that increasing sample size doesn't appear to affect the degree of post-imputation collinearity in that particular example, but increasing the percentage of missing data does. (Hope I'm doing this correctly – I noticed that increasing pmiss led to an unexpected increase in the frequencies (height of the histogram) that made them appear 'saturated', as shown in the images).

enter image description here

enter image description here

enter image description here

enter image description here

Best Answer

Imputation can indeed increase multi-collinearity. I've demonstrated this with a simple bivariate normal example below. In this example I generate 100 bivariate normal random variables with correlation $\rho=.4$. I then randomly create some missingness in one of the variables ($X_2$) leaving the other variable ($X_1$) complete with no missingness. After this I created a simple imputation regression model that uses only the complete observations to predict $X_2$ from $X_1$. I stored the correlation of $X_1$ and $X_2$ both pre and post imputation and repeated this 1000 times to generate the histograms below. You will notice that the complete data has correlations hovering around the true $\rho$, while the imputed data has higher correlations in general.

This is likely what is happening in your case. Increasing sample size does not increase multi-collinearity, but imputation certainly can as I've demonstrated here.



library(mvtnorm)

set.seed(124)
B=1000
rho=.4
n=100
pmiss=.3

impucor=rep(NA,B)
origcor=rep(NA,B)
for(b in 1:B){
  # generate bivariate data
  dat=rmvnorm(n,mean=c(0,0),sigma=matrix(c(1,rho,rho,1),nrow=2))
  datmis=dat
  
  # create missing data in the second variable, X2, while leaving X1 compelte
  IDall=sample(1:n)
  IDmiss=IDall[1:floor(n*pmiss)]
  IDcomplete=IDall[(floor(n*pmiss)+1):n]
  datmis[IDmiss,2]=NA
  
  # imputation model
  X1c=datmis[IDcomplete,1]
  X2c=datmis[IDcomplete,2]
  lm.out =lm(X2c~X1c)
  betacoef=coef(lm.out)
  
  # create imputed data matrix by using predictions from imputation regression model
  datimp = datmis
  for(i in 1:length(IDmiss)){
    datimp[IDmiss[i],2]=betacoef[1]+betacoef[2]*datimp[IDmiss[i],1]
  }
  # store the correlation between X1 and X2 in the original data as
  # well as in the imputed data after missingness is induced
  impucor[b]=cor(datimp)[1,2]
  origcor[b]=cor(dat)[1,2]
}



plt1=hist(origcor,breaks=30)
plt2=hist(impucor,breaks=30)
plot(plt1,col=rgb(0,0,1,1/4),xlim=c(0,1),main="Correlation of imputed versus original data",xlab="Correlation")
plot(plt2,col=rgb(1,0,0,1/4),add=T)
colors=c(rgb(1,0,0,1/4),rgb(0,0,1,1/4))
legend("topright",c("Imputed","Original"),col=colors,pch=15)

enter image description here

Related Question