Solved – Finding the best combination of variables for high R-squared values

rr-squaredregression

I've been spending quite some time to figure out how I can get the best R squared value from randomization of some values in a linear regression equation. I have allele frequency data and 14 environmental gradient data. Allele frequency value is fixed, but 2~14 combinations of the 14 environmental variables are used.

My aim here is to find a combination of the environmental variables that yield high R squared value. Here is a simple linear regression equation code that returns R squared value.

> summary(lm(allele ~ compositevalues))$r.squared

"compositevalues" is a sum of standardized 14 different environmental values. I want to make 2~14 combinations of variables (with no replacement:i.e. var1+var2, var1+var3, var1+var4, var1+var2+var3, var2+var3+var4, var2+var3, var2+var4, var3+var4….etc. but not var1+var1+var2) as I mentioned above.

I would appreciate it if you could instruct me on how to write a code that generate random combination of (sum of ) the variables and returns combinations of variables that are used with R squared value of >0.4.

I was looking for permutation and resampling function in R, couldn't find ones that serve my purpose…..

Below is a part of my data set.

 1.  Location   allele           var1             var2          var3
 2.  site1,     0.230271924,    -0.872093023,   -0.696403914,   -0.398671096
 3.  site2,     -1.061563963,   0.944767442,    1.104640692,    -0.398671096
 4.  site3,     -0.524508594,   0.339147287,    -1.296752116,   0.431893688
 5.  site4,     0.027061785,    2.156007752,    -0.096055712,   0.431893688
 6.  site5,     0.186726894,    0.944767442,    1.104640692,    -0.398671096
 7.  site6,     -0.118088315,   -0.266472868,   -0.696403914,   -0.398671096
 8.  site7,     -1.003503923,   0.339147287,    -1.296752116,   0.431893688
 9.  site8,     -1.569589312,   0.339147287,    -1.296752116,   0.431893688
 10. site9,     -1.119624003,   0.944767442,     0.50429249,    -1.22923588
 11. site10,    1.362442702,    -1.477713178,   -0.096055712,   1.262458472
 12. site11,    0.215756914,    0.339147287,    -1.897100318,   1.262458472
 13. site12,    0.665722223,    -1.477713178,   -0.096055712,   1.262458472
 14. site13,    1.086657513,    -1.477713178,   -0.096055712,   1.262458472
 15. site14,    -0.001968235,   0.339147287,    1.704988894,    -2.059800664
 16. site15,    -1.656679372,   0.339147287,    1.104640692,    -2.059800664
 17. site16,    0.433482064,    0.339147287,    1.704988894,    -2.059800664
 18. site17,    -0.814808794,   1.550387597,    -1.296752116,   -0.398671096
 19. site18,    -0.713203724,   1.550387597,    -0.696403914,   -0.398671096

Best Answer

In your case, it might be feasible to try out all combinations (there are 16383 combinations) of sums. I wrote a quick and dirty implementation of that. With 14 variables it takes less than a minute to try out all combinations. If you want a random combination, you can modify the code to meet your needs.

my.vars <- matrix(NA, ncol=14, nrow=) # a matrix with your 14 different environmental variables
colnames(my.vars) <- paste("var", 1:14, sep="") # add row names "var1" - "var14"
my.grad.data <- 1:14
sum.vars <- vector()
r.2 <- vector()
comb.mat <- matrix(numeric(0), nrow=14, ncol=0) # initialise the matrix containing all combinations

for ( i in 1:14 ) { # generate and store all possible combination of sums of the 14 variables

  t.mat <- combn(my.grad.data, m=i)

  comb.mat <- cbind(comb.mat, rbind(t.mat, matrix(NA, ncol=dim(t.mat)[2] , nrow=14-i)))
}

for ( j in 1:dim(comb.mat)[2] ) { # calculate and store the R2 for all combinations

  sum.vec <- rowSums(my.vars[, comb.mat[, j]], na.rm=TRUE)

  sum.vars[j] <- paste(
    colnames(my.vars[, comb.mat[, j]])[!is.na(colnames(my.vars[, comb.mat[, j]]))], 
    collapse="+")

  r.2[j] <- summary(lm(allele ~ sum.vec))$r.squared 
}


result.frame <- data.frame(combination=sum.vars, r2=r.2)

result.frame.sorted <- result.frame[order(r.2, decreasing=TRUE), ]

head(result.frame.sorted, n=10) # the 10 "best" combinations