Solved – How to choose the best combination of covariates in Cox multiple regression

combinatoricscox-modelrregression

I am performing a multivariate Cox regression analysis, and would like to find what combination of those covariates best predict my outcome.

Say I have a list of candidate genes whose expressions showed (1) to be associated with overall survival (OS) (Cox regression), and (2) also associated among themselves (multivariate linear modeling). For example, high levels of gene_1 AND low levels of gene_2 AND low levels of gene_3 are associated with poor prognosis. So, if I am using only these 3 variables and a decent number of patients (e.g. n=200), it is not hard to find those patients with that combination of genes outcome.

However, if for instance I have a list comprising 7 of such candidates, the chances of finding a patient that fits this criterion (now for these 7 hits) is nearly none.

So my question is: is there a way to do some sort of permutation/combinatorial analysis coupled with Cox regression to find the combination of those 7 targets that best associates with OS? Considering that a satisfactory combination of factors is represented say in at least 25% of the patient population.

An example:

> shortlist[1:5,1:10]
           Age    PFS PFS_codex     OS OS_codex   gene_1   gene_2   gene_3   gene_4   gene_5
Sample_1  67.9 117.23         0 115.69        0 9.451046 5.572303 7.260597 8.492154 4.010582
Sample_2    61  69.27         0  72.30        1 9.520935 9.956700 8.370941 6.854242 4.638455
Sample_3  69.1   1.23         1   1.08        1 9.691664 8.713712 8.840432 7.891189 3.707268
Sample_4  72.2  15.27         1  69.63        1 9.490668 9.015255 8.601908 9.584230 4.277126
Sample_5  40.7  61.43         1  78.41        1 9.439942 7.769697 7.337121 7.222432 4.843211   

This shortlist is already with only those candidate that passed a Cox Regression univariate analysis. So now, I run Cox again, with the exception that this time all candidates #from the shortlist are put up together:

multicox <- coxph(Surv(OS, OS_codex) ~ gene_1 + gene_2 + gene_3 + ... + gene_27, data=shortlist)

The results are the following (because of space, only those with significant p-values are listed):

> summary(multicox)
Call:
coxph(formula = Surv(OS, OS_codex) ~ gene_1 + gene_2 + gene_3 + ... + gene_27, data = shortlist)

  n= 196, number of events= 133 

            coef exp(coef) se(coef)      z Pr(>|z|)    
gene_1   0.80167   2.22926  0.22501  3.563 0.000367 ***
gene_8   0.15332   1.16570  0.06417  2.389 0.016888 *  
gene_9   0.76781   2.15505  0.24603  3.121 0.001803 ** 
gene_12 -0.84846   0.42807  0.30868 -2.749 0.005984 ** 
gene_18 -0.60773   0.54459  0.13789 -4.407 1.05e-05 ***
gene_26  0.35992   1.43321  0.14591  2.467 0.013634 *  
gene_27  0.41905   1.52052  0.14972  2.799 0.005127 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

There are 7 candidates showing association to the OS (considering p<0.05).
Based on that, and on the HR (to know whether upregulation or downregulation is associated to OS) of each one of them, I would like to test the hypothesis that the combination of all these candidates are indeed good predictors for poor prognosis.

For that, I would like to artificially create 2 groups of patients: 'high risk' and 'low risk'. The 'high risk' group is the one which the up- or down-regulation for each candidate showed to be associated with OS, respectively. The 'low risk' is the rest. (OBS: here, I am considering up- or down-regulation all values that are above and below the median for each gene, respectively)

First, I store the median values for all the 7 genes:

> a <- data.frame(gene_1=median(shortlist$gene_1), gene_8=median(shortlist$gene_8), gene_9=median(shortlist$gene_9), gene_12=median(shortlist$gene_12), gene_18=median(shortlist$gene_18),             gene_26=median(shortlist$gene_26), gene_27=median(shortlist$gene_27))
> a <- as.numeric(a)
> a
[1] 8.999678 5.681134 5.907599 8.420542 6.158107 3.279144 7.020527

So the classification now comprises the following:

shortlist$risk_class <- ifelse(shortlist$gene_1>a[1] & shortlist$gene_8>a[2] & shortlist$gene_9>a[3] & shortlist$gene_12<a[4] & shortlist$gene_18<a[5] & shortlist$gene_26>a[6] & shortlist$gene_27>a[7], "high_risk", "low_risk")

However, when I do that, there is only one patient that fits in that criteria:

>sum(shortlist$gene_1>a[1]&shortlist$gene_8>a[2]&shortlist$gene_9>a[3]&shortlist$gene_12<a[4]&shortlist$gene_18<a[5]&shortlist$gene_26>a[6]&shortlist$gene_27>a[7])
    [1] 1

So, my question is, would there be a way to test for the best combination of those 7 candidates, with a reasonable number of patients (i.e. >30)? I was thinking about some sort of permutation/combinatorial analysis coupled with Cox regression to find the combination of those targets that best associates with OS?

For example:

gene_1 + gene_9 + gene_18 + gene_26 OR

gene_1 + gene_18 + gene_27 OR

gene_9 + gene_18 + gene_26 + gene_27 OR

gene_18 + gene_27 … and so forth.

And finally, would there be any package in R that can perform it?
Thanks a lot!

Best Answer

Converting continuous predictors to binary

You already have a combination of genes whose expressions on a continuous scale (apparently log-transformed) are associated with outcome. Trying to convert those continuous predictors to binary predictors, as you propose, runs a serious risk of making things worse.

This page is a good introduction to the problems introduced by binning continuous predictors. You are throwing away a lot of information by converting to binary indicators of expression.* That might make sense if the distribution of expression levels among patients clearly showed bimodality indicating two expression groups (for example, estrogen-receptor $\alpha$ expression distinguishes ER-positive from ER-negative breast cancer), but then you would choose a cutoff that distinguished those groups, which would not be the median expression unless the 2 groups were of equal size.

Selecting predictors

The question you ask still applies to selection of continuous predictors.** What you are proposing is known as best-subset selection, extensively discussed on this site for example on this page and computationally feasible when starting with only 7 predictors. Such attempts at automating model selection can be ill-advised, as discussed on this page. At the minimum, you will not be able to use standard p-values and significance testing as you will have used the outcomes to select predictors at several steps of your modeling,*** and predictions are likely to be biased.

You will find that the particular set of genes chosen will depend highly on the particular sample of patients you analyze. You can examine that issue with the gene-selection methods you have already used: try repeating your model-building process thus far on multiple bootstrap samples from your dataset. I suspect that you will not always identify the same 7 candidate genes. That is an inherent problem with any variable-selection method when you have many more predictors than cases and does not necessarily pose problems for properly constructed predictive models. This problem should, however, discourage attempts to rely on the mechanistic importance of any particular set of chosen genes.

Cox regression issues

You also are faced with issues that affect Cox regressions differently from standard linear regressions, as discussed in this answer. First, in Cox regression the power is provided by the number of cases with events (progression or death in your study), not the overall number of cases. The usual rule of thumb for unpenalized Cox regressions is 10-20 events per predictor that you are considering in a model. With only 133 events your 27-predictor model (only about 5 events per predictor) may be overfit substantially, with results unlikely to generalize well. If you try to work with only 30 cases and only half have events, you are unlikely to get beyond 1 or 2 predictors without overfitting.

Second, omitting any predictor related to outcome from a Cox regression model leads to biased coefficient estimates for the included variables. If the omitted predictor is uncorrelated with the included predictors the bias tends to be toward lower-magnitude coefficients, so in practice finding the genes most associated with outcome as individual predictors can still provide a useful screening step. If such an omitted predictor is correlated with an included predictor, however, there is no guarantee about the direction of the bias.

Third, related to the previous paragraph, is the omission of clinical data from the survival analysis. It's possible that the genes you identify will be proxies for clinical variables like sex or disease stage already known to be related to outcome. That's still interesting, but the interpretation of results would be quite different from a situation in which the gene expression adds to clinical information, as is the case for the Oncotype DX® and MammaPrint® gene panels in breast cancer. This issue is critical if patients received different therapies. A gene whose expression is related to outcome following chemoradiation therapy might have no relation to outcome following surgical excision of a tumor.

How to proceed

Identifying small sets of genes whose expression is related to outcome can readily be done with LASSO. It would select a subset of genes that together are strongly related to outcome while penalizing their regression coefficients to minimize overfitting. The R glmnet package provides tools for analyzing Cox models, including cross-validation to choose the optimum penalty factor (based on partial likelihood deviance in Cox models) and thus the number of genes chosen. LASSO can't avoid the dependence of the gene set chosen on the particular sample at hand, and you still should consider incorporating standard clinical variables into your model, but it provides a principled and well accepted approach to this type of problem.


*Note that gene-expression panels for breast cancer prognostication are based on continuous expression measurements.

**The gene expression panels noted above involve 3 to 10 times more than your 7 genes, measured by qPCR (21 genes, Oncotype DX®) or microarray (70 genes, MammaPrint®). The groups behind those tests worked hard to identify small yet useful sets of genes. Thus it is unlikely that restricting yourself to even fewer than 7 genes will provide much prognostic information. There are nevertheless important statistical issues that you raise.

***There is some recent work on correcting for variable selection in standard linear regression (see Chapter 6 of Statistical Learning with Sparsity) but it's not clear to me how well that work applies to Cox regression.

Related Question