Number of variables to include in the GLM and how to interpret the surrogate value analysis

bioinformaticsgeneralized linear model

I am working with genes and I am designing a model that its dependent variable must be diagnosis, whereas the rest of potential variables to include are sex, ethnicity(2 levels), rin(RNA integrity number), age and region(2 different body areas where sampling occured). Rin and age are continuous and I have transformed them to categorical (3 levels each):

# diag sex ethn rin age region
1 0 1 0 1 1 0
6 1 1 1 2 3 1
9 0 0 1 2 1 0
10 0 1 1 2 3 0
11 1 0 0 3 2 0

…….etc

I have around 400 samples, 150 of them as controls, and traits region and sex are highly skewed.

I performed ANOVA testing for nested glm models to define which of the variables are the most important to include and these are ethnicity, sex and age:

**Analysis of Deviance Table**

Model 1: as.factor(diag) ~ as.factor(ethn) + as.factor(sex)

Model 2: as.factor(diag) ~ as.factor(ethn) + as.factor(sex) + as.factor(age)

  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       381     500.60                          
2       379     473.39  2   27.212 **1.233e-06** ***

**Analysis of Deviance Table**

Model 1: as.factor(diag) ~ as.factor(ethn) + as.factor(sex) + as.factor(age)

Model 2: as.factor(diag) ~ as.factor(ethn) + as.factor(sex) + as.factor(age) + as.factor(rin)

  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       379     473.39                     
2       377     470.76  2   2.6258    **0.269**

Because after PCA I suspected batch effects, I performed surrogate value analysis(using https://bioconductor.org/packages/sva/) by trying different combinations of variables. But I notice that the most significant surrogate values arise when I include unimportant variables such as region/rin(sv3, p<0.000001), but not when I tried my original statistically significant variables(sv1, p<0.03):

**mod0** = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age), data=datTraits)

**mod** = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age), data=datTraits)

Number of significant surrogate variables is: 1

**mod0** = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(rin), data=datTraits)

**mod** = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(rin), data=datTraits)

Number of significant surrogate variables is: 5

**mod0** = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region)+as.factor(rin), data=datTraits)

**mod** = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region)+as.factor(rin), data=datTraits)

Number of significant surrogate variables is: 1

**mod0** = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region), data=datTraits)

**mod** = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region), data=datTraits)

Number of significant surrogate variables is: 5

My goal is to remove from my model the surrogate value that captures as much variance as possible and is not correlated to dependent variable "diag".

Could anyone please tell me what I am doing wrong???

Prida

PS: Below I have added the SV results when I have included the region variable and when not. Since in the first case the surrogate value is positively correlated to the dependent biological variable, shall I keep it?

mod0 = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region), data=datTraits)
mod = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age)+as.factor(region), data=datTraits)
Response Y3 :

Call:
lm(formula = Y3 ~ datTraits$diag)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.11038 -0.03425 -0.00597  0.03375  0.15983 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -0.009628   0.003157  -3.050  0.00245 ** 
datTraits$diag1  0.026792   0.005266   5.088  5.7e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.04951 on 382 degrees of freedom
Multiple R-squared:  0.06346,   Adjusted R-squared:  0.06101 
F-statistic: 25.88 on 1 and 382 DF,  p-value: 5.695e-07
mod0 = model.matrix(~as.factor(ethn)+as.factor(sex)+as.factor(age), data=datTraits)
mod = model.matrix(~as.factor(diag)+as.factor(ethn)+as.factor(sex)+as.factor(age), data=datTraits)
Call:
lm(formula = svobj$sv ~ datTraits$diag)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.09461 -0.06206  0.02571  0.04066  0.06141 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)  
(Intercept)      0.004097   0.003243   1.263   0.2073  
datTraits$diag1 -0.011401   0.005410  -2.107   0.0357 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.05087 on 382 degrees of freedom
Multiple R-squared:  0.01149,   Adjusted R-squared:  0.008904 
F-statistic: 4.441 on 1 and 382 DF,  p-value: 0.03574
```

Best Answer

There are two separate types of analyses that are being used here to try to evaluate genes whose expression levels are specifically associated with the diagnosis of interest. The analyses, however, serve different purposes and get at different issues.

One issue is that gene-expression levels might just be surrogates for some other clinically important characteristics associated with diagnosis, like sex or age. You want your analysis to control for those characteristics so that you focus on genes whose expression levels are associated with the diagnosis itself rather than just with patient characteristics that might be associated with it.

That what the first sets of regressions (logistic GLMs) seem to be doing: trying to identify clinical characteristics associated with the diagnosis and include in the model. They suggest that sex and age and ethnicity are associated with diagnosis. The RNA integrity number (RIN), a measure of the technical quality of the RNA obtained from a tissue sample, and the region, the part of the body from which the tissue sample was taken, aren't associated with the diagnosis. There really is no reason that those two characteristics of a sample should be.

The second problem with gene-expression studies is that there can be several unknown factors affecting gene expression that have nothing to do with the clinical diagnosis. Batch effects, in which there are systematic differences in gene-expression levels among samples assayed at different times, are one such factor.

That's what the second set of analyses, surrogate variable analysis (SVA), tries to account for. SVA is a way to try to control for un-modeled factors like batch effects, effects not included in the known predictors included in the regressions with gene expression as the outcomes.

In SVA, you first do a multivariate multiple regression, with gene-expression values as outcomes and the diagnosis and other clinical factors as predictors. You determine the matrix of residuals between observed and predicted values.

Then you do a singular-value decomposition (SVD) of the matrix of residuals. To identify "significant" un-modeled effects, you use multiple randomizations of the residuals matrix as a null and determine how many singular values of the original residuals matrix differ "significantly" from that null. That's the number of "significant surrogate values" to include.

Finally, you repeat the multivariate multiple regression but include the "significant" singular vectors as additional predictors. The idea is that this will minimize the contribution of genes whose expression levels differ due to the un-modeled factors. The hope is that, after this additional correction, genes with large coefficients associated with the diagnosis predictor will then be more likely to be biologically relevant rather than technical artifacts.

The sets of model matrices near the end of this question seem to be different choices for the ultimate SVA. The first matrix in each set of 2 is a set of predictors other than diagnosis; the second adds in the predictor diagnosis, the main predictor of interest. The number of "significant surrogate values" is reported for each.

With that background, the superiority of a model including RIN and region in terms of minimizing the number of "significant surrogate values" isn't surprising. Neither has anything to do with diagnosis, but both have a lot to do with other sources of variability in gene-expression levels. As a measure of technical sample quality, RIN has a lot to do with the ability to detect gene-expression values reliably in a sample. Similarly, samples from different body regions might have different complements of cell types and associated gene-expression patterns that have nothing directly to do with diagnosis but lead to differences in gene expression levels in a sample.

So if you include RIN and region as predictors in the ultimate model, you are directly modeling major sources of variability in the gene-expression measurements. It doesn't matter that they aren't associated themselves with diagnosis: they still are associated with differences in gene-expression values. Including them thus leaves fewer "unmodeled" factors to be handled by the SVA.

In general, it's often good practice incorporate as many predictors as reasonable, without overfitting, when you are modeling. With this scale of study, a few hundred patients, there probably is no need to do the preliminary logistic GLM at all. You should be able to handle a couple of dozen predictors potentially associated with diagnosis, or otherwise with gene-expression levels, and thus minimize the unmodeled factors to be handled by the SVA.

Two cautions about this approach.

First, you would be better off modeling the continuous predictors flexibly with regression splines rather than categorizing them. See this page among many others.

Second, the SVA will attempt to remove "unmodeled heterogeneity" in the gene-expression values from any source. That means that it's important to get a reasonably correct model to start with.

For example, if gene expression is best modeled on a log scale and you do the analysis on a linear scale, that can show up as "unmodeled heterogeneity." Or if expression of a gene only matters with respect to diagnosis in females rather than in males, you will miss that if you don't include a sex:diagnosis interaction term explicitly in the model. Otherwise that will just be yet more "unmodeled heterogeneity" to be subsumed in the "surrogate values" and lost to your analysis.

If you find some genes whose apparent importance differs greatly between with and without the SVA, it would make sense to investigate more closely what's going on with them.

Related Question