It matters what you mean by prediction. Unfortunately, this term can be somewhat ambiguous, especially since the linear combination of covariates in the regression model is often referred to as a linear predictor.
The typical purpose of a generalized linear model is to estimate the population mean and to perform inference on the mean. This would be the proportion in a Bernoulli model and the mean in a Poisson or gamma model.
The word prediction is best reserved for when interest surrounds a future sampled observation. Of course our best point prediction of a future observation is the estimated mean of the population. For a gamma model one would report the sample mean as the point prediction for a future observation. For a Bernoulli model one would report the value 0 or 1 that has the largest estimated proportion since an individual observation can only take on these discrete values. For a Poisson model one could report the mean rounded to the nearest integer since the support of the Poisson distribution is the non-negative integers. One could also use the floor or ceiling function on the mean to produce a point prediction.
One might also be interested in presenting the estimated percentiles of the population. It is important that these be presented with tolerance intervals (confidence intervals for population percentiles). Alternatively one might be interested in quantifying the uncertainty regarding the point prediction for a single future observation. This would require the use of a prediction interval which is not the estimated percentiles. Here is a related thread that discusses prediction intervals.
Addendum: Splitting the data into training and test is for the purposes of validating the out-of-sample prediction ability of a model. My preferred approach is not to split the data into training and test sets. Rather, I suggest to bootstrap (sample with replacement) $n$ observations from the data set as if it is the population, fit the model, and construct a point prediction or interval prediction for a particular prediction target (a single future $y$ [$m=1$ observation] or a future $\bar{y}$ based on $m$ observations). Then bootstrap a sample of size $m$ and tally i) the discrepancy between the point prediction and the target, and ii) whether the prediction interval covered the target . Repeat this 10,000 times and plot the histogram for point prediction errors and calculate the coverage rate for prediction intervals. This validates the performance of the model based on operating characteristics.
Sampling with replacement from your data set treats it as a much larger population. It is likely the percentiles of your data set do not match the theoretical percentiles of the glm model you posit. This means there is slight model misspecification so don't be surprised if the prediction intervals do not cover at the nominal level and if the histogram of prediction errors shows small bias (not centered at zero). You can also perform this type of validation through simulation by randomly generating observations from the theoretical model that matches your glm, e.g. gamma or Poisson. Here you should find the prediction intervals perform close to the nominal level and your point prediction is asymptotically unbiased for the target.
This type of approach can also be used to validate point and interval estimation of a population parameter.
There are two separate types of analyses that are being used here to try to evaluate genes whose expression levels are specifically associated with the diagnosis of interest. The analyses, however, serve different purposes and get at different issues.
One issue is that gene-expression levels might just be surrogates for some other clinically important characteristics associated with diagnosis
, like sex or age. You want your analysis to control for those characteristics so that you focus on genes whose expression levels are associated with the diagnosis itself rather than just with patient characteristics that might be associated with it.
That what the first sets of regressions (logistic GLMs) seem to be doing: trying to identify clinical characteristics associated with the diagnosis and include in the model. They suggest that sex and age and ethnicity are associated with diagnosis. The RNA integrity number (RIN), a measure of the technical quality of the RNA obtained from a tissue sample, and the region
, the part of the body from which the tissue sample was taken, aren't associated with the diagnosis. There really is no reason that those two characteristics of a sample should be.
The second problem with gene-expression studies is that there can be several unknown factors affecting gene expression that have nothing to do with the clinical diagnosis. Batch effects, in which there are systematic differences in gene-expression levels among samples assayed at different times, are one such factor.
That's what the second set of analyses, surrogate variable analysis (SVA), tries to account for. SVA is a way to try to control for un-modeled factors like batch effects, effects not included in the known predictors included in the regressions with gene expression as the outcomes.
In SVA, you first do a multivariate multiple regression, with gene-expression values as outcomes and the diagnosis and other clinical factors as predictors. You determine the matrix of residuals between observed and predicted values.
Then you do a singular-value decomposition (SVD) of the matrix of residuals. To identify "significant" un-modeled effects, you use multiple randomizations of the residuals matrix as a null and determine how many singular values of the original residuals matrix differ "significantly" from that null. That's the number of "significant surrogate values" to include.
Finally, you repeat the multivariate multiple regression but include the "significant" singular vectors as additional predictors. The idea is that this will minimize the contribution of genes whose expression levels differ due to the un-modeled factors. The hope is that, after this additional correction, genes with large coefficients associated with the diagnosis
predictor will then be more likely to be biologically relevant rather than technical artifacts.
The sets of model matrices near the end of this question seem to be different choices for the ultimate SVA. The first matrix in each set of 2 is a set of predictors other than diagnosis
; the second adds in the predictor diagnosis
, the main predictor of interest. The number of "significant surrogate values" is reported for each.
With that background, the superiority of a model including RIN
and region
in terms of minimizing the number of "significant surrogate values" isn't surprising. Neither has anything to do with diagnosis, but both have a lot to do with other sources of variability in gene-expression levels. As a measure of technical sample quality, RIN
has a lot to do with the ability to detect gene-expression values reliably in a sample. Similarly, samples from different body regions
might have different complements of cell types and associated gene-expression patterns that have nothing directly to do with diagnosis but lead to differences in gene expression levels in a sample.
So if you include RIN
and region
as predictors in the ultimate model, you are directly modeling major sources of variability in the gene-expression measurements. It doesn't matter that they aren't associated themselves with diagnosis
: they still are associated with differences in gene-expression values. Including them thus leaves fewer "unmodeled" factors to be handled by the SVA.
In general, it's often good practice incorporate as many predictors as reasonable, without overfitting, when you are modeling. With this scale of study, a few hundred patients, there probably is no need to do the preliminary logistic GLM at all. You should be able to handle a couple of dozen predictors potentially associated with diagnosis, or otherwise with gene-expression levels, and thus minimize the unmodeled factors to be handled by the SVA.
Two cautions about this approach.
First, you would be better off modeling the continuous predictors flexibly with regression splines rather than categorizing them. See this page among many others.
Second, the SVA will attempt to remove "unmodeled heterogeneity" in the gene-expression values from any source. That means that it's important to get a reasonably correct model to start with.
For example, if gene expression is best modeled on a log scale and you do the analysis on a linear scale, that can show up as "unmodeled heterogeneity." Or if expression of a gene only matters with respect to diagnosis
in females rather than in males, you will miss that if you don't include a sex:diagnosis
interaction term explicitly in the model. Otherwise that will just be yet more "unmodeled heterogeneity" to be subsumed in the "surrogate values" and lost to your analysis.
If you find some genes whose apparent importance differs greatly between with and without the SVA, it would make sense to investigate more closely what's going on with them.
Best Answer
This answer elaborates on some discussion in comments on the answer from Nick Cox.
Your situation might be handled by a multi-category extension of binomial regression: ordinal regression. You model the probability of moving from one category to the next in a way that takes advantage of the ordering among the outcome categories.
This UCLA web page illustrates ordinal logistic regression, based on a "proportional odds" (PO) assumption for moving up the scale. I don't know whether that assumption will hold for your data, but the page does show how to evaluate it.
Also, as Frank Harrell points out in Section 13.3.3 of his Regression Modeling Strategies book, a PO model can sometimes work well even if the assumption isn't met. In this answer to a question on highly skewed data that take only a few values with clumping at one end, he says:
The
orm()
function in Harrell'srms
package allows for ordinal regression with link functions other than the logit, and Section 13.4 of his book shows how to implement a "continuation ratio" method that sometimes works better than a PO model. That provides you some flexibility in how to proceed.With a PO model you can often model, without overfitting, almost as many parameters as you can with linear regression. Section 4.4 of Harrell's book and course notes provides an estimate of the effective sample size that takes the distribution of cases among categories into account. Your sample size of about 200 would be reduced to an effective sample size of about 180 on that basis, so you should be able to estimate about 12 regression coefficients.