Solved – How to use k-fold cross-validation to determine whether a linear regression model performs significantly better than chance

cross-validationhypothesis testingregressionstatistical significance

I have an experiment in which I present a subject with $n$ inputs, $\pmb{x} \in \mathbb{R}^N$. For each input, a response is produced in ~25,000 separate output variables – so for a given output variable $Y_i$, $Y_i \in \mathbb{R}^N$.

For each $Y_i, i \in [0, 25000]$ and a function $f$ that maps inputs to features, I need to determine whether a linear regression model can be used to predict $Y_i$ given $f(\pmb{x})$, and if so, calculate the accuracy of this prediction. Prediction accuracy is defined by Pearson's r between predicted output $\hat{Y}_i$ and true output $Y_i$.

The method given in a paper for this is:

For each $Y_i$:

  1. Split $\pmb{x}$ into $\pmb{x}_{train}$ and $\pmb{x}_{test}$.
  2. Use k-fold cross validation on $\pmb{x}_{train}$ to determine whether the linear regression model predicts $Y_i$ from $f(\pmb{x}_{train})$ significantly better than chance, using a p threshold of 0.01 / 25000 = 4e-6 (to correct for the number of output variables).
  3. If the linear regression model was found to predict better than chance, then calculate the prediction accuracy by training on the entire training set and evaluating on the test set.

My issue is with the details of step 2. I understand k-fold cross validation but I don't know what test I should be using to determine whether the prediction is better than chance from the results of the cross validation. The paper's exact wording is: "Student’s t test across cross-validated [input]", but I don't know exactly what that means here.

Clarifications

Context: We have a set of output variables $S = \{Y_i, \forall i \in [0, M]\}$. We suspect that there is a subset of these $Q = \{Y_j \in S, \text{$Y_j$ can be predicted from $f(\pmb{x})$}\} \subseteq S$, where each $Y_j$ can be predicted using a linear mapping from $f(\pmb{x})$: in other words, where a better-than-chance correlation exists between $Y_j$ and our predicted values $\hat{Y}_j$.

Once $Q$ is determined, I can calculate the prediction accuracy of the variables by computing the Pearson's r between $Y_j$ and $\hat{Y}_j$ for all $Y_j \in Q$. This is in a biological context and the idea is that we can go onto analyse (for example) the biological reasons why only the elements in $Q$ can be modelled in this way, as opposed to all the elements in $S$.

The linear regression model being used is ridge regression, which is 'trained' by fitting it to training data then evaluating it on test data. Hyperparameter search is performed by repeating the cross-validation in step 2 over the grid of hyperparameters.

To use the wording from the paper, it determines Q by "…discarding output variables whose prediction accuracy was not significantly better than chance, $p > 0.01 / M$ (Bonferroni correcting for number of output variables), Student’s t test across cross-validated training inputs". This is too vague for me to understand. I did not think you needed to correct the p threshold for the hyperparameter search in ridge regression.

An idea we had: for all (k = 5) folds, compute Pearson's r between the predicted and true values for that fold, thus ending up with a sample of 5 r values. Use a one-sample t test to determine whether the mean of these 5 r values is different from 0, using $p=0.01/M$ as the threshold. If the null hypothesis is rejected, then this output variable is included in Q. However, I'm not sure that this method is valid.

Best Answer

As you note in your question, the important thing to do in this type of analysis is to clearly define what you mean by being "better than chance". The paper discussed in your question (linked in the comments) is not clear on exactly how this was done, and in view of that, my answer is going to give you a simple method by which this form of cross-validation ought to be done.


Assessing linear regression via leave-one-out cross-validation (LOOCV): A good way to see if a linear regression is "better than chance", in a predictive sense, is to make a comparison between predictions from the linear regression model with your explanatory variables, and predictions from a null model containing an intercept term, but no explanatory variables. Testing predictive performance on a train-test split is best done by using leave-one-out cross-validation (LOOCV), since this form of cross-validation maximises the training data used in each prediction. This method also has the benefit of being able to rely on well-known results for predictive error for leave-one-out analysis in linear regression models (see e.g., here and here).

Prediction errors for regression model: Suppose you have a linear regression model for a dataset with $n$ data points. You want to make predictions for each of the data points, using the remaining data points as your training data in each case. One of the most useful results for this analysis is that the LOOCV prediction error for data point $i$ is:

$$r_{[i]} = \frac{r_i}{1-h_{ii}} = \frac{\hat{\sigma}}{\sqrt{1-h_{ii}}} \cdot t_i,$$

where $r_i$ is the $i$th residual in the model using all the data, $t_i$ is the (internally) studentised residual, and $h_{ii}$ is the corresponding leverage of that data point. This result means that you only need to fit your linear model once, to the whole dataset, and you can still easily extract the predictive errors for LOOCV for each data point. For an overall measure of prediction error it is common to use the PRESS statistic:

$$\text{PRESS}_\text{ model} = \sum_{i=1}^n r_{[i]}^2 = MS_{Res} \cdot \sum_{i=1}^n \frac{t_i^2}{1-h_{ii}}.$$

Prediction errors for null model: For the null model with an intercept term, but no explanatory variables, you have predictions $\hat{y}_i = \bar{y}$ and corresponding leverage $h_{ii} = 1/n$, so you get LOOCV prediction errors:

$$r_{[i] \text{ null}} = \frac{y_i - \bar{y}}{1-1/n} = \frac{n}{n-1} \cdot (y_i - \bar{y}).$$

For this case you get an overall measure of prediction error:

$$\text{PRESS}_{\text{null}} = \sum_{i=1}^n r_{[i] \text{ null}}^2 = \Big( \frac{n}{n-1} \Big)^2 \sum_{i=1}^n (y_i - \bar{y})^2 = MS_{Tot} \cdot \frac{n^2}{n-1}.$$


Comparison of models under LOOCV: Comparison of the linear regression model with the null model can be undertaken either by comparing the LOOCV prediction errors under the models, or with a hypothesis test on the prediction error in the linear model, under the null hypothesis that there is no relationship between the explanatory variables and the response (i.e., that the null model is correct). If you would like to get a measure of the reduction in prediction errors in the linear model, compared to the null model, you have:

$$\sqrt{\frac{\text{PRESS}_\text{ model}}{\text{PRESS}_\text{ null}}} = \sqrt{\frac{MS_{Res}}{MS_{Tot}} \cdot \frac{n-1}{n} \cdot \frac{1}{n} \sum_{i=1}^n \frac{t_i^2}{1-h_{ii}}}.$$

This ratio gives you the proportionate size of the norm of the vector of prediction errors under your linear model, compared to the null model. If this value is substantially smaller than one, this suggests that the linear model is predicting the out-of-sample values substantially better than the null model (i.e., "better than chance"). This can be augmented with formal hypothesis tests that look at the distribution of the PRESS statistic for the linear model under the null hypothesis that the null model is true.

If you are using R for analysis, you can calculate the residuals in a linear regression model from the outputs in the base package, and you can obtain the leverage values for the data using the influence function in the stats package. This will give you all the information you need to calculate the LOOCV errors for your model, and the corresponding PRESS statistic. Alternatively, you can calculate the latter measure directly using the CV function in the forecast package in R.

Although the linked paper is unclear on exactly how the predictive performance of the model was tested, it appears that this was done via a hypothesis test using the null distribution of the prediction errors. In the case of LOOCV above, under the null model the studentised residuals would have a T-distribution, so the LOOCV prediction errors would have a scaled T-distribution. Presumably the authors of the paper have undertaken some kind of hypothesis test on the prediction errors using this fact (although they did not use the LOOCV prediction errors that I am using here).

Related Question