Regression – Why Zero-Intercept Linear Regression Predicts Better

predictive-modelsregression

Many textbooks and papers said that intercept should not be suppressed. Recently, I used a training dataset to build a linear regression model with or without an intercept. I was surprised to find that the model without an intercept predicts better than that with an intercept in terms of rmse in an independent validation dataset. Is the prediction accuracy one of the reasons that I should use zero-intercept models?

Best Answer

Look carefully at how the rmse or other statistic is computed when comparing no-intercept models to intercept models. Sometimes the assumptions and calculations are different between the 2 models and one may fit worse, but look better because it is being divided by something much larger.

Without a reproducible example it is difficult to tell what may be contributing.

Related Solutions

Solved – Interpreting the intercept for linear regression with binary predictors

The following is true when the predictors are continuous variables.

I thought this was the predicted value when there x1 and x2 are set to zero.

In the case of categorical (eg binary) predictors, the intercept might be interpreted differently as we need to introduce one or more auxiliary binary variables. Each of these variables represents one of the levels (i.e. the unique values in the domain of a categorical variable). Let me provide you an example:

Assume we have a degree variable which can take undergraduate and postgraduate values, and we want to model the salary based on this variable, then we should model as:

$\text{salary} = \beta_0 + \beta_1 \,\text{degree:under} + \beta_2 \, \text{degree:higher}\hspace{1cm}$

Therefore, for a data point representing a higher degree graduate data point, we will have $\text{degree:under} = 0$ and $\text{degree:higher}=1$. Note that since $\text{degree:under}$ and $\text{degree:higher}$ are each other's compliment, there is no need to keep both of them (which increases the model complexity). For example, we can keep $\text{degree:under}$ and remove the other one:

$\text{salary} = \beta_0 + \beta_1 \,\text{degree:under}\hspace{1cm}$

In this case, $\beta_0$ is the estimated salary of $\text{degree:under} = 0$ or the average salary of the sampled higher degree graduates ($\text{degree:higher} = 1$).

Therefore, in your case, $0.1222$ is the estimated response when both binary variables are False or bothreference variables ($\bar{x}_1$ and $\bar{x}_2$) are True.

Hope it helps.

Solved – How to use k-fold cross-validation to determine whether a linear regression model performs significantly better than chance

As you note in your question, the important thing to do in this type of analysis is to clearly define what you mean by being "better than chance". The paper discussed in your question (linked in the comments) is not clear on exactly how this was done, and in view of that, my answer is going to give you a simple method by which this form of cross-validation ought to be done.

Assessing linear regression via leave-one-out cross-validation (LOOCV): A good way to see if a linear regression is "better than chance", in a predictive sense, is to make a comparison between predictions from the linear regression model with your explanatory variables, and predictions from a null model containing an intercept term, but no explanatory variables. Testing predictive performance on a train-test split is best done by using leave-one-out cross-validation (LOOCV), since this form of cross-validation maximises the training data used in each prediction. This method also has the benefit of being able to rely on well-known results for predictive error for leave-one-out analysis in linear regression models (see e.g., here and here).

Prediction errors for regression model: Suppose you have a linear regression model for a dataset with $n$ data points. You want to make predictions for each of the data points, using the remaining data points as your training data in each case. One of the most useful results for this analysis is that the LOOCV prediction error for data point $i$ is:

$$r_{[i]} = \frac{r_i}{1-h_{ii}} = \frac{\hat{\sigma}}{\sqrt{1-h_{ii}}} \cdot t_i,$$

where $r_i$ is the $i$th residual in the model using all the data, $t_i$ is the (internally) studentised residual, and $h_{ii}$ is the corresponding leverage of that data point. This result means that you only need to fit your linear model once, to the whole dataset, and you can still easily extract the predictive errors for LOOCV for each data point. For an overall measure of prediction error it is common to use the PRESS statistic:

$$\text{PRESS}_\text{ model} = \sum_{i=1}^n r_{[i]}^2 = MS_{Res} \cdot \sum_{i=1}^n \frac{t_i^2}{1-h_{ii}}.$$

Prediction errors for null model: For the null model with an intercept term, but no explanatory variables, you have predictions $\hat{y}_i = \bar{y}$ and corresponding leverage $h_{ii} = 1/n$, so you get LOOCV prediction errors:

$$r_{[i] \text{ null}} = \frac{y_i - \bar{y}}{1-1/n} = \frac{n}{n-1} \cdot (y_i - \bar{y}).$$

For this case you get an overall measure of prediction error:

$$\text{PRESS}_{\text{null}} = \sum_{i=1}^n r_{[i] \text{ null}}^2 = \Big( \frac{n}{n-1} \Big)^2 \sum_{i=1}^n (y_i - \bar{y})^2 = MS_{Tot} \cdot \frac{n^2}{n-1}.$$

Comparison of models under LOOCV: Comparison of the linear regression model with the null model can be undertaken either by comparing the LOOCV prediction errors under the models, or with a hypothesis test on the prediction error in the linear model, under the null hypothesis that there is no relationship between the explanatory variables and the response (i.e., that the null model is correct). If you would like to get a measure of the reduction in prediction errors in the linear model, compared to the null model, you have:

$$\sqrt{\frac{\text{PRESS}_\text{ model}}{\text{PRESS}_\text{ null}}} = \sqrt{\frac{MS_{Res}}{MS_{Tot}} \cdot \frac{n-1}{n} \cdot \frac{1}{n} \sum_{i=1}^n \frac{t_i^2}{1-h_{ii}}}.$$

This ratio gives you the proportionate size of the norm of the vector of prediction errors under your linear model, compared to the null model. If this value is substantially smaller than one, this suggests that the linear model is predicting the out-of-sample values substantially better than the null model (i.e., "better than chance"). This can be augmented with formal hypothesis tests that look at the distribution of the PRESS statistic for the linear model under the null hypothesis that the null model is true.

If you are using R for analysis, you can calculate the residuals in a linear regression model from the outputs in the base package, and you can obtain the leverage values for the data using the influence function in the stats package. This will give you all the information you need to calculate the LOOCV errors for your model, and the corresponding PRESS statistic. Alternatively, you can calculate the latter measure directly using the CV function in the forecast package in R.

Although the linked paper is unclear on exactly how the predictive performance of the model was tested, it appears that this was done via a hypothesis test using the null distribution of the prediction errors. In the case of LOOCV above, under the null model the studentised residuals would have a T-distribution, so the LOOCV prediction errors would have a scaled T-distribution. Presumably the authors of the paper have undertaken some kind of hypothesis test on the prediction errors using this fact (although they did not use the LOOCV prediction errors that I am using here).

Best Answer

Related Solutions

Solved – Interpreting the intercept for linear regression with binary predictors

Solved – How to use k-fold cross-validation to determine whether a linear regression model performs significantly better than chance

Related Question