Solved – Communicating Regression Model Results

diagnosticmodel selectionmodelingpredictive-modelsregression

I am concerned about how unequipped most people are (both within and without academia) to properly employ standard model building methods such as linear regression and to interpret the results of these models. Both from my own observations and the literature, it is clear that most people are being poorly served by the standard statistical tools they have available.

To improve this situation, I would like to propose a different set of model evaluation output tables (examples shown below in Table 2 and Table 3) to replace the standard output table that are almost universally used today (an example shown in Table 1). This new output format is robust to human error (both in model specification and analysis) to a much greater extent that is our current format.

I would like ask people for their critique of this proposed method and whether there was interest in using it.

(Sorry for the length of this, it got much longer than expected)

Approach Overview

First, calculate an "honest $R^2$" using leave one out cross validation (LOOCV) to get an overall measure of the model's fit. Simply put, remove each data point in turn, estimate the model without that data point, then use the reduced model to calculate the value of the removed data point and use these errors to estimate your $R^2$.

Then for each coefficient in the model, remove that parameter from the model and calculate an "honest $R^2$" for that submodel. Calculate the effect of the adding the coefficient to the model as the overall model's $R^2$ minus the $R^2$ of the submodel when the coefficient is removed.

Thus a positive value for this coefficient $R^2$ indicates the fit improves when it is added to the model; a negative value indicates the fit decreases.

Results would be reported as a simple table of honest $R^2$ values (see examples in Table 2 and Table 3 below).

Discussion

This approach is designed to be straightforward and to do three things:

  1. Provide end users with the information they need to interpret the (practical) significance of models and coefficients
  2. Provide end users with this information in a form they are already familiar with
  3. Minimize the effect of user error both in developing the models and interpreting results

I think this approach does all three. The use of the universally understood $R^2$ metric should mean it is readily understandable and can plug right into users' existing mental frameworks.

Second, the use of this approach is robust: both to misspecified models and to poor interpretation. Most models depend on certain assumptions that users often fail to check are met. Of course, p-values for coefficients in linear regressions depend on certain parametric assumptions that are generally violated to a greater or a lesser extent. This proposed approach is resilient to such issues (the main issue it is susceptible to is, as with regular output statistics, correlation between observations, which could lead to an overestimation of the honest $R^2$).

In regards to the interpretation of results, p-values or often misinterpreted (as has been extensively noted). I believe this honest $R^2$ approach is much more resilient to these issues as it is focused on effect size (which is often what people incorrectly take the p-value as a proxy for). It emphasizes practical significance in place of statistical significance which is a concept people have a lot of trouble with.

Furthermore, this method allows the direct comparison of results to those generated by other methods (even those that don't generate likelihoods) such as machine learning techniques such as random forests.

One issue with this approach is computational burden. Where LOOCV is computationally infeasible I would suggest 10- or 5-fold CV. Since these methods should result in higher error (as the model is training on smaller data sets), the honest $R^2$ values reported by them would be conservative. Therefore they can be used and reported in placed of the LOOCV with any risk of mis-comparison being of a conservative nature.

Another issue is the LOOCV estimate is known to have high bias. I'm not really sure how this could be dealt with or if it is a serious problem. One last point is the LOOCV is asymptotically equivalent to AIC, so this could fit into that paradigm a bit.

Applied Example

Taking a set of housing data ( http://archive.ics.uci.edu/ml/datasets/Housing ) we try to predict average house value in a suburb based on average house age, number of rooms in the house, pollutant levels, pupil to teacher ratio, and proximity to a highway. We start with a linear regression.

Below is similar to what most software will currently output (this specific table was generated by R).

Table 1. Standard regression table.

Coefficients:
               Estimate Std. Error t value Pr(|t|)    
(Intercept)     7.76739    4.98881   1.557 0.120112    
AGE            -0.01509    0.01378  -1.096 0.273773    
ROOMS           7.00565    0.41172  17.015    2e-16 ***
NOX           -13.31418    3.90262  -3.412 0.000698 ***
PUPIL.TEACHER  -1.11645    0.14799  -7.544 2.17e-13 ***
HIGHWAY        -0.02487    0.04257  -0.584 0.559341    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 5.819 on 500 degrees of freedom
Multiple R-squared: 0.6037, Adjusted R-squared: 0.5997 

The more honest statistics results are shown below. Note that these results aren't surprising given what we saw in table 1. The parameters that were not significant in Table 1, result in worse models as measured by the change in the honest $R^2$ (again the coefficient Honest $R^2$ value indicate how the honest $R^2$ changed when that coefficient was added to the model).

Table 2. Proposed "honest" statistics table.

          Item Coefficient Honest.R2  
  -Full Model-              0.593080  
   (Intercept)      7.7674 -0.000998  
           AGE     -0.0151 -0.000409
         ROOMS      7.0056  0.228901
           NOX    -13.3142  0.008123
 PUPIL.TEACHER     -1.1165  0.045506
       HIGHWAY     -0.0249 -0.002018

Finally, this approach allows us to directly compare more exotic algorithms such as Random Forests. Unlike Bayes factor, AIC, BIC or some other methods, likelihoods are not required for comparison. Anything that creates predictions can be compared. Classification algorithms can also fit into this scheme if you use one of the common pseudo $R^2$ approaches for calculating $R^2$.

Table 3. Comparison between algorithms.

(In this case Support Vector Machines and Random Forests)

        SVM               |         Random Forest
          Item Honest.R2  |             Item Honest.R2
  -Full Model-  0.71052   |     -Full Model-   0.7450
           AGE  0.00834   |              AGE  -0.0128
         ROOMS  0.34722   |            ROOMS   0.2068
           NOX  0.02338   |              NOX   0.0654
 PUPIL.TEACHER  0.01813   |    PUPIL.TEACHER  -0.0127
       HIGHWAY -0.00381   |          HIGHWAY  -0.0255

Side note: It is interesting to interpret these results as we can see in the linear model pupil per teacher has a statistically significant effect on housing prices. While in the random forest model, it does not. Since the random forest model is the most predictive of the models, I would have to conclude that this was evidence that this coefficient did not significantly (practical significance) affected housing prices. This is a small illustration of the weakness of blindly applying linear models to everything and using that to carry out hypothesis testing.

Implementation

I have uploaded (rough-hewn, unoptimized) code in R to implement this technique at:

http://dl.dropbox.com/u/94002/HonestStats.R

If there is interest, I can optimize this to be much faster for linear models and implement things like k-Fold CV.

Best Answer

In addition to Michelle's answer, predictive ability (as measured by $R^2$) is not relevant to all uses of regression.

In your example, if one is interested in the difference in mean houses prices, comparing houses whose NOx level differs by one unit but that have identical values of all the other covariates, then (given some assumptions of linearity) the regression in Table 1 is the right one to do, regardless of its $R^2$.

In my experience, getting people to translate between "what quantity is of interest?" and "what regression do we do?" is much more challenging than quantifying predictive ability.

Related Question