Solved – High R-squared although many insignificant coefficients

multiple regressionr-squaredregression coefficients

I just did a regression based on the gravity model where I try to identify the most important factors that determine the trade flows. In total I have 18 variables and 363 observations. In fact I would do several regressions not including all variables at once. Although I included them all just to have a look and I got an R-squared of 0,9162 (using robust standard errors). Due to poor data availability for certain variables I only have 102 observations by including all variables. The problem is that only 9 coefficients are significant at the 10% level. Can I now assume that only those variables have a significant impact on trade flows? Can I assume that the regression is good as R-squared is quite high? Or can I do some tests whether those results are reliable (so that I can use them in my research paper)?

I am doing this for the first time and I am a little bit confused.

Best Answer

There's some chance you're overfitting the data with this sample size (the $R^2$ value seems suspiciously high), but more to the point there's nothing inconsistent between a high $R^2$ and lots of "insignificant" predictors. This is simply because the coefficient of determination never decreases when you add variables to the model, so if you start with a high $R^2$ then you'll end up with one as long as you don't drop any variables. Below are some simple simulations that demonstrate this idea. Here I generated data according to the model $$ Y_i = x_{1i} + \epsilon_i $$ where $\epsilon_i \sim$ normal$(0, \sigma^2)$ and then fit a regression only involving $x_{1i}$ along with another that included nine extra noise predictors $x_{2i}, x_{3i}, \ldots, x_{10i}$ that bore no relation to $Y_i$. We can see that the second model has a larger $R^2$ even though the added variables are not important.

Model 1:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.935
Model:                            OLS   Adj. R-squared:                  0.934
Method:                 Least Squares   F-statistic:                     1424.
Date:                Sun, 02 Aug 2015   Prob (F-statistic):           1.44e-60
Time:                        22:37:24   Log-Likelihood:                -4.2454
No. Observations:                 100   AIC:                             10.49
Df Residuals:                      99   BIC:                             13.10
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
var_1          0.9865      0.026     37.734      0.000         0.935     1.038
==============================================================================
Omnibus:                        7.358   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.025   Jarque-Bera (JB):                3.027
Skew:                           0.016   Prob(JB):                        0.220
Kurtosis:                       2.148   Cond. No.                         1.00
==============================================================================

Model 2:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.941
Model:                            OLS   Adj. R-squared:                  0.934
Method:                 Least Squares   F-statistic:                     143.2
Date:                Sun, 02 Aug 2015   Prob (F-statistic):           9.08e-51
Time:                        22:37:27   Log-Likelihood:                0.48280
No. Observations:                 100   AIC:                             19.03
Df Residuals:                      90   BIC:                             45.09
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
var_1          0.9817      0.028     35.336      0.000         0.927     1.037
var_2         -0.0011      0.025     -0.043      0.966        -0.052     0.050
var_3          0.0098      0.025      0.393      0.695        -0.040     0.059
var_4          0.0253      0.030      0.856      0.394        -0.033     0.084
var_5          0.0160      0.027      0.596      0.553        -0.037     0.069
var_6         -0.0138      0.028     -0.486      0.628        -0.070     0.043
var_7          0.0100      0.024      0.418      0.677        -0.037     0.057
var_8         -0.0358      0.027     -1.335      0.185        -0.089     0.017
var_9          0.0180      0.026      0.707      0.482        -0.033     0.069
var_10        -0.0574      0.025     -2.288      0.024        -0.107    -0.008
==============================================================================
Omnibus:                        5.760   Durbin-Watson:                   1.815
Prob(Omnibus):                  0.056   Jarque-Bera (JB):                2.903
Skew:                          -0.147   Prob(JB):                        0.234
Kurtosis:                       2.219   Cond. No.                         1.72
==============================================================================

This is one of the reasons why $R^2$ is rarely used when doing model selection.

Related Solutions

Solved – How to you have significant correlations and insignificant coefficients

Here are some results from a regression for 74 cars of gpm (gallons per mile) as a function of trunk, weight, length and displacement, which are all measures of size of cars. Only one predictor achieves significance at conventional levels, although its P-value is pleasingly small.

. regress gpm trunk weight length displacement

      Source |       SS       df       MS              Number of obs =      74
-------------+------------------------------           F(  4,    69) =   48.19
       Model |  .008805719     4   .00220143           Prob > F      =  0.0000
    Residual |  .003151908    69   .00004568           R-squared     =  0.7364
-------------+------------------------------           Adj R-squared =  0.7211
       Total |  .011957628    73  .000163803           Root MSE      =  .00676

------------------------------------------------------------------------------
         gpm |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       trunk |   .0003037   .0002702     1.12   0.265    -.0002354    .0008427
      weight |   .0000121   3.90e-06     3.11   0.003     4.35e-06    .0000199
      length |   .0000137   .0001189     0.12   0.909    -.0002235    .0002509
displacement |   4.31e-06   .0000194     0.22   0.825    -.0000344     .000043
       _cons |   .0059957    .012773     0.47   0.640    -.0194857    .0314771
------------------------------------------------------------------------------

Stata users will, or rather should, recognise regression output for the auto dataset. Naturally none of the commentary here is intrinsic or specific to Stata.

If we look at correlations for the predictors with gpm, here presented in terms of correlations and 95% confidence intervals, we see that all correlations between individual predictors and gpm are significant at the 5% level; in fact stronger statements could be made.

                            correlations and 95% limits
trunk        gpm               0.632    0.472    0.752
weight       gpm               0.854    0.778    0.906
length       gpm               0.820    0.727    0.883
displacement gpm               0.771    0.659    0.850

It is easy to reconcile these two findings. The correlations pay absolutely no attention to any other variables except the two named. (There are ways of taking other variables into account, notably partial correlation, but we haven't done that.) The regression on the other hand is a team effort and each coefficient depends not only on the associated predictor, but also on the other predictors. The way it shapes out here is that the predictors are strongly correlated with each other, but weight looks like the best predictor, and given that weight is in the equation, the other predictors cannot add much.

In a real problem, you should always look at the entire correlation matrix to check the relationships among the predictors; the corresponding scatter plot matrix; and various diagnostic plots.

Only when the predictors are uncorrelated with each other will the effects of all the predictors be the sum of the effects of individual predictors. If you have that situation, it is often bad news, not good, as it means your data are just noise. Absent some experimental design intended to secure independence, moderate if not strong relationships among the predictors are as much to be expected as moderate to strong relationships between the predictors and the response variable.

Solved – Multicollinearity in multiple regression

Including variables in your multiple regressions is something that depends on your hypothesis and what you are testing. But you can check the variance inflation factor (VIF) that is used as an indicator of multicollinearity. If VIFs are less that 10, means multicollinearity is not a problem. If VIFs for two variables is 10 or higher then you have to keep just one of those variables and eliminate the other one. Hope this help.

Best Answer

Related Solutions

Solved – How to you have significant correlations and insignificant coefficients

Solved – Multicollinearity in multiple regression

Related Question