Solved – High R-squared although many insignificant coefficients

multiple regressionr-squaredregression coefficients

I just did a regression based on the gravity model where I try to identify the most important factors that determine the trade flows. In total I have 18 variables and 363 observations. In fact I would do several regressions not including all variables at once. Although I included them all just to have a look and I got an R-squared of 0,9162 (using robust standard errors). Due to poor data availability for certain variables I only have 102 observations by including all variables. The problem is that only 9 coefficients are significant at the 10% level. Can I now assume that only those variables have a significant impact on trade flows? Can I assume that the regression is good as R-squared is quite high? Or can I do some tests whether those results are reliable (so that I can use them in my research paper)?

I am doing this for the first time and I am a little bit confused.

Best Answer

There's some chance you're overfitting the data with this sample size (the $R^2$ value seems suspiciously high), but more to the point there's nothing inconsistent between a high $R^2$ and lots of "insignificant" predictors. This is simply because the coefficient of determination never decreases when you add variables to the model, so if you start with a high $R^2$ then you'll end up with one as long as you don't drop any variables. Below are some simple simulations that demonstrate this idea. Here I generated data according to the model $$ Y_i = x_{1i} + \epsilon_i $$ where $\epsilon_i \sim$ normal$(0, \sigma^2)$ and then fit a regression only involving $x_{1i}$ along with another that included nine extra noise predictors $x_{2i}, x_{3i}, \ldots, x_{10i}$ that bore no relation to $Y_i$. We can see that the second model has a larger $R^2$ even though the added variables are not important.

Model 1:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.935
Model:                            OLS   Adj. R-squared:                  0.934
Method:                 Least Squares   F-statistic:                     1424.
Date:                Sun, 02 Aug 2015   Prob (F-statistic):           1.44e-60
Time:                        22:37:24   Log-Likelihood:                -4.2454
No. Observations:                 100   AIC:                             10.49
Df Residuals:                      99   BIC:                             13.10
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
var_1          0.9865      0.026     37.734      0.000         0.935     1.038
==============================================================================
Omnibus:                        7.358   Durbin-Watson:                   1.957
Prob(Omnibus):                  0.025   Jarque-Bera (JB):                3.027
Skew:                           0.016   Prob(JB):                        0.220
Kurtosis:                       2.148   Cond. No.                         1.00
==============================================================================

Model 2:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.941
Model:                            OLS   Adj. R-squared:                  0.934
Method:                 Least Squares   F-statistic:                     143.2
Date:                Sun, 02 Aug 2015   Prob (F-statistic):           9.08e-51
Time:                        22:37:27   Log-Likelihood:                0.48280
No. Observations:                 100   AIC:                             19.03
Df Residuals:                      90   BIC:                             45.09
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
var_1          0.9817      0.028     35.336      0.000         0.927     1.037
var_2         -0.0011      0.025     -0.043      0.966        -0.052     0.050
var_3          0.0098      0.025      0.393      0.695        -0.040     0.059
var_4          0.0253      0.030      0.856      0.394        -0.033     0.084
var_5          0.0160      0.027      0.596      0.553        -0.037     0.069
var_6         -0.0138      0.028     -0.486      0.628        -0.070     0.043
var_7          0.0100      0.024      0.418      0.677        -0.037     0.057
var_8         -0.0358      0.027     -1.335      0.185        -0.089     0.017
var_9          0.0180      0.026      0.707      0.482        -0.033     0.069
var_10        -0.0574      0.025     -2.288      0.024        -0.107    -0.008
==============================================================================
Omnibus:                        5.760   Durbin-Watson:                   1.815
Prob(Omnibus):                  0.056   Jarque-Bera (JB):                2.903
Skew:                          -0.147   Prob(JB):                        0.234
Kurtosis:                       2.219   Cond. No.                         1.72
==============================================================================

This is one of the reasons why $R^2$ is rarely used when doing model selection.

Related Question