Solved – The meaning of coefficients in Multiple Linear Regression

least squaresmultiple regressionregression coefficients

So I am learning about linear regression. The coefficient is the slope of the function, which means how much the dependent variable change due to change of the independent variable. So I make an linear regression first with only one IV. The coefficient is positive 0.2708, we can infer that relationship between the IV and DV is positive. I post the results below. We can see that the R-squared is 0.516, so the IV explain only half of the variance in the DV. I guess this means that the IV is not the best in explaining the DV? Please correct my interpretation if I am wrong.

                            OLS Regression Results                            
==============================================================================
Dep. Variable:       total_goal_count   R-squared:                       0.516
Model:                            OLS   Adj. R-squared:                  0.515
Method:                 Least Squares   F-statistic:                     404.7
Date:                Thu, 02 Aug 2018   Prob (F-statistic):           9.15e-62
Time:                        09:25:41   Log-Likelihood:                -837.54
No. Observations:                 380   AIC:                             1677.
Df Residuals:                     379   BIC:                             1681.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
position_difference     0.2708      0.013     20.118      0.000       0.244       0.297
==============================================================================
Omnibus:                        0.714   Durbin-Watson:                   1.726
Prob(Omnibus):                  0.700   Jarque-Bera (JB):                0.783
Skew:                           0.101   Prob(JB):                        0.676
Kurtosis:                       2.907   Cond. No.                         1.00
==============================================================================

So I add another variable to the regression. The results are as follows:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:       total_goal_count   R-squared:                       0.746
Model:                            OLS   Adj. R-squared:                  0.745
Method:                 Least Squares   F-statistic:                     554.9
Date:                Thu, 02 Aug 2018   Prob (F-statistic):          3.43e-113
Time:                        09:35:06   Log-Likelihood:                -715.26
No. Observations:                 380   AIC:                             1435.
Df Residuals:                     378   BIC:                             1442.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
position_difference     -0.0024      0.018     -0.135      0.893      -0.037       0.032
avg_total_goal_count     0.0268      0.001     18.479      0.000       0.024       0.030
==============================================================================
Omnibus:                       12.317   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               12.617
Skew:                           0.436   Prob(JB):                      0.00182
Kurtosis:                       3.191   Cond. No.                         22.3
==============================================================================

position_difference now has a negative coefficient when we add another variable, what can we interpret from this? Both attributes are positively correlated with the DV otherwise. But what can we learn from their interaction in the regression.

I am studying the effect of different variables on the amount of goals scored in Premier League games if that is of any relevance. So position_difference is the difference in league positions between the teams, the numbers go from 1-19. avg_total_goal_count is the average of the number of goals scored in matches in which the two teams are playing.

Best Answer

The coefficient is only one statistic, but there are other statistics that are usually more helpful to interpret a model. In particular the P-value (P>|t|) and 95% confidence interval (last two columns) are very helpful.

From the first table, you can see that position_difference has a zero P-value, which means its highly relevant for prediction. This is not surprising since there isn't any other predictor. If you include an intercept term in your model, which I highly recommend, this could change and position_difference might be less important. Also notice that the 95% confidence interval is quite small, which means there isn't a lot of uncertainty about the coefficient.

Looking at the second table, the situation is quite different. Now position_difference has a negative coefficient, but also the 95% confidence interval goes from -0.037 to 0.032, which contains negative as well as positive coefficients. Therefore, we aren't really sure what the effect of position_difference is. In addition, the P-value is now 0.893, which means position_difference isn't all that important and most of the variance is explained by avg_total_goal_count alone.

If you want to check whether there is a significant interaction between position_difference and avg_total_goal_count you can create a model with an interaction term, in R that would be y ~ position_difference * avg_total_goal_count. Look at the table of statistics, in particular the P-value. If it's small, the interaction is important and helps to explain more variance.

Unfortunately though, P-value and coefficient are only a reliable indicator of the importance of a variable if the assumptions of the linear model hold, in particular, that variables are linearly related to the dependent variable. For instance, if position_difference has a high importance if it's more than 10, but a tiny impact otherwise, it would not behave linearly. There is a relatively easy way to check for non-linear effects by splitting a continuous variable to multiple categorical variables. For instance, you can split position_difference into 4 intervals (1-5, 6-10, 11-15, 16-19) and use a categorical variable indicating which interval the value is in. Fitting a model with this newly created variable will give you multiple coefficients and help to investigate whether the coefficients for different intervals differ vastly.