So I am learning about linear regression. The coefficient is the slope of the function, which means how much the dependent variable change due to change of the independent variable. So I make an linear regression first with only one IV. The coefficient is positive 0.2708, we can infer that relationship between the IV and DV is positive. I post the results below. We can see that the R-squared is 0.516, so the IV explain only half of the variance in the DV. I guess this means that the IV is not the best in explaining the DV? Please correct my interpretation if I am wrong.
OLS Regression Results
==============================================================================
Dep. Variable: total_goal_count R-squared: 0.516
Model: OLS Adj. R-squared: 0.515
Method: Least Squares F-statistic: 404.7
Date: Thu, 02 Aug 2018 Prob (F-statistic): 9.15e-62
Time: 09:25:41 Log-Likelihood: -837.54
No. Observations: 380 AIC: 1677.
Df Residuals: 379 BIC: 1681.
Df Model: 1
Covariance Type: nonrobust
=======================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------
position_difference 0.2708 0.013 20.118 0.000 0.244 0.297
==============================================================================
Omnibus: 0.714 Durbin-Watson: 1.726
Prob(Omnibus): 0.700 Jarque-Bera (JB): 0.783
Skew: 0.101 Prob(JB): 0.676
Kurtosis: 2.907 Cond. No. 1.00
==============================================================================
So I add another variable to the regression. The results are as follows:
OLS Regression Results
==============================================================================
Dep. Variable: total_goal_count R-squared: 0.746
Model: OLS Adj. R-squared: 0.745
Method: Least Squares F-statistic: 554.9
Date: Thu, 02 Aug 2018 Prob (F-statistic): 3.43e-113
Time: 09:35:06 Log-Likelihood: -715.26
No. Observations: 380 AIC: 1435.
Df Residuals: 378 BIC: 1442.
Df Model: 2
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
position_difference -0.0024 0.018 -0.135 0.893 -0.037 0.032
avg_total_goal_count 0.0268 0.001 18.479 0.000 0.024 0.030
==============================================================================
Omnibus: 12.317 Durbin-Watson: 1.993
Prob(Omnibus): 0.002 Jarque-Bera (JB): 12.617
Skew: 0.436 Prob(JB): 0.00182
Kurtosis: 3.191 Cond. No. 22.3
==============================================================================
position_difference
now has a negative coefficient when we add another variable, what can we interpret from this? Both attributes are positively correlated with the DV otherwise. But what can we learn from their interaction in the regression.
I am studying the effect of different variables on the amount of goals scored in Premier League games if that is of any relevance. So position_difference
is the difference in league positions between the teams, the numbers go from 1-19. avg_total_goal_count
is the average of the number of goals scored in matches in which the two teams are playing.
Best Answer
The coefficient is only one statistic, but there are other statistics that are usually more helpful to interpret a model. In particular the P-value (
P>|t|
) and 95% confidence interval (last two columns) are very helpful.From the first table, you can see that
position_difference
has a zero P-value, which means its highly relevant for prediction. This is not surprising since there isn't any other predictor. If you include an intercept term in your model, which I highly recommend, this could change andposition_difference
might be less important. Also notice that the 95% confidence interval is quite small, which means there isn't a lot of uncertainty about the coefficient.Looking at the second table, the situation is quite different. Now
position_difference
has a negative coefficient, but also the 95% confidence interval goes from -0.037 to 0.032, which contains negative as well as positive coefficients. Therefore, we aren't really sure what the effect ofposition_difference
is. In addition, the P-value is now 0.893, which meansposition_difference
isn't all that important and most of the variance is explained byavg_total_goal_count
alone.If you want to check whether there is a significant interaction between
position_difference
andavg_total_goal_count
you can create a model with an interaction term, in R that would bey ~ position_difference * avg_total_goal_count
. Look at the table of statistics, in particular the P-value. If it's small, the interaction is important and helps to explain more variance.Unfortunately though, P-value and coefficient are only a reliable indicator of the importance of a variable if the assumptions of the linear model hold, in particular, that variables are linearly related to the dependent variable. For instance, if
position_difference
has a high importance if it's more than 10, but a tiny impact otherwise, it would not behave linearly. There is a relatively easy way to check for non-linear effects by splitting a continuous variable to multiple categorical variables. For instance, you can splitposition_difference
into 4 intervals (1-5, 6-10, 11-15, 16-19) and use a categorical variable indicating which interval the value is in. Fitting a model with this newly created variable will give you multiple coefficients and help to investigate whether the coefficients for different intervals differ vastly.