Solved – Interpreting the “coefficient” output of the lm function in R

linear modelrregression

I have created a linear model (which has multiple predictors) using the lm() function and I would like to interpret the "coefficients" that I get when I use the summary() function on the linear model.

Now I want to consider how the coefficients reflect on the predictors' influence in the model – am I right in thinking that a large value for the coefficient means that the corresponding predictor has a greater effect? I'm not sure what else I need to consider or if I'm even thinking along the right lines.

Also, am I correct in thinking these "coefficients" are in fact the Beta coefficients?

Best Answer

Let's say we want to predict median value of house expressed in thousands of $ (medv) based on its age, number of rooms (rm) and the crime rate (crim) in its neighborhood. The dataset for this is called Boston and it is in the MASS package

> library(MASS)

Original regression model

> print(coef(lm(medv~crim+rm+age, data=Boston)))

 (Intercept)         crim           rm          age 
-23.60556128  -0.21102311   8.03283820  -0.05224283 

So we see that the coefficients for those three predictors are -0.211, 8.03 and -0.05. If we want to calculate the price for any house we just use formula:

medv = -0.21102311*crim + 8.03283820*rm - 0.05224283*age

It means that for example, that for increase in one room the medv price rises for 8.03 thousand dollars.

Now, let's say we measure price in the dollars instead of thousands of dollars. The regression coefficients are totally different.

> Boston.2 <- Boston
> Boston.2$medv <- Boston.2$medv*1000

> print(coef(lm(medv~crim+rm+age, data=Boston.2)))
 (Intercept)         crim           rm          age 
-23605.56128   -211.02311   8032.83820    -52.24283 

They are all scaled versions from the ones in the original model, but what if we measure for example, the crime rate per 1000?

Then there is a big change for the rm and age coefficients.

> Boston.3 <- Boston
> Boston.3$crim <- Boston.3$crim/1000

> print(coef(lm(medv~crim+rm+age, data=Boston.3)))
  (Intercept)          crim            rm           age 
 -23.60556128 -211.02311213    8.03283820   -0.05224283 

So compare this model and the original one. If we compare the variables based on those coefficients it will seem that the crime rate is much much more important (-211.02 vs only 8.03) than the number of rooms(which is not realistic).

However, in the original model there is quite a different story. Number of rooms seems way more important as we're measuring variables differently.

To asses the relative importance of variables we first need to perform scaling which will make all the variables comparable. It won't matter anymore what unit is used (crim vs crim/1000) for the data, the coefficients will end up the same. We can then compare their values to see which ones are the most important for prediction.

We now apply the same model three times using the differently scaled data sets (Boston, Boston.2, Boston.3) - note the identical coefficient results:

Original regression model

> print(coef(lm(scale(medv)~scale(crim)+scale(rm)+scale(age), data=Boston)))
  (Intercept)   scale(crim)     scale(rm)    scale(age) 
-3.076923e-16 -1.973583e-01  6.136725e-01 -1.598956e-01 

Regression model with prices in dollars instead of thousands

> print(coef(lm(scale(medv)~scale(crim)+scale(rm)+scale(age), data=Boston.2)))
  (Intercept)   scale(crim)     scale(rm)    scale(age) 
-3.076923e-16 -1.973583e-01  6.136725e-01 -1.598956e-01 

Regression model with crime rate per 1000 people

> print(coef(lm(scale(medv)~scale(crim)+scale(rm)+scale(age), data=Boston.3)))
  (Intercept)   scale(crim)     scale(rm)    scale(age) 
-3.076923e-16 -1.973583e-01  6.136725e-01 -1.598956e-01  

From this it seems that number of rooms is indeed more important than the crime rate, which again is a common sense :).