Solved – How to interpret regression function with categorical variable

categorical-encodingmultiple regressionregressionregression coefficientsself-study

I am trying to figure out how to interpret a regression function with no intercept and one categorical variable performed on a survey data. Each participant marks which actions, from a list of 25, they perceived as crimes. The survey data collects the age, sex, the year in college and income level of the participant.

$$crime = 0.38x_{age} – 10.3x_{female}I_{sex} – 8.01x_{male}I_{male} + 0.18x_{college} + 0.29x_{income}$$

Are the following interpretations of regression coefficients correct?

  • $\beta_{age}$ interpretation: This coefficient estimates 0.38 increase in crime score for each additional year of age of the participant, holding other variables constant.
  • $\beta_{female}$ interpretation: This coefficient estimates 10.3 decrease in crime score if the survey participant is female, holding other variables constant.

Here is the R code that generated my model,

reg_model <- lm(crimes ~ 0 + age + sex + college 
                + income , data = crime_data)

From the R formula documentation it says "It can also used to remove the intercept term: when fitting a linear model y ~ x – 1 specifies a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x."

If I try to build the model with an intercept using the following R code,

reg_model <- lm(crimes ~ age + sex + college + 
                income , data = crime_data)

I get the following model,
$$crime = – 10.3 + 0.38x_{age} + 2.29x_{male}I_{male} + 0.18x_{college} + 0.29x_{income}$$

I thought it would feel nicer if I can distinctly say how much male and female affects the crime score. If I take out the 0 from the formula, I only have one sex in my model. I was not sure how to explain the effect of the missing sex on the crime score.


Here is the question I am trying to solve,
enter image description here
Link to csv dataset: https://pastebin.com/eJJqUfmr

Source: Freund, R.J. Wilson, W.J., and D. L. Mohr (2010). Statistical
Methods, 3rd Edition, Academic Press. ISBN-13: 978-0123749703;
Chapter 8 : Multiple Linear Regression; Exercise problem 11

Best Answer

Note that your initial model uses level means coding.

The answers to your explicit questions are:

  1. Yes, if age goes up by $1$ (year, I assume), the crime score is predicted to increase by $0.38$, holding all other covariates equal.
  2. No (not exactly), being female does not decrease the crime score by 10.3 if the survey participant is female holding other variables constant. Instead, if all other variables are exactly equal to $0$ (e.g., the person is in the process of being born), the predicted crime score for a female is $-10.3$. Whether that's particularly meaningful is a different issue—it's just part of the linear model.

A concern on this thread (e.g., in the comments) is whether the two listed models are the same. The two models are indeed identical. Because there is a categorical variable here, suppressing the intercept just changes the meanings of the intercept and the difference between the levels to the predicted means for the two individual levels when all other variables are $0$. It may help to read my answer here: How can logistic regression have a factorial predictor and no intercept? Note that the coefficients for age, college, and income are identical between the two models. The difference is the last two estimated coefficients. In the first, female = -10.3 & male = -8.01; in the second, the intercept is -10.3, & the difference between male and female is 2.29. These yield the same predicted values for all combinations of predictor values.