Solved – Does including gender as a predictor variable mean I should use a glm function, not an lm function, in R

generalized linear modellinear modelr

I have been modelling a data set that contains several predictor variables, but after extensive research, I am even more confused as to whether I should be using a lm() or glm() function for the following:

Model4 <-lm(Height_cm ~ Sleep.hours + Gender + Age + Exercise, data=Data.dat.complete)

Where height is continuous, sleep.hours is continuous, gender is male/female, age is continuous and exercise is count (the number of times someone exercised in a week).

As I have a binary variable (gender), my diagnostic plots do not look too good, and I am tempted to use a glm() with family=binomial and link=logit. However, after researching for several hours, I am beginning to question this, as I have seen examples of gender being fitted using just lm() and my other predictors are clearly not binary.

If at all possible, I would like to try and work at least some of this out on my own, but if someone could please point me in the right direction (i.e. where is the flaw in my understanding), it would be very much appreciated.

Thank you.


Diagnostic plots:

enter image description here

Best Answer

A few points:

  • The short (tl;dr) answer to your question is that the choice of linear (lm) vs generalized linear (glm) models depends on the response variable (Height_cm), not on anything having to do with the predictor variables. Since your response is continuous, you definitely should avoid the standard GLMs (Poisson, binomial/logistic) which are meant for count or proportion data.
  • your diagnostic plots don't look that bad to me: no systematic variation in y as a function of x for residual-vs-fitted or scale-location plots; Q-Q plot is approximately a straight line; all residuals have Cook's distance < 0.5 (within innermost contour). . I suspect that the 'badness' you're referring to is the non-uniform distribution of the fitted values (x-axis in fitted-vs-residual and scale-location) plots. This is presumably happening because there is a big effect of gender (the only binary predictor I see in your data set); you can plot(Model4, col = as.numeric(Data.dat.complete$Gender)) to check this.
  • In theory, a positive variable would be better modeled using a log-transformed response or a Gamma GLM, but when the coefficient of variation is low (standard deviation of response variables << mean of variables, say <1/3; the value is on the order of 1/8 here) then the implied probability of a negative response is very small and you probably don't need to worry about it.

You may find the annotated version of the diagnostics from the performance::diagnostics() function useful (I don't agree with all of the design decisions, e.g. the fifth plot showing "normality of residuals" is redundant and less revealing than the Q-Q plot, but overall it's helpful).

enter image description here

Related Question