I have been modelling a data set that contains several predictor variables, but after extensive research, I am even more confused as to whether I should be using a lm()
or glm()
function for the following:
Model4 <-lm(Height_cm ~ Sleep.hours + Gender + Age + Exercise, data=Data.dat.complete)
Where height is continuous, sleep.hours is continuous, gender is male/female, age is continuous and exercise is count (the number of times someone exercised in a week).
As I have a binary variable (gender), my diagnostic plots do not look too good, and I am tempted to use a glm()
with family=binomial
and link=logit
. However, after researching for several hours, I am beginning to question this, as I have seen examples of gender being fitted using just lm()
and my other predictors are clearly not binary.
If at all possible, I would like to try and work at least some of this out on my own, but if someone could please point me in the right direction (i.e. where is the flaw in my understanding), it would be very much appreciated.
Thank you.
Diagnostic plots:
Best Answer
A few points:
lm
) vs generalized linear (glm
) models depends on the response variable (Height_cm
), not on anything having to do with the predictor variables. Since your response is continuous, you definitely should avoid the standard GLMs (Poisson, binomial/logistic) which are meant for count or proportion data.plot(Model4, col = as.numeric(Data.dat.complete$Gender))
to check this.You may find the annotated version of the diagnostics from the
performance::diagnostics()
function useful (I don't agree with all of the design decisions, e.g. the fifth plot showing "normality of residuals" is redundant and less revealing than the Q-Q plot, but overall it's helpful).