GLM to normal distribution

generalized linear modelglmmnormal distributionrregression

I have a dataset with four variables: Body temperature (dependent variable), air temperature, substrate temperature, precipitation and relative humidity (independent variables). To test whether my independent variables affect body temperature, I thought about using a GLM, but I'm uncertain whether this would be the most appropriate procedure, since body temperature has a normal distribution. Would this really be a barrier to using a GLM in this situation? Would a Mixed Model be more appropriate?

Best Answer

There seems to be some confusion.

A generalised linear model is indicated when the response variable is a count (or otherwise discreet variable), or if it is continuous, when the conditional distribution of the response (that is, conditional on the covariates), follows a non-normal distribution. A common example is where the response assumes values only within a interval - eg, probabilities, which are bounded by [0, 1] are commonly modelled using beta regression, since the beta distribution is defined on the interval [0, 1]. In general, for a generalised linear model we have:

$$ \text{link}\bigg(\mathbb E\big[Y\vert X\big]\bigg) = X\beta $$ where $X$ is the model matrix of fixed effects and $\beta$ is the parameter vector.

So, if the link function is the identity function, and the response distribution is the normal distribution, this is exactly the same as multivariable linear regression:

$$ \mathbb E \big[Y\vert X\big] = X\beta $$

In your case, you appear to have a continuous response, so unless there is some underlying thoery that suggests a GLM such as a gamma model is indicated, I would start with a multivariable linear regression. In R, we would have:

lm(body_temp ~ air_temp + substrate_temp + precip + rel_hum, data = mydata)

[Of course you may want to allow for nonlinearities by fitting interactions and/or nonlinear terms, as indicated by the underlying theory]

A mixed model is also mentioned. This would be indicated when there are repeated measures, or some other kind of clustering, when observation within one cluster are more similar to each other than observations within different clusters. In such a situation we often fit random intercepts to account for this. A model with fixed effects, and random effects, is known as a mixed model.

Related Solutions

Solved – in r, lognormal, glm, transformations, what should I do

There are a variety of confusions here: most have probably been dealt with at one time or another on CrossValidated ...

generally, the distributional assumptions of regression modeling (whether linear, generalized linear, or mixed) refer to the conditional distribution of the response variable: that is, the assumption is that $y \sim \textrm{Dist}(...)$, where the $...$ contains the information from the input variables (approximately the same as "predictor variables", "covariates", or "independent variables").
sometimes people also transform the predictor variables, but this is to improve the linearity of the relationship between the input variables and the response. There are very few cases where any important assumptions are made about the distribution of the input variables.
if you have a continuous response variable, you can probably get away with a linear model (implemented via lm() in base R) or, if you want to include a random effect of site, lmer() from the lme4 package (or lme from the nlme package).
first you should plot your data. You should probably start by looking at univariate relationships (plot(biomass~wind,data=mydata)), even though they can miss a lot of higher-order structure.

I would probably try

  fit1 <- lm(biomass~.-location,data=mydata) 
  ## dot in the formula stands for "everything but the response"; 
  ##    -location takes out the location
  plot(fit1)   ## diagnostic plots
  ## maybe, if the Q-Q plot and scale-location plot look funny ...
  library("MASS")
  boxcox(fit1)

first, and then try the equivalent

library("lme4")
fit2 <- lmer(biomass~...,data=mydata)
## here you need to fill in the ... yourself since lme4 doesn't have the same
##  shortcuts

This is just scratching the surface, but might get you started.

Heteroscedasticity problem violating assumptions for lm and glm

As the other answers say, the unweighted lm() model (equivalent to your glm() with a Gaussian family; in the form you used, glm() and lm() are based on the same assumptions) wasn't that bad. Even with the complication of potential heteroscedasticity, the Normal Q-Q plot wasn't very bad at all. You have a reasonably large number of observations, so it's possible to have "statistically significant" violations of normality and heteroscedasticity that aren't of practical importance. See, for example, the discussion of whether normality testing is essentially useless.

In a later version of the question you added some results from a generalized least squares (gls()) model, in this case a weighted least squares model. That's a well accepted way to deal with heteroscedasticity. You allowed for different variances among all combinations of A and B, with observations weighted inversely to their variance estimates (weights = varIdent(form = ~ 1 | A* B)). If you are worried about the heteroscedasticity in the unweighted models, that would be a good way to go.

The terminology can be very confusing when you're first starting to learn these methods. In particular, "generalized least squares" sounds a lot like "generalized linear model" even though they can have quite different applications. Further complicating matters, either of those could implement a "generalized additive model" to fit continuous predictors flexibly. Make sure you know what type of "generalized" model you're dealing with.

Best Answer

Related Solutions

Solved – in r, lognormal, glm, transformations, what should I do

Heteroscedasticity problem violating assumptions for lm and glm

Related Question