Generalized Linear Models – Key Assumptions of Generalized Linear Models

ancovaassumptionsgeneralized linear modelregressionscatterplot

I have made a generalised linear model with a single response variable (continuous/normally distributed) and 4 explanatory variables (3 of which are factors and the fourth is an integer). I have used a Gaussian error distribution with an identity link function. I am currently checking that the model satisfies the assumptions of the generalised linear model, which are:

  1. independence of Y
  2. correct link function
  3. correct scale of measurement of explanatory variables
  4. no influential observations

My question is: how can I check that the model satisfies these assumptions? The best starting point would seem to be plotting the response variable against each explanatory variable. However, 3 of the explanatory variables are categorical (with 1-4 levels), so what should I be looking for in the plots?

Also, do I need to check for multicollinearity and interactions amongst explanatory variables? If yes, how do I do this with categorical explanatory variables?

Best Answer

I think trying to think of this as a generalized linear model is overkill. What you have is a plain old regression model. More specifically, because you have some categorical explanatory variables, and a continuous EV, but no interactions between them, this could also be called a classic ANCOVA.

I would say that #3 is not really an assumption here that you need to worry about. Nor, for that matter, do you need to really worry about #2. Instead, I would supplant these with two different assumptions:

2'. Homogeneity of variance
3'. Normality of residuals

Furthermore, #4 is an important thing to check, but I don't really think of it as an assumption per se. Lets think about how assumptions can be checked.

Independence is often 'checked' firstly by thinking about what the data stand for and how they were collected. In addition, it can be checked using things like a runs test, Durbin-Watson test, or examining the pattern of autocorrelations--you can also look at partial autocorrelations. (Note that, these can only be assessed relative to your continuous covariate.)

With primarily categorical explanatory variables, homogeneity of variance can be checked by calculating the variance at each level of your factors. Having computed these, there are several tests used to check if they are about the same, primarily Levene's test, but also the Brown-Forsyth test. The $F_{max}$ test, also called Hartley's test is not recommended; if you would like a little more information about that I discuss it here. (Note that these tests can be applied to your categorical covariates unlike above.) For a continuous EV, I like to just plot my residuals against the continuous covariate and examine them visually to see if they spread out further to one side or the other.

The normality of the residuals can be assessed via some tests, like the Shapiro-Wilk, or the Kolmogorov-Smirnov tests, but is often best assessed visually via a qq-plot. (Note that this assumption is generally the least important of the set; if it is not met, your beta estimates will still be unbiased, but your p-values will be inaccurate.)

There are several ways to assess the influence of your individual observations. It is possible to get numerical values that index this, but my favorite way, if you can do it, is to jackknife your data. That is, you drop each data point in turn and re-fit your model. Then you can examine how much your betas bounce around if that observation were not a part of your dataset. This measure is called dfbeta. This requires a bit of programming, but there are standard ways that software can often compute for you automatically. These include leverage and Cook's distance.

Regarding your question as originally stated, if you want to know more about link functions and the generalized linear model, I discussed that fairly extensively here. Basically, the most important thing to consider in order to select an appropriate link function is the nature of your response distribution; since you believe $Y$ is Gaussian, the identity link is appropriate, and you can just think of this situation using standard ideas about regression models.

Concerning the "correct scale of measurement of explanatory variables", I take you to be referring to Steven's levels of measurement (i.e., categorical, ordinal, interval & ratio). The first thing to realize is that regression methods (including GLiM's) do not make assumptions about the explanatory variables, instead, the manner in which you use your explanatory variables in your model reflects your beliefs about them. Furthermore, I tend to think Steven's levels are overplayed; for a more theoretical treatment of that topic, see here.