Solved – Using GLM on a continuous response variable

generalized linear modelr

Let's say that I am developing a glm on a continuous response variable. I've read a number of tutorials on glm and the estimation that it utilizes. However, I'm a little lost on the specifications that are required for developing a glm in R.

According to the glm help page, family is supposed to specify the residual distribution. However, I find that my error distribution never matches any of the provided "options" (gamma, gaussian, etc). For example, how would I specify that family if my error ditribution looked as follows.

Feel free to provide me with any other information that I seem to be misunderstanding.

enter image description here

Best Answer

The family describes the response. Its what makes a glm different to a standard plain linear model.

In a linear model your responses (y-values) are considered to be Normally-distributed values with mean aX+b and variance sigma-squared. In a GLM of another family the responses are thought to come from another distribution - such as poisson, or binomial, via a transformation of the aX+b bit. If your response values are small numbers of counts (classic example: number of cars past a post every minute on a quiet road) then the linear model is inappropriate because it might end up predicting a negative number of cars... So you use a poisson regression, where the distribution is non-negative.

This affects how you think about residuals, and the GLM code in R can return various types of residuals - see help(residuals.glm) for details. For example the size of the residuals in your plot, together with you telling us that the data is in the thousands, makes me think these are 'response' residuals, and not scaled in the way some of the other residuals are.

You don't think the blue line in your plot is the 'distribution of your residuals' do you? You should maybe have a look at the histogram or a boxplot or quantiles of your residuals.

Fitted vs resid plots like you've given us tell is some things: for example if you see a strong pattern of high, low, high residuals then that's saying there's some more curviness in your data that your model hasn't found (because your residuals aren't independent); if your residuals are all small at low fitted values and then large at high fitted values (but all balanced around zero) then your variance isn't constant and you should think carefully about a transformation of some kind.

Related Question