Solved – How would you explain generalized linear models to people with no statistical background

communicationgeneralized linear model

I always have a hard time explaining statistical techniques to audience with no statistical background. If I wanted to explain what GLM is to such audience (without throwing out statistical jargon), what would be the best or most effective way?

I usually explain GLM with three parts — (1) the random component which is response variable, (2) the systematic component which is linear predictors, and (3) the link function which is the "key" to connecting (1) and (2). Then I would give an example of linear or logistic regression and explain how the link function is selected based on the response variable. Hence it acts as the key connecting two components.

Best Answer

If the audience really has no statistical background, I think I would try to simplify the explanation quite a bit more. First, I would draw a coordinate plane on the board with a line on it, like so:

y = mx + b

Everyone at your talk will be familiar with the equation for a simple line, $\ y = mx + b $, because that's something that is learned in grade school. So I would display that alongside the drawing. However, I would write it backwards, like so:

$\ mx + b = y $

I would say that this equation is an example of a simple linear regression. I would then explain how you (or a computer) could fit such an equation to a scatter plot of data points, like the one shown in this image:

Scatter plot

I would say that here, we are using the age of the organism that we are studying to predict how big it is, and that the resultant linear regression equation that we get (shown on the image) can be used to predict how big an organism is if we know its age.

Returning to our general equation $\ mx + b = y $, I would say that x's are variables that can predict the y's, so we call them predictors. The y's are commonly called responses.

Then I would explain again that this was an example of a simple linear regression equation, and that there are actually more complicated varieties. For example, in a variety called logistic regression, the y's are only allowed to be 1's or 0's. One might want to use this type of model if you are trying to predict a "yes" or "no" answer, like whether or not someone has a disease. Another special variety is something called Poisson regression, which is used to analyse "count" or "event" data (I wouldn't delve further into this unless really necessary).

I would then explain that linear regression, logistic regression, and Poisson regression are really all special examples of a more general method, something called a "generalized linear model". The great thing about "generalized linear models" is that they allow us to use "response" data that can take any value (like how big an organism is in linear regression), take only 1's or 0's (like whether or not someone has a disease in logistic regression), or take discrete counts (like number of events in Poisson regression).

I would then say that in these types of equations, the x's (predictors) are connected to the y's (responses) via something that statisticians call a "link function". We use these "link functions" in the instances in which the x's are not related to the y's in a linear manner.

Anyway, those are my two cents on the issue! Maybe my proposed explanation sounds a bit hokey and dumb, but if the purpose of this exercise is just to get the "gist" across to the audience, perhaps an explanation like this isn't too bad. I think it's important that the concept be explained in an intuitive way and that you avoid throwing around words like "random component", "systematic component", "link function", "deterministic", "logit function", etc. If you're talking to people who truly have no statistical background, like a typical biologist or physician, their eyes are just going to glaze over at hearing those words. They don't know what a probability distribution is, they've never heard of a link function, and they don't know what a "logit" function is, etc.

In your explanation to a non-statistical audience, I would also focus on when to use what variety of model. I might talk about how many predictors you are allowed to include on the left hand side of the equation (I've heard rules of thumb like no more than your sample size divided by ten). It would also be nice to include an example spread sheet with data and explain to the audience how to use a statistical software package to generate a model. I would then go through the output of that model step by step and try to explain what all the different letters and numbers mean. Biologists are clueless about this stuff and are more interested in learning what test to use when rather than actually gaining an understanding of the math behind the GUI of SPSS!

I would appreciate any comments or suggestions regarding my proposed explanation, particularly if anyone notes errors or thinks of a better way to explain it!