Solved – Incidence Rate Ratio (IRR) in R from linear regression using log-transformed data

incidence-rate-ratioleast squarespoisson distribution

I was wondering if it would make sense to calculate IRR for OLS (not poisson), but the OLS is done using log-transformed data?
I have a set of crude death rate data (which I'm still debating whether they are count data (because they are, after all, based on counts) or continuous data (since they are not integers), and I've modeled them using poisson, but then just curious what would happen if I logged the crude rate and then perform a robust linear regression…. but I'd like to compare the two ways via IRR…..

any suggestions welcome, for example, if I really shouldn't log the crude rate to begin with…..
thanks!

Best Answer

Well, if your numerator is directly interpreted as counts, then both the poisson regression and the log transformed outcome linear regression will be consistent for the same parameters. The only discrepancy in this case is exactly how the observations are weighted (see paragraph 2). If your outcome is rates and you have measured (variable) denominators (such as 1-3 $\mu$gs of biopsied tumor, or 1-20 ccs blood), you need to use some alternative approaches to account for the various weighting differences in the two groups. In both linear regression and Poisson regression, this comes about in the form of an offset. I'm curious whether this should be a consideration in your problem.

In OLS, the mean is independent of the variance (under classical assumptions), so your fitted model will have the minimum squared residuals, which will be largely driven by large counts. In the Poisson GLM, large counts are significantly downweighted by inverse variance reweighting. An inspection of the distribution of the data using one or more scatterplots (depending on the number of adjustment variables) and fitted curves is a very important consideration indeed. You will certainly need to verify high leverage / high influence observations to validate the alternative modeling approaches you've proposed.

Using robust standard errors (one particular form of robust regression) does not assume that mean is independent of variance, but it does use such a working probability model, so while robust standard errors will be consistent, your point estimates will be unstable, and your inference will be of lower power (than when you can assume a better working probability model for the data).

Although R warns you about non-integral counts in Poisson GLMs, there are plenty of sane regression models, especially in, say, ecology, where non-integral Poisson outcomes come about such as plankton concentration in a cubic meter of sampled water from various watersheds, or flow cytometry assessed mRNA concentration in biopsied tumor tissue.

Related Solutions

Solved – Interpreting coefficients for Poisson regression

Not to be critical, but this is kind of a strange example. It's not clear that you're really doing time series analysis, nor what the NASDAQ would have to do with the number of games won by some team. If you're interested in saying something about the number of games a team won, I think it would be best to use binary logistic regression, given that you presumably know how many games are played. Poisson regression is most appropriate for talking about counts when the total possible is not constrained well, or at least not known.

How you would interpret your betas depends, in part, on the link used--it is possible to use the identity link, even though the log link is more common (and typically more appropriate). If you are using the log link, you probably wouldn't take the log of your response variable--the link in essence is doing that for you. Let's take an abstract case, you have a Poisson model using the log link as follows:
$$ \hat{y}=\text{exp}(\hat{\beta}_0)*\text{exp}(\hat{\beta}_1)^x $$ alternatively, $$ \hat{y}=\text{exp}(\hat{\beta}_0+\hat{\beta}_1x) $$

(EDIT: I'm removing the "hats" from the betas in what follows, because they're ugly, but they should still be understood.)

With normal OLS regression, you are predicting the mean of a Gaussian distribution of the response variable conditional on the values of the covariates. In this case, you are predicting the mean of a Poisson distribution of the response variable conditional on the values of the covariates. For OLS, if a given case were 1 unit higher on your covariate, you expect, all things being equal, the mean of that conditional distribution to be ${\beta}_1$ units higher. Here, if a given case were 1 unit higher, ceteris paribus, you expect the conditional mean to be $e^{{\beta}_1}$ times higher. For instance, say ${\beta}_1=2$, then in normal regression it is 2 units higher (i.e., +2), and here it is 7.4 times higher (i.e., x 7.4). In both cases, ${\beta}_0$ is your intercept; in our equation above, consider the situation when $x=0$, then exp$({\beta}_1)^x=1$, and the right hand side reduces to exp(${\beta}_0$), which gives you the mean of $y$ when all covariates equal 0.

There are a couple of things that can be confusing about this. First, predicting the mean of a Poisson distribution isn't the same as predicting the mean of a Gaussian. With a normal distribution, the mean is the single most likely value. But with the Poisson, the mean is often an impossible value (e.g., if your predicted mean is 2.7, that's not a count that could exist). In addition, normally the mean is unrelated to the level of dispersion (i.e., the SD), but with the Poisson distribution, the variance necessarily equals the mean (although, it often doesn't in practice, leading to additional complexities). Finally, those exponentiations make it more complicated; if, instead of a relative change, you wanted to know the exact value, you would have to start at 0 (i.e., $e^{{\beta}_0}$) and multiply your way up $x$ times. For predicting a specific value, it's easier to solve the expression inside the parentheses in the bottom equation and then exponentiate; this makes the meaning of the beta less clear, but the math easier and reduces the possibility of error.

Poisson Regression – How to Generate Data Samples from Poisson Regression

The poisson regression model assumes a Poisson distribution for $Y$ and uses the $\log$ link function. So, for a single explanatory variable $x$, it is assumed that $Y \sim P(\mu)$ (so that $E(Y) = V(Y) = \mu$) and that $\log(\mu) = \beta_0 + \beta_1 x$. Generating data according to that model easily follows. Here is an example which you can adapt according to your own scenario.

>   #sample size
> n <- 10
>   #regression coefficients
> beta0 <- 1
> beta1 <- 0.2
>   #generate covariate values
> x <- runif(n=n, min=0, max=1.5)
>   #compute mu's
> mu <- exp(beta0 + beta1 * x)
>   #generate Y-values
> y <- rpois(n=n, lambda=mu)
>   #data set
> data <- data.frame(y=y, x=x)
> data
   y         x
1  4 1.2575652
2  3 0.9213477
3  3 0.8093336
4  4 0.6234518
5  4 0.8801471
6  8 1.2961688
7  2 0.1676094
8  2 1.1278965
9  1 1.1642033
10 4 0.2830910

Best Answer

Related Solutions

Solved – Interpreting coefficients for Poisson regression

Poisson Regression – How to Generate Data Samples from Poisson Regression

Related Question