I am currently modelling crash severity data with an inverse gaussian glm with a log link. I read that
model residual deviance ~ $\chi^2_{n-p}$
Would there be an obvious reason why the residual deviance is so low and far from $n-p$?
Call:
glm(formula = cost ~ points + age + metro_area,
family = inverse.gaussian(link = "log"), data = data[data$cost > 0, ])
Deviance Residuals:
Min 1Q Median 3Q Max
-0.076370 -0.015587 -0.008113 -0.000530 0.023887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.762717 0.091166 107.087 < 2e-16 ***
points 0.119757 0.022237 5.386 7.48e-08 ***
age 0.013409 0.001778 7.544 5.21e-14 ***
metro_area noMetro 0.301689 0.050476 5.977 2.40e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for inverse.gaussian family taken to be 8.679275e-05)
Null deviance: 1.3357 on 6372 degrees of freedom
Residual deviance: 1.3274 on 6369 degrees of freedom
AIC: 143717
Number of Fisher Scoring iterations: 22
```
Best Answer
The residual deviance from an inverse Gaussian glm is proportional to a chisquared random variable, not equal to it.
Inverse-Gaussian glms and Gaussian glms share the property that the residual deviance would be exactly distributed as $\sigma^2 \chi^2_n$, where $\sigma^2$ is the dispersion parameter, if the link-linear model is correctly specified and the regression coefficients could be estimated perfectly without any sampling error.
After allowing for estimation uncertainty of the regression coefficients, the fitted residual deviance is approximately distributed as $\sigma^2 \chi^2_{n-p}$ where $p$ is the dimension of the design matrix, in your case the number of covariates plus 1 for the intercept giving $p=4$.
The dispersion $\sigma^2$ is analogous to the variance for a normal linear regression. It can take any positive value at all, and there is no reason why it should be near to 1. For your data, the R output shows that the dispersion is estimated to be $$\hat\sigma^2 = 0.00008679275.$$
The dispersion value partly just reflects the scale on which your response variable is measured. If you express your
cost
variable in terms of (i) cents, (ii) dollars, (iii) thousands of dollars or (iv) millions of dollars, then the estimated dispersion will change proportionally. Multiplying all yourcost
values by a constant will decrease the dispersion and the residual deviance by the same constant (without changing the coefficients or standard errors or p-values for any of your covariates). Yourcost
values appear to be about 20,000 (=exp(9.76)
), so it is no surprise that the dispersion is roughly 1/20,0000.For more details about inverse-Gaussian glms and examples in R you could consult the two references below.
References
Dunn P.K., Smyth G.K. (2018) Chapter 11: Positive Continuous Data: Gamma and Inverse Gaussian GLMs. In: Generalized Linear Models With Examples in R. Springer Texts in Statistics. Springer, New York, NY. https://link.springer.com/book/10.1007/978-1-4419-0118-7
Giner, G, and Smyth, GK (2016). statmod: probability calculations for the inverse Gaussian distribution. R Journal 8(1), 339-351. https://journal.r-project.org/archive/2016/RJ-2016-024/