Solved – inverse gaussian glm residual deviance

deviancegeneralized linear modelinverse-gaussian-distribution

I am currently modelling crash severity data with an inverse gaussian glm with a log link. I read that

model residual deviance ~ $\chi^2_{n-p}$

Would there be an obvious reason why the residual deviance is so low and far from $n-p$?

Call:
glm(formula = cost ~ points + age + metro_area, 
    family = inverse.gaussian(link = "log"), data = data[data$cost > 0, ])

Deviance Residuals: 
      Min         1Q     Median         3Q        Max  
-0.076370  -0.015587  -0.008113  -0.000530   0.023887  

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                   9.762717   0.091166 107.087  < 2e-16 ***
points                        0.119757   0.022237   5.386 7.48e-08 ***
age                           0.013409   0.001778   7.544 5.21e-14 ***
metro_area noMetro            0.301689   0.050476   5.977 2.40e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for inverse.gaussian family taken to be 8.679275e-05)

    Null deviance: 1.3357  on 6372  degrees of freedom
Residual deviance: 1.3274  on 6369  degrees of freedom
AIC: 143717

Number of Fisher Scoring iterations: 22

```

Best Answer

The residual deviance from an inverse Gaussian glm is proportional to a chisquared random variable, not equal to it.

Inverse-Gaussian glms and Gaussian glms share the property that the residual deviance would be exactly distributed as $\sigma^2 \chi^2_n$, where $\sigma^2$ is the dispersion parameter, if the link-linear model is correctly specified and the regression coefficients could be estimated perfectly without any sampling error.

After allowing for estimation uncertainty of the regression coefficients, the fitted residual deviance is approximately distributed as $\sigma^2 \chi^2_{n-p}$ where $p$ is the dimension of the design matrix, in your case the number of covariates plus 1 for the intercept giving $p=4$.

The dispersion $\sigma^2$ is analogous to the variance for a normal linear regression. It can take any positive value at all, and there is no reason why it should be near to 1. For your data, the R output shows that the dispersion is estimated to be $$\hat\sigma^2 = 0.00008679275.$$

The dispersion value partly just reflects the scale on which your response variable is measured. If you express your cost variable in terms of (i) cents, (ii) dollars, (iii) thousands of dollars or (iv) millions of dollars, then the estimated dispersion will change proportionally. Multiplying all your cost values by a constant will decrease the dispersion and the residual deviance by the same constant (without changing the coefficients or standard errors or p-values for any of your covariates). Your cost values appear to be about 20,000 (=exp(9.76)), so it is no surprise that the dispersion is roughly 1/20,0000.

For more details about inverse-Gaussian glms and examples in R you could consult the two references below.

References

Dunn P.K., Smyth G.K. (2018) Chapter 11: Positive Continuous Data: Gamma and Inverse Gaussian GLMs. In: Generalized Linear Models With Examples in R. Springer Texts in Statistics. Springer, New York, NY. https://link.springer.com/book/10.1007/978-1-4419-0118-7

Giner, G, and Smyth, GK (2016). statmod: probability calculations for the inverse Gaussian distribution. R Journal 8(1), 339-351. https://journal.r-project.org/archive/2016/RJ-2016-024/