Solved – Which distribution should I use when building a GAM in R

distributionsf distributiongeneralized-additive-modelpoisson distributionr

What type of distribution would you consider this data to be? My first thought is an f-distribution, but I suppose it could also be a poisson distribution. I am trying to build a generalized additive model (gam) that takes time compnents of an EMS call (things like response time, time at scene, time to hospital, etc.) to see how well specific parts of an EMS run can explain the overall call time.

The gam function in R (using the mgcv package) uses family=" " to identify the distribution of the gam. The f-distribution is not a family type. Does anyone know why not? Should I go with the poisson distribution? That said, I have tried both gaussian and poisson in test gam models. I had a deviance explained = ~70% for the gaussian gam and only deviance explained = ~50% for the poisson. Should I go with whichever distribution gives me the higher deviance explained value? Thanks!

enter image description here

Best Answer

What type of distribution would you consider this data to be?

it's largely irrelevant what that looks like, since you have here the marginal distribution, but the GAM will be modelling the conditional distributions (conditional on the predictors, that is).

My first thought is an f-distribution, but I suppose it could also be a poisson distribution.

If it could be one, it could not possibly be the other, since one is continuous and the other is discrete. But again, it doesn't matter for this problem.

The f-distribution is not a family type. Does anyone know why not?

Because the F is not exponential-family.

If there's a strong need for an F, there are other things you might try (but I see no reason to think that there is any need for an F at all, other than a very superficial similarity of the - irrelevant - shape of the marginal distribution).

Should I go with the poisson distribution?

I don't see why this would be a reasonable choice. Times aren't discrete. Even if that wasn't an issue, I see no reason to expect that the variance will be equal to the mean. Indeed, since you'd expect the spread-mean relationship not to depend on your time-unit, I'd plump for variance proportional to mean-squared (which suggests Gamma or lognormal as two easy options)

My first choice for times would typically be Gamma, with either a log link or an inverse link (or rarely, perhaps identity), depending on my understanding of the circumstances. In some circumstances I might transform times (to speeds or log-times) and try to model those.

There are other possibilities - inverse-Gaussian or Tweedie**, for example

**(not available by default, but there's a package that should make it work with GLMs and GAMs)

Related Solutions

Solved – Zero inflated Poisson model

Criterion is based upon (informed) model comparisons. You are trying to account for over-dispersion.

Poisson var(x) ~ mu

Neg Binomial var(x) > mu

"Extra" zeros

ZIP var(x) ~ mu

ZIPB var(x) > mu
One active package that you can use is install.packages("pscl") You can then fit a number of models such as a hurdle model that uses a negative binomial for the counts and a binomial model for modeling the probability of zeros. This would be written something like:
```
fit <- hurdle(Admission ~ Temperature + Humidity), dist="negbin", data = data)

 summary (fit)
```

Note that the output will have two sets of coefficients: one for the hurdle component and one for the count data. This output also provides an estimate of the theta parameter (overdispersion) of the negative binomial

Or you may want to look at the zero-inflation model

fit1<-zeroinfl(Admissions ~ Temperature + Humidity), data = data,dist="negbin",link="logit")

These models can be examined with AIC (also compare these models to your Poisson model...) AIC(fit,fit1)

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

This doesn't exactly answer your question, but it might still solve your problem of needing to calculate risk ratios. The epiR package allows you to calculate risk ratios.

I could not get your example to work (see my comment to your question), so here is an example from the package's documentation:

library(epiR) # Used for Risk ratio
library(MASS) # Used for data

dat1 <- birthwt; head(dat1)

## Generate a table of cell frequencies. First set the levels of the outcome
## and the exposure so the frequencies in the 2 by 2 table come out in the
## conventional format:
dat1$low <- factor(dat1$low, levels = c(1,0))
dat1$smoke <- factor(dat1$smoke, levels = c(1,0))
dat1$race <- factor(dat1$race, levels = c(1,2,3))
## Generate the 2 by 2 table. Exposure (rows) = smoke. Outcome (columns) = low.
tab1 <- table(dat1$smoke, dat1$low, dnn = c("Smoke", "Low BW"))
print(tab1)
## Compute the incidence risk ratio and other measures of association:
epi.2by2(dat = tab1, method = "cohort.count", 
conf.level = 0.95, units = 100, outcome = "as.columns")

Best Answer

Related Solutions

Solved – Zero inflated Poisson model

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

Related Question