Predictive Models – Handling Zero Values During Statistical Model Estimation with a Log Link Function

predictive-modelssas

I fitted a generalized additive model (GAM) using SAS PROC GAMPL with a Poisson distribution and a log link function.

The outcome was a measure of repair cost for a large collection of units (4k+). There is a percentage of units with values that are zero ($0) which I expected to simply drop out of the model estimation process due to the log transformation. However, the reported totals for the used observations do not reflect this.

Are these observations really being used? If so, does anyone know how?

Thanks

Best Answer

If you choose a GAM with a Poisson distribution and log-link, it means that

  • the response is modelled by a Poisson random variable $Y$,
  • the mean $E[Y]$ is related to the nonlinear predictor $\sum f(x_i)$ via the link function, i. e. $$log\left(E[Y]\right) = x_0 + \sum f(x_i) \Leftrightarrow E[Y] = exp\left(x_0 + \sum f(x_i) \right)$$

A Poisson random variable can be zero with positive probability, i. e. there is no reason to exclude these observations. In that sense, $y=0$ is not different from $y=1$ or any other value. There is no log transformation of the response variable.

Side note: If there are weights for each observation (if you don't specify them, they are equal to $1$ for all observations), these weights would be log-transformed into an offset $$E[Y] = W \cdot exp\left(x_0 + \sum f(x_i) \right) = exp\left(log(W) + x_0 + \sum f(x_i) \right). $$ Here, observations with weight zero have to be excluded. Weights can be used when trying to model ratios. An example is modelling the claims frequency in insurance, where $Y$ would be the number of claims and $W$ the number of insurance contracts.

Related Question