Before looking on QQplots of residuals, you should assess the quality of fit, by plotting residuals against the predictors in the model (and possibly, also against other variables you have which you did not use). Non-linearity should show up in this plots. If the effect of variable $x$ really is linear, you expect the plot of residuals against $x$ to be "horizontal", without visible structure:

```
*
* *
* *
*
*
--------------------------------------*------------------------------x
*
*
*
* *
*
```

That is, a random horizontal "blob" of points, centered around the line resid=0.

If the effect is non-linear, you expect to see some curvature in this plot.
(and, please, ignore the QQplots until you got non-linearities sorted out, using plots as above!)

You should also think about possible interactions (modelled usually by product terms), that is, the effect of one variable depends on the levels of another, (If all your three variables have high values at the same time, maybe that shows some particularly difficult patient? If so, interactions could be needed).

If you go for some non-linear model, after having tried for interactions and transformations (did you try `log(Cost)`

?) Did you try some box-cox-transformations? Since you have multiple regression, I don't think that `loess`

is what you need, you should look for `gam`

(generalized additive models, SAS should have that, in R it is in package `mgcv`

).

Sorry to keep you waiting for over 5 years, but I just came up with a cool little hack that apparently solves for this. It's based on iteratively updating a constant offset until convergence. The R code below simulates data from a logistic model with $logit(\Pr(Y= 1 | x_1, x_2)) = -2 + .5 x_1 + .5x_2$, and creates a situation where both regression parameters are equal.

```
set.seed(124)
N <- 500
sim_df <- data.frame(x1 = rnorm(N))
sim_df$x2 <- 1.3 + .5 + sim_df$x1 + rnorm(N)
cor(sim_df$x1, sim_df$x2) # to keep it interesting
sim_df$z <- with(sim_df, -2 + .5 * x1 + .5 * x2)
sim_df$u <- runif(N)
sim_df$y <- with(sim_df, as.numeric(u < plogis(z)))
table(sim_df$y)
my_model <- glm(y ~ x1 + x2, data = sim_df, family = "binomial")
summary(my_model)
```

Even though the true values of the coefficients are the same, the estimated values are clearly different:

```
Call:
glm(formula = y ~ x1 + x2, family = "binomial", data = sim_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8887 -0.8124 -0.5640 0.9315 2.5558
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.8947 0.2382 -7.956 1.78e-15 ***
x1 0.3558 0.1557 2.284 0.0224 *
x2 0.5467 0.1109 4.928 8.30e-07 ***
---
```

Let's initialize based on the above model and iterate:

```
beta1 <- coef(my_model)[["x1"]]
prev_beta1 <- 0
while (abs(prev_beta1 - beta1) > .00001) {
my_model <- glm(y ~ x1 + offset(x2 * beta1), data = sim_df,
family = "binomial")
prev_beta1 <- beta1
beta1 <- coef(my_model)[["x1"]]
cat("previous beta1: ", prev_beta1, "new beta1: ", beta1, "\n")
}
```

-- This is output:
previous beta1: 0.3557504 new beta1: 0.5203537
previous beta1: 0.5203537 new beta1: 0.3778784
previous beta1: 0.3778784 new beta1: 0.5007586
previous beta1: 0.5007586 new beta1: 0.3944498
previous beta1: 0.3944498 new beta1: 0.4861758
previous beta1: 0.4861758 new beta1: 0.406849
previous beta1: 0.406849 new beta1: 0.4753156

```
summary(my_model)
Call:
glm(formula = y ~ x1 + offset(x2 * beta1), family = "binomial",
data = sim_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8289 -0.8224 -0.5777 0.9677 2.4802
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.6990 0.1077 -15.78 < 2e-16 ***
x1 0.4435 0.1239 3.58 0.000343 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 541.75 on 499 degrees of freedom
Residual deviance: 527.63 on 498 degrees of freedom
AIC: 531.63
Number of Fisher Scoring iterations: 4
```

Note that the final model with the offset doesn't include $x_2$, but you can see below that it's in the model, with the same coefficient value as $x_1$:

```
# compare:
predict(my_model)[1]
# vs
coef(my_model)[1] +
coef(my_model)[["x1"]] * sim_df[1, "x1"] +
coef(my_model)[["x1"]] * sim_df[1, "x2"]
```

## Best Answer

If you choose a GAM with a Poisson distribution and log-link, it means that

A Poisson random variable can be zero with positive probability, i. e. there is no reason to exclude these observations. In that sense, $y=0$ is not different from $y=1$ or any other value. There is no log transformation of the response variable.

Side note: If there are weights for each observation (if you don't specify them, they are equal to $1$ for all observations), these weights would be log-transformed into an offset $$E[Y] = W \cdot exp\left(x_0 + \sum f(x_i) \right) = exp\left(log(W) + x_0 + \sum f(x_i) \right). $$ Here, observations with weight zero have to be excluded. Weights can be used when trying to model ratios. An example is modelling the claims frequency in insurance, where $Y$ would be the number of claims and $W$ the number of insurance contracts.