Solved – What’s the substitue of MSE in GAMs

binary datageneralized-additive-modellogisticmgcvmse

In case a linear regression, Mean Square Error (MSE) is defined as:

$$
\frac{1}{n-k-1}\sum_{i=1}^{n}{(y_i-\hat{y}_i)^2},
$$

where $n$ is the number of observations, $k$ is the number of independent variables, $y_i$ and $\hat{y}_i$ are the $i$th observed value of the response variable and its prediction, respectively.

Now what is the equivalence of MSE in a generalized additive model (GAM)? Especially in case of a binary GAM (which means a binary response variable)?

Edit: Please provide the corresponding option of your answer in package mgcv.

Best Answer

Typically, the model deviance or the Pearson statistic are substituted for RSS. As the latter can under-smooth heavily, deviance-based variants are preferred (Wood, 2017, p 261).

The unbiased risk estimator (UBRE) is one way to estimate MSE in GAMs with know scale parameter $\phi$. In the binary GAM UBRE would be (using the notation of Wood, 2017)

$$\mathcal{V}_a(\boldsymbol{\lambda}) = D(\hat{\beta}) + 2 \gamma \phi \tau$$

where $D(\hat{\beta})$ is the deviance of the model at the parameter estimates, $\tau$ is the model effective degrees of freedom, $\gamma$ is usually 1 but can be used to put an additional penalty on degrees of freedom (1.4 is commonly used in the smoothing literature, and 1.5 has special justification from the view point of double cross validation), $\phi \equiv 1$ from the binomial distribution, and $\boldsymbol{\lambda}$ is the vector of smoothing parameters. Hence $\mathcal{V}_a(\boldsymbol{\lambda})$ is the UBRE at the current values of the smoothness parameters.

The $D(\hat{\beta})$ component replaces the $\sum_{i=1}^n(y_i - \hat{y}_i)^2$ part of your equation.

If $\phi$ is not known but rather estimated instead, then generalised cross validation (GCV) can be used in place of UBRE. The corresponding GCV score is defined as:

$$\mathcal{V}_g(\boldsymbol{\lambda}) = nD(\hat{\beta}) / (n - \gamma \tau)^2$$

where $n$ is the number of observations.

If using mgcv in R, then the above are automatically used for smoothness selection via the default option method = "GCV.Cp". This criterion automatically selects between UBRE and GCV depending on whether the family implies a known, fixed value of $\phi$ or whether this is estimated from the data.

Alternative approaches are available:

method = "GACV.Cp" uses a related measure generalised approximate cross validation
method = "ML" and method = "REML" treat the smoothing problem as a mixed effects problem and estimate models and perform smoothness selection by maximising the likelihood or restricted likelihood of the model after converting the smooths into fixed and random effect terms.
others; see ?gam and argument method.

Wood, S. N. (2017) Generalized Additive Models: An Introduction with R. Second Edition. (Chapman and Hall/CRC).

Related Solutions

Solved – Are LOESS and GAM with one covariate the same

Not really a full answer, but too long for a comment: s sets up a spline, whereas loess does a local regression.

In the gam package (maybe mgcv too, not too familiar with that one) you can also feed a local regression, as in

library(gam)

set.seed(1234) 

# generate data
x <- sort(runif(100)) 
y <- sin(2*pi*x) + rnorm(10, sd=0.1) 

gam.1 <- gam(y ~ lo(x))
base.r <- loess(y ~ x) 
summary(base.r$fitted - gam.1$fitted)
plot(base.r$fitted,gam.1$fitted)

That does not produce the same fitted values either, but maybe you can further play around with the settings of lo and loess.

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

This doesn't exactly answer your question, but it might still solve your problem of needing to calculate risk ratios. The epiR package allows you to calculate risk ratios.

I could not get your example to work (see my comment to your question), so here is an example from the package's documentation:

library(epiR) # Used for Risk ratio
library(MASS) # Used for data

dat1 <- birthwt; head(dat1)

## Generate a table of cell frequencies. First set the levels of the outcome
## and the exposure so the frequencies in the 2 by 2 table come out in the
## conventional format:
dat1$low <- factor(dat1$low, levels = c(1,0))
dat1$smoke <- factor(dat1$smoke, levels = c(1,0))
dat1$race <- factor(dat1$race, levels = c(1,2,3))
## Generate the 2 by 2 table. Exposure (rows) = smoke. Outcome (columns) = low.
tab1 <- table(dat1$smoke, dat1$low, dnn = c("Smoke", "Low BW"))
print(tab1)
## Compute the incidence risk ratio and other measures of association:
epi.2by2(dat = tab1, method = "cohort.count", 
conf.level = 0.95, units = 100, outcome = "as.columns")

Best Answer

Related Solutions

Solved – Are LOESS and GAM with one covariate the same

Solved – Calculating a risk ratio for specific x values from a GAM model using the mgcv package

Related Question