Generalized Additive Model – GAM with Proportion Independent Variable

data transformationgeneralized-additive-modelmgcvr

I'm using GAM to model a potentially non-linear relationship between two variables (with some controls variables included) in mgcv. My main independent variable is a proportion (without 0's or 1's) that is highly skewed (see image 1 histogram).

Is there justification/reasoning to make an IV normal in GAM?
Should I transform this variable? I tried to sqrt(IV) (see image 2) and log10((IV)/(1-IV)) (see image 3). Functional form of results are different but not drastically depending on which (or none) transformation is used.

r code:

gam(DV ~ s(IV) + CV1 + CV2 +
                   s(CASEID, bs="re"), data = df, method = "REML")

Histograms of IV:

untransformed IV

image 1 - untransformed IV

sqrt(IV)

image 2 - sqrt IV

log10(IV)/(1-IV)

image 3 - log10(IV)/(1-IV)

Best Answer

GAMs make no assumptions at all about the independent (predictor) variables (except, for some types of inference, that they are measured without error); certainly there are no assumptions about the distribution of the predictors, which are taken as observed values rather than as random variables.

Although the flexibility of GAMs can in principle handle most of the the issues that transforming the data would alleviate in a 'vanilla' linear model (i.e. nonlinearity in the response), I can imagine that using a transformation to stretch out part of the predictor space could improve the performance of the model under some circumstances.

you should have a clear idea what your metric is for model improvement (you say the result is 'cleaner'; do you have an objective way of quantifying that, or are you OK with using subjective assessments to guide your modeling?).
if you are worried about overfitting, be aware that trying a bunch of different transformations could be a form of 'data snooping'

Related Solutions

Solved – Simulating Responses from fitted Generalized Additive Model

I'll illustrate with the classic 4 term data set oft used to illustrate GAMs, but will only simulate data from the strongly nonlinear term $f(x_2)$ as it is easy to visualise the process with a single covariate.

library('mgcv')
set.seed(20)

f2 <- function(x) 0.2 * x^11 * (10 * (1 - x))^6 + 10 * (10 * x)^3 * (1 - x)^10
ysim <- function(n = 500, scale = 2) {
  x <- runif(n)
  e <- rnorm(n, 0, scale)
  f <- f2(x)
  y <- f + e
  data.frame(y = y, x2 = x, f2 = f)
}

df <- ysim()
head(df)

Fit the model:

m <- gam(y ~ s(x2), data = df)

Now, simulate 50 new data points from the posterior distribution of the response conditional upon the estimated spline coefficients and smoothness parameters.

set.seed(10)
nsim <- 50
drawx <- function(x, n) runif(n, min = min(x), max = max(x))
newx <- data.frame(x2 = drawx(df[, 'x2'], n = nsim))

Here I'm drawing 50 values uniformly at random over the range of $x_2$.

The simulated values are given by

$$y \sim \mathcal{N}(\hat{y}, \hat{\sigma}^2) $$

where $\hat{y}$ is

$$\hat{y} = \alpha + s(x_2)$$

and $\hat{\sigma}$ is the residual standard error, which is generally estimated from the residuals of a Gaussian model rather than modelled directly (although the latter is certainly possible). We get $\hat{y}$ via predict().

First grab the estimated residual variance from the model

sig2 <- m$sig2

Next predict for the 50 new locations

set.seed(10)
newx <- transform(newx, newy = predict(m, newx, type = "response"))

and then, using these values of $\hat{y}$ in newy, simulate 50 new values from a Gaussian with mean given by newy and standard deviation (sqrt of sig2)

newx <- transform(newx, ysim = rnorm(nsim, mean = newy, sd = sqrt(sig2)))

Notice that each simulated value is a random draw from a Gaussian distribution with mean value equal to the model estimated value and variance $\hat{\sigma}^2$.

We should check what we've done. First, generate more values from the fitted spline so we can see what model was estimated

pred <- with(df, data.frame(x2 = seq(min(x2), max(x2), length = 500)))
pred <- transform(pred, fitted = predict(m, newdata = pred,
                                         type = "response"))

Now we can put this all together

library('ggplot2')
theme_set(theme_bw())

ggplot(df) +
  geom_point(aes(x = x2, y = y), colour = "grey") +
  geom_line(aes(x = x2, y = f2), colour = "forestgreen", size = 1.3) +
  geom_line(aes(x = x2, y = fitted), data = pred, colour = "blue") +
  geom_point(aes(x = x2, y = newy), data = newx, colour = "blue", size = 2) +
  geom_point(aes(x = x2, y = ysim), data = newx, colour = "red",
             size = 2)

The thick green line is the true function $f(x_2)$. The grey dots are the original data to which the GAM was fitted. The blue line is the fitted smooth, the estimate $\hat{f}(x_2)$. The blue points are the fitted values for the 50 locations at which we simulated from the model. The red points are the random draws from the posterior distribution of $y$ conditional upon the estimated model.

This approach extends to multiple covariates — it's just not so easy to visualize what's going on. This also extends to non-Gaussian responses, such as Poisson or binomial-distributed response.

The key points are:

Predict from the model using predict(), to get the expectation of the response
- (this is the mean in the case of the Gaussian, but is the expected count say for the Poisson)
Predict or extract values for any additional parameters needed for the conditional distribution of the response
- (for the Gaussian, as shown here, there is a second parameter $\sigma^2$, which is needed to fully specify the conditional distribution of the response. For the Poisson, for example, there is only the "mean", the expected count.)
Simulate from the conditional distribution of the response
- (here we could use rnorm() directly because the parameters of the distribution estimated via the GLM/GAM were in exactly the same form as that expected by rnorm(). This is also the case for the Poisson and Binomial distributions. For other distributions this may not be case. For example the parameters of a GLM/GAM with a Gamma family fitted via glm() and gam() require some translation before you can plug them into the rgamma() function.

Best Answer

Related Solutions

Solved – Simulating Responses from fitted Generalized Additive Model

Related Question