Generalized Linear Model – How Simulated DHARMa Residuals Are Created

generalized linear modelresidualssimulation

I am currently working with the DHARMa package and I was looking at the explanations of how the residuals are simulated (https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html):

  1. Simulate new response data from the fitted model for each observation.

  2. For each observation, calculate the empirical cumulative density function for the simulated observations, which describes the possible values (and their probability) at the predictor combination of the observed value, assuming the fitted model is correct.

  3. The residual is then defined as the value of the empirical density function at the value of the observed data, so a residual of 0 means that all simulated values are larger than the observed value, and a residual of 0.5 means half of the simulated values are larger than the observed value.

However, I do not fully understand this. For example, how can one simulate new response data from a fitted model? Does it assume the errors are distributed according to the family specified in the glm and create new response data from this? Does it also use the data the model was fitted on? Then in point 2, what do they mean with each observation? Do they mean for each observation in the dataset each new response variable and their explanatory variables? Is the cdf created using the error distribution?

Hopefully someone can explain a bit more elaborately to me how these residuals are created!

Best Answer

One way to try to understand this is to try to make a simple example. But first, let us clear up your doubt in

For example, how can one simulate new response data from a fitted model? Does it assume the errors are distributed according to the family specified in the glm and create new response data from this?

One does exactly as is done in parametric bootstrapping, assuming that the fitted model is the true one, replacing the unknown parameters with the fitted values. So, with a normal (Gaussian) family, we assume normal errors, with a Poisson family we simulate response values from a Poisson distribution, and so on.

A simple example, with code in R. Let us first simulate some data:

set.seed(7*11*13) ### My public seed

x <- rep(1:10, 2)
y <- 1  +  0.5 * x  +  rnorm(x, 0, 2)
mod <- lm(y  ~ x)
intercept <- coef(mod)[1]; slope <- coef(mod)[2]
sigma <- summary(mod)$sigma
### Then we can simulate responses,  as in 
#### parametric bootstrapping:
ystar <- intercept  +  slope*x  +  rnorm(x, 0, sigma)
### Then this would have to be repeated many times!

To repeat, this is really parametric bootstrapping!

Then we can program the tree steps of the algorithm you reference as a simple R function, which will work for linear models (lm) and glm (generalized linear models) objects. The DHARMa package must be more complicated, as it works for many other model types. Making a minimal code can be a good way to understand an algorithm, as usual working code as that in the DHARMa package might be difficult to read, as such code is often dominated by error checks, guards and strange border cases. No such distractions below:

sim_res <- function( model,  B, seed) {
    ystars <- simulate(model, nsim=B, seed=seed)
    # step 1
    y <-  model.response(model.frame(model))
    N <- NROW(ystars)
     my_ecdf <- function(data) {
        function(y) sum( data <= y)/length(data)
     } ### using systen ecdf gives error I do not understand  
    simres <- rep(as.numeric(NA), N)
    for (i in seq_along(simres)) {
        efun <- my_ecdf( ystars[i, ] ) # step 2
        simres[i] <- efun( y[i] )      # step 3
    }
    return( simres )
}     

Hopefully, this code should clarify your doubts:

Then in point 2, what do they mean with each observation?

The loop in the code goes through the dataset, for each observation $y_i, i=1, 2, \dotsc, N$ it takes the $B$ simulated (bootstrapped) responses, stored in row $i$ in object ystars, calculates its ecdf (empirical cumulative distribution function), and then applies this function to the actual observation $y_i$.

Is the cdf created using the error distribution?

The cdf is created using the simulated responses.