Solved – lmer predict with random effects & log transformation

lme4-nlmemixed modelr

I am currently trying to build a mixed effect model using the lme4 package with the in-built lmer() function.

My goal is to predict the number of units sold in a specific site on a specific date based on independent variables such as the retail price or promotions.

Note: I transform the number of units sold in each day into log(units).

At the moment I am reviewing the forecasts of my model looking at the plots. Looking at the generated plots I observed 2 key things:

(1) I had the impression that the sigma of the model is not the correct one.

(2) I had the impression that the model only takes fixed effects into consideration and neglects the random effects which I specify.

Here are the key lines in the code which I assume to be relevant for my questions:

d_data$units_as_log <- log(d_data$units+0.001) # Log Transformation of Units
#Specifying the model formula (very basic model)
model_vec <- c("units_as_log ~ retail_price + (1|article) + (1|site)")   
model <- lmer(model_vec[ii], 
    data=d_data[d_data$date <= endofhistory & d_data$AVAIL == 1,], REML=FALSE)
    # <= endofhistory describes the model interval
# > endofisthistory describes the holdout period, with re.form=NULL i try to include random effects
fcst <- exp(predict(model, newdata=subdata[subdata$date > endofhistory,], 
  re.form = NULL)+summary(model)$sigma^2/2) 
runMSE[ii] <- mean((fcst-actuals)^2, na.rm=TRUE)

Question #1: Is the sigma of summary(model) the correct sigma or do I need to calculate it myself somehow?

I previously read that the sigma function of lme4 was moved to the stats package.

Question #2: Is the transformation from log_units to units with + summary(model)$sigma^2/2 correct?

I discovered several posts on StackExchange which gave different answers, hence my confusion here.

Question #3: Do I already include my specified random effects? Do I need to calculate the results myself by building custom matrices?

I previously discovered posts saying lme4 has no predict() function (state of 2014). Nevertheless, I identified that there is a predict() function in the stats package which can deal with lmer objects, including random effects with re.form = NULL.

Best Answer

It would help a lot to have a reproducible example, but I'll try.

Question #1: Is the sigma of summary(model) the correct sigma or do I need to calculate it myself somehow?

It's correct: sigma(model) and summary(model)$sigma are identical, the former is a little more compact but not different. (Why do you think it's incorrect?)

I previously read that the sigma function of lme4 was moved to the stats package.

This just means that sigma() was established as a generic function in base R, so the generic function is no longer defined in the lme4 package. But sigma(fitted_model) still works the same.

Question #2: Is the transformation from log_units to units with + summary(model)$sigma^2/2 correct? (I discovered several posts on StackExchange which gave different answers, hence my confusion here.)

What are you trying to do? To reverse your transformation y=log(x+0.001) you need x=exp(y)-0.001. The reason that people sometimes add in sigma^2/2 is because analyzing on the log-transformed scale will give a biased estimate of the (arithmetic) mean on the original scale: see e.g. here. My opinion is that this isn't necessarily worth doing unless you're specifically interested in predicting the mean (rather than, say, the median) on the original scale, but it's worth keeping in mind.

Question #3: Do I already include my specified random effects? Do I need to calculate the results myself by building custom matrices?

You shouldn't need to. When in doubt, the documentation of the current version of the package you have installed (?predict.merMod) gives the authoritative answer. Do you have a reason to believe that re.form=NULL is not doing what you want? (Hint: if you have an overly complex model, or too few levels in your grouping variables, you may have estimated random effects variances of zero, in which case it will look like your random effects aren't doing anything ... check VarCorr(fitted_model), and note that variances are sometimes estimated as being very small (e.g. 1e-06) rather than exactly zero ...)

Note that this question might be more appropriate for r-sig-mixed-models@r-project.org than CrossValidated (go to the info page to subscribe)

Related Solutions

Solved – How to estimate variance components with lmer for models with random effects and compare them with lme results

One common way to determine the relative contribution of each factor to a model is to remove the factor and compare the relative likelihood with something like a chi-squared test:

pchisq(logLik(model1) - logLik(model2), 1)

As the way that likelihoods are calculated between functions may be slightly different, I typically will only compare them between the same method.

Solved – lmer with multiply imputed data

You can do this somewhat by hand if by taking advantage of the lapply functionality in R and the list-structure returned by the Amelia multiple imputation package. Here's a quick example script.

library(Amelia)
library(lme4)
library(merTools)
library(plyr) # for collapsing estimates

Amelia is similar to mice so you can just substitute your variables in from the mice call here -- this example is from a project I was working on.

 a.out <- amelia(dat[sub1, varIndex], idvars = "SCH_ID", 
            noms = varIndex[!varIndex %in% c("SCH_ID", "math12")], 
            m = 10)

a.out is the imputation object, now we need to run the model on each imputed dataset. To do this, we use the lapply function in R to repeat a function over list elements. This function applies the function -- which is the model specification -- to each dataset (d) in the list and returns the results in a list of models.

 mods <- lapply(a.out$imputations,
           function(d) lmer((log(wage) ~ gender + age + age_sqr + 
            occupation + degree + private_sector + overtime + 
             (1+gender|faculty), data = d)

Now we create a data.frame from that list, by simulating the values the fixed and random effects using the functions FEsim and REsim from the merTools package

imputeFEs <- ldply(mods, FEsim, nsims = 1000)
imputeREs <- ldply(mods, REsim, nsims = 1000)

The data.frames above include separate estimates for each dataset, now we need to combine them using a collapse like argument collapse

imputeREs <- ddply(imputeREs, .(X1, X2), summarize, mean = mean(mean), 
               median = mean(median), sd = mean(sd), 
               level = level[1])

imputeFEs <- ddply(imputeFEs, .(var), summarize, meanEff = mean(meanEff), 
               medEff = mean(medEff), sdEff = mean(sdEff))

Now we can also extract some statistics on the variance/covariance for the random effects across the imputed values. Here I have written two simple extractor functions to do this.

REsdExtract <- function(model){
  out <- unlist(lapply(VarCorr(model), attr, "stddev"))
  return(out)
}

REcorrExtract <- function(model){
  out <- unlist(lapply(VarCorr(model), attr, "corre"))
  return(min(unique(out)))
}

And now we can apply them to the models and store them as a vector:

modStats <- cbind(ldply(mods, REsdExtract), ldply(mods, REcorrExtract))

Update

The functions below will get you much closer to the output provided by arm::display by operating on the list of lmer or glmer objects. Hopefully this will be incorporated into the merTools package in the near future:

# Functions to extract standard deviation of random effects from model
REsdExtract <- function(model){
  out <- unlist(lapply(VarCorr(model), attr, "stddev"))
  return(out)
}

#slope intercept correlation from model
REcorrExtract <- function(model){
  out <- unlist(lapply(VarCorr(model), attr, "corre"))
  return(min(unique(out)))
}

modelRandEffStats <- function(modList){
  SDs <- ldply(modList, REsdExtract)
  corrs <- ldply(modList, REcorrExtract)
  tmp <- cbind(SDs, corrs)
  names(tmp) <- c("Imp", "Int", "Slope", "id", "Corr")
  out <- data.frame(IntSD_mean = mean(tmp$Int), 
                        SlopeSD_mean = mean(tmp$Slope), 
                    Corr_mean = mean(tmp$Corr), 
                        IntSD_sd = sd(tmp$Int),
                    SlopeSD_sd = sd(tmp$Slope), 
                        Corr_sd = sd(tmp$Corr))
  return(out)
}

modelFixedEff <- function(modList){
  require(broom)
  fixEst <- ldply(modList, tidy, effects = "fixed")
  # Collapse
  out <- ddply(fixEst, .(term), summarize,
               estimate = mean(estimate), 
               std.error = mean(std.error))
  out$statistic <- out$estimate / out$std.error
  return(out)
}

print.merModList <- function(modList, digits = 3){
  len <- length(modList)
  form <- modList[[1]]@call
  print(form)
  cat("\nFixed Effects:\n")
  fedat <- modelFixedEff(modList)
  dimnames(fedat)[[1]] <- fedat$term
  pfround(fedat[-1, -1], digits)
  cat("\nError Terms Random Effect Std. Devs\n")
  cat("and covariances:\n")
  cat("\n")
  ngrps <- length(VarCorr(modmathG[[1]]))
  errorList <- vector(mode = 'list', length = ngrps)
  corrList <- vector(mode = 'list', length = ngrps)
  for(i in 1:ngrps){
    subList <- lapply(modList, function(x) VarCorr(x)[[i]])
    subList <- apply(simplify2array(subList), 1:2, mean)
    errorList[[i]] <- subList
    subList <- lapply(modList, function(x) attr(VarCorr(x)[[i]], "corre"))
    subList <- min(unique(apply(simplify2array(subList), 1:2, function(x) mean(x))))
    corrList[[i]] <- subList
  }
  errorList <- lapply(errorList, function(x) {
    diag(x) <- sqrt(diag(x))
    return(x)
    })

  lapply(errorList, pfround, digits)
  cat("\nError Term Correlations:\n")
  lapply(corrList, pfround, digits)
  residError <- mean(unlist(lapply(modList, function(x) attr(VarCorr(x), "sc"))))
  cat("\nResidual Error =", fround(residError,
                                             digits), "\n")
  cat("\n---Groups\n")
  ngrps <- lapply(modList[[1]]@flist, function(x) length(levels(x)))
  modn <- getME(modList[[1]], "devcomp")$dims["n"]
  cat(sprintf("number of obs: %d, groups: ", modn))
  cat(paste(paste(names(ngrps), ngrps, sep = ", "),
            collapse = "; "))
  cat("\n")
  cat("\nModel Fit Stats")
  mAIC <- mean(unlist(lapply(modList, AIC)))
  cat(sprintf("\nAIC = %g", round(mAIC, 1)))
  moDsigma.hat <- mean(unlist(lapply(modmathG, sigma)))
  cat("\nOverdispersion parameter =", fround(moDsigma.hat,
                                             digits), "\n")
}

Best Answer

Related Solutions

Solved – How to estimate variance components with lmer for models with random effects and compare them with lme results

Solved – lmer with multiply imputed data

Related Question