Your question(s) is a little bit "big", so I'll start with some general comments and tips.
Some background reading and useful packages
You should probably take a look at some of the tutorial introductions to using mixed models, I would recommend starting with Baayen et al (2008) -- Baayen is the author of languageR
. Barr et al (2013) discuss some issues with random effects structure, and Ben Bolker is one of the current developers of lme4
. Baayen et al is especially good for your questions because they spend a lot of time discussing the use of bootstrapping / permutation tests (the stuff behind mcp.fnc
).
Literature
- Baayen, R.H., D. Davidson und D. Bates (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(390–412).
- Barr, Dale J, R. Levy, C. Scheepers und H. J. Tily (2013). Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language, 68:255– 278.
- Bolker, Benjamin M, M. E. Brooks, C. J. Clark, S. W. Geange, J. R. Poulsen, M. H. H. Stevens und J.-S. S. White (2009). Generalized linear mixed models: a practical guide for ecology and evolution. Trends Ecol Evol, 24(3):127–35.
Florian Jaeger also has a bunch of stuff on the practical side of mixed models on his lab's blog.
There are also a number of useful R packages for visualizing and testing mixed models, like lmerTest
and effects
. The effects
package is especially nice because it lets you plot the linear regression and confidence intervals going on behind the scenes.
Goodness of fit and comparing models
$p$-values are a really lousy way to compare mixed models (and are generally not a great method for anything). There is a reason why Doug Bates (the original author of lme4) does not feel it is necessary to include them. If you really want to go that way (and I beg you not to), the aforementioned lmerTest
provides some additional facilities for calculating and dealing with them.
Better ways of comparing models are log-likelihood or the various information-theoretic criteria like AIC and BIC. For AIC and BIC, the general rule is "smaller is better", but you have to be careful there because they both penalize for more model degrees of freedom and there is no "absolute" good or bad value. To get a nice summary of AIC and log-likelihood models, you can use the anova()
function, which has been overloaded to accept mer
objects. Please note that the $\chi^2$ values given their are only valid for comparisons between nested models. Nonetheless, the output is quite nice for being so tabular, even for other comparisons. For nested models, you have a nice $\chi^2$ test and don't need $p$-values to directly compare two models. The downside to this is that it's not immediately clear just how good your fit is.
For looking at the individual effects (i.e. the stuff you would see in a traditional ANOVA), you should look at the $t$-values for the fixed effects in the models (which are part of the summary()
if I'm not mistaken). Generally, anything $|t|>2$ is considered good (more details in Baayen et al). You can also access the fixed effects directly with the helper function fixef()
.
You should also make sure that none of your fixed effects are too strongly correlated -- multicollinearity violates model assumptions. Florian Jaeger has written a bit on this, as well as a few possible solutions.
Multiple comparisons
I'll address your question #4 a bit more directly. The answer is basically the same as good practice with traditional ANOVA, unfortunately this seems to be a spot where there is a great deal of uncertainty for most researchers. (It's the same as traditional ANOVA because both linear mixed-effects models and ANOVA are based on the general linear model, it's just that mixed models have an extra term for the random effects.) If you're assuming that the time windows make a difference and want to compare the effects of time, you should include time in your model. This, incidentally, will also provide a convenient way for you to judge whether time made a difference: is there a main (fixed) effect for time? The downside to going this route is that your model will get a LOT more complex and the single "super" model with time as a paramater will probably take longer to compute than three smaller models without time as a paramater. Indeed, the classic tutorial example for mixed models is sleepstudy
which uses time as a paramater.
I couldn't quite tell whether the different characteristics are meant to be predictors or dependent variables. If they're meant to be predictors, you could throw them all into one model and see which one has the largest $t$-value, but this model would be incredibly complex, even if you don't allow interactions, and take a long time to compute. The faster, and potentially easier way, would be to compute different models for each predictor. I would recommend using a foreach
loop to parallelize it across as many cores as you have -- lme4
doesn't yet do its matrix operations in parallel, so you can compute a different model on each core in parallel. (See the short introduction here) Then, you can compare the AIC and BIC of each model to find which predictor is best. (They're not nested in this case, so you'$\chi^2$ statistic.)
If your characteristics are the dependent variable, then you will have to compute different models anyway, and then you can use the AIC and BIC to compare the results.
I would probably go with your original model with your full dataset. I generally think of these things as facilitating sensitivity analyses. That is, they point you towards what to check to ensure that you don't have a given result only because of something stupid. In your case, you have some potentially influential points, but if you rerun the model without them, you get substantively the same answer (at least with respect to the aspects that you presumably care about). In other words, use whichever threshold you like—you are only refitting the model as a check, not as the 'true' version. If you think that other people will be sufficiently concerned about the potential outliers, you could report both model fits. What you would say is along the lines of,
Here are my results. One might be concerned that this picture only emerges due to a couple unusual, but highly influential, observations. These are the results of the same model, but without those observations. There are no substantive differences.
It is also possible to remove them and use the second model as your primary result. After all, staying with the original dataset amounts to an assumption about which data belong in the model just as much as going with the subset. But people are likely to be very skeptical of your reported results because psychologically it is too easy for someone to convince themselves, without any actual corrupt intent, to go with the set of post-hoc tweaks (such as dropping some observations) that gives them the result they most expected to see. By always going with the full dataset, you preempt that possibility and assure people (say, reviewers) that that isn't what's going on in your project.
Another issue here is that people end up 'chasing the bubble'. When you drop some potential outliers, and rerun your model, you end up with results that show new, different observations as potential outliers. How many iterations are you supposed to go through? The standard response to this is that you should stay with your original, full dataset and run a robust regression instead. This again, can be understood as a sensitivity analysis.
Best Answer
(This started out as a comment but seemed to be getting too long.)
This question may be getting less attention than it would otherwise deserve because it is very broad (among other things, you've asked 5 separate questions here). A few answers:
re.form=NULL
, or the default, inlme4
) would make more sense.hatvalues.merMod()
, see below) will be conditional on the estimated variance-covariance matrices of the random effects; this is different from the question of whether you're conditioning on conditional modes/BLUPs or not. If you don't want to condition on these estimates, you'll have to (1) assume multivariate Normality of the estimates of the variance-covariance parameters (ugh) or (2) do some kind of parametric bootstrapping (double-ugh).influence.ME
package does a lot of its work by a semi-brute force method:Note also the difference between influential observations and influential groups, either of which might be of interest.
lme4
package does provide a hat matrix (or its diagonal) via?hatvalues.merMod
, so you could use these to compute some standard influence measures.Abad, Ariel Alonso, Saskia Litière, and Geert Molenberghs. “Testing for Misspecification in Generalized Linear Mixed Models.” Biostatistics 11, no. 4 (October 1, 2010): 771–86. doi:10.1093/biostatistics/kxq019.
Julio M. Singer, Juvencio S. Nobre, and Francisco M. M. Rocha. “Diagnostic and Treatment for Linear Mixed Models,” 5486. Hong Kong, 2013. http://2013.isiproceedings.org/Files/CPS203-P28-S.pdf.