Solved – the right way to analyze a nested design in R

I know that there are already a host of questions about nested designs but many of them haven't been answered or come from biological domains which I sometimes find hard to transfer to my domain.

I am currently trying to analyse the data from a psychological study. Participants read four statements and indicated how appropriate they found each on a 5-point Likert-scale. But I'll just call the outcome variable "out" and so the content will not distract from the statistical problem at hand.

There are two IVs, one that was varied on three levels between subjects (let's call it "between" and the levels 1, 2 and 3) and one on two levels within subjects ("within" with its levels a and b). The within subjects factor is realised in such a manner that two of the four statements correspond to one of the levels and the other two to the other.

Correspondingly, for every participant, I now have four "out" data points, two for the a statements and two for the b statements, like this (in long format):

subj    between within  item    out
123     1       a       a1      3
123     1       a       a2      4
123     1       b       b1      1
123     1       b       b2      2
124     2       a       a1      5
124     2       a       a2      4
124     2       b       b1      2
124     2       b       b2      3
125     3       a       a1      1
125     3       a       a2      1
125     3       b       b1      2
125     3       b       b2      3

and so on
What is the right way to analyse the effect and interaction of between and within? I assume that the two different a and the two different b statements do not differ systematically but are just two instances of the same thing. Or do I need an additional factor that tells R which of the four statements it is in order to allow for that error? Anyway, I have tried these options:

m1 <- aov(out~between*within+Error(subj/within))
summary(m1)

m2 <- lm(out~between*within+within/subj)
anova(m2)

m3 <- lmer(out~between*within+(between|subj))
Anova(m3)

They produce different results but I'm not entirely sure about which one is right or what the differences are. Could anyone enlighten me on this? I assume this is about fixed and random effects which always gets my head in a twist. I have read other posts about nested designs and I think this one comes nearest. But unfortunately the question ID being nested groups wasn't answered. Any help would be very much appreciated!

Edit:
I greatly appreciate the answers so far! The comments prompted me to read some more tutorials on mixed effects models as well as answers to similar questions. This enables me to clarify my question: I know that I will need to specify random intercepts at least for subjects and probably also the four different items representing the two levels of the within factors. However, I am unsure about the specific fixed/random effects structure I would have to model, and specifically what the maximal model in my case would look like.
Right now, I have tried some code which roughly looks like this:

m1 <- lmer(out ~ between * within + (between|subj) + (within|subj) + (between|within/item))

This is probably wrong on several levels so any feedback would be greatly appreciated! Just for clarity: The item factor has no intrinsic meaning, but both levels of the within factor is realised by two items each (which every participant sees) and I am unsure whether this is relevant to the model.

Apart from the model specification, I am still getting confused about the concepts nested and crossed and which one applies to my setup. Lastly, most of my models fail to converge but that is probably a matter for another question (and has been discussed here in length.)

Best Answer

As discussed in the comments, here is my Stan model for an Ordered Logistic Regression. It takes long format data (one observation per row) and estimates the following parameters: coefficients for dummy-coded predictors (you can for example use Rs model.matrix) from the matrix preds as fixed effects, and also "maximal" (intercept + slope for all terms plus all interaction) random subject and item effects. It does not estimate or disallow correlated random effects.

It should give you the coefficients for your predictor matrix (in beta) and the cutoff points.

data{
int<lower=2> K;                 // number of outcomes
int<lower=1> n_conds;           // e.g. 32 (5 factors and all interactions)
int<lower=0> n_obs;             // e.g. 12000 (observations)
int A[n_obs];                   // outcomes (e.g. Likert scale rankings, 1-4)
matrix[n_obs,n_conds] preds;    // predictor matrix of 0s and 1s

int<lower=0> Ss;                // e.g. 190 (subjs)
int<lower=0> Is;                // e.g. 2560 (items)

int<lower=0,upper=Ss> S[n_obs]; // subj per observation
int<lower=0,upper=Is> I[n_obs]; // item per observation


}

parameters{

vector[n_conds] r_subj[Ss];     // random effects subject
vector[n_conds] r_item[Is];     // random effects item

real<lower=0,upper=10> sigma_s; // one sigma for all subjs
real<lower=0,upper=10> sigma_i; // one sigma for all items

real<lower=0,upper=10> sigma;   // one sigma for all fixed
ordered[K-1] c;                 // cut points
vector[n_conds] beta;           // fixed effects

}

model {
    sigma ~ cauchy(0,5);        // rather narrow prior for all sigmas
    sigma_s ~ cauchy(0,5);
    sigma_i ~ cauchy(0,5);

    beta ~ normal(0,sigma);     // moderate prior for effects

for (s in 1:Ss)                 // prior for subj rand. eff.
    r_subj[s] ~ normal(0,sigma_s);

for (i in 1:Is)                 // prior for item rand. eff.
    r_item[i] ~ normal(0,sigma_i);

for (n in 1:n_obs)              // loop to estimate using ordered logistic regr.    

    A[n] ~ ordered_logistic(  preds[n] * beta           // fixed

                            + preds[n] * r_subj[S[n]]   // random s
                            + preds[n] * r_item[I[n]]   // random i

                        , c);                           // cut points

After you've specified all values asked for in the data part of the code in a list dat, you can run it, e.g. via fit <- stan(model_code = code, data=dat, chains = 0, iter = 10), and print the values via print(fit,"beta").

I can't promise this works as intended, but I've compared it with simple ordinal regression using clmm and "maximal" linear regression using lme4 and the results are qualitatively very similar!

Best Answer

Related Solutions

Related Question