Here is a possibility: you could add a constraint to the optimization of the purity index (e.g. Gini Index or Entropy) to the individual trees in the forest. So:
$$min\,\Sigma{D_i} \; with\;D_i=1-\Sigma^{k}p_{ik}^2$$ $$s.t.\, p_{ik} >= p_{i(k-1)} >= ... >= p_0$$ where $k$ indexes the observation type, $i$ indexes the terminal node and $p_{ik}$ is the proportion of of $k$ on node $i$. That way your forest should yield results consistent with that as well. I guess you could relax the condition by introducing a slack variable $min\,\zeta_i$ with $p_0>\zeta_0 > 0$ $p_0-\zeta_0 <= p_1-\zeta_1$, etc. for the other probs.
But if your data is correct and makes sense and that condition is true for sure your forest will yield results that are consistent with that condition. If you do an unconstrained forest with enough trees and you do not observe your $P(A) < P(B) < ...$ it is quite likely you are mixing non-comparable data sets or that the condition is simply not true.
As a precursor, the IRT approach to this problem is very demanding computationally due to the higher dimensionality. It may be worthwhile to look into structural equation modeling (SEM) alternatives using the WLSMV estimator for ordinal data since I imagine less issues will exist. Plus, including external covariates is much easier within that framework. Both approaches I describe here are also possible in SEM.
There are two ways that I know of which you can estimate unidimensional longitudinal IRT models that are not Rasch in nature. The the first approach requires a unique latent factor for each time block and a specific residual variation term for each item. A different approach, similar to what one would find in the SEM literature, is via a latent growth curve model whereby only a fixed number of factors are estimated (three if the relationship over time is believed to be linear). Fixed loadings are used in this approach, so computationally it may be much more stable due to the reduced number of estimated parameters, so I would tend to prefer the growth curve model for both the smaller dimensionality and fewer estimated parameters.
The idea for both approaches is to set up latent time
factors indicating how person level $\theta$ values change over each test administration, and constrain the influence of their loadings across time as well so that their hyper parameters can be estimated (i.e., the latent mean and covariances). Item constraints must also be imposed across time to remain invariable so that the person differences are only captured in the hyper parameters. Since this approach can require a huge number of integration dimensions, so you'll need to use something like the dimensional reduction algorithm which is available in mirt
under the bfactor()
function.
Instead of going through a worked example here, which would take a lot of code, I'll simply point to a worked versions of these analyses. A word of warning though, these are very computationally demanding and may take more than an hour to converge on your computer since you have 4 dimensions of integration in the first case and 3 dimensions in the second. Or, if you don't have much RAM you could experience issues when increase the number of quadpts
.
Data simulation script: https://github.com/philchalmers/mirt/blob/gh-pages/data-scripts/Longitudinal-IRT.R
Analysis output: http://philchalmers.github.io/mirt/html/Longitudinal-IRT.html
In the first example, if you save the factor scores by using fscores()
you'll obtain estimates for each time point regarding how individual $\theta$ values are changing. In the second example, using the linear growth curve approach, the first column of the factor scores will represent the initial $\theta$ estimates while the second column will indicate the slope/change occurring on average over time. In the example, I set up a constant mean change of .5, so the values in fscores()
should all be around 0.5 for each individual. Both analyses give roughly the same conclusions but are somewhat different approaches to the problem. However, if you are familiar with longitudinal models in SEM then these should be fairly natural to interpret.
Best Answer
There is a previous post that discussed including mixed-effects for clustered/longitudinal data.
How can I include random effects into a randomForest
Here is a good reference for decision tree implementations in R: http://statistical-research.com/a-brief-tour-of-the-trees-and-forests/
Also, you may want to review these slides http://www2.ims.nus.edu.sg/Programs/014swclass/files/denis.pdf