I think the difference primarily is a philosophical one when choosing Rasch/1PL models (the emphases on what measurement means is slightly different in that literature, and hence researchers try their best to obtain these special items), and an empirical/design one when deciding between using 2PL and 3PL models.
Since the slopes are all equal in 1PL models determining a persons location amounts to finding the optimal location where respondents have a P = 0.5 chance of answering correctly by simply choosing items with the best intercepts to get an estimate of $\theta$, whereas in 2- and 3PL models it's slightly more complicated due to the unequal slopes and lower bound parameters for guessing. As a consequence, 2-3PL models often require more advanced adaptive item selection procedures such as the Kullback–Leibler/Fisher information to select the next best item for honing in on $\theta$.
Speaking purely from a design perspective if the adaptive testing items contain a finite number of responses then the 3PL seems like the better option, but if it's more of a fill in the blank style answer (e.g., 2 + 3 = __.) then the 1PL and 2PL models would, at least theoretically, be more reasonable.
As a precursor, the IRT approach to this problem is very demanding computationally due to the higher dimensionality. It may be worthwhile to look into structural equation modeling (SEM) alternatives using the WLSMV estimator for ordinal data since I imagine less issues will exist. Plus, including external covariates is much easier within that framework. Both approaches I describe here are also possible in SEM.
There are two ways that I know of which you can estimate unidimensional longitudinal IRT models that are not Rasch in nature. The the first approach requires a unique latent factor for each time block and a specific residual variation term for each item. A different approach, similar to what one would find in the SEM literature, is via a latent growth curve model whereby only a fixed number of factors are estimated (three if the relationship over time is believed to be linear). Fixed loadings are used in this approach, so computationally it may be much more stable due to the reduced number of estimated parameters, so I would tend to prefer the growth curve model for both the smaller dimensionality and fewer estimated parameters.
The idea for both approaches is to set up latent time
factors indicating how person level $\theta$ values change over each test administration, and constrain the influence of their loadings across time as well so that their hyper parameters can be estimated (i.e., the latent mean and covariances). Item constraints must also be imposed across time to remain invariable so that the person differences are only captured in the hyper parameters. Since this approach can require a huge number of integration dimensions, so you'll need to use something like the dimensional reduction algorithm which is available in mirt
under the bfactor()
function.
Instead of going through a worked example here, which would take a lot of code, I'll simply point to a worked versions of these analyses. A word of warning though, these are very computationally demanding and may take more than an hour to converge on your computer since you have 4 dimensions of integration in the first case and 3 dimensions in the second. Or, if you don't have much RAM you could experience issues when increase the number of quadpts
.
Data simulation script: https://github.com/philchalmers/mirt/blob/gh-pages/data-scripts/Longitudinal-IRT.R
Analysis output: http://philchalmers.github.io/mirt/html/Longitudinal-IRT.html
In the first example, if you save the factor scores by using fscores()
you'll obtain estimates for each time point regarding how individual $\theta$ values are changing. In the second example, using the linear growth curve approach, the first column of the factor scores will represent the initial $\theta$ estimates while the second column will indicate the slope/change occurring on average over time. In the example, I set up a constant mean change of .5, so the values in fscores()
should all be around 0.5 for each individual. Both analyses give roughly the same conclusions but are somewhat different approaches to the problem. However, if you are familiar with longitudinal models in SEM then these should be fairly natural to interpret.
Best Answer
As I stated in the comments above, missing data can be handled by either the
ltm
ormirt
package when the data is MCAR. Here is an example of how to use both on a dataset with missing values:It's also possible to impute missing values given a good estimate of $\theta$ for obtaining things like model and item fit statistics (should do this several times if the amount of missingness is non-trivial, and it's even better to jitter the $\hat{\theta}$ values as a function of the respective $SE_{\hat{\theta}}$ values for more reasonable imputation results).