Confirmatory Factor Analysis – How Does an Item Intercept in CFA Relate to the Difficulty Parameter in IRT?

confirmatory-factorinterceptitem-response-theoryregressionstructural-equation-modeling

I am having difficulty clarifying in my mind how item intercepts in Confirmatory Factor Analysis or SEM manifest, and how they are graphically represented in an IRT ICC plot. I understand that multi-group CFA for polytomous variables is mathematically equivalent to an IRT graded response model.

Crucially, I would like to know if measurement non-invariance/DIF between two groups in intercepts (but invariant loadings and thresholds) for an item means that (in the case of a depression questionnaire):

a) the presence of that trait/symptom is greater in one group, external to the influence of the latent variable (depression). i.e. older people report more sleep problems anyway because of population characteristics, and so have a higher intercept for the sleep disruption item than younger people, and that this is applied as a constant (but depression still loads on to the item equally between groups).

and/or

b) the regression slope (item loading) is the same, but the "item difficulty" is different. This could relate to the horizontal positioning of an ICC curve with equal discrimination parameters i.e. sleep disruption is equally explained by depression, but it is sensitive to milder depression in older people, and sensitive to more severe depression in younger people. In other words, you need to have much more severe depression as a younger person before it starts to influence responses to the item, but the gradient of influence is equal once it does influence.

I understand that an item intercept is the expected response to an item when the factor score is zero, which matches option a, but I have also heard that the item intercepts are likened to the IRT difficulty parameter, which fits option b.

These two explanations seem conceptually distinct to me. Both could result in one group having a higher item response for a given latent factor score, but for different reasons. The first is caused by a difference in an external constant, and the second by the fact that older people may experience depression-related sleep disruption more readily/in milder cases, and this means the responses to the item would be higher when the factor scores are equal.

I was wondering if someone could please explain which (or both/neither) of these explanations is true?

Best Answer

how they are graphically represented in an IRT ICC plot

I don't think intercepts would be, because the ICCurve plots the probability of a particular response category (y axis) by the latent trait (x axis). Intercepts are only relevant on the latent-response scale. So if you plot the expected latent response (y axis) by the latent trait (x axis) in the case of uniform DIF across 2 groups, you would have 2 parallel lines (equal loadings, different intercepts). You could perhaps call this a latent-response characteristic curve (LRCC), in contrast to an (observed) item characteristic curve.

I was wondering if someone could please explain which (or both/neither) of these explanations is true?

I think your interpretation (a) is correct: Controlling for the latent trait (i.e., even among people with the same latent-trait values), the average latent response differs across groups. That is uniform DIF: a constant shift in the expected values. But in the linear case (for latent responses), that implies the same distance in the y-axis direction at any point on the x axis. In the nonlinear case (for observed item responses), "parallel" sigmoidal curves have different y-axis distances along the x-axis, despite the lines never intersecting (until $\pm \infty$ on the latent trait).

I think your interpretation (b) is also correct in terms of interpreting differences in difficulty. However, the GRM's difficulty parameters ($C-1$ per item) are analogous to CFA's thresholds ($C-1$ per item), not CFA's intercepts (1 per item). This is where it gets hairy because the GRM is generally only interpreted in terms of the observed (not latent) response scale. CFA was augmented with a threshold model as an ad-hoc way to have our continuous cake and discretely (sic) eat it too. So the thresholds are additional measurement parameters, on top of the intercepts and variances already included for continuous indicators (which latent responses are assumed to be).

Answering the question in your post's title will take some background, so please forgive me for covering a lot of concepts that I think you already understand. I just want to organize the concepts so we are on the same page.

Suppose we have 2 groups (it doesn't matter whether latent-trait means differ or not). Suppose that indicators are measured with a 5-point Likert scale, and the first 9 items have equal measurement parameters, but the 10th item has DIF in 1 of the 4 thresholds. (For simplicity, let's assume its loadings and intercepts are equal across groups.) Thus, for the first 9 items, group differences on the latent trait fully explain group differences in the probability of responding (e.g.) "strongly agree" rather than "somewhat agree". That is, there is a point on the latent-trait scale (a threshold, certain level of "difficulty") beyond which subjects stop responding "somewhat" and start responding "strongly".

Now, for the 10th item, suppose that last threshold is higher in group 2. That implies the people in group 2 need to be higher on the latent trait before they pass that threshold, to switch from "somewhat" to "strongly". That is a "more difficult" transition in group 2 than in group 1, so their 4th difficulty parameter is higher in the GRM, just like their 4th threshold is higher in CFA.

Okay, now suppose ALL thresholds are higher in group 2, shifted by the same amount (i.e., group 2's thresholds = group 1's thresholds + a constant). There is no way to statistically distinguish this situation from a situation in which the thresholds are the same in both groups, but group 2's latent-response intercept is shifted lower than group 1, by the same constant. This is exactly the situation we have with binary items (i.e., only 1 threshold): We can estimate the intercept with the threshold fixed to 0 for identification (standard in probit regression), or we can estimate the threshold with the intercept fixed to 0 (standard in CFA). We will get the same estimated absolute value either way, but with opposite signs.

It would be quite a coincidence for each of an item's thresholds to differ across groups by the exact same shift, so let's take for granted that the thresholds are in fact equal, and allow for group differences in the 10th item's intercept. Now, I can finally answer the question in your post's title :-)

This difference in intercepts reflects that the 2 groups agree more/less (on average) with the 10th item, even among members with the same level of the latent trait. That "average agreement" is on the latent response scale, and members of the group with higher average agreement have a greater probability of responding "strongly" rather than "somewhat". The item is just as "difficult", but more people pass that shared threshold in the group with the greater expected value. So DIF in intercepts simply pushes more people past the threshold in one group than another. It is the threshold that is analogous to a difficulty parameter, not the intercept. In case you have not already read it, this paper provides a formula to transform CFA to IRT parameters (or vice versa), at least in the simple binary case, but the same principle generalizes to more thresholds.

https://doi.org/10.1080/10705510701758406

Another hypothetical might make it even more concrete: Suppose there are differences both in the 4th threshold and in the intercept. Specifically, the 4th threshold is 0.5 higher in group 1 than 2, so it is that much more "difficult" to respond "strongly" than "somewhat". But the intercept is also 0.5 higher in group 1 than 2. In that case, group 2's latent-response distribution is shifted to the right, just enough to compensate for the greater difficulty. Thus, the probability of responding "strongly" would be the same in both groups, holding the latent trait constant, because just as much of the latent-response distribution exceeds the 4th threshold. However, group 2's latent-response distribution is shifted just as far to the right relative to all other thresholds, so response probabilities for lower categories would still differ between groups (holding the latent trait constant).

Related Question