Solved – Detecting patterns of cheating on a multi-question exam

classificationclusteringpsychometricsr

QUESTION:

I have binary data on exam questions (correct/incorrect). Some individuals might have had prior access to a subset of questions and their correct answers. I don’t know who, how many, or which. If there were no cheating, suppose I would model the probability of a correct response for item $i$ as $logit((p_i = 1 | z)) = \beta_i + z$, where $\beta_i$ represents question difficulty and $z$ is the individual’s latent ability. This is a very simple item response model that can be estimated with functions like ltm’s rasch() in R. In addition to the estimates $\hat{z}_j$ (where $j$ indexes individuals) of the latent variable, I have access to separate estimates $\hat{q}_j$ of the same latent variable which were derived from another dataset in which cheating was not possible.

The goal is to identify individuals who likely cheated and the items they cheated on. What are some approaches you might take? In addition to the raw data, $\hat{\beta}_i$, $\hat{z}_j$, and $\hat{q}_j$ are all available, although the first two will have some bias due to cheating. Ideally, the solution would come in the form of probabilistic clustering/classification, although this is not necessary. Practical ideas are highly welcomed as are formal approaches.

So far, I have compared the correlation of question scores for pairs of individuals with higher vs. lower $\hat{q}_j -\hat{z}_j $ scores (where $\hat{q}_j – \hat{z}_j $ is a rough index of the probability that they cheated). For example, I sorted individuals by $\hat{q}_j – \hat{z}_j $ and then plotted the correlation of successive pairs of individuals’ question scores. I also tried plotting the mean correlation of scores for individuals whose $\hat{q}_j – \hat{z}_j $ values were greater than the $n^{th}$ quantile of $\hat{q}_j – \hat{z}_j $, as a function of $n$. No obvious patterns for either approach.


UPDATE:

I ended up combining ideas from @SheldonCooper and the helpful Freakonomics paper that @whuber pointed me toward. Other ideas/comments/criticisms welcome.

Let $X_{ij}$ be person $j$’s binary score on question $i$. Estimate the item response model $$logit(Pr(X_{ij} = 1 | z_j) = \beta_i + z_j,$$ where $\beta_i$ is the item’s easiness parameter and $z_j$ is a latent ability variable. (A more complicated model can be substituted; I’m using a 2PL in my application). As I mentioned in my original post, I have estimates $\hat{q_j } $ of the ability variable from a separate dataset $\{y_{ij}\}$ (different items, same persons) on which cheating was not possible. Specifically, $\hat{q_j} $ are empirical Bayes estimates from the same item response model as above.

The probability of the observed score $x_{ij}$, conditional on item easiness and person ability, can be written $$p_{ij} = Pr(X_{ij} = x_{ij} | \hat{\beta_i }, \hat{q_j }) = P_{ij}(\hat{\beta_i }, \hat{q_j })^{x_{ij}} (1 – P_{ij}(\hat{\beta_i }, \hat{q_j }))^{1-x_{ij}},$$ where $P_{ij}(\hat{\beta_i }, \hat{q_j }) = ilogit(\hat{\beta_i} + \hat{q_j})$ is the predicted probability of a correct response, and $ilogit$ is the inverse logit. Then, conditional on item and person characteristics, the joint probability that person $j$ has the observations $x_j$ is $$p_j = \prod_i p_{ij},$$ and similarly, the joint probability that item $i$ has the observations $x_i$ is $$p_i = \prod_j p_{ij}.$$ Persons with the lowest $p_j$ values are those whose observed scores are conditionally least likely — they are possibly cheaters. Items with the lowest $p_j$ values are those which are conditionally least likely — they are the possible leaked/shared items. This approach relies on the assumptions that the models are correct and that person $j$’s scores are uncorrelated conditional on person and item characteristics. A violation of the second assumption isn’t problematic though, as long as the degree of correlation does not vary across persons, and the model for $p_{ij}$ could easily be improved (e.g., by adding additional person or item characteristics).

An additional step I tried is to take r% of the least likely persons (i.e. persons with the lowest r% of sorted p_j values), compute the mean distance between their observed scores x_j (which should be correlated for persons with low r, who are possible cheaters), and plot it for r = 0.001, 0.002, …, 1.000. The mean distance increases for r = 0.001 to r = 0.025, reaches a maximum, and then declines slowly to a minimum at r = 1. Not exactly what I was hoping for.

Best Answer

Ad hoc approach

I'd assume that $\beta_i$ is reasonably reliable because it was estimated on many students, most of who did not cheat on question $i$. For each student $j$, sort the questions in order of increasing difficulty, compute $\beta_i + q_j$ (note that $q_j$ is just a constant offset) and threshold it at some reasonable place (e.g. p(correct) < 0.6). This gives a set of questions which the student is unlikely to answer correctly. You can now use hypothesis testing to see whether this is violated, in which case the student probably cheated (assuming of course your model is correct). One caveat is that if there are few such questions, you might not have enough data for the test to be reliable. Also, I don't think it's possible to determine which question he cheated on, because he always has a 50% chance of guessing. But if you assume in addition that many students got access to (and cheated on) the same set of questions, you can compare these across students and see which questions got answered more often than chance.

You can do a similar trick with questions. I.e. for each question, sort students by $q_j$, add $\beta_i$ (this is now a constant offset) and threshold at probability 0.6. This gives you a list of students who shouldn't be able to answer this question correctly. So they have a 60% chance to guess. Again, do hypothesis testing and see whether this is violated. This only works if most students cheated on the same set of questions (e.g. if a subset of questions 'leaked' before the exam).

Principled approach

For each student, there is a binary variable $c_j$ with a Bernoulli prior with some suitable probability, indicating whether the student is a cheater. For each question there is a binary variable $l_i$, again with some suitable Bernoulli prior, indicating whether the question was leaked. Then there is a set of binary variables $a_{ij}$, indicating whether student $j$ answered question $i$ correctly. If $c_j = 1$ and $l_i = 1$, then the distribution of $a_{ij}$ is Bernoulli with probability 0.99. Otherwise the distribution is $logit(\beta_i + q_j)$. These $a_{ij}$ are the observed variables. $c_j$ and $l_i$ are hidden and must be inferred. You probably can do it by Gibbs sampling. But other approaches might also be feasible, maybe something related to biclustering.

Related Question