Multiple Choice Exams – Deducing Correct Answers

algorithmscoding-theorycombinatoricspuzzle

I am looking for an algorithmic way to solve the following problem.

Problem

Say we are given a multiple choice test with 100 questions, 4 answers per question (exactly one of those four being correct), each correctly given answer is worth one point, wrong answers are worth zero points.
If now we got a database D of lots of answer sheets and their corresponding points, e.g.

D:= { ('ABAA…', 80),
('ABAB…', 80),
('ABAC…', 80),
('ABAD…', 81), … }

How can we find out which answers are correct?
I am not looking for something probabilistic, but for answers which are definitely correct.

Some ideas

There are some obvious strategies like:

  • look for someone who reached 100 points; you got all your answers
  • look for tests, whose answers differ only by one (in the example database given, we can deduce the answer to question 4 is "D")
  • look for someone who answered everything wrong, you can rule out those answers

But what information can we get from other combinations of answers?

Viewing the answer sheets as a metric space we get

100 – S(test) = d(test, correct)

for the hamming distance d(.,.), the score S(.) and the correct sheet correct.

Maybe someone could give me a reformulation of the problem, which yields a more obvious implementation. Any contribution is appreciated.

Edit:

Not considering computational complexity, couldn't I achieve something by intersecting the balls
$$
\bigcap_i B_{d(t_i,\textrm{correct})}(t_i),
$$
with tests $t_i$ and balls $B_d(x) := \{y: d(x,y)\leq d\}$?

Best Answer

Update:

Today, as I was looking through some interesting math/programming problems on projecteuler, I noticed Number Mind (problem number 185), and it immediately reminded me of this math.SE question; they are practically equivalent. Searching for a solution to the projectuler problem, I found a solution (written in python) from a sister site: codereview.SE. (I haven't actually read it though.)

What follows is what I originally posted.

Some Observations:

  1. Depending on the database D, we may not be able to determine the correct answers. For example, if question number 100 is very tricky and as a result everyone chooses C or D, even though the correct answer is A, then we cannot determine the correct answer.

  2. Let's fix some notation: Let $\mathcal{A}$ be the correct answers and $\mathcal{\hat{A}}$ be some guess of $\mathcal{A}$. (So both are strings, 100 letters long.) Let $t_i$ be a test whose actual score is 85; say $S_{\mathcal{A}}(t_i) = 85$. Further, let's assume that if we rescore $t_i$ according to $\mathcal{\hat{A}}$, we get a new score $S_{\mathcal{\hat{A}}}(t_i) = 90$. Then we have lower and upper bounds for $d(\mathcal{A},\mathcal{\hat{A}})$ = the number of letters for which they differ: $$5=|90-85| \leq d(\mathcal{A},\mathcal{\hat{A}}) \leq (100-85) + (100-90) = 25.$$ The first inequality holds because the test $t_i$ is graded incorrectly by $\mathcal{\hat{A}}$ for at least 5 questions. The second is the triangle inequality; $d(\mathcal{A},\mathcal{\hat{A}}) \leq d(\mathcal{A},t_i) + d(t_i,\mathcal{\hat{A}})$. (Notice that $d(\mathcal{A},t_i) = 100 - S_{\mathcal{A}}(t_i)$ etc.) The second inequality is optimal because we can imagine all 15 wrong answers of $t_i$ being counted as correct (by $\mathcal{\hat{A}}$) and 10 of its correct answers being counted as incorrect (by $\mathcal{\hat{A}}$).

Related Question