1) Flag all responses with duplicate IP addresses. Create a new variable for this purpose -- say FLAG1, which takes on values of 1 or 0.
2) Choose a threshold for an impossibly fast response time based on common sense (e.g., less than 1 second per question) and the aid of a histogram of response times -- flag people faster than this threshold again using another variable, FLAG2.
3) "Some respondents clearly randomly clicked through..." -- Apparently you can manually identify some respondents who cheated. Sort the data by response time and look at the fastest 5% or 10% (25 or 50 respondents for your data). Manually examine these respondents and flag any "clearly random" ones using FLAG3.
4) Apply Sheldon's suggestion by creating an inconsistency score -- 1 point for each inconsistency. You can do this by creating a new variable that identifies inconsistencies for each pair of redundant items, and then adding across these variables. You could keep this variable as is, as higher inconsistency scores obviously correspond to higher probabilities of cheating. But a reasonable approach is to flag people who fall above a cut-off chosen by inspecting a histogram -- call this FLAG4.
Anyone who is flagged on each of FLAG1-4 is highly likely to have cheated, but you can set aside flagged people for a separate analysis based on any weighting scheme of FLAG1-4 you want. Given your tolerance for false positives, I would eliminate anyone flagged on FLAG1, FLAG2, or FLAG4.
Welcome to Cross Validated!
As user2974951 points out, a random variable could "...conceptually represent..the subjective randomness that results from incomplete knowledge of a quantity". That is, it 'seems' random to you because you simply don't have enough information about the factors that actually influence/determine the outcome.
So that answers the first part of your question: "Are the values of "reading_proficiency" truly determined by chance?"
Basically - regardless of what truly determines "reading proficiency", you do not have the information to know everything about it, meaning it can take on many different values, with a particular distribution, and you do not know what value it will take on in a particular case because you don't have enough information. So from that aspect, it's certainly a random variable. We're getting a bit into philosophy here, but essentially, if there's some variable/process that generates different results but you do not know everything about what determines an outcome (meaning you have uncertainty , it is a random variable). If this point is unclear, please ask in comments and we can discuss more.
Let's move on to the second part of your question: "Are two observations of "reading_proficiency" truly independent of one another?"
This is an interesting question (given your clustering concerns). However, let's be clear - I don't think "independence" is a prerequisite for being a random variable. For example, you can see clustering in almost all distributions - they tend to be clustered around the mean. So whatever process is generating that data must, in some way, result in most data points being quite connected/similar and close to the mean. You don't need sample independence for it to be a random variable.
So all in all, yes, your survey data is a random variable.
Perhaps you're in some way starting to confuse "random variable" with "random sample"? It's with taking "random samples" that you need to think about whether the individual samples are independent, and that's where your clustering concerns might come in.
Best Answer
This is a good situation for a sensitivity analysis. Analyze your data in each of three ways --
Then compare results, sharing any rationale you can develop as to which results might be more accurate, or more accurate in certain respects.
You can also investigate the range of ways in which the logicals and illogicals differ, if any. Do the illogicals tend to report higher incomes? To show greater support for certain ideas or programs? To skip more questions? To evince more bias in the sense of straightlining or disproportionately choosing middle responses or extreme responses?
With about 400 or these illogicals, you have enough data even to assess the relationship between degree of illogicality and degree of a given type of bias. Something like a dose-response relationship.
What you learn from these investigations might be fed back into your plan for dealing with the illogicals when it comes to the main analyses of interest.