Solved – How to identify invalid online survey responses

classificationsurvey

I have a set of around 500 responses to a online survey that offered an incentive to complete. While most of the data appears to be valid, it's clear that some people were able to get around the (inadequate) browser cookie-based duplicate survey protection. Some respondents clearly randomly clicked through the survey to recieve the incentive and then repeated the process via a couple methods. My question is what is the best way to try and filter out the invalid responses?

The information that I have is limited to:

  • The amount of time it took to complete the survey (time started and ended)
  • The IP address of each respondent
  • The User Agent (browser identifier) of each respondent
  • The survey answers of each respondent (over 100 questions in the survey)

The most obvious sign of an invalid response is when (sorted by time started) there will be a group all from the same IP address or similar IP (sharing the same first three octets for example, 255.255.255.*) which were all completed in a much shorter amount of time than the total average in quick succession.

With this information there must be a thoughtful way to weed out the people who were exploiting the survey for the incentive from the rest of the survey population. I know that someone from the community here would have an interesting idea about how to approach this. I'm willing to accept false positives as long as I can be confident that I've gotten rid of most of the invalid responses. Thanks for your advice!

Best Answer

1) Flag all responses with duplicate IP addresses. Create a new variable for this purpose -- say FLAG1, which takes on values of 1 or 0.

2) Choose a threshold for an impossibly fast response time based on common sense (e.g., less than 1 second per question) and the aid of a histogram of response times -- flag people faster than this threshold again using another variable, FLAG2.

3) "Some respondents clearly randomly clicked through..." -- Apparently you can manually identify some respondents who cheated. Sort the data by response time and look at the fastest 5% or 10% (25 or 50 respondents for your data). Manually examine these respondents and flag any "clearly random" ones using FLAG3.

4) Apply Sheldon's suggestion by creating an inconsistency score -- 1 point for each inconsistency. You can do this by creating a new variable that identifies inconsistencies for each pair of redundant items, and then adding across these variables. You could keep this variable as is, as higher inconsistency scores obviously correspond to higher probabilities of cheating. But a reasonable approach is to flag people who fall above a cut-off chosen by inspecting a histogram -- call this FLAG4.

Anyone who is flagged on each of FLAG1-4 is highly likely to have cheated, but you can set aside flagged people for a separate analysis based on any weighting scheme of FLAG1-4 you want. Given your tolerance for false positives, I would eliminate anyone flagged on FLAG1, FLAG2, or FLAG4.

Related Question