Solved – the difference between data mining and data fishing

data miningterminology

What is the difference between data mining and data fishing (sometimes referred to as a fishing expedition)? If there is a difference, how can you tell the one from the other? And why would one be more “valuable” for research than the other?

Best Answer

There is plenty of overlap between these two concepts, so there is not a clear distinction. However, I try to point out what I believe to be the differences.

In terms of statistical analysis, "a fishing expedition" just about always has a negative connotation; the idea being that the researchers started with one question about their data (i.e. "is there a linear relation between these two variables in our data?"). After coming up negative, they "recast their net" with a different question (i.e. "is there a quadratic relation between these two variables?") and so on until they finally find a "statistical significant" relation. Of course, the issue here is that the researcher did many comparisons and reported the top hit. Assuming they did not adjust their p-values for the multiple comparisons, this result will not be valid.

In contrast, with data mining (done correctly) you are starting with the understanding that you do not know which hypothesis you want to test in your data, but rather that you would like to search your data for interesting relations. As such, you will comb through your data and look for potentially interesting relations that will be reported. It is important to note that this step is really hypothesis generating, rather than confirming; to really decisively decide that the interesting relations you found in your data set are not just due to random chance, they should be confirmed in a follow up study (or moreover, independent data).

The similarities between data-fishing and data-mining is that in both cases you are inspecting a very large number of hypotheses from your data. If done correctly, data-mining is not frowned upon because it is acknowledged that you are doing this to generate interesting hypotheses to be tested later, where as data-fishing implies that the researcher did not confirm the final hypothesis they inspected in a new data set.

Related Question