What is the difference between data mining and data fishing (sometimes referred to as a fishing expedition)? If there is a difference, how can you tell the one from the other? And why would one be more “valuable” for research than the other?
Solved – the difference between data mining and data fishing
data miningterminology
Related Solutions
Jerome Friedman wrote a paper a while back: Data Mining and Statistics: What's the Connection?, which I think you'll find interesting.
Data mining was a largely commercial concern and driven by business needs (coupled with the "need" for vendors to sell software and hardware systems to businesses). One thing Friedman noted was that all the "features" being hyped originated outside of statistics -- from algorithms and methods like neural nets to GUI driven data analysis -- and none of the traditional statistical offerings seemed to be a part of any of these systems (regression, hypothesis testing, etc). "Our core methodology has largely been ignored." It was also sold as user driven along the lines of what you noted: here's my data, here's my "business question", give me an answer.
I think Friedman was trying to provoke. He didn't think data mining had serious intellectual underpinnings where methodology was concerned, but that this would change and statisticians ought to play a part rather than ignoring it.
My own impression is that this has more or less happened. The lines have been blurred. Statisticians now publish in data mining journals. Data miners these days seem to have some sort of statistical training. While data mining packages still don't hype generalized linear models, logistic regression is well known among the analysts -- in addition to clustering and neural nets. Optimal experimental design may not be part of the data mining core, but the software can be coaxed to spit out p-values. Progress!
There is considerable overlap among these, but some distinctions can be made. Of necessity, I will have to over-simplify some things or give short-shrift to others, but I will do my best to give some sense of these areas.
Firstly, Artificial Intelligence is fairly distinct from the rest. AI is the study of how to create intelligent agents. In practice, it is how to program a computer to behave and perform a task as an intelligent agent (say, a person) would. This does not have to involve learning or induction at all, it can just be a way to 'build a better mousetrap'. For example, AI applications have included programs to monitor and control ongoing processes (e.g., increase aspect A if it seems too low). Notice that AI can include darn-near anything that a machine does, so long as it doesn't do it 'stupidly'.
In practice, however, most tasks that require intelligence require an ability to induce new knowledge from experiences. Thus, a large area within AI is machine learning. A computer program is said to learn some task from experience if its performance at the task improves with experience, according to some performance measure. Machine learning involves the study of algorithms that can extract information automatically (i.e., without on-line human guidance). It is certainly the case that some of these procedures include ideas derived directly from, or inspired by, classical statistics, but they don't have to be. Similarly to AI, machine learning is very broad and can include almost everything, so long as there is some inductive component to it. An example of a machine learning algorithm might be a Kalman filter.
Data mining is an area that has taken much of its inspiration and techniques from machine learning (and some, also, from statistics), but is put to different ends. Data mining is carried out by a person, in a specific situation, on a particular data set, with a goal in mind. Typically, this person wants to leverage the power of the various pattern recognition techniques that have been developed in machine learning. Quite often, the data set is massive, complicated, and/or may have special problems (such as there are more variables than observations). Usually, the goal is either to discover / generate some preliminary insights in an area where there really was little knowledge beforehand, or to be able to predict future observations accurately. Moreover, data mining procedures could be either 'unsupervised' (we don't know the answer--discovery) or 'supervised' (we know the answer--prediction). Note that the goal is generally not to develop a more sophisticated understanding of the underlying data generating process. Common data mining techniques would include cluster analyses, classification and regression trees, and neural networks.
I suppose I needn't say much to explain what statistics is on this site, but perhaps I can say a few things. Classical statistics (here I mean both frequentist and Bayesian) is a sub-topic within mathematics. I think of it as largely the intersection of what we know about probability and what we know about optimization. Although mathematical statistics can be studied as simply a Platonic object of inquiry, it is mostly understood as more practical and applied in character than other, more rarefied areas of mathematics. As such (and notably in contrast to data mining above), it is mostly employed towards better understanding some particular data generating process. Thus, it usually starts with a formally specified model, and from this are derived procedures to accurately extract that model from noisy instances (i.e., estimation--by optimizing some loss function) and to be able to distinguish it from other possibilities (i.e., inferences based on known properties of sampling distributions). The prototypical statistical technique is regression.
Best Answer
There is plenty of overlap between these two concepts, so there is not a clear distinction. However, I try to point out what I believe to be the differences.
In terms of statistical analysis, "a fishing expedition" just about always has a negative connotation; the idea being that the researchers started with one question about their data (i.e. "is there a linear relation between these two variables in our data?"). After coming up negative, they "recast their net" with a different question (i.e. "is there a quadratic relation between these two variables?") and so on until they finally find a "statistical significant" relation. Of course, the issue here is that the researcher did many comparisons and reported the top hit. Assuming they did not adjust their p-values for the multiple comparisons, this result will not be valid.
In contrast, with data mining (done correctly) you are starting with the understanding that you do not know which hypothesis you want to test in your data, but rather that you would like to search your data for interesting relations. As such, you will comb through your data and look for potentially interesting relations that will be reported. It is important to note that this step is really hypothesis generating, rather than confirming; to really decisively decide that the interesting relations you found in your data set are not just due to random chance, they should be confirmed in a follow up study (or moreover, independent data).
The similarities between data-fishing and data-mining is that in both cases you are inspecting a very large number of hypotheses from your data. If done correctly, data-mining is not frowned upon because it is acknowledged that you are doing this to generate interesting hypotheses to be tested later, where as data-fishing implies that the researcher did not confirm the final hypothesis they inspected in a new data set.