Solved – the difference between data mining and statistical analysis

analysisdata miningterminology

What is the difference between data mining and statistical analysis?

For some background, my statistical education has been, I think, rather traditional. A specific question is posited, research is designed, and data are collected and analyzed to offer some insight on that question. As a result, I've always been skeptical of what I considered "data dredging", i.e. looking for patterns in a large dataset and using these patterns to draw conclusions. I tend to associate the latter with data-mining and have always considered this somewhat unprincipled (along with things like algorithmic variable selection routines).

Nonetheless, there is a large and growing literature on data mining. Often, I see this label referring to specific techniques like clustering, tree-based classification, etc. Yet, at least from my perspective, these techniques can be "set loose" on a set of data or used in a structured way to address a question. I'd call the former data mining and the latter statistical analysis.

I work in academic administration and have been asked to do some "data mining" to identify issues and opportunities. Consistent with my background, my first questions were: what do you want to learn and what are the things that you think contribute to issue? From their response, it was clear that me and the person asking the question had different ideas on the nature and value of data mining.

Best Answer

Jerome Friedman wrote a paper a while back: Data Mining and Statistics: What's the Connection?, which I think you'll find interesting.

Data mining was a largely commercial concern and driven by business needs (coupled with the "need" for vendors to sell software and hardware systems to businesses). One thing Friedman noted was that all the "features" being hyped originated outside of statistics -- from algorithms and methods like neural nets to GUI driven data analysis -- and none of the traditional statistical offerings seemed to be a part of any of these systems (regression, hypothesis testing, etc). "Our core methodology has largely been ignored." It was also sold as user driven along the lines of what you noted: here's my data, here's my "business question", give me an answer.

I think Friedman was trying to provoke. He didn't think data mining had serious intellectual underpinnings where methodology was concerned, but that this would change and statisticians ought to play a part rather than ignoring it.

My own impression is that this has more or less happened. The lines have been blurred. Statisticians now publish in data mining journals. Data miners these days seem to have some sort of statistical training. While data mining packages still don't hype generalized linear models, logistic regression is well known among the analysts -- in addition to clustering and neural nets. Optimal experimental design may not be part of the data mining core, but the software can be coaxed to spit out p-values. Progress!

Related Question