Solved – Tips and tricks to get started with statistical modeling

bayesianexploratory-data-analysismodelingreferences

I work in the field of data mining and have had very little formal schooling in statistics. Lately I have been reading a lot of work that focuses on Bayesian paradigms for learning and mining, which I find very interesting.

My question is (in several parts), given a problem is there a general framework by which it is possible to construct a statistical model? What are the first things you do when given a dataset of which you'd like to model the underlying process? Are there good books/tutorials out there that explain this process or is it a matter of experience? Is inference in the forefront of your mind when constructing your model or do you first aim to describe the data before you worry about how to use it to compute?

Any insight would be greatly appreciated!
Thanks.

Best Answer

In Statistics, like in Data Mining, you start with data and a goal. In statistics there is a lot of focus on inference, that is, answering population-level questions using a sample. In data mining the focus is usually prediction: you create a model from your sample (training data) in order to predict test data.

The process in statistics is then:

  1. Explore the data using summaries and graphs - depending on how data-driven the statistician, some will be more open-minded, looking at the data from all angles, while others (especially social scientists) will look at the data through the lens of the question of interest (e.g., plot especially the variables of interest and not others)

    1. Choose an appropriate statistical model family (e.g., linear regression for a continuous Y, logistic regression for a binary Y, or Poisson for count data), and perform model selection

    2. Estimate the final model

    3. Test model assumptions to make sure they are reasonably met (different from testing for predictive accuracy in data mining)

    4. Use the model for inference -- this is the main step that differs from data mining. The word "p-value" arrives here...

Take a look at any basic stats textbook and you'll find a chapter on Exploratory Data Analysis followed by some distributions (that will help choose reasonable approximating models), then inference (confidence intervals and hypothesis tests) and regression models.

I described to you the classic statistical process. However, I have many issues with it. The focus on inference has completely dominated the fields, while prediction (which is extremely important and useful) has been nearly neglected. Moreover, if you look at how social scientists use statistics for inference, you'll find that they use it quite differently! You can check out more about this here