Solved – How to start with regression analysis? 10 variables; 1M samples

multiple regressionmultivariate analysismultivariate regressionregression

My statistics knowledge is limited, and it appears that I have a task which would benefit from regression analysis. Please direct me.

I've around 10 variables (A, B, C, …) which might be related to X. Most of the variables and X are floats. One variable is binary, and another is categorical (on a nominal scale). I have to find the relationship between the variables and X, in order to predict X for samples where I don't have its value (I guess that's what regression analysis is, and since I have multiple variables that's multiple regression).

First, I need to find which of these variables are actually relevant (related to X). How do I do that? Do I have to compute the correlation of each with X?

Second, how can I determine the relationship, to calculate X?

I have 1 million samples. I can use a random subset for training (1%?), and test the developed method on the rest. I have no preference on the method, it can be linear or non-linear.

I know some Python and Numpy/Scipy if it helps.

I understand that this is a general question, but regression seems to be a huge field, and I have no idea where to start. Any help or direction would be appreciated.

Best Answer

If you need a general introduction to approaches for regression or classification, An Introduction to Statistical Learning is a great choice if you have a bit of mathematical background.

The first 3 chapters cover the essentials of what you need to get started with your linear regressions. Chapter 5 covers ways to validate your model, going beyond the simple training and validation sets you propose in the question. Chapter 6 discusses approaches to selecting among your predictors, which gets to your question about "which of these variables are actually relevant"; this is more than just examining predictors that are individually related to your outcome variable, as @Scortchi pointed out in a comment. The exercises provided for each chapter give the opportunity to test your learning as you go.

For a more advanced treatment, you can then graduate to the related book The Elements of Statistical Learning.

I always keep links to both of these readily accessible on my computer.

Related Question