Solved – Multiple linear regression on a data set with Python

pythonrregression

I'll preface my question with the fact that I'm just learning about linear regression so I may be thinking about this wrong.

I have a set of data. In this set I have one dependent variable and about 10 independent variables and the data set is growing regularly. It's rows of data in database with 10 columns of independent variables and one column of the dependent variable. You can see my previous question for an example of what I'm trying to do: Variables importance: who can do the most pushups?

The output of a linear regression is a formula right?

Now I want to write a python script (I could use R also but I'd greatly prefer python) to take this data as input and output the linear regression formula. Is there a python method to do this? Do I need to run a regression comparing each independent variable with the dependent variable one at a time? Or is there a python method to feed in the data with all 10 independent variables and come out with a formula?

Best Answer

The simplest outcome from a regression is a set of coefficients, but that is not sufficient for a true regression analysis. You say that you have used R, in R there is a built in dataset called "anscombe" (after the person who created the data). Use that dataset and fit a regression of y1 vs. x1, then do a regression of y2 vs. x2, y3 vs. x3, and y4 vs. x4.

Compare the coefficients (formulas) for the 4 regressions and think about what your conclusions are. Now plot the pairs of data and compare the plots. How does the comparison of the plots compare to the comparison of the regression models?

You could also look up Anscombe's quartet on wikipedia or google, but it is much more informative to do it yourself.

A more complete regression analysis will include not only the coefficients but also things like residual, fitted values, standard errors, confidence intervals, diagnostic plots, etc. (the complete list of everything needed in an analysis depends on the specific data, science, and questions being asked). The above can be produced with about 4 lines of code in R, I don't know how much python code it would take (but there may be prewritten python code to do the same in much fewer lines than programming straight python would).

Also, unless your predictor variables (independent variables, but I don't like the independent/dependent names) are perfectly orthogonal to each other you will get different coefficients fitting one at a time than fitting them all together, and for any dataset of real interest the other important aspects (standard errors, etc) will differ whether you do things one at a time or all together.