Solved – Getting very large coefficients from linear regression

pandaspythonregression

I'm currently looking at rates for a study that vary between 0 and 100 with most of the rates falling between 0 and 1. I am running a linear regression on 70 dummy variables (coded 0-1) and nearly 100,000 lines of observations. When I run the regression, the coefficients I am getting for each of the dummy variables and intercept is in the region of 10E10 to 10E13. Testing the predicted values of this regression does come out to numbers around the actual rate (somewhere between 0 and 1 for the most part) but I feel like something is wrong with this analysis.

Is there something I might be missing as to why my coefficients for each variable are coming out so high? I'm new to actually implementing regression and don't know if anything is wrong or this is just the result I'm looking for. I'd really appreciate any help with this

Best Answer

Try to see what happens if you drop those observations that are close to 100 (or anyway have a scale far above the others). This way you will have a better understanding of the situation. If you say that most of the dependent var values are between 0 and 1 it may happen that some high values (extreme values) are skewing the coefficients. Those coefficients seem too high (although it may be theoretically possible if positive coefficients are offset by negative coefficients of the same scale).. check also if some of those 70 variables are too strongly correlated, in which case there could be excessive multicollinearity (maybe you could have extremely high values for the correlations making the estimate unstable)

Related Question