Solved – Getting very large coefficients from linear regression

pandaspythonregression

I'm currently looking at rates for a study that vary between 0 and 100 with most of the rates falling between 0 and 1. I am running a linear regression on 70 dummy variables (coded 0-1) and nearly 100,000 lines of observations. When I run the regression, the coefficients I am getting for each of the dummy variables and intercept is in the region of 10E10 to 10E13. Testing the predicted values of this regression does come out to numbers around the actual rate (somewhere between 0 and 1 for the most part) but I feel like something is wrong with this analysis.

Is there something I might be missing as to why my coefficients for each variable are coming out so high? I'm new to actually implementing regression and don't know if anything is wrong or this is just the result I'm looking for. I'd really appreciate any help with this

Best Answer

Try to see what happens if you drop those observations that are close to 100 (or anyway have a scale far above the others). This way you will have a better understanding of the situation. If you say that most of the dependent var values are between 0 and 1 it may happen that some high values (extreme values) are skewing the coefficients. Those coefficients seem too high (although it may be theoretically possible if positive coefficients are offset by negative coefficients of the same scale).. check also if some of those 70 variables are too strongly correlated, in which case there could be excessive multicollinearity (maybe you could have extremely high values for the correlations making the estimate unstable)

Related Solutions

Solved – Very Large Values Predicted for Linear Regression with One Hot Encoding

There is at least one point that seems very suspicious.

Consider the lines

train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)

and

test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)

Specifically, the parts

pandas.get_dummies(train[non_numeric])

and

pandas.get_dummies(test[non_numeric])

Note that this depends on the values of the matrices. There is no reason implying that the generated columns must be the same, and so it's hard to guess the effect on the prediction of the test data.

In general, when performing get_dummies, it is better to do it before train/test splits (including cross-validation). This is an unsupervised transformation anyway, so it is not "peeking".

Bayesian Regression – Handling Extremely Large Credible Intervals and Standard Deviation in Python

The bug is not in your implementation of Bayesian linear regression but in how you sample the errors in Y.

Aside: You don't cite Pattern Recognition and Machine Learning by Bishop properly.

In Section 3.3 on Bayesian linear regression, Bishop assumes that the noise variance is known and that X has a (multivariate) Normal distribution. You violate both of these assumptions by sampling error terms from a mixture of uniform(-0.2, 0) and a spike at 0.

This is your error distribution; it's idiosyncratic to say the least.

errors = []
for x in X:
    errors.append(0.2 * np.random.randint(-1, 1) * np.random.rand())

You can notice the issue in your original plot as an unexpected upper bound on the spread of the observations.

And here is how to sample the errors properly.

alpha0 = 2.5
alpha1 = 0.8
beta = 1
n = 1000

# The predictor has uniform(0, 1) distribution.
# Predictors can have any distribution actually.
X = np.random.rand(n, 1)

# The errors have normal distribution.
error = np.random.randn(n, 1) / np.sqrt(beta)
y = alpha0 + alpha1 * X + error  # Equation (3.10) in Bishop

The rest of your code works fine, so I only attach an updated plot of the posterior.

Best Answer

Related Solutions

Solved – Very Large Values Predicted for Linear Regression with One Hot Encoding

Bayesian Regression – Handling Extremely Large Credible Intervals and Standard Deviation in Python

Related Question