Solved – Very Large Values Predicted for Linear Regression with One Hot Encoding

categorical datapythonregressionscikit learn

I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.

I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:

non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)

train = train._get_numeric_data()
train.fillna(0, inplace = True)

non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)

test = test._get_numeric_data()
test.fillna(0, inplace = True)

feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']

lm = LinearRegression()
lm.fit(X, y)

import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))

The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars (sometimes even in the trillions). Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed. I posted this on stackoverflow and got a response suggesting that it might be a collinearity issue, but I tried setting fit_intercept parameter of LinearRegression to False as well as setting drop_first parameter of get_dummies to True.

Also, if there is a better way to do this I'd be eager to hear about it.

Best Answer

There is at least one point that seems very suspicious.

Consider the lines

train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)

and

test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)

Specifically, the parts

pandas.get_dummies(train[non_numeric])

and

pandas.get_dummies(test[non_numeric])

Note that this depends on the values of the matrices. There is no reason implying that the generated columns must be the same, and so it's hard to guess the effect on the prediction of the test data.

In general, when performing get_dummies, it is better to do it before train/test splits (including cross-validation). This is an unsupervised transformation anyway, so it is not "peeking".

Related Question