I'm trying to run a linear regression in python to determine house prices given many features. Some of these are numeric and some are non-numeric. I'm attempting to do one hot encoding for the non-numeric columns and attach the new, numeric, columns to the old dataframe and drop the non-numeric columns. This is done on both the training data and test data.
I then took the intersection of the two columns features (since I had some encodings that were only located in the testing data). Afterwards, it goes into a linear regression. The code is the following:
non_numeric = list(set(list(train)) - set(list(train._get_numeric_data())))
train = pandas.concat([train, pandas.get_dummies(train[non_numeric])], axis=1)
train.drop(non_numeric, axis=1, inplace=True)
train = train._get_numeric_data()
train.fillna(0, inplace = True)
non_numeric = list(set(list(test)) - set(list(test._get_numeric_data())))
test = pandas.concat([test, pandas.get_dummies(test[non_numeric])], axis=1)
test.drop(non_numeric, axis=1, inplace=True)
test = test._get_numeric_data()
test.fillna(0, inplace = True)
feature_columns = list(set(train) & set(test))
#feature_columns.remove('SalePrice')
X = train[feature_columns]
y = train['SalePrice']
lm = LinearRegression()
lm.fit(X, y)
import numpy
predictions = numpy.absolute(lm.predict(test).round(decimals = 2))
The issue that I'm having is that I get these absurdly high Sale Prices as output, somewhere in the hundreds of millions of dollars (sometimes even in the trillions). Before I tried one hot encoding I got reasonable numbers in the hundreds of thousands of dollars. I'm having trouble figuring out what changed. I posted this on stackoverflow and got a response suggesting that it might be a collinearity issue, but I tried setting fit_intercept
parameter of LinearRegression
to False as well as setting drop_first
parameter of get_dummies
to True.
Also, if there is a better way to do this I'd be eager to hear about it.
Best Answer
There is at least one point that seems very suspicious.
Consider the lines
and
Specifically, the parts
and
Note that this depends on the values of the matrices. There is no reason implying that the generated columns must be the same, and so it's hard to guess the effect on the prediction of the test data.
In general, when performing
get_dummies
, it is better to do it before train/test splits (including cross-validation). This is an unsupervised transformation anyway, so it is not "peeking".