Linear Regression – Causes and Solutions for Negative Predicted Values

linear modelpredictorregression

I'm using linear regression to predict a price which is obviously positive. I have only one feature which is gross_area. I standardized it (z-score) I got this kind of value:

array([[ 1.        , -0.48311432],
       [ 1.        ,  0.68052306],
       [ 1.        ,  2.1426852 ],
       [ 1.        , -1.17398593],
       [ 1.        , -0.16265712]])

Where the 1 is the constant for the intercept term.
I predict the parameters(predictors) and I got this:

array([[ 31780004.85045217],
       [ 27347542.4693376 ]])

Where the first cell is the intercept term and the second cell correspond to the parameter found for my feature gross_area.

My problem is the following, when I take for example the fourth line and I compute the matrix multiplication XB to get my prediction, I got this:

In [797]: np.dot(training[4], theta)
Out[797]: array([-325625.35640697])

Which is totally wrong since I cannot have negative value for my dependent variable. It seems like because of my normalization where I got negative value for my feature, I ended up with a negative predicted value for some tuple. How can is it possible and how can I fix this ?
Thank you.

This is what I have predict graphically:
enter image description here

with y=price , x =gross area

Best Answer

Linear regression does not respect the bounds of 0. It's linear, always and everywhere. It may not be appropriate for values that need to be close to 0 but are strictly positive.

One way to manage this, particularly in the case of price, is to use the natural log of price.

Related Solutions

Solved – Positive linear regression coefficient

It is often the case that suppressing the intercept leads to regression coefficients that don't make sense. In my experience, there are rarely cases where suppressing the intercept makes sense, even if the scientific plausibility suggests that it might be justifiable (such as stopping distance versus cruising speed or creatinine clearance versus kidney mass in grams: you LEAVE the intercept IN with such analyses!). This is a problem of extrapolation.

Just eyeballing these data, I imagine that the estimated intercept would be a largely non-zero value. Since these data appear to come from some sort of computing time, comparing flops versus elapsed time, etc. the non-zero intercept could have a host of interpretations such as a boot time for running a process, a system lag as memory is allocated for an operation, or any other non-neglible system processes that aren't measured as part of an experimental run. Furthermore, and more subtle, there may be non-linear effects which are influencing your results. The regression coefficient from intercept-in OLS still provides a great way of estimating the first order linear trend through those data, even if the trend is curvilinear... only when you leave the intercept IN.

My first recommendation is to look at the output from running pairs(fit). And just look at the trend.

Nonetheless, if your goal is to simply find optimal positive coefficients in the model, you can do so with using by-hand optimization, either ML or Gibb's sampling, though don't be surprised if those results make no sense. Example of by-hand optimization:

X <- model.matrix(~ tinst+tmem+tcom-1, data=fit)
y <- fit$tcyc
negLogLik <- function(b) {
  b <- exp(b)     ## restrict to positive only values
  yhat <- b %*% X ## calculate fitted
  -var(y-yhat)    ## objective foo
}

nlm(negLogLik, c(1,1,1)) ## minimize objective foo

Solved – Multivariate Linear Regression with continuous and discrete explanatory variable

Notice that you are predicting negative prices for small gross areas. Does that make sense (Please come and live in this appartment, I will give you money if you do). I would consider using a log link function.

As for the categorical variable, I would just add indicator variables for the number of bathrooms instead of entering it as a continuous variable.

Related Question