Solved – Modeling prices with the Hedonic regression

estimationrregressionregression coefficients

I'm using the concept of Hedonic regression in order to model the prices for real estates. I'm having some trouble with my approach.

What I have and what I do

  • my data consists out of real estates with following charcteristics: price | livingArea | propertyArea | condoFloorNumber | roomCount | elevator | garage | quiet | etc.
  • I run a robust regression without intercept lmRob(price ~ . -1)

What I want

  • a model with which I can predict the price of real estates, but which are not in the used data set
  • also it would be nice to have some constraints on the coefficients

Problems

  • very often I get bad values for the coefficients ex: bathroomCount = -80000. it's not possible that with a additive bathroom , the price of the house will sink with 80.000€
  • also I tried to use the function pcls in order to put some constraints on the coefficients, but this method gave very bad results. In the plot Y = price and X = livingArea. as you can see, the regression line isn't correct.
    enter image description here

    • another thought was to transform the regression problem into a maximization or minimization problem, but didn't managed to do it
    • also I tried to use different regression methods lm, lmrob, ltsReg, MARS, but they also give me bad coefficients. (sometimes this bad coefficients make a good price estimation)
    • I think that the big number of dummy variables damages a little bit the regression

Is my approach false?

Does someone have some hints, tricks for me? (I'm not a statistician)

[UPDATE]

price ~ livingArea

This is how the plotted data looks like. LivingArea is the only non-dummy variable.

[UPDATE 2]

y = bX 

     means

y = b_0*X_0 + b_1*X_1 + ... + b_k*X_k

     which is an equation system like this:

y[0] = b_0*X_0[0] + b_1*X_1[0] + ... + b_k*X_k[0]
.
.
.
y[n] = b_0*X_0[n] + b_1*X_1[n] + ... + b_k*X_k[n]

Did I got it right?

If so, isn't possible to add some inequality constraints equation to it. example:

b_0 >= 2000
b_2 <= b_0/2

[UPDATE 3]

I'm running the regression without intercept, because if all the characteristics of a real estate = 0, then of course it'S price = 0. Nobody would pay for an apartment with 0m².
enter image description here
but it seems that the regression line where it was used an intercept (blue) looks far more better than the regression line without intercept (green). I can't understand why it is so. and why doesn't the regression line without intercept start at the point (0,0)?

Best Answer

This type of approach clearly can work (and has evidently been used by tax authorities to set property taxes on my house for many years), so there needs to be some investigation of the sources of this difficulty.

Understanding the nature of this data set is very important. If it is to be used for predicting prices of properties not in the data set you must be very certain that it is adequately representative of the population of properties of interest. It's possible there is some peculiarity in the way this particular sample was collected, so that some particular combinations of co-linear factors are leading to things like the negative coefficients for bathroom numbers. Re-evaluate the sample collection and the data coding, an oft-overlooked source of difficulty. Also, for your PCA-based approaches, the signs of coefficients for principal components depend on the directions of the associated eigenvectors, making it all too easy to create errors when you try to go back to the space of the original factors. Check that, too.

You didn't specify the standard errors of your coefficient estimates, so some of your apparently anomalous coefficients might not be significantly different from 0. For example, a -80K coefficient per bathroom with a standard error of +/- 100K would not really be an issue; that probably just means that the high co-linearity makes it difficult to determine a value per bathroom, given its high association with land area, numbers of bedrooms, and so forth. If that's the case you should retain the coefficient when making predictions, as the apparently anomalous coefficient for bathrooms is probably helping to correct for price over-estimates based on some of its co-linear factors alone.

You could try to figure out which combinations of factors are leading to these problems. Although stepwise selection of factors is not wise for building a final model, for troubleshooting you might consider starting with a simple model of price-bathroom relations and adding more factors to see which combinations of factors are leading to your problem.

You also should take advantage of information from structured re-sampling of your data set to evaluate these issues. You don't say whether or how you have approached this crucial aspect of model validation. If you have, then cross-validation or bootstrap resampling may have already provided insights into the sources of your difficulty. If you haven't, consult An Introduction to Statistical Learning or similar references to see how to proceed.

Related Question