I'm using the concept of Hedonic regression in order to model the prices for real estates. I'm having some trouble with my approach.
What I have and what I do
- my data consists out of real estates with following charcteristics:
price | livingArea | propertyArea | condoFloorNumber | roomCount | elevator | garage | quiet | etc.
- I run a robust regression without intercept
lmRob(price ~ . -1)
What I want
- a model with which I can predict the price of real estates, but which are not in the used data set
- also it would be nice to have some constraints on the coefficients
Problems
- very often I get bad values for the coefficients
ex: bathroomCount = -80000
. it's not possible that with a additive bathroom , the price of the house will sink with80.000€
-
also I tried to use the function
pcls
in order to put some constraints on the coefficients, but this method gave very bad results. In the plotY = price
andX = livingArea
. as you can see, the regression line isn't correct.
- another thought was to transform the regression problem into a maximization or minimization problem, but didn't managed to do it
- also I tried to use different regression methods
lm, lmrob, ltsReg, MARS
, but they also give me bad coefficients. (sometimes this bad coefficients make a good price estimation) - I think that the big number of dummy variables damages a little bit the regression
Is my approach false?
Does someone have some hints, tricks for me? (I'm not a statistician)
[UPDATE]
This is how the plotted data looks like. LivingArea is the only non-dummy variable.
[UPDATE 2]
y = bX
means
y = b_0*X_0 + b_1*X_1 + ... + b_k*X_k
which is an equation system like this:
y[0] = b_0*X_0[0] + b_1*X_1[0] + ... + b_k*X_k[0]
.
.
.
y[n] = b_0*X_0[n] + b_1*X_1[n] + ... + b_k*X_k[n]
Did I got it right?
If so, isn't possible to add some inequality constraints equation to it. example:
b_0 >= 2000
b_2 <= b_0/2
[UPDATE 3]
I'm running the regression without intercept, because if all the characteristics of a real estate = 0, then of course it'S price = 0. Nobody would pay for an apartment with 0m².
but it seems that the regression line where it was used an intercept (blue) looks far more better than the regression line without intercept (green). I can't understand why it is so. and why doesn't the regression line without intercept start at the point (0,0)?
Best Answer
This type of approach clearly can work (and has evidently been used by tax authorities to set property taxes on my house for many years), so there needs to be some investigation of the sources of this difficulty.
Understanding the nature of this data set is very important. If it is to be used for predicting prices of properties not in the data set you must be very certain that it is adequately representative of the population of properties of interest. It's possible there is some peculiarity in the way this particular sample was collected, so that some particular combinations of co-linear factors are leading to things like the negative coefficients for bathroom numbers. Re-evaluate the sample collection and the data coding, an oft-overlooked source of difficulty. Also, for your PCA-based approaches, the signs of coefficients for principal components depend on the directions of the associated eigenvectors, making it all too easy to create errors when you try to go back to the space of the original factors. Check that, too.
You didn't specify the standard errors of your coefficient estimates, so some of your apparently anomalous coefficients might not be significantly different from 0. For example, a -80K coefficient per bathroom with a standard error of +/- 100K would not really be an issue; that probably just means that the high co-linearity makes it difficult to determine a value per bathroom, given its high association with land area, numbers of bedrooms, and so forth. If that's the case you should retain the coefficient when making predictions, as the apparently anomalous coefficient for bathrooms is probably helping to correct for price over-estimates based on some of its co-linear factors alone.
You could try to figure out which combinations of factors are leading to these problems. Although stepwise selection of factors is not wise for building a final model, for troubleshooting you might consider starting with a simple model of price-bathroom relations and adding more factors to see which combinations of factors are leading to your problem.
You also should take advantage of information from structured re-sampling of your data set to evaluate these issues. You don't say whether or how you have approached this crucial aspect of model validation. If you have, then cross-validation or bootstrap resampling may have already provided insights into the sources of your difficulty. If you haven't, consult An Introduction to Statistical Learning or similar references to see how to proceed.