Solved – Refining a linear regression model for condominium prices

correlationregression

I'm hoping someone here is able to help me refine a linear regression model I'm working on at work. I am in no way a statistician, but I guess I have the most experience (basic stats course and decently capable with excel) in my office.

I've been tasked with creating a model that would help predict condo prices (dependent variable) in a particular city. I've collected data from the Multiple Listing Service for use as my independent variables. The data I am collecting is from condos that have sold or are currently active within the last 6 months, and that are between 0 and 2 years old. The data is also limited to 4 storey wood-frame construction within a particular city.

The independent variables I have used are: Square footage, top floor (dummy variable), corner unit (dummy), unit type (1 bed, 2 bed etc.), exposure (dummy, direction it faces), material spec (quality of finishings). I have since dropped exposure from the equation because it wasn't statistically significant (t stat was was around .3 – .4). All of the other coefficients have a t Stat over 2, however two of them are confusing me. The top floor and corner unit coefficients have a negative relationship when logically they should have a positive one. In my experience, top floor and corner units hold a premium over lower level and inside units.

Does anyone have any idea why this could be? I have around 40 samples so far, would expanding my data set to include more samples help fix this? Also, I understand real estate prices can be a tricky thing to model because of subjective variables that can't really be accounted for. Anyways, any help would be appreciated as I am trying to learn about regression as I work on this!

Sincerely,

Rob

Best Answer

Since you are looking for predictive value, you should not necessarily drop out a variable (exposure) based on a significance test. There are methods out there that select variables based on criteria more aimed at good prediction (generally based on crossvalidation or other bootstrap-alike techniques). I doubt you will find these in Excel though. I greatly advise LASSO, e.g. with any measure of predictive value (feel free to ask more info). Note that most of these techniques are basically forms of linear regression with a twitch that finds the coefficients that can be set to zero.

Your number of observations is not exactly high for your number of covariates, but if this becomes an option, it will be interesting to add interaction terms (which I understand you have not done yet).

As for reasons why this or that variable is in your model: I'd be wary of making strong statements about that from your sample size (especially considering the number of covariates, again).

Related Question