Solved – Why does skew and heteroscedasticity lead to bias

distributionsmachine learningr

I'm newbie attempting the Zillow prize competition on Kaggle.
I've read about the reason for using a relative error from Zillow's answer in their FAQ.
(I don't think it's a Zillow-specific question but rather a general question in machine learning)

Q: Why did Zillow pick the log error instead of an absolute error metric such as RMSE?
A : Home sale prices have a right skewed distribution and are also strongly heteroscedastic, so we need to use a relative error metric instead of an absolute metric to ensure valuation models are not biased towards expensive homes. A relative error metric like the percentage error or log ratio error avoids these problems. While we report Zestimate errors in terms of percentages on Zillow.com because we believe that to be a more intuitive metric for consumers, we do not advocate using percentage error to evaluate models in Zillow Prize, as it may lead to biased models The log error is free of this bias problem and when using the natural logarithm, errors close to 1 approximate percentage errors quite closely. See this paper for more on relative errors and why log error should be used instead of percentage error.

I asked about this FAQ explanation here.
So I mostly understand what "right skewed heteroscedastic distribution" means.
But I still don't understand why these things are problems.

Here are the questions.

  1. What does "biased towards expensive homes" mean? Does it mean that models have higher prediction error for expensive homes than cheap homes?

  2. Why do we need to avoid making "biased" models? It's hard to say that models with higher error on expensive homes are bad because there are more cheap houses than expensive homes. I can't tell which is better.

  3. What causes a model to be "biased towards expensive homes" and why does "bias" happen? Is it caused by "right skew" or "heteroscedasticity" or both?

EDIT after jon_simon's answer :

I'll explain my understanding and confusing point. Please correct me!

  1. I understand that absolute error can be disproportionately influenced by the magnitude of Y and that's the reason we should use relative error metric. But what point I don't understand yet is "This problem can be caused by right-skew". (in jon_simon's answer)

  2. Let's assume home sales price problem is left-skewed distribution.
    It will be plotted like this :
    left-skewed distribution
    (I defined cutoff of low and high price to explain more clearly.)

  3. There are more high-priced homes than low-priced homes and high-priced homes are more influence on absolute error metric like RMSE. So, if we use RMSE, the key to get better model is reduce prediction error for expensive homes more. Prediction error for low-priced homes will be relatively less important. So final chosen model should have ability of making lower error in high-priced model and I think this is the meaning of "biased" in my first question.

  4. But I think it's different in right-skewed distribution of home price like this :
    right skewed distribution

It is same that each high-priced home is more influence on RMSE than low-priced one. But the difference is the number of low-priced homes. There are much more low-priced homes than high-priced, and in terms of total error, low-priced homes get bigger influence than left-skewed distribution.
So, I think we don't know whether the final model will be biased towards low-priced homes or high-priced homes.(in right-skewed distribution)
In other words, this is a problem of this sort :
"Is sum(low-priced homes' absolute error) bigger(or smaller) than sum(high-priced homes' absolute error)?"

I'm confused if we can say that right-skewed distribution surely causes 'biased towards expensive home' model.(with RMSE)

Best Answer

In this context "biased" essentially means "disproportionately influenced by". RMSE, a commonly used error metric, sums the total squared error across all predictions. In this case if your model were to predict with 90% accuracy the value of a \$100,000 home and a \$1,000,000 home respectively, you would be penalized 100x more for the 10% prediction error in the \$1,000,000 home, because $$\frac{(1,000,000 - 900,000)^2}{(100,000 - 90,000)^2} = \frac{100,000^2}{10,000^2} = 100$$

Zillow is getting around this problem by using a different type of error metric which penalizes errors proportional to the magnitude of the prediction, so that in the case of the above example both 10% errors would be penalized to the same extent.

This problem can be caused by right-skew alone (presence of very high-priced homes) or heteroscedasticity alone (more variability in the high-priced homes), but the two together make the problem even worse.

This problem is fundamentally related to heteroscedasticity (the higher the price is, the more variability there is), and significantly right-skewed response variables frequently exhibit some amount of heteroscedasticity with respect to the explanatory variable(s).

EDIT in response to Ashtray's edit:

I have altered my response to qualify the statement that right-skew alone can bias the model. If the response variable is right-skewed but the samples are perfectly homoscedastic with respect to the independent variable(s), then this issue with RMSE does not arise, because the variance (and therefore the error) will be the same at all scales. However in almost any real-world scenario where the response variable is dramatically right-skewed, the samples will also exhibit some amount of heteroscadasticity.

As for your question about whether this is guaranteed to irreparably bias a model designed to penalize RMSE, the answer is that it won't, necessarily. However the more important question is, even in the case where you have enough cheap-home data to compensate for the more expensive homes, are you building a model that's actually predicting what you want? By using RMSE, you're implicitly saying that that you care as much about a \$10,000 difference in the price of a \$100,000 home as you do in the price of a \$1,000,000 home, or even a \$100,000,000 home, which is almost certainly not true.

Hopefully I've managed to answer all of your lingering questions.