Solved – Which is the right way to handle imbalanced data in a regression problem

machine learningregressionunbalanced-classesweighted-regressionweighted-sampling

I'm working on a regression problem with imbalanced data, and I would like to know if I'm weighting the errors correctly. I'll try to illustrate the concept with a simple example.

Imagine I'm building a model to predict house prices in New York and Los Angeles. I have many more training examples in NY than in LA, but I want the algorithm to perform equally well in both cities. To further complicate the issue, house prices in NY have a greater variance than those in LA.

Here is an example training dataset:

City    N_rooms  House_Price
NY      4        400
NY      7        1000
NY      5        800
NY      3        300
NY      7        600
NY      2        100
NY      4        500
LA      3        400
LA      5        500
LA      4        500

I have 7 training examples for NY and 3 training examples for LA. If my cost function is MSE, namely sum((y_pred - y_true)^2)/10, to make sure that the algorithm performs equally well in both cities, I would need to give different weights on the prediction errors, namely

sum(w * (y_pred - y_true)^2)/10

I would like to know which one of the following would be the correct way to define w and/or rescale training data:

  1. Do not use weights (i.e., w=1)
  2. Define w as the inverse frequency of each class in the training set, namely w=1/3 for houses in LA and w=1/7 for houses in NY
  3. Standardize prices in NY and LA separately, namely subtract the average price in NY from the price of every house in NY, then divide the price of every house in NY by the standard deviation of house prices in NY. Similarly, subtract the average price in LA from the price of every house in LA, then divide the price of every house in LA by the standard deviation of house prices in LA. Now train the regression model on the scaled data. To predict actual prices, apply the inverse scaling to the model predictions.
  4. Apply both points 2 and 3.

Note: the goal is not only to minimize the overall error, but to build an algorithm that performs equally well in both cities.

Best Answer

If I have understood you correctly, the issue here is that you wish to fit your regression in such a way that it performs equally well on both cities, by which you means that you want to minimise the weighted sum-of-squares, with weights that ensure equal total weight to the data from each city. If that is correct, then this should be a fairly simple problem, where you can use weighted least-squares estimation. For this type of estimation, you have an $n \times n$ diagonal weighting matrix $\mathbf{w}$, and the coefficient estimator is:

$$\hat{\boldsymbol{\beta}} = (\mathbf{x}^\text{T} \mathbf{w} \mathbf{x})^{-1} (\mathbf{x}^\text{T} \mathbf{w} \mathbf{y}).$$

Now, suppose that you have $n_\text{NY}$ data points from New York and $n_\text{LA}$ data points from Los Angeles (so that $n= n_\text{NY}+n_\text{LA}$). Then you would use weights $w_\text{NY} = 1/n_\text{NY}$ and $w_\text{LA} = 1/n_\text{LA}$ in your weighting matrix, and this would ensure that the two cities are equally weighted in the aggregate. As a result of this weighting, more weight would be given to data points from the city that has been sampled less.

Now, I will also deal with your further complication, which is that you say there is more variance in one city than the other. My suggestion here would be to fit a first-pass model where you use weighted-least-squares, with a weight of unity on one city, and a free parameter to weight the other city. This will give you an estimate of the relative sizes of the error variance for the two cities. You can then take that estimate and apply it as an additional weight when you do your main weighted analysis (as described above). So, for example, if we let $\hat{\delta} \equiv \hat{\sigma}_\text{NY}^2 / \hat{\sigma}_\text{LA}^2$ denote the estimated relative error variance, then we would use the subsequent weightings $w_\text{NY} = 1/n_\text{NY}$ and $w_\text{LA} = \hat{\delta}/n_\text{LA}$ in your weighted analysis. This should allow you to incorporate both the different error variance of the two cities, and also apply your own weighting to force the analysis to give "equal weight" (after adjustment for error variance) to the two cities.

Related Question