Solved – Dealing with imbalanced/zero-inflated training examples for regression

classificationmodel-evaluationregressionunbalanced-classeszero inflation

I am trying to predict the rainfall in a desert with a regression model. However, as you might expect, most of my training examples have zeroed labels. I have two questions:

a. What is an appropriate performance measure?

For classification problems, it seems conventional to evaluate the confusion matrix, F1 score or other metrics (e.g. kappa) normalized for imbalanced classes.

What about in a regression setting? Any model output with near constant zero prediction will achieve a very low RMSE/MAE but doesn't give good intuition on how good my model will be ultimately at predicting the amount of rainfall.

b. What is an appropriate model?

It seems that one common strategy with zero-inflated data is to split this into a two-step problem with a binary classification problem for {rain, no rain}, pick my favorite classifier from cross-validation, then split my data set with that classifier to run a separate regression problem conditional on predicted rain.

The main concern I have with this approach is that I have limited data by the regression step (there's very few training examples conditional on predicted rain).

Is there a better approach I can take?

Best Answer

(a) Assess the performance(s) you are interested in. Thus, if you are mainly interested in getting the expectation of the response E(y) right, then MAE or RMSE are useful. Similarly, you could also the conditional expectation E(y | y > 0), i.e., the expected amount of precipitation given that there is precipitation. If you are mostly interested in the probability of any precipitation P(y > 0) you could look at the corresponding misclassification rate or the Brier score etc. And if you are interested in the entire distribution, the scoring rules like the log-likelihood (or log-score) or the CRPS (continuous ranked probability score) would be natural.

(b) Instead of a two-step model with binary first step and zero-truncated second step, you could also use a single regression model with a response that is censored at zero. A worked example with precipitation in a weather forecasting context is available in a paper about our crch R package (see https://doi.org/10.32614/RJ-2016-012).

Related Solutions

Solved – Sampling for Imbalanced Data in Regression

Imbalance is not necessarily a problem, but how you get there can be. It is unsound to base your sampling strategy on the target variable. Because this variable incorporates the randomness in your regression model, if you sample based on this you will have big problems doing any kind of inference. I doubt it is possible to "undo" those problems.

You can legitimately over- or under-sample based on the predictor variables. In this case, provided you carefully check that the model assumptions seem valid (eg homoscedasticity one that springs to mind as important in this situation, if you have an "ordinary" regression with the usuals assumptions), I don't think you need to undo the oversampling when predicting. Your case would now be similar to an analyst who has designed an experiment explicitly to have a balanced range of the predictor variables.

Edit - addition - expansion on why it is bad to sample based on Y

In fitting the standard regression model $y=Xb+e$ the $e$ is expected to be normally distributed, have a mean of zero, and be independent and identically distributed. If you choose your sample based on the value of the y (which includes a contribution of $e$ as well as of $Xb$) the e will no longer have a mean of zero or be identically distributed. For example, low values of y which might include very low values of e might be less likely to be selected. This ruins any inference based on the usual means of fitting such models. Corrections can be made similar to those made in econometrics for fitting truncated models, but they are a pain and require additional assumptions, and should only be employed whenm there is no alternative.

Consider the extreme illustration below. If you truncate your data at an arbitrary value for the response variable, you introduce very significant biases. If you truncate it for an explanatory variable, there is not necessarily a problem. You see that the green line, based on a subset chosen because of their predictor values, is very close to the true fitted line; this cannot be said of the blue line, based only on the blue points.

This extends to the less severe case of under or oversampling (because truncation can be seen as undersampling taken to its logical extreme).

enter image description here

# generate data
x <- rnorm(100)
y <- 3 + 2*x + rnorm(100)

# demonstrate
plot(x,y, bty="l")
abline(v=0, col="grey70")
abline(h=4, col="grey70")
abline(3,2, col=1)
abline(lm(y~x), col=2)
abline(lm(y[x>0] ~ x[x>0]), col=3)
abline(lm(y[y>4] ~ x[y>4]), col=4)
points(x[y>4], y[y>4], pch=19, col=4)
points(x[x>0], y[x>0], pch=1, cex=1.5, col=3)
legend(-2.5,8, legend=c("True line", "Fitted - all data", "Fitted - subset based on x",
    "Fitted - subset based on y"), lty=1, col=1:4, bty="n")

Solved – Evaluating a regression model’s performance using training and test sets

As said, typically, the Mean Squared Error is used. You calculate your regression model based on your training set, and evaluate its performance using a separate test set (a set on inputs x and known predicted outputs y) by calculating the MSE between the outputs of the test set (y) and the outputs given by the model (f(x)) for the same given inputs (x).

Alternatively you can use following metrics: Root Mean Squared Error, Relative Squared Error, Mean Absolute Error, Relative Absolute Error... (ask google for definitions)

Best Answer

Related Solutions

Solved – Sampling for Imbalanced Data in Regression

Solved – Evaluating a regression model’s performance using training and test sets

Related Question