Solved – Dealing with imbalanced/zero-inflated training examples for regression

classificationmodel-evaluationregressionunbalanced-classeszero inflation

I am trying to predict the rainfall in a desert with a regression model. However, as you might expect, most of my training examples have zeroed labels. I have two questions:

a. What is an appropriate performance measure?

For classification problems, it seems conventional to evaluate the confusion matrix, F1 score or other metrics (e.g. kappa) normalized for imbalanced classes.

What about in a regression setting? Any model output with near constant zero prediction will achieve a very low RMSE/MAE but doesn't give good intuition on how good my model will be ultimately at predicting the amount of rainfall.

b. What is an appropriate model?

It seems that one common strategy with zero-inflated data is to split this into a two-step problem with a binary classification problem for {rain, no rain}, pick my favorite classifier from cross-validation, then split my data set with that classifier to run a separate regression problem conditional on predicted rain.

The main concern I have with this approach is that I have limited data by the regression step (there's very few training examples conditional on predicted rain).

Is there a better approach I can take?

Best Answer

(a) Assess the performance(s) you are interested in. Thus, if you are mainly interested in getting the expectation of the response E(y) right, then MAE or RMSE are useful. Similarly, you could also the conditional expectation E(y | y > 0), i.e., the expected amount of precipitation given that there is precipitation. If you are mostly interested in the probability of any precipitation P(y > 0) you could look at the corresponding misclassification rate or the Brier score etc. And if you are interested in the entire distribution, the scoring rules like the log-likelihood (or log-score) or the CRPS (continuous ranked probability score) would be natural.

(b) Instead of a two-step model with binary first step and zero-truncated second step, you could also use a single regression model with a response that is censored at zero. A worked example with precipitation in a weather forecasting context is available in a paper about our crch R package (see https://doi.org/10.32614/RJ-2016-012).