Solved – negative impact from imbalance/skew in predictor variables

skewness

I understand that imbalance or skew in the target variable within your training data can negatively impact effectiveness. Does the same apply to the predictor/independent variables?

y ~ B0 + B1*x1 + B2*x2

Consider this simple example. I am trying to predict y, a categorical variable, from two variables x1 and x2 which are also categorical variables. If I have an imbalanced set of y values, this could be a bad thing. What if I have an imbalanced set of x1 or x2? Could the same issue apply?

Best Answer

Lack of balance is not bad for a saturated model. With two categorical variables this means having $ AB $ predictors ($ A $ is number of categories in the variable, $ B $ the second). If you include the interaction term $ x_1$ and $ x_2$ you should be fine. If this fit is too "noisy" then include it as a random effect in a mixed model - so you get the benefits of partial pooling.

The problem comes in when you are pooling relationships across categories (such as main effects only). If you are polling across two imbalanced groups, the larger group contributes more to the pooled estimate. This causes problems when the groups are actually different, and the imbalance is "unrepresentative" - that is, an artifact of the sampling mechanism that you obtained your data from (eg quota sampling, or "balancing" by subsampling your data).