Solved – How to account for case weights when generating folds for K-fold cross-validation

cross-validationsamplingvalidationweighted-dataweights

I am currently working on a binary classification problem where each point in the dataset is paired with a case weight. That is, each point is of the form $(w_i, x_i, y_i)$ where:

  • $x_i, y_i$ are the features and outcome of a *person of type $i$;

  • $w_i \geq 1$ is the number of people of type $i$ in the generate population.

Here, we only care about accuracy on the general population and are aiming to fit a classifier that attains the lowest weighted error $\sum_{i=1}^n{w_i \cdot 1[y_i \neq \hat{y}_i ]}$.

My question is: How do I handle the case weights when generating folds for $K$-fold CV?


I am currently handling this problem as follows: I create an 'expanded' dataset that includes $w_i$ copies of each point in the original dataset $(x_i, y_i)$, then generate folds using the expanded dataset.

The issue with this approach is that, for points with large case weights, we are likely to include copies of the point $(x_i, y_i)$ in the training set and the testing set.

As an example, say I my dataset contains a point $(w_i, x_i, y_i)$ where $w_i = 1000$ (which is very large). If I generate folds for 5-CV using an expanded dataset, then I could end up with 800 copies of $(x_i, y_i)$ in the training set, and the 200 copies of $(x_i, y_i)$ in the test set.

Since both the training set and the testing set include the same kinds of points, the training error will end up being very similar to the testing error. As a result, the test error doesn't really reflect out-of-sample performance.

Best Answer

Model validation estimates how well the model-building process provides reliable predictions for other samples from the population. To apply that principle, you need to go through the entire, identical process for each CV fold as you used for the full data set.

If you have a set of records that represents a random sample from an underlying population, each record with its associated weight, then there is no reason to consider the weights in constructing the folds. In 10-fold CV, you conduct the same model building process on 90% of cases, with their weights, and evaluate performance on the held-out 10%, with their weights.

They way you describe the situation, however, it seems that there may already have been some data aggregation, with each individual record representing a subpopulation of known size in the underlying population ("types of persons"), with a single $(x,y)$ pair for each subpopulation, and its weight representing its size. In that case something based on your "expanded data set" or "split weights" might seem to make sense.

With such aggregated data, however, such a CV approach would ignore any differences in $(x,y)$ values among members of an individual subpopulation, which would seem to be a major source of variance that is not taken into account in your model. It's not at all clear that CV on an expanded data set would adequately test your model-building process unless you know for a fact that all members of each subpopulation have identical $(x,y)$ pairs (in which case this approach seems at first thought to be OK) or you incorporate a reasonable estimate of variability into the expanded data set. But if your original model-building process didn't incorporate within-subpopulation variability, then you aren't validating the model-building approach that you actually used.

If there are differences in $(x,y)$ pairs among individuals that are the same "type of person," your model-building process needs to take that into account.

Related Question