Given a data set where each individual data point can be assigned to more than 1 class (a multi-class, multi-label data set), are there any guidelines for calculating oversampling weights, i.e., the probability with which you sample a data point based on the frequencies of the labels within the data set?
This is in the context of multi-label classification; I have a very imbalanced data set.
An obvious answer would be to calculate the weight for each label as the inverse frequency (i.e. 1 / total_number_of_label_appearances
), then average up the weights for a given data point; though I'm unsure if there's any better approaches.
Best Answer
Calculating the weight for each label as the inverse frequency, then average up the weights for a given data point is done like so with pandas in Python: