Machine Learning Sampling – Using Oversampling to Increase Sample Weights

bootstrapmachine learningsamplingweighted-data

I am working on a balanced binary classification problem. I don't need to adjust the proportion of either of the classes, as it's already 50/50 . However, some of the samples are more valuable than others. For example, even though row 4 and row 15 are both response variable = 0, it is actually more valuable to me to get row 4 correct than row 15.

You could say "then your response variable is designed incorrectly", but let's leave that aside for a moment.

Could I use oversampling to increase the prevalence of the "more important" rows in my dataset (by duplicating them), and therefore increasing the weight of those samples so that whichever ML algorithm I'm using is forced to value those examples more heavily?

Is there any other way of going about increasing the sample weight for certain samples?

Thank you!

Best Answer

Oversampling is one option. You could also give sample_weights to the ML algorithm (many in scikit-learn accept it in their fit methods), or formulate your loss function explicitly and accordingly.

Related Question