Solved – How to deal with unbalanced data

datasetk nearest neighbourmachine learningunbalanced-classes

I'm doing data analysis with a dataset of 11795 data points (with 88 features). 85% (9973 points) of these data points correspond to data points belonging to class 1, 5% (589 points) belong to class 2 and 10% (1233 points) belong to class 3.

I'm trying to build a model from this data for predicting the class of new data points. I started to wonder if I build my model using this dataset, does it favour the class 1 data points? Would it be difficult for the model to detect the low frequency classes?

Generally how does one tackle unbalanced data sets such as the one I have?

Thank you for any advices =)

P.S.

I'm using k-nearest neighbor and regularized linear regression methods.

Best Answer

How you deal with unbalanced data classes depends on the particular classifier you work with. What classifier are you using?. For this cases, using the one-vs-class strategy has been reported to perform better than a naive approach in this case, since each classifier works with a more balanced data set.

But there are a couple of strategies which are classifier agnostic like stratified sampling and other sampling methods.

P.S. you said, you are using kNNs. One standard approach is to weight vectors according to the distance to the sample. There are a few approaches. I am just familiar with this one. The paper is quite nice.

As for linear regression methods, you may try to regularize your weights to avoid potential overfitting. Again, you could try something like one-vs-all if it applies to the algorithm you are using.