Solved – Sampling highly imbalance multi-class response variable

I have a dataset (11000 x 117) with response variable having multiple classes.

Here is a plot of class distribution:

Some of the classes have only 1 sample in the entire dataset and some have 2, 3 and 5.

I have tried:

SMOTE: Smote sampling, it generates sample (with freq = 100, 11000×117) for each of the classes but the model performs very poor on such sampled dataset. May be because SMOTE generate new points in the plane for samples : SMOTE

downSample & upSample: Does not give good accuracy, explained below.

Sample/class weights in loss function:

get_class_weights <- function(y){
counter=funModeling::freq(y, plot=F) %>% select(var, frequency)
majority=max(counter$frequency)
counter$weight=pracma::ceil(majority/counter$frequency)
l_weights=setNames(as.list(counter$weight), counter$var)
return(l_weights)
}

This helped increase train and vald accuracy from 0.18 to 0.28 but it is still not enough.

As you can see, the classes are highly imbalanced. Is there any other way this can be done?

I am using Keras networks in R to train a neural network.

SOLUTION I tried:

After removing the observation for which the classes have low frequency( <100 ) I got this class distribution:

and as there is still imbalance, so I tried upSample and it resulted in the following distribution:

BUT even after this distribution, the neural net gives only 35% accuracy.

What could be the reason for this?

Best Answer

I don't think this is a problem with sampling, rather, it is a problem with your data. You have many responses with too few instances. From your comment, you have 110 different responses - which are some sort of computer flag - and, from your first plot, it looks like about 1/4 to 1/2 of these have so few responses that it's going to be impossible to estimate things.

I don't think any sampling program is going to solve this; you will have to combine some of the flags or else drop some of them.

Best Answer

Related Solutions

Solved – Handling unbalanced data using SMOTE – no big difference

Solved – Balancing classes for Neural Network training

Related Question