Solved – Sampling highly imbalance multi-class response variable

down-samplersamplingsmoteunbalanced-classes

I have a dataset (11000 x 117) with response variable having multiple classes.

Here is a plot of class distribution:

enter image description here

Some of the classes have only 1 sample in the entire dataset and some have 2, 3 and 5.

I have tried:

SMOTE: Smote sampling, it generates sample (with freq = 100, 11000×117) for each of the classes but the model performs very poor on such sampled dataset. May be because SMOTE generate new points in the plane for samples : SMOTE
enter image description here

downSample & upSample: Does not give good accuracy, explained below.

Sample/class weights in loss function:

get_class_weights <- function(y){
counter=funModeling::freq(y, plot=F) %>% select(var, frequency)
majority=max(counter$frequency)
counter$weight=pracma::ceil(majority/counter$frequency)
l_weights=setNames(as.list(counter$weight), counter$var)
return(l_weights)
}

This helped increase train and vald accuracy from 0.18 to 0.28 but it is still not enough.

As you can see, the classes are highly imbalanced. Is there any other way this can be done?

I am using Keras networks in R to train a neural network.

SOLUTION I tried:

After removing the observation for which the classes have low frequency( <100 ) I got this class distribution:

enter image description here

and as there is still imbalance, so I tried upSample and it resulted in the following distribution:
enter image description here

BUT even after this distribution, the neural net gives only 35% accuracy.

What could be the reason for this?

enter image description here

Best Answer

I don't think this is a problem with sampling, rather, it is a problem with your data. You have many responses with too few instances. From your comment, you have 110 different responses - which are some sort of computer flag - and, from your first plot, it looks like about 1/4 to 1/2 of these have so few responses that it's going to be impossible to estimate things.

I don't think any sampling program is going to solve this; you will have to combine some of the flags or else drop some of them.

Related Question