Hello I am working with sklearn to perform a classifier, I have the following distribution of labels:
label : 0 frecuency : 119
label : 1 frecuency : 1615
label : 2 frecuency : 197
label : 3 frecuency : 70
label : 4 frecuency : 203
label : 5 frecuency : 137
label : 6 frecuency : 18
label : 7 frecuency : 142
label : 8 frecuency : 15
label : 9 frecuency : 182
label : 10 frecuency : 986
label : 12 frecuency : 73
label : 13 frecuency : 27
label : 14 frecuency : 81
label : 15 frecuency : 168
label : 18 frecuency : 107
label : 21 frecuency : 125
label : 22 frecuency : 172
label : 23 frecuency : 3870
label : 25 frecuency : 2321
label : 26 frecuency : 25
label : 27 frecuency : 314
label : 28 frecuency : 76
label : 29 frecuency : 116
One thing that clearly stands out is that I am working with a unbalanced data set I have many labels for the class 25,23,1,10, I am getting bad results after the training as follows:
precision recall f1-score support
0 0.00 0.00 0.00 31
1 0.61 0.23 0.34 528
2 0.00 0.00 0.00 70
3 0.67 0.06 0.11 32
4 0.00 0.00 0.00 62
5 0.78 0.82 0.80 39
6 0.00 0.00 0.00 3
7 0.00 0.00 0.00 46
8 0.00 0.00 0.00 5
9 0.00 0.00 0.00 62
10 0.14 0.01 0.02 313
12 0.00 0.00 0.00 30
13 0.31 0.57 0.40 7
14 0.00 0.00 0.00 35
15 0.00 0.00 0.00 56
18 0.00 0.00 0.00 35
21 0.00 0.00 0.00 39
22 0.00 0.00 0.00 66
23 0.41 0.74 0.53 1278
25 0.28 0.39 0.33 758
26 0.50 0.25 0.33 8
27 0.29 0.02 0.03 115
28 1.00 0.61 0.76 23
29 0.00 0.00 0.00 42
avg / total 0.33 0.39 0.32 3683
I am getting many zeros and the SVC is not able to learn from several class, the hyperparameters that I am using are the followings:
from sklearn import svm
clf2= svm.SVC(kernel='linear')
I order to overcome this issue I builded one dictionary with weights for each class as follows:
weight={}
for i,v in enumerate(uniqLabels):
weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)
for i,v in weight.items():
print(i,v)
print(weight)
these are the numbers and output, I am just taking the numbers of element of determinated label divided by the total of elements in the labels set, the sum of these numbers is 1:
0 0.010664037996236221
1 0.14472622994892015
2 0.01765391164082803
3 0.006272963527197778
4 0.018191594228873554
5 0.012277085760372793
6 0.0016130477641365713
7 0.012725154583744062
8 0.0013442064701138096
9 0.01630970517071422
10 0.0883591719688144
12 0.0065418048212205395
13 0.002419571646204857
14 0.007258714938614571
15 0.015055112465274667
18 0.009588672820145173
21 0.011201720584281746
22 0.015413567523971682
23 0.34680526928936284
25 0.20799354780894344
26 0.0022403441168563493
27 0.028138722107715744
28 0.006810646115243301
29 0.01039519670221346
trying again with this dictionary of weights as follows:
from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)
I got:
precision recall f1-score support
0 0.00 0.00 0.00 31
1 0.90 0.19 0.31 528
2 0.00 0.00 0.00 70
3 0.00 0.00 0.00 32
4 0.00 0.00 0.00 62
5 0.00 0.00 0.00 39
6 0.00 0.00 0.00 3
7 0.00 0.00 0.00 46
8 0.00 0.00 0.00 5
9 0.00 0.00 0.00 62
10 0.00 0.00 0.00 313
12 0.00 0.00 0.00 30
13 0.00 0.00 0.00 7
14 0.00 0.00 0.00 35
15 0.00 0.00 0.00 56
18 0.00 0.00 0.00 35
21 0.00 0.00 0.00 39
22 0.00 0.00 0.00 66
23 0.36 0.99 0.52 1278
25 0.46 0.01 0.02 758
26 0.00 0.00 0.00 8
27 0.00 0.00 0.00 115
28 0.00 0.00 0.00 23
29 0.00 0.00 0.00 42
avg / total 0.35 0.37 0.23 3683
Since I am not getting good results I really appreciate suggestions to automatically adjust the weight of each class and express that in the SVC, I don have many expierience dealing with unbalanced data.
Best Answer
If you are not getting good results, you should first check that you are using the right classification algorithm (is your data well fit to be classified by a linear SVM?) and that you have enough training data. Practically, that means you might consider visualizing your dataset through PCA or t-SNE to see how "clustered" your classes are, and checking how your classification metrics evolve with the amount of data your classifier is given.
If you then confirm that investing in tweaking your linear SVM is the right way to approach your problem, you can look at modifying the class weights. Note then that what you suggest as weights is probably the opposite of what you want to do: you are giving more weights to less frequent classes, marginalizing them further - said differently, you typically want to use weights that are inversely proportional to class frequencies. You can calculate these manually, or you can let sklearn do it automatically for you by specificing
class_weight='balanced'
.