Solved – Optimal class weight for SVC

Hello I am working with sklearn to perform a classifier, I have the following distribution of labels:

label : 0 frecuency :  119
label : 1 frecuency :  1615
label : 2 frecuency :  197
label : 3 frecuency :  70
label : 4 frecuency :  203
label : 5 frecuency :  137
label : 6 frecuency :  18
label : 7 frecuency :  142
label : 8 frecuency :  15
label : 9 frecuency :  182
label : 10 frecuency :  986
label : 12 frecuency :  73
label : 13 frecuency :  27
label : 14 frecuency :  81
label : 15 frecuency :  168
label : 18 frecuency :  107
label : 21 frecuency :  125
label : 22 frecuency :  172
label : 23 frecuency :  3870
label : 25 frecuency :  2321
label : 26 frecuency :  25
label : 27 frecuency :  314
label : 28 frecuency :  76
label : 29 frecuency :  116

One thing that clearly stands out is that I am working with a unbalanced data set I have many labels for the class 25,23,1,10, I am getting bad results after the training as follows:

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.61      0.23      0.34       528
          2       0.00      0.00      0.00        70
          3       0.67      0.06      0.11        32
          4       0.00      0.00      0.00        62
          5       0.78      0.82      0.80        39
          6       0.00      0.00      0.00         3
          7       0.00      0.00      0.00        46
          8       0.00      0.00      0.00         5
          9       0.00      0.00      0.00        62
         10       0.14      0.01      0.02       313
         12       0.00      0.00      0.00        30
         13       0.31      0.57      0.40         7
         14       0.00      0.00      0.00        35
         15       0.00      0.00      0.00        56
         18       0.00      0.00      0.00        35
         21       0.00      0.00      0.00        39
         22       0.00      0.00      0.00        66
         23       0.41      0.74      0.53      1278
         25       0.28      0.39      0.33       758
         26       0.50      0.25      0.33         8
         27       0.29      0.02      0.03       115
         28       1.00      0.61      0.76        23
         29       0.00      0.00      0.00        42

avg / total       0.33      0.39      0.32      3683

I am getting many zeros and the SVC is not able to learn from several class, the hyperparameters that I am using are the followings:

from sklearn import svm
clf2= svm.SVC(kernel='linear')

I order to overcome this issue I builded one dictionary with weights for each class as follows:

weight={}
for i,v in enumerate(uniqLabels):
        weight[v]=labels_cluster.count(uniqLabels[i])/len(labels_cluster)

for i,v in weight.items():
        print(i,v)
print(weight)

these are the numbers and output, I am just taking the numbers of element of determinated label divided by the total of elements in the labels set, the sum of these numbers is 1:

0 0.010664037996236221
1 0.14472622994892015
2 0.01765391164082803
3 0.006272963527197778
4 0.018191594228873554
5 0.012277085760372793
6 0.0016130477641365713
7 0.012725154583744062
8 0.0013442064701138096
9 0.01630970517071422
10 0.0883591719688144
12 0.0065418048212205395
13 0.002419571646204857
14 0.007258714938614571
15 0.015055112465274667
18 0.009588672820145173
21 0.011201720584281746
22 0.015413567523971682
23 0.34680526928936284
25 0.20799354780894344
26 0.0022403441168563493
27 0.028138722107715744
28 0.006810646115243301
29 0.01039519670221346

trying again with this dictionary of weights as follows:

from sklearn import svm
clf2= svm.SVC(kernel='linear',class_weight=weight)

I got:

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        31
          1       0.90      0.19      0.31       528
          2       0.00      0.00      0.00        70
          3       0.00      0.00      0.00        32
          4       0.00      0.00      0.00        62
          5       0.00      0.00      0.00        39
          6       0.00      0.00      0.00         3
          7       0.00      0.00      0.00        46
          8       0.00      0.00      0.00         5
          9       0.00      0.00      0.00        62
         10       0.00      0.00      0.00       313
         12       0.00      0.00      0.00        30
         13       0.00      0.00      0.00         7
         14       0.00      0.00      0.00        35
         15       0.00      0.00      0.00        56
         18       0.00      0.00      0.00        35
         21       0.00      0.00      0.00        39
         22       0.00      0.00      0.00        66
         23       0.36      0.99      0.52      1278
         25       0.46      0.01      0.02       758
         26       0.00      0.00      0.00         8
         27       0.00      0.00      0.00       115
         28       0.00      0.00      0.00        23
         29       0.00      0.00      0.00        42

avg / total       0.35      0.37      0.23      3683

Since I am not getting good results I really appreciate suggestions to automatically adjust the weight of each class and express that in the SVC, I don have many expierience dealing with unbalanced data.

Best Answer

If you are not getting good results, you should first check that you are using the right classification algorithm (is your data well fit to be classified by a linear SVM?) and that you have enough training data. Practically, that means you might consider visualizing your dataset through PCA or t-SNE to see how "clustered" your classes are, and checking how your classification metrics evolve with the amount of data your classifier is given.

If you then confirm that investing in tweaking your linear SVM is the right way to approach your problem, you can look at modifying the class weights. Note then that what you suggest as weights is probably the opposite of what you want to do: you are giving more weights to less frequent classes, marginalizing them further - said differently, you typically want to use weights that are inversely proportional to class frequencies. You can calculate these manually, or you can let sklearn do it automatically for you by specificing class_weight='balanced'.

Best Answer

Related Solutions

Solved – Finding an optimal class probability threshold for SVM

Related Question