Random Forest Classifier Accuracy – Why the Performance May Be Low

classificationmachine learningrandom forest

I'm trying to train a machine learning model that predicts jobs based on the skills entered. I used Random Forest Classifier as the algorithm, but the accuracy turns out to be very low. I thought the model was underfitted, so I added more unique data and columns to the dataset. However, this caused the accuracy to decrease even more. The prediction results are correct, but How can I improve accuracy?

The classification report results:

                                                       precision    recall  f1-score   support

                                  Android developer       0.00      0.00      0.00         2
                              Application architect       0.00      0.00      0.00         0
                   Artificial intelligence engineer       0.00      0.00      0.00         3
                           Automation test engineer       1.00      0.50      0.67         2
                                       Bi developer       0.00      0.00      0.00         0
                                  Big data engineer       0.00      0.00      0.00         2
          Business intelligence development manager       1.00      1.00      1.00         1
            C sharp asp net client server developer       0.00      0.00      0.00         0
                     Cloud infrastructure developer       0.00      0.00      0.00         2
                       Cloud services product owner       0.25      0.50      0.33         2
                           Computer vision engineer       0.00      0.00      0.00         1
                              Cpp software engineer       0.00      0.00      0.00         1
                             Cyber security analyst       0.00      0.00      0.00         2
                                       Data analyst       0.00      0.00      0.00         1
                                     Data architect       0.00      0.00      0.00         1
               Data center virtualization architect       0.00      0.00      0.00         1
                                      Data engineer       0.00      0.00      0.00         1
                                     Data scientist       0.00      0.00      0.00         1
                        Data security administrator       0.00      0.00      0.00         1
                                    Devops engineer       0.00      0.00      0.00         1
Domestic Outsourcing Business Development Executive       0.00      0.00      0.00         1
                     Domestic outsourcing executive       0.00      0.00      0.00         0
                                Electrical engineer       0.00      0.00      0.00         1
                         Embedded software engineer       0.00      0.00      0.00         0
                                 Frontend developer       0.00      0.00      0.00         1
                               Geolocation engineer       0.00      0.00      0.00         1
                                      Gui developer       0.00      0.00      0.00         1
                      Information security engineer       0.00      0.00      0.00         1
                   Information technology architect       0.00      0.00      0.00         0
                Infrastructure production developer       0.00      0.00      0.00         2
                                It business analyst       0.33      1.00      0.50         1
                              It quality consultant       0.00      0.00      0.00         2
                                     Java developer       0.00      0.00      0.00         0
                           Java full stack engineer       0.00      0.00      0.00         2
                                     Jira developer       1.00      1.00      1.00         2
                                     Linux engineer       0.00      0.00      0.00         3
                           Mobile automation tester       0.00      0.00      0.00         1
                       Mysql database administrator       0.00      0.00      0.00         1
                                   Network engineer       1.00      1.00      1.00         1
                         Procurement system manager       0.00      0.00      0.00         2
                                  Product developer       0.00      0.00      0.00         0
                                    Product manager       1.00      1.00      1.00         3
                                    Project manager       0.00      0.00      0.00         4
                     Quality assurance test analyst       0.00      0.00      0.00         1
                                     Sales engineer       0.00      0.00      0.00         2
                                 Sap fico architect       0.00      0.00      0.00         1
                                 Sap technical lead       0.00      0.00      0.00         0
                                   Storage engineer       0.00      0.00      0.00         2
                         Swift messaging specialist       0.00      0.00      0.00         2
                       System support administrator       0.00      0.00      0.00         2
                              Systems test engineer       0.00      0.00      0.00         1
                              Windows administrator       0.00      0.00      0.00         1
                       Windows system administrator       0.00      0.00      0.00         1

accuracy                           0.15        68
                                          macro avg       0.11      0.11      0.10        68
                                       weighted avg       0.14      0.15      0.14        68

Accuracy: 0.14705882352941177

A part of the dataset:

Jobtitle                    Skill1          Skill2          Skill3     Skill4   Skill5   Skill6
Automation test engineer    Selenium        Postman Ubuntu  Testcraft   Testing tools    Mongodb
Automation test engineer    Mongodb         Testing tools   Testcraft   Ubuntu  Postman  Selenium
Automation test engineer    Testing tools   Testcraft       Postman     Mongodb  Selenium Ubuntu
Automation test engineer    Postman Mongodb Testing tools   Selenium    Ubuntu  Testcraft
Automation test engineer    Testcraft        Ubuntu         Selenium    Testing tools   Mongodb Postman
Automation test engineer    Ubuntu  Selenium    Mongodb Postman Testcraft   Testing tools

Information security engineer   Cloud security  System monitoring   Incident response   Systems administration  Security accessment 
Information security engineer   Security accessment Systems administration  System monitoring   Incident response   Cloud security  
Information security engineer   System monitoring   Incident response   Security accessment Cloud security  Systems administration  
Information security engineer   Systems administration  Security accessment Cloud security  System monitoring   Incident response   
Information security engineer   Incident response   Cloud security  Systems administration  Security accessment System monitoring   

Also, I tried to tune the parameters of Random Forest, it works, but as you can see from the results, still not enough.

Best Answer

I see a number of entries that look suspiciously close to fractions like $\frac{1}{2}$, $\frac{1}{3}$, $\frac{2}{3}$ and so on. It seems like you have very little data, so little that some categories only have two or three entries. Your random forest just can't do much with this little data. And neither can any other model.

How to know that your machine learning problem is hopeless?

Also note that accuracy, precision, recall, F1 etc. have major problems, especially for "unbalanced" data, but also in general. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: Why is accuracy not the best measure for assessing classification models? Is accuracy an improper scoring rule in a binary classification setting? Classification probability threshold Instead, use probabilistic classifications, and evaluate these using proper scoring rules.

Related Question