I'm trying to train a machine learning model that predicts jobs based on the skills entered. I used Random Forest Classifier as the algorithm, but the accuracy turns out to be very low. I thought the model was underfitted, so I added more unique data and columns to the dataset. However, this caused the accuracy to decrease even more. The prediction results are correct, but How can I improve accuracy?
The classification report results:
precision recall f1-score support
Android developer 0.00 0.00 0.00 2
Application architect 0.00 0.00 0.00 0
Artificial intelligence engineer 0.00 0.00 0.00 3
Automation test engineer 1.00 0.50 0.67 2
Bi developer 0.00 0.00 0.00 0
Big data engineer 0.00 0.00 0.00 2
Business intelligence development manager 1.00 1.00 1.00 1
C sharp asp net client server developer 0.00 0.00 0.00 0
Cloud infrastructure developer 0.00 0.00 0.00 2
Cloud services product owner 0.25 0.50 0.33 2
Computer vision engineer 0.00 0.00 0.00 1
Cpp software engineer 0.00 0.00 0.00 1
Cyber security analyst 0.00 0.00 0.00 2
Data analyst 0.00 0.00 0.00 1
Data architect 0.00 0.00 0.00 1
Data center virtualization architect 0.00 0.00 0.00 1
Data engineer 0.00 0.00 0.00 1
Data scientist 0.00 0.00 0.00 1
Data security administrator 0.00 0.00 0.00 1
Devops engineer 0.00 0.00 0.00 1
Domestic Outsourcing Business Development Executive 0.00 0.00 0.00 1
Domestic outsourcing executive 0.00 0.00 0.00 0
Electrical engineer 0.00 0.00 0.00 1
Embedded software engineer 0.00 0.00 0.00 0
Frontend developer 0.00 0.00 0.00 1
Geolocation engineer 0.00 0.00 0.00 1
Gui developer 0.00 0.00 0.00 1
Information security engineer 0.00 0.00 0.00 1
Information technology architect 0.00 0.00 0.00 0
Infrastructure production developer 0.00 0.00 0.00 2
It business analyst 0.33 1.00 0.50 1
It quality consultant 0.00 0.00 0.00 2
Java developer 0.00 0.00 0.00 0
Java full stack engineer 0.00 0.00 0.00 2
Jira developer 1.00 1.00 1.00 2
Linux engineer 0.00 0.00 0.00 3
Mobile automation tester 0.00 0.00 0.00 1
Mysql database administrator 0.00 0.00 0.00 1
Network engineer 1.00 1.00 1.00 1
Procurement system manager 0.00 0.00 0.00 2
Product developer 0.00 0.00 0.00 0
Product manager 1.00 1.00 1.00 3
Project manager 0.00 0.00 0.00 4
Quality assurance test analyst 0.00 0.00 0.00 1
Sales engineer 0.00 0.00 0.00 2
Sap fico architect 0.00 0.00 0.00 1
Sap technical lead 0.00 0.00 0.00 0
Storage engineer 0.00 0.00 0.00 2
Swift messaging specialist 0.00 0.00 0.00 2
System support administrator 0.00 0.00 0.00 2
Systems test engineer 0.00 0.00 0.00 1
Windows administrator 0.00 0.00 0.00 1
Windows system administrator 0.00 0.00 0.00 1
accuracy 0.15 68
macro avg 0.11 0.11 0.10 68
weighted avg 0.14 0.15 0.14 68
Accuracy: 0.14705882352941177
A part of the dataset:
Jobtitle Skill1 Skill2 Skill3 Skill4 Skill5 Skill6
Automation test engineer Selenium Postman Ubuntu Testcraft Testing tools Mongodb
Automation test engineer Mongodb Testing tools Testcraft Ubuntu Postman Selenium
Automation test engineer Testing tools Testcraft Postman Mongodb Selenium Ubuntu
Automation test engineer Postman Mongodb Testing tools Selenium Ubuntu Testcraft
Automation test engineer Testcraft Ubuntu Selenium Testing tools Mongodb Postman
Automation test engineer Ubuntu Selenium Mongodb Postman Testcraft Testing tools
Information security engineer Cloud security System monitoring Incident response Systems administration Security accessment
Information security engineer Security accessment Systems administration System monitoring Incident response Cloud security
Information security engineer System monitoring Incident response Security accessment Cloud security Systems administration
Information security engineer Systems administration Security accessment Cloud security System monitoring Incident response
Information security engineer Incident response Cloud security Systems administration Security accessment System monitoring
Also, I tried to tune the parameters of Random Forest, it works, but as you can see from the results, still not enough.
Best Answer
I see a number of entries that look suspiciously close to fractions like $\frac{1}{2}$, $\frac{1}{3}$, $\frac{2}{3}$ and so on. It seems like you have very little data, so little that some categories only have two or three entries. Your random forest just can't do much with this little data. And neither can any other model.
How to know that your machine learning problem is hopeless?
Also note that accuracy, precision, recall, F1 etc. have major problems, especially for "unbalanced" data, but also in general. Every criticism at the following threads applies equally to all of these, and indeed to all evaluation metrics that rely on hard classifications: Why is accuracy not the best measure for assessing classification models? Is accuracy an improper scoring rule in a binary classification setting? Classification probability threshold Instead, use probabilistic classifications, and evaluate these using proper scoring rules.