How does decision tree classifier work for text sentences

cartclassificationnatural languagepython

This is an image created using decision tree classifier in which, 30 sentences of two different categories like sports and crime were used. These texts were converted into vectors using sbert (each sentence is now represented as integers) and length of each sentence is made to be 112 ( length of short sentences is also made as 112 to make feature every sentence have feature from X0 to X111).

I am trying to figure out how this decision tree works. What do the numbers 345, 196, 5131.5 denote?

Best Answer

So for your variable 31 (which is your encoded sentence so you have 112 variables)if that value is less than or equal to 345 then you follow the left path that says 'True'. Else you go right. So basically this variable 31 is the 'best' first split to divide up your data and it found that the variable itself is best split at value 345 for whatever reason.

A decision tree at each split poses a question and the first number you see is that question: is x31 less than or equal to 345. Then it poses 2 questions with that split:

one for the 'Trues'
one for the 'Falses'

So the falses get the question is x34 less than or equal to 5131.5? And so on...

All a decision tree does is partition your data and this graph shows you the questions it asks each data point in order to partition it.

For text sentences specifically this process is the same as with anything else. The only difference is you are using word embeddings or some way to convert the words to numbers that can represent them. So what does 345 actually mean? It means nothing except that the vector of all of the variables for a given data point should represent a sentence and be closely related to similar sentences. But the numbers mean absolutely nothing on their own, they only mean something relative to other numbers.

Best Answer

Related Solutions

Solved – Is it feasible to use k-Nearest Neighbours to identify text language

Solved – How to reduce dimensionality of audio data that comes in form of matrices and vectors

Related Question