Solved – Training a binary classifier (xgboost) using probabilities instead of just 0 and 1 (versus training a multi class classifier or using regression)

Sorry if the title seems a little roundabout, but let me explain what I'm trying to do. I'm training a XGBClassifier (in python) on samples that aren't strictly in the class of 0 and 1, but have a little granularity of range– anywhere from [0, 1], [.25, .75], [.5 .5], [.75, .25], [1, 0] for each of the two classes, where [0, 1] means it's 0% class A and 100% class B.

The reason I would rather not use regression is that the training values aren't technically continuous, but discreet within the spectrum of 0 to 1, and I'm trying to combine the power of doing multi-class classification only within the framework of all classes being simply different combinations of purely class A and class B. Perhaps regression is still a better option, or doing reg:linear as the objective– but that doesn't exactly solve my problem.

For example, for measuring sentiment less in terms of "negative or positive" but "25% positive 75% negative" using predict_proba().

I'm trying to figure out what the best way to do this is. I know the base class of the XGBClassifier is binary:logistic, and I might try this also using multi:softmax, but I like the idea of using the predict_proba() between 0 and 1 as a measure of where a sample falls on a scale between class A and class B (or really, between 0 and 1) which would be more difficult using 5 separate "classes."

(For the following example, I'm using the letters A and B but really mean 0 and 1. It's just less confusing this way.)

My first inclination is to force classification probabilities by using ratios of A and B in the training set for each sample, essentially sending each one through four times with different classifications– but I'm not sure if there's an easier way or if it's doing what I think it is.

For example– if I have a sample that I want to represent as [.5, .5] so basically, a 50/50 or "neutral" sentiment (so that other samples I sent through later come out around [.5, .5], I'd train it four times with a value of A and four times with a value of B. Then for something that should be classified as [0, 1], we train it eight times with a value of 1, and for something that is [.75, .25], we'd train it six times with a value of 0 and two times with a value of 1.

Here's how I'd train each sample then, where "B B B B" would mean I train the same sample four times telling the classifier it is B, etc:

[0.00, 1.00]: B B B B 
[0.25, 0.75]: A B B B 
[0.50, 0.50]: A A B B 
[0.75, 0.25]: A A A B 
[1.00, 0.00] :A A A A

So– barring this approach being incorrect, is there a better way to go about what I'm trying to do? Like an analog for predict_proba() but for training inputs? Knowing how the algorithm works i don't think that exists, but then again, I'm here to be schooled.

Is this a bastardization of a binary classifier parading as a regression wannabe? Or is this an alright way to do what I'm trying to do?

Thanks everybody.

Best Answer

Your instinct is correct -- this is still a binary problem. The feature vectors $x$ and labels $y$ have just been "compressed" in your representation. Consider some feature vector $x$ that has associated label $y = (0.25, 0.75)$. This is the exact same as having $$ (X,Y)= \left(\begin{bmatrix} x \\ x \\ x \\ x \\ \end{bmatrix}, \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ \end{bmatrix} \right) $$ as parts of your feature and label matrix.

Of course the order isn't imporant, so you could also write $$ Y= \begin{bmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ \end{bmatrix} ,$$ or any other ordering, for your labels of a particular $(x,y)$.

If you un-compress your data using this method, it's exactly the same as the ordinary binary case.

Note that $(x,y)$ are just stand-ins for any tuple of feature vectors and labels. There might be another feature vector $z \neq x$ which also has label $y = [0.25, 0.75]$. If we de-compress this and append it to the previous result, we have $$ (X,Y)= \left(\begin{bmatrix} x \\ x \\ x \\ x \\ z \\ z \\ z \\ z \\ \end{bmatrix}, \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \right) .$$

Best Answer

Related Solutions

Solved – Using AdaBoost on multi-class in R on unbalanced data

Solved – Probability for class in xgboost

Related Question