Sorry if the title seems a little roundabout, but let me explain what I'm trying to do. I'm training a XGBClassifier (in python) on samples that aren't strictly in the class of 0 and 1, but have a little granularity of range– anywhere from [0, 1], [.25, .75], [.5 .5], [.75, .25], [1, 0]
for each of the two classes, where [0, 1]
means it's 0% class A
and 100% class B
.
The reason I would rather not use regression is that the training values aren't technically continuous, but discreet within the spectrum of 0 to 1, and I'm trying to combine the power of doing multi-class classification only within the framework of all classes being simply different combinations of purely class A
and class B
. Perhaps regression is still a better option, or doing reg:linear as the objective– but that doesn't exactly solve my problem.
For example, for measuring sentiment less in terms of "negative or positive" but "25% positive 75% negative" using predict_proba()
.
I'm trying to figure out what the best way to do this is. I know the base class of the XGBClassifier is binary:logistic, and I might try this also using multi:softmax, but I like the idea of using the predict_proba() between 0
and 1
as a measure of where a sample falls on a scale between class A
and class B
(or really, between 0
and 1
) which would be more difficult using 5 separate "classes."
(For the following example, I'm using the letters A
and B
but really mean 0
and 1
. It's just less confusing this way.)
My first inclination is to force classification probabilities by using ratios of A
and B
in the training set for each sample, essentially sending each one through four times with different classifications– but I'm not sure if there's an easier way or if it's doing what I think it is.
For example– if I have a sample that I want to represent as [.5, .5]
so basically, a 50/50 or "neutral" sentiment (so that other samples I sent through later come out around [.5, .5]
, I'd train it four times with a value of A
and four times with a value of B
. Then for something that should be classified as [0, 1]
, we train it eight times with a value of 1, and for something that is [.75, .25]
, we'd train it six times with a value of 0
and two times with a value of 1
.
Here's how I'd train each sample then, where "B B B B
" would mean I train the same sample four times telling the classifier it is B
, etc:
[0.00, 1.00]: B B B B
[0.25, 0.75]: A B B B
[0.50, 0.50]: A A B B
[0.75, 0.25]: A A A B
[1.00, 0.00] :A A A A
So– barring this approach being incorrect, is there a better way to go about what I'm trying to do? Like an analog for predict_proba() but for training inputs? Knowing how the algorithm works i don't think that exists, but then again, I'm here to be schooled.
Is this a bastardization of a binary classifier parading as a regression wannabe? Or is this an alright way to do what I'm trying to do?
Thanks everybody.
Best Answer
Your instinct is correct -- this is still a binary problem. The feature vectors $x$ and labels $y$ have just been "compressed" in your representation. Consider some feature vector $x$ that has associated label $y = (0.25, 0.75)$. This is the exact same as having $$ (X,Y)= \left(\begin{bmatrix} x \\ x \\ x \\ x \\ \end{bmatrix}, \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ \end{bmatrix} \right) $$ as parts of your feature and label matrix.
Of course the order isn't imporant, so you could also write $$ Y= \begin{bmatrix} 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ \end{bmatrix} ,$$ or any other ordering, for your labels of a particular $(x,y)$.
If you un-compress your data using this method, it's exactly the same as the ordinary binary case.
Note that $(x,y)$ are just stand-ins for any tuple of feature vectors and labels. There might be another feature vector $z \neq x$ which also has label $y = [0.25, 0.75]$. If we de-compress this and append it to the previous result, we have $$ (X,Y)= \left(\begin{bmatrix} x \\ x \\ x \\ x \\ z \\ z \\ z \\ z \\ \end{bmatrix}, \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \right) .$$