Solved – Why use Platt’s scaling

calibrationcross-validationlogistic

In order to calibrate a confidence level to a probability in supervised learning (say to map the confidence from an SVM or a decision tree using oversampled data) one method is to use Platt's Scaling (e.g., Obtaining Calibrated Probabilities from Boosting).

Basically one uses logistic regression to map $[-\infty;\infty]$ to $[0;1]$. The dependent variable is the true label and the predictor is the confidence from the uncalibrated model. What I don't understand is the use of a target variable other than 1 or 0. The method calls for creation of a new "label":

To avoid overfitting to the sigmoid train set, an out-of-sample model is used. If there are $N_+$ positive examples and $N_-$ negative examples in the train set, for each training example Platt Calibration uses target values $y_+$ and $y_-$ (instead of 1 and 0, respectively), where
$$
y_+=\frac{N_++1}{N_++2};\quad\quad y_-=\frac{1}{N_-+2}
$$

What I don't understand is how this new target is useful. Isn't logistic regression simply going to treat the dependent variable as a binary label (regardless of what label is given)?

UPDATE:

I found that in SAS changing the dependent from $1/0$ to something else reverted back to the same model (using PROC GENMOD). Perhaps my error or perhaps SAS's lack of versatility. I was able to change the model in R. As an example:

data(ToothGrowth) 
attach(ToothGrowth) 

  # 1/0 coding 
dep          <- ifelse(supp == "VC", 1, 0) 
OneZeroModel <- glm(dep~len, family=binomial) 
OneZeroModel 
predict(OneZeroModel) 

  # Platt coding 
dep2           <- ifelse(supp == "VC", 31/32, 1/32) 
plattCodeModel <- glm(dep2~len, family=binomial) 
plattCodeModel 
predict(plattCodeModel) 

compare        <- cbind(predict(OneZeroModel), predict(plattCodeModel)) 

plot(predict(OneZeroModel), predict(plattCodeModel))

Best Answer

I suggest to check out the wikipedia page of logistic regression. It states that in case of a binary dependent variable logistic regression maps the predictors to the probability of occurrence of the dependent variable. Without any transformation, the probability used for training the model is either 1 (if y is positive in the training set) or 0 (if y is negative).

So: Instead of using the absolute values 1 for positive class and 0 for negative class when fitting $p_i=\frac{1}{(1+exp(A*f_i+B))}$ (where $f_i$ is the uncalibrated output of the SVM), Platt suggests to use the mentioned transformation to allow the opposite label to appear with some probability. In this way some regularization is introduced. When the size of the dataset reaches infinity, $y_+$ will become 1 and $y_{-}$ will become zero. For details, see the original paper of Platt.

Related Question