Solved – Logistic regression with {-1,+1} labels

logisticMATLABmaximum likelihood

I am trying to implement logistic regression where the label space is {-1,+1} instead of the usual {0,1}. I know that I can model this as a 0-1 problem but nevertheless I wanted to see if I can derive this from first principles (using MLE).

The min log likelihood expression I get is:
$ \ l(\theta) = \Sigma_{i=1}^{m}\ \log(1+exp(-y^{i}\Theta^{T}x^{i})) $
where $\{\dots \ (x^{i},y^{i}) \dots \} $ are the $m$ training examples (x is a $n$-dimensional vector).

So now I try to find the gradient for this and I get:
$ \frac{\partial l(\theta)}{\partial \theta_j} = \frac{\mu.y.x_j}{1+\mu} $ where $j=1\dots n$ are the indices corresponding to features and $\mu = exp(y\Theta^{T}x)$

However, when I try to solve this with Matlab's fminunc I do not get any updates on my initial weight vector. My Matlab code for this is:

temp1 = exp((-y).*(X*w));
temp2 = temp1.*((1+temp1).^(-1)).*y;
grad  = (X'*temp2);

Can somebody point what I am doing wrong here?

Best Answer

Expanding Frank Harrells answer, to derive likelihood function you first need to define the probabilistic model of the problem. In the case of logistic regression, we are talking about a model for binary target variable (e.g. male vs female, survived vs died, sold vs not sold etc.). For such data, Bernoulli distribution is the distribution of choice. Notice that using $\{0, 1\}$ or $\{-1, +1\}$ coding is not a part of the definition of the problem, it is just a way of encoding your data, the labels are arbitrary and can be changed. In this case we decide to use the $\{0, 1\}$ labels because they have some nice properties, but the main problem in logistic regression is estimating the probability of "success". We use the $\{0, 1\}$ encoding, because the model is defined in terms of Bernoulli distribution that uses such labels.

If you insisted on defining the likelihood function in terms of a distribution that assigns $1-p$ probability for $-1$ and $p$ probability for $+1$, then you would need to use such distribution in your likelihood function. The distribution would have the following probability mass function

$$ g(x) = p^{(x+1)/2} (1-p)^ {1-(x+1)/2} $$

what basically reduces to Bernoulli distribution for $(X+1)/2 \in \{0, 1\} $.