[Math] Lacking in the Intuition behind the Logistic Regression Cost and Update Functions

calculuslinear algebramachine learningstatistics

I am lacking in intuition about the logistic regression cost and update functions. For example, in the cost function of
Cost Function where

Sigmoid Function

why is log used. Is it just to make computations easier? Could log not be used and still work the same? Since likelihood is the inverse of probrability couldn't the inverse of the sigmoid function be used instead?

Also, is there any reason other than a coincidence, that the derivative of both the logistic and linear regression is the cost function times $x^(i)$?

Best Answer

We are given a data vector $\textbf{x}$ and a class vector $\textbf{y}$. The class vector tells us which of two classes $\{0,1\}$ the data instances belong to.

We want to come up with a function $h_\theta(x_i)$ that helps us estimate the classes $y_i$ as best as we can.

You can think of $h_\theta(x_i)$ as the probability that $y_i=1$, given $x_i$ and $\theta$.

$$P(y_i=1|x_i,\theta)=h_\theta(x_i)$$

Likewise, $1-h_\theta(x_i)$ is the probability that $y_i=0$, given $x_i$ and $\theta$.

$$P(y_i=0|x_i,\theta)=1-h_\theta(x_i)$$

We can combine the two formulas in a clever way using exponents:

$$P(y_i|x_i,\theta)=h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$

(Note that one of the terms is always reduced to 1 because one of the exponents is always zero.)

Since all instances are independent, the total probability over all instances $i$ is just the product of all the individual probabilities:

$$P(\textbf{y}|\textbf{x},\theta)=\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}$$


We are hoping to maximize the probability of the output vector $\textbf{y}$, or equivalently, maximize its $\textbf{log}$.

$$\log\big( P(\textbf{y}|\textbf{x},\theta)\big)=\log\big(\prod_i h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$

$$=\sum_i \log\big(h_\theta(x_i)^{y_i}[1-h_\theta(x_i)]^{1-y_i}\big)$$

$$=\sum_i \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$

This is a function with a maximum, but since we want to use gradient descent, we can just throw a negative sign in front to turn the maximum into a minimum, and scale it by the number of instances $m$ for convenience. (This makes the error more or less invariant to the number of instances.)A negative sign is added to the front since we want to reduce the cost function. The log loss function is strictly increasing, but by adding a negative sign we invert it.

$$J(\theta) = -\frac{1}{m}\sum_i^m \big[y_i\log(h_\theta(x_i)) + (1-y_i)\log(1-h_\theta(x_i))\big]$$

Why did we bother taking the $\log$? Because it's easier to take the derivative of a sum rather than the derivative of a product (imagine all that product rule!). You'll find this trick is used a lot in machine learning to make differentiation easier.


You also asked why we chose $h_\theta(x_i)$ to be the sigmoid function. A couple reasons are:

  1. It is differentiable (unlike the unit step function), so we can use gradient descent with it.
  2. Its domain is the whole real line $\mathbb{R}$ and its range is $[0,1]$, which seems like a good idea for a binary classification problem with no constraints on the input value.

To me it seems like the use of sigmoid is more of an engineered solution rather than something we arrived at from a mathematical proof. It has nice properties and seems to work.

In neural networks, some people prefer using alternatives to the sigmoid like $arctan$ and $tanh$, but I don't think it makes that much of a difference in most cases.

enter image description here