Solved – Question about Sigmoid Function in Logistic Regression

logisticsigmoid-curve

This is with reference with Andrew Ng's video on Logistic Regression, I just want to confirm a small doubt I have.

I get the basic idea of Logistic Regression that
$z=\theta^Tx$

Where
$\theta$= Parameters of our model and $x$= observations of the dataset.

And the $z$ is then used as an input for our sigmoidal function, which is,

$f(z)= \frac{1}{1+e^{-z}}$

Where this function will give us the probability of our "Y-Variable" taking a value 0 or 1.

The part I don't get is that when representing this $f(z)$ in a graphical form as a function of z, the curve of the sigmoid is shown to intersect the y axis at a value of 0.5, implying that when

$z=0$, then $f(z)=0.5$ && $z>0$, then $f(z)>0.5$ && $z<0$, then $f(z)<0.5$

My doubt is whether this always has to be the case. Because it's perfectly possible that when we get the $\theta$ paramters, the $z$ values that we calculate can never take on negative values.

Best Answer

When we use the compact form $z=\theta^Tx$, we are assuming that the first element of $x$ is the intercept and is always 1: $x_0 = 1$. You can think of $x$ as the row of a design matrix. Therefore the corresponding parameter $\theta_0$ is a bias term that can be adjusted up or down so that the model is "unbiased": $E[z] = E[\theta^Tx]$.

To expand on this point, consider this more explicit formula:

$$ z = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_p x_p \tag{1} $$

If we want to write this as a dot product ($a^T b$ is the same as $a \cdot b$) then the intercept term is inconvenient, because every other $\theta_i$ is multiplied by $x_i$ but $\theta_0$ is not. So we simply define $x_0 = 1$, and then we can write (1) as

$$ z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... + \theta_p x_p = \theta^T x \tag{2} $$

Hiding the intercept inside the design matrix is notionally convenient (and easier to code, when implementing an algorithm) but it is important for constantly be aware that the intercept term is in there and needs special handling in some cases. For example, the l1 and l2 regularization penalty omits $\theta_0$ because that would bias the model to no real purpose.

Related Question