[Math] Logistic Regression: When can the cost function be non-convex

linear regressionlogistic regressionmachine learningregressionstatistics

I'm studying the lecture on the Machine Learning at Coursera.

When introducing the cost function for Logistic Regression, they said that we shouldn't use the same cost function as the one for Linear Regression since that cost function can have multiple local minima that would cause gradient descent to fail.

However, I'm having a hard time finding an example of an equation which would result in multiple local minima if one uses the Linear Regression cost function on Logistic Regression.

Could someone please provide such an example and give me some intuition about why it occurs?

Side note: I can see other good properties of using the log(h(x)) cost function for logistic regression, that it strongly penalizes being certain when you're actually wrong. But I'm trying to understand the claim that the linear regression's cost function will be non-convex for logistic regression.

Best Answer

My example is probably not the best but it still show you that the linear regression cost function applied to logistic regression can give local minima.

Let us consider a training set with 3 examples in 1D: $$\{(-1,2), (-20,-1), (-5,5)\}$$

For the sake of simplicity let us consider a model without constant term so that we get the activation function $$h_{\theta}(x)=\dfrac{1}{1+e^{\theta \cdot x}}$$

If we use the cost function of a linear regression we would get

$$J(\theta)=\dfrac{1}{6}\left(\left(\dfrac{1}{1+e^{-\theta}}-2\right)^2 + \left(\dfrac{1}{1+e^{-20\theta}}+1\right)^2+\left(\dfrac{1}{1+e^{-5\theta}}-5\right)^2\right)$$

Let us now plot this function. You need to zoom a bit around $0$ to see the following graph

graph

There is clearly a local minimum that we could reach. This is one of the reason you should use the classical cost function for logistic regression.

Besides I am sure that you should be able to find a nicer graph with more local minima when using a bigger value for $n$.