Solved – Gradient Descent and Cost Function trouble

gradient descentoptimization

I am taking Andrew Ng's ML course. I noticed the following:
When he is talking about gradient descent with J(theta0, theta1), he "descends" into a negative J(t0, t1) output. Here is a photo: enter image description here

After:

enter image description here

This J(t0, t1) output is clearly negative ( < 0 ), and I though we wanted the output to be as close to 0 as possible. I doubt that Andrew is wrong. Can somebody please explain why he does this? P.S I am young and don't quite get 3D graphs yet, which might be contributing to the misunderstanding

Best Answer

"I thought we wanted the output to be as close to 0 as possible"

There are two functions involved here. There's the output of the neural network which you want to be accurate, and there's the cost function which you want to minimize.

Perhaps it's the output of the neural network that you want to be close to 0. However, this seems unlikely because a neural network for binary classification has some targets of $1$ and some targets of $0$ so you wouldn't want all the outputs to be zero.

The cost function used is often $\Sigma_{i=1}^{m} (y_i-T_i)^2$

That's the sum of the squared difference between the $i^{th}$ output ($y_i$) and the target ($T_i$). Clearly this cannot be negative so at a previous stage Andrew Ng might have said we want the cost function to be close to zero. However, other cost functions can be negative. Also, including regularization might add one some quantities which can be negative and these will allow the cost function to go below zero.