Solved – How is the Quadratic Cost Function in Neural Networks smooth

gradient descentneural networks

I've just started reading about neural networks from neural networks and deep learning and came across this section about the quadratic cost function vs taking accuracy directly to improve weights.

Why introduce the quadratic cost? After all, aren't we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost?
The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won't cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That's why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy.

Could someone help me figure out how the quadratic cost function is "smooth" compared to accuracy?

Best Answer

We have $$ \frac{d}{dx}x^2 = 2x $$ which is obviously smooth.

But the accuracy function is $f(\hat{y})=\mathbb{I}(y = \text{max}_i \hat{y}_i)$, using the convention that $y$ is an integer label and $\hat{y}$ is vector predicting the label. Clearly $f$ is a step function: constant right up until a different value is the maximum. Step functions are not smooth.