Solved – Cross-entropy cost function in neural network

error-propagationneural networks

I'm looking at the cross-entropy cost function found in this tutorial:

$$C = -\frac{1}{n} \sum_x [y \ln a+(1−y)\ln(1−a)]$$

What exactly are we summing over? It is, of course, over $x$, but $y$ and $a$ don't change with $x$. All of the $x$'s are inputs into the one $a$.
$a$ is even defined in the paragraph above the equation as a function of the sum of all $w$'s and $x$'s.

Also, $n$ is defined as the number of inputs into this particular neuron, correct? It is worded as "the total number of items of training data".


Edit:

Am I correct in thinking that

$$C= -\frac{1}{n} \sum_x [y \ln a+(1−y)\ln(1−a)]$$

would be the cost function for the entire network, whereas

$$C = [y \ln a+(1−y)\ln(1−a)]$$

would be the cost for the individual neuron? Shouldn't the sum be over each output neuron?

Best Answer

Here's how I would express the cross-entropy loss: $$\mathcal{L}(X, Y) = -\frac{1}{n} \sum_{i=1}^n y^{(i)} \ln a(x^{(i)}) + \left(1 - y^{(i)}\right) \ln \left(1 - a(x^{(i)})\right) $$

Here, $X = \left\{x^{(1)},\dots,x^{(n)}\right\}$ is the set of input examples in the training dataset, and $Y=\left\{y^{(1)},\dots,y^{(n)} \right\}$ is the corresponding set of labels for those input examples. The $a(x)$ represents the output of the neural network given input $x$.

Each of the $y^{(i)}$ is either 0 or 1, and the output activation $a(x)$ is typically restricted to the open interval (0, 1) by using a logistic sigmoid. For example, for a one-layer network (which is equivalent to logistic regression), the activation would be given by $$a(x) = \frac{1}{1 + e^{-Wx-b}}$$ where $W$ is a weight matrix and $b$ is a bias vector. For multiple layers, you can expand the activation function to something like $$a(x) = \frac{1}{1 + e^{-Wz(x)-b}} \\ z(x) = \frac{1}{1 + e^{-Vx-c}}$$ where $V$ and $c$ are the weight matrix and bias for the first layer, and $z(x)$ is the activation of the hidden layer in the network.

I've used the (i) superscript to denote examples because I found it to be quite effective in Andrew Ng's machine learning course; sometimes people express examples as columns or rows in a matrix, but the idea remains the same.