Here are those I understand so far. Most of these work best when given values between 0 and 1.
Quadratic cost
Also known as mean squared error, this is defined as:
$$C_{MST}(W, B, S^r, E^r) = 0.5\sum\limits_j (a^L_j - E^r_j)^2$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C_{MST} = (a^L - E^r)$$
Cross-entropy cost
Also known as Bernoulli negative log-likelihood and Binary Cross-Entropy
$$C_{CE}(W, B, S^r, E^r) = -\sum\limits_j [E^r_j \text{ ln } a^L_j + (1 - E^r_j) \text{ ln }(1-a^L_j)]$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C_{CE} = \frac{(a^L - E^r)}{(1-a^L)(a^L)}$$
Exponentional cost
This requires choosing some parameter $\tau$ that you think will give you the behavior you want. Typically you'll just need to play with this until things work good.
$$C_{EXP}(W, B, S^r, E^r) = \tau\text{ }\exp(\frac{1}{\tau} \sum\limits_j (a^L_j - E^r_j)^2)$$
where $\text{exp}(x)$ is simply shorthand for $e^x$.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{2}{\tau}(a^L- E^r)C_{EXP}(W, B, S^r, E^r)$$
I could rewrite out $C_{EXP}$, but that seems redundant. Point is the gradient computes a vector and then multiplies it by $C_{EXP}$.
Hellinger distance
$$C_{HD}(W, B, S^r, E^r) = \frac{1}{\sqrt{2}}\sum\limits_j(\sqrt{a^L_j}-\sqrt{E^r_j})^2$$
You can find more about this here. This needs to have positive values, and ideally values between $0$ and $1$. The same is true for the following divergences.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{\sqrt{a^L}-\sqrt{E^r}}{\sqrt{2}\sqrt{a^L}}$$
Kullback–Leibler divergence
Also known as Information Divergence, Information Gain, Relative entropy, KLIC, or KL Divergence (See here).
Kullback–Leibler divergence is typically denoted $$D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \ln\frac{P(i)}{Q(i)}$$,
where $D_{\mathrm{KL}}(P\|Q)$ is a measure of the information lost when $Q$ is used to approximate $P$. Thus we want to set $P=E^i$ and $Q=a^L$, because we want to measure how much information is lost when we use $a^i_j$ to approximate $E^i_j$. This gives us
$$C_{KL}(W, B, S^r, E^r)=\sum\limits_jE^r_j \log \frac{E^r_j}{a^L_j}$$
The other divergences here use this same idea of setting $P=E^i$ and $Q=a^L$.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = -\frac{E^r}{a^L}$$
Generalized Kullback–Leibler divergence
From here.
$$C_{GKL}(W, B, S^r, E^r)=\sum\limits_j E^r_j \log \frac{E^r_j}{a^L_j} -\sum\limits_j(E^r_j) + \sum\limits_j(a^L_j)$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{a^L-E^r}{a^L}$$
Itakura–Saito distance
Also from here.
$$C_{GKL}(W, B, S^r, E^r)= \sum_j \left(\frac {E^r_j}{a^L_j} - \log \frac{E^r_j}{a^L_j} - 1 \right)$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{a^L-E^r}{\left(a^L\right)^2}$$
Where $\left(\left(a^L\right)^2\right)_j = a^L_j \cdot a^L_j$. In other words, $\left( a^L\right) ^2$ is simply equal to squaring each element of $a^L$.
Best Answer
Keep in mind why $\Theta_3$ is $4 \times 6$ rather than $4 \times 5$, even though the third layer has only $5$ nodes. It's because each node in the output layer takes the $5$ nodes as input plus an intercept. Remember that $\delta^{(3)}$ is the derivative of the error function with respect to each node in the third layer, prior to activation. One of your six $\delta^{(3)}$ components is the derivative with respect to the intercept, which has no dependence on any earlier part of the network, and thus has no further "backpropagating" to do. It's not even a relevant value to the calculation, because all you want is the derivative with respect to the weights that travel from the intercept to the outputs.
(I know it doesn't make sense to take a derivative with respect to a constant. However, what we're doing is treating the intercept as if it was an extra variable that always happens to have an observed value of 1. It's done that way for convenience, so we can place its weights in the same matrix as the other weights, rather than considering it separately.)
Thus in the second calculation you matrix multiply $\Theta_2^T$ with the five $\delta^{(3)}$ components that you care about. The ones corresponding to the actual nodes that take weight arguments from earlier in the network.