Here are those I understand so far. Most of these work best when given values between 0 and 1.
Quadratic cost
Also known as mean squared error, this is defined as:
$$C_{MST}(W, B, S^r, E^r) = 0.5\sum\limits_j (a^L_j - E^r_j)^2$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C_{MST} = (a^L - E^r)$$
Cross-entropy cost
Also known as Bernoulli negative log-likelihood and Binary Cross-Entropy
$$C_{CE}(W, B, S^r, E^r) = -\sum\limits_j [E^r_j \text{ ln } a^L_j + (1 - E^r_j) \text{ ln }(1-a^L_j)]$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C_{CE} = \frac{(a^L - E^r)}{(1-a^L)(a^L)}$$
Exponentional cost
This requires choosing some parameter $\tau$ that you think will give you the behavior you want. Typically you'll just need to play with this until things work good.
$$C_{EXP}(W, B, S^r, E^r) = \tau\text{ }\exp(\frac{1}{\tau} \sum\limits_j (a^L_j - E^r_j)^2)$$
where $\text{exp}(x)$ is simply shorthand for $e^x$.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{2}{\tau}(a^L- E^r)C_{EXP}(W, B, S^r, E^r)$$
I could rewrite out $C_{EXP}$, but that seems redundant. Point is the gradient computes a vector and then multiplies it by $C_{EXP}$.
Hellinger distance
$$C_{HD}(W, B, S^r, E^r) = \frac{1}{\sqrt{2}}\sum\limits_j(\sqrt{a^L_j}-\sqrt{E^r_j})^2$$
You can find more about this here. This needs to have positive values, and ideally values between $0$ and $1$. The same is true for the following divergences.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{\sqrt{a^L}-\sqrt{E^r}}{\sqrt{2}\sqrt{a^L}}$$
Kullback–Leibler divergence
Also known as Information Divergence, Information Gain, Relative entropy, KLIC, or KL Divergence (See here).
Kullback–Leibler divergence is typically denoted $$D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \ln\frac{P(i)}{Q(i)}$$,
where $D_{\mathrm{KL}}(P\|Q)$ is a measure of the information lost when $Q$ is used to approximate $P$. Thus we want to set $P=E^i$ and $Q=a^L$, because we want to measure how much information is lost when we use $a^i_j$ to approximate $E^i_j$. This gives us
$$C_{KL}(W, B, S^r, E^r)=\sum\limits_jE^r_j \log \frac{E^r_j}{a^L_j}$$
The other divergences here use this same idea of setting $P=E^i$ and $Q=a^L$.
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = -\frac{E^r}{a^L}$$
Generalized Kullback–Leibler divergence
From here.
$$C_{GKL}(W, B, S^r, E^r)=\sum\limits_j E^r_j \log \frac{E^r_j}{a^L_j} -\sum\limits_j(E^r_j) + \sum\limits_j(a^L_j)$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{a^L-E^r}{a^L}$$
Itakura–Saito distance
Also from here.
$$C_{GKL}(W, B, S^r, E^r)= \sum_j \left(\frac {E^r_j}{a^L_j} - \log \frac{E^r_j}{a^L_j} - 1 \right)$$
The gradient of this cost function with respect to the output of a neural network and some sample $r$ is:
$$\nabla_a C = \frac{a^L-E^r}{\left(a^L\right)^2}$$
Where $\left(\left(a^L\right)^2\right)_j = a^L_j \cdot a^L_j$. In other words, $\left( a^L\right) ^2$ is simply equal to squaring each element of $a^L$.
Best Answer
Here's how I would express the cross-entropy loss: $$\mathcal{L}(X, Y) = -\frac{1}{n} \sum_{i=1}^n y^{(i)} \ln a(x^{(i)}) + \left(1 - y^{(i)}\right) \ln \left(1 - a(x^{(i)})\right) $$
Here, $X = \left\{x^{(1)},\dots,x^{(n)}\right\}$ is the set of input examples in the training dataset, and $Y=\left\{y^{(1)},\dots,y^{(n)} \right\}$ is the corresponding set of labels for those input examples. The $a(x)$ represents the output of the neural network given input $x$.
Each of the $y^{(i)}$ is either 0 or 1, and the output activation $a(x)$ is typically restricted to the open interval (0, 1) by using a logistic sigmoid. For example, for a one-layer network (which is equivalent to logistic regression), the activation would be given by $$a(x) = \frac{1}{1 + e^{-Wx-b}}$$ where $W$ is a weight matrix and $b$ is a bias vector. For multiple layers, you can expand the activation function to something like $$a(x) = \frac{1}{1 + e^{-Wz(x)-b}} \\ z(x) = \frac{1}{1 + e^{-Vx-c}}$$ where $V$ and $c$ are the weight matrix and bias for the first layer, and $z(x)$ is the activation of the hidden layer in the network.
I've used the (i) superscript to denote examples because I found it to be quite effective in Andrew Ng's machine learning course; sometimes people express examples as columns or rows in a matrix, but the idea remains the same.