In my opinion, loss function is the objective function that we want our neural networks to optimize its weights according to it. Therefore, it is task-specific and also somehow empirical. Just to be clear, **Multinomial Logistic Loss** and **Cross Entropy Loss** are the same (please look at http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/). The cost function of **Multinomial Logistic Loss** is like this
$J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right].$

It is usually used for classification problem. The **Square Error** has equation like
$\frac 1 {2N} \sum_{i=1}^N \| x^1_i - x^2_i \|_2^2.$

Therefore, it is usually used for minimize using some construction errors.

EDIT: @MartinThoma The above formula of multinomial logistics loss is just for binary case, for general case, it should be $J(\theta) = -\left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)} = k\right\} \log P(y^{(i)} = k | x^{(i)} ; \theta) \right]$, where K is number of categories.

Let us assume that the activation function is a logistic regression denoted as $\sigma()$.

The idea behind cross-entropy (CE) is to optimise the weights $W = [w_1, w_2,...,w_j,...w_k]$ to maximise the log probability - or to minimise the negative log probability.

Here, you are willing to obtain each neuron's derivative of the cost $C^n$ with respect to each of the layers in $W$. Thus, you write $\frac{\partial C}{\partial w_j}$, where $C = [C^1, C^2,...,C^n,...,C^m]$. After some math, which I'll skip here but you can read more about it (in case you're interested here (slide 18 proves useful) and here):

This results in $\frac{1}{n} \sum x_j(\sigma(z)−y)$, where $n$ is the size of your set.

Here, $z=WX+b$, where $X = [x_{11} \ x_{12}...x_{1j}...x_{1k};\quad ....;\quad x_{n1} \ x_{n2}...x_{nj}...x_{nk}]$ ($X$ is an $n$ by $k$ matrix) and $x_{11}..x_{1k}$ are the features you would have per entry, $W$ are the weights as defined above and $b$ is the bias.

In classification, you would like to use this linear dependency of $z$. However, you would want to run it through a non-linear function such as a sigmoid, hereby defined by $\sigma()$ (you can see a proof and read more about it here). $y$ represents the targeted output.

So $w_j$ is the **j-th** weight of the vector above; $x_j$ is the **j-th** input vector of an entry, $\sigma(z)$ is the sigmoid applied to the $WX+b$ linear function.

Hope that makes sense.

## Best Answer

The negative log likelihood (eq.80) is also known as the multiclass cross-entropy (ref: Pattern Recognition and Machine Learning Section 4.3.4), as they are in fact two different interpretations of the same formula.

eq.57 is the negative log likelihood of the Bernoulli distribution, whereas eq.80 is the negative log likelihood of the multinomial distribution with one observation (a multiclass version of Bernoulli).

For binary classification problems, the softmax function outputs

twovalues (between 0 and 1 and sum to 1) to give the prediction of each class. While the sigmoid function outputsonevalue (between 0 and 1) to give the prediction of one class (so the other class is 1-p).So eq.80 can't be directly applied to the sigmoid output, though it is essentially the same loss as eq.57.

Also see this answer.

Following is a simple illustration of the connection between (sigmoid + binary cross-entropy) and (softmax + multiclass cross-entropy) for binary classification problems.

Say we take $0.5$ as the split point of the two categories, for sigmoid output it follows,

$$\sigma(wx+b)=0.5$$ $$wx+b=0$$ which is the decision boundary in the feature space.

For softmax output it follows $$\frac{e^{w_1x+b_1}}{e^{w_1x+b_1}+e^{w_2x+b_2}}=0.5$$ $$e^{w_1x+b_1}=e^{w_2x+b_2}$$ $$w_1x+b_1=w_2x+b_2$$ $$(w_1-w_2)x+(b_1-b_2)=0$$ so it remains the same model although there are twice as many parameters.

The followings show the decision boundaries obtained using theses two methods, which are almost identical.