I observed that Caffe (a deep learning framework) used the **Softmax Loss Layer** `SoftmaxWithLoss`

as output layer for most of the model samples.

As far as I know, **Softmax Loss layer** is the combination of **Multinomial Logistic Loss Layer** and **Softmax Layer**.

From Caffe, they said that

Softmax Loss Layer gradient computation is more numerically stable

However, this explanation is not the answer that I want, the explanation is just compare the combination of **Multinomial Logistic Loss Layer** and **Softmax Loss layer** instead of layer by layer. But not compare with other type of loss function.

However, I would like to know more what is the **differences/advantages/disadvantages** of these 3 error function which is **Multinomial Logistic Loss**, **Cross Entropy** (CE) and **Square Error** (SE) in supervised learning perspective? Any supportive articles?

## Best Answer

In my opinion, loss function is the objective function that we want our neural networks to optimize its weights according to it. Therefore, it is task-specific and also somehow empirical. Just to be clear,

Multinomial Logistic LossandCross Entropy Lossare the same (please look at http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/). The cost function ofMultinomial Logistic Lossis like this $J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right].$It is usually used for classification problem. The

Square Errorhas equation like $\frac 1 {2N} \sum_{i=1}^N \| x^1_i - x^2_i \|_2^2.$Therefore, it is usually used for minimize using some construction errors.

EDIT: @MartinThoma The above formula of multinomial logistics loss is just for binary case, for general case, it should be $J(\theta) = -\left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)} = k\right\} \log P(y^{(i)} = k | x^{(i)} ; \theta) \right]$, where K is number of categories.