I observed that Caffe (a deep learning framework) used the Softmax Loss Layer SoftmaxWithLoss
as output layer for most of the model samples.
As far as I know, Softmax Loss layer is the combination of Multinomial Logistic Loss Layer and Softmax Layer.
From Caffe, they said that
Softmax Loss Layer gradient computation is more numerically stable
However, this explanation is not the answer that I want, the explanation is just compare the combination of Multinomial Logistic Loss Layer and Softmax Loss layer instead of layer by layer. But not compare with other type of loss function.
However, I would like to know more what is the differences/advantages/disadvantages of these 3 error function which is Multinomial Logistic Loss, Cross Entropy (CE) and Square Error (SE) in supervised learning perspective? Any supportive articles?
Best Answer
In my opinion, loss function is the objective function that we want our neural networks to optimize its weights according to it. Therefore, it is task-specific and also somehow empirical. Just to be clear, Multinomial Logistic Loss and Cross Entropy Loss are the same (please look at http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/). The cost function of Multinomial Logistic Loss is like this $J(\theta) = -\frac{1}{m} \left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right].$
It is usually used for classification problem. The Square Error has equation like $\frac 1 {2N} \sum_{i=1}^N \| x^1_i - x^2_i \|_2^2.$
Therefore, it is usually used for minimize using some construction errors.
EDIT: @MartinThoma The above formula of multinomial logistics loss is just for binary case, for general case, it should be $J(\theta) = -\left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)} = k\right\} \log P(y^{(i)} = k | x^{(i)} ; \theta) \right]$, where K is number of categories.