Solved – Should one-hot output encoding be used in backpropagation

backpropagationcategorical-encodingneural networks

I'm going through the process of writing backpropagation for a neural network.
In particular I'm building a MNIST classifier.

I'm wondering if it is better to apply the cost function and backpropagation to the non-one-hot output activation.

The advantage I see is that this will encourage all would-be-non-hot values toward 0, rather than simply being marginally less than the maximum value. Likewise, it would encourage the would-be-hot value towards 1.

An additional one-hot output layer could be applied after training to clamp the values to either 0 or 1.

Andrew Ng's machine learning course wanted me to use one-hot encoding. Do my arguments for using the raw-non-hot encoding output carry any value?

Best Answer

I assume that what ever is backpropagated has to be a probability distribution (the vector sum = 1). I.e., it's already been through a softmax.

It would be interesting to see which works better empirically, but I think the raw output contains more information.

Consider raw output [0.01, 0.45, 0.54]. We then convert this to [0,0,1]. Say we have truth [0,0,1]. (We really want something like [1e-10,1e-10,1-2e-10].)

If we use the one hot [0,0,1], we don't change anything. The error is [0,0,0].

If we use the raw output, however, we end up changing the weights to increase the ouput's certainty (eg, we decrease the first and second and increase the third), with error [0.01,0.45,-0.44].

This might not even be evident if we only evaluated the accuracy or ROC area for a classifier, but perhaps you would see a difference in calibration (the ability to give accurate probability estimates)--I think that the raw output would lead to better calibrated models, because by one hot encoding you are essentially removing information about numerical probabilities.