Solved – Should one-hot output encoding be used in backpropagation

backpropagationcategorical-encodingneural networks

I'm going through the process of writing backpropagation for a neural network.
In particular I'm building a MNIST classifier.

I'm wondering if it is better to apply the cost function and backpropagation to the non-one-hot output activation.

The advantage I see is that this will encourage all would-be-non-hot values toward 0, rather than simply being marginally less than the maximum value. Likewise, it would encourage the would-be-hot value towards 1.

An additional one-hot output layer could be applied after training to clamp the values to either 0 or 1.

Andrew Ng's machine learning course wanted me to use one-hot encoding. Do my arguments for using the raw-non-hot encoding output carry any value?

Best Answer

I assume that what ever is backpropagated has to be a probability distribution (the vector sum = 1). I.e., it's already been through a softmax.

It would be interesting to see which works better empirically, but I think the raw output contains more information.

Consider raw output [0.01, 0.45, 0.54]. We then convert this to [0,0,1]. Say we have truth [0,0,1]. (We really want something like [1e-10,1e-10,1-2e-10].)

If we use the one hot [0,0,1], we don't change anything. The error is [0,0,0].

If we use the raw output, however, we end up changing the weights to increase the ouput's certainty (eg, we decrease the first and second and increase the third), with error [0.01,0.45,-0.44].

This might not even be evident if we only evaluated the accuracy or ROC area for a classifier, but perhaps you would see a difference in calibration (the ability to give accurate probability estimates)--I think that the raw output would lead to better calibrated models, because by one hot encoding you are essentially removing information about numerical probabilities.

Notation:

I'll follow the notation in this made-up example of color classification:

whereby $j$ is the index denoting any of the $K$ output neurons - not necessarily the one corresponding to the true, ($t)$, value. Now,

$$\begin{align} o_j&=\sigma(j)=\sigma(z_j)=\text{softmax}(j)=\text{softmax (neuron }j)=\frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\\[3ex] z_j &= \mathbf w_j^\top \mathbf x = \text{preactivation (neuron }j) \end{align}$$

The loss function is the negative log likelihood:

$$E = -\log \sigma(t) = -\log \left(\text{softmax}(t)\right)$$

The negative log likelihood is also known as the multiclass cross-entropy (ref: Pattern Recognition and Machine Learning Section 4.3.4), as they are in fact two different interpretations of the same formula.

Gradient of the loss function with respect to the pre-activation of an output neuron:

$$\begin{align} \frac{\partial E}{\partial z_j}&=\frac{\partial}{\partial z_j}\,-\log\left( \sigma(t)\right)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(t)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\sigma(z_j)\\[2ex] &= \frac{-1}{\sigma(t)}\quad\frac{\partial}{\partial z_j}\frac{e^{z_t}}{\displaystyle\sum_k e^{z_k}}\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left[ \frac{\frac{\partial }{\partial z_j }e^{z_t}}{\displaystyle \sum_K e^{z_k}} \quad - \quad \frac{e^{z_t}\quad \frac{\partial}{\partial z_j}\displaystyle \sum_K e^{z_k}}{\left[\displaystyle\sum_K e^{z_k}\right]^2}\right]\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left[ \frac{\delta_{jt}\;e^{z_t}}{\displaystyle \sum_K e^{z_k}} \quad - \quad \frac{e^{z_t}}{\displaystyle\sum_K e^{z_k}} \frac{e^{z_j}}{\displaystyle\sum_K e^{z_k}}\right]\\[2ex] &= \frac{-1}{\sigma(t)}\quad\left(\delta_{jt}\sigma(t) - \sigma(t)\sigma(j) \right)\\[2ex] &= - (\delta_{jt} - \sigma(j))\\[2ex] &= \sigma(j) - \delta_{jt} \end{align}$$

This is practically identical to $\frac{\partial E} {\partial z_j} = o_j - t_j$, and it does become identical if instead of focusing on $j$ as an individual output neuron, we transition to vectorial notation (as indicated in your question), and $t_j$ becomes the one-hot encoded vector of true values, which in my notation would be $\small \begin{bmatrix}0&0&0&\cdots&1&0&0&0_K\end{bmatrix}^\top$.

Then, with $\frac{\partial E} {\partial z_j} = o_j - t_j$ we are really calculating the gradient of the loss function with respect to the preactivation of all output neurons: the vector $t_j$ will contain a $1$ only in the neuron corresponding to the correct category, which is equivalent to the delta function $\delta_{jt}$, which is $1$ only when differentiating with respect to the pre-activation of the output neuron of the correct category.

In the Geoffrey Hinton's Coursera ML course the following chunk of code illustrates the implementation in Octave:

%% Compute derivative of cross-entropy loss function.
error_deriv = output_layer_state - expanded_target_batch;

The expanded_target_batch corresponds to the one-hot encoded sparse matrix with corresponding to the target of the training set. Hence, in the majority of the output neurons, the error_deriv = output_layer_state $(\sigma(j))$, because $\delta_{jt}$ is $0$, except for the neuron corresponding to the correct classification, in which case, a $1$ is going to be subtracted from $\sigma(j).$

The actual measurement of the cost is carried out with...

% MEASURE LOSS FUNCTION.
CE = -sum(sum(...
  expanded_target_batch .* log(output_layer_state + tiny))) / batchsize;

We see again the $\frac{\partial E}{\partial z_j}$ in the beginning of the backpropagation algorithm:

$$\small\frac{\partial E}{\partial W_{hidd-2-out}}=\frac{\partial \text{outer}_{input}}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial \text{outer}_{input}}=\frac{\partial z_j}{\partial W_{hidd-2-out}}\, \frac{\partial E}{\partial z_j}$$

hid_to_output_weights_gradient =  hidden_layer_state * error_deriv';
output_bias_gradient = sum(error_deriv, 2);

since $z_j = \text{outer}_{in}= W_{hidd-2-out} \times \text{hidden}_{out}$

Observation re: OP additional questions:

The splitting of partials in the OP, $\frac{\partial E} {\partial z_j} = {\frac{\partial E} {\partial o_j}}{\frac{\partial o_j} {\partial z_j}}$, seems unwarranted.

The updating of the weights from hidden to output proceeds as...

hid_to_output_weights_delta = ...
 momentum .* hid_to_output_weights_delta + ...
 hid_to_output_weights_gradient ./ batchsize;
hid_to_output_weights = hid_to_output_weights...
 - learning_rate * hid_to_output_weights_delta;

which don't include the output $o_j$ in the OP formula: $w_{ij} = w'_{ij} - r{\frac{\partial E} {\partial z_j}} {o_i}.$ The formula would be more along the lines of...

$$W_{hidd-2-out}:=W_{hidd-2-out}-r\, \small \frac{\partial E}{\partial W_{hidd-2-out}}\, \Delta_{hidd-2-out}$$

Solved – What algorithms require one-hot encoding

Most algorithms (linear regression, logistic regression, neural network, support vector machine, etc.) require some sort of the encoding on categorical variables. This is because most algorithms only take numerical values as inputs.

Algorithms that do not require an encoding are algorithms that can directly deal with joint discrete distributions such as Markov chain / Naive Bayes / Bayesian network, tree based, etc.

Additional comments:

One hot encoding is one of the encoding methods. Here is a good resource for categorical variable encoding (not limited to R). R LIBRARY CONTRAST CODING SYSTEMS FOR CATEGORICAL VARIABLES
Even without encoding, distance between data points with discrete variables can be defined, such as hamming distance or Levenshtein Distance

Best Answer

Related Solutions

Solved – Neural network softmax activation

Notation:

Gradient of the loss function with respect to the pre-activation of an output neuron:

Solved – What algorithms require one-hot encoding

Related Question