Solved – How are hidden layer weights computed in a multilayer neural network

machine learningneural networksweights

More specifically, given a typical neural network with a single hidden layer $Z_m$ where $m = 1,…,M$ (see specifications/notation below drawn from p. 392 of Elements of Statistical Learning (Hastie, Tibshirani, Friedman, 2008)), wouldn't the weights for each of the derived features $Z_m$ converge to the same values since each feature $Z_m$ in the hidden layer is derived from the same collection of input observations $X_p$ where $p = 1,…,P$? To my untrained eyes it looks like the hidden layer contain redundant features. What am I missing?

For K-class classification, there are K units at the top, with the kth
unit modeling the probability of class k. There are K target
measurements $Y_k$, $k = 1,…,K$, each being coded as a 0−1 variable
for the kth class.

Derived features $Z_m$ are created from linear combinations of the
inputs, and then the target $Y_k$ is modeled as a function of linear
combinations of the $Z_m$,

$Z_m = σ(α_{0m} + α^T_mX),$ where $m = 1,…,M,$

$T_k = β_{0k} + β^T_k Z,$ where $k = 1,…,K,$

$f_k(X) = g_k(T),$ where $k = 1,…,K,$

where $Z = (Z_1,Z_2,…,Z_M),$ and $T = (T_1,T_2,…,T_K).$

enter image description here

Best Answer

The weights are initialized with different (and typically random) values. Because of this, hidden units will have different activations, and will contribute differently to the output. This breaks the symmetry that you noticed. Because of the asymmetry, weights will converge to different values.

An example

Say we have a 3 layer network. There are $n$ inputs with activations $x = [x_1, ..., x_n]$. There are $n$ hidden units with sigmoidal activations $h = [h_1, ..., h_n] = \tanh(x W + b_h)$. Here, $W$ is a weight matrix, and $b_h$ is a vector of biases. There's a single output unit with linear activation $o = h V + b_o$. We want to predict a target output $t$, and we measure the error with the squared loss $L = (t - o)^2$. We update the weights using stochastic gradient descent with learning rate $\alpha$. This means that, for each training input and target, we calculate the gradient of the loss function w.r.t. each parameter. We then update each parameter by stepping in the direction opposite the gradient.

Consider the parameter $W_{11}$, the weight from the first input unit to the first hidden unit. The update rule is:

$$W_{11} \leftarrow W_{11} - \alpha \frac{\partial L}{\partial W_{11}}$$

The gradient is:

$$ \frac{\partial L}{\partial W_{11}} = \frac{\partial L}{\partial h_1} \frac{\partial h_1}{\partial W_{11}} = 2 V_1 (t - o) x_1 (h_1^2 - 1) $$

In contrast, consider $W_{12}$, the weight from the first input unit to the second hidden unit. The update rule is:

$$W_{12} \leftarrow W_{12} - \alpha \frac{\partial L}{\partial W_{12}}$$

The gradient is:

$$ \frac{\partial L}{\partial W_{12}} = \frac{\partial L}{\partial h_2} \frac{\partial h_2}{\partial W_{12}} = 2 V_2 (t - o) x_1 (h_2^2 - 1) $$

So, we can see that there are some differences. The new value for $W_{11}$ depends on its previous value, $h_1$ (the activation of the first hidden unit), and $V_1$ (the weight from the first hidden unit to the output unit). In constrast, The new value for $W_{12}$ depends on its previous value, $h_2$, and $V_2$.

Because the weights were initialized randomly, the previous values of $W_{11}$ and $W_{12}$ will differ, as will $V_1$ and $V_2$. Because the activations depend on the weights, $h_1$ and $h_2$ will also differ. Therefore, the updated values of $W_{11}$ and $W_{12}$ will differ as well.