I am using feed-forward NN. I understand the concept, but my question is about weights. How can you interpret them, i.e. what do they represent or how can they be undestrood (besied just function coefficients)? I have found something called "space of weights", but I am not quite sure what does it means.
Solved – Neural network – meaning of weights
neural networksweights
Related Solutions
Forward propogation is simply multiplying input with weights and add bias before applying activation fuction (sigmoid in here) at each node. There is no bias in this question.
$ W^{(1)}*x = z^{(1)} = \begin{bmatrix} \ W_{11}^{(1)} & \ W_{12}^{(1)} \\[0.3em] \ W_{21}^{(1)} & \ W_{22}^{(1)} \end{bmatrix} * \begin{bmatrix} \ x_1 \\[0.3em] \ x_2 \end{bmatrix} = \begin{bmatrix} \ 0.5 & \ 0.1 \\[0.3em] \ 0.25 & 0.75 \end{bmatrix} \begin{bmatrix} \ 1 \\[0.3em] \ 0 \end{bmatrix} = \begin{bmatrix} \ 0.5 \\[0.3em] \ 0.25 \end{bmatrix}$
$ a^{(2)}= sigm(z^{(1)}) = sigm(\begin{bmatrix} \ 0.5 \\[0.3em] \ 0.25 \end{bmatrix}) = \begin{bmatrix} \ 0.6225 \\[0.3em] \ 0.5622 \end{bmatrix} $
$ W^{(2)}*a^{(2)} = z^{(2)} = \begin{bmatrix} \ W_{11}^{(2)} & \ W_{12}^{(2)} \end{bmatrix} * \begin{bmatrix} \ a^{(2)}_1 \\[0.3em] \ a^{(2)}_2 \end{bmatrix} = \begin{bmatrix} \ 0.95*0.6225 + 0.5622*1.0 \end{bmatrix} = 1.1536 $
$ a^{(3)}= sigm(z^{(2)}) = sigm(1.1536) = 0.7602 $
This is your output, and assume that your cost function is
$ C = \frac{1}{2}(a^{(3)} -y )^2$
where y is expected output = 0.5, and output error term derived as,
$ δ^{(3)} = \frac{dC}{dz^{(2)}} = (a^{(3)} -y ).* a^{(3)}.*(1-a^{(3)}) = (0.7602 - 0.5) .*0.7602.*(1-0.7602) = 0.0474$
where '.*' is element-wise product and $a^{(3)}.*(1-a^{(3)})$ comes from derivation of sigmoid. I've assumed that error term calculated with respect to $ z$,not $ a$. If that is the case the derivation changes a little bit. Now back propogate $ δ^{(3)}$, to find $ δ_{2}^{(2)}$
$ δ_{2}^{(2)} = \frac{dC}{dz^{(2)}} * \frac{dz^{(2)}}{dz_{2}^{(1)}} = δ^{(3)} * \frac{dz^{(2)}}{dz_{2}^{(1)}} $
let's drive the second term before we continue
$ \frac{dz^{(2)}}{dz_{2}^{(1)}} = W_{12}^{(2)}.*a_{2}^{(2)}.*(1-a_{2}^{(2)})$ from $ z^{(2)}= W^{(2)}*sigm(z^{(1)}) $
now we can evaluate previous equation, $ δ_{2}^{(2)} = \frac{dC}{dz^{(2)}} * \frac{dz^{(2)}}{dz_{2}^{(1)}} = δ^{(3)} * W_{12}^{(2)}.*a_{2}^{(2)}.*(1-a_{2}^{(2)}) = 0.0474 * 1.0 * 0.5622 * (1-0.5622) = 0.0117 $
You can see that how error term diminishes quickly during back propogation if we use sigmoid activation(or hyperbolic tangent).
A is, in fact, a full layer. The output of the layer is $h_t$, is in fact the neuron output, that can be plugged into a softmax layer (if you want a classification for the time step $t$, for instance) or anything else such as another LSTM layer if you want to go deeper. The input of this layer is what sets it apart from the regular feedforward network: it takes both the input $x_t$ and the full state of the network in the previous time step (both $h_{t-1}$ and the other variables from the LSTM cell).
Note that $h_t$ is a vector. So, if you want to make an analogy with a regular feedforward network with 1 hidden layer, then A could be thought as taking the place of all of these neurons in the hidden layer (plus the extra complexity of the recurring part).
Best Answer
Individual weights represent the strength of connections between units. If the weight from unit A to unit B has greater magnitude (all else being equal), it means that A has greater influence over B (i.e. to increase or decrease B's level of activation).
You can also think of the the set of incoming weights to a unit as measuring what that unit 'cares about'. This is easiest to see at the first layer. Say we have an image processing network. Early units receive weighted connections from input pixels. The activation of each unit is a weighted sum of pixel intensity values, passed through an activation function. Because the activation function is monotonic, a given unit's activation will be higher when the input pixels are similar to the incoming weights of that unit (in the sense of having a large dot product). So, you can think of the weights as a set of filter coefficients, defining an image feature. For units in higher layers (in a feedforward network), the inputs aren't from pixels anymore, but from units in lower layers. So, the incoming weights are more like 'preferred input patterns'.
Not sure about your original source, but if I were talking about 'weight space', I'd be referring to the set of all possible values of all weights in the network.