A major insight into how a neural network can learn to classify something as complex as image data given just examples and correct answers came to me while studying the work of Professor Kunihiko Fukushima on the neocognitrion in the 1980's. Instead of just showing his network a bunch of images, and using back-propagation to let it figure things on it's own, he took a different approach and trained his network layer by layer, and even node by node. He analyzed the performance and operation of each individual node of the network and intentionally modified those parts to make them respond in intended ways.
For instance, he knew he wanted the network to be able to recognize lines, so he trained specific layers and nodes to recognize three pixel horizontal lines, 3 pixel vertical lines and specific variations of diagonal lines at all angles. By doing this, he knew exactly which parts of the network could be counted on to fire when the desired patterns existed. Then, since each layer is highly connected, the entire neocognitron as a whole could identify each of the composite parts present in the image no matter where they physically existed. So when a specific line segment existed somewhere in the image, there would always be a specific node that would fire.
Keeping this picture ever present, consider linear regression which is simply finding a formula ( or a line) via sum of squared error, that passes most closely through your data, that's easy enough to understand. To find curved "lines" we can do the same sum of products calculation, except now we add a few parameters of x^2 or x^3 or even higher order polynomials. Now you have a logistic regression classifier. This classifier can find relationships that are not linear in nature. In fact logistic regression can express relationships that are arbitrarily complex, but you still need to manually choose the correct number of power features to do a good job at predicting the data.
One way to think of the neural network is to consider the last layer as a logistic regression classifier, and then the hidden layers can be thought of as automatic "feature selectors". This eliminates the work of manually choosing the correct number of, and power of, the input features. Thus, the NN becomes an automatic power feature selector and can find any linear or non-linear relationship or serve as a classifier of arbitrarily complex sets** (this, assumes only, that there are enough hidden layers and connections to represent the complexity of the model it needs to learn). In the end, a well functioning NN is expected to learn not just "the relationship" between the input and outputs, but instead we strive for an abstraction or a model that generalizes well.
As a rule of thumb, the neural network can not learn anything a reasonably intelligent human could not theoretically learn given enough time from the same data, however,
- it may be able to learn somethings no one has figured out yet
- for large problems a bank of computers processing neural networks can find really good solutions much faster than a team of people (at a much lower cost)
- once trained NNs will produce consitsent results with the inputs they've been trained on and should generalize well if tweaked properly
- NN's never get bored or distracted
It looks like you are using an ANN for function approximation where your target is a continuous variable(?). If so, the uniformity of input data spread over all the dimensions and their range and scale can have a strong impact on how well your ANN works. It has been noticed that, even if there is dense coverage of datapoints over the space of the inputs, an ANN can still have a problem making a good prediction. (I have actually had to smooth the input data $\rightarrow$ i.e., cubic splines, to improve uniformity of datapoint coverage over the range and scale of inputs, before the MSE stabilized and decreased substantially). Therefore, straightforward use of raw input data may not help the situation. If you were performing classification analysis, you would need to use the softmax function on the output side -- but it looks like you are merely performing function approximation. You might also try a linear function (identity) on the output side, and what happens on the output side, because the distribution of your target $y$-variable $[y=f(x_1,x_2,...,x_p)]$ can affect the MSE as well. You could also consider an RBF network, or SOM-ANN, which will reduce dimensionality of your inputs. Lastly, correlation between input features degrades learning speed, since an ANN will waste time learning the correlation between inputs. This is why many use PCA on inputs first, and then input e.g. the 10 PCs associated with the greatest eigenvalues -- effectively decorrelating features so they are orthogonal.
Update(4/27/2016):
A way to more evenly distribute randomly sampled points for inputs to a function approximation problem using an ANN is to employ Latin Hypercube Sampling (LHS) from $\{x,y\}$ in order to predict $\hat{z}$. To begin, split up the range of $x$ and $y$ into $M=10$ uniformly spaced, non-overlapping bins -- the result is a $10\times 10$ square grid with 100 elements (cells) -- call this a ``range grid.'' Next, sample one of the 100 cells, and from this cell draw a random pair of $\{x,y\}$ values from within the range of the bin walls for $x$ and $y$ (of the selected cell), and then block that row and column out from further selection. Next, draw another random element from a row and column that hasn't been sampled from yet, and draw a pair of random $\{x,y\}$ from within that cell. Continue until all rows and column have been selected once. The 10 samples of $\{x,y\}$ will provide pairs of points with no overlap, which is a good way to feed $\{x,y\}$ to an ANN for an $\{x,y,z\}$ problem, or multiple feature problem $\{x_1,z_2,\ldots,x_p\}$.
If you want 100 pairs of $\{x,y\}$, you can start with the combination $\{1,2,3,4,5,6,7,8,9,10\}$. Next, identify 10 permutations for this combination to generate a $10 \times 10$ ``row'' matrix $\mathbf{R}$:
$\{3,2,10,4,1,5,7,9,8,6\}$, $\{2,4,3,1,5,6,10,8,9,7\}$,...,$\{9,1,2,10,5,6,7,8,3,4\}$.
which will give 100 integer values for sampling rows.
Next, generate a $10 \times 10$ ``column'' matrix $\mathbf{C}$ using another set of 10 different permutations:
$\{5,8,10,4,9,3,7,1,2,6\}$, $\{3,7,4,1,5,6,10,8,9,2\}$,...,$\{6,9,2,7,5,1,10,8,3,4\}$
which will provide 100 integers for sampling columns.
The first random draw using the above matrices would be from row 3 and col 5 in the original 10 by 10 grid ``range grid'' of 100 bins for $x$ and $y$. This is another form of LHS.
If you need more than 100 $\{x,y\}$ pairs, then just increase the number of permutations used, and don't be stingy as there are 10! permutations.
Best Answer
Forward propogation is simply multiplying input with weights and add bias before applying activation fuction (sigmoid in here) at each node. There is no bias in this question.
$ W^{(1)}*x = z^{(1)} = \begin{bmatrix} \ W_{11}^{(1)} & \ W_{12}^{(1)} \\[0.3em] \ W_{21}^{(1)} & \ W_{22}^{(1)} \end{bmatrix} * \begin{bmatrix} \ x_1 \\[0.3em] \ x_2 \end{bmatrix} = \begin{bmatrix} \ 0.5 & \ 0.1 \\[0.3em] \ 0.25 & 0.75 \end{bmatrix} \begin{bmatrix} \ 1 \\[0.3em] \ 0 \end{bmatrix} = \begin{bmatrix} \ 0.5 \\[0.3em] \ 0.25 \end{bmatrix}$
$ a^{(2)}= sigm(z^{(1)}) = sigm(\begin{bmatrix} \ 0.5 \\[0.3em] \ 0.25 \end{bmatrix}) = \begin{bmatrix} \ 0.6225 \\[0.3em] \ 0.5622 \end{bmatrix} $
$ W^{(2)}*a^{(2)} = z^{(2)} = \begin{bmatrix} \ W_{11}^{(2)} & \ W_{12}^{(2)} \end{bmatrix} * \begin{bmatrix} \ a^{(2)}_1 \\[0.3em] \ a^{(2)}_2 \end{bmatrix} = \begin{bmatrix} \ 0.95*0.6225 + 0.5622*1.0 \end{bmatrix} = 1.1536 $
$ a^{(3)}= sigm(z^{(2)}) = sigm(1.1536) = 0.7602 $
This is your output, and assume that your cost function is
$ C = \frac{1}{2}(a^{(3)} -y )^2$
where y is expected output = 0.5, and output error term derived as,
$ δ^{(3)} = \frac{dC}{dz^{(2)}} = (a^{(3)} -y ).* a^{(3)}.*(1-a^{(3)}) = (0.7602 - 0.5) .*0.7602.*(1-0.7602) = 0.0474$
where '.*' is element-wise product and $a^{(3)}.*(1-a^{(3)})$ comes from derivation of sigmoid. I've assumed that error term calculated with respect to $ z$,not $ a$. If that is the case the derivation changes a little bit. Now back propogate $ δ^{(3)}$, to find $ δ_{2}^{(2)}$
$ δ_{2}^{(2)} = \frac{dC}{dz^{(2)}} * \frac{dz^{(2)}}{dz_{2}^{(1)}} = δ^{(3)} * \frac{dz^{(2)}}{dz_{2}^{(1)}} $
let's drive the second term before we continue
$ \frac{dz^{(2)}}{dz_{2}^{(1)}} = W_{12}^{(2)}.*a_{2}^{(2)}.*(1-a_{2}^{(2)})$ from $ z^{(2)}= W^{(2)}*sigm(z^{(1)}) $
now we can evaluate previous equation, $ δ_{2}^{(2)} = \frac{dC}{dz^{(2)}} * \frac{dz^{(2)}}{dz_{2}^{(1)}} = δ^{(3)} * W_{12}^{(2)}.*a_{2}^{(2)}.*(1-a_{2}^{(2)}) = 0.0474 * 1.0 * 0.5622 * (1-0.5622) = 0.0117 $
You can see that how error term diminishes quickly during back propogation if we use sigmoid activation(or hyperbolic tangent).