A major insight into how a neural network can learn to classify something as complex as image data given just examples and correct answers came to me while studying the work of Professor Kunihiko Fukushima on the neocognitrion in the 1980's. Instead of just showing his network a bunch of images, and using back-propagation to let it figure things on it's own, he took a different approach and trained his network layer by layer, and even node by node. He analyzed the performance and operation of each individual node of the network and intentionally modified those parts to make them respond in intended ways.
For instance, he knew he wanted the network to be able to recognize lines, so he trained specific layers and nodes to recognize three pixel horizontal lines, 3 pixel vertical lines and specific variations of diagonal lines at all angles. By doing this, he knew exactly which parts of the network could be counted on to fire when the desired patterns existed. Then, since each layer is highly connected, the entire neocognitron as a whole could identify each of the composite parts present in the image no matter where they physically existed. So when a specific line segment existed somewhere in the image, there would always be a specific node that would fire.
Keeping this picture ever present, consider linear regression which is simply finding a formula ( or a line) via sum of squared error, that passes most closely through your data, that's easy enough to understand. To find curved "lines" we can do the same sum of products calculation, except now we add a few parameters of x^2 or x^3 or even higher order polynomials. Now you have a logistic regression classifier. This classifier can find relationships that are not linear in nature. In fact logistic regression can express relationships that are arbitrarily complex, but you still need to manually choose the correct number of power features to do a good job at predicting the data.
One way to think of the neural network is to consider the last layer as a logistic regression classifier, and then the hidden layers can be thought of as automatic "feature selectors". This eliminates the work of manually choosing the correct number of, and power of, the input features. Thus, the NN becomes an automatic power feature selector and can find any linear or non-linear relationship or serve as a classifier of arbitrarily complex sets** (this, assumes only, that there are enough hidden layers and connections to represent the complexity of the model it needs to learn). In the end, a well functioning NN is expected to learn not just "the relationship" between the input and outputs, but instead we strive for an abstraction or a model that generalizes well.
As a rule of thumb, the neural network can not learn anything a reasonably intelligent human could not theoretically learn given enough time from the same data, however,
- it may be able to learn somethings no one has figured out yet
- for large problems a bank of computers processing neural networks can find really good solutions much faster than a team of people (at a much lower cost)
- once trained NNs will produce consitsent results with the inputs they've been trained on and should generalize well if tweaked properly
- NN's never get bored or distracted
From the stanfords note on NN:
Real-world example. The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.
ref: http://cs231n.github.io/convolutional-networks/
These notes accompany the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition.
For questions/concerns/bug reports regarding contact Justin Johnson regarding the assignments, or contact Andrej Karpathy regarding the course notes
Best Answer
In a fully connected neural network, the input can't change size because the linear transform in the first layer $Wx+b$ wouldn't work anymore -- the weight matrix $W$ wouldn't be of the correct shape.
However, note that you can apply a convolution to an image of any size without needing to change the parameters in the filter. So there is nothing restricting the size of the input image.
It makes sense that the network can generalize to inputs of different shape -- you are still applying the same convolutional filters to the same feature maps, so why shouldn't the result be the same as before?