The core idea about convolutional neural networks is that, contrary to fully-connected layers, instead to assigning different weights per each pixel of the picture (or something else), you have some kernel that is smaller then the input picture and slides through it. What follows, we apply same set of weights to different parts of the picture (so called weight sharing). By this we hope to detect same patterns in different parts of the image.
To illustrate this, let's look at one-dimensional kernel that slides through a vector (say, a sentence):
g(x[0:2] * W + b) = z[0]
/ | \
x[0] x[1] x[2] x[3] x[4]
g(x[1:3] * W + b) = z[1]
/ | \
x[0] x[1] x[2] x[3] x[4]
g(x[2:4] * W + b) = z[2]
/ | \
x[0] x[1] x[2] x[3] x[4]
As you can see, we have an input vector of length five $\boldsymbol{x} = (x_0,x_1,\dots,x_4)$ and apply same set of three weights $\boldsymbol{w} = (w_0, w_1, w_2)$ and bias term $b$. The convolution kernel slides through the vector by applying same weights to each part of the vector and produces output vector of length three $\boldsymbol{z} = (z_0, z_1, z_2)$, where each $z_i = g(\boldsymbol{x}_{i:i+2} \cdot \boldsymbol{w} + b) = g(x_i w_0 + x_{i+1} w_1 + x_{i+2} w_2 + b)$.
So basically, it applies the same operator, but at smaller scale, going through the input tensor part-by-part while sharing the weights.
You can find this tutorial and recorded lectures by the Stanford CS231n staff helpful.
Your confusion stems from the fact that channels (feature maps) are treated somewhat differently than other dimensions.
Let's say you have a grayscale image input to the first layer and 32 kernels of shape (3,3)
as per your example. But in fact, those kernels have shape (3,3,1)
- 1
for the number of channels in the input. For a RGB input image it would be 3
. The number of channels is simply omitted in the code because it is inferred automatically from the number of channels of the layer input.
The output of this layer has 32 channels (1 per each kernel). In the second layer in your example, you have 64 kernels of shape (3,3)
, but they are in fact (3,3,32)
! Each of these kernels is aggregating information from all input feature maps.
Typically I would think that with 64 filters and 32 feature maps from the previous layer we would get 64*32 feature maps in the next layer (all features are connected to each filter).
I hope that from the above explanation it is clear that you are not applying each of the 64 kernels on each of the 32 feature maps individually. Instead, each of these 64 kernels is looking at all of the 32 feature maps at the same time, having different weights for each of them.
Best Answer
To understand the local connectivity, first think about giving an image as input into just a regular fully connected neural network. Each input (pixel value) is connected to every neuron in the first layer. So each neuron in the first layer is getting input from EVERY part of the image.
With a convolutional network, each neuron only receives input from a small local group of the pixels in the input image. This is what is meant by "local connectivity", all of the inputs that go into a given neuron are actually close to each other.
For your second question, yes, both the fully connected layers and the convolutional layers can be trained using back propagation. You take the errors after propagating back to the first fully connected layer and start your convolutional layer propagation using those.