It sounds to me like you're on the right track, but maybe I can help clarify.
Single output
Let's imagine a traditional neural network layer with $n$ input units and 1 output (let's also assume no bias). This layer has a vector of weights $w\in\mathbb{R}^n$ that can be learned using various methods (backprop, genetic algorithms, etc.), but we'll ignore the learning and just focus on the forward propagation.
The layer takes an input $x\in\mathbb{R}^n$ and maps it to an activation $a\in\mathbb{R}$ by computing the dot product of $x$ with $w$ and then applying a nonlinearity $\sigma$: $$ a = \sigma(x\cdot w) $$
Here, the elements of $w$ specify how much to weight the corresponding elements of $x$ to compute the overall activation of the output unit. You could even think of this like a "convolution" where the input signal ($x$) is the same length as the filter ($w$).
In a convolutional setting, there are more values in $x$ than in $w$; suppose now our input $x\in\mathbb{R}^m$ for $m>n$. We can compute the activation of the output unit in this setting by computing the dot product of $w$ with contiguous subsets of $x$: $$\begin{eqnarray*} a_1 &=& \sigma(x_{1:n} \cdot w) \\ a_2 &=& \sigma(x_{2:n+1} \cdot w) \\ a_3 &=& \sigma(x_{3:n+2} \cdot w) \\ \dots \\ a_{m-n+1} &=& \sigma(x_{m-n+1:m} \cdot w) \end{eqnarray*}$$
(Here I'm repeating the same annoying confusion between cross-correlation and convolution that many neural nets authors make; if we were to make these proper convolutions, we'd flip the elements of $w$. I'm also assuming a "valid" convolution which only retains computed elements where the input signal and the filter overlap completely, i.e., without any padding.)
You already put this in your question basically, but I'm trying to walk through the connection with vanilla neural network layers using the dot product to make a point. The main difference with vanilla network layers is that if the input vector is longer than the weight vector, a convolution turns the output of the network layer into a vector -- in convolution networks, it's vectors all the way down! This output vector is called a "feature map" for the output unit in this layer.
Multiple outputs
Ok, so let's imagine that we add a new output to our network layer, so that it has $n$ inputs and 2 outputs. There will be a vector $w^1\in\mathbb{R}^n$ for the first output, and a vector $w^2\in\mathbb{R}^n$ for the second output. (I'm using superscripts to denote layer outputs.)
For a vanilla layer, these are normally stacked together into a matrix $W = [w^1 w^2]$ where the individual weight vectors are the columns of the matrix. Then when computing the output of this layer, we compute $$\begin{eqnarray*} a^1 &=& \sigma(x \cdot w^1) \\ a^2 &=& \sigma(x \cdot w^2) \end{eqnarray*}$$ or in shorter matrix notation, $$a = [a^1 a^2] = \sigma(x \cdot W)$$ where the nonlinearity is applied elementwise.
In the convolutional case, the outputs of our layer are still associated with the same parameter vectors $w^1$ and $w^2$. Just like in the single-output case, the convolution layer generates vector-valued outputs for each layer output, so there's $a^1 = [a^1_1 a^1_2 \dots a^1_{m-n+1}]$ and $a^2 = [a^2_1 a^2_2 \dots a^2_{m-n+1}]$ (again assuming "valid" convolutions). These filter maps, one for each layer output, are commonly stacked together into a matrix $A = [a^1 a^2]$.
If you think of it, the input in the convolutional case could also be thought of as a matrix, containing just one column ("one input channel"). So we could write the transformation for this layer as $$A = \sigma(X * W)$$ where the "convolution" is actually a cross-correlation and happens only along the columns of $X$ and $W$.
These notation shortcuts are actually quite helpful, because now it's easy to see that to add another output to the layer, we just add another column of weights to $W$.
Hopefully that's helpful!
Receptive field refers to the pixels in the input image which contribute to a feature in any layer of a network.
Layer 1: Each point in the feature map comes from 3x3 pixels from input image, so RF is 3
Layer 3: Each point in the feature map comes from 3x3 patch of feature map
from layer 1 which in turn map to 3x3 pixels of input image, if you map it back you realize it maps to 5x5 patch of the input image.
Similarly for layer 5 you get a 7x7 patch.
Best Answer
Here is what Lasagne does, it should answer your two questions:
This RNN tutorial does the same thing (minus the gain):
So I assume it's correct (I hope so since this is the code I use).
In TensorFlow: