Edit: As @Toke Faurby correctly pointed out, the default implementation in tensorflow actually uses an element-wise dropout. What I described earlier applies to a specific variant of dropout in CNNs, called spatial dropout:
In a CNN, each neuron produces one feature map. Since dropout spatial dropout works per-neuron, dropping a neuron means that the corresponding feature map is dropped - e.g. each position has the same value (usually 0). So each feature map is either fully dropped or not dropped at all.
Pooling usually operates separately on each feature map, so it should not make any difference if you apply dropout before or after pooling. At least this is the case for pooling operations like maxpooling or averaging.
Edit: However, if you actually use element-wise dropout (which seems to be set as default for tensorflow), it actually makes a difference if you apply dropout before or after pooling. However, there is not necessarily a wrong way of doing it. Consider the average pooling operation: if you apply dropout before pooling, you effectively scale the resulting neuron activations by 1.0 - dropout_probability
, but most neurons will be non-zero (in general). If you apply dropout after average pooling, you generally end up with a fraction of (1.0 - dropout_probability)
non-zero "unscaled" neuron activations and a fraction of dropout_probability
zero neurons. Both seems viable to me, neither is outright wrong.
Yes that is correct, in that case the input is mapped through the output via a single weight matrix (10 x 10) and a bias of (10 x1).
If you choose your activation function as a sigmoid function then the Network that you are describing is equivalent to logistic regression.
Best Answer
In the original paper that proposed dropout layers, by Hinton (2012), dropout (with p=0.5) was used on each of the fully connected (dense) layers before the output; it was not used on the convolutional layers. This became the most commonly used configuration.
More recent research has shown some value in applying dropout also to convolutional layers, although at much lower levels: p=0.1 or 0.2. Dropout was used after the activation function of each convolutional layer: CONV->RELU->DROP.