Edit: As @Toke Faurby correctly pointed out, the default implementation in tensorflow actually uses an element-wise dropout. What I described earlier applies to a specific variant of dropout in CNNs, called spatial dropout:
In a CNN, each neuron produces one feature map. Since dropout spatial dropout works per-neuron, dropping a neuron means that the corresponding feature map is dropped - e.g. each position has the same value (usually 0). So each feature map is either fully dropped or not dropped at all.
Pooling usually operates separately on each feature map, so it should not make any difference if you apply dropout before or after pooling. At least this is the case for pooling operations like maxpooling or averaging.
Edit: However, if you actually use element-wise dropout (which seems to be set as default for tensorflow), it actually makes a difference if you apply dropout before or after pooling. However, there is not necessarily a wrong way of doing it. Consider the average pooling operation: if you apply dropout before pooling, you effectively scale the resulting neuron activations by 1.0 - dropout_probability
, but most neurons will be non-zero (in general). If you apply dropout after average pooling, you generally end up with a fraction of (1.0 - dropout_probability)
non-zero "unscaled" neuron activations and a fraction of dropout_probability
zero neurons. Both seems viable to me, neither is outright wrong.
I don't know a lot about statistical genomics but I can give you a few suggestions.
Be wary of spurious correlations, they are a very common problem in statistical genomics. I suggest you keep a set of data separated from the others and never use it to train or validate your architectures, until you have selected the very final one. In other words, build different networks (different number of layers, different number of hidden units, etc.) without using the "reserved data", and choose the one with the smallest $k$-fold cross-validation error. Then, once you have fixed all the hyperparameters of your neural network, test it on the separate data set. At the cost of some accuracy (your training set will be smaller) you gain some protection from the risk of mistaking noise for signal. Since the number of alternatives can be prohibitive, you can use some automated machine learning frameworks which help you explore the space of possible networks, such as for example auto-sklearn and tpot.
Especially if you use these automated tools you should not let them see the separate data set. You're basically using a black-box to define your architecture, and you may want some kind of insurance against overfitting.
Also, more often than not, when deep learning has the same accuracy than linear regression on a large training set, it's a sign that you're doing something wrong. Read the What's going on? section here, and see also here for some common errors. Unfortunately, most of the material is geared towards classification - that's where Deep Learning is being used the most, today.
Finally, in case the final goal of this genomic study is precision medicine, you may want to have a look at the Deep Review - it's a work in progress, but it may contain useful material for you.
Best Answer
You mostly just have to figure this out through trial and error (and metrics on your validation data). But in say, a classification CNN, the majority of the parameters are typically concentrated in the last few layers, so it makes sense to use more dropout there, as there's more need for regularization where there are more parameters.