From the stanfords note on NN:
Real-world example. The Krizhevsky et al. architecture that won the ImageNet challenge in 2012 accepted images of size [227x227x3]. On the first Convolutional Layer, it used neurons with receptive field size F=11, stride S=4 and no zero padding P=0. Since (227 - 11)/4 + 1 = 55, and since the Conv layer had a depth of K=96, the Conv layer output volume had size [55x55x96]. Each of the 55*55*96 neurons in this volume was connected to a region of size [11x11x3] in the input volume. Moreover, all 96 neurons in each depth column are connected to the same [11x11x3] region of the input, but of course with different weights. As a fun aside, if you read the actual paper it claims that the input images were 224x224, which is surely incorrect because (224 - 11)/4 + 1 is quite clearly not an integer. This has confused many people in the history of ConvNets and little is known about what happened. My own best guess is that Alex used zero-padding of 3 extra pixels that he does not mention in the paper.
ref: http://cs231n.github.io/convolutional-networks/
These notes accompany the Stanford CS class CS231n: Convolutional Neural Networks for Visual Recognition.
For questions/concerns/bug reports regarding contact Justin Johnson regarding the assignments, or contact Andrej Karpathy regarding the course notes
I think the initialization should be roughly $\sqrt{\frac{1.55}{n_{in}}}$
The He et al. 2015 formula was made for ReLU units. The key idea is that the variance of f(y) with y = W * x + b should be roughly equal to the variance of y. Let's first go over the case of taking a ReLU activation, and see if we can ammend it for ELU units.
In the paper they show show that:
$$
Var[y_l] = n_l Var[w_l] \mathbb{E}[x^2_l]
$$
They express the last expectation $\mathbb{E}[x^2_l]$ in terms of $Var[y_{l-1}]$. For ReLUs we have that $\mathbb{E}[x^2_l] = \frac{1}{2} Var[y_{l-1}]$, simply because ReLUs put half the values in $x$ to $0$ on average. Thus we can write
$$
Var[y_l] = n_l Var[w_l] \frac{1}{2} Var[y_{l-1}]
$$
We apply this to all layers, taking the product over $l$, all the way to the first layer. This gives:
$$
Var[y_L] = Var[y_1] \prod_{i=2}^L \frac{1}{2} n_l Var[w_l]
$$
Now this is stable only when $\frac{1}{2} n_l Var[w_l]$ is close to 1. So they set it to 1 and find $Var[W_l] = \frac{2}{n_l}$
Now for ELU units, the only thing we have to change is the expression of $\mathbb{E}[x^2_l]$ in terms of $Var[y_{l-1}]$. Sadly, this is not as straight-forward for ELU units as for RelU units as it involves calculating $\mathbb{E}[({e^{(\mathcal{N})}}^2)]$ for only the negative values of $\mathcal{N}$. This is not a pretty formula, I don't even know if there's a good closed form solution, so let's sample to get an approximation. We want $Var[y_l]$ to roughly be equal to 1 (most inputs are variance 1, batch norm makes layers variance 1 etc.). Thus we can sample from a normal distribution, apply the elu function with alpha = 1, square and calculate the mean. This gives $\approx 0.645$. The inverse of this is $\approx 1.55$.
Thus following the same logic, we can set $Var[w_l]$ to $\sqrt{\frac{1.55}{n}}$ to get a variance that doesn't increase in magnitude.
I reckon that would be the optimal value for the ELU function. It fits in between the value for the ReLU function (1/2, which is lower than 0.645 because the values that are mapped to 0 now get mapped to some minus value), and what you would have for any function with mean 0 (which is just 1).
Take care that if the variance of $Var[y_{l-1}]$ is different, the optimal constant is also different. When this variance tends to 0, then the function becomes more and more like a unit function, thus the constant will tend to 1. If the variance becomes really big, the value tends towards the original ReLU value, thus 0.5.
Edit: Did the theoretical analysis of the variance of ELU(x) if x is normally distributed. It involves the some derivations of the log-normal distribution and not so pretty integrals. The eventual answer for the variance is $0.5 \sigma$ (the part of the linear function) +
$$
a - 2(b)^2 + (2b - 1)^2
$$
where
$$
a = \frac{1}{2} e^{\frac{\sigma^2}{2}} \left(\text{erfc}\left(\frac{\sigma}{\sqrt{2}}\right) + \sqrt{\frac{1}{\sigma^2}} \sigma -1\right)\\
b = \frac{1}{2} e^{2\sigma^2} \left(\text{erfc}\left(\sqrt{2} \sigma\right) + \sqrt{\frac{1}{\sigma^2}} \sigma -1\right)\\
$$
Which is not very solvable for $\sigma$ unfortunately. You can fill in for $\sigma$ and get the estimate I gave above however, which is pretty cool.
Best Answer
In this case the amount of neurons should be
5*5*3
.I found it especially useful for convolutional layers. Often a uniform distribution over the interval $[-c/(in+out), c/(in+out)]$ works as well.
It is implemented as an option in almost all neural network libraries. Here you can find the source code of Keras's implementation of Xavier Glorot's initialization.