Solved – What’s the recommended weight initialization strategy when using the ELU activation function

deep learningneural networksweights

For deep neural networks using ReLU neurons, the recommended connection weight initialization strategy is to pick a random uniform number between -r and +r with:

$r = \sqrt{\dfrac{12}{\text{fan-in} + \text{fan-out}}}$

Where fan-in and fan-out are the number of connections going in and out of the layer being initialized. This is called "He initialization" (paper).

My question is: what's the recommended weights initialization strategy when using ELU neurons (paper)?

Since ELUs look a lot like ReLUs, I'm tempted to use the same logic, but I'm not sure it's the optimal strategy.

Note

There is a fairly similar question but this one is more specifically about the ELU activation function (which is not covered by the answers to the other question).

Best Answer

I think the initialization should be roughly $\sqrt{\frac{1.55}{n_{in}}}$

The He et al. 2015 formula was made for ReLU units. The key idea is that the variance of f(y) with y = W * x + b should be roughly equal to the variance of y. Let's first go over the case of taking a ReLU activation, and see if we can ammend it for ELU units.

In the paper they show show that: $$ Var[y_l] = n_l Var[w_l] \mathbb{E}[x^2_l] $$ They express the last expectation $\mathbb{E}[x^2_l]$ in terms of $Var[y_{l-1}]$. For ReLUs we have that $\mathbb{E}[x^2_l] = \frac{1}{2} Var[y_{l-1}]$, simply because ReLUs put half the values in $x$ to $0$ on average. Thus we can write

$$ Var[y_l] = n_l Var[w_l] \frac{1}{2} Var[y_{l-1}] $$ We apply this to all layers, taking the product over $l$, all the way to the first layer. This gives: $$ Var[y_L] = Var[y_1] \prod_{i=2}^L \frac{1}{2} n_l Var[w_l] $$ Now this is stable only when $\frac{1}{2} n_l Var[w_l]$ is close to 1. So they set it to 1 and find $Var[W_l] = \frac{2}{n_l}$

Now for ELU units, the only thing we have to change is the expression of $\mathbb{E}[x^2_l]$ in terms of $Var[y_{l-1}]$. Sadly, this is not as straight-forward for ELU units as for RelU units as it involves calculating $\mathbb{E}[({e^{(\mathcal{N})}}^2)]$ for only the negative values of $\mathcal{N}$. This is not a pretty formula, I don't even know if there's a good closed form solution, so let's sample to get an approximation. We want $Var[y_l]$ to roughly be equal to 1 (most inputs are variance 1, batch norm makes layers variance 1 etc.). Thus we can sample from a normal distribution, apply the elu function with alpha = 1, square and calculate the mean. This gives $\approx 0.645$. The inverse of this is $\approx 1.55$.

Thus following the same logic, we can set $Var[w_l]$ to $\sqrt{\frac{1.55}{n}}$ to get a variance that doesn't increase in magnitude.

I reckon that would be the optimal value for the ELU function. It fits in between the value for the ReLU function (1/2, which is lower than 0.645 because the values that are mapped to 0 now get mapped to some minus value), and what you would have for any function with mean 0 (which is just 1).

Take care that if the variance of $Var[y_{l-1}]$ is different, the optimal constant is also different. When this variance tends to 0, then the function becomes more and more like a unit function, thus the constant will tend to 1. If the variance becomes really big, the value tends towards the original ReLU value, thus 0.5.

Edit: Did the theoretical analysis of the variance of ELU(x) if x is normally distributed. It involves the some derivations of the log-normal distribution and not so pretty integrals. The eventual answer for the variance is $0.5 \sigma$ (the part of the linear function) + $$ a - 2(b)^2 + (2b - 1)^2 $$ where $$ a = \frac{1}{2} e^{\frac{\sigma^2}{2}} \left(\text{erfc}\left(\frac{\sigma}{\sqrt{2}}\right) + \sqrt{\frac{1}{\sigma^2}} \sigma -1\right)\\ b = \frac{1}{2} e^{2\sigma^2} \left(\text{erfc}\left(\sqrt{2} \sigma\right) + \sqrt{\frac{1}{\sigma^2}} \sigma -1\right)\\ $$ Which is not very solvable for $\sigma$ unfortunately. You can fill in for $\sigma$ and get the estimate I gave above however, which is pretty cool.

Best Answer

Related Solutions

Solved – CNN xavier weight initialization

Hyperparameter Tuning in Neural Networks – Best Practices

Related Question