Solved – Softmax weights initialization

deep learningmachine learningneural networkssigmoid-curvesoftmax

I am a new to deep learning and neural networks, and I need to know if there is a good weights initialization method to use if the activation function is Softmax like Tanh, ReLU and Sigmoid. Related answer.

Best Answer

For:

ReLU and variants like PReLU, RReLU and ELU: use He initialization (uniform or normal)
SELU: use LeCun initialization (normal) (see this paper)
Default (including Sigmoid, Tanh, Softmax, or no activation): use Xavier initialization (uniform or normal), also called Glorot initialization. This is the default in Keras and most other deep learning libraries.

When initializing the weights with a normal distribution, all these methods use mean 0 and variance σ²=scale/fan_avg or σ²=scale/fan_in. The fan_in is the layer's number of inputs, the fan_out is the layer's number of outputs (=number of neurons), fan_avg is the average between the two =½(fan_in+fan_out). Specifically:

Xavier: σ²=1/fan_avg
He: σ²=2/fan_in
LeCun: σ²=1/fan_in

When initializing the weights with a uniform distribution, all these methods just use the range [-limit, limit] where limit = sqrt(3 * σ²).

If you have consecutive ReLU layers with very different sizes, you may prefer using fan_avg rather than fan_in. In Keras, you can use something like this:

init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='normal')
layer = Dense(10, activation="relu", kernel_initializer=init)

Best Answer

Related Solutions

Neural Networks – Nonlinearity Before Softmax Layer in Convolutional Neural Networks

Solved – Can neural network (e.g., convolutional neural network) have negative weights

Related Question