Solved – Softmax weights initialization

deep learningmachine learningneural networkssigmoid-curvesoftmax

I am a new to deep learning and neural networks, and I need to know if there is a good weights initialization method to use if the activation function is Softmax like Tanh, ReLU and Sigmoid. Related answer.

Best Answer

For:

  • ReLU and variants like PReLU, RReLU and ELU: use He initialization (uniform or normal)
  • SELU: use LeCun initialization (normal) (see this paper)
  • Default (including Sigmoid, Tanh, Softmax, or no activation): use Xavier initialization (uniform or normal), also called Glorot initialization. This is the default in Keras and most other deep learning libraries.

When initializing the weights with a normal distribution, all these methods use mean 0 and variance σ²=scale/fan_avg or σ²=scale/fan_in. The fan_in is the layer's number of inputs, the fan_out is the layer's number of outputs (=number of neurons), fan_avg is the average between the two =½(fan_in+fan_out). Specifically:

  • Xavier: σ²=1/fan_avg
  • He: σ²=2/fan_in
  • LeCun: σ²=1/fan_in

When initializing the weights with a uniform distribution, all these methods just use the range [-limit, limit] where limit = sqrt(3 * σ²).

If you have consecutive ReLU layers with very different sizes, you may prefer using fan_avg rather than fan_in. In Keras, you can use something like this:

init = keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='normal')
layer = Dense(10, activation="relu", kernel_initializer=init)