Solved – Relu vs Sigmoid vs Softmax as hidden layer neurons

conv-neural-networkmachine learningneural networkssigmoid-curvetensorflow

I was playing with a simple Neural Network with only one hidden layer, by Tensorflow, and then I tried different activations for the hidden layer:

  • Relu
  • Sigmoid
  • Softmax (well, usually softmax is used in the last layer..)

Relu gives the best train accuracy & validation accuracy. I am not sure how to explain this.

We know that Relu has good qualities, such as sparsity, such as no-gradient-vanishing, etc, but

Q: is Relu neuron in general better than sigmoid/softmax neurons ?
Should we almost always use Relu neurons in NN (or even CNN) ?

I thought a more complex neuron would introduce better result, at least train accuracy if we worry about overfitting.

Thanks
PS: The code basically is from "Udacity-Machine learning -assignment2", which is recognition of notMNIST using a simple 1-hidden-layer-NN.

batch_size = 128
graph = tf.Graph()
with graph.as_default():
  # Input data. 
  tf_train_dataset = tf.placeholder(tf.float32, shape=(batch_size, image_size * image_size))
  tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
  tf_valid_dataset = tf.constant(valid_dataset)
  tf_test_dataset = tf.constant(test_dataset)

  # hidden layer
  hidden_nodes = 1024
  hidden_weights = tf.Variable( tf.truncated_normal([image_size * image_size, hidden_nodes]) )
  hidden_biases = tf.Variable( tf.zeros([hidden_nodes]))
  hidden_layer = **tf.nn.relu**( tf.matmul( tf_train_dataset, hidden_weights) + hidden_biases)

  # Variables.
  weights = tf.Variable( tf.truncated_normal([hidden_nodes, num_labels])) 
  biases = tf.Variable(tf.zeros([num_labels]))

  # Training computation.
  logits = tf.matmul(hidden_layer, weights) + biases
  loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits, tf_train_labels) )

  # Optimizer.
  optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)

  # Predictions for the training, validation, and test data.
  train_prediction = tf.nn.softmax(logits)
  valid_relu = **tf.nn.relu**(  tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)
  valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, weights) + biases) 

  test_relu = **tf.nn.relu**( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)
  test_prediction = tf.nn.softmax(tf.matmul(test_relu, weights) + biases)

Best Answer

In addition to @Bhagyesh_Vikani:

  • Relu behaves close to a linear unit
  • Relu is like a switch for linearity. If you don't need it, you "switch" it off. If you need it, you "switch" it on. Thus, we get the linearity benefits but reserve ourself an option of not using it altogther.
  • The derivative is 1 when it's active. The second derivative of the function is 0 almost everywhere. Thus, it's a very simple function. That makes optimisation much easier.
  • The gradient is large whenever you want it be and never saturate

There are also generalisations of rectified linear units. Rectified linear units and its generalisations are based on the principle that linear models are easier to optimize.

Both sigmoid/softmax are discouraged (chapter 6: Ian Goodfellow) for vanilla feedforward implementation. They are more useful for recurrent networks, probabilistic models, and some autoencoders have additional requirements that rule out the use of piecewise linear activation functions.

If you have a simple NN (that's the question), Relu is your first preference.