Solved – Non-linearity before final Softmax layer in a convolutional neural network

conv-neural-networkdeep learningneural networksnonlinearsoftmax

I'm studying and trying to implement convolutional neural networks, but I suppose this question applies to multilayer perceptrons in general.

The output neurons in my network represent the activation of each class: the most active neuron corresponds to the predicted class for a given input. To consider a cross-entropy cost for training, I'm adding a softmax layer at the end of the network, so that each neuron's activation value is interpreted as a probability value.

My question is: should the neurons in the output layer apply a non-linear function to the input? My intuition is that it is not necessary:

  • if the input to the $i$-th output neuron is the dot product $x^T\theta_i$ between a vector $x$ (coming from the previous layer) and the weights $\theta_i$ for that neuron,
  • and if I employ a monotonic non-linear function like the sigmoid or the ReLU
  • then the larger activation output will still correspond to the largest $x^T\theta_i$, so from this point of view the non-linear function would not change the prediction.

Is something wrong with this interpretation? Are there some training factors that I'm overlooking which make the output non-linearity necessary?

And if I'm right, would anything change if instead of using the sigmoid function I use the ReLU $$\max(0,x^T\theta_i)$$ function, which is not strictly monotonic?

EDIT

With reference to Karel's answer, whose answer basically was "it depends", here is a more detailed description of my network and doubt:

Suppose I have N hidden layers, and my output layer is just a softmax layer over a set of neurons representing classes (so my expected output is the probability that the input data belongs to each class). Assuming the first N-1 layers have nonlinear neurons, what is the difference between using nonlinear vs linear neurons in the N-th hidden layer?


Best Answer

You should not use a non-linearity for the last layer before the softmax classification. The ReLU non-linearity (used now almost exclusively) will in this case simply throw away information without adding any additional benefit. You can look at the caffe implementation of the well-known AlexNet for a reference of what's done in practice.

Related Question