I think the short answer here is that it's not a good idea to use ReLU activations on the output layer in combination with a cross-entropy loss. Read on for details!
The cross-entropy is a "cost" function that attempts to compute the difference between two probability distribution functions. If your neural network's output does not fit the criteria for representing a probability distribution function, then the cross-entropy is going to work erratically.
What are these criteria? Traditionally, you want each of the categories in your distribution to be represented using a probability value, such that
- each probability value is between 0 and 1
- the sum of all probability values equals 1.
Most often when using a cross-entropy loss in a neural network context, the output layer of the network is activated using a softmax (or the the logistic sigmoid, which is a special case of the softmax for just two classes) $$ s(\vec{z}) = \frac{\exp(\vec{z})}{\sum_i\exp(z_i)} $$ which forces the output of the network to satisfy these two representation criteria. In particular the softmax ensures that each of the outputs of the network are restricted to the open interval (0, 1), which in turn ensures that you don't get these undefined mathematical quantities like taking $\log(0)$ or computing $\frac{1}{1-z}$ for $z=1$.
Using a ReLU output activation function with a cross-entropy loss is problematic because the ReLU activation does not generate values that can, in general, be interpreted as probabilities, whereas the cross-entropy requires its inputs to be interpreted as probabilities.
Here is a paper that explains the issue. I'm quoting some part of it to make the issue clear.
The rectifier activation function allows a network to easily obtain sparse representations. For example, after uniform initialization of the weights, around
50%
of hidden units continuous output values are real zeros, and this fraction can easily increase with sparsity-inducing
regularization.
So rectifier activation function introduces sparsity effect on the network. Here are some advantages of sparsity from the same paper;
Information disentangling. One of the claimed objectives of deep learning algorithms (Bengio,2009) is to disentangle the factors
explaining the variations in the data. A dense representation is
highly entangled because almost any change in the input modifies most
of the entries in the representation vector. Instead, if a
representation is both sparse and robust to small input changes, the
set of non-zero features is almost always roughly conserved by small
changes of the input.
Efficient variable-size representation. Different inputs may contain different amounts of information and would be more
conveniently represented using a variable-size data-structure, which
is common in computer representations of information. Varying the
number of active neurons allows a model to control the effective
dimensionality of the representation for a given input and the
required precision.
Linear separability. Sparse representations are also more likely to be linearly separable, or more easily separable with less
non-linear machinery, simply because the information is represented in
a high-dimensional space. Besides, this can reflect the original data
format. In text-related applications for instance, the original raw
data is already very sparse.
Distributed but sparse. Dense distributed representations are the richest representations, being potentially exponentially more
efficient than purely local ones (Bengio, 2009). Sparse
representations’ efficiency is still exponentially greater, with the
power of the exponent being the number of non-zero features. They may
represent a good trade-off with respect to the above criteria.
It also answers the question you've asked:
One may hypothesize that the hard saturation at 0 may hurt
optimization by blocking gradient back-propagation. To evaluate the
potential impact of this effect we also investigate the softplus
activation: $ \text{softplus}(x) = \log(1 + e^x) $ (Dugas et al., 2001), a
smooth version of the rectifying non-linearity. We lose the exact
sparsity, but may hope to gain easier training. However, experimental
results tend to contradict that hypothesis, suggesting that hard zeros
can actually help supervised training. We hypothesize that the hard
non-linearities do not hurt so long as the gradient can propagate
along some paths, i.e., that some of the hidden units in each layer
are non-zero With the credit and blame assigned to these ON units
rather than distributed more evenly, we hypothesize that optimization
is easier.
You can read the paper Deep Sparse Rectifier Neural Networks for more detail.
Best Answer
Cross-entropy + ReLu does not seem to be a good idea. Check that answer.
Which is not the case with ReLu.