Let us assume that the activation function is a logistic regression denoted as $\sigma()$.
The idea behind cross-entropy (CE) is to optimise the weights $W = [w_1, w_2,...,w_j,...w_k]$ to maximise the log probability - or to minimise the negative log probability.
Here, you are willing to obtain each neuron's derivative of the cost $C^n$ with respect to each of the layers in $W$. Thus, you write $\frac{\partial C}{\partial w_j}$, where $C = [C^1, C^2,...,C^n,...,C^m]$. After some math, which I'll skip here but you can read more about it (in case you're interested here (slide 18 proves useful) and here):
This results in $\frac{1}{n} \sum x_j(\sigma(z)−y)$, where $n$ is the size of your set.
Here, $z=WX+b$, where $X = [x_{11} \ x_{12}...x_{1j}...x_{1k};\quad ....;\quad x_{n1} \ x_{n2}...x_{nj}...x_{nk}]$ ($X$ is an $n$ by $k$ matrix) and $x_{11}..x_{1k}$ are the features you would have per entry, $W$ are the weights as defined above and $b$ is the bias.
In classification, you would like to use this linear dependency of $z$. However, you would want to run it through a non-linear function such as a sigmoid, hereby defined by $\sigma()$ (you can see a proof and read more about it here). $y$ represents the targeted output.
So $w_j$ is the j-th weight of the vector above; $x_j$ is the j-th input vector of an entry, $\sigma(z)$ is the sigmoid applied to the $WX+b$ linear function.
Hope that makes sense.
- Regression: linear (because values are unbounded)
- Classification: softmax (simple sigmoid works too but softmax works better)
Use simple sigmoid only if your output admits multiple "true" answers, for instance, a network that checks for the presence of various objects in an image. In other words, the output is not a probability distribution (does not need to sum to 1).
Best Answer
Typically, The softmax function is used in case of multi class problems and a single logistic function for binary classification. The reason is that the output nonlinearity and the loss "match", that means that the derivative is very simple--a property of generalized linear models.
On a side note, the tanh and the logistic sigmoid are related linearly. Tanh is just the logistic scaled and translated from the $[0, 1]$ to the $[-1, 1]$ interval.