I'll start making a list here of the ones I've learned so far. As @marcodena said, pros and cons are more difficult because it's mostly just heuristics learned from trying these things, but I figure at least having a list of what they are can't hurt.
First, I'll define notation explicitly so there is no confusion:
Notation
This notation is from Neilsen's book.
A Feedforward Neural Network is a many layers of neurons connected together. It takes in an input, then that input "trickles" through the network and the neural network returns an output vector.
More formally, call $a^i_j$ the activation (aka output) of the $j^{th}$ neuron in the $i^{th}$ layer, where $a^1_j$ is the $j^{th}$ element in the input vector.
Then we can relate the next layer's input to it's previous via the following relation:
$$a^i_j = \sigma\bigg(\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j\bigg)$$
where
- $\sigma$ is the activation function,
- $w^i_{jk}$ is the weight from the $k^{th}$ neuron in the $(i-1)^{th}$ layer to the $j^{th}$ neuron in the $i^{th}$ layer,
- $b^i_j$ is the bias of the $j^{th}$ neuron in the $i^{th}$ layer, and
- $a^i_j$ represents the activation value of the $j^{th}$ neuron in the $i^{th}$ layer.
Sometimes we write $z^i_j$ to represent $\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j$, in other words, the activation value of a neuron before applying the activation function.
For more concise notation we can write
$$a^i = \sigma(w^i \times a^{i-1} + b^i)$$
To use this formula to compute the output of a feedforward network for some input $I \in \mathbb{R}^n$, set $a^1 = I$, then compute $a^2, a^3, \ldots, a^m$, where $m$ is the number of layers.
Activation Functions
(in the following, we will write $\exp(x)$ instead of $e^x$ for readability)
Identity
Also known as a linear activation function.
$$a^i_j = \sigma(z^i_j) = z^i_j$$
Step
$$a^i_j = \sigma(z^i_j) = \begin{cases} 0 & \text{if } z^i_j < 0 \\ 1 & \text{if } z^i_j > 0 \end{cases}$$
Piecewise Linear
Choose some $x_{\min}$ and $x_{\max}$, which is our "range". Everything less than than this range will be 0, and everything greater than this range will be 1. Anything else is linearly-interpolated between. Formally:
$$a^i_j = \sigma(z^i_j) = \begin{cases} 0 & \text{if } z^i_j < x_{\min} \\ m z^i_j+b & \text{if } x_{\min} \leq z^i_j \leq x_{\max} \\ 1 & \text{if } z^i_j > x_{\max} \end{cases}$$
Where
$$m = \frac{1}{x_{\max}-x_{\min}}$$
and
$$b = -m x_{\min} = 1 - m x_{\max}$$
Sigmoid
$$a^i_j = \sigma(z^i_j) = \frac{1}{1+\exp(-z^i_j)}$$
Complementary log-log
$$a^i_j = \sigma(z^i_j) = 1 − \exp\!\big(−\exp(z^i_j)\big)$$
Bipolar
$$a^i_j = \sigma(z^i_j) = \begin{cases} -1 & \text{if } z^i_j < 0 \\ \ \ \ 1 & \text{if } z^i_j > 0 \end{cases}$$
Bipolar Sigmoid
$$a^i_j = \sigma(z^i_j) = \frac{1-\exp(-z^i_j)}{1+\exp(-z^i_j)}$$
Tanh
$$a^i_j = \sigma(z^i_j) = \tanh(z^i_j)$$
LeCun's Tanh
See Efficient Backprop.
$$a^i_j = \sigma(z^i_j) = 1.7159 \tanh\!\left( \frac{2}{3} z^i_j\right)$$
Scaled:
Hard Tanh
$$a^i_j = \sigma(z^i_j) = \max\!\big(-1, \min(1, z^i_j)\big)$$
Absolute
$$a^i_j = \sigma(z^i_j) = \mid z^i_j \mid$$
Rectifier
Also known as Rectified Linear Unit (ReLU), Max, or the Ramp Function.
$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)$$
Modifications of ReLU
These are some activation functions that I have been playing with that seem to have very good performance for MNIST for mysterious reasons.
$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)+\cos(z^i_j)$$
Scaled:
$$a^i_j = \sigma(z^i_j) = \max(0, z^i_j)+\sin(z^i_j)$$
Scaled:
Smooth Rectifier
Also known as Smooth Rectified Linear Unit, Smooth Max, or Soft plus
$$a^i_j = \sigma(z^i_j) = \log\!\big(1+\exp(z^i_j)\big)$$
Logit
$$a^i_j = \sigma(z^i_j) = \log\!\bigg(\frac{z^i_j}{(1 − z^i_j)}\bigg)$$
Scaled:
Probit
$$a^i_j = \sigma(z^i_j) = \sqrt{2}\,\text{erf}^{-1}(2z^i_j-1)$$.
Where $\text{erf}$ is the Error Function. It can't be described via elementary functions, but you can find ways of approximating it's inverse at that Wikipedia page and here.
Alternatively, it can be expressed as
$$a^i_j = \sigma(z^i_j) = \phi(z^i_j)$$.
Where $\phi $is the Cumulative distribution function (CDF). See here for means of approximating this.
Scaled:
Cosine
See Random Kitchen Sinks.
$$a^i_j = \sigma(z^i_j) = \cos(z^i_j)$$.
Softmax
Also known as the Normalized Exponential.
$$a^i_j = \frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}$$
This one is a little weird because the output of a single neuron is dependent on the other neurons in that layer. It also does get difficult to compute, as $z^i_j$ may be a very high value, in which case $\exp(z^i_j)$ will probably overflow. Likewise, if $z^i_j$ is a very low value, it will underflow and become $0$.
To combat this, we will instead compute $\log(a^i_j)$. This gives us:
$$\log(a^i_j) = \log\left(\frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}\right)$$
$$\log(a^i_j) = z^i_j - \log(\sum\limits_k \exp(z^i_k))$$
Here we need to use the log-sum-exp trick:
Let's say we are computing:
$$\log(e^2 + e^9 + e^{11} + e^{-7} + e^{-2} + e^5)$$
We will first sort our exponentials by magnitude for convenience:
$$\log(e^{11} + e^9 + e^5 + e^2 + e^{-2} + e^{-7})$$
Then, since $e^{11}$ is our highest, we multiply by $\frac{e^{-11}}{e^{-11}}$:
$$\log(\frac{e^{-11}}{e^{-11}}(e^{11} + e^9 + e^5 + e^2 + e^{-2} + e^{-7}))$$
$$\log(\frac{1}{e^{-11}}(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18}))$$
$$\log(e^{11}(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18}))$$
$$\log(e^{11}) + \log(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18})$$
$$ 11 + \log(e^{0} + e^{-2} + e^{-6} + e^{-9} + e^{-13} + e^{-18})$$
We can then compute the expression on the right and take the log of it. It's okay to do this because that sum is very small with respect to $\log(e^{11})$, so any underflow to 0 wouldn't have been significant enough to make a difference anyway. Overflow can't happen in the expression on the right because we are guaranteed that after multiplying by $e^{-11}$, all the powers will be $\leq 0$.
Formally, we call $m=\max(z^i_1, z^i_2, z^i_3, ...)$. Then:
$$\log\!(\sum\limits_k \exp(z^i_k)) = m + \log(\sum\limits_k \exp(z^i_k - m))$$
Our softmax function then becomes:
$$a^i_j = \exp(\log(a^i_j))=\exp\!\left( z^i_j - m - \log(\sum\limits_k \exp(z^i_k - m))\right)$$
Also as a sidenote, the derivative of the softmax function is:
$$\frac{d \sigma(z^i_j)}{d z^i_j}=\sigma^{\prime}(z^i_j)= \sigma(z^i_j)(1 - \sigma(z^i_j))$$
Maxout
This one is also a little tricky. Essentially the idea is that we break up each neuron in our maxout layer into lots of sub-neurons, each of which have their own weights and biases. Then the input to a neuron goes to each of it's sub-neurons instead, and each sub-neuron simply outputs their $z$'s (without applying any activation function). The $a^i_j$ of that neuron is then the max of all its sub-neuron's outputs.
Formally, in a single neuron, say we have $n$ sub-neurons. Then
$$a^i_j = \max\limits_{k \in [1,n]} s^i_{jk}$$
where
$$s^i_{jk} = a^{i-1} \bullet w^i_{jk} + b^i_{jk}$$
($\bullet$ is the dot product)
To help us think about this, consider the weight matrix $W^i$ for the $i^{\text{th}}$ layer of a neural network that is using, say, a sigmoid activation function. $W^i$ is a 2D matrix, where each column $W^i_j$ is a vector for neuron $j$ containing a weight for every neuron in the the previous layer $i-1$.
If we're going to have sub-neurons, we're going to need a 2D weight matrix for each neuron, since each sub-neuron will need a vector containing a weight for every neuron in the previous layer. This means that $W^i$ is now a 3D weight matrix, where each $W^i_j$ is the 2D weight matrix for a single neuron $j$. And then $W^i_{jk}$ is a vector for sub-neuron $k$ in neuron $j$ that contains a weight for every neuron in the previous layer $i-1$.
Likewise, in a neural network that is again using, say, a sigmoid activation function, $b^i$ is a vector with a bias $b^i_j$ for each neuron $j$ in layer $i$.
To do this with sub-neurons, we need a 2D bias matrix $b^i$ for each layer $i$, where $b^i_j$ is the vector with a bias for $b^i_{jk}$ each subneuron $k$ in the $j^{\text{th}}$ neuron.
Having a weight matrix $w^i_j$ and a bias vector $b^i_j$ for each neuron then makes the above expressions very clear, it's simply applying each sub-neuron's weights $w^i_{jk}$ to the outputs $a^{i-1}$ from layer $i-1$, then applying their biases $b^i_{jk}$ and taking the max of them.
Radial Basis Function Networks
Radial Basis Function Networks are a modification of Feedforward Neural Networks, where instead of using
$$a^i_j=\sigma\bigg(\sum\limits_k (w^i_{jk} \cdot a^{i-1}_k) + b^i_j\bigg)$$
we have one weight $w^i_{jk}$ per node $k$ in the previous layer (as normal), and also one mean vector $\mu^i_{jk}$ and one standard deviation vector $\sigma^i_{jk}$ for each node in the previous layer.
Then we call our activation function $\rho$ to avoid getting it confused with the standard deviation vectors $\sigma^i_{jk}$. Now to compute $a^i_j$ we first need to compute one $z^i_{jk}$ for each node in the previous layer. One option is to use Euclidean distance:
$$z^i_{jk}=\sqrt{\Vert(a^{i-1}-\mu^i_{jk}\Vert}=\sqrt{\sum\limits_\ell (a^{i-1}_\ell - \mu^i_{jk\ell})^2}$$
Where $\mu^i_{jk\ell}$ is the $\ell^\text{th}$ element of $\mu^i_{jk}$. This one does not use the $\sigma^i_{jk}$. Alternatively there is Mahalanobis distance, which supposedly performs better:
$$z^i_{jk}=\sqrt{(a^{i-1}-\mu^i_{jk})^T \Sigma^i_{jk} (a^{i-1}-\mu^i_{jk})}$$
where $\Sigma^i_{jk}$ is the covariance matrix, defined as:
$$\Sigma^i_{jk} = \text{diag}(\sigma^i_{jk})$$
In other words, $\Sigma^i_{jk}$ is the diagonal matrix with $\sigma^i_{jk}$ as it's diagonal elements. We define $a^{i-1}$ and $\mu^i_{jk}$ as column vectors here because that is the notation that is normally used.
These are really just saying that Mahalanobis distance is defined as
$$z^i_{jk}=\sqrt{\sum\limits_\ell \frac{(a^{i-1}_{\ell} - \mu^i_{jk\ell})^2}{\sigma^i_{jk\ell}}}$$
Where $\sigma^i_{jk\ell}$ is the $\ell^\text{th}$ element of $\sigma^i_{jk}$. Note that $\sigma^i_{jk\ell}$ must always be positive, but this is a typical requirement for standard deviation so this isn't that surprising.
If desired, Mahalanobis distance is general enough that the covariance matrix $\Sigma^i_{jk}$ can be defined as other matrices. For example, if the covariance matrix is the identity matrix, our Mahalanobis distance reduces to the Euclidean distance. $\Sigma^i_{jk} = \text{diag}(\sigma^i_{jk})$ is pretty common though, and is known as normalized Euclidean distance.
Either way, once our distance function has been chosen, we can compute $a^i_j$ via
$$a^i_j=\sum\limits_k w^i_{jk}\rho(z^i_{jk})$$
In these networks they choose to multiply by weights after applying the activation function for reasons.
This describes how to make a multi-layer Radial Basis Function network, however, usually there is only one of these neurons, and its output is the output of the network. It's drawn as multiple neurons because each mean vector $\mu^i_{jk}$ and each standard deviation vector $\sigma^i_{jk}$ of that single neuron is considered a one "neuron" and then after all of these outputs there is another layer that takes the sum of those computed values times the weights, just like $a^i_j$ above. Splitting it into two layers with a "summing" vector at the end seems odd to me, but it's what they do.
Also see here.
Radial Basis Function Network Activation Functions
Gaussian
$$\rho(z^i_{jk}) = \exp\!\big(-\frac{1}{2} (z^i_{jk})^2\big)$$
Multiquadratic
Choose some point $(x, y)$. Then we compute the distance from $(z^i_j, 0)$ to $(x, y)$:
$$\rho(z^i_{jk}) = \sqrt{(z^i_{jk}-x)^2 + y^2}$$
This is from Wikipedia. It isn't bounded, and can be any positive value, though I am wondering if there is a way to normalize it.
When $y=0$, this is equivalent to absolute (with a horizontal shift $x$).
Inverse Multiquadratic
Same as quadratic, except flipped:
$$\rho(z^i_{jk}) = \frac{1}{\sqrt{(z^i_{jk}-x)^2 + y^2}}$$
*Graphics from intmath's Graphs using SVG.
Best Answer
Quoting myself from this answer to a different question:
This explains why this sigmoid is used in logistic regression.
Regarding neural networks, this blog post explains how different nonlinearities including the logit / softmax and the probit used in neural networks can be given a statistical interpretation and thereby a motivation. The underlying idea is that a multi-layered neural network can be regarded as a hierarchy of generalized linear models; according to this, activation functions are link functions, which in turn correspond to different distributional assumptions.