Does it refer to the input or the output of the activation function?
The literature seems to be inconsistent. A few examples:
Activations = Input of the activation function
- Deep Learning Book, Goodfellow et al., Pages 208, 209
$a^{(k)} = b^{(k)} + W^{(k)}h^{(k-1)}$ […] the activations $a^{(k)}$
- Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe et al.
We want to a preserve the information in the network, by
normalizing the activations […] Note that, since we normalize Wu+b, the bias b can be ignored … - http://cs231n.github.io/neural-networks-1/ (describing ReLU)
this one is a common choice and simply thresholds all activations that are below zero to zero
Activations = Output of the activation function
- Binarized Neural Networks: Training Neural Networks with Weights and
Activations Constrained to +1 or −1, Courbariaux et al speaks of pre-activations and activations -
http://cs231n.github.io/neural-networks-1/
h1 = f(np.dot(W1, x) + b1) # calculate first hidden layer activations
h2 = f(np.dot(W2, h1) + b2) # calculate second hidden layer activations
Best Answer
The simplest representation of a neural network is a Multi-Layer Perceptron or MLP. In its simplest form, MLP is just three layers.
An input layer represented by matrix $X \in \mathbb{R}^{N\times d}$ where $N$ is the number of training examples and $d$ is the number of features.
A hidden layer which is usually a ReLU or a logistic sigmoid function. Hidden layer $i$ could be a ReLU function which is represented by $$h_i(x) = \text{ReLU}(x) = max(x, 0)$$ In other words if the input to the ReLU function is negative, the function outputs a $0$. If the inputs x are positive, the ReLU function will output $x$.
The hidden layer feeds into the output layer which is just another function. This function could be squared error function (in the context of regression) or softmax (in the case of multiclass classification). The MLP is complete when you consider the weight and bias matrices but we don't need them for now.
Th activation function is just what the name suggests... a function. In the example above, the activation function for the hidden layer is the ReLU function. The activation function for the output layer was squared error or softmax.
When someone in Machine Learning uses the word $\text{activations}$, they are almost always referring to the output of the activation function. The possible activations in the hidden layer in the example above could only either be a $0$ or a $1$.
Note that the hidden activations (output from the hidden layer) could become input to other activation functions (in this case; the output layer activation functions). Pre-activation is the input to an activation function.
On a final note, I come from a statistics background which is a much older and more developed field. The notation in Statistics is pretty much standard. In machine learning however, the notation and the nomenclature are still evolving so I would not be surprised to see some authors use some terms differently. Context is your best friend when reading machine learning texts.