Solved – Exact definition of Maxout

machine learningneural networks

I've been trying to figure out what exactly it meant by the "Maxout" activation function in neural networks. There is this question, this paper, and even in the Deep Learning book by Bengio et al., except with just little bit of information and a big TODO next to it.

I will be using the notation described here for clarity. I just don't want to retype it up and cause question bloat. Briefly, $a^i_j=\sigma(z^i_j)=\sigma(\sum\limits_k a^{i-1}_kw^i_{jk}+b^i_j)$, in other words, a neuron has a single bias, a single weight for each input, and then it sums the inputs times the weights, then adds the bias and applies the activation function to get the output (aka activation) value.

So far I know that Maxout is an activation function that "outputs the max of it's inputs". What does that mean? Here are some ideas that I could interpret from that:

  1. $a^i_j=\max\limits_k (a^{i-1}_k)$, also known as max-pooling.
  2. $a^i_j=\max\limits_k (a^{i-1}_kw^i_{jk})+b^i_j$, simply replacing the sum that is normally done with a max.
  3. $a^i_j=\max\limits_k (a^{i-1}_kw^i_{jk}+b^i_{jk})$, where each neuron now has one bias value for each input, instead of a single bias value applied after summing all inputs. This would make backpropagation different, but still possible.
  4. Each $z^i_j$ is computed as normal, and each neuron has a single bias and a weight for each input. However, similar to softmax ($a^i_j = \frac{\exp(z^i_j)}{\sum\limits_k \exp(z^i_k)}$), this takes the maximum of all $z$'s in it's current layer. Formally, $a^i_j=\max\limits_k z^i_k$.

Are any of these correct? Or is it something different?

Best Answer

None of the above; maxout networks don't follow the architecture you assumed.

From the beginning of the "description of maxout" section in the paper you linked, which defined maxout:

Given an input $x \in \mathbb{R}^d$ ($x$ may be $v$, or may be a hidden layer’s state), a maxout hidden layer implements the function

$$h_i = \max_{j \in [1, k]} z_{ij}$$

where $z_{ij} = x^T W_{ij} + b_{ij}$, and $W \in \mathbb{R}^{d \times m \times k}$ and $b ∈ R^{m \times k}$ are learned parameters.

So, each unit of the $m$ units has $k$ different affine combinations of the previous layer, and outputs the max of those $k$ affine functions. Imagine each layer being conected to the previous layer with $k$ different-colored connections, and taking the max of the colors.

Alternatively, you can think of a maxout unit as actually being two layers: each of the previous layer's units is connected to each of $k$ units with the identity activation function, and then a single unit connects those $k$ linear units with a max-pooling activation.

This means that the unit, viewed as a function from $\mathbb R^d$ to $\mathbb R$, is the piecewise max of affine functions. The paper's Figure 1 gives some examples of different functions it might look like:

enter image description here

Each of the dashed lines represents a $W^T x + b$. You can represent any convex function in this way, which is pretty nice.