Solved – the correct way of calculating Rectifier Linear and MaxOut functions

machine learningneural networksperceptron

If I have a artificial neuron with 2 inputs:

input 1 = 0.7 & weight = 0.7
input 2 = 0.3 & weight = 0.3

If I use a Rectifier Linear (ReLU) as activation function ($f(x)=max(0,x)$) is the output of this neuron:

Outcome 1: {I assume this is correct}: ReLU = $max(0,0.7*0.7+0.3*0.3) = 0.58$

or

Outcome 2: ReLU = $max(0,0.7*0.7,0.3*0.3) = 0.49 $

Or in other words: is this max taken over the summation or individual inputs? As I see ReLU as a drop in replacement for sigmoid/tanh I expect the first one is correct.

Question 1: is outcome 1 or 2 the correct way of calculating ReLU

I started to hesistate because of maxout which is defined in the paper as:

$ h_i = \max_{j \in [1, k]} z_{ij} $ with $z_{ij} = x^T W_{ij} + b_{ij}$, $ W \in \mathbb{R}^{d \times m \times k} $ and $b \in \mathbb{R}^{m \times k}$

Based on this definition I expect that the output of this neuron if maxout is used as activation is: $max(0.7*0.7,0.3*0.3) = 0.49$

Question 2: is this the correct way of calculating maxout for 1 layer with 1 neuron with these 2 input/weights??

The reason why I hesistate about the maxout/ReLU algo impls is because of the following phrase in this paper:

The only difference between maxout
and max pooling over a set of rectified linear units
is that maxout does not include a 0 in the max.

This suggests (to me) that $ReLU == MaxOut$ if the maximum $value > 0$ so that outcome 2 is correct where I first (still) assume outcome 1 is correct.

Best Answer

If I have a artificial neuron with 2 inputs:

input 1 = 0.7 & weight = 0.7
input 2 = 0.3 & weight = 0.3

Let's make things clear. Consider the non-spatial case where activation function's input is just a 1D vector: $\vec{x} = (x_1, ..., x_d)$ (where do your weights come from?), which is usually an output of a Linear or Batch Normalization layer but can be what you want.

Typical activation functions are $\mathbb{R} \rightarrow \mathbb{R} $. That means they handle each input component individually. For example, ReLU:

$ ReLU_i(x_i) = max(0, x_i) $; Note that no weight involved.

Applied to an input vector: $ ReLU((1, -2, 3, -4, 5)) = (1, 0, 3, 0, 5) $

But MaxOut is a $\mathbb{R}^d \rightarrow \mathbb{R} $ function:

$ MaxOut_s(\vec{x}) = max( \vec{w}_{s1} \vec{x} + b_{s1}, ... , \vec{w}_{sk} \vec{x} + b_{sk}) $

$ MaxOut(\vec{x}) = (MaxOut_1(\vec{x}), ..., MaxOut_m(\vec{x})) $

where the number of components $k$ in each MaxOut function and the amount $ m $ of such functions are up to you, $w_{sl} \in \mathbb{R}^d, b_{sl} \in \mathbb{R}$ are learned.