Solved – Should normalization match the activation function

neural networksnormalization

I'm new to neural networks and I think I now have a good grasp of the fundamentals, but I have a question relating to normalization and activation functions.

I see places that say to normalize between -1 and 1, and some that say between 0 and 1. I also see many people recommending using the ReLU activation function for performance benefits.

I assume that the data should be normalized to suit the chosen activation function? i.e. if using ReLU then the data should be normalized between 0 and 1 as anything <0 is 0. So if I'd normalized between -1 and 1 then a big chunk of the data immediately becomes 0? If normalizing the data between -1 and 1 then I assume that'd be more suited to TANH?

Also, because ReLU is linear, could the data be normalized beyond 0-1, and maybe 0-5? Would that be advisable?

Many thanks.

Best Answer

The way the weights of hidden layers are initialized makes them expect input data with standard distribution of 0 mean and 1 variance. However Batch Normalization exists, so you don't have to wonder about the activation distribution and how it is going to change in your subsequent layers.

Related Solutions

Solved – the correct way of calculating Rectifier Linear and MaxOut functions

If I have a artificial neuron with 2 inputs:
input 1 = 0.7 & weight = 0.7
input 2 = 0.3 & weight = 0.3

Let's make things clear. Consider the non-spatial case where activation function's input is just a 1D vector: $\vec{x} = (x_1, ..., x_d)$ (where do your weights come from?), which is usually an output of a Linear or Batch Normalization layer but can be what you want.

Typical activation functions are $\mathbb{R} \rightarrow \mathbb{R} $. That means they handle each input component individually. For example, ReLU:

$ ReLU_i(x_i) = max(0, x_i) $; Note that no weight involved.

Applied to an input vector: $ ReLU((1, -2, 3, -4, 5)) = (1, 0, 3, 0, 5) $

But MaxOut is a $\mathbb{R}^d \rightarrow \mathbb{R} $ function:

$ MaxOut_s(\vec{x}) = max( \vec{w}_{s1} \vec{x} + b_{s1}, ... , \vec{w}_{sk} \vec{x} + b_{sk}) $

$ MaxOut(\vec{x}) = (MaxOut_1(\vec{x}), ..., MaxOut_m(\vec{x})) $

where the number of components $k$ in each MaxOut function and the amount $ m $ of such functions are up to you, $w_{sl} \in \mathbb{R}^d, b_{sl} \in \mathbb{R}$ are learned.

Solved – Neural Networks – Performance VS Amount of Data

The notion of "more data -> better performance" is normally used in context of number of samples and not the size of each sample. I.e. Deep learning can extract more information from higher number of observations than other methods. In your example you are talking more about giving additional information per sample rather than more samples.

Things to check:

Scale of the temperature - improperly scaled inputs can completely destroy the stability of training
Outliers - if model heavily relies on the temperature to predict the outcome it is possible that outliers in this relationship can create wildly wrong predictions and since MSE is sensitive to outliers you get worse performance.

Best Answer

Related Solutions

Solved – the correct way of calculating Rectifier Linear and MaxOut functions

Solved – Neural Networks – Performance VS Amount of Data

Related Question