Solved – Should normalization match the activation function

neural networksnormalization

I'm new to neural networks and I think I now have a good grasp of the fundamentals, but I have a question relating to normalization and activation functions.

I see places that say to normalize between -1 and 1, and some that say between 0 and 1. I also see many people recommending using the ReLU activation function for performance benefits.

I assume that the data should be normalized to suit the chosen activation function? i.e. if using ReLU then the data should be normalized between 0 and 1 as anything <0 is 0. So if I'd normalized between -1 and 1 then a big chunk of the data immediately becomes 0? If normalizing the data between -1 and 1 then I assume that'd be more suited to TANH?

Also, because ReLU is linear, could the data be normalized beyond 0-1, and maybe 0-5? Would that be advisable?

Many thanks.

Best Answer

The way the weights of hidden layers are initialized makes them expect input data with standard distribution of 0 mean and 1 variance. However Batch Normalization exists, so you don't have to wonder about the activation distribution and how it is going to change in your subsequent layers.