Solved – Worse accuracy with input normalization (NNs)

conv-neural-networkdata preprocessingmachine learningneural networksnormalization

I am training a neural network for audio classification. My inputs are "1-channel images" of size 60x130x1.

Surprisingly, I always get better accuracy when training the model with the original data, instead of with the normalized input data (mean = 0 , variance = 1).

This is how I normalize it:

mean = np.mean(X_train, axis = 0)
std = np.std(X_train, axis = 0)

X_train = (X_train-mean)/std
X_test = (X_test-mean)/std
X_val = (X_val-mean)/std 

———————————————————EDIT 1 —————————————————————-

Some relevant values of my training data are:

Min and Max values (across training examples): 0.0 , 1954.4

Min and Max values of the mean (across training examples): 0.0023, 6.7611

Min and Max values of the std (across training examples): 0.0204 , 39.0361

enter image description here

enter image description here

enter image description here

———————————————————EDIT 1 —————————————————————-

Does this makes any sense, or normalized inputs should always give better results? (The purple line corresponds to the normalized data)

validation accuracy per epoch

Training accuracy / per_epoch_per_minibatch

training accuracy per epoch per minibatch

training cost per epoch

Best Answer

why are you applying normalisation? Is it because you believe it to be a necessary step or is it because you have determined based on your data that it is appropriate?

Mean centring and scaling to unit variance is commonly useful, but not universally so and so you should think about the properties of your data.

Mean centring is rarely not useful, but may be less useful for highly skewed populations where subtracting the mean is not significantly accounting for large proportions of the variance in the dataset. Median or mode centering are less common solutions but may work if they reduce total variance more than the mean.

Unit variance is less useful if the data is all on the same dynamic range and noise is correlated with magnitude. In such scenarios scaling to unit variance will magnify the apparent magnitude of variance from a low amplitude signal should be retained as lower than the variance in a high amplitude signal.

I realise this link is specifically about PCA, but it discusses when unit scaling is and isn't helpful and the lessons are more generalisable. Note, to help interpret this link bear in mind that using variance scaled data creates a correlation matrix in the first step of PCA and non-scaled data a covariance matrix. PCA on correlation or covariance?