Solved – Normalizing (mean and std) a 3D array

conv-neural-networkmeanneural networksnormalizationstandard deviation

I am having troubles visualizing how to normalize a 3D matrix. I am trying to use the spectrogram of sound files for a sound classification task using neural networks.
This is how a spectrogram looks like:
enter image description here

Forgetting about the axis and scales, it is just a 2D array, as a greyscale image would be. Numbers filling up a grid of dimensions 60×130 let's say.
So if I have 100 sound files, then the array dimensions would be 100x60x130

From CS231n course notes I've seen that it is recomendable to normalize the input to the network, and one of the usual ways of doing it is by substracting the mean (zero-centering) and dividing by the standard deviation of the data. I see this clearly when it comes to 2D data, and its effect is easily appretiable here:

where I just plotted some (x,y) values
enter image description here

And I did that by substracting to the x values, the mean accross the x values; and by substracting to the y values, the mean accross the y values. And then dividing the zero-centered "x" values by the std(x) and the zero-centered "y" values by the std(y).

Now, I am completely unable to extend this proccedure to the 3D case. If I think of the 0-axis (0 index like in Python, meaning the axis of the number of sound files in this case), and I compute the mean accross that dimension, I am going to get an array of dimensions 60×130. Should I then substract each of those 60×130 values, to each of the inputs? So for each new 1x60x130 example, substract (element-wise) the obtained "mean-accross-examples" 60×130 array?

And the same for the Std. Should I compute the Std accross the 0-axis (number of examples) and then divide each value of the input 1x60x130 array by each value of this Std-60×130 array?

Or am I completely wrong?

Thanks a lot if you read till here, I didn't manage to further squeeze the question.

Best Answer

I'm not an expert on NN's, but here's my two cents: Maybe you get a clearer picture by simplifying your feature set: Just unravel the 2d frame into a 1d feature vector. According to your description you then get a flat array of 60*130=7800 features for each sample. As far as I understood you want to predict some target value, or label, using the information contained in these 7800 features. If features strongly differ in scale, gradient descent for the NN may become very slow. (E.g. values in the upper left corner of your 2d feature space may vary by 10's of DB' s and features in the lower corner may only vary by fractions (0.01) of DB's.) Before standardizing, maybe calculate the variance of each feature and plot. If feature variance does not differ widely (i.e. all are within one or two orders of magnitude), I would not worry and skip standardization. If they do differ a lot, the way to go is to standardize each feature separately: calculate the mean and SD for each feature, then subtract and divide the result from each sample separately.

Related Question