Solved – Do I apply normalization per entire dataset, per input vector or per feature

machine learningnormalizationstandardization

One of the ways to standardize input data for Neural network training is:

\begin{equation}
X = \frac{X – mean(X)}{std(X)}
\end{equation}

However if I have have $n$ training examples which have each $m$ features:
\begin{matrix}
x_{11} & \ldots & a_{1m}\\
\vdots & \vdots & \vdots \\
x_{n1} & \ldots & a_{nn}
\end{matrix}

But on what level do I apply this "mean to zero, variance to 1"?

What to apply, and what is the explanation? (I bet the answer is standardize across entire space!?)

  • {Standardize across entire space} Calculate the mean/std (1 values) for the entire matrix and subtract/divide this element wise for each cell.

  • {Standardize on row/input case level} Calculate the mean/std for each entire row, and subtract this element wise on each features in that row.

  • {Standardize on column/feature basis} Calculate the mean/std for each entire column (feature), and subtract/divide this element wise all cells in that column.

This make a difference, If I have as input features with a different scale:

\begin{matrix}
Feature 1: & Feature 2: & Feature 3:\\
case1: 100 & 0.1 & 5\\
case2: 150 & 0.9 & 2\\
\end{matrix}

Normalization across entire matrix will result in:

\begin{matrix}Feature 1: & Feature 2: & Feature 3:\\
Case1: 0.95363115 & -0.71773292 & -0.6357541 \\
case2: 1.79014971 & -0.70434862 & -0.68594522] \\
\end{matrix}

Normalization across each entire row (per case) will result in:

\begin{matrix}Feature 1: & Feature 2: & Feature 3:\\
Case1:1.41287463 & -0.75971915 & -0.65315549\\
Case2: 1.41418448 & -0.71494618 & -0.69923831]\\
\end{matrix}

Normalization across each entire column (per feature) will result in:

\begin{matrix}Feature 1: & Feature 2: & Feature 3:\\
Case1: -1 & -1 & 1\\
Case2: 1 & 1 & -1]\\
\end{matrix}

Python code to calculate examples:

import numpy as np
a = np.array([[100,0.1,5],
              [150,0.9,2]])
print(a)
a-=a.mean()
a/=a.std()
print("Normalize over the entire matrix")
print(a)

c = np.array([[100,0.1,5],
              [150,0.9,2]])

a = c
mean=a[0].mean()
std=a[0].std()
a[0] -= mean
a[0] /= std

mean=a[1].mean()
std=a[1].std()
a[1] -= mean
a[1] /= std

print("Normalize per input vector")
print(a)

a = c

mean=a[:,0].mean()
std=a[:,0].std()
a[:,0] -= mean
a[:,0] /= std

mean=a[:,1].mean()
std=a[:,1].std()
a[:,1] -= mean
a[:,1] /= std
print(a)

mean=a[:,2].mean()
std=a[:,2].std()
a[:,2] -= mean
a[:,2] /= std

print("Normalize per feature")
print(a)

Best Answer

Rather than thinking of the problem in abstract terms, imagine a real-life example. You want to predict job satisfaction (1-5) based on features such as: age (in years), work experience (in years), monthly salary (thousands of $) and gender (0 or 1). What would be the point of subtracting global mean (highly influenced by monthly earnings) of gender, or number of years of work experience? Global, or row-wise normalization does not make any sens in most cases. On another hand, in column-wise case you end up with each of the columns having mean of zero and standard deviation of one -- each of the features is centered at zero and equally scaled.