Solved – Why center the data during feature scaling for neural network

machine learningstandardization

Basically, feature scaling is done to transform the data to same scale so that the gradients don't have bias towards larger values.

Now, we do this by centering the data and dividing by standard deviation –

x = np.array([-20,-10,20,50])
y = (x - x.mean())/x.std()
print(y)

I get the scaled values as –

[-1.09544512 -0.73029674  0.36514837  1.46059349]

But, if we only want to scale the values, why subtract by the mean, we can only divide by standard deviation i.e

x = np.array([-20,-10,20,50])
y = (x)/x.std()
print(y)

And I get the scaled values as –

[-0.73029674 -0.36514837  0.73029674  1.82574186]

What purpose does this centering the data serve?

Best Answer

We subtract the mean, because what we aim for is not only scaling, but rather normalisation of the data, so that it is also centred around its mean.

I think a good intuition can be obtained about why we need centering by considering batch normalisation and sigmoid activation functions.

If you look at the sigmoid activation function, notice that the point where you will get the largest gradients is the middle of the sigmoid function where it most closely approximates a linear function. If you put a lot of very large of very small values, you saturate the activation function and you get basically something which approximates a flat line. This will results in effectively a slower convergence of the algorithm because the optimisation has to find out the appropriate scaling of all parameters, so the gradients are not saturated and learning can be performed faster.

In essence, centering the data will cause only the outliers to saturate, which could be considered desirable because we often want outliers to have relatively less importance.

Related Solutions

Lasso Centering and Standardization – How to Perform Lasso Centering and Standardization with R

If you use glmnet, the scaling is performed by the package. You don't need to worry about scaling the test set because the "coefficients are always returned on the original scale".

By default:

glmnet(x, y, [...]
standardize = TRUE,
intercept = TRUE,
standardize.response = FALSE [...])

As for the standardization of the response, it should not change the performance of your model after cross validating over $\lambda$ so you can set standardize.response = FALSE

Indeed the LASSO solves

$$ \min_\beta\; \| Y - X\beta \|^2_2 + \lambda \|\beta\|_1 $$

Scaling $Y$ by a factor $\alpha > 0$, the problem becomes

$$ \min_\beta\; \| \alpha Y - X\beta \|^2_2 + \lambda \|\beta\|_1 $$

which is equivalent to

$$ \min_\beta\; \alpha \| Y-X\beta/\alpha \|^2_2 + \lambda \|\beta\|_1 $$

$$ \min_\beta\; \| Y - X\beta/\alpha \|^2_2 + \lambda \|\beta/\alpha\|_1 $$

So it has the same value of $\lambda$

Solved – centering and scaling (standardizing) a variable: use population or sample standard deviation

The short answer is 'it does not matter' in most cases. The goal of standardization is adjusting variables to have (roughly) similar distributions. This is usually necessary because many statistical learning methods assume they are, and otherwise, some variable may numerically overwhelm others during model fitting.

The reason behind dividing by standard deviation is because many methods assume that the variables are normally distributed, so standard normal distribution $N(0,1)$ (variance of 1) happens to be a convenient ideal. But in most cases, this is just arbitrary. You could scale to any sensible variance value (distribution $N(0,a)$), and it will not make any difference to your model performance.

Thus, the choice of sample standard deviation estimate rarely matters, as noted in scikit-learn documentation and the answer for that prior question.

In addition, even if you are in a situation where choice of standard deviation estimate could make a slight difference (e.g. multiple samples standardized separately to different distributions), there is no such thing as the 'best' standard deviation estimate. The uncorrected (divided by N) estimate actually has the maximum likelihood, and even the corrected (divided by N-1) estimate is still biased due to the square root. (See wiki article for more details.) As such, you should consult papers/guides on your method for their choice of standard deviation estimate.

Best Answer

Related Solutions

Lasso Centering and Standardization – How to Perform Lasso Centering and Standardization with R

Solved – centering and scaling (standardizing) a variable: use population or sample standard deviation

Related Question