Solved – How to quantify the variance reduction in Mini-batch gradient descent

gradient descentvariance

This is a question about the difference in gradient noise or variance of gradient for stochastic gradient descent vs mini-batch gradient descent (with batch size $B \geq 1$).

How can I prove that the variance of gradient for the batch gradient descent is reduced by a factor of $B$ compared to stochastic gradient descent?

Any links to papers would be helpful too.

Best Answer

By induction, you can show that for $i=1$, it is trivial. For any $t \leq m - 1$, assume that $E((\frac{\sum_{i=1}^tX_i}{t}-E(X))^2)=\frac{E(X^2) - E(X)^2}{t}$, when $t = m$, one has

$$ E((\frac{\sum_{i=1}^mX_i}{m}-E(X))^2) = E((\frac{\sum_{i=1}^{m-1}X_i-(m-1)E(X) + (X_m - E(X))}{m})^2) = \frac{1}{m^2}(E((\sum_{i=1}^{m-1}X_i-(m-1)E(X))^2) + E((X_m-E(X))^2) + E(2(\sum_{i=1}^{m-1}X_i-(m-1)E(X))(X_m - E(X)))) = \frac{1}{m^2}((m-1)E(X^2) - (m-1)E(X)^2 + E(X^2)-E(X)^2) = \frac{E(X^2) - E(X)^2}{m} $$

The third equality used fact that $X_1, \ldots, X_m$ are drawn independently. You may replace $m$ with $B$ in your formula.

Best Answer

Related Solutions

Solved – Does memory ever really matter for mini-batch size selection

Solved – How to set mini-batch size in SGD in keras

Related Question