This is a question about the difference in gradient noise or variance of gradient for stochastic gradient descent vs mini-batch gradient descent (with batch size $B \geq 1$).
How can I prove that the variance of gradient for the batch gradient descent is reduced by a factor of $B$ compared to stochastic gradient descent?
Any links to papers would be helpful too.
Best Answer
By induction, you can show that for $i=1$, it is trivial. For any $t \leq m - 1$, assume that $E((\frac{\sum_{i=1}^tX_i}{t}-E(X))^2)=\frac{E(X^2) - E(X)^2}{t}$, when $t = m$, one has
$$ E((\frac{\sum_{i=1}^mX_i}{m}-E(X))^2) = E((\frac{\sum_{i=1}^{m-1}X_i-(m-1)E(X) + (X_m - E(X))}{m})^2) = \frac{1}{m^2}(E((\sum_{i=1}^{m-1}X_i-(m-1)E(X))^2) + E((X_m-E(X))^2) + E(2(\sum_{i=1}^{m-1}X_i-(m-1)E(X))(X_m - E(X)))) = \frac{1}{m^2}((m-1)E(X^2) - (m-1)E(X)^2 + E(X^2)-E(X)^2) = \frac{E(X^2) - E(X)^2}{m} $$
The third equality used fact that $X_1, \ldots, X_m$ are drawn independently. You may replace $m$ with $B$ in your formula.