For this simple problem, argue that the gradient descent update is $x_t = (1-2\eta) x_{t-1}$. Then, argue that
$$
x_t = (1-2\eta)^t x_0.
$$
Then, see if this sequence converges for all $\eta$ or if you need some condition on $\eta$ for that to happen.
I think that "batch gradient descent" is just another name for "gradient descent". And "mini-batch gradient descent" is just another name for the mini-batch version of stochastic gradient descent (which I would call mini-batch SGD).
So below I'll explain the meaning of gradient descent, stochastic gradient descent (with a mini-batch size of $1$), and mini-batch stochastic gradient descent.
Suppose we're minimizing
$$
\tag{1} f(x) = \frac{1}{N} \sum_{i=1}^N f_i(x).
$$
Let's assume the functions $f_i: \mathbb R^n \to \mathbb R$ are differentiable.
The gradient descent iteration is
$$
x^{k+1} = x^k - t \nabla f(x^k).
$$
Here $t$ is the step size, also known as the learning rate, for our optimization algorithm.
In stochastic gradient descent (with a mini-batch size of $1$), each time we update $x$ we compute the gradient using only one of the terms, selected at random, from the big sum (1). So in SGD we update $x$ as follows:
$$
x^{k+1} = x^k - t \nabla f_i(x^k),
$$
where $i$ is selected at random from $\{1,2, \ldots, N \}$.
The index $i$ can be selected either with replacement or without replacement. I believe without replacement is more common and tends to work a little better.
In the mini-batch version of stochastic gradient descent, with a mini-batch size of $M$, each time we update $x$ we compute the gradient using only $M$ of the terms, selected at random, from the big sum (1).
So the minibatch SGD update is
$$
x^{k+1} = x^k - t \left( \frac{1}{M} \sum_{i \in S_k} \nabla f_i(x^k) \right),
$$
where $S_k$ is a randomly selected $M$-element subset of $\{1, 2, \ldots, N \}$.
Usually the subset $S_k$ is not allowed to intersect with any of the previous subsets $S_1, \ldots S_{k-1}$ until we have completely exhausted the set of possible indices $S = \{1,2, \ldots, N \}$, at which point we make another full sweep through $S$, then another full sweep through $S$, and so on. Each full sweep through $S$ is called one "epoch".
When using stochastic gradient descent, often people use a diminishing step size strategy in order to guarantee convergence. SGD with a fixed step size does not converge, although in practice it might find a good approximate minimizer of (1).
Best Answer
I am not sure how much you know, but basically SGD is trying to find the best $w$, by minimizing $Q$. Generally, $Q$ is an error function; thus, by following the gradient direction in the space of values of $w$, we are going towards to the $w$ that minimizes the error. More concretely, if you consider linear regression or perceptron classification, $w$ specifies the weight parameters of the model and $Q(w)$ is a measure of the error that the model makes on the data.
To make sure you understand, normal gradient descent is written: $$ w \leftarrow w - \eta \nabla Q(w) $$ where the error objective is written (with its gradient): $$ Q(w) = \frac{1}{n}\sum_i Q_i(w)\;\;\;\;\implies\;\;\;\; \nabla Q(w) = \frac{1}{n}\sum_i \nabla Q_i(w) $$ Let's break it down: rewriting as an iteration via $w_{t+1}=w_t - \eta \nabla Q(w_t)$ makes it a bit easier to see. Basically, $w$ is is the machine learning model (i.e. its parameters, which determine its behaviour). So, given some $w_t$, we will add some small vector to it (or take a small step, if you prefer), to move in the direction of the optimal $w_*$ (which specifies the model with the least $Q$). The size of the step is determined by the $\eta\in\mathbb{R}$ parameter. Then, the direction is determined by the gradient of $Q$. (Note that when the gradient is zero, all the partial derivatives of $Q$ with respect to $w$ have vanished, meaning we have reached a minimum. If you are confused as to why the gradient is the direction of steepest descent, see here).
Now for the stochastic part: do you see a possible problem with the regular method? Imagine $n$ is enormous, and/or $\nabla Q(w)$ is complex (e.g. in deep neural networks). Then, computationally, evaluating $\nabla Q(w)$ may be very costly.
The common solution is to just estimate $\nabla Q$, instead of actually calculating it. The obvious and simplest method is to pick a random $j$ and use $\nabla Q \approx \nabla Q_j$ or (perhaps better) a minibatch (note indeed even that $\mathbb{E}_j[\nabla Q_j]=\nabla Q$).
This has another benefit, beyond computational efficiency, in that it helps avoid local minima in the error function, which "real" gradient descent could get stuck in.
See also here.