One option is to consider the dynamical system
$$ e_k = x_k - x^*, $$
where $x^*$ is the extremum.
Then, $e_{k+1} = e_k - \alpha \nabla f(x_k)$.
Take a function $V(e_k) = V_k := e_k^2$.
Then,
$$ V_{k+1} - V_k = \alpha^2 ( \nabla f(x_k) )^2 - 2 \alpha e_k \nabla f(x_k) $$
If it is possible to show that $V_{k+1} - V_k < 0, \forall e \ne 0$, then it is called a Lyapunov function for the system and in turn $e_k$ converges asymptotically to the equilibrium $e = 0$ whence $x = x^*$.
Observe that $e_k < 0 \implies \nabla f(x_k) < 0$ and $e_k > 0 \implies \nabla f(x_k) > 0$. So, if $\alpha$ is to be taken as positive, the condition
$$ \alpha_k < 2\frac{|e_k|}{|\nabla f(x_k)|}, e_k = 0 \implies \alpha_k := 0 $$
is a possible choice of an adaptive gain $\alpha_k$ which renders the equilibrium asymptotically stable.
If the gradient is approximate, then a condition that its sign coincides with the sign of $e_k$ suffices for the argument to work. Otherwise, you need to define the region where this condition holds. Its complement is the attractor. Indeed, if in some region the gradient changes sign arbitrarily, then no stability can be shown.
$$\frac{\partial f }{\partial x_1} =2x_1 - x_2 -1 $$
$$\frac{\partial f }{\partial x_2} =-x_1 + 2yx_2 -1 $$
So optima or saddle points could occur when both of these are zero:
$$2x_1 - x_2 =1 $$
$$-x_1 + 2yx_2 =1 $$
Computing the Hessian (the matrix of second derivatives):
$$H= \pmatrix{2 &-1\\-1 &2y},$$
we see that the Hessian is zero when $y=\frac{1}{4}$ and positive for $y>\frac{1}{4}$.
Thus, for $y>\frac{1}{4}$, a minimum will exist. For $y<\frac{1}{4}$, the function has a saddle point and no minimum. For $y=\frac{1}{4}$, higher order tests are required, so says https://mathworld.wolfram.com/SecondDerivativeTest.html.
Best Answer
Let's assume we are talking about stochastic gradient descent where we update the weights based on a single example (not minibatch), out of a total data set of size $N$.
The total error over the whole set is given by: $$ L(w) = \frac{1}{N}\sum\limits_{n=1}^{N} L_n(w) $$ Then, at every step, a random sample point $n\sim U$ is chosen, and we update the weights via: $$ w \leftarrow w - \gamma\nabla L_n(w) $$ where $U$ means uniform over the data set. Now we want to know whether $\mathbb{E}_{n\sim U}[\nabla L_n(w)]=\nabla L(w)$.
We show this as follows: \begin{align} \mathbb{E}_{n\sim U}[\nabla L_n(w)] &= \nabla\; \mathbb{E}_{n\sim U}[ L_n(w)] \\ &= \nabla \sum\limits_{i=1}^N P(n=i) L_i(w)\\ &= \nabla \frac{1}{N}\sum\limits_{i=1}^{N} L_i(w)\\ &= \nabla L(w) \end{align}
The first step is probably the nastiest (although not in the discrete case I guess), but we can interchange the gradient and expectation assuming $L$ is sufficiently smooth and bounded (which it probably is). See here and here.
The other steps are just the definition of discrete expectation (but should still work assuming continuous spaces as well).