Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,
For most nonlinear regression problems there is no closed form solution.
Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.
For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,
$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$
is given by
$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$
Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.
Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.
However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.
In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.
This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.
You have a couple mistakes in your updates. I think generally you're confusing the value of the current weights with the difference between the current weights and the previous weights. You have $\Delta$ symbols scattered around where there shouldn't be any, and += where you should have =.
Perceptron:
$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} + \eta_t (y^{(i)} - \hat{y}^{(i)}) \pmb{x}^{(i)}$,
where $\hat{y}^{(i)} = \text{sign} ({\pmb{w}^\top\pmb{x}^{(i)}})$ is the model's prediction on the $i^{th}$ training example.
This can be viewed as a stochastic subgradient descent method on the following "perceptron loss" function*:
Perceptron loss:
$L_{\pmb{w}}(y^{(i)}) = \max(0, -y^{(i)} \pmb{w}^\top\pmb{x}^{(i)})$.
$\partial L_{\pmb{w}}(y^{(i)}) = \begin{array}{rl}
\{ 0 \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} > 0 \\
\{ -y^{(i)} \pmb{x}^{(i)} \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} < 0 \\
[-1, 0] \times y^{(i)} \pmb{x}^{(i)}, & \text{ if } \pmb{w}^\top\pmb{x}^{(i)} = 0 \\
\end{array}$.
Since perceptron already is a form of SGD, I'm not sure why the SGD update should be different than the perceptron update. The way you've written the SGD step, with non-thresholded values, you suffer a loss if you predict an answer too correctly. That's bad.
Your batch gradient step is wrong because you're using "+=" when you should be using "=". The current weights are added for each training instance. In other words, the way you've written it,
$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} + \sum_{i=1}^n \{\pmb{w}^{(t)} - \eta_t \partial L_{\pmb{w}^{(t)}}(y^{(i)}) \}$.
What it should be is:
$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} - \eta_t \sum_{i=1}^n {\partial L_{\pmb{w}^{(t)}}(y^{(i)}) }$.
Also, in order for the algorithm to converge on every and any data set, you should decrease your learning rate on a schedule, like $\eta_t = \frac{\eta_0}{\sqrt{t}}$.
* The perceptron algorithm is not exactly the same as SSGD on the perceptron loss. Usually in SSGD, in the case of a tie ($\pmb{w}^\top\pmb{x}^{(i)} = 0$), $\partial L= [-1, 0] \times y^{(i)} \pmb{x}^{(i)}$, so $\pmb{0} \in \partial L$, so you would be allowed to not take a step. Accordingly, perceptron loss can be minimized at $\pmb{w} = \pmb{0}$, which is useless. But in the perceptron algorithm, you are required to break ties, and use the subgradient direction $-y^{(i)} \pmb{x}^{(i)} \in \partial L$ if you choose the wrong answer.
So they're not exactly the same, but if you work from the assumption that the perceptron algorithm is SGD for some loss function, and reverse engineer the loss function, perceptron loss is what you end up with.
Best Answer
The typical assumptions about the validity of linear regression don't really have much to do with how you optimize the model; they're about whether the learned model will be "right." That is: