You have a couple mistakes in your updates. I think generally you're confusing the value of the current weights with the difference between the current weights and the previous weights. You have $\Delta$ symbols scattered around where there shouldn't be any, and += where you should have =.
Perceptron:
$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} + \eta_t (y^{(i)} - \hat{y}^{(i)}) \pmb{x}^{(i)}$,
where $\hat{y}^{(i)} = \text{sign} ({\pmb{w}^\top\pmb{x}^{(i)}})$ is the model's prediction on the $i^{th}$ training example.
This can be viewed as a stochastic subgradient descent method on the following "perceptron loss" function*:
Perceptron loss:
$L_{\pmb{w}}(y^{(i)}) = \max(0, -y^{(i)} \pmb{w}^\top\pmb{x}^{(i)})$.
$\partial L_{\pmb{w}}(y^{(i)}) = \begin{array}{rl}
\{ 0 \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} > 0 \\
\{ -y^{(i)} \pmb{x}^{(i)} \}, & \text{ if } y^{(i)} \pmb{w}^\top\pmb{x}^{(i)} < 0 \\
[-1, 0] \times y^{(i)} \pmb{x}^{(i)}, & \text{ if } \pmb{w}^\top\pmb{x}^{(i)} = 0 \\
\end{array}$.
Since perceptron already is a form of SGD, I'm not sure why the SGD update should be different than the perceptron update. The way you've written the SGD step, with non-thresholded values, you suffer a loss if you predict an answer too correctly. That's bad.
Your batch gradient step is wrong because you're using "+=" when you should be using "=". The current weights are added for each training instance. In other words, the way you've written it,
$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} + \sum_{i=1}^n \{\pmb{w}^{(t)} - \eta_t \partial L_{\pmb{w}^{(t)}}(y^{(i)}) \}$.
What it should be is:
$\pmb{w}^{(t+1)} = \pmb{w}^{(t)} - \eta_t \sum_{i=1}^n {\partial L_{\pmb{w}^{(t)}}(y^{(i)}) }$.
Also, in order for the algorithm to converge on every and any data set, you should decrease your learning rate on a schedule, like $\eta_t = \frac{\eta_0}{\sqrt{t}}$.
* The perceptron algorithm is not exactly the same as SSGD on the perceptron loss. Usually in SSGD, in the case of a tie ($\pmb{w}^\top\pmb{x}^{(i)} = 0$), $\partial L= [-1, 0] \times y^{(i)} \pmb{x}^{(i)}$, so $\pmb{0} \in \partial L$, so you would be allowed to not take a step. Accordingly, perceptron loss can be minimized at $\pmb{w} = \pmb{0}$, which is useless. But in the perceptron algorithm, you are required to break ties, and use the subgradient direction $-y^{(i)} \pmb{x}^{(i)} \in \partial L$ if you choose the wrong answer.
So they're not exactly the same, but if you work from the assumption that the perceptron algorithm is SGD for some loss function, and reverse engineer the loss function, perceptron loss is what you end up with.
Using gradient descent, we optimize (minimize) the cost function
$$J(\mathbf{w}) = \sum_{i} \frac{1}{2}(y_i - \hat{y_i})^2 \quad \quad y_i,\hat{y_i} \in \mathbb{R}$$
If you minimize the mean squared error, then it's different from logistic regression. Logistic regression is normally associated with the cross entropy loss, here is an introduction page from the scikit-learn library.
(I'll assume multilayer perceptrons are the same thing called neural networks.)
If you used the cross entropy loss (with regularization) for a single-layer neural network, then it's going to be the same model (log-linear model) as logistic regression.
If you use a multi-layer network instead, it can be thought of as logistic regression with parametric nonlinear basis functions.
However, in multilayer perceptrons, the sigmoid activation function is
used to return a probability, not an on off signal in contrast to
logistic regression and a single-layer perceptron.
The output of both logistic regression and neural networks with sigmoid activation function can be interpreted as probabilities. As the cross entropy loss is actually the negative log likelihood defined through the Bernoulli distribution.
Best Answer
The difference is that the first formula is the derivation of just the output of a perceptron, while the second is the derivation of the non-linear activation of the perceptron.
When stacking perceptron layers (MLP - Multi-layered Perceptron), you have to add some non-linearity on the output of each layers, otherwise all the process is linear (and can be modeled with a single layer).
So the output of the perceptron (or more accurately, the input of the next layer) becomes:
The derivation will be as in your second formula.
If you are not using a non-linear activation (single layer), the output is:
and the derivation is as in your first formula.