Using gradient descent, we optimize (minimize) the cost function
$$J(\mathbf{w}) = \sum_{i} \frac{1}{2}(y_i - \hat{y_i})^2 \quad \quad y_i,\hat{y_i} \in \mathbb{R}$$
If you minimize the mean squared error, then it's different from logistic regression. Logistic regression is normally associated with the cross entropy loss, here is an introduction page from the scikit-learn library.
(I'll assume multilayer perceptrons are the same thing called neural networks.)
If you used the cross entropy loss (with regularization) for a single-layer neural network, then it's going to be the same model (log-linear model) as logistic regression.
If you use a multi-layer network instead, it can be thought of as logistic regression with parametric nonlinear basis functions.
However, in multilayer perceptrons, the sigmoid activation function is
used to return a probability, not an on off signal in contrast to
logistic regression and a single-layer perceptron.
The output of both logistic regression and neural networks with sigmoid activation function can be interpreted as probabilities. As the cross entropy loss is actually the negative log likelihood defined through the Bernoulli distribution.
I believe that the key to answering this question is to point out that the element-wise multiplication is actually shorthand and therefore when you derive the equations you never actually use it.
The actual operation is not an element-wise multiplication but instead a standard matrix multiplication of a gradient with a Jacobian, always.
In the case of the nonlinearity, the Jacobian of the vector output of the non-linearity with respect to the vector input of the non-linearity happens to be a diagonal matrix. It's therefore true that the gradient multiplied by this matrix is equivalent to the gradient of the output of the nonlinearity with respect to the loss element-wise multiplied by a vector containing all the partial derivatives of the nonlinearity with respect to the input of the nonlinearity, but this follows from the Jacobian being diagonal. You must pass through the Jacobian step to get to the element-wise multiplication, which might explain your confusion.
In math, we have some nonlinearity $s$, a loss $L$, and an input to the nonlinearity $x \in \mathbb{R}^{n \times 1}$ (this could be any tensor). The output of the nonlinearity has the same dimension $s(x) \in \mathbb{R}^{n \times 1}$---as @Logan says, the activation function are defined as element-wise.
We want $$\nabla_{x}L=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L$$
Where $\dfrac{\partial s(x)}{\partial x}$ is the Jacobian of $s$. Expanding this Jacobian, we get
\begin{bmatrix}
\dfrac{\partial{s(x_{1})}}{\partial{x_1}} & \dots & \dfrac{\partial{s(x_{1})}}{\partial{x_{n}}} \\
\vdots & \ddots & \vdots \\
\dfrac{\partial{s(x_{n})}}{x_{1}} & \dots & \dfrac{\partial{s(x_{n})}}{\partial{x_{n}}}
\end{bmatrix}
We see that it is everywhere zero except for the diagonal. We can make a vector of all its diagonal elements $$Diag\left(\dfrac{\partial s(x)}{\partial x}\right)$$
And then use the element-wise operator.
$$\nabla_{x}L
=\left({\dfrac{\partial s(x)}{\partial x}}\right)^T\nabla_{s(x)}L
=Diag\left(\dfrac{\partial s(x)}{\partial x}\right) \circ \nabla_{s(x)}L$$
Best Answer
Here is my try
$$J(x) = -\frac{1}{m}\sum_{i = 1}^{m} b_iln(h_i) + (1 - b_i)ln(1 - h_i)$$
where $h_i = \sigma(x^Ta_i)$. Let $A = [a_1^T, \dots, a_m^T]^T$. Assuming $ln, \sigma, \frac{1}{\cdot}$ work element-wise on vectors, $\odot$ is element-wise multiplication and $\mathbb{1}$ is a vector of $1$s we have
$$J(x) = -\frac{1}{m}\big[b^Tln(\sigma(Ax)) + (\mathbb{1} - b)^Tln(\mathbb{1} - \sigma(Ax))\big]$$
Now
$$\frac{\partial J(x)}{\partial x} = -\frac{1}{m}\Big[\frac{\partial}{\partial x}b^Tln(\sigma(Ax)) + \frac{\partial}{\partial x}(\mathbb{1} - b)^Tln(\mathbb{1} - \sigma(Ax))\Big] \\ = -\frac{1}{m}\Big[\frac{\partial ln(\sigma(Ax))}{\partial x}b + \frac{\partial ln(\mathbb{1} - \sigma(Ax))}{\partial x}(\mathbb{1} - b)\Big] \\ = -\frac{1}{m}\Big[\frac{\partial \sigma(Ax)}{\partial x} \big(\frac{1}{\sigma(Ax)}\odot b\big) + \frac{\partial \mathbb{1} - \sigma(Ax)}{\partial x}\big(\frac{1}{\mathbb{1} - \sigma(Ax)}\odot (\mathbb{1} - b)\big)\Big] \\ = -\frac{1}{m}\Big[\frac{\partial Ax}{\partial x} \big(\sigma(Ax) \odot (\mathbb{1} - \sigma(Ax)) \odot \frac{1}{\sigma(Ax)}\odot b\big) - \frac{\partial Ax}{\partial x}\big(\sigma(Ax) \odot (\mathbb{1} - \sigma(Ax)) \odot \frac{1}{\mathbb{1} - \sigma(Ax)}\odot (\mathbb{1} - b)\big)\Big] \\ = -\frac{1}{m}\Big[A^T \big((\mathbb{1} - \sigma(Ax)) \odot b\big) - A^T\big(\sigma(Ax)) \odot (\mathbb{1} - b)\big)\Big] \\ = -\frac{1}{m}\Big[A^T \big(b - \sigma(Ax) \odot b - \sigma(Ax) + \sigma(Ax) \odot b\big)\Big] \\ = -\frac{1}{m}\Big[A^T \big(b - \sigma(Ax)\big)\Big] \\ = \frac{1}{m}\Big[A^T \big(\sigma(Ax) - b\big)\Big] $$