Solved – From the Perceptron rule to Gradient Descent: How are Perceptrons with a sigmoid activation function different from Logistic Regression

classificationgradient descentlogisticneural networksperceptron

Essentially, my question is that in multilayer Perceptrons, perceptrons are used with a sigmoid activation function. So that in the update rule $\hat{y}$ is calculated as

$$\hat{y} = \frac{1}{1+\exp(-\mathbf{w}^T\mathbf{x}_i)}$$

How does this "sigmoid" Perceptron differ from a logistic regression then?

I would say that a single-layer sigmoid perceptron is equivalent to a logistic regression in the sense that both use $\hat{y} = \frac{1}{1+\exp(-\mathbf{w}^T\mathbf{x}_i)}$ in the update rule. Also, both return $\operatorname{sign}(\hat{y} = \frac{1}{1+\exp(-\mathbf{w}^T\mathbf{x}_i)})$ in the prediction. However, in multilayer perceptrons, the sigmoid activation function is used to return a probability, not an on off signal in contrast to logistic regression and a single-layer perceptron.

I think the usage of the term "Perceptron" may be a little bit ambiguous, so let me provide some background based on my current understanding about single-layer perceptrons:

Classic perceptron rule

Firstly, the classic perceptron by F. Rosenblatt where we have a step function:

$$\Delta w_d = \eta(y_{i} – \hat{y_i})x_{id} \quad\quad y_{i}, \hat{y_i} \in \{-1,1\}$$

to update the weights

$$w_k := w_k + \Delta w_k \quad \quad (k \in \{1, …, d\})$$

So that $\hat{y}$ is calculated as

$$\hat{y} = \operatorname{sign}(\mathbf{w}^T\mathbf{x}_i) = \operatorname{sign}(w_0 + w_1x_{i1} + … + w_dx_{id})$$

Gradient Descent

Using gradient descent, we optimize (minimize) the cost function

$$J(\mathbf{w}) = \sum_{i} \frac{1}{2}(y_i – \hat{y_i})^2 \quad \quad y_i,\hat{y_i} \in \mathbb{R}$$

where we have "real" numbers, so I see this basically analogous to linear regression with the difference that our classification output is thresholded.

Here, we take a step into the negative direction of the gradient when we update the weights

$$\Delta w_k = – \eta \frac{\partial J}{\partial w_k} = – \eta \sum_i (y_i – \hat{y_i})(- x_{ik}) = \eta \sum_i (y_i – \hat{y_i})x_{ik}$$

But here, we have $\hat{y} = \mathbf{w}^T\mathbf{x}_i$ instead of $\hat{y} = \operatorname{sign}(\mathbf{w}^T\mathbf{x}_i)$

$$w_k := w_k + \Delta w_k \quad \quad (k \in \{1, …, d\})$$

Also, we calculate the sum of squared errors for a complete pass over the entire training dataset (in the batch learning mode) in contrast to the classic perceptron rule which updates the weights as new training samples arrive (analog to stochastic gradient descent — online learning).

Sigmoid activation function

Now, here is my question:

In multilayer Perceptrons, perceptrons are used with a sigmoid activation function. So that in the update rule $\hat{y}$ is calculated as

$$\hat{y} = \frac{1}{1+\exp(-\mathbf{w}^T\mathbf{x}_i)}$$

How does this "sigmoid" Perceptron differ from a logistic regression then?

Best Answer

Using gradient descent, we optimize (minimize) the cost function

$$J(\mathbf{w}) = \sum_{i} \frac{1}{2}(y_i - \hat{y_i})^2 \quad \quad y_i,\hat{y_i} \in \mathbb{R}$$

If you minimize the mean squared error, then it's different from logistic regression. Logistic regression is normally associated with the cross entropy loss, here is an introduction page from the scikit-learn library.


(I'll assume multilayer perceptrons are the same thing called neural networks.)

If you used the cross entropy loss (with regularization) for a single-layer neural network, then it's going to be the same model (log-linear model) as logistic regression. If you use a multi-layer network instead, it can be thought of as logistic regression with parametric nonlinear basis functions.


However, in multilayer perceptrons, the sigmoid activation function is used to return a probability, not an on off signal in contrast to logistic regression and a single-layer perceptron.

The output of both logistic regression and neural networks with sigmoid activation function can be interpreted as probabilities. As the cross entropy loss is actually the negative log likelihood defined through the Bernoulli distribution.