Calculus – How to Find the Equation of a Decision Boundary in Logistic Regression

calculuslogistic regressionmachine learningstatistics

I'm new to machine learning and currently working on logistic regression. but i don't know how to deal this problem. let us consider the logistic regression for a dataset $(x_n,y_n)\ (x_i \in \mathbb R^d, y_i \in \{+1,-1\})$, let $\Phi(x)=(\phi_p(x))^T$ be a p-dimensional vector of functions and $x\in\mathbb R^d$ be a parameter vector. A probabilistic model is defined as $$p(y|x)=\frac{1}{1+\exp(-y\theta^T\Phi(x))}$$
$$(p(+1|x)=\frac{1}{1+\exp(-\theta^T\Phi(x))}=\frac{\exp(\theta^T\Phi(x))}{1+\exp(\theta^T\Phi(x))},p(-1|x)=\frac{1}{1+\exp(\theta^T\Phi(x))})$$
and the logistic regression estimates the parameter $\theta$ by maximizing $L(\theta)=\sum_{i=1}^n\log p(y_i|x_i,\theta)$.

Then, how can I found an equation representing a decision boundary which satisfy $p(+1|x)=p(-1|y)$?

Best Answer

The decision boundary follows from your last sentence

Then, how can I found an equation representing a decision boundary which satisfy $p(+1|x)= p(−1|x)$?

(correcting $p(-1|y)$ to $p(-1|x)$). The decision boundary is given by the set $$\left\{x\in\mathbb{R}^d:p(+1|x) = p(-1|x)\right\}.$$ Then expanding the condition, \begin{align*} \frac{1}{1+\exp(-\theta^T\Phi(x))} &= \frac{1}{1+\exp(\theta^T\Phi(x))} \\ 1+\exp(-\theta^T\Phi(x)) &= 1+\exp(\theta^T\Phi(x)) \\ -\theta^T\phi(x) &= \theta^T\Phi(x)\\ \Rightarrow \theta^T\Phi(x) &= 0. \end{align*} And so the decision boundary is given by the set, $$\{x\in\mathbb{R}^d:\theta^T\Phi(x) = 0\}.$$

Related Solutions

Bayesian Logistic Regression – How to Derive Bayesian Logistic Regression

Your explanation of Bayesian logistic regression looks pretty reasonable. A few remarks:

1. Using different priors for $a$ and $b$. Can I suggest that we use different variance hyperparameters $\sigma^2_a$ and $\sigma^2_b$ for $\mathbf a$ and $b$? Since $b$ is the intercept, you probably want to pick quite a flat prior for $b$, by picking a large value for $\sigma^2_b$.

It would also make notation simpler later on if we incorporate $b$ into $\mathbf a$, as an extra component. In other words, $\mathbf a_{\rm mine} = (\mathbf a_{\rm yours}, b) \in \mathbf R^{p+1}$. This is valid, provided that we also add an additional constant component to each of the feature vectors $\mathbf x_i$, i.e. $\mathbf x_{i, \rm mine} = (\mathbf x_{i, \rm yours}, 1) \in \mathbf R^{p + 1}$.

The prior variance for $\mathbf a$ is then a $(p+1)\times(p+1)$ diagonal matrix $\mathbf \Sigma$, whose first $p$ diagonal entries are $\sigma_a^2$ and whose final entry is $\sigma_b^2$.

2. The $\mathbf x_i$'s don't need to be random variables. The presentation might be clearer if we abandon the probability distributions for the $\mathbf x_i$'s, as they are unimportant for the discussion. What is important is that the model is governed by the parameter vector $\mathbf a \in \mathbb R^{p+1}$, and a priori, this parameter vector is normally distributed: $$ P(\mathbf a) = \mathcal N(\mathbf a | \mathbf 0, \mathbf \Sigma ). $$ Given the value of $\mathbf a$, and given the feature vectors $\mathbf x_i$, the labels $y_i$ are independent of each other, and follow Bernouilli distributions of the form: $$ P(y_i | \mathbf x_i, \mathbf a) = \begin{cases} s(\mathbf a .\mathbf x_i) & {\rm if \ }y_i = 1 \\ 1 - s(\mathbf a . \mathbf x_i ) & {\rm if \ } y_i = 0\end{cases},$$ where $s(u) := e^u / (1 + e^u)$ is the logistic function.

The joint probability distribution is then $$ P(y_1, \dots, y_N | \mathbf x_1, \dots, \mathbf x_N, \mathbf a ) = P(\mathbf a)\prod_{i=1}^N P(y_i | \mathbf x_i , \mathbf a).$$

And by the way, "joint probability distribution" is probably a more appropriate phrase than "PDF".

3. The expression for the log of the posterior distribution. Your expression for the log of the posterior distribution seems to assume that all the $y_i$ are positive! I think it should be: \begin{multline} \log P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) \\ = {\rm const} + \frac 1 2 \mathbf a \mathbf \Sigma^{-1} \mathbf a + \sum_{i}\left(y_i \log s (\mathbf a_i . \mathbf x_i) + (1 - y_i) \log (1 - s(\mathbf a_i . \mathbf x_i))\right)\end{multline}

4. Information to extract from the model: going beyond point estimates. Given the training set $(\mathbf x_1, y_1), \dots, (\mathbf x_N, y_N)$, the posterior distribution for the parameters is $$ P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) \propto P(y_1, \dots, y_N | \mathbf x_1, \dots, \mathbf x_N, \mathbf a ), $$ by Bayes theorem. So one of our objectives must surely be to understand this posterior distribution.

One perfectly valid question to ask is: what value of $\mathbf a$ maximises the posterior probability? In other words, the task is to find

$$ \mathbf a_{\rm MAP} = {\rm argmax}_{\mathbf a} P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) .$$ This is what you discussed in your final paragraph: you take logs, and ignore the terms that are independent of $\mathbf a$, and optimise the log probability using gradient descent or Newton-Raphson or some similar iterative optimisation technique. (The subscript ${\rm MAP}$ stands for maximum a posteriori, by the way.)

But this only gives us a point estimate for $\mathbf a$ . We don't need the Bayesian approach to get do this at all. In fact, if we follow the standard non-Bayesian approach with L2-regularisation, we will arrive at exactly the same answer. So if this is all we're getting, then there is very little point in adopting the Bayesian approach at all! (Obviously that was a personal opinion - you may disagree!)

In my opinion, the real magic in the Bayesian approach is that we get we can get more than just a point estimate for $\mathbf a$! The distribution $P(\mathbf a| y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) $ also tells us about the uncertainties in this estimate. For certain features, the posterior distribution for the corresponding component of $\mathbf a$ will be quite narrow, indicating that we are very sure about the coefficients that the model should assign to these features. For other features, it will be the other way round, indicating that we are less sure about what coefficient to assign them.

The posterior distribution $P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) $ is quite a complicated function, however. To get to grips with it, a common technique (known as the Laplace approximation) is to Taylor expand the log of this distribution to quadratic order: $$ \log P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N) \approx {\rm const } + \frac 1 2 \mathbf (\mathbf a - \mathbf a_{\rm MAP})^T \mathbf H \mathbf (\mathbf a - \mathbf a_{\rm MAP}),$$

where $\mathbf H$ is the Hessian matrix for $\log P(\mathbf a | y_1, \dots, y_N, \mathbf x_1, \dots, \mathbf x_N)$, evaluated at $\mathbf a_{\rm MAP}$.

Exponentiating this, we see that $\mathbf a$ is approximately normally distributed, with mean $\mathbf a_{\rm MAP}$ and variance $\mathbf H^{-1}$. Thus we have obtained not just a point estimate for $\mathbf a$, but also some estimate of the uncertainty.

So if I was explaining this to people, I would probably try to put the characterisation of the posterior distribution centre stage. It just so happens that computing $\mathbf a_{\rm MAP}$ is part of this, but I wouldn't make that the end goal. (Again, this is obviously a personal opinion...)

4a. Making Bayesian predictions. Now, suppose we receive a new datapoint, with feature vector $\mathbf x_{\rm new}$. Our task is to predict the probability of the corresponding label $y_{\rm new}$ being $1$. According to the Bayesian approach, the prediction should be:

$$ P(y_{\rm new} = 1 | \mathbf x_{\rm new}, x_1, \dots, x_N, y_1, \dots, y_N) = \int d\mathbf a \ P(y_{\rm new} | \mathbf x_{\rm new}, \mathbf a) P(\mathbf a | y_1, \dots, y_N, x_1, \dots, x_N). $$

I think this part of the story - how to predict stuff (!!!) - is a really important thing to cover in your explanation of Bayesian logistic regression. If anything, I would probably mention this before any of the stuff about estimating the posterior distribution - in fact, I would use the desire to make predictions as the motivation for wanting to estimate the posterior distribution!

And I want to emphasise that we're not just using the point estimate $\mathbf a_{\rm MAP}$ to make the prediction. Instead, we're effectively pooling the predictions we get from all possible values for $\mathbf a$, but assigning higher weight to the predictions coming from choices of $\mathbf a$ that are deemed more likely given the training data. The use of a Bayesian approach, rather than just a simple frequentist/optimisation approach, is what makes this possible.

A lot of my notation is taken from Bishop's "Pattern Recognition and Machine Learning", Chapter 4.3-4.5, and the book is freely available online.

Implementing multiclass logistic regression from scratch

$ \def\bbR#1{{\mathbb R}^{#1}} \def\a{\alpha}\def\b{\beta}\def\g{\gamma}\def\t{\theta} \def\l{\lambda}\def\s{\sigma}\def\e{\varepsilon} \def\n{\nabla}\def\o{{\tt1}}\def\p{\partial} \def\E{R}\def\F{{\cal F}}\def\L{{\cal L}} \def\L{\left}\def\R{\right} \def\LR#1{\L(#1\R)} \def\BR#1{\Big(#1\Big)} \def\vec#1{\operatorname{vec}\LR{#1}} \def\diag#1{\operatorname{diag}\LR{#1}} \def\Diag#1{\operatorname{Diag}\LR{#1}} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\hess#1#2#3{\frac{\p^2 #1}{\p #2\,\p #3}} \def\c#1{\color{red}{#1}} \def\m#1{\left[\begin{array}{r}#1\end{array}\right]} $It will be convenient to use $\{\e_k,\o,J,I\}$ as the standard basis vector, the all-ones vector, all-ones matrix, and identity matrix, respectively. And to use $\{\odot,\oslash\}$ to denote elementwise multiplication and division.

Further, the indexed column vectors can be collected into matrices, e.g. $$\eqalign{ X &= \m{x_\o&\ x_2&\ldots&\ x_N} &\qiq x_i = X\e_i \\ Y &= \m{y_\o&\ y_2&\ldots&\ y_N} &\qiq y_j = Y\e_j \\ B &= \m{\b_\o&\b_2&\ldots&\b_K} &\qiq \b_k = B\e_k \\ }$$

The various functions can be defined for the vector argument $u$. $$\eqalign{ e &= \exp(u) &\qiq de = e\odot du \\ s &= \operatorname{Softmax}(u) \\ &= \frac{e}{\o:e} = e \oslash Je &\qiq ds = \BR{\Diag{s}-ss^T}\,du \\ \l &= \operatorname{log-sum-exp}(u) \\ &= \log(\o:e) &\qiq d\l = s:du \\ }$$ $$\eqalign{ \log(s) &= \log(e)-\log(Je) \;&=\; u -\l\o \qquad\qquad\qquad\qquad\qquad\qquad \\ \ell(y,s) &= -y:\log(s) \;&=\; \l{y:\o} - y:u \;=\; \l - y:u \\ }$$ where a colon denotes the trace/Frobenius product, i.e. $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \big\|A\big\|^2_F \\ }$$ The properties of the underlying trace function allow the terms in a Frobenius product to be rearranged in many different ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:AB &= CB^T:A = A^TC:B \\ }$$ Furthermore, it commutes with elementwise multiplication and division $$\eqalign{ A:(B\odot C) &= (A\odot B):C &= \sum_{i=1}^m\sum_{j=1}^n A_{ij} B_{ij} C_{ij} \\ }$$

The Kronecker product $(\otimes)$ can also be used to good effect since $$\eqalign{ M_i &= \LR{I\otimes x_i^T} \\ M_i\b &= \LR{I\otimes x_i^T} \vec{B} = \vec{\e_i^TX^TB} = {B^TX\e_i} \\ y_i &\approx \operatorname{Softmax}(M_i\b) = \exp\LR{B^TX\e_i}\oslash J\exp\LR{B^TX\e_i} \\ Y &\approx \exp\LR{B^TX}\oslash J\exp\LR{B^TX} \;\doteq\; S \\ }$$ The last equation is the Matrix Model that must be optimized for $B$ with respect to the cross-entropy loss function. Note that we now have matrix analogs of all of the vector quantities $$\eqalign{ u &\to U=B^TX \\ e &\to E=\exp(U) \\ s &\to S=\operatorname{Softmax}(U) &= E\oslash JE \\ &&= E\cdot\c{\Diag{E^T\o}^{-1}} \\ &&= E\c{D^{-1}} \\ y &\to Y,\quad Y^T\o=\o,\;&JY=J,\quad Y:J=N \\ }$$ Calculate the gradient of the loss function. $$\eqalign{ {\ell} &= -\BR{Y:\log(S)} \\ &= Y:\log(JE) - Y:U \\ \\ d{\ell} &= Y:d\log(JE) - Y:dU \\ &= Y:(J\,dE\oslash JE) - Y:dB^TX \\ &= (Y\oslash JE):(J\,dE) - Y^T:X^TdB \\ &= \LR{Y{D^{-1}}}:\BR{J\,dE} - XY^T:dB \\ &= JY{D^{-1}}:dE - XY^T:dB \\ &= JD^{-1}:(E\odot dU) - XY^T:dB \\ &= (E\odot JD^{-1}):dB^TX - XY^T:dB \\ &= \LR{ED^{-1}}^T:X^TdB - XY^T:dB \\ &= X\LR{ED^{-1}}^T:dB - XY^T:dB \\ \\ \grad{\ell}{B} &= X\LR{ED^{-1} - Y}^T \\ &= X\LR{S - Y}^T \;\doteq\; G \\ }$$ This $G$ matrix can be used in any standard gradient descent algorithm.
The Hessian will be a fourth-order tensor and much uglier.

Update

Here is an attempt at a Hessian calculation (with flattening via vectorization).

First, recall some identities involving the Diag() operator $$\eqalign{ &E\odot JP = E\cdot\Diag{P^T\o} \\ &H = \Diag{h} \iff h = H\o \\ &\Diag{H\o} = \Diag{h} = H \qiq &\Diag{H^n\o} = H^n \\ }$$ Calculate the differential of $G$. $$\eqalign{ G &= X\LR{ED^{-1}-Y}^T \\ dG &= X\LR{dE\;D^{-1}}^T - X\LR{E\,dD\,D^{-2}}^T \\ &= XD^{-1}\,dE^T - XD^{-2}\,dD\,E^T \\ &= XD^{-1}\LR{E^T\odot dU^T} - XD^{-2}\Diag{dE^T\o}\,E^T \\ &= XD^{-1}\LR{\c{E^T\odot X^TdB}} - XD^{-2}\Diag{(\c{E^T\odot X^TdB})\o}\,E^T \\ }$$ Note the following vec expansion of the $\c{\rm red}$ term $$\eqalign{ &\E = \Diag{\vec{E^T}} \\ &\vec{\c{E^T\odot X^TdB}} = \E\cdot\vec{X^TdB} = \E \LR{I\otimes X^T} d\b \\ }$$ and use this to vectorize the differential relationship and recover the Hessian. $$\eqalign{ dg &= \vec{dG} \\ &= \LR{I\otimes XD^{-1}} \vec{\c{E^T\odot X^TdB}} - \LR{E\boxtimes XD^{-2}} \LR{\c{E^T\odot X^TdB}}\o \\ &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}} \;\vec{\c{E^T\odot X^TdB}} \\ &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}}\,\E\LR{I\otimes X^T} d\b \\ \grad{g}{\b} &= \BR{\LR{I\otimes XD^{-1}} - \o^T\otimes\LR{E\boxtimes XD^{-2}}}\,\E\LR{I\otimes X^T} \;=\; \hess{\ell}{\b\,}{\b^T} \\ }$$ where $\{\boxtimes\}$ denotes the Khatri-Rao product which can be expanded in terms of Hadamard and Kronecker products $$\eqalign{ B\boxtimes A &= (B\otimes\o_m)\odot(\o_p\otimes A) \;\in \bbR{(pm)\times n} \\ \o_p &\in \bbR{p\times\o} \qquad A \in \bbR{m\times n} \qquad B \in \bbR{p\times n} \\\\ }$$ Note that the gradient can also be vectorized, if that is preferable $$\vec{G} \;=\; g \;=\; \grad{\ell}{\b}$$

Best Answer

Related Solutions

Bayesian Logistic Regression – How to Derive Bayesian Logistic Regression

Implementing multiclass logistic regression from scratch

Update

Related Question