PRML Bishop equation 3.15 – Maximum likelihood and least squares

linear algebramachine learningmaximum likelihood

I've doubt related to Derivation of normal equations for maximum likelihood and least squares:
Excuse me and do let me know, if I shouldn't have asked a new question.

This question is related to section 3.1.1, pae 148 of Pattern Recognition book by Christopher Bishop.

My doubt is related to point (b) in @LCS11 explanation/ statement.

Reiterating:
This is related to mathematical derivation of log function (of Gaussian conditional probability) :

Step 1: $ lnp(\mathbf {t|x,w,} \beta)$ = $\sum_{n=1}^N ln \mathcal N (t_n|\mathbf w^T \it \phi(\mathbf x_n), \beta^{-1})$

Step 2, expanding RHS:
$\begin{align*} \frac{N}{2} \ln \beta – \frac{N}{2}\ln 2 \pi – \frac{\beta}{2}\sum_{n=1}^N\{t_n – \mathbf w^T \mathbf \phi(\mathbf x_n) \}^2 \end{align*}$

taking gradient of above equation w.r.t $\mathbf w$ and equating to $\quad 0$:

Step 3:
$\begin{align*} \sum_{n=1}^N\{t_n – \mathbf w^T \mathbf \phi(\mathbf x_n) \}\phi(\mathbf x_n)^T = \quad0 \end{align*}$

Step 4: $\begin{align*} \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T = \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) \end{align*}$

solving for $\mathbf w$:

Step 5: $\begin{align*} \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T \end{align*}$

hereafter, I fail to understand how author moves on to give:

$\begin{align*} \quad\mathbf w_{ML}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t} \end{align*}$

Question: Pls. help me for I couldn't understand how (after Step 5):

$\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T = (\mathbf \Phi^T \mathbf \Phi)$ and $\sum_{n=1}^N \phi(\mathbf x_n) t_n = \mathbf \Phi^T \mathbf {\mathtt t}$.

where, $\mathbf \Phi \in R^{NxM}$ is design-matrix (below):

\begin{bmatrix}
\phi_0(\mathbf x_1)&\phi_1(\mathbf x_1)&\dots&\phi_{M-1}(\mathbf x_1)\\
\phi_0(\mathbf x_2)&\phi_1(\mathbf x_2)&\dots&\phi_{M-1}(\mathbf x_2)\\
\vdots &\vdots & \dots & \vdots\\
\phi_0(\mathbf x_N)&\phi_1(\mathbf x_N)&\dots&\phi_{M-1}(\mathbf x_N)\end{bmatrix}

Related Question:
the author receives following after applying gradient w.r.t $\mathbf w$:

\begin{align*} \nabla ln p(\mathbf {t|w}, \beta) = \beta \sum_{n=1}^N [t_n – \mathbf{w^T \phi(x_n)] \mathbf {\phi(x_n)^T}} \end{align*}

we equate above gradient to 0:

\begin{align*} \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T \end{align*}

I convert it to vector form to get (correct me, in case of wrong understanding):

\begin{align*} \mathbf w^T(\mathbf {\Phi \Phi^T}) = \mathbf {t\Phi^T} \end{align*}

applying transpose gives:

\begin{align*} (\mathbf {\Phi \Phi^T}) \mathbf w = \mathbf {\Phi t^T} \end{align*}

pre-multiplying both sides by $(\mathbf {\Phi \Phi^T})^{-1}$

\begin{align*} \mathbf w_{ML} = (\mathbf {\Phi \Phi^T})^{-1} \mathbf {\Phi t^T} \end{align*}

I'm not sure how the author arrives to the same result as explained by @Syd Amerikaner. Pls. guide/ point out where am I going off-track

I've read Strang and Meyer, but it's been some time; and the rustiness shows for I'm asking an elementary question.

Best Answer

Let $$\boldsymbol t = \begin{pmatrix} t_1 \\ t_2 \\ \vdots \\ t_N \end{pmatrix},\qquad\text{and}\qquad \boldsymbol\Phi = \begin{pmatrix} \boldsymbol\phi(\boldsymbol x_1)^\top \\ \boldsymbol\phi(\boldsymbol x_2)^\top \\ \vdots \\ \boldsymbol\phi(\boldsymbol x_N)^\top \end{pmatrix}.$$ Note that $$\boldsymbol\phi(\boldsymbol x_n) = \begin{pmatrix} \phi_0(\boldsymbol x_n) \\ \phi_1(\boldsymbol x_n) \\ \vdots \\ \phi_{M-1}(\boldsymbol x_n) \end{pmatrix}.$$

The log likelihood function is given by \begin{align*}\log\mathcal L(\boldsymbol w) &= \frac N2 \log(\beta) -\frac N2 \log(2\pi) - \frac \beta 2 \sum_{n=1}^N\big(t_n - \boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)\big)^2. \end{align*} Since $\boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)$ is a scalar (i.e. an element in $\mathbb R$), it holds that $\boldsymbol\phi(\boldsymbol x_n)^\top\boldsymbol w$ = $\boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)$. Thus, \begin{align*}\log\mathcal L(\boldsymbol w) &= \frac N2 \log(\beta) -\frac N2 \log(2\pi) - \frac \beta 2 \sum_{n=1}^N\big(t_n - \boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)\big)^2 \\ &= \frac N2 \log(\beta) -\frac N2 \log(2\pi) - \frac \beta 2 \sum_{n=1}^N\big(t_n - \boldsymbol\phi(\boldsymbol x_n)^\top\boldsymbol w\big)^2. \end{align*} Note that $\sum_{n=1}^N\big(t_n - \boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)\big)^2 = (\boldsymbol t - \boldsymbol\Phi\boldsymbol w)^\top(\boldsymbol t - \boldsymbol\Phi\boldsymbol w)$ as for any vector $\boldsymbol\in\mathbb R^N$ it holds that $\boldsymbol z^\top \boldsymbol z = \sum_{n=1}^N z_n^2$. It's a basic linear algebra exercise to show this equality. Next, \begin{align*} (\boldsymbol t - \boldsymbol\Phi\boldsymbol w)^\top(\boldsymbol t - \boldsymbol\Phi\boldsymbol w) &= \boldsymbol t^\top\boldsymbol t - \boldsymbol t^\top\boldsymbol\Phi\boldsymbol w - (\boldsymbol\Phi\boldsymbol w)^\top\boldsymbol t + (\boldsymbol\Phi\boldsymbol w)^\top\boldsymbol\Phi\boldsymbol w \\ &= \boldsymbol t^\top\boldsymbol t - 2\boldsymbol t^\top\boldsymbol\Phi\boldsymbol w + \boldsymbol w^\top\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w \end{align*} using again the fact that $(\boldsymbol\Phi\boldsymbol w)^\top\boldsymbol t$ is a scalar and thus $(\boldsymbol\Phi\boldsymbol w)^\top\boldsymbol t = \boldsymbol t^\top\boldsymbol\Phi\boldsymbol w$. Since the log likelihood is differentiable, setting the derivative to zero yields the maximum. Recall the derivative rules \begin{align*} \frac{\partial}{\partial\boldsymbol w}C &= \boldsymbol 0, \\ \frac{\partial}{\partial\boldsymbol w}\big(\boldsymbol t^\top\boldsymbol\Phi\boldsymbol w\big) &= \boldsymbol\Phi^\top\boldsymbol t,\\ \frac{\partial}{\partial\boldsymbol w}\big(\boldsymbol w^\top\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w\big) &= 2\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w, \end{align*} and the linearty of the derivative operator to obtain $$\frac{\partial}{\partial\boldsymbol w}\big(\log\mathcal L(\boldsymbol w)\big) = \boldsymbol 0 - \frac\beta 2\Big(\boldsymbol 0 - 2\boldsymbol\Phi^\top\boldsymbol t + 2\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w\Big) = \beta\boldsymbol\Phi^\top\boldsymbol t - \beta\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w.$$ Setting the derivative to zero yields $$\boldsymbol 0 = \beta\boldsymbol\Phi^\top\boldsymbol t - \beta\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w,$$ or $$\beta\boldsymbol\Phi^\top\boldsymbol t = \beta\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w.$$ Dividing both sides by $\beta$ and premultiplying both sides by $(\boldsymbol\Phi^\top\boldsymbol\Phi)^{-1}$ from the left yields $$\boldsymbol w = (\boldsymbol\Phi^\top\boldsymbol\Phi)^{-1}(\boldsymbol\Phi^\top\boldsymbol t).$$

Related Question