Finding the M.L.E estimates of weight maximizing the Likelihood Function of a Linear Regression

machine learningmultivariable-calculuspattern recognitionregression

I am reading this book "Pattern Recognition and Machine Learning " – Christopher M. Bishop . My question is around equation $3.13$ page $141$.In the book , he talks of Maximum likelihood estimate of the parameter vector $w$ for a non linear regression as follows :
We have a feature vector $x \in R^{k}$ , a basis function $\phi_{i}$ , and we call $ \phi = [\phi_{0},\phi_{1},\phi_{2},….\phi_{k-1},]$ . The book then proceeds with modelling a targe variable $t$ as
$$t= y(x,w) + e$$ where $y(x,w) $ is our model and $e$ is a gaussian noise with $0$ mean and variance $\beta^{-1}$.
So we write : this uncertainity over $t$ as p.d.f over $t$ given by :
$$ N(t|y(x,w),\beta) $$.
The book then proceeds with the likelihood function , for $N$ overstations to be :
$$ P =\prod_{i=1}^{i=N} N(t_{i}|y(x_{i},w),\beta )\\
\Rightarrow ln(P) = \sum_{i=1}^{i=N}
N(t_{i}|y(x_{i},w),\beta)
=\frac{N}{2}ln(\beta) -\frac{N}{2}ln(2\pi) -\beta E_{D}w$$ where $E_{D}w= \frac{1}{2}\sum_{i=1}^{i=N}[ t_{n}-w^{T}\phi(x_{n})]^{2}$
Now maximizing the logLikelihood with respect to $w$ is equivalent to minimizing the $E_{D}w$.
Now the author calculates the gradient of $E_{D}w$ and writes :
$$ \nabla_{w}ln(P) = \sum_{n=1}^{N}[t_{n}-w^{T}\phi(x_{n})]\phi(x_{n})^{T}$$ .

My doubt is that the dimension of $$does not match the dimension of $w$ . Reasoning , my understanding from the above texts is that the dimension of $w$ and $\phi$ have to be same for $y(x,w)$ to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}\phi(x_{n})]$ is real valued and $\phi_{n}^{T}$ have dimensions equal to $w^{T}$ . So the gradient does not seem to match $w$ in dimensions , where am i getting it wrong ?

Edit /Note :
After comprehensive answer by @user3658307 . I went to solve the original question which was about finding the optimal weights and considering to the gradient to be
$$ \nabla_{w}ln(P) = \sum_{n=1}^{N}[t_{n}-w^{T}\phi(x_{n})]\phi(x_{n})$$
i.e without a transpose over the $\phi(x_{n})$ .
I found the optimal weights to be same as the optimal weights found out by the author considering his equation of gradient !

Best Answer

I think what's happening is $x_i\in\mathbb{R}^D \;\forall i$ defines the data points , $t_i\in \mathbb{R}$ is the target for $x_i$, the basis functions are $\phi_j:\mathbb{R}^D\rightarrow\mathbb{R}$ (so that $\phi : \mathbb{R}^D\rightarrow \mathbb{R}^M$), and the weights are given by $w\in\mathbb{R}^M$. The data generating mechanism is $$ t = y(w,x) + \epsilon = \epsilon + \sum_{\ell=0}^{M-1} w^T\phi(x) $$ meaning $t\sim\mathcal{N}(t|y(x,w),\beta^{-1})$. See equation 3.3 in the book.

Next the author looks at the log-likelihood of the data under the model (i.e., $p(t|w,\beta)$), and computes the gradient wrt to the weights. We should see that $\nabla\ln p(t|w,\beta)\in\mathbb{R}^M$, since that is how many weights there are. Our dataset is of size $N$; i.e., $X=\{x_1,\ldots,x_N\}$ and $t=\{t_1,\ldots,t_N\}$. We get: $$ \nabla\mathcal{E}:=\nabla \ln p(t|w,\beta) =\sum_{n=1}^N \underbrace{\left[ t_n - w^T\phi(x_n) \right]}_{\mathbb{R}}\underbrace{\phi(x_n)^T}_{\mathbb{R}^M}\in\mathbb{R}^M $$ since summing over the vectors doesn't change their dimensionality. Indeed, $\nabla\mathcal{E}\in\mathbb{R}^M$, meaning the gradient for the $u$th weight is given by the $u$th component of $\nabla\mathcal{E}$.

Your question:

my understanding from the above texts is that the dimension of 𝑤 and 𝜙 have to be same for 𝑦(𝑥,𝑤) to be real valued. Now considering the R.H.S for the gradient equation, the term $[t_{n}-w^{T}\phi(x_{n})]$ is real valued and $𝜙^𝑇_𝑛$ have dimensions equal to $𝑤^𝑇$

Note that $\phi_n$ is nowhere in the above equation. There is only $\phi(x_n)$, which is $M$ dimensional (mapping the $x_n\in\mathbb{R}^D$ to an $M$ dimensional vector), and it has the same dimensionality as the weight vector.

Edit for comments:

In ML, the distinction between row and column vector tends to be overlooked, partly because we operate in (what are assumed to be) Euclidean spaces, and so it doesn't matter very much. I see papers switch between them or (more often) just treat it as a vector (ignoring the row vs column distrinction altogether).

However, it is slightly more common, I feel, to consider it a row vector, because:

The rows of the Jacobian matrix $J$ match to the gradient vectors of the component functions
It can be written (as a directional derivative linear operator) merely by $\nabla f(x) v$ (like how we write $J(x) v$ in the Taylor expansion).

There is plenty of debate on this though; e.g., [1], [2], [3], [4], [5], [6], [7] [8].

(Aside: mathematically, resolving this requires distinguishing between the differential and the gradient (contravariance vs covariance), and this is not something people care about in ML usually since the metric tensor is Euclidean. See the last few refs for more.)

I don't know that the author says this, but I guess he implicitly considers the gradient a row vector (at least in this case). I'm pretty sure in other places in the book that the gradient ended up being a column vector. Maybe it's a typo?

TL;DR: in ML, the literature is very lenient regarding $\mathbb{R}^{1\times M}$ vs $\mathbb{R}^{M\times 1}$ vs $\mathbb{R}^{M}$, so I wouldn't worry about it.

Best Answer

Related Solutions

[Math] Minimizing $L_1$ Regularization

Marginal likelihood in variational bayes

Related Question