Exercise 1.1 from Introduction of Pattern Recognition and Machine Learning by Christopher Bishop

machine learningpattern recognitionsequences-and-series

So i'm stuck of exercise 1.1 because I am getting confused with the i's and j's I think.
Also, I have done maths and further maths A level in the UK so have an okay maths background but I had to study some multivariable calculus for this book. Is there anything else that I need to cover? Thanks
Question:
Series
the other series

Also, this may be useful:
"We can solve the curve fitting problem by choosing the value of w for which
E(w) is as small as possible. Because the error function is a quadratic function of
the coefficients w, its derivatives with respect to the coefficients will be linear in the
elements of w, and so the minimization of the error function has a unique solution,
denoted by w* Exercise 1.1 , which can be found in closed form. The resulting polynomial is
given by the function y(x, w*)."

Best Answer

The main things going on in this question are multivariate calculus and linear algebra. With a solid understanding of those, most of the math behind machine learning techniques falls into place, give or take the statistical view that some people have on the subject.

I agree the indices aren't great. The problem is solvable with them with careful bookkeeping. Take every step slowly, and take small steps. What the problem is asking you to do is differentiate the error function with respect to the weights. The set of equations they ask for are precisely what you find when you set that gradient equal to $0$.

To make that explicit, consider just the partial with respect to $w_1$: $$\begin{aligned}\frac{\partial E}{\partial w_1}&=\frac{\partial}{\partial w_1}\left[\frac12\sum_{n=1}^N(y(x_n,w)-t_n)^2\right]\\&=\frac12\sum_{n=1}^N\frac{\partial}{\partial w_1}(y(x_n,w)-t_n)^2\\&=\frac12\sum_{n=1}^N2(y(x_n,w)-t_n)\frac{\partial y}{\partial w_1}(x_n, w)\\&=\frac12\sum_{n=1}^N2(y(x_n,w)-t_n)x_n^1\\&=\sum_{n=1}^N(y(x_n,w)-t_n)x_n^1\\&=\sum_{n=1}^N\left(\sum_{j=0}^Mw_jx_n^j-t_n\right)x_n^1\end{aligned}$$

With a little cajoling, when you set that partial equal to $0$ you get the second coordinate of the linear equations the textbook asks for. What happens when you repeat the exercise for arbitrary $w_a$ instead of just $w_1$?

Just because this point is rarely brought up, I want to point out that indices are indeed a large part of the confusion in a problem like this. Especially when you get to the point where you would like to simplify things or speed calculations, having to keep track of all that bookkeeping is a nightmare. The same set of equations pops out when we take derivatives with respect to vectors or matrices instead though. Define $$X=\begin{pmatrix}x_0^0&\cdots&x_0^M\\\vdots&\ddots&\vdots\\x_N^0&\cdots&x_N^M\end{pmatrix},$$ and $$t=\begin{pmatrix}t_0\\\vdots\\t_N\end{pmatrix},$$ and $$w=\begin{pmatrix}w_0\\\vdots\\w_M\end{pmatrix}.$$ Note that your error function can be succinctly expressed as $$E(w)=\frac12(Xw-t)^T(Xw-t).$$

To take the derivative, we proceed almost as if we were performing standard 1D calculus, and skipping to the solution, we find $$\frac{\partial E}{\partial w}=X^TXw-X^Tt.$$

Setting this equal to $0$ we recover the equations the author asks for $$X^TXw=X^Tt.$$ Better yet, in the event that $X$ is invertible (which can unfortunately only happen if $N=M$, but happens almost always if we do have $N=M$) then this problem simplifies even further to $$Xw=t,$$ a simplification we were almost certain to miss when wading through indices.