Derive the gradient of the second order Taylor approximation

multivariable-calculusoptimization

I study second year applied physics and electrical engineering and I am preparing for an exam in optimization. I am trying to understand the derivation/calculation of the step direction $\mathbf{d}$, which is a part in Newton's method in higher dimensions.

In my textbook, the derivation starts as follows:

The second order taylor approximation around the point $\mathbf{x^{(k)}}$ for the function $f(\mathbf{x})$ can be written as \begin{equation}
f_{2}(\mathbf{x}) = f(\mathbf{x^{(k)}}) + \nabla f(\mathbf{x^{(k)}})^T(\mathbf{x} – \mathbf{x^{(k)}}) + \frac{1}{2}(\mathbf{x} – \mathbf{x^{(k)}})^T\nabla^2f(\mathbf{x^{(k)}})(\mathbf{x} – \mathbf{x^{(k)}})
\end{equation}

Taking the gradient of $f_{2}(\mathbf{x})$ we get

$$ \nabla f_{2}(\mathbf{x}) = \nabla f(\mathbf{x^{(k)}}) + \nabla^2f(\mathbf{x^{(k)}})(\mathbf{x} – \mathbf{x^{(k)}}) $$

Now, if we let $\nabla f_{2}(\mathbf{x}) = 0$ and multiplying both sides by the inverse of the matrix $\nabla^2f(\mathbf{x^{(k)}})$ and solve for $\mathbf{d} := (\mathbf{x} – \mathbf{x^{(k)}})$ we get the correct answer.

The only part in the derivation I have trouble understanding is when we take the gradient of $f_{2}(\mathbf{x})$, there is no explanation in the textbook. I would appreciate if you could explain just enough for me to be able to calculate this gradient. I have taken one introductory course in Linear Algebra and Calculus 1-3 with some vector analysis.

Best Answer

Many of the subexpressions in $f_2(\mathbf x)$ are constants. I suggest taking it term by term and expanding by coordinates. The Einstein summation convention is useful for reducing clutter during these manipulations, but if you’re not familiar with it, that’s not a big deal.

The first term is easy. It’s a constant, so its gradient vanishes. In the second term, $\nabla f(\mathbf x^{(\mathbf k)})$ is a constant vector. Let’s call it $\mathbf v$ and rewrite this term as $\mathbf v^T\mathbf x-\mathbf v^T\mathbf x^{(\mathbf k)}$. The second term of this is constant, so it drops out. Expanding the first by coordinates, $\mathbf v^T\mathbf x = \sum_i v_ix_i$, from which ${\partial\over\partial x_i}\left(\mathbf v^T\mathbf x\right)$ obviously equals $v_i$, so its gradient is simply $\mathbf v$. Therefore, the gradient of the second term of $f_2(\mathbf x)$ is $\nabla f(\mathbf x^{(\mathbf k)})$.

Moving to the next term, $\nabla^2f(\mathbf x^{(\mathbf k)})$ is some constant matrix $A$. Setting $\mathbf y=\mathbf x-\mathbf x^{(\mathbf k)}$, this term expands into $\frac12\sum_{i,j}a_{ij}y_iy_j$. I’ll leave computing the partial derivatives and reassembling the result into a vector/matrix expression to you. Remember that $A$ is symmetric.

Related Question