[Math] Computing the derivative of a quadratic form and matrix chain rule

derivativesmatricesmultivariable-calculusreference-request

I'm working on using the Generalized Method of Moments to analyze some yogurt purchase data, and in the course of trying to implement the standard Hansen method (i.e. not an empirical likelihood method), I need to compute first and second derivatives of the following function:

$$Q(\theta) = \biggl[\frac{1}{N}\sum_{i=1}^{N}\psi(Z_{i},\theta)\biggr]^{T}C\biggl[\frac{1}{N}\sum_{i=1}^{N}\psi(Z_{i},\theta)\biggr].$$

Here, $\psi(Z_{i},\theta)$ is a vector function (in my case a 9-by-1 column vector of the moment conditions; you can just think of each component as a function of the scalar parameter $\theta$ if you wish). The $Z_{i}$ are the individual purchase data. $C$ is a weight matrix derived from the model assumptions, but you can treat it as just the identity matrix of suitable size if you want; it shouldn't matter as it isn't a function of $\theta$.

If I let $$F(\theta) = \biggl[\frac{1}{N}\sum_{i=1}^{N}\psi(Z_{i},\theta)\biggr]$$ for simplicity, then the place I am getting stuck is computing the first and second derivatives of $Q_{C}$ w.r.t. $\theta$. This is a scalar-valued objective function of a single variable, so everything involved should work out to be a scalar.

Based on the Wikipedia article on Matrix Calculus, here is what I have tried thus far. $$\frac{dQ}{d\theta} = \frac{dQ}{dF}\cdot{}\frac{dF}{d\theta} = \biggl[ F(\theta)^{T}(C+C^{T})\biggr]\cdot{}\biggl[\frac{d}{d\theta}F(\theta)\biggr]$$

Next I want to take the derivative again, so I use the matrix chain and product rules. In my case, it happens that the final term, $\frac{d}{d\theta}F(\theta)$ is no longer a function of $\theta$ (just constants in all components), so its derivative will be zero and we only need to worry about the first part of the product.

$$\frac{d}{d\theta}\biggl[ F(\theta)^{T}(C+C^{T})\biggr]\cdot{}\biggl[\frac{d}{d\theta}F(\theta)\biggr].$$

As far as I can tell, this just results in the following:
$$ \frac{d}{d\theta}\biggl[ F(\theta)^{T}(C+C^{T})\biggr]\cdot{}\biggl[\frac{d}{d\theta}F(\theta)\biggr] = \biggl(\frac{d}{d\theta}F(\theta) \biggr)^{T}(C+C^{T})\biggl(\frac{d}{d\theta}F(\theta) \biggr).$$

This gives a nice formula, but when I use these results for the first and second derivatives to program up Newton's method to find the value of $\theta$ that minimizes the quadratic form, the method is not converging, and I am concerned that it is because I have calculated the derivatives incorrectly (missing a transpose, or something like that).

Additionally, links to good, clearly written references that explain the logic behind matrix calculus, especially when and why transpositions occur, would be appreciated. Almost all references I could find in 30+ minutes of Googling were absolutely inscrutable and tended to just state results with no expositions at all.

Best Answer

Letting primes denote derivatives with respect to $\theta$ and $A=C+C^T$, I concur with your calculations $$ \eqalign{ &&\href{.}{\text{reasons:}}\\ Q'' &=\frac{d^2Q}{d\theta^2} =\frac{d}{d\theta} \left(F^T\cdot A\cdot F\,'\right)\qquad & \href{http://en.wikipedia.org/wiki/Matrix_calculus#Identities}{\text{chain rule, }} \href{http://en.wikipedia.org/wiki/Matrix_calculus#Derivative_of_quadratic_functions}{\frac{\partial{\mathbf{x}^T\mathbf{C}\mathbf{x}}}{\partial\mathbf{x}}=\mathbf{x}^T\left(\mathbf{C}+\mathbf{C}^T\right)} \\ &=\left(\frac{d}{d\theta} F^T\cdot A\right)\cdot F\,' &\href{http://en.wikipedia.org/wiki/Matrix_calculus#Identities}{\text{product rule}}\href{.}{\text{, assumption }F\,''=0} \\ &=\left(F\,'\right)^T\cdot A \cdot F\,' &\href{http://en.wikipedia.org/wiki/Matrix_calculus#Identities}{\text{product rule}}\href{.}{\text{, assumption }A\,'=0} \\&&\text{or } \href{http://en.wikipedia.org/wiki/Matrix_calculus#Identities}{\text{chain rule, }} \href{http://en.wikipedia.org/wiki/Matrix_calculus#Derivative_of_linear_functions}{\frac{\partial{\mathbf{x}^T\mathbf{A}}}{\partial\mathbf{x}}=\mathbf{A}} } $$ The logic is all based on summation. All the transposes are just what is necessary because of the way we define matrix products. For vectors $\vec{x},\vec{y}\in\mathbb{R}^n\cong\mathbb{R}^{n\times1}$ identified with $n\times1$ column matrices, $$ \eqalign{ \vec{x}^T\cdot\vec{y} &=\left[\matrix{x_1\\\vdots\\x_n}\right]^T \cdot \left[\matrix{y_1\\\vdots\\y_n}\right] \\ &=\left[\matrix{x_1&\cdots&x_n}\right] \cdot \left[\matrix{y_1\\\vdots\\y_n}\right] \\ &=\sum_{i=1}^n x_i y_i } $$ Tensor notation goes further and dispenses with the summation signs, but introducing a convention of superscript-subscript pairing in cancelling indices: $x_iy^i$ (or $x^iy_i$), for the above. But since this is a single real quantity, the normal product rule applies to its derivative, and then the matrix product rules mean we can write it the way they appear in multifarious sources. A good source would be a book on real analysis, vector calculus/analysis, tensor calculus/analysis or differential geometry. For a (real analysis) example, Rudin's Principles of Mathematical Analysis, but there are probably also many other good recommendations.

So the second derivative of a quadratic form in $\mathbf{x}(\theta)$ with constant matrix $C$, where $\mathbf{x}''=\mathbf{0}$, is another quadratic form, in $\mathbf{x}'$, with matrix $A=C+C^T$. Good to know.

As to the convergence, are the conditions for Newton's method met? Is $Q'\ne0$ on a suitable interval $I$ of $\theta$? Do you know the value of $$M=\sup_{\theta\in I}\frac12\left|\frac{Q''}{Q'}\right|,$$ your constant for quadratic convergence (i.e. so that $|\epsilon_{n+1}|\le M\epsilon_n^2$)? And, lastly, do you have good initial estimates?

Also, what is the geometry (i.e. eigenvalues/-vectors, Witt index) of $C$ & $A$?

Related Question