Partial derivative of matrix vector product with respect to component of vector

calculuslinear algebramatrix-calculuspartial derivative

Given $RSS(\beta) = y^Ty – y^TX\beta – \beta^TX^Ty + \beta^TX^TX\beta$, I wish to compute $\frac{ \partial RSS }{ \beta_j }$.

I haven't stumbled on any definitions for matrix vector derivatives that are partial with respect to a component of the vector, so I tried to derive it myself. But then I ran into the fact that I can derive it two different ways and get two different answers:

  1. First take the simpler case $g(\beta) = y^TX\beta$, then what would $\frac{ \partial g }{ \partial \beta_j }$ be? $y^TX$ is dot producted against $\beta$, but since we are only taking the derivative with respect to $\beta_j$, we want the elements of $y^TX$ that $\beta_j$ is multiplied against. This is just $(y^TX)_j$, a scalar. Given that we expect $RSS(\beta)$ to return a scalar, we expect $\frac{ \partial RSS }{ \beta_j }$ to return a scalar, so this bodes well for our ultimate goal.

  2. In single variable calculus it's common to look at a linearization of a function centered around a point. We often write $L(x) = f(x_0) + f'(x_0)(x-x_0)$. If I try to generalize this notion to $\frac{ \partial g }{ \partial \beta_j }$ I get $L(\beta) = y^TX\beta_0 + P(\beta_0)(\beta – \beta_0)$ where $P(\beta_0)$ is a stand-in for the partial derivative we are trying to derive. We know $(\beta – \beta_0)$ is a column vector, and we know $y^TX\beta_0$ is 1×1. But then we need $P(\beta_0)$ to be a row vector, and we already decided it was a scalar or 1×1 matrix. If it were a row vector, a value of $\lbrack 0 \ldots 0\ (y^TX)_j\ 0 \ldots 0 \rbrack$ would make sense. But if it's a row vector, then the terms of $\frac{ \partial RSS }{ \beta_j }$ won't be scalars and we are expecting a scalar.

Is $\frac{ \partial RSS }{ \beta_j }$ well defined? How do I reconcile these two views?

Best Answer

The simple function used in case 1 and 2 can be further simplified $$g = y^TX\beta = v^T\beta$$ since $(y,X)$ are constants.

The gradient of a vector $(\beta)$ with respect to its $j$th component $(\beta_j)$ yields the $j$th standard basis vector (written horizonally to save space) $$\frac{\partial\beta}{\partial\beta_j} = \big[\matrix{0&\ldots&0&1&0&\ldots&0}\big]^T\;\equiv\;e_j$$ When multiplied by another vector, it returns $\;v_j=v^Te_j\;$ which concludes case ${\tt1}$.

For case ${\tt2}$, you started out with a scalar function of a scalar argument $(x)$ $$L(x) = f(x_0) + f'(x_0)(x-x_0) $$ but then switched to a vector argument $(\beta)$.

You retained the symbols $\{f,f'\}$ but they are now very different mathematical objects. The following should clarify their new meanings. $$\eqalign{ L(\beta) &= v^T\beta_0 &+\; v^T(\beta-\beta_0) \;&\doteq\; v^T\beta \\ &= f(\beta_0) &+\; f'(\beta_0)(\beta-\beta_0) \\ \\ f(\beta_0) &= v^T\beta_0 \\ f'(\beta_0) &= v^T \;\doteq\; &P(\beta_0) \\ }$$ So $P(\beta_0)$ is seen to be a row vector, not a scalar nor a ${\tt1}\times{\tt1}$ matrix.