[Math] Minimize Expected squared Prediction Error (EPE)

statistics

I have difficulty understanding when minimizing expected squared prediction error:

$$\operatorname{EPE}(\beta)=\int (y-x^T \beta)^2 \Pr(dx, dy),$$

how to reach the solution that

$$\operatorname{E}[yx]-\operatorname{E}[xx^{T}\beta]=0.$$

From A Solution Manual and Notes for the Text: The Elements of Statistical Learning (page 2), I noticed the formula (2):

$$\frac{\partial \operatorname{EPE}}{\partial \beta}=\int 2(y-x^T\beta)(-1) x \Pr(dx, dy).$$

I understand chain rule is used here to solve the derivative, but why the last part is $x$ instead of $x^T$? As $x$ is a column vector and the Jacobian of a constant w.r.t. $x$ should be a row vector.

$$\frac{\partial x^T\beta}{\partial \beta}=x^T.$$

What is missing in my analysis?

According to the Jacobian matrix, shouldn't the $\frac{\partial x^T\beta}{\partial \beta}$ be a row vector instead of a column vector?

$$\frac{\partial EPE}{\partial \mathbf{\beta}}= \left (\frac{\partial EPE}{\partial \beta_{1}}, \frac{\partial EPE}{\partial \beta_{2}}, \dots, \frac{\partial EPE}{\partial \beta_{N}} \right )$$

Best Answer

Suppose $x^T=\left(\begin{array}{} x_1 & x_2\end{array}\right) $ and $\beta=\left(\begin{array}{} \beta_1 \\ \beta_2 \end{array}\right)$

Then $x^T\cdot \beta=\begin{array}{} x_1\beta_1 + x_2\beta_2 \end{array}$

The derivative w.r.t $\beta$ is

$\frac{\partial x^T\cdot \beta }{\partial \beta}=\left( \begin{array}{} \frac{\partial x^T\cdot \beta}{\partial \beta_1} \\ \frac{\partial x^T\cdot \beta}{\partial \beta_2} \end{array} \right)=\binom{x_1 }{ x_2}=x$