Solved – Transpose or lack of transpose in the $\hat y=X\hat \beta$ regression equation

notationregression

$$\large \hat y = X\,\hat\beta \tag 1$$

seems to be (?) the most commonly encountered expression of the ordinary least square projection. The actual values of the "dependent" variable differ from the estimated $\hat y$ values, so that…

$$\large y = X\,\hat\beta + \varepsilon$$

This is the expression that appears in Wikipedia. In it $X$ is the model matrix, which typically would correspond to a $m\times n$ matrix of $m$ observations or subjects, and $n$ variables. $\hat \beta$ is an $n \times 1$ vector of coefficients; and $\ y$ and $\hat y$, the observed values (or predicted values, respectively) of the "dependent" variable, forming an $m \times 1$ vector. $\varepsilon$ is the error.

However, in the book The Elements of Statistical Learning (Second Edition) by T. Hastie, R. Tibshirani and J. Friedman, on page 11, this is expressed as:

$$\large \hat Y = X^T \,\hat\beta \tag 2$$

where $X^T$ denotes vector or matrix transpose ($X$ being a column vector). Here we are modeling a single output, so $\hat Y$ is a scalar; in general $\hat Y$ can be a $K$–vector, in which case $\beta$ would be a $p \times K$ matrix of coefficients. In the $(p + 1)$-dimensional input–output space, $(X,\hat Y)$ represents a hyperplane. If the constant is included in $X$, then the hyperplane includes the origin and is a subspace; if not, it is an affine set cutting the $Y$-axis at the point $(0,\hat\beta_0)$. From now on we assume that the intercept is included in $\hat\beta$.

$X^T$ seems most intuitive with one single vector, but it is not so straightforward when it is a model matrix. Also it seems as though the authors are including the possibility of multivariate regression.

In any event, the paragraph is far from clear to me, including the hat-less beta, and would like to ask for a "connecting" explanatory answer to clarify how both expression $(1)$ and $(2)$ are equivalent (if they are).

Best Answer

In matrix notation $$\hat{Y} = X \hat{\beta} $$

where $\hat{Y}$ is the fitted $m \times 1$ response vector, $X$ is an $m \times n$ model matrix and $\hat{\beta}$ is the estimated $n \times 1$ coefficient. Each column of $X$ is a predictor and each row is observed predictor values for each observation.

If we to write $X$ as $(x_1^T \, x_2^T \, \ldots \, x_m^T)$ then each $x_i$ is the observed predictors for the $i$th observation. Each $x_i^T$ makes the rows of $X$. Then from the matrix notation, for each observation $i$, we have

$$\hat{y}_i = x_i^T \hat{\beta}.$$

As amoeba pointed out, Hastie et al. use the notation $X$ in place of my $x_i$ here, which is different from the $X$ notation in the first equation.

Related Question