Solved – How to a vector of variables represent a hyperplane

referencesregressionstatistical-learning

I am reading Elements of Statistical Learning and on page 12 (section 2.3) a linear model is notated as:

$$\widehat{Y} = X^{T} \widehat{\beta}$$

…where $X^{T}$ is the transpose of a column vector of the predictors / independent variables / inputs. (It states earlier "all vectors are assumed to be column vectors" so wouldn't this make $X^{T}$ a row vector and $\widehat{\beta}$ a column vector?)

Included in $X$ is a "$1$" to be multiplied with the corresponding coefficient giving the (constant) intercept.

It goes on to say:

In the $(p + 1)$-dimensional input–output space, $(X,\ \widehat{Y})$ represents a hyperplane. If the constant is included in $X$, then the hyperplane includes the origin
and is a subspace; if not, it is an affine set cutting the $Y$-axis at the point
$(0,\ \widehat{\beta_0})$.

Does "$(X,\ \widehat{Y})$" describe a vector formed by the concatenation of the predictors, the intercept's "$1$" and $\widehat{Y}$? And why does including a "$1$" in $X$ force the hyperplane to pass through the origin, surely that "$1$" is to be multiplied by $\widehat{\beta_0}$?

I am failing to understand the book; any help / advice / links to resources would be very much appreciated.

Best Answer

Let $N$ be the number of observations and $K$ the number of explanatory variables.

$X$ is actually a $N\!\times\!K$ matrix. Only when we look at a single observation, we denote each observation usually as $x_i^T$ - a row vector of explanatory variables of one particular observation scalar multiplied with the $K\!\times\!1$ column vector $\beta$. Furthermore, $Y$ is a $N\!\times\!1$ column vector, holding all observations $Y_n$.

Now, a two dimensional hyperplane would span between the vector $Y$ and one(!) column vector of $X$. Remember that $X$ is a $N\!\times\!K$ matrix, so each explanatory variable is represented by exactly one column vector of the matrix $X$. If we have only one explanatory variable, no intercept and $Y$, all data points are situated along the 2 dimensional plane spanned by $Y$ and $X$.

For a multiple regression, how many dimensions in total does the hyperplane between $Y$ and the matrix $X$ have? Answer: Since we have $K$ column vectors of explanatory variables in $X$, we must have a $K\!+\!1$ dimensional hyperplane.

Usually, in a matrix setting, the regression requires a constant intercept to be unbiased for a reasonable analysis of the slope coefficient. To accommodate for this trick, we force one column of the matrix $X$ to be only consisting of "$1$s". In this case, the estimator $\beta_1$ stands alone multiplied with a constant for each observation instead of a random explanatory variable. The coefficient $\beta_1$ represents the therefore the expected value of $Y$ given that $x_{1i}$ is held fixed with value 1 and all other variables are zero. Therefore the $K\!+\!1$-dimensional hyperplane is reduced by one dimension to a $K$-dimensional subspace, and $\beta_1$ corresponds to the "intercept" of this $K$-dimensional plane.

In matrix settings its always advisable to have a look at the simple case of two dimensions, to see if we can find an intuition for our results. Here, the easiest way is to think of the simple regression with two explanatory variables: $$ y_i=\beta_1x_{1i} + \beta_2x_{2i} +u_i $$ or alternatively expressed in Matrix algebra: $Y=X\beta +u$ where $X$ is a $N\!\times\!2$ matrix.

$<Y,X>$ spans a 3-dimensional hyperplane.

Now if we force all $x_1$ to be all $1$, we obtain: $$ y_i=\beta_{1i} + \beta_2x_{2i} + u_i $$ which is our usual simple regression that can be represented in a two dimensional $X,\ Y$ plot. Note that $<Y,X>$ is now reduced to a two dimensional line - a subset of the originally 3 dimensional hyperplane. The coefficient $\beta_1$ corresponds to the intercept of the line cutting at $x_{2i}=0$.

It can be further shown that it also passes through $<0,\beta_1>$ for when the constant is included. If we leave out the constant, the regression hyperplane always passes trivially through $<0,0>$ - no doubt. This generalizes to multiple dimensions, as it will be seen later for when deriving $\beta$: $$ (X'X)\beta=X'y \implies (X'X)\beta-X'y=0 \implies X'(y-X\beta)=0. $$ Since $X$ has full rank per definition, $y-X\beta=0$, and so the regression passes through the origin if we leave out the intercept.

(Edit: I just realized that for your second question this is exactly the opposite of you have written regading inclusion or exclusion of the constant. However, I have already devised the solution here and I stand corrected if I am wrong on that one.)

I know the matrix representation of a regression can be quite confusing at the beginning but eventually it simplifies a lot when deriving more complex algebra. Hope this helps a bit.