Solved – Showing that the OLS estimator is scale equivariant

least squareslinear model

I don't have a formal definition of scale equivariance, but here's what Introduction to Statistical Learning says about this on p. 217:

The standard least squares coefficients… are scale equivariant: multiplying $X_j$ by a constant $c$ simply leads to a scaling of the least squares coefficient estimates by a factor of $1/c$.

For simplicity, let's assume the general linear model $\mathbf{y} = \mathbf{X}\boldsymbol\beta + \boldsymbol\epsilon$, where $\mathbf{y} \in \mathbb{R}^N$, $\mathbf{X}$ is a $N \times (p+1)$ matrix (where $p+1 < N$) with all entries in $\mathbb{R}$, $\boldsymbol\beta \in \mathbb{R}^{p+1}$, and $\boldsymbol\epsilon$ is a $N$-dimensional vector of real-valued random variables with $\mathbb{E}[\boldsymbol\epsilon] = \mathbf{0}_{N \times 1}$.

From OLS estimation, we know that if $\mathbf{X}$ has full (column) rank,
$$\hat{\boldsymbol\beta}_{\mathbf{X}} = (\mathbf{X}^{T}\mathbf{X})^{-1}\mathbf{X}^{T}\mathbf{y}\text{.}$$
Suppose we multiplied a column of $\mathbf{X}$, say $\mathbf{x}_k$ for some $k \in \{1, 2, \dots, p+1\}$, by a constant $c \neq 0$. This would be equivalent to the matrix
\begin{equation}
\mathbf{X}\underbrace{\begin{bmatrix}
1 & \\
& 1 \\
& & \ddots \\
& & & 1 \\
& & & & c\\
& & & & & 1 \\
& & & & & &\ddots \\
& & & & & & & 1
\end{bmatrix}}_{\mathbf{S}} =
\begin{bmatrix} \mathbf{x}_1 & \mathbf{x}_2 & \cdots & c\mathbf{x}_{k} & \cdots & \mathbf{x}_{p+1}\end{bmatrix} \equiv \tilde{\mathbf{X}}
\end{equation}
where all other entries of the matrix $\mathbf{S}$ above are $0$, and $c$ is in the $k$th entry of the diagonal of $\mathbf{S}$. Then, $\tilde{\mathbf X}$ has full (column) rank as well, and the resulting OLS estimator using $\tilde{\mathbf X}$ as the new design matrix is
$$\hat{\boldsymbol\beta}_{\tilde{\mathbf{X}}} = \left(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}}\right)^{-1}\tilde{\mathbf{X}}^{T}\mathbf{y}\text{.}$$
After some work, one can show that
$$\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}} = \begin{bmatrix}
\mathbf{x}_1^{T}\mathbf{x}_1 & \mathbf{x}_1^{T}\mathbf{x}_2 & \cdots & c\mathbf{x}_1^{T}\mathbf{x}_k & \cdots & \mathbf{x}_1^{T}\mathbf{x}_{p+1} \\
\mathbf{x}_2^{T}\mathbf{x}_1 & \mathbf{x}_2^{T}\mathbf{x}_2 & \cdots & c\mathbf{x}_2^{T}\mathbf{x}_k & \cdots & \mathbf{x}_2^{T}\mathbf{x}_{p+1} \\
\vdots & \vdots & \ddots & \vdots & \cdots & \vdots \\
c\mathbf{x}_k^{T}\mathbf{x}_1 & c\mathbf{x}_k^{T}\mathbf{x}_2 & \cdots & c^2\mathbf{x}_k^{T}\mathbf{x}_k & \cdots & c\mathbf{x}_k^{T}\mathbf{x}_{p+1} \\
\vdots & \vdots & \vdots & \vdots & \cdots & \vdots \\
\mathbf{x}_{p+1}^{T}\mathbf{x}_1 & \mathbf{x}_{p+1}^{T}\mathbf{x}_2 & \cdots & c\mathbf{x}_{p+1}^{T}\mathbf{x}_{p+1} & \cdots & \mathbf{x}_{p+1}^{T}\mathbf{x}_{p+1} \\
\end{bmatrix}$$
and
$$\tilde{\mathbf{X}}^{T}\mathbf{y} = \begin{bmatrix}
\mathbf{x}_1^{T}\mathbf{y} \\
\mathbf{x}_2^{T}\mathbf{y} \\
\vdots \\
c\mathbf{x}_k^{T}\mathbf{y} \\
\vdots \\
\mathbf{x}_{p+1}^{T}\mathbf{y}
\end{bmatrix}$$
How do I go from here to show the claim quoted above (i.e., that $\hat{\boldsymbol\beta}_{\tilde{\mathbf{X}}} = \dfrac{1}{c}\hat{\boldsymbol\beta}_{\mathbf{X}}$)? It's not clear to me how to compute $(\tilde{\mathbf{X}}^{T}\tilde{\mathbf{X}})^{-1}$.

Best Answer

Since the assertion in the quotation is a collection of statements about rescaling the columns of $X$, you might as well prove them all at once. Indeed, it takes no more work to prove a generalization of the assertion:

When $X$ is right-multiplied by an invertible matrix $A$, then the new coefficient estimate $\hat\beta_A$ is equal to $\hat \beta$ left-multiplied by $A^{-1}$.

The only algebraic facts you need are the (easily proven, well-known ones) that $(AB)^\prime=B^\prime A^\prime$ for any matrices $AB$ and $(AB)^{-1}=B^{-1}A^{-1}$ for invertible matrices $A$ and $B$. (A subtler version of the latter is needed when working with generalized inverses: for invertible $A$ and $B$ and any $X$, $(AXB)^{-} = B^{-1}X^{-}A^{-1}$.)

Proof by algebra: $$\hat\beta_A = ((XA)^\prime ((XA))^{-}(XA)^\prime y = A^{-1}(X^\prime X)^{-} (A^\prime)^{-1}A^\prime y = A^{-1}\hat \beta,$$

QED. (In order for this proof to be fully general, the $^-$ superscript refers to a generalized inverse.)

Proof by geometry:

Given bases $E_p$ and $E_n$ of $\mathbb{R}^n$ and $\mathbb{R}^p$, respectively, $X$ represents a linear transformation from $\mathbb{R}^p$ to $\mathbb{R}^n$. Right-multiplication of $X$ by $A$ can be considered as leaving this transformation fixed but changing $E_p$ to $AE_p$ (that is, to the columns of $A$). Under that change of basis, the representation of any vector $\hat\beta\in\mathbb{R}^p$ must change via left-multiplication by $A^{-1}$, QED.

(This proof works, unmodified, even when $X^\prime X$ is not invertible.)

The quotation specifically refers to the case of diagonal matrices $A$ with $A_{ii}=1$ for $i\ne j$ and $A_{jj}=c$.

Connection with least squares

The objective here is to use first principles to obtain the result, with the principle being that of least squares: estimating coefficients that minimize the sum of squares of residuals.

Again, proving a (huge) generalization proves no more difficult and is rather revealing. Suppose $$\phi:V^p\to W^n$$ is any map (linear or not) of real vector spaces and suppose $Q$ is any real-valued function on $W^n$. Let $U\subset V^p$ be the (possibly empty) set of points $v$ for which $Q(\phi(v))$ is minimized.

Result: $U$, which is determined solely by $Q$ and $\phi$, does not depend on any choice of basis $E_p$ used to represent vectors in $V^p$.

Proof: QED.

There's nothing to prove!

Application of the result: Let $F$ be a positive semidefinite quadratic form on $\mathbb{R}^n$, let $y\in\mathbb{R}^n$, and suppose $\phi$ is a linear map represented by $X$ when bases of $V^p=\mathbb{R}^p$ and $W^n=\mathbb{R}^n$ are chosen. Define $Q(x)=F(y,x)$. Choose a basis of $\mathbb{R}^p$ and suppose $\hat\beta$ is the representation of some $v\in U$ in that basis. This is least squares: $x=X\hat\beta$ minimizes the squared distance $F(y,x)$. Because $X$ is a linear map, changing the basis of $\mathbb{R}^p$ corresponds to right-multiplying $X$ by some invertible matrix $A$. That will left-multiply $\hat\beta$ by $A^{-1}$, QED.

Related Solutions

Solved – Proof that the coefficients in an OLS model follow a t-distribution with (n-k) degrees of freedom

Since $$\begin{align*} \hat\beta &= (X^TX)^{-1}X^TY \\ &= (X^TX)^{-1}X^T(X\beta + \varepsilon) \\ &= \beta + (X^TX)^{-1}X^T\varepsilon \end{align*}$$ we know that $$\hat\beta-\beta \sim \mathcal{N}(0,\sigma^2 (X^TX)^{-1})$$ and thus we know that for each component $k$ of $\hat\beta$, $$\hat\beta_k -\beta_k \sim \mathcal{N}(0, \sigma^2 S_{kk})$$ where $S_{kk}$ is the $k^\text{th}$ diagonal element of $(X^TX)^{-1}$. Thus, we know that $$z_k = \frac{\hat\beta_k -\beta_k}{\sqrt{\sigma^2 S_{kk}}} \sim \mathcal{N}(0,1).$$

Take note of the statement of the Theorem for the Distribution of an Idempotent Quadratic Form in a Standard Normal Vector (Theorem B.8 in Greene):

If $x\sim\mathcal{N}(0,I)$ and $A$ is symmetric and idempotent, then $x^TAx$ is distributed $\chi^2_{\nu}$ where $\nu$ is the rank of $A$.

Let $\hat\varepsilon$ denote the regression residual vector and let $$M=I_n - X(X^TX)^{-1}X^T \text{,}$$ which is the residual maker matrix (i.e. $My=\hat\varepsilon$). It's easy to verify that $M$ is symmetric and idempotent.

Let $$s^2 = \frac{\hat\varepsilon^T \hat\varepsilon}{n-p}$$ be an estimator for $\sigma^2$.

We then need to do some linear algebra. Note these three linear algebra properties:

The rank of an idempotent matrix is its trace.
$\operatorname{Tr}(A_1+A_2) = \operatorname{Tr}(A_1) + \operatorname{Tr}(A_2)$
$\operatorname{Tr}(A_1A_2) = \operatorname{Tr}(A_2A_1)$ if $A_1$ is $n_1 \times n_2$ and $A_2$ is $n_2 \times n_1$ (this property is critical for the below to work)

So $$\begin{align*} \operatorname{rank}(M) = \operatorname{Tr}(M) &= \operatorname{Tr}(I_n - X(X^TX)^{-1}X^T) \\ &= \operatorname{Tr}(I_n) - \operatorname{Tr}\left( X(X^TX)^{-1}X^T) \right) \\ &= \operatorname{Tr}(I_n) - \operatorname{Tr}\left( (X^TX)^{-1}X^TX) \right) \\ &= \operatorname{Tr}(I_n) - \operatorname{Tr}(I_p) \\ &=n-p \end{align*}$$

Then $$\begin{align*} V = \frac{(n-p)s^2}{\sigma^2} = \frac{\hat\varepsilon^T\hat\varepsilon}{\sigma^2} = \left(\frac{\varepsilon}{\sigma}\right)^T M \left(\frac{\varepsilon}{\sigma}\right). \end{align*}$$

Applying the Theorem for the Distribution of an Idempotent Quadratic Form in a Standard Normal Vector (stated above), we know that $V \sim \chi^2_{n-p}$.

Since you assumed that $\varepsilon$ is normally distributed, then $\hat\beta$ is independent of $\hat\varepsilon$, and since $s^2$ is a function of $\hat\varepsilon$, then $s^2$ is also independent of $\hat\beta$. Thus, $z_k$ and $V$ are independent of each other.

Then, $$\begin{align*} t_k = \frac{z_k}{\sqrt{V/(n-p)}} \end{align*}$$ is the ratio of a standard Normal distribution with the square root of a Chi-squared distribution with the same degrees of freedom (i.e. $n-p$), which is a characterization of the $t$ distribution. Therefore, the statistic $t_k$ has a $t$ distribution with $n-p$ degrees of freedom.

It can then be algebraically manipulated into a more familiar form.

$$\begin{align*} t_k &= \frac{\frac{\hat\beta_k -\beta_k}{\sqrt{\sigma^2 S_{kk}}}}{\sqrt{\frac{(n-p)s^2}{\sigma^2}/(n-p)}} \\ &= \frac{\frac{\hat\beta_k -\beta_k}{\sqrt{S_{kk}}}}{\sqrt{s^2}} = \frac{\hat\beta_k -\beta_k}{\sqrt{s^2 S_{kk}}} \\ &= \frac{\hat\beta_k -\beta_k}{\operatorname{se}\left(\hat\beta_k \right)} \end{align*}$$

Solved – The theory behind the weights argument in R when using lm()

The matrix $X$ should be $$ \begin{bmatrix} 1 & 0\\ 1 & 1\\ 1 & 2 \end{bmatrix}, $$ not $$ \begin{bmatrix} 1 & 1\\ 1 & 1\\ 1 & 1 \end{bmatrix}. $$ Also, your V_inv should be diag(weights), not diag(1/weights).

x <- c(0, 1, 2)
y <- c(0.25, 0.75, 0.85)
weights <- c(50, 85, 75)
X <- cbind(1, x)

> solve(t(X) %*% diag(weights) %*% X, t(X) %*% diag(weights) %*% y)
       [,1]
  0.3495122
x 0.2834146

Best Answer

Connection with least squares

Related Solutions

Solved – Proof that the coefficients in an OLS model follow a t-distribution with (n-k) degrees of freedom

Solved – The theory behind the weights argument in R when using lm()

Related Question