Solved – Identifiability of the linear regression model: necessary and sufficient condition

identifiabilitylinear modelregression

Let $\{(x_i, y_i), 1\le i\le n\}$ be the pairwise values of the observations and responses respectively. Let us fit the linear regression model: $y_i=b_0+b_1 x_i+\epsilon_i, \epsilon_i\sim\mathcal{N}(0,\sigma^2)$ are iid.

I'd like to find a necessary and sufficient condition for this above model to be identifiable.

Let me explain, just in case: let $\theta=(b_0, b_1, \sigma^2)$ be a vector of unknown parameters, and let $\varphi=(a_0,a_1, \nu^2)$ be another such set. Let us assume that they give rise to the same distribution. i.e. assuming $y=(y_1, y_2, …y_n)$, assume also that $\sum_{i=1}^{n} \frac{(y_i-b_0-b_1 x_i)^2}{\sigma^2}=\sum_{i=1}^{n} \frac{(y_i-a_0-a_1 x_i)^2}{\nu^2}\forall (y_1,y_2,…y_n)\in \mathbb{R}^{n}$. Need a condition on $(x_1,x_2,…x_n)$ so that this equality implies $\theta=\varphi$.

Question 1: I'm new to the whole identifiability definition, but this is exactly what we need to check, right? If not, please correct me!

Question 2: What is the condition on $(x_1,x_2,…x_n)$ in order for the model to be identifiable?

Best Answer

Your "assume also" clause equates two quadratic forms in $\mathbb{R}^n$ (with $\mathrm{y}=(y_1,y_2,\ldots,y_n)$ the variable). Since any quadratic form is completely determined by its values at $1+n+\binom{n+1}{2}$ distinct points, their agreement at all points of $\mathbb{R}^n$ is far more than needed to conclude the two forms are identical, whence their coefficients must be the same.

The coefficients of $y_1^2$ are $1/\sigma^2$ and $1/\nu^2$, whence $\sigma=\pm \nu$. We always stipulate that $\sigma$ and $\nu$ are nonnegative, implying $\sigma=\nu$. (The "real" parameter should be considered to be $\sigma^2$ or $1/\sigma^2$ rather than $\sigma$ itself.)

The linear terms in $y_i$ are both proportional to $b_0+b_1 x_i = a_0 + a_1 x_i$. Letting $\mathrm{1} = (1,1,\ldots, 1)$ and $\mathrm{x} = (x_1, x_2, \ldots, x_n)$, we conclude

$$(a_0 - b_0)\mathrm{1} + (a_1 - b_1)\mathrm{x} = \mathrm{0}.$$

Thus either

  1. $\mathrm{1}$ and $\mathrm{x}$ are linearly independent, which by definition implies both $a_0 = b_0$ and $a_1 = b_1$, or

  2. $\mathrm{1}$ and $\mathrm{x}$ are linearly dependent, which means $x_1 = x_2 = \cdots = x_n = x$, say. In that case

    • If $x \ne 0$, $a_0 - b_0 = (a_1 - b_1) x$ determines one of $(a_0, a_1, b_0, b_1)$ in terms of the other three, or
    • Otherwise $a_0=b_0$ and $a_1$ and $b_1$ could have any values.

In case (1) all parameters are uniquely determined: this is the identifiable model. In case (2) $\sigma = \nu$ is identifiable no matter what and various linear combinations of $(a_0,a_1,b_0,b_1)$ can be identified.

Evidently, linear independence of $\mathrm{x}$ and $\mathrm{1}$ is both necessary and sufficient for identifiability.

This criterion easily generalizes to multiple regression, where the ordinary least squares model is identifiable if and only if the design matrix $X$ (whose columns are formed from $\mathrm{1}, \mathrm{x}$, and any other variables in any order) has full rank: that is, there is no linear dependence among its columns.