Solved – Simple linear regression with a random predictor

inferenceinterpretationmodelingregressionvariance

We understand a SLR model as
$$y_i = \alpha + \beta x_i + \varepsilon_i$$
with $\varepsilon_i$ i.i.d with equal variance.

Suppose I have two instruments measuring a common entity, say, density of samples of different liquids, and assume that we do not know the exact density of each liquid (nor is there any way to accurately measure it).

My objective is to construct a model that will summarise the difference between the two instruments.

Here is how I go about it: (for simplicity sake, all models are SLR models)

Let readings from instrument 1 and instrument 2 when measuring the $i$-th liquid be $y_i$ and $w_i$ respectively, and the actual density of the $i$-th liquid be $x_i$ Then we have:
$$y_i=\alpha_1+\beta_1 x_i + \varepsilon_{1i} \\
w_i=\alpha_2+\beta_2 x_i + \varepsilon_{2i}$$
where $\varepsilon_{1i}$ and $\varepsilon_{2i}$ are i.i.d and equal variance w.r.t $i$.

Rearranging from the second equation we have:
$$x_i= \frac{w_i- \alpha_2- \varepsilon_{2i}}{\beta_2 } $$
Plugging into the first equation,
$$ \begin{align*} y_i &=\alpha_1+\beta_1 \frac{w_i- \alpha_2- \varepsilon_{2i}}{\beta_2 } + \varepsilon_{1i} \\
&= \alpha_1 – \frac{\beta_1}{\beta_2} \alpha_2 + \frac{\beta_1}{\beta_2} w_i + \varepsilon_{1i}-\frac{\beta_1}{\beta_2}\varepsilon_{2i}\\
&= \alpha + \beta w_i + \varepsilon_i
\end{align*}
$$

where $\beta = \frac{\beta_1}{\beta_2}$, $\alpha = \alpha_1 – \beta \alpha_2$, and $ \varepsilon_i = \varepsilon_{1i}-\beta \varepsilon_{2i}$, and $ \varepsilon_i $ are consequently i.i.d. with equal variance w.r.t. $i$.

I would like to ask,

  1. Is this a valid SLR model for my objective? Do I just treat readings from instrument 2 as an 'independent variable' and proceed like a normal SLR analysis?

  2. If so, can I interpret the slope($\beta$) as the ratio of the individual slopes of the two instruments?

  3. Can I then test the hypothesis that the two instruments are similar by testing $\alpha = 0$ and $ \beta =1 $?

  4. Suppose there are N total liquids we are interested in. All N liquids have been measured by instrument 1. However, we only have time/money to measure n ($<$N) liquids using instrument 2. How should I select the n liquids from N such that I would minimise variance for $\beta$ and $\alpha$?

Best Answer

The answer to 1 is no which makes the answers to all the others not applicable.

Let me start with your last equation: \begin{align} y_i = \alpha + \beta w_i + \epsilon_i \end{align}

Now, let's assume that your earlier equations for $y$ and $w$ are valid classical linear regression models, so that $Cov(x,\epsilon_1)=0$ and $Cov(x,\epsilon_2)=0$. I'm not sure what SLR stands for---Simple Linear Regression?

Anyway, now let's calculate $Cov(w,\epsilon)$ in order to verify whether your new equation is part of a valid classical linear regression model (recall that we need this to be zero): \begin{align} Cov(w,\epsilon) &= Cov(w,\epsilon_1-\frac{\beta_1}{\beta_2}\epsilon_2) \\ \strut \\ &= Cov(w,\epsilon_1) - \frac{\beta_1}{\beta_2}Cov(w,\epsilon_2) \\ \strut \\ &= Cov(\epsilon_2,\epsilon_1) - \frac{\beta_1}{\beta_2}V(\epsilon_2) \end{align}

The second term is not zero unless $\beta_1=0$, and that would make the example pretty silly. Even the first term is not likely to be zero in most physical applications. For that term to be zero, you would have to make the additional assumption that the errors made by the two instruments were completely uncorrelated. You could get wildly lucky (in a stopped-clock-is-right-twice-a-day kind of sense) and the two terms could magically cancel out, but there is no systematic tendency of the two terms to cancel out.

The bias in estimating $\beta$ will be: \begin{align} \frac{Cov(\epsilon_2,\epsilon_1) - \frac{\beta_1}{\beta_2}V(\epsilon_2)}{V(w)} \end{align}

Below, I attach a bit of R code which makes a toy monte carlo to demonstrate the effect. The theoretical bias in the monte carlo is -0.25 and the answer we get in the monte carlo is too low by 0.23. So, demonstrates the point pretty well.

In general, even if you can't see how to evaluate the bias in an example like this, you can always run a little monte carlo to see what is going on. This is one of the great things about statistical software languages. Monte Carlo simulations are amazingly powerful tools to give you feedback as to whether your ideas are really good or really not.

# This program written in response to a Cross Validated question
# http://stats.stackexchange.com/questions/74527/simple-linear-regression-with-a-random-predictor

# The program is a toy monte carlo.
# It generates a "true" but unobservable-to-the-analyst physical state x.
# Then it generates two measurements of that state from different instruments.
# Then it regresses one measurement on the other.

set.seed(12344321)

# True state, 1000 runs of the experiment
x <- rnorm(1000)

# Set the various parameters of the monte carlo
# Play with these for fun and profit:

alpha_1 <- 0
alpha_2 <- 0
beta_1  <- 1
beta_2  <- 1
stddev_e1 <- 1
stddev_e2 <- 1
corr_e1e2 <- 0.5

# Fallible measurements
e_1 <- stddev_e1*rnorm(1000)
e_2 <- stddev_e2*(corr_e1e2*e_1+sqrt(1-corr_e1e2^2)*rnorm(1000))
y <- alpha_1 + beta_1*x + e_1
w <- alpha_2 + beta_2*x + e_2

var(data.frame(e_1,e_2))
var(data.frame(x,w,y))

lm(y~x)
lm(w~x)

# By the bias formula in the answer, this regression should have a bias of
# -0.25 = (0.5-1*1)/2.  That is, the coefficient should not be close to 1,
# the correct value of beta_1/beta_2.  Instead, it should be close 
# to 0.75 - 1-0.25

lm(y~w)
Related Question