[Math] Linear Regression Coefficients W/ X, Y swapped

linear algebrast.statistics

Let's say I have a linear regression model of the form $ y = B_x x + I_x + \epsilon $, where $B_x$ is the beta coefficient of the $x$ term, $I_x$ is the intercept term and $\epsilon$ is additive, normally distributed noise. If I have a dataset and perform linear regression, I get a value for $B_x$, which indicates the slope of the relationship.

If I swap the roles of the $x$ and $y$ data, and try to fit a model of $x = B_y y + I_y + \epsilon$, I would expect intuitively that $B_y = \frac{1}{B_x}$. A simple geometric argument can be made to show that swapping the roles of $x$ and $y$ shouldn't change the position of the regression line w.r.t. any data point, and from here it seems like simple algebra that if $y = Bx + I$ then $x = \frac{1}{B} y + \frac{I}{B}$.

Where is this reasoning wrong? Can someone explain to me why $B_x \neq \frac{1}{B_y}$, preferably without resorting to tons of linear algebra or direct derivation from the normal equation?

Best Answer

Well, I think Mike McCoy's answer is "the right answer," but here's another way of thinking about it: the linear regression is looking for an approximation (up to the error $\epsilon$) for $y$ as a function of $x$. That is, we're given a non-noisy $x$ value, and from it we're computing a $y$ value, possibly with some noise. This situation is not symmetric in the variables -- in particular, flipping $x$ and $y$ means that the error is now in the independent variable, while our dependent variable is measured exactly.

One could, of course, find the equation of the line that minimizes the sum of the squares of the (perpendicular) distances from the data points. My guess is that the reason that this isn't done is related to my first paragraph and "physical" interpretations in which one of the variables is treated as dependent on the other.

Incidentally, it's not hard to think up silly examples for which $B_x$ and $B_y$ don't satisfy anything remotely like $B_x \cdot B_y = 1$. The first one that pops to mind is to consider the least-squares line for the points {(0, 1), (1, 0), (-1, 0), (0, -1)}. (Or fudge the positions of those points slightly to make it a shade less artificial.)

Another possible reason that the perpendicular distances method is nonstandard is that it doesn't guarantee a unique solution -- see for example the silly example in the preceding paragraph.

(N.B.: I don't actually know anything about statistics.)

Best Answer

Related Solutions

[Math] How to tell if y is a function of x in a random sample

[Math] In linear regression, we have 0 training error if data dimension is high, but are there similar results for other supervised learning problems

Related Question