Regression – Understanding Random Variables and Non-Random Variables in Regression Models

expected valueleast squaresrandom variableregressionregression coefficients

I've already seen this question but it didn't help .

So I'm going over regression models (simple linear regression mainly) in my statistics text book and there's a lot of confusion here about what actually is a random variable and what isn't. Namely, at one point they treat some term as a random variable and then later it's a constant. Or something is initially a constant but then we calculate it's expected value somehow.

Anyway we first define regression function as $f(X) = E(Y|X)$, after which we immediately go specifically to simple linear regression.

Let $(X_1, Y_1), … (X_n, Y_n)$ be our sample. The model that we wish to apply is
$$Y_i = \beta_0 + \beta_1X_i + \epsilon_i$$
where the sequence of random variables $\{\epsilon_i\}$ satisfies the following:

  1. $E(\epsilon_i) = 0 $ for $i=1, 2, …, n$
  2. $E(\epsilon_i\epsilon_j) = 0$ for all $i \neq j$
  3. $D(\epsilon_i)=\sigma^2 < \infty$

The problem with this textbook is that everything is very vague and it's written as if it's supposed to be a reminder for someone who already knows all this stuff rather then a textbook for someone to learn it from scratch from.

Later on we derive the estimated coefficients $\beta_0$ and $\beta_1$ using partial derivatives of the sum of squares, and we obtain:

$$\hat{\beta_1} = \frac{\sum_{i=1}^n(X_i – \bar{X_n})(Y_i-\bar{Y_n})}{\sum_{i=1}^n(X_i-\bar{X_n})^2}$$
$$\hat{\beta_0} = \bar{Y_n} – \hat{\beta_1}\bar{X_n}$$

Now we wish to find the expected value for $\hat{\beta_1}$. We transform it into the following form:
$$\hat{\beta_1} = \sum_{i=1}^n{Y_i\frac{(X_i – \bar{X_n})}{nS^2_{X}}}$$
where $S^2_{X}$ is $\frac{1}{n}\sum_{i=1}^n(X_i – \bar{X_n})^2$.

And now when we start finding the expected value it looks something like this:

$$E(\hat{\beta_1}) = \sum_{i=1}^n{E(Y_i)\frac{X_i – \bar{X_n}}{nS^2_{X}}} = \sum_{i=1}^n{(\beta_0 + \beta_iX_i)\frac{X_i-\bar{X_n}}{nS^2_{X}}} = …$$

Meaning, everything except for $Y_i$ in the sum is treated as a constant. That's one of the parts I don't understand. In some other sources where I've tried finding answers to this question I've seen the following sentence:

Only ${e_i}$'s are random variables

This doesn't sit right with me probably because I got to regression after I'd been studying hypothesis testing and other parts of statistical inference for a while, where we've always treated 'almost everything' as a random variable, meaning the sample (in this case the $X_i, Y_i$ pairs), was also a random variable. How come here, suddenly, the part containing $X_i$ and $\bar{X_n}$ gets just thrown out of the $E()$ as if it is just a constant?

Some sources also mention that $X_i, Y_i$'s are indeed random variables but rather 'fixed', which still doesn't help me understand it because it sounds very informal.

Now I'll try and summarize my question(s) somehow.

  1. Do we treat $(X_i, Y_i)$'s as random variables?
  2. Do we treat $\beta_0$ and $\beta_1$ as random variables?
  3. Do we treat $\hat{\beta_0}$ and $\hat{\beta_1}$ as random variables?
  4. What can have an expected value and what can't (what gets treated as a constant when finding expected values) and why?

Best Answer

This post is an honest response to a common problem in the textbook presentation of regression, namely, the issue of what is random or fixed. Regression textbooks typically blithely state that the $X$ variables are fixed and go on their merry way, when in practice this assumption eliminates most of the interesting regression applications.

Rather than assume the $X$ variables are fixed, a better route to understanding regression analysis is to take a conditional distribution approach, one where the $X$'s are assumed random throughout, and then the case of fixed $X$ (which occurs only in very narrow experimental designs, and at that only when the experiment is performed without error) is subsumed as a special case where the distributions are degenerate.

What the OP is missing is the link from random $X$ to fixed realizations of $X$ ($X=x$), which all starts from the

Law of Total Expectation: Assume $U$ and $V$ are random, with finite expectation. Let $E(U | V=v) = \mu(v)$. Then $E(U) = E\{\mu(V)\}$.

This "Law" (which is actually a mathematical theorem) allows you to prove unbiasedness of the estimate $\hat \beta $ in two steps: (i) by first showing that it is unbiased, conditional on the $X$ data, and (ii) by using the Law of Total Expectation to then show that it is unbiased when averaged over all possible realizations of the $X$ data. (The average of 11,11, 11, 11, 11, 11, ... is 11, e.g.).

Answers to the OP:

Q1. Do we treat $(X_i,Y_i)$'s as random variables?

A1. Yes. They are random in the sense of the model, which describes the way that potentially observable values of such data might appear. Of course the actual observed data, $(x_i, y_i)$, are not random. Instead, they are fixed values, one many possible realizations of the potentially observable random variables $(X_i, Y_i)$. In rare cases, the $X$ data are fixed, but this is covered as a special case of randomness, so it is easier and safer just to assume randomness always.

Q2. Do we treat $\beta_0$ and $\beta_1$ as random variables?

A2. This is somewhat off topic from the OP, but still a very important question. From the scientist's conceptualization of reality, these are ordinarily fixed values. That is, the scientist assumes that there is a rigid structure responsible for the production of all of the $(Y_i | X_i = x_i)$ data values, and these $\beta_0, \beta_1$ values are part of that rigid structure.

Now, the parameters $\beta_0, \beta_1$ are uncertain in the scientist's mind (which is why he or she is collecting data in the first place!), so the scientist may choose to view them, mentally, as "random." The scientist has some ideas about the possible values of these parameters based on logic, subject matter considerations, and past data, and these ideas form the scientist's "prior distribution." The scientist then may update this prior using current data to obtain her/his posterior. That, in a nutshell, in what Bayesian statistics is all about.

But again, that issue is a little off topic from the OP, so let's consider everything conditional on the scientist's conceptualization that there is a rigid structure, and that these $\beta_0, \beta_1$ values are fixed in reality. In other words, all of my replies other than this one assume that the $\beta$'s are fixed.

Q3. Do we treat $\hat \beta_0$ and $\hat \beta_1$ as random variables?

A3. Here is another place where typical regression teaching sources are slippery. In some cases, they refer to the estimates $\hat \beta_0$ and $\hat \beta_1$ as functions of the (fixed) data that has been collected, and sometimes they refer to them as functions of the (random) potentially observable data, but use the same symbols $\hat \beta_0$ and $\hat \beta_1$ in either case. Often, you just have to understand from context which is which.

Whenever you see $E(\hat \beta)$, you can assume that $\hat \beta$ is a function of the random data, i.e., that $\hat \beta$ is a function of the $(X_i, Y_i)$.

Whenever you see the value of $\hat \beta$ reported, e.g., following a computer printout of results from a regression analysis, you can assume that $\hat \beta$ is a function of the fixed data sample, i.e., that $\hat \beta$ is a function of the $(x_i, y_i)$.

Q4. What can have an expected value and what can't (what gets treated as a constant when finding expected values) and why?

A4. Anything can have an expectation. Some things are more interesting than others, though. Anything that is a fixed (like a $\hat \beta$ that is a function of the observed $(x_i, y_i)$ sample) has an expectation that is just equal to that value. For example, if you observe from your computer printout that $\hat \beta_1 =0.23$, then $E(\hat \beta_1) =0.23$. But that is not interesting.

What is more interesting is the following question: over all possible potential realizations of $(X_i, Y_i)$ from this data-generating process, is the estimator $\hat \beta_1$ neither systematically too large, nor systematically too small, in an average sense, when compared to the structural parameter $\beta_1$? The expression $E(\hat \beta_1) = \beta_1$ tells you that the answer to that question is a comforting "yes."

And in that expression $E(\hat \beta_1) = \beta_1$, it is implicit that $ \hat \beta_1$ is a function of the potentially observable $(X_i, Y_i)$ data, not the sample $(x_i, y_i)$ data.

Related Question