Suppose I have a random sample $\lbrace x_n ,y_n \rbrace_{n=1}^N$.

Suppose $$y_n = \beta_0 + \beta_1 x_n + \varepsilon_n$$

and $$\hat{y}_n = \hat{\beta}_0 +\hat{\beta}_1 x_n$$

What is the difference between $\beta_1$ and $\hat{\beta}_1$?

Skip to content
# Solved – the difference between $\beta_1$ and $\hat{\beta}_1$

#### Related Solutions

###### Related Question

regression

Suppose I have a random sample $\lbrace x_n ,y_n \rbrace_{n=1}^N$.

Suppose $$y_n = \beta_0 + \beta_1 x_n + \varepsilon_n$$

and $$\hat{y}_n = \hat{\beta}_0 +\hat{\beta}_1 x_n$$

What is the difference between $\beta_1$ and $\hat{\beta}_1$?

Do you have a good reason to do the doubling (or duplication?) It doesn't make much statistical sense, but still it is interesting to see what happens algebraically. In matrix form your linear model is $$ \DeclareMathOperator{\V}{\mathbb{V}} Y = X \beta + E, $$ the least square estimator is $\hat{\beta}_{\text{ols}} = (X^T X)^{-1} X^T Y $ and the variance matrix is $ \V \hat{\beta}_{\text{ols}}= \sigma^2 (X^t X)^{-1} $. "Doubling the data" means that $Y$ is replaced by $\begin{pmatrix} Y \\ Y \end{pmatrix}$ and $X$ is replaced by $\begin{pmatrix} X \\ X \end{pmatrix}$. The ordinary least squares estimator then becomes $$ \left(\begin{pmatrix}X \\ X \end{pmatrix}^T \begin{pmatrix} X \\ X \end{pmatrix} \right )^{-1} \begin{pmatrix} X \\ X \end{pmatrix}^T \begin{pmatrix} Y \\ Y \end{pmatrix} = \\ (x^T X + X^T X)^{-1} (X^T Y + X^T Y ) = (2 X^T X)^{-1} 2 X^T Y = \\ \hat{\beta}_{\text{ols}} $$ so the calculated estimator doesn't change at all. But the calculated variance matrix becomes wrong: Using the same kind of algebra as above, we get the variance matrix $\frac{\sigma^2}{2}(X^T X)^{-1}$, half of the correct value. A consequence is that confidence intervals will shrink with a factor of $\frac{1}{\sqrt{2}}$.

The reason is that we have calculated as if we still have iid data, which is untrue: the pair of doubled values obviously have a correlation equal to $1.0$. If we take this into account and use weighted least squares correctly, we will find the correct variance matrix.

From this, more consequences of the doubling will be easy to find as an exercise, for instance, the value of R-squared will not change.

This post is an honest response to a common problem in the textbook presentation of regression, namely, the issue of what is random or fixed. Regression textbooks typically blithely state that the $X$ variables are fixed and go on their merry way, when in practice this assumption eliminates most of the interesting regression applications.

Rather than assume the $X$ variables are fixed, a better route to understanding regression analysis is to take a conditional distribution approach, one where the $X$'s are assumed random throughout, and then the case of fixed $X$ (which occurs only in very narrow experimental designs, and at that only when the experiment is performed without error) is subsumed as a special case where the distributions are degenerate.

What the OP is missing is the link from random $X$ to fixed realizations of $X$ ($X=x$), which all starts from the

Law of Total Expectation:Assume $U$ and $V$ are random, with finite expectation. Let $E(U | V=v) = \mu(v)$. Then $E(U) = E\{\mu(V)\}$.

This "Law" (which is actually a mathematical theorem) allows you to prove unbiasedness of the estimate $\hat \beta $ in two steps: (i) by first showing that it is unbiased, conditional on the $X$ data, and (ii) by using the Law of Total Expectation to then show that it is unbiased when averaged over all possible realizations of the $X$ data. (The average of 11,11, 11, 11, 11, 11, ... is 11, e.g.).

Answers to the OP:

Q1. Do we treat $(X_i,Y_i)$'s as random variables?

A1. Yes. They are random in the sense of the model, which describes the way that *potentially observable* values of such data might appear. Of course the actual observed data, $(x_i, y_i)$, are not random. Instead, they are fixed values, one many possible realizations of the potentially observable random variables $(X_i, Y_i)$. In rare cases, the $X$ data are fixed, but this is covered as a special case of randomness, so it is easier and safer just to assume randomness always.

Q2. Do we treat $\beta_0$ and $\beta_1$ as random variables?

A2. This is somewhat off topic from the OP, but still a very important question. From the scientist's conceptualization of reality, these are ordinarily fixed values. That is, the scientist assumes that there is a rigid structure responsible for the production of all of the $(Y_i | X_i = x_i)$ data values, and these $\beta_0, \beta_1$ values are part of that rigid structure.

Now, the parameters $\beta_0, \beta_1$ are uncertain in the scientist's mind (which is why he or she is collecting data in the first place!), so the scientist may choose to view them, mentally, as "random." The scientist has some ideas about the possible values of these parameters based on logic, subject matter considerations, and past data, and these ideas form the scientist's "prior distribution." The scientist then may update this prior using current data to obtain her/his posterior. That, in a nutshell, in what Bayesian statistics is all about.

But again, that issue is a little off topic from the OP, so let's consider everything conditional on the scientist's conceptualization that there is a rigid structure, and that these $\beta_0, \beta_1$ values are fixed in reality. In other words, all of my replies other than this one assume that the $\beta$'s are fixed.

Q3. Do we treat $\hat \beta_0$ and $\hat \beta_1$ as random variables?

A3. Here is another place where typical regression teaching sources are slippery. In some cases, they refer to the estimates $\hat \beta_0$ and $\hat \beta_1$ as functions of the (fixed) data that has been collected, and sometimes they refer to them as functions of the (random) potentially observable data, but use the same symbols $\hat \beta_0$ and $\hat \beta_1$ in either case. Often, you just have to understand from context which is which.

Whenever you see $E(\hat \beta)$, you can assume that $\hat \beta$ is a function of the random data, i.e., that $\hat \beta$ is a function of the $(X_i, Y_i)$.

Whenever you see the value of $\hat \beta$ reported, e.g., following a computer printout of results from a regression analysis, you can assume that $\hat \beta$ is a function of the fixed data sample, i.e., that $\hat \beta$ is a function of the $(x_i, y_i)$.

Q4. What can have an expected value and what can't (what gets treated as a constant when finding expected values) and why?

A4. Anything can have an expectation. Some things are more interesting than others, though. Anything that is a fixed (like a $\hat \beta$ that is a function of the observed $(x_i, y_i)$ sample) has an expectation that is just equal to that value. For example, if you observe from your computer printout that $\hat \beta_1 =0.23$, then $E(\hat \beta_1) =0.23$. But that is not interesting.

What is more interesting is the following question: over all possible potential realizations of $(X_i, Y_i)$ from this data-generating process, is the estimator $\hat \beta_1$ neither systematically too large, nor systematically too small, in an average sense, when compared to the structural parameter $\beta_1$? The expression $E(\hat \beta_1) = \beta_1$ tells you that the answer to that question is a comforting "yes."

And in that expression $E(\hat \beta_1) = \beta_1$, it is implicit that $ \hat \beta_1$ is a function of the potentially observable $(X_i, Y_i)$ data, not the sample $(x_i, y_i)$ data.

## Best Answer

$\beta_1$ is an idea - it doesn't really exist in practice. But if the Gauss-Markov assumption hold, $\beta_1$ would give you that optimal slope with values above and below it on a vertical "slice" vertical to the dependent variable forming a nice normal Gaussian distribution of residuals. $\hat \beta_1$ is the estimate of $\beta_1$ based on the sample.

The idea is that you are working with a sample from a population. Your sample forms a data cloud, if you will. One of the dimensions corresponds to the dependent variable, and you try to fit the line that minimizes the error terms - in OLS, this is the projection of the dependent variable on the vector subspace formed by the column space of the model matrix. These estimates of the population parameters are denoted with the $\hat \beta$ symbol. The more data points you have the more accurate the estimated coefficients, $\hat \beta_i$ are, and the better the estimation of these idealized population coefficients, $\beta_i$.

Here is the difference in slopes ($\beta$ versus $\hat \beta$) between the "population" in blue, and the sample in isolated black dots:

The regression line is dotted and in black, whereas the synthetically perfect "population" line is in solid blue. The abundance of points provides a tactile sense of the normality of the residuals distribution.