Solved – Understanding the meaning of the parameters in the linear regression model

estimationleast squareslinear modelmultiple regressionregression coefficients

When I first time learn multiple linear regression, I remember the interpretation of the regression coefficient is that: the marginal contribution of a specific predictor.

Now I am rethinking this interpretation and create an example to ask myself a question: is this interpretation always correct?

Suppose I have 2 dummy variables in my data set $X_i$ and $Z_i$, to make it concrete, they are defined as the following
\begin{equation*}
X_i=\begin{cases}
&1,~\text{if people $i$ go to school before}\\
&0,~\text{otherwise}
\end{cases},~~~Z_i=\begin{cases}
&1,~\text{if people $i$ drink milk}\\
&0,~\text{otherwise}
\end{cases}
\end{equation*}
Let $Y_i$ be the response variable, say the wage of people $i$. There are 4 combinations of the value of $(X_i,Z_i)$. If we can FORCE all people in the population have a specific value of $(X_i,Z_i)$, then we will be able to get a wage distribution, denoted by random variable $Y^Z_{Xi}$. So we have $Y^1_{1i}$, $Y^1_{0i}$, $Y^0_{0i}$ and $Y^0_{1i}$. Note that, without loss of generality, we can always assume that
\begin{equation*}
Y^1_{1i}=\mu^1_1+\epsilon^1_{1i},~Y^1_{0i}=\mu^1_0+\epsilon^1_{0i},~Y^0_{1i}=\mu^0_1+\epsilon^0_{1i},~Y^0_{0i}=\mu^0_0+\epsilon^0_{0i}\\
\end{equation*}
where $\mu^X_Z$ is the mean of the wage if the people in the population ALL have value $X$ and $Z$. After rearrangement, it is easily to write $Y_i$, which is the observed wage in the real world, as
\begin{align*}
Y_i&=\mu^0_0+X_i(\mu^0_1-\mu^0_0)+Z_i(\mu^1_0-\mu^0_0)+X_iZ_i(\mu^1_1-\mu^0_1-\mu^1_0+\mu^0_0)\\
&+\underset{\epsilon_i}{\underbrace{\epsilon^0_{0i}+X_i(\epsilon^0_{1i}-\epsilon^0_{0i})+Z_i(\epsilon^1_{0i}-\epsilon^0_{0i})+X_iZ_i(\epsilon^1_{1i}-\epsilon^0_{1i}-\epsilon^1_{0i}+\epsilon^0_{0i})}}
\end{align*}
To write it clearly, we have
\begin{equation}
Y_i=a+bX_i+cZ_i+dX_iZ_i+\epsilon_i,\label{eq:2D}
\end{equation}
where $a=\mu^0_0, b=\mu^0_1-\mu^0_0, c=\mu^1_0-\mu^0_0, d=\mu^1_1-\mu^0_1-\mu^1_0+\mu^0_0$
Note that this equation decribes the TRUTH of the world (without making ANY assumption) and all of these coefficients have real meaning, e.g., $b$ is the increment of mean wage if the population have no milk but change from no school to all go to school, i.e., the effect of schooling when $Z_i=0$ (no milk). Note that whether we can consistently estimate these coefficient is another story.

Now, what if when you see the data set $\{(Y_i,X_i,Z_i)\}_{i=1}^n$, you build up the following model
\begin{equation}
Y_i=a'+b'X_i+c'Z_i+\epsilon'_i,
\end{equation}
i.e., you didn't incorporate the term $X_iZ_i$ in your model, which means that you dump this term into the error term, then here comes my questions:

(1) if you don't include the term $X_iZ_i$, is it always true that the error term will be correlated with the predictors? Intuitively, it looks yes, since how could the random variable $X$ and $XZ$ being uncorrelated? Then this could implies that OLS cannot consistently estimate $a',b',c'$. If for some reason that I don't know, $\epsilon'_i$ is uncorrelated with $X$ and $Z$, then we know that the omitted variable $XZ$ does not affect the OLS, so in this case, $a'=a,b'=b,c'=c$, but then what is the correct interpretation of these coefficients? $b'$ is still: "the increment of mean wage if the population have no milk but change from no school to all go to school"?

(2) somewhat related to (1), if initially I only have data $\{(Y_i,X_i)\}_{i=1}^n$, then I build the model $Y_i=a'' + b'' X_i+\epsilon''_i$, where $a''=\mathbb{E}[Y_{0i}]$, i.e., the mean wage of population if nobody go to school regardless of whether they drink milk and $b''=\mathbb{E}[Y_{1i}-Y_{0i}]$. Now somehow I observe the information about whether each individual drink milk or not. Then I project the error term $\epsilon''_i$ onto $Z_i$, i.e., $\epsilon''_i=c'' Z_i+\nu_i$. Then I run $Y_i=a'' + b'' X_i + c''_i Z_i + \nu_i$, what is the meaning of $a''_i$ and $b''_i$ in the new model? Are they the same as in the old model (both in the theoretical value and the interpretation)?

(3) For any dataset we have (with a response variable $Y$ and a predictors vector $W$), we can always write $Y_i=\mathbb{E}[Y_i|W_i]+e_i$, where $e_i$ is the residual, which is defined as $e_i\triangleq Y_i-\mathbb{E}[Y_i|W_i]$. By definition, $e_i$ is uncorrelated with $W_i$. In this post, $W_i$ are all dummies, $W_i=(X_i,Z_i)$, then $\mathbb{E}[Y_i|W_i]$ is linear, i.e., you could write $\mathbb{E}[Y_i|W_i]$ as $\mathbb{E}[Y_i|X_i,Z_i]=\alpha+\beta X_i+\gamma Z_i+\delta X_iZ_i+e_i$. Note that those $\alpha,\beta,\gamma,\delta$ have nothing to do with the aforementioned $a,b,c,d$. Here those Greek letters have completely different interpretation, e.g., here $\beta$ means the impact of unit change in $X$ on $\mathbb{E}[Y_i|X_i,Z_i]$ when $Z=0$, which is a pure description about the relationship between the predictors and the conditional expectation, that has nothing to do with the counterfactual effect like the interpretation of $b$. So in this case, if we omit the variable $X_iZ_i$, and model the linear functional $\mathbb{E}[Y_i|X_i,Z_i]$ as $\mathbb{E}[Y_i|X_i,Z_i]=\alpha'+\beta'X_i+\gamma'Z_i+e'_i$, then $\beta'$ just means the impact of unit change in $X$ on $Y_i$ (no "when $Z=0$"). In empirical study, which interpretation of the parameters should we adopt? the $\beta$ one or the $b$ one? Note that as $e_i$ is the residual (NOT the $\epsilon_i$ aforementioned), which is defined as $e_i\triangleq Y_i-\mathbb{E}[Y_i|W_i]$, then by definition, $e_i$ is uncorrelated with $W_i$. Hence, in this case, the OLS is always consistent, in the sense that, OLS estimator will converge to those $\beta$'s (the Greek letter), which does not necessarily equal to $b$ (the English letter). I recall that when I learned econometrics, the consistency of OLS estimator is a big chunk of lectures, so I assume the correct interpretation of the coefficients in the linear regression model should be the $b$ one, rather than the $\beta$ one, since otherwise, the OLS estimator is always consistent. Did I miss something important?

Best Answer

To correct your understanding a bit, your interpretation of the model coefficients as the marginal contribution of a predictor is not one I've ever heard before. And I don't think it's right. Marginal suggests that it is not conditional. In an OLS model with two factors, $X$ and $Z$, the two respective coefficients $b$ and $c$ certainly differ from the results obtained by regressing $X$ and $Z$ separately and obtaining coefficients $b_m$ and $c_m$. In the two factor model, the interpretation of $b$ is "an expected mean difference in $Y$ comparing groups differing by one unit in $X$ having the same value of $Z$", whereas the marginal model would not have the "having the...". The idea of "forcing" people one unit higher is a causal interpretation, and with observational data and causal analysis, is a type of counterfactual reasoning, sort of rewinding time.

The model you've written is a heteroscedastic one, and the rearrangement would require terms of $(1-X)$ and $(1-Z)$ to obtain $\mu_{ij}$ for specific covariate values of $X_i Z_j$. So I'd recommend writing it like:

$$E[Y|X,Z] = \mu_{00}(1-X)(1-Z) + \mu_{01}(1-X)Z + \mu_{10}X(1-Z) + \mu_{11}XZ$$

Omitting the interaction term from this model produces fitted values which are a complex combination of the $\mu$s depending on the correlation of $X$ and $Z$, but they can be calculated by hand, and estimated consistently.

Yes it's true that omitting certain predictors from a model (what we call "model misspsecification") will result in errors which are correlated with the omitted predictors and the observed predictors if those two are correlated. This is something we really never observed because the error is a thing different from the residuals. If I fit the OLS model, the residuals will be orthogonal to any predictor in the model.
My first paragraph should address the difference in interpretation from the unadjusted model, and the adjusted one. The interpretation will not be the same. The actual value of the coefficient will not be the same unless $X$ and $Z$ are uncorrelated, but even when they are I would not interpret the regression models results as the same.
In your description of results you omitted an interaction term. In regression analysis we often consider exploring interaction terms as a type of post hoc analysis. What you might imagine is that when the interaction is present --suppose the object of inference is the relationship between $X$ and $Y$ controlling for $Z$ (binary)-- there is one slope comparing $X$ to $Y$ when $Z=0$ and another when $Z=1$, by omitting the interaction term, the estimated slope comparing $X$ to $Y$ is just a weighted average between the two $Z$ levels. In observational studies, what you are probably calling "empirical" studies", you mustn't use causal language in interpreting any of the findings, including words like "impact" "change" and "increase".

Proof

Let us assume $\mathbb{E}[\epsilon_i]=0,~\mathbb{E}[\epsilon_i\,D_i]=0$ and denote the two possible values of $D_i$ by $d_1$ and $d_2$. Using the two assumptions and decomposing over $D_i=d_1,D_i=d_2$, we get \begin{equation} \begin{cases} \mathbb{P}(D_i=d_1)\,\mathbb{E}(\epsilon_i \mid D_i = d_1) + \mathbb{P}(D_i=d_2)\,\mathbb{E}(\epsilon_i \mid D_i = d_2) = 0 \\ \mathbb{P}(D_i=d_1)\,\mathbb{E}(\epsilon_i \mid D_i = d_1)\,d_1 + \mathbb{P}(D_i=d_2)\,\mathbb{E}(\epsilon_i \mid D_i = d_2)\,d_2 = 0 \end{cases} \end{equation}

By solving this system of equations for $\mathbb{P}(D_i=d_1)\,\mathbb{E}(\epsilon_i \mid D_i = d_1)$ and $\mathbb{P}(D_i=d_2)\,\mathbb{E}(\epsilon_i \mid D_i = d_2)$, we see that either

$d_1=d_2$ or
$\mathbb{P}(D_i=d_1)\,\mathbb{E}(\epsilon_i \mid D_i = d_1) = \mathbb{P}(D_i=d_2)\,\mathbb{E}(\epsilon_i \mid D_i = d_2)=0$

The first case would mean $D_i$ has only one possible value (and mean independence would trivially hold). Assuming both probabilities $\mathbb{P}(D_i=d_k)>0$*, the second case then implies $\mathbb{E}(\epsilon_i \mid D_i = d_{k} )=0$, that is, mean independence. Thus, mean independence follows from the assumptions.

*If one of the probabilities is $0$, the corresponding $\mathbb{E}(\epsilon_i \mid D_i = d_k)$ can technically obtain any value, but then the model would correspond to $D_i$ having only one possible values.

Solved – The theory behind the weights argument in R when using lm()

The matrix $X$ should be $$ \begin{bmatrix} 1 & 0\\ 1 & 1\\ 1 & 2 \end{bmatrix}, $$ not $$ \begin{bmatrix} 1 & 1\\ 1 & 1\\ 1 & 1 \end{bmatrix}. $$ Also, your V_inv should be diag(weights), not diag(1/weights).

x <- c(0, 1, 2)
y <- c(0.25, 0.75, 0.85)
weights <- c(50, 85, 75)
X <- cbind(1, x)

> solve(t(X) %*% diag(weights) %*% X, t(X) %*% diag(weights) %*% y)
       [,1]
  0.3495122
x 0.2834146

Best Answer

Related Solutions

Solved – Understanding mean independence in the regression setting

Proof

Solved – The theory behind the weights argument in R when using lm()

Related Question