Can someone explain what identification means in the context of an OLS model? I have a fair grasp of the derivation using either the method of moments or by minimizing the squares, but am failing to grasp which part of this process corresponds to identification. Also, how does identification differ from estimation of the parameters?
Solved – parameter identification in the context of OLS
econometricsidentifiabilityleast squaresmethod of momentsregression
Related Solutions
I first encountered the ANOVA when I was a Master's student at Oxford in 1978. Modern approaches, by teaching continuous and categorical variables together in the multiple regression model, make it difficult for younger statisticians to understand what is going on. So it can be helpful to go back to simpler times.
In its original form, the ANOVA is an exercise in arithmetic whereby you break up the total sum of squares into pieces associated with treatments, blocks, interactions, whatever. In a balanced setting, sums of squares with an intuitive meaning (like SSB and SST) add up to the adjusted total sum of squares. All of this works thanks to Cochran's Theorem. Using Cochran, you can work out the expected values of these terms under the usual null hypotheses, and the F statistics flow from there.
As a bonus, once you start thinking about Cochran and sums of squares, it makes sense to go on slicing and dicing your treatment sums of squares using orthogonal contrasts. Every entry in the ANOVA table should have an interpretation of interest to the statistician and yield a testable hypothesis.
I recently wrote an answer where the difference between MOM and ML methods arose. The question turned on estimating random effects models. At this point, the traditional ANOVA approach totally parts company with maximum likelihood estimation, and the estimates of the effects are no longer the same. When the design is unbalanced, you don't get the same F statistics either.
Back in the day, when statisticians wanted to compute random effects from split-plot or repeated measures designs, the random effects variance was computed from the mean squares of the ANOVA table. So if you have a plot with variance $\sigma^2_p$ and the residual variance is $\sigma^2$, you might have that the expected value of the mean square ("expected mean square", EMS) for plots is $\sigma^2 + n\sigma_p^2$, with $n$ the number of splits in the plot. You set the mean square equal to its expectation and solve for $\hat{\sigma_b^2}$. The ANOVA yields a method of moments estimator for the random effect variance. Now, we tend to solve such problems with mixed effects models and the variance components are obtained through maximum likelihood estimation or REML.
The ANOVA as such is not a method of moments procedure. It turns on splitting the sum of squares (or more generally, a quadratic form of the response) into components that yield meaningful hypotheses. It depends strongly on normality since we want the sums of squares to have chi-squared distributions for the F tests to work.
The maximum likelihood framework is more general and applies to situations like generalized linear models where sums of squares do not apply. Some software (like R) invite confusion by specifying anova methods to likelihood ratio tests with asymptotic chi-squared distributions. One can justify use of the term "anova", but strictly speaking, the theory behind it is different.
Random variables $\{u_i; i=1,...,n\}$ are said to be "homoskedastic" when
$$\text{Var}(u_i) = \text{constant},\;\; \forall i$$
This property can coexist with conditional heteroskedasticity:
$$\text{Var}(u_i \mid \mathbf x_i) = h(\mathbf x_i)$$
This is because, by the Law of Total Variance, we have
$$\text{Var}(u_i) = E\big[\text{Var}(u_i \mid \mathbf x_i)] + \text{Var}\big[E(u_i\mid \mathbf x_i)\big]$$
$$= E[h(\mathbf x_i)]+\text{Var}\big[E(u_i\mid \mathbf x_i)\big]$$
The second term is a moment of the distribution of the random variable $E(u_i\mid \mathbf x_i)$, and so a constant (irrespective of whether $E(u_i\mid \mathbf x_i)=0$ or not) . The first term will be a constant over $i$ if the $\mathbf x_i$'s have equal mean over $i$.
In other words if the corresponding collection of regressors are "first-order stationary", then we can have conditional heteroskedasticity and unconditional homoskedasticity at the same time.
As regards the relation/contrast between "homoskedasticity" and "exogeneity", first of all, in fairness to the textbook you mention they actually write in page 194
"This last condition is called the orthogonality condition. If this condition is satisfied, then the explanatory variables are said to be exogenous (or sometimes ‘weakly’ exogenous, to distinguish this type of exogeneity, which is related to consistent estimation, from other types of exogeneity related to forecasting and structural breaks)."
So they do point out that the more accurate term for the property is the "orthogonality" one, while the concept of "exogeneity" has variations in "weak", "strong" or "strict" each reflecting a different assumption.
Now, as regards the essence of the question: conditional homoskedasticity states that
$$\text{Var}(u_i \mid \mathbf x_i) = E(u_i^2\mid \mathbf x_i) - \left[E(u_i\mid \mathbf x_i)\right]^2 = \text{constant}$$
So it is a statement about whether moments of the distribution followed by the $u_i$'s are affected by the presence of $\mathbf x_i$ (or, in an informal informational approach, whether if we know $\mathbf x_i$, the variation that we anticipate to see in $u_i$ as summarized by the variance changes compared to when we don't know $\mathbf x_i$). Keep that "variation" relates to second moments.
On the other hand, the orthogonality property states that $$E(\mathbf x_i\cdot u_i)=0$$
This is a statement about the first moment of a specific function of $\mathbf x_i$ and $u_i$, namely, their product. In a regression setting, where we assume that $E(u_i)=0$, we have that
$$E(\mathbf x_i\cdot u_i)=0 \implies \text{Cov}(\mathbf x_i, u_i)=0$$
So this is a property about whether $\mathbf x_i$ and $u_i$ tend to co-vary.
So, in general informal terms, conditional homoskedasticity and orthogonality, both state that "the $\mathbf x_i$'s do not tell us something about the $u_i$'s" - but this "something" is a different "something" in each case, and usefully and meaningfully distinguished.
Best Answer
Thank you for all the responses. It has been more than a year since I asked this question and I am now able to provide one answer to the question. The below answer illustrates the issue of identification in the context of treatment evaluation where the parameter of interest is the causal effect of treatment receipt on some outcome of interest. Such evaluation problems arise frequently when you are trying to estimate the efficacy of a drug by comparing health outcomes of those who received treatment against a control group. This problem is also frequently encountered in the social sciences where you might be interested in estimating the treatment effect (causal effect) of some policy intervention (e.g., subsidizing healthcare for a certain group of people) on various outcomes such as income, mortality, etc.
Setup:
For simplicity, let the true underlying process be given by the following linear relationship: $$y=\alpha_0+\alpha_1 t + \bf{Z}\pmb{\beta}+\varepsilon$$ where $y$ is the outcome, $t$ is a binary indicator for receipt of some treatment of interest, and $\bf{Z}$ is a vector of all relevant factors that affect the outcome $y$. Further, let $\varepsilon$ be a normally distributed mean-zero noise term (this noise is assumed to be truly non-deterministic since we have assumed $\bf{Z}$ contains every relevant determinant of $y$). It follows that $$\alpha_1=E[y | t=1, \textbf{Z}]-E[y | t=0, \textbf{Z}]$$ Since $\bf{Z}$ includes all relevant factors that determine the outcome $y$, we interpret $\alpha_1$ as the causal effect of treatment receipt ($t=1$) on the outcome $y$.
Empirical Application:
Suppose we are given data on $y$, $t$, and $\bf{Z'}$, where $\bf{Z'}$ is a subset of $\bf{Z}$. In other words, $\bf{Z'}$ only contains some of the relevant variables that determine the outcome $y$. Let $\bf{\tilde{Z}}$ be the unobserved components of $\bf{Z}$. We need to estimate the causal effect of receiving treatment, i.e., $\alpha_1$, in our empirical exercise. However, given our inability to observe the full vector $\bf{Z}$, the best we can do with OLS is to estimate the following: \begin{align} y&=\hat{\alpha}_0+\hat{\alpha}_1 t + \bf{Z'}\hat{\pmb{\beta}}+\nu \end{align} where the error term $\nu = \varepsilon+\bf{\tilde{Z}}\pmb{\tilde{\beta}}$ absorbs the effect of $\bf{\tilde{Z}}$ and is thus, no longer random noise.
Now notice that \begin{align} E[y | t=1, \textbf{Z}']-E[y | t=0, \textbf{Z}'] &=\hat{\alpha}_1 + \left(E[\nu | t=1, \textbf{Z}']-E[\nu | t=0, \textbf{Z}']\right) \\ & = \hat{\alpha}_1 + \pmb{\tilde{\beta}}\underbrace{\left(E[\bf{\tilde{Z}} | t=1, \textbf{Z}']-E[\bf{\tilde{Z}} | t=0, \textbf{Z}']\right)}_{bias} \\ \implies &\hat{\alpha}_1= \left(E[y | t=1, \textbf{Z}']-E[y | t=0, \textbf{Z}']\right) - \text{bias} \end{align} Therefore, $\hat{\alpha}_1$ captures both the mean difference in outcomes associated with treatment status as well as a bias stemming from heterogeneity in $\bf{\tilde{Z}}$ w.r.t. treatment status.
A simple OLS doesn't allow us to adjust for such biases. Therefore, $\hat{\alpha}_1$ does not provide the causal effect of $t$ on $y$. We have thus, failed to identify our parameter of interest namely, the causal effect of $t$ on $y$. The OLS estimate $\hat{\alpha}_1$ can only be interpreted as an estimate of the correlation between $y$ and $t$ that adjusts for $\bf{Z}'$.
Conclusion:
The above explanation began by defining the parameter of interest as the causal effect of $t$ on $y$. It then illustrated how identification of this parameter can be compromised by omitted variable bias. There are a number of empirical strategies that can be employed to correct for such biases. The most obvious would be to collect data on the missing variables $\bf{\tilde{Z}}$, however, this is unlikely to be feasible. Another option would be to randomize the assignment of $t$ and then assume that $t$ is independent of $\bf{\tilde{Z}}$ by construction. This works well if you have the time and resources to design your own intervention $t$ and collect outcomes data on the subjects of your study. If you are stuck with observational data that cannot be randomized, you will need to look for empirical strategies that allow you to induce exogeneity in the assignment of $t$. Social scientists refer to such approaches as quasi-experimental methods.