Solved – Feasible Generalized Least Square in R

generalized linear modelheteroscedasticitymultiple regressionr

I am studying the factors influencing the annual salary for employees at a undisclosed bank. The regression model that I have decided to employ is as follows:

\begin{equation}
Y_{k}=\beta_{1}+\beta_{2}E_{k}+\beta_{3}D_{gk}+\beta_{4}D_{mk}+\beta_{5}D_{2k}+\beta_{6}D_{3k}+\varepsilon_{k}
\end{equation}
where $Y_{k}$ is the logarithm of annual salary, $E$ is the number of years of education, $D_{g}$ is a gender dummy, $D_{m}$ is a minority dummy, and where

\begin{equation}
D_{2}=\begin{cases} 1 &\text{Custodial job} \\ 0 & \text{Otherwise} \end{cases}
\end{equation}
and
\begin{equation}
D_{3}=\begin{cases} 1 &\text{Management job} \\ 0 & \text{Otherwise} \end{cases}
\end{equation}
As you know, whenever one deals with GLS, $\Omega$ will almost surely be unknown and thus have to be estimated. In general there are $\frac{n(n+1)}{2}$ parameters to be estimated, which makes it pretty impossible to come up with a viable estimation out of $n$ observations. This is usually counteracted by imposing some structure on $\Omega$.

In my case, I would like to make the assumption that the disturbance terms $\varepsilon_{k}$ in the above regression model have variance $\sigma_{i}^{2}$ for $i=1,2,3$, according to whether the $i$-th employee has a job in category 1,2, or 3 respectively. Now, we may introduce the transformations $\gamma_{1}=\log (\sigma_{1}^{2}),\gamma_{2}=\log(\sigma_{2}^{2}/\sigma_{1}^{2})$, and $\gamma_{3}=\log(\sigma_{3}^{2}/\sigma_{1}^{2})$ so as to enable us to formulate the following model for

\begin{equation}
\sigma_{k}^{2}= \exp \{ \gamma_{1}+\gamma_{2}D_{2k}+\gamma_{3}D_{3k} \}
\end{equation}
Since $\hat{\beta}_\rm{OLS}$ is a consistent estimate of $\beta$, even under the assumption of heteroscedasticity, we have that $\hat{\beta}_{\rm OLS} \xrightarrow[]{p}\beta$ as the number of observations increase. We may therefore argue that $e_{k}^{2} \approx \sigma_{k}^{2}$, and so we can regress upon information that we already possess.

Summary of procedure

(1) Calculate the OLS estimate.

(2) Calculate the OLS residual $\textbf{e}=\textbf{Y}-\textbf{X}\hat{\beta}$

(3) Calculate the OLS estimate of $\gamma$ from $e_{k}^{2}=f_{\gamma}(Z_{k})+\overline{\varepsilon}_{k}$.

(4) Calculate the FGLS estimate as the GLS estimate with $\hat{\Omega}=\Omega(\hat{\gamma})$ in place of $\Omega$.

What I would like to know is whether or not one can perform this estimation using a known function in R, say gls? If the answer is yes, then how exactly should I write to ensure that that my heteroscedasticity assumption is taken into account? Thanks for taking the time! Have a great day!

Best Answer

Estimating Regression Models with Multiplicative Heteroscedasticity

The model that you have described is discussed in Harvey (1976).

Let me rewrite the model $$ \begin{align} \mathbb{E}(Y_i \mid \mathbf{X}_i, \mathbf{Z}_i) &= \mathbf{X}_i'\boldsymbol{\beta} \\ \mathbb{V}(Y_i \mid \mathbf{X}_i, \mathbf{Z}_i) &\equiv \sigma^2_i \\ &= \exp(\mathbf{Z}_i'\boldsymbol{\alpha}) \\ \end{align} $$ Note that it is possible that $\mathbf{X}_i$ and $\mathbf{Z}_i$ have common elements.

Rewriting the conditional mean equation equivalently as $$ \begin{align} Y_i &= \mathbf{X}_i'\boldsymbol{\beta} + \varepsilon_i \\ \mathbb{E}(\varepsilon_i \mid \mathbf{X}_i, \mathbf{Z}_i) &= 0 \end{align} $$

Two-step estimation

The main aim of Harvey (1976) is to provide (efficient) estimates of $\boldsymbol{\alpha}$, rather than to use that estimate to compute a WLS estimator. However, once the parameters $\boldsymbol{\alpha}$ are computed, they can be so used.

The two-step estimator that you have described, follows the procedure:
1. Compute the OLS estimator $\hat{\boldsymbol{\beta}}$, and the OLS residuals $\hat{\varepsilon}_i^2$,
2. Compute the estimates $\hat{\boldsymbol{\alpha}}$, from the regression
$$ \log \hat{\varepsilon}_i^2 = \mathbf{Z}_i'\boldsymbol{\alpha} + \nu_i $$

Maximum likelihood estimation

The MLE is claimed to be up to 60% more efficient than the two-step estimator above. There are some other advantages that are apparent to me including that the MLE would be the pseudo- maximum likelihood under distributional misspecification, and that there is no need to recompute the WLS after computing $\hat{\boldsymbol{\alpha}}$. However, this does not seem to be borne out by my calculations below.

Under conditional normality of the response, the likelihood is very simple to write down and optimize

$$ \begin{align} \log L_i(\boldsymbol{\beta}, \boldsymbol{\alpha}) &= -\frac{1}{2}(\log2 \pi + \mathbf{Z}_i'\boldsymbol{\alpha})\\ &\qquad -\frac{1}{2}\left(\dfrac{(Y_i - \mathbf{X}_i'\boldsymbol{\beta})^2}{\exp(\mathbf{Z}_i'\boldsymbol{\alpha})}\right) \end{align} $$

Two-step estimation: R

To implement this, first let us simulate some heteroskedastic data using the model given, and estimate it using OLS.

#==========================================================
# simulate the heteroskedastic data
#==========================================================
iN = 1000
iK1 = 7
iK2 = 4

mX = cbind(1, matrix(rnorm(iN*iK1), nrow = iN, ncol = iK1))
mZ = cbind(1, matrix(rnorm(iN*iK2), nrow = iN, ncol = iK2))

vBeta = rnorm(1 + iK1)
vAlpha = rnorm(1 + iK2)

vY = rnorm(iN, mean = mX %*% vBeta, sd = sqrt(exp(mZ %*% vAlpha)))

#==========================================================
# fit the data using OLS
#==========================================================
vBetaOLS = coef(lmHetMean <- lm.fit(y = vY, x = mX))

Next, we can get the results using the two-step procedure:

#==========================================================
# two-step estimation
#==========================================================
residHet = resid(lmHetMean)
vVarEst = exp(fitted(lmHetVar <- lm.fit(y = log(residHet^2), x = mZ)))

vBetaTS = coef(lm.fit(y = vY/vVarEst, x = apply(mX, 2, function(x) x/vVarEst)))

Maximum likelihood estimation: R

#==========================================================
# likelihood function
#==========================================================
fnLogLik = function(vParam, vY, mX, mZ) {
  vBeta = vParam[1:ncol(mX)]
  vAlpha = vParam[(ncol(mX)+1):(ncol(mX)+ncol(mZ))]

  negLogLik = -sum(0.5*(log(2*pi) - mZ %*% vAlpha - 
                           (vY - mX %*% vBeta)^2/(exp(mZ %*% vAlpha))))
  return(negLogLik)
}

# test the function
# debugonce(fnLogLik)
fnLogLik(c(vBeta, vAlpha), vY, mX, mZ)

#==========================================================
# MLE
#==========================================================
vParam0 = rnorm(13)
optimHet = optim(vParam0, fnLogLik, vY = vY, mX = mX, mZ = mZ)
vBetaML = optimHet$par

#==========================================================
# collect all the results
#==========================================================
cbind(vBeta, vBetaOLS, vBetaTS, vBetaML = vBetaML[1:8])

I am not entirely sure why the ML results are a bit farther off than even the OLS, and I am not ruling out a coding mistake.

But as you can see, the 2-step estimator seems to do better than the OLS estimator.

Related Solutions

Solved – Conditional mean independence implies unbiasedness and consistency of the OLS estimator

It's false. As you observe, if you read Stock and Watson closely, they don't actually endorse the claim that OLS is unbiased for $\beta$ under conditional mean independence. They endorse the much weaker claim that OLS is unbiased for $\beta$ if $E(u|x,z)=z\gamma$. Then, they say something vague about non-linear least squares.

Your equation (4) contains what you need to see that the claim is false. Estimating equation (4) by OLS while omitting the variable $E(u|x,z)$ leads to omitted variables bias. As you probably recall, the bias term from omitted variables (when the omitted variable has a coefficient of 1) is controlled by the coefficients from the following auxiliary regression: \begin{align} E(u|z) = x\alpha_1 + z\alpha_2 + \nu \end{align} The bias in the original regression for $\beta$ is $\alpha_1$ from this regression, and the bias on $\gamma$ is $\alpha_2$. If $x$ is correlated with $E(u|z)$, after controlling linearly for $z$, then $\alpha_1$ will be non-zero and the OLS coefficient will be biased.

Here is an example to prove the point: \begin{align} \xi &\sim F(), \; \zeta \sim G(), \; \nu \sim H()\quad \text{all independent}\\ z &=\xi\\ x &= z^2 + \zeta\\ u &= z+z^2-E(z+z^2)+\nu \end{align}

Looking at the formula for $u$, it is clear that $E(u|x,z)=E(u|z)=z+z^2-E(z+z^2)$ Looking at the auxiliary regression, it is clear that (absent some fortuitous choice of $F,G,H$) $\alpha_1$ will not be zero.

Here is a very simple example in R which demonstrates the point:

set.seed(12344321)
z <- runif(n=100000,min=0,max=10)
x <- z^2 + runif(n=100000,min=0,max=20)
u <- z + z^2 - mean(z+z^2) + rnorm(n=100000,mean=0,sd=20)
y <- x + z + u

summary(lm(y~x+z))

# auxiliary regression
summary(lm(z+z^2~x+z))

Notice that the first regression gives you a coefficient on $x$ which is biased up by 0.63, reflecting the fact that $x$ "has some $z^2$ in it" as does $E(u|z)$. Notice also that the auxiliary regression gives you a bias estimate of about 0.63.

So, what are Stock and Watson (and your lecturer) talking about? Let's go back to your equation (4): \begin{align} y = x\beta + z\gamma + E(u|z) + v \end{align}

It's an important fact that the omitted variable is only a function of $z$. It seems like if we could control for $z$ really well, that would be enough to purge the bias from the regression, even though $x$ may be correlated with $u$.

Suppose we estimated the equation below using either a non-parametric method to estimate the function $f()$ or using the correct functional form $f(z)=z\gamma+E(u|z)$. If we were using the correct functional form, we would be estimating it by non-linear least squares (explaining the cryptic comment about NLS): \begin{align} y = x\beta + f(z) + v \end{align} That would give us a consistent estimator for $\beta$ because there is no longer an omitted variable problem.

Alternatively, if we had enough data, we could go ``all the way'' in controlling for $z$. We could look at a subset of the data where $z=1$, and just run the regression: \begin{align} y = x\beta + v \end{align} This would give unbiased, consistent estimators for the $\beta$ except for the intercept, of course, which would be polluted by $f(1)$. Obviously, you could also get a (different) consistent, unbiased estimator by running that regression only on data points for which $z=2$. And another one for the points where $z=3$. Etc. Then you'd have a bunch of good estimators from which you could make a great estimator by, say, averaging them all together somehow.

This latter thought is the inspiration for matching estimators. Since we don't usually have enough data to literally run the regression only for $z=1$ or even for pairs of points where $z$ is identical, we instead run the regression for points where $z$ is ``close enough'' to being identical.