Solved – Using and understanding the zelig package in R

maximum likelihoodrsimulation

Is any one here familiar with an R package called Zelig?

I have a data frame like this:

IQ   AGE
80   50
100  18
90   25

etc.

What I need to do is build a model of IQ given AGE, I am running these commands:
z.out <- zelig(IQ~AGE,data=df,model="ls")
this runs the what-if given age 110, what would be the IQ
x.out <- setx(z.out, AGE=110)
This is a simulation model where given the age 110, after running 1 million runs of simulation, what would be the IQ with 95% confidence interval.
s.out <- sim(z.out,x.out, num=1000000, level=95)

I have a hard time understanding from what pool of data the sim() function draws the numbers. I read though the docs, but they are written for Ph.D. students, if not more advanced readers. I have asked the Zelig creators this question multiple times but they are directing me to the docs which I read multiple times, with no luck. However, one of the person that works with Zelig sent me this email:

Suppose that you fit
$$\text{IQ} = a + \text{Age} * b + e$$
Then you get a table of regression coefficients where a=50, b=2, and their standard errors are something like $\text{s.e.}(a)=\sqrt{10}$ and $\text{s.e.}(b)=1$. These are all hypothetical examples.
In maximum likelihood estimation, this regression output is another way of saying that $a$ and $b$ are distributed bivariate normal with means $[50,2]$ and there's a variance-covariance matrix that looks something like this (all numbers are made up):
$$\begin{array}{cc}
10 & cov(a,b) \\
cov(a,b) & 1 \\
\end{array}$$
So, the variance of $a$ is 10, the variance of $b$ is 1, and their covariance is $cov(a,b)$. It won't be shown in your regression table, but Zelig remembers it for you. Let's pretend it's 3.
This variance-covariance matrix is the inverse of the Hessian I mentioned earlier. Don't worry about it. For this example, you need only remember that $\text{mean}(a,b) = [50,2]$ and $cov(a,b)=\begin{array}{cc}10&3\\3&1\end{array}$. For purposes of plain text email, I'm representing matrices with columns separated by commas and rows by semicolons.
In addition, suppose that the error term $e$ is distributed with mean 0 and s.d.=1.
Now, one way to predict what IQ you might get for somebody aged 88, based on this regression table, is exactly what you would expect: you simply calculate 50 + 88 * 2 = 226. This is your point estimate. The 95% confidence interval around this point estimate is a function of the standard errors of the coefficient estimates of $a=50$ and $b=2$, and the exact formula for that is in any econometrics textbook.
Simulation makes it unnecessary to dig up that textbook. Instead, for 1000 rounds, sim() will come up with 1000 different pairs of $(a,b)$ estimates drawn from the bivariate normal with mean=[50,2] and cov=$\begin{array}{cc}10&3\\3&1\end{array}$. One such pair might be $(47,1.5)$; another might be $(52,3)$; yet another might be $(10,5)$.
Whatever they are, sim() plugs them into the formula and gives you 1000 different estimates for the IQ. Their average is your point estimate. If you stack them from lowest to highest, the ends of the 95% confidence interval are the top 25th value and the bottom 25th. That's it. That's all that sim() does.

Given the above explanation, can anybody tell me in lay terms, what numbers sim() is picking? How are those numbers in pool generated? I would greatly appreciate if anyone brings some light into this.

Best Answer

The function sim() randomly selects numbers are randomly chosen from a bivariate normal distribution with the specified vector of means and variance-covariance matrix in order to construct confidence intervals around the parameter estimates.

The vector of means and the covariance matrix are determined by the ML estimates of the model parameters. Therefore, these are simulations of parameter values. Recall that the ML estimates of parameters are just the most likely values. The true value might not actually be the maximum of the likelihood function given the data because of sampling error -- this is the motivation for finding confidence sets: to define a region with high probability of containing the true value.

To understand this topic more fully, you should familiarize yourself with probability density functions generally and the normal distribution and then the bivariate normal specifically. The latter extends the normal from the real line to a real plane, and so is useful in cases when two random variables can vary together.

Related Question