Solved – Using and understanding the zelig package in R

maximum likelihoodrsimulation

Is any one here familiar with an R package called Zelig?

I have a data frame like this:

etc.

What I need to do is build a model of IQ given AGE, I am running these commands:
z.out <- zelig(IQ~AGE,data=df,model="ls")
this runs the what-if given age 110, what would be the IQ
x.out <- setx(z.out, AGE=110)
This is a simulation model where given the age 110, after running 1 million runs of simulation, what would be the IQ with 95% confidence interval.
s.out <- sim(z.out,x.out, num=1000000, level=95)

I have a hard time understanding from what pool of data the sim() function draws the numbers. I read though the docs, but they are written for Ph.D. students, if not more advanced readers. I have asked the Zelig creators this question multiple times but they are directing me to the docs which I read multiple times, with no luck. However, one of the person that works with Zelig sent me this email:

Suppose that you fit
$$\text{IQ} = a + \text{Age} * b + e$$
Then you get a table of regression coefficients where a=50, b=2, and their standard errors are something like $\text{s.e.}(a)=\sqrt{10}$ and $\text{s.e.}(b)=1$. These are all hypothetical examples.
In maximum likelihood estimation, this regression output is another way of saying that $a$ and $b$ are distributed bivariate normal with means $[50,2]$ and there's a variance-covariance matrix that looks something like this (all numbers are made up):
$$\begin{array}{cc}
10 & cov(a,b) \\
cov(a,b) & 1 \\
\end{array}$$
So, the variance of $a$ is 10, the variance of $b$ is 1, and their covariance is $cov(a,b)$. It won't be shown in your regression table, but Zelig remembers it for you. Let's pretend it's 3.
This variance-covariance matrix is the inverse of the Hessian I mentioned earlier. Don't worry about it. For this example, you need only remember that $\text{mean}(a,b) = [50,2]$ and $cov(a,b)=\begin{array}{cc}10&3\\3&1\end{array}$. For purposes of plain text email, I'm representing matrices with columns separated by commas and rows by semicolons.
In addition, suppose that the error term $e$ is distributed with mean 0 and s.d.=1.
Now, one way to predict what IQ you might get for somebody aged 88, based on this regression table, is exactly what you would expect: you simply calculate 50 + 88 * 2 = 226. This is your point estimate. The 95% confidence interval around this point estimate is a function of the standard errors of the coefficient estimates of $a=50$ and $b=2$, and the exact formula for that is in any econometrics textbook.
Simulation makes it unnecessary to dig up that textbook. Instead, for 1000 rounds, sim() will come up with 1000 different pairs of $(a,b)$ estimates drawn from the bivariate normal with mean=[50,2] and cov=$\begin{array}{cc}10&3\\3&1\end{array}$. One such pair might be $(47,1.5)$; another might be $(52,3)$; yet another might be $(10,5)$.
Whatever they are, sim() plugs them into the formula and gives you 1000 different estimates for the IQ. Their average is your point estimate. If you stack them from lowest to highest, the ends of the 95% confidence interval are the top 25th value and the bottom 25th. That's it. That's all that sim() does.

Given the above explanation, can anybody tell me in lay terms, what numbers sim() is picking? How are those numbers in pool generated? I would greatly appreciate if anyone brings some light into this.

Best Answer

The function sim() randomly selects numbers are randomly chosen from a bivariate normal distribution with the specified vector of means and variance-covariance matrix in order to construct confidence intervals around the parameter estimates.

The vector of means and the covariance matrix are determined by the ML estimates of the model parameters. Therefore, these are simulations of parameter values. Recall that the ML estimates of parameters are just the most likely values. The true value might not actually be the maximum of the likelihood function given the data because of sampling error -- this is the motivation for finding confidence sets: to define a region with high probability of containing the true value.

To understand this topic more fully, you should familiarize yourself with probability density functions generally and the normal distribution and then the bivariate normal specifically. The latter extends the normal from the real line to a real plane, and so is useful in cases when two random variables can vary together.

Why Simulate?

Quick research into the problem will reveal that the optimal algorithm to solve this problem is to rank the first $\frac{1}{e}$ applicants, not hire them, and then choose the first candidate who is better than your previous ones. For example, if there are ten candidates each with unique rankings, you might observe the following sequence.

$$2 \quad 3 \quad 6 \quad 5 \quad 4 \quad 8 \quad 9 \quad 10 \quad 1 \quad 7$$

We observe that in the first $\lfloor \frac{1}{e} \rfloor$ candidates, or first 3 candidates, the max score was a six. So, we check the fourth candidate and observe a 5, so we don't hire. Then we observe a 4, so we don't hire. Then we observe a 9, so he hire them and halt the interview process. In this case, we didn't select the optimal candidate, but we were close.

But what if we don't know this algorithm and we want to determine which $k$ is best? It isn't immediately intuitive that $k \approx \frac{1}{e}$ is the best number of candidates to skip and so simulating can assist us in finding for which $k$ our probabilility of picking the best candidate is maximized. In the general case, simulations aid in intuition. If we can guess an optimal solution, or perhaps even prove an optimal solution through mathematics, then simulation helps to verify our results.

How would you simulate?

This is perhaps my favorite part. I will be using R to write this simulation and go through the steps on how to do this simulation.

We will start with a single simulation for arbitrary $k$, where we want to create 1000 candidates each with distinct rankings.

x <- sample(1:1000, 1000)

Now that we have our sample, we want to find the best candidate among the first $k$ candidates, which can be done with

init_best <- max(x[1:k])

Next, we begin by looping through the remaining candidates until we find one who is better than the best of our first $k$ candidates. Note that we only simuluate up to $n-1$ candidates for $k$, because it doesn't make sense to skip every candidate as that always results in no hire.

for (i in (k+1):999) {
    if (x[i] > init_best) {
        candidate_score <- x[i]
        candidate_num <- i
        break
    }
}

So, from here, we have recorded the candidates score. We can quickly verify if this candidate is the best candidate by checking if their ranking is the max ranking. Since we have 1000 candidates, we do this with

if (candidate_score == 1000)
    success = success + 1

That is, we record that we successfully chose the best candidate and increment some value that keeps track of that. Using these, we can write a few loops to run this simulation several times which is shown below

sims <- 10000
p <- c()
p[1] <- 0
cand_ave <- c()
cand_ave[1] <- 0
for (k in 2:999) {
  success <- 0
  for (n in 1:sims) {
    x <- sample(1:1000, 1000)
    init_best <- max(x[1:k])
    candidate_score = -1
    for (i in (k+1):1000) {
      if (x[i] > init_best) {
        candidate_score <- x[i]
        break
      }
    }
    if (candidate_score == 1000)
      success <- success + 1
  }
  p[k] <- success/sims
  cat(k, " complete \n")
}
plot(1:999, p, type = "l", xlab = "Candidates Skipped",
     ylab = "Probability of selecting Best Candidate",
     main = "The Secretary Problem")

So, we create a collection of probabilities so that later on, we can plot these values. We also initialize candidate_score at -1 before each loop so as to signify the case where we end up not hiring anyone. After running this simulation 10000 times for each $k$, the results are as follows

Within this simulation, we find that the $k$ value(s) that maximizes $p$ is $k = 332$ or $k = 374$ with $p = .3789$.

We know that the asymptotic $k_n^*$ and $p_n^*$ values are $\dfrac{n}{e}$ and $\dfrac{1}{e}$ respectively and these results are fairly close to those values which would be in this case $$k_{1000} \approx 369 \quad \quad p_{1000} \approx .369$$

So the simulation approximately confirms the asymptotic values.

Simulated p-values – Approximate Real Significance Level and Power

It is often useful to simulate the power of a test. However, your question is unclear. [The figure does not match the distributions you say you simulate. The null and alternative hypotheses are unclear. You do not say whether you are doing a one-sample or a two-sample t test. You do not indicate the sample size(s).]

Suppose you have random samples of sizes $n_1 = n_2 = 20$ from the null distribution $\mathsf{Norm}(0,1)$ and the specific alternative distribution $\mathsf{Norm}(1,1).$ Also, suppose you will do a pooled 2-sample t test at level $\alpha = 0.05$ and wish to know the power of the test (probability of rejecting).

Here is one such 2-sample t test in R.

set.seed(1234)
x = rnorm(20, 0, 1)
y = rnorm(20, 1, 1)
t.test(x,y, var.eq=T)$p.val  # P-value only
[1] 0.02451879

t.test(x,y, var.eq=T)        # complete output

        Two Sample t-test

data:  x and y
t = -2.3421, df = 38, p-value = 0.02452
alternative hypothesis: 
 true difference in means is not equal to 0
95 percent confidence interval:
 -1.25582408 -0.09136416
sample estimates:
 mean of x  mean of y 
-0.2506641  0.4229301

In this one example, the null hypothesis $H_0: \mu_x=\mu_y$ is rejected in favor of the two-sided alternative $H_0: \mu_x\ne\mu_y$ at the 5% level because the P-value $0.02452 < 0.05 = 5\%.$

Now we show a simulation of 100,000 such pooled 2-sample t tests. [A simulation with only 1000 iterations would give only a very approximate answer.]

set.seed(2021)
pv = replicate(10^5, t.test(rnorm(20,0,1), 
                 rnorm(20,1,1), var.eq=T)$p.val)
mean(pv <= 0.05)   # aprx power of test at 5% level
[1] 0.86787

summary(pv)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000000 0.0003599 0.0029373 0.0284559 0.0180662 0.9986458 

2*sd(pv <= 0.05)/sqrt(20^5)
[1] 0.000378604    # aprx 95% margin of simulation error

So the simulated power of the test at the 5% level is $0.8679 \pm 0.0004,$ or about 87%.

Notes: (1) The numerical vector pv contains $10^5$ P-values. The logical vector pv <= 0.05 contains $10^5$ TRUEs and FALSEs. Its mean is the proportion of its TRUEs. The last line of the R code gives a Wald 95% confidence interval for that proportion.

(2) For this specific test one can use an exact formula involving the non-central t distribution to find an exact power value. However, the simulation method shown here works in cases where no exact analytic formula is available. (I would use simulation to get the power of a Welch 2-sample t test, which does not assume equal variances. I have never seen an exact formula for that test.)

(3) With only the first 1000 iterations, we get power $0.857\pm 0.022,$ which might be good enough.

set.seed(2021)
pv = replicate(1000, t.test(rnorm(20,0,1), 
               rnorm(20,1,1), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.857
2*sd(pv <= 0.05)/sqrt(1000)
[1] 0.02215163

Addendum on simulating significance levels, per Comments at end. You can use simulation to check that the significance level of a pooled t test designed to have a 5% critical value truly has significance level 5%. Why might you need to check? Maybe you doubt the software is programmed correctly. Maybe you doubt the assumptions are met (random normal data? equal population variances?).

Here is a simulation to check that that the pooled t test truly has the 5% level (rejection rate when both populations have the same normal distribution):

set.seed(2021)
pv = replicate(10^5, t.test(rnorm(20,0,1), 
                 rnorm(20,0,1), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.04961
2*sd(pv <= 0.05)/sqrt(10^5)
[1] 0.001373307

Significance level is $0.0496\pm 0.0014$ (so very near 5%).

However, if population variances are not equal, then the pooled 2-sample t test does not have the 'advertised' 5% rejection rate when population means are equal $(H_0:\mu_x=\mu_y).$ This can lead to 'false discovery'--especially if sample sizes are unequal and the smaller sample is from the population with the larger variance.

pv = replicate(10^5, t.test(rnorm(40,0,1), 
                rnorm(15,0,2), var.eq=T)$p.val)
mean(pv <= 0.05)
[1] 0.14371

The actual significance level is about 14%, not 5%. So we might reject when population means are truly equal, if population variances are not equal. That is why it is recommended to use the Welch t test if there is any doubt at all about the equal variances assumption. In R you can get the Welch 2-sample t test, which does not assume equal population variances, by using the default 2-sample t test (omit parameter 'vareq=T').

The simulation below, shows that a Welch t test at the nominal 5% level gives a true significance level near 5% $(0.0515\pm 0.0014),$ even when population variances differ (by 1:4 ratio).

set.seed(2021)
pv.w = replicate(10^5, t.test(rnorm(40,0,1), 
                 rnorm(15,0,2))$p.val)
mean(pv.w <= 0.05)
[1] 0.0515
2*sd(pv.w <= 0.05)/sqrt(10^5)
[1] 0.00139783

Best Answer

Related Solutions

Solved – Understanding simulation in the secretary problem

Why Simulate?

How would you simulate?

Simulated p-values – Approximate Real Significance Level and Power

Related Question