Solved – Understanding simulation in the secretary problem

secretary-problemself-studysimulation

I'm working on the following problem:

When applying for a secretarial position, $n$ candidates line up in random order for interviews, where the manager after each interview has to decide to take or decline that candidate. His assessment is qualitative in the sense that at interview $k\geq 2$, he observes only $Y_{k,n}$, the indicator that candidate $k$ is better than candidates $1,\ldots,k−1,$ and his goal is to find a strategy whose probability of selecting the best candidate is close to the maximal value $p_n^*$. It is easy to see that the optimal strategy is to select candidate $$\tau_{k,n}\stackrel{def}{=}\inf\{l\geq k: Y_{l,n}\} = 1\}\land n $$ for some $k\in\{2,\ldots,n\}$. Let $k_n^*$ denote the optimal such $k$. Plot simulation estimates of $k_n^*$ and $p_n^*$ for $n = 5,\ldots, 50,$ and compare with the known asymptotics $k_n^*\sim n/e, p_n^*\sim 1/e$.

Question: Why would you choose to simulate $k_n^*$? And if you decided to do so; how? The fact that you could simulate $k_n^*$ indicates that $k_n^*$ has some kind of distribution. I don't get why since $k_n^*$ seems (I'm apparently mistaken) like a deterministic value to me.

The secretary problem is described here. For an arbitrary $k$, the probability that the best applicant is selected is equal to $$P(k) = \sum_{i = 1}^n P(\text{applicant $i$ is selected$\cap$applicant $i$ is the best})$$ If we fast forward through some derivations we find that $$P(k) = \sum_{i=r}^n\dfrac{r-1}{i -1}\times \dfrac{1}{n} = \dfrac{r-1}{n}\sum_{i = r}^n\dfrac{1}{i-1}$$ This means that $k_n^*$ can be found by just checking for which $k$ the value $P(k)$ is maximal.

Edit: The only thing I can think of right now is that you might want to simulate per case which $k$ would give you the optimal candidate. But then again, we would know which $k$ that would be; if the optimal candidate were on the $m$-th place, we would want to skip the first $m-1$ candidates, so I don't see what simulation would bring to the table.

Best Answer

Why Simulate?

Quick research into the problem will reveal that the optimal algorithm to solve this problem is to rank the first $\frac{1}{e}$ applicants, not hire them, and then choose the first candidate who is better than your previous ones. For example, if there are ten candidates each with unique rankings, you might observe the following sequence.

$$2 \quad 3 \quad 6 \quad 5 \quad 4 \quad 8 \quad 9 \quad 10 \quad 1 \quad 7$$

We observe that in the first $\lfloor \frac{1}{e} \rfloor$ candidates, or first 3 candidates, the max score was a six. So, we check the fourth candidate and observe a 5, so we don't hire. Then we observe a 4, so we don't hire. Then we observe a 9, so he hire them and halt the interview process. In this case, we didn't select the optimal candidate, but we were close.

But what if we don't know this algorithm and we want to determine which $k$ is best? It isn't immediately intuitive that $k \approx \frac{1}{e}$ is the best number of candidates to skip and so simulating can assist us in finding for which $k$ our probabilility of picking the best candidate is maximized. In the general case, simulations aid in intuition. If we can guess an optimal solution, or perhaps even prove an optimal solution through mathematics, then simulation helps to verify our results.

How would you simulate?

This is perhaps my favorite part. I will be using R to write this simulation and go through the steps on how to do this simulation.

We will start with a single simulation for arbitrary $k$, where we want to create 1000 candidates each with distinct rankings.

x <- sample(1:1000, 1000)

Now that we have our sample, we want to find the best candidate among the first $k$ candidates, which can be done with

init_best <- max(x[1:k])

Next, we begin by looping through the remaining candidates until we find one who is better than the best of our first $k$ candidates. Note that we only simuluate up to $n-1$ candidates for $k$, because it doesn't make sense to skip every candidate as that always results in no hire.

for (i in (k+1):999) {
    if (x[i] > init_best) {
        candidate_score <- x[i]
        candidate_num <- i
        break
    }
}

So, from here, we have recorded the candidates score. We can quickly verify if this candidate is the best candidate by checking if their ranking is the max ranking. Since we have 1000 candidates, we do this with

if (candidate_score == 1000)
    success = success + 1

That is, we record that we successfully chose the best candidate and increment some value that keeps track of that. Using these, we can write a few loops to run this simulation several times which is shown below

sims <- 10000
p <- c()
p[1] <- 0
cand_ave <- c()
cand_ave[1] <- 0
for (k in 2:999) {
  success <- 0
  for (n in 1:sims) {
    x <- sample(1:1000, 1000)
    init_best <- max(x[1:k])
    candidate_score = -1
    for (i in (k+1):1000) {
      if (x[i] > init_best) {
        candidate_score <- x[i]
        break
      }
    }
    if (candidate_score == 1000)
      success <- success + 1
  }
  p[k] <- success/sims
  cat(k, " complete \n")
}
plot(1:999, p, type = "l", xlab = "Candidates Skipped",
     ylab = "Probability of selecting Best Candidate",
     main = "The Secretary Problem")

So, we create a collection of probabilities so that later on, we can plot these values. We also initialize candidate_score at -1 before each loop so as to signify the case where we end up not hiring anyone. After running this simulation 10000 times for each $k$, the results are as follows

Within this simulation, we find that the $k$ value(s) that maximizes $p$ is $k = 332$ or $k = 374$ with $p = .3789$.

We know that the asymptotic $k_n^*$ and $p_n^*$ values are $\dfrac{n}{e}$ and $\dfrac{1}{e}$ respectively and these results are fairly close to those values which would be in this case $$k_{1000} \approx 369 \quad \quad p_{1000} \approx .369$$

So the simulation approximately confirms the asymptotic values.

Best Answer

Why Simulate?

How would you simulate?

Related Solutions

Solved – Bootstrapping unbalanced clustered data (non-parametric bootstrap)

Related Question