This confusion between bootstrap procedures and Monte Carlo procedures keeps recurring, so perhaps this is as good a place as any to address it. (The examples of R
code may help with the homework, too.)
Consider this implementation of the bootstrap in R
:
boot <- function(x, t) { # Exact bootstrap of procedure t on data x
n <- length(x) # Must lie between 2 and 7 inclusive.
if (n > 7) {
stop("Sample size exceeds 7; use an approximate method instead.")
}
p <- c(n, 1:(n-1))
a <- rep(x, n^(n-1))
dim(a) <- rep(n, n)
y <- as.vector(a)
while (n > 1) {
n <- n-1
a <- aperm(a, p)
y <- cbind(as.vector(a), y)
}
apply(y, 1, t)
}
A quick look will confirm that this is a deterministic calculation: no random values are generated or used. (I will leave the details of its inner workings for interested readers to figure out for themselves.)
The arguments to boot
are a batch of numeric data in the array x
and a reference t
to a function (that can be applied to arrays exactly like x
) to return a single numeric value; in other words, t
is a statistic. It generates all possible samples with replacement from x
and applies t
to each of them, thereby producing one number for each such sample: that's the bootstrap in a nutshell. The return value is an array representing the exact bootstrap distribution of t
for the sample x
.
As a tiny example, let's bootstrap the mean for a sample x
= c(1,3)
:
> boot(c(1,3), mean)
> [1] 1 2 2 3
There are indeed four possible samples of size $2$ with replacement from $(1,3)$; namely, $(1,1)$, $(1,3)$, $(3,1)$, and $(3,3)$. boot
generates them all (in the order just listed) and applies t
to each of them. In this case t
computes the mean and those turn out to be $1$, $2$, $2$, and $3$, respectively, as shown in the output.
Where you go from here depends on how you want to use the bootstrap. The full information about the bootstrap is contained in this output array, so it's usually a good idea to display it. Here is an example where the standard deviation is bootstrapped from the sample $(1,3,3,4,7)$:
hist(boot(c(1,3,3,4,7), sd))
Now we are prepared to talk about Monte Carlo simulation. Suppose, say, we were going to bootstrap a 95% upper confidence limit on the SD from a sample of $5$ by using the upper 95th percentile of its bootstrap distribution. What properties would this procedure have? One way to find out is to suppose the sample were obtained randomly from, say, a uniform distribution. (The application will often indicate what a reasonable distributional assumption may be; here, I arbitrarily chose one that is simple for computation but not easy to deal with analytically.) We can simulate what happens by taking such a sample and computing the UCL:
> set.seed(17)
> quantile(boot(runif(5, min=0, max=10), sd), .95)[1]
95%
3.835870
The result for this particular random sample is 3.83587. This is definite: were you to call boot
again with the same set of data, the answer would be exactly the same. But how might the answer change with different random samples? Find out by repeating this process a few times and drawing a histogram of the results:
> boot.sd <- replicate(100, quantile(boot(runif(5, min=0, max=10), sd), .95)[1])
> hist(boot.sd)
Were we to do another set of simulations, the random draws would come out different, creating a (slightly) different histogram--but not very different from this one. We can use it with some confidence to understand how the bootstrap UCL of the SD is working. For reference, notice that the standard deviation of a uniform distribution (spanning the range from $0$ to $10$ as specified here) equals $10/\sqrt{12} \approx 2.887$. As one would hope for any UCL worth its salt, the majority (three-quarters, or 0.75) of the values in the histogram exceed this:
> length(boot.sd[boot.sd >= 10/sqrt(12)]) / length(boot.sd)
[1] 0.75
But that's nowhere near the nominal 95% we specified (and were hoping for)! This is one value of simulation: it compares our hopes to what is really going on. (Why the discrepancy? I believe it is because bootstrapping an SD does not work well with really small samples.)
Review
Bootstrap statistics are conceptually just the same as any other statistic like a mean or standard deviation; they just tend to take a long time to compute. (See the warning message in the boot
code!)
Monte-Carlo simulation can be useful for studying how a bootstrap statistic varies due to randomness in obtaining samples. The variation observed in such simulation is due to variation in the samples, not variation in the bootstrap.
(Not illustrated here) Because bootstrap statistics can take a lot of computation (apparently, up to $n^n$ calculations for samples of size $n$), it is convenient to approximate the bootstrap distribution. This is usually done by creating a "black box" program to obtain one value at random from the true bootstrap distribution and calling that program repeatedly. The collective output approximates the exact distribution. The approximation can vary due to randomness in the black box--but that variation is an artifact of the approximation procedure. It is not (conceptually) inherent in the bootstrap procedure itself.
Best Answer
Your "toy" problem (opportunity) arises naturally in real life when companies need to make available sufficient capacity to deal with possible extraordinary demand. I have been involved with a number of communications/power companies in this regard ...thus the historical and ever-evolution of AUTOBOX to meet critical planning/forecasting requirements including incorporating the uncertainty in user-specified predictor series that need to be forecasted and used in a SARMAX model https://autobox.com/pdfs/SARMAX.pdf
At the heart of the issue is a forecasting problem. Your approach was to implicitly assume 1100 independent values with a constant mean and some (many) one-time pulses. In general these 1100 observations may be serially related thus the correct forecasting model may be something different than white noise , after spikes/pulses have been removed.
You say " Let's say during my play, at about attempt 1100 , I wanted to know how much longer I need to play in order to beat the game. How could I estimate when one of my performance spikes would go over a certain threshold, in this case 60 seconds?"
I say "This is unanswerable because you have not specified a level of confidence BUT what is answerable is "what is the probablity of exceeding a specific threshhold value" for any future period (trial #) . To do so one needs to predict the future probablity density function for each period in the future and examine it to determine the probability of exceeding the threshold value." Essentially you select the level of confidence and you obtain the forecast period value and then you compare it your aforementioned critical value ( say 60 ) and determine if the threshold value has been crossed at that level of confidence.
You say "My intuition would say to filter out the spikes as noise and then model the resulting series."
I say "you need to filter out the spikes and then model the resulting/adjusted series to obtain a prediction based upon evidented recursive relationships (signal) yielding an adequate noise series" . Thus a distribution of possible values ( allowing for spikes) can be made for each forecasted period in the future
You say " Then at each period, make a random draw from the distribution of the noise that I filtered away to simulate that noise. Then, use Monte Carlo simulation to see where the density of passing that threshold is high and report a range subjectively from that Monte Carlo simulation."
I say "Then at each period, make a random draw from the probability density function predicted for each future period that was based on the deterministically adjusted series Then, review these Monte Carlo simulations to see where the density of passing that threshold is and report that probability .
Your approach used all 1100 as the basis for the simulation, assuming that the distribution of 1100 had one and only 1 mean. I say that after adjusting for the spikes, observations 1-389 had a mean and observations 390-1100 had a significantly different mean thus only the last 701 values should be used. The two means differed by 1.8868 ( see the coefficient for the level/step shift below ).
With that said ... I now report the results of using AUTOBOX to analyze your 1100 observations
Your 1100 observations yielded an ARIMA model (slight adjustment for memory ) along with a level shift and a number of spikes. Here is the Actual,Fit and Forecast for the next 50 periods (trials) showing 95% prediction limits for the forecasting horizon 1101-1150.
The identified model is here and here . The residual plot is here showing the effect of memory , a constant, a level shift and numerous spikes/pulses. suggesting an adequate extraction of noise.
The forecasting equation is then used to obtain 1000 simulations for the next period explicitely allowing for spikes/pulses to be present while incorporating changing uncertainty as we go further into the future ( not really important for your data as you have no trends , or lots of autoregressive memory , or seasonal pulses. Hereis the histogram of the 1000 monte vcarlo simulations for period 1101
and period 1102 and period 1150
I would rate your intuition as " very high" and your professor will be joyed by your findings . You didn't consider the possible time series forecasting complications and possible spikes in the future and the need to incorporate uncertainties in possible user-specified predictor series. There was little time-series complications as the lag 3 effect (.0994) is possibly/probably spurious and certainly was small. Additionally you ignored the shift in the mean as you got better with more experience after 390 tries. That would have been a bias in your approach as you just adjusted for the one-time anomalies (spikes) and ignored the statistically significant sequential "spikes" (read:level/step shift ) starting at period 391 . N.B. The leve/step shift is now "visually obvious" after it has been pointed out by analytics having "sharper eyes" .
Finally a picture of the 1000 simulations for forecast period 1150 .