Is it reasonable to use Monte Carlo methods to resample a dataset of weekly rainfall amounts to statistically test for difference between two timeseries? That is, randomly pull ~30 paired observations for two gauges, conduct a test of differences (e.g., sign test), repeat 1000 times, and count the number of times the p-value is less than 0.05. Power analysis would suggest that if 80 percent of the p-values are less than 0.05 then the test has adequate power to detect a true difference. However, I don't think this is the same as accepting the null unless more than 80 percent of the results have a p-value less than 0.05? The datasets are large sample size and observations are serially correlated so I am trying to remove that correlation and test differences under reasonable sample sizes. Perhaps there is a better / simpler method?
Solved – Difference between monte carlo based analysis and a hypothesis test
hypothesis testingmonte carloresampling
Related Solutions
This is a Monte Carlo simulation, though I doubt it's doing what you want, and really is of no practical use. What you are comparing is 10000 single sample studies and determining how many of these individual observed values are on average higher. So, it's probably better to conceptualize your code as the following less efficient code:
runs <- 10000
result <- logical(runs)
for(i in 1:runs){
a <- rnorm(1, mean = 0)
b <- rnorm(1, mean = 0)
result[i] <- a > b
}
mc.p.value <- mean(result)
The above code, when the distributions are equal, should show that $a$ is higher than $b$ 50% of the time, but this type of result is essentially useless in practice because you would only have $N=2$, and statistical inferences are not applicable (i.e., there is no variance within each group available to quantify sampling uncertainty).
What you are missing is comparing two summary statistics of interest to determine their sampling properties, which is generally the topic of statistical inference and requires at least a minimum number of data points to quantify some form of sampling uncertainty.
As it stands this is generally a standard independent $t$-test set-up. So, you could compare a number of things, such as how often the first group mean is higher than the second, how often a $t$-test gives a $p < \alpha$ result (or analogously, how often the $|t|$ ratio is greater than the cut-off), and so on.
E.g., If $n=20$ for each group and the distributions are equal in the population then
runs <- 10000
n <- 20
alpha <- .05
result.p <- result.mean <- numeric(runs)
for(i in 1:runs){
a <- rnorm(n, mean = 0)
b <- rnorm(n, mean = 0)
result.mean[i] <- mean(a) > mean(b)
result.p[i] <- t.test(a, b)$p.value
}
mc.p.value <- mean(result.p < alpha)
mc.a_gt_b.value <- mean(result.mean)
Playing around with the other parameters, such as changing the first group mean to 1, will change the nature of the simulation (changing it from a Type I error simulation, as it is now, to a power study).
I agree with @AndyW that a key consideration in a power and sample size determination is to decide how big a difference between two means would be a result of practical importance.
A more difficult ingredient in such a determination is to estimate the variances of the responses. Maybe you can get a clue about the variances by looking at prior studies using similar methods.
You also need to know the significance level of the test you will use: often 5%, sometimes in medical studies 1% or smaller. And you need to have a good idea what power the test needs to have. Often people want at least 80% or 90% probability of detecting an effect of the desired size, if real.
Do not be dismayed that some of the necessary inputs into a power and sample size determination are inevitably guesses. Such a determination based on reasonable guesses is almost always much better than none at all.
Suppose you want 85% power for a two-sample t test at the 5% level to detect a difference in means that is 5 units, when the standard deviation of the observations may be 10 units. Many statistical software programs have 'power and sample size' procedures for balanced studies (equal sample sizes in the two groups).
Below is printout from a recent release of Minitab that includes the situation I described in the previous paragraph.
Power and Sample Size
2-Sample t Test
Testing mean 1 = mean 2 (versus ≠)
Calculating power for mean 1 = mean 2 + difference
α = 0.05 Assumed standard deviation = 10
Sample Target
Difference Size Power Actual Power
5 64 0.80 0.801460
5 73 0.85 0.850968
5 86 0.90 0.903230
Power and sample size procedures are available in many statistical software programs for many of the most common procedures.
Some procedures require simulation. A two-sample t test would require simulation if you know ahead of time that one of the two groups has a larger variance so that you'll need to do a Welch t test, you may need simulation for that. Also, most software assumes equal sample sizes in the two groups. If financial constraints require one group to be smaller than the other, then you'd probably need simulation.
The simulation in R below addresses the situation where one of the two groups in a a two-sample t test has SD $9$ and the other has SD $11.$ With $70$ observations in each group the power is about 83% even using a Welch test to accommodate to slightly different group standard deviations.
set.seed(2022)
pv = replicate(10^5,
t.test(rnorm(70,50,9),
rnorm(70,55,11))$p.val)
mean(pv <= 0.05)
[1] 0.83177 # aprx power
Best Answer
You are sort of describing a bootstrap approach to solving the problem. Indeed, resampling rows from the data you have collected gives you a robust method of calculating a confidence interval for some desired effect, whether a mean difference, or a rank based statistic, or a p-value resulting from such a test.
As you know, you can perform inference by inspecting whether a 95% confidence interval for a effect contains the null hypothesized value. In this case, you are basically concerned with whether the mean difference is 0 in your two timeseries.
It turns out that calculating mean differences with bootstrapped data, creating a 95% confidence interval, and inspecting whether this interval includes 0 or not is an asymptotically equivalent approach to conducting the t-test (which you have called a hypothesis test in your problem description).
The bootstrap approach has many desirable properties: small sample correctness for one, no assumptions at the cost of very poor efficiency and very small sample bias.
So depending on the sample size, I would consider the bootstrap if you have more than, say, 50 observations in each time series. Otherwise, a regular t-test will probably be more efficient and is a reasonable test.