Solved – Multiple testing in correlation analysis over time periods

correlationmultiple-comparisonstime series

I have one variable measured once per time interval (say, once per year), and another variable measured periodically (say, once per day). The periodic measurements are autocorrelated.

I am interest in finding statistically significant correlations of the first variable with aggregates of the second variable over time periods. For example, the second variable could be aggregated as the arithmetic mean of observations over week3-week4. This setting is relevant, for instance, in tree growth analysis, where one variable could be annual width increment, and another variable – daily temperatures.

I was wondering how to account for multiple testing in this situation? I am aware of previous discussions focusing on permutation testing:
Look and you shall find (a correlation) ;
Permutation test for multiple correlation test statistics .

As I understand (How to choose the test statistic for permutation test?), in permutation testing one would like to break the relation between the potentially correlated variables, in my case – the yearly observation and an aggregate of daily observations. That would not simulate independent trials with respect to different time periods of aggregation. If I break autocorrelations in the daily observations, then the concept of correlations over a time window is lost. Any advice?

Here is a toy example in R. I measure correlations between the first variable, and an aggregate of the second variable over a time window. I would like to test, which of the correlations in C are significant.

# initialize
t <- 7 #length of time series
n <- 100 #number of observations
level <- 10 #magnitude parameter

#generate input data
X <- seq(1:t) #initialize dataset
for (j in 1:n)
{
  x <- rnorm(1)*level #starting point in time series
  for (i in 1:(t-1))
  {
    x <- c(x,x[i]+rnorm(1)) #one time series
  }
  X <-rbind(X,x) #a collection of time series
}  
X <- X[-1,] #remove extra line

#define target variable
y <- apply(X[,3:4],1,mean) +rnorm(n)

#compute correlations
C <- matrix(0,t,t)
for (start in 1:t)
{
  for (finish in start:t)
  {
    if ((finish-start)>0)
    {
      xnow <- apply(X[,start:finish],1,mean)
    }
    else
    {
      xnow <- X[,start:finish]
    }
    C[start,finish] <- cor(xnow,y)
  }
}

#resulting correlations
print(C)

Best Answer

In the scenario depicted in your example, you could permute the $Y$'s. But note that this only works because the $Y$ themselves do not have any temporal dependence on each other, once you condition on the average of the $X$.

In general, with time series, you need to make assumptions about how the correlations will manifest in $X$ and $y$. This is because it is these assumptions that tell you what the "independent units" are to resample. If $Y$ was an AR-1 process (so has a memory of its previous value, beyond what the $X$ dictate), then simple permutation on $Y$ wouldn't work. I would suggest consulting a book on bootstrapping and permutation if you think this is the case--there are developed methods out there for bootstrapping time-series that I am not familiar with enough to comment on there. One good book is "Bootstrap methods and their application" by AC Davidson.

Anyways, if you believe that your $Y$ really are independent realizations (at least given some feature in your model), I would recommend a blocked procedure, where you loop through your features for a fixed permutation of $Y$

  1. Form your features $X_1, X_2, \dotsc, X_m$
  2. For p = 1, ..., P:

    a. Permute $Y$

    b. For each feature $X_1, \dotsc, X_m$, calculate your favorite measure of association, ie, $R^2$, Kendall's $\tau$, whatever.

  3. Tabulate your association statistics. If you want to control the probability of falsely finding any one of the features to be associated, then over each permutation, take the maximum association you find across features.

Methodological Comment

Also, I should comment that if your goal really is to "identify time periods that are significantly correlated with the target" variable, there are other approaches besides this one that you should research. In particular, smoothing splines and wavelet bases for $X$ would be two areas that I'd investigate. There's a whole host of literature out there under the rubric of "functional linear regression."

Related Question