Solved – Should I use an average ECDF

permutation-testrsamplingstatistical significance

This relates to a previous question of mine which didn't gain many responses, perhaps because it wasn't very clear nd well written. I hope this time I will be more accurate and get your much appreciated assistance.

I am analyzing results of a biological experiment. The results given as a single value ( non-negative integer) per genomic position. I am interested in valleys, or local minimam over this series of values.

I wish to control the false positives rate and get the significance for each local minima. I can shuffle the raw data which was used to produce the data.

So what I do is to shuffle the raw data, create the new series of values,search for all local minima and keep their values.

Now, I have something like this:

data_set  local_minima_values
=============================
true_data 4 9 1 27 12 0 0 2 5 32 0 1 5 70 2
sim_1 14 25 94 59 32
sim_2 52 0 14 74 82 12 54
...

Note the number of local minima naturally varies between simulations.

So, my idea was to calculate an ECDF for each simulation and then combine those ECDFs into a single "average ECDF" which represents the null hypothesis. Then, I can assign a p-value for each local minima from the true data, and see how significant ('surprising') it is.

My questions are:

  1. Does this make sense?
  2. How do I create an average ECDF? I can't just merge the values from all simulation together and get and ECDF for this merged set, since the number of minima found in each simulation differs, and I think all simulations should have the same contribution to the average ECDF, or am I wrong?
  3. How should I take the number of simulations (shuffles) into account?

Thanks,

Dave

p.s. I'm working with R.

Best Answer

To average the ECDFs, I'd do something like:

impute_resolution = 1e3
values_to_impute = seq(
    min(my_data$true_data)
    , max(my_data$true_data)
    , length.out = impute_resoluton
)

ecdfs = matrix(NA,nrow=length(values_to_impute))

for(i in 1:(ncol(my_data)-1)){ #assumes column 1 is true_data
    this_ecdf = ecdf(my_data[,i+1])
    ecdfs[i,] = this_ecdf(values_to_impute)
}

mean_ecdf = colMeans(ecdfs)
plot(
    x = values_to_impute
    , y = mean_ecdf
    , type = 'l'
)