This relates to a previous question of mine which didn't gain many responses, perhaps because it wasn't very clear nd well written. I hope this time I will be more accurate and get your much appreciated assistance.
I am analyzing results of a biological experiment. The results given as a single value ( non-negative integer) per genomic position. I am interested in valleys, or local minimam over this series of values.
I wish to control the false positives rate and get the significance for each local minima. I can shuffle the raw data which was used to produce the data.
So what I do is to shuffle the raw data, create the new series of values,search for all local minima and keep their values.
Now, I have something like this:
data_set local_minima_values
=============================
true_data 4 9 1 27 12 0 0 2 5 32 0 1 5 70 2
sim_1 14 25 94 59 32
sim_2 52 0 14 74 82 12 54
...
Note the number of local minima naturally varies between simulations.
So, my idea was to calculate an ECDF for each simulation and then combine those ECDFs into a single "average ECDF" which represents the null hypothesis. Then, I can assign a p-value for each local minima from the true data, and see how significant ('surprising') it is.
My questions are:
- Does this make sense?
- How do I create an average ECDF? I can't just merge the values from all simulation together and get and ECDF for this merged set, since the number of minima found in each simulation differs, and I think all simulations should have the same contribution to the average ECDF, or am I wrong?
- How should I take the number of simulations (shuffles) into account?
Thanks,
Dave
p.s. I'm working with R.
Best Answer
To average the ECDFs, I'd do something like: