Solved – Should I use an average ECDF

permutation-testrsamplingstatistical significance

This relates to a previous question of mine which didn't gain many responses, perhaps because it wasn't very clear nd well written. I hope this time I will be more accurate and get your much appreciated assistance.

I am analyzing results of a biological experiment. The results given as a single value ( non-negative integer) per genomic position. I am interested in valleys, or local minimam over this series of values.

I wish to control the false positives rate and get the significance for each local minima. I can shuffle the raw data which was used to produce the data.

So what I do is to shuffle the raw data, create the new series of values,search for all local minima and keep their values.

Now, I have something like this:

data_set  local_minima_values
=============================
true_data 4 9 1 27 12 0 0 2 5 32 0 1 5 70 2
sim_1 14 25 94 59 32
sim_2 52 0 14 74 82 12 54
...

Note the number of local minima naturally varies between simulations.

So, my idea was to calculate an ECDF for each simulation and then combine those ECDFs into a single "average ECDF" which represents the null hypothesis. Then, I can assign a p-value for each local minima from the true data, and see how significant ('surprising') it is.

My questions are:

Does this make sense?
How do I create an average ECDF? I can't just merge the values from all simulation together and get and ECDF for this merged set, since the number of minima found in each simulation differs, and I think all simulations should have the same contribution to the average ECDF, or am I wrong?
How should I take the number of simulations (shuffles) into account?

Thanks,

Dave

p.s. I'm working with R.

Best Answer

To average the ECDFs, I'd do something like:

impute_resolution = 1e3
values_to_impute = seq(
    min(my_data$true_data)
    , max(my_data$true_data)
    , length.out = impute_resoluton
)

ecdfs = matrix(NA,nrow=length(values_to_impute))

for(i in 1:(ncol(my_data)-1)){ #assumes column 1 is true_data
    this_ecdf = ecdf(my_data[,i+1])
    ecdfs[i,] = this_ecdf(values_to_impute)
}

mean_ecdf = colMeans(ecdfs)
plot(
    x = values_to_impute
    , y = mean_ecdf
    , type = 'l'
)

Related Solutions

Solved – How to look for valleys in a graph

You could use some sort of Monte Carlo approach, using for instance the moving average of your data.

Take a moving average of the data, using a window of a reasonable size (I guess it's up to you deciding how wide).

Throughs in your data will (of course) be characterized by a lower average, so now you need to find some "threshold" to define "low".

To do that you randomly swap the values of your data (e.g. using sample()) and recalculate the moving average for your swapped data.

Repeat this last passage a reasonably high amount of times (>5000) and store all the averages of these trials. So essentially you will have a matrix with 5000 lines, one per trial, each one containing the moving average for that trial.

At this point for each column you pick the 5% (or 1% or whatever you want) quantile, that is the value under which lies only 5% of the means of the randomized data.

You now have a "confidence limit" (I'm not sure if that is the correct statistical term) to compare your original data with. If you find a part of your data that is lower than this limit then you can call that a through.

Of course, bare in mind that not this nor any other mathematical method could ever give you any indication of biological significance, although I'm sure you're well aware of that.

EDIT - an example

require(ares) # for the ma (moving average) function

# Some data with peaks and throughs 
values <- cos(0.12 * 1:100) + 0.3 * rnorm(100) 
plot(values, t="l")

# Calculate the moving average with a window of 10 points 
mov.avg <- ma(values, 1, 10, FALSE)

numSwaps <- 1000    
mov.avg.swp <- matrix(0, nrow=numSwaps, ncol=length(mov.avg))

# The swapping may take a while, so we display a progress bar 
prog <- txtProgressBar(0, numSwaps, style=3)

for (i in 1:numSwaps)
{
# Swap the data
val.swp <- sample(values)
# Calculate the moving average
mov.avg.swp[i,] <- ma(val.swp, 1, 10, FALSE)
setTxtProgressBar(prog, i)
}

# Now find the 1% and 5% quantiles for each column
limits.1 <- apply(mov.avg.swp, 2, quantile, 0.01, na.rm=T)
limits.5 <- apply(mov.avg.swp, 2, quantile, 0.05, na.rm=T)

# Plot the limits
points(limits.5, t="l", col="orange", lwd=2)
points(limits.1, t="l", col="red", lwd=2)

This will just allow you to graphically find the regions, but you can easily find them using something on the lines of which(values>limits.5).

Solved – Significance of average correlation coefficient

A better approach to analysing this data is to use a mixed-model (a.k.a. mixed effects model, hierarchical model) with subject as a random effect (random intercept or random intercept + slope). To summarize a different answer of mine:

This is essentially a regression that models a single overall relationship while allowing that relationship to differ between groups (the human subjects). This approach benefits from partial pooling and uses your data more efficiently.

Best Answer

Related Solutions

Solved – How to look for valleys in a graph

Solved – Significance of average correlation coefficient

Related Question