Solved – How to get p-value by using ecdf and bootstrapping

bootstrapempirical-cumulative-distr-fnrstatistical significance

First of all, forgive my for my ignorance about this concept. I might ask a basic question but what I have read from Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R , How to calculate cumulative distribution in R? and Distribution empirical in R did not guide me to my aim.

I have 248 genes with enrichment measure and by sorting this enrichment measures we were able to assign a enrichment score to the specific intervals to the data. On the other hand, I have my protein complex which contain these genes. By adding up enrichment scores we thought we can generate a protein complex enrichment score. For statistical validation, I would like to randomly generate 1000 protein complexes (for every unique protein complex, i will change the number of genes overtime) and compare those random score with our score.

 In descending order
 Top +25 --- +2
  26- 51 --- +1
  51-150 ---  0
 151-200 --- -1
 200-248 --- -2

For comparison, I thought, nonparametric fitting of my random scores in a distribution, then piping our protein complex score and obtaining a p value would work.

My questions: (Both of the following functions are in R)

  1. For non parametric fitting, approxfun(density(random_genarated_sums)) object will give my a distribution function. Can I calculate the probability of my protein complex with inputting it to distribution function ?

  2. I asked the same question to my advisors and he suggested me to use ecdf(x). But when since it is cumulative, i get both 0.01 or .998 as a value so should I 1-ecdf(x) for p>0.5? Will this give "p-value"?

I have read about bootstrapping in both CrossValidated and R stackexchange. In those threads, bootstrapping were dont by setting up a hypothesis by;

 Generate 1000 random protein complexes of 6 genes
 If random sum is greater than > protein complex sum
 +1 to count
 at the end p value is count / total number of iterations (10000)

I am very confused with this question so any guidance will be appreciated.

Best Answer

For question 2, yes.

When you are doing a two-tail test using ecdf(distribution)(your score), if the p>0.5, you need to manually do 1-p to get the pvalue.

But if you need a one-tail test, like some enrichment test when you trying to test if the observed expression level is significantly higher than null hypothesis, just use 1-p is

Related Question