First of all, forgive my for my ignorance about this concept. I might ask a basic question but what I have read from Exploratory Data Analysis: 2 Ways of Plotting Empirical Cumulative Distribution Functions in R , How to calculate cumulative distribution in R? and Distribution empirical in R did not guide me to my aim.
I have 248 genes with enrichment measure and by sorting this enrichment measures we were able to assign a enrichment score to the specific intervals to the data. On the other hand, I have my protein complex which contain these genes. By adding up enrichment scores we thought we can generate a protein complex enrichment score. For statistical validation, I would like to randomly generate 1000 protein complexes (for every unique protein complex, i will change the number of genes overtime) and compare those random score with our score.
In descending order
Top +25 --- +2
26- 51 --- +1
51-150 --- 0
151-200 --- -1
200-248 --- -2
For comparison, I thought, nonparametric fitting of my random scores in a distribution, then piping our protein complex score and obtaining a p value would work.
My questions: (Both of the following functions are in R)
-
For non parametric fitting,
approxfun(density(random_genarated_sums))
object will give my a distribution function. Can I calculate the probability of my protein complex with inputting it to distribution function ? -
I asked the same question to my advisors and he suggested me to use
ecdf(x)
. But when since it is cumulative, i get both 0.01 or .998 as a value so should I 1-ecdf(x) for p>0.5? Will this give "p-value"?
I have read about bootstrapping in both CrossValidated and R stackexchange. In those threads, bootstrapping were dont by setting up a hypothesis by;
Generate 1000 random protein complexes of 6 genes
If random sum is greater than > protein complex sum
+1 to count
at the end p value is count / total number of iterations (10000)
I am very confused with this question so any guidance will be appreciated.
Best Answer
For question 2, yes.
When you are doing a two-tail test using ecdf(distribution)(your score), if the p>0.5, you need to manually do 1-p to get the pvalue.
But if you need a one-tail test, like some enrichment test when you trying to test if the observed expression level is significantly higher than null hypothesis, just use 1-p is