Solved – Comparing two sets of data for similarity

correlationgeneticssimilaritiesstandard deviation

I have two independent sets of gene expression data (biological replicates) that were generated at the same time. Each gene is sampled dozens of times on the machine (technical replicates). Data are filtered for transcripts having expression values above background (p-values less that 0.05). I then export everything for analysis in Excel. However, the raw expression values can vary between the two samples. The majority seem to be rather close but many are far enough apart to make me question their reliability.

What would be the best way to compare the two values ($X_1$ and $X_2$) such that I can create a new filtering criteria to remove these potentially less reliably sampled transcripts?

I only have two datasets, and I'm uncertain as to whether I can use standard deviations in any filtering. My general thinking is that if there is less than a 20% difference between the two values I can keep those that pass. But I don't know what that would mean. Take these genes as an example:

$$
\begin{array}{ccc}
& X_1 & X_2 \\
\mbox{Strap} & 5554.15 & 5262.48 \\
\mbox{Cops8} & 1762.63 & 2317.22 \\
\end{array}
$$

As you can see, the Strap values are pretty close, the Cops8 are quite a bit different.
Does a "% difference" criteria make sense? Something like this:

$$
\mbox{%diff} = \frac{\left|X_1 – X_2\right|}{(X_1+X_2)/2}
$$

If I calculate a coefficient of variation I get a much different value. I don't want to exclude more genes than I have to, if I can avoid it.

Thanks for any advice!

Best Answer

How many genes are sampled in either data set you've obtained? When the number of genes is large, it shouldn't be surprising that a gene may be highly differentially expressed in one replicate but not in another. This is a consequence of multiple testing. When you filter genes according to a $p=0.05$ statistical significance level, you have a 5% chance of making a type I error for any given gene. When averaged over several dozens or hundreds of genes, the chance of including at least one erroneous gene is greatly multiplied.

If you don't want to exclude more genes than you have to, why not just include them all?

Related Question