Solved – Comparing two sets of data for similarity

correlationgeneticssimilaritiesstandard deviation

I have two independent sets of gene expression data (biological replicates) that were generated at the same time. Each gene is sampled dozens of times on the machine (technical replicates). Data are filtered for transcripts having expression values above background (p-values less that 0.05). I then export everything for analysis in Excel. However, the raw expression values can vary between the two samples. The majority seem to be rather close but many are far enough apart to make me question their reliability.

What would be the best way to compare the two values ($X_1$ and $X_2$) such that I can create a new filtering criteria to remove these potentially less reliably sampled transcripts?

I only have two datasets, and I'm uncertain as to whether I can use standard deviations in any filtering. My general thinking is that if there is less than a 20% difference between the two values I can keep those that pass. But I don't know what that would mean. Take these genes as an example:

$$
\begin{array}{ccc}
& X_1 & X_2 \\
\mbox{Strap} & 5554.15 & 5262.48 \\
\mbox{Cops8} & 1762.63 & 2317.22 \\
\end{array}
$$

As you can see, the Strap values are pretty close, the Cops8 are quite a bit different.
Does a "% difference" criteria make sense? Something like this:

$$
\mbox{%diff} = \frac{\left|X_1 – X_2\right|}{(X_1+X_2)/2}
$$

If I calculate a coefficient of variation I get a much different value. I don't want to exclude more genes than I have to, if I can avoid it.

Thanks for any advice!

Best Answer

How many genes are sampled in either data set you've obtained? When the number of genes is large, it shouldn't be surprising that a gene may be highly differentially expressed in one replicate but not in another. This is a consequence of multiple testing. When you filter genes according to a $p=0.05$ statistical significance level, you have a 5% chance of making a type I error for any given gene. When averaged over several dozens or hundreds of genes, the chance of including at least one erroneous gene is greatly multiplied.

If you don't want to exclude more genes than you have to, why not just include them all?

Related Solutions

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

You are close, with your use of dhyper and phyper, but I don't understand where 0:2 and -1:2 are coming from.

The p-value you want is the probability of getting 100 or more white balls in a sample of size 400 from an urn with 3000 white balls and 12000 black balls. Here are four ways to calculate it.

sum(dhyper(100:400, 3000, 12000, 400))
1 - sum(dhyper(0:99, 3000, 12000, 400))
phyper(99, 3000, 12000, 400, lower.tail=FALSE)
1-phyper(99, 3000, 12000, 400)

These give 0.0078.

dhyper(x, m, n, k) gives the probability of drawing exactly x. In the first line, we sum up the probabilities for 100 – 400; in the second line, we take 1 minus the sum of the probabilities of 0 – 99.

phyper(x, m, n, k) gives the probability of getting x or fewer, so phyper(x, m, n, k) is the same as sum(dhyper(0:x, m, n, k)).

The lower.tail=FALSE is a bit confusing. phyper(x, m, n, k, lower.tail=FALSE) is the same as 1-phyper(x, m, n, k), and so is the probability of x+1 or more. [I never remember this and so always have to double check.]

At that stattrek.com site, you want to look at the last row, "Cumulative Probability: P(X $\ge$ 100)," rather than the first row "Hypergeometric Probability: P(X = 100)."

Any particular number that you draw is going to have small probability (in fact, max(dhyper(0:400, 3000, 12000, 400)) gives $\sim$0.050), and getting 101 or 102 or any larger number is even more interesting that 100, and the p-value is the probability, if the null hypothesis were true, of getting a result as interesting or more so than what was observed.

Here's a picture of the hypergeometric distribution in this case. You can see that it's centered at 80 (20% of 400) and that 100 is pretty far out in the right tail. enter image description here

Solved – Is a heat-map of gene expression more informative if Z-scores are used instead of actual expression measurement values

What the reviewer may be referring to is the bottom legend of your figure. It goes from 1 to 12, with 4 right in the middle, which is discomforting. This makes your absolute log expression values difficult to interpret, because when a gene goes from bright green to black, its expression level is multiplied by 16, but when it goes from black to bright red, it is multiplied by 256. In short, I don't think your figure could be "more informative", but the information could be more intuitive.

As explained by @fosgen, Z-scores are centered and normalized, so the user can interpret a color as $x$ standard deviations from the mean and have an intuitive idea of the relative variation of that value.

Like @fosgen, I think you should go for standardization by gene (standardization by cell type does not make sense to me in that context). Black will be the average expression across different cell types (set to 0) and the color distribution will be symmetrical on both sides.

Showing the (relative) gene-wise variation of expression is standard in the field, but you might have specific reasons to show the (absolute) log2-microarray measurements, in which case you can expose them to the reviewers. But I would still straigthen the color gradient to ease interpretation.

Best Answer

Related Solutions

Solved – Calculating the probability of gene list overlap between an RNA seq and a ChIP-chip data set

Solved – Is a heat-map of gene expression more informative if Z-scores are used instead of actual expression measurement values

Related Question