Statistical Methods – Which Method Can Compare Values of the Same Variables from Different Views?

differencesfriedman testmultivariate analysisrepeated measureswilcoxon-signed-rank

I got a biological data set, where I need to run a statistical analysis on.

The data was created on a mass spectrometer and contains the intensity values for around one hundred lipids.

The analysis was done on five different types of blood (Plasma, Serum, EDTA Blood, DBS Finger and DBS Venous) from four persons at three different time points. All five types were gathered from each persons at each of time point.

Each sample contains the categorical variable person (0,1,2 or 3), the date (date1, date2 or date3) and the type of used liquid. Additionally around one hundred different float values for the different detected lipids. The higher the number the more of the lipid is present.

The general motivation behind the experiment is to test a new standard operation procedure for reliably detecting as many lipids as possible in a repeatable way. To add further value into this small test study we want to further analyse the data, aside from the sheer number of detected lipids.
The first idea was to measure differences between the used types, e.g. EDTA should get throughout higher values than dried blood samples. At the moment the goal is only to measure if differences between the types are present, which there are, and not a non-inferiority analysis.

Second, is the question of repeatability. Are the numbers for the lipids in one blood type mostly constant between the time points. This point could be problematic, because we only have three repetitions.

For the first question I compared all samples of two types with a Wilcoxon signed rank test.
For the second question I used a Friedman test to compare the different time points in one blood type. Both tests were done on the values of one lipid individually and the number of times H0 got rejected were counted.

Is that a statistically correct way to test it? My background is machine learning and not statistics so I do not want to do something completely wrong.

Best Answer

The biggest problem that you face with your approach is multiple comparisons. For the Wilcoxon tests you have 10 pairwise comparisons among the blood-sample types and 100 lipid types, for 1000 comparisons. Even if there are no real differences you would expect 50 "significant" differences at the usual p = 0.05 cutoff. For the Friedman tests you have 5 blood-sample types and 100 lipid types, 500 comparisons (even ignoring the 3 post-hoc pairwise analyses among time points you might perform following each test) and 25 expected "significant" differences even if all null hypotheses hold. Just counting "the number of times H0 got rejected" isn't quite enough to handle that.

To deal with the multiple-comparisons problem you need to decide whether you want to control the family-wise error rate (FWER, the probability of making any false-positive findings) or the false-discovery rate (FDR, the fraction of "significant" findings that probably aren't correct). The Holm-Bonferroni method to control FWER and the Benjamini-Hochberg method to control FWER both work with lists of p-values, as you presumably have obtained with your multiple tests.

A problem I see is that your approach isn't estimating the magnitudes of the differences. In particular, as you don't seem to have technical replicates on the same sample types collected from the same individual at the same time, the differences among time points necessarily include both the analytical-method variance and any time-associated differences. Also, you are treating the between-sample-type and between-time analyses separately, and not directly accounting for the 4 different individuals at all. That might not be using your data (about 6000 observations in total) most efficiently.

I'd recommend trying to analyze these data with some form of regression, and treating the time points simply as replicates instead of trying to model them individually like your Friedman test did (unless you expect systematic differences from time point to time point, as you might with changes in time following some therapeutic intervention). Then at least you will have some estimate of the sum of the analytical-method and time-associated variances, based on pooling information from many different measurements.

You might then model the lipid values themselves in regression to evaluate the differences among blood-sample types, while you account for the 4 individuals and treat the 3 time points as replicates. One often finds that errors in biochemical measurements are proportional to the actual values. Working in a logarithmic scale might put residuals of different (log-transformed) lipid measurements on similar scales so that a regression can evaluate all 100 lipids at once.* The 100 lipids could be treated as repeated measures on the 4 individuals, with the sample types as predictors. The log scale converts absolute differences into percentage/proportional differences. Thus it would directly handle via regression the types of overall percentage differences in lipid levels that you expect, say between EDTA-treated samples and dried blood, while allowing for further differences among lipid/sample-type combinations.

If a simple log transformation isn't adequate, it should be possible to adapt bioinformatic methods developed for things like microarrays and RNA-seq data. The limma package in Bioconductor comes to mind, although I don't know if that's the most appropriate here. Those methods directly model variance as a function of expression levels, to allow for simultaneous analysis of up to thousands of gene-expression values as a function of treatments with small numbers of replicates. You would simply treat the blood-sample-types as "treatments" and see which of your lipid values had significantly different "responses" to the "treatments."

Finally, note that Bioconductor and R provide many tools for analyzing mass-spec metabolic data. See this recent review. Although many of those packages are designed for analyses more at the raw-data scale and mapping mass-spec data onto particular biochemical species, it seems that some of them might already be designed to handle this type of quantitative, regression-type approach on reduced mass-spec data. Furthermore, I suspect that your different sample types might be leading to different mass-spec fragmentation patterns, something that would need analysis closer to the raw-data level. So look carefully at what's already been developed for similar studies.


*You might instead use a generalized linear model (GLM) with a log link. A prior log transformation means that you are estimating geometric means (mean of the log) in the regression. The GLM with a log link would model the (log of the) mean values directly.

Related Question