I am trying to understand some chemical concentration data I have measured. I am taking the log of the ratio of two concentrations. The ratio itself is from oscillating timeseries data and is the (local maxima in conc):(local minima in conc). I then take the log of this ratio. When I do this and plot the distribution is appears to my eyes as being roughly log-normal with the bulk of the probability at around 0.75 (using log base 2) and a long tail skewed-right (despite the plot there is no mass less than zero as the local max>local min).
I would like to find outliers in this distribution as well as find mean+SD cutoffs to threshold it (comparing to another similar dataset from a treatment condition and I want to characterize what the common variation is in this distribution as a control). Is it best to transform this to a log-normal distribution to do something like this? Should I check other distributions as well? Any relatively straightforward suggestions would be appreciated.
Here is the actual sorted data that was used to plot the kernel density seen above. I have many other realizations of distributions of log ratios similar to this one:
{0.34, 0.35, 0.38, 0.42, 0.45, 0.47, 0.47, 0.53, 0.54, 0.56, 0.59,
0.6, 0.61, 0.61, 0.62, 0.65, 0.71, 0.72, 0.8, 0.84, 0.9, 0.92, 0.95,
0.96, 0.96, 1.68, 1.81, 2.03, 2.03, 3.19, 3.19, 3.37, 3.79, 4.65,
4.75}
Best Answer
Thanks for posting the data. Here is a slightly tougher check on whether the data are lognormal, a normal quantile plot of the logged ratios.
Considerable caution is indicated:
The sample size is 35. Easy to say, but that is a small sample for this kind of exercise.
The grouping is suggestive, or may be just a quirk as can be expected in any sample of this size. Certainly you should check whether there is anything distinctive underlying the 10 highest values.
The fit is middling, but I didn't search through other distributions to try to find a better fit.
I don't see why mixtures are expected to be mixtures of normals. It's my impression that that is the most common kind of mixture fitted to data, a different point.
I used natural logarithms, but using log base 2 would clearly just change axis labels, and nothing fundamental.