Solved – How to normalize data of unknown distribution

distributionshistogramnormal distributionnormalization

I am trying to find the most appropriate characteristic distribution of repeated measurements data of a certain type.

Essentially, in my branch of geology, we often use radiometric dating of minerals from samples (chunks of rock) in order to find out how long ago an event happened (the rock cooled below a threshold temperature). Typically, several (3-10) measurements will be made from each sample. Then, the mean $\mu$ and standard deviation $\sigma$ are taken. This is geology, so the cooling ages of the samples can scale from $10^5$ to $10^9$ years, depending on the situation.

However, I have reason to believe that the measurements are not Gaussian: 'Outliers', either declared arbitrarily, or through some criterion such as Peirce's criterion [Ross, 2003] or Dixon's Q-test [Dean and Dixon, 1951], are fairly common (say, 1 in 30) and these are almost always older, indicating that these measurements are characteristically skewed right. There are well-understood reasons for this having to do with mineralogical impurities.

Mean vs. median sample age.  Red line indicates mean=median.  Note older means caused by skewed measurements.

Therefore, if I can find a better distribution, that incorporates fat tails and skew, I think that we can construct more meaningful location and scale parameters, and not have to dispense of outliers so quickly. I.e. if it can be shown that these types of measurements are lognormal, or log-Laplacian, or whatever, then more appropriate measures of maximum likelihood can be used than $\mu$ and $\sigma$, which are non-robust and maybe biased in the case of systematically right-skewed data.

I am wondering what the best way to do this is. So far, I have a database with about 600 samples, and 2-10 (or so) replicate measurements per sample. I have tried normalizing the samples by dividing each by the mean or the median, and then looking at histograms of the normalized data. This produces reasonable results, and seems to indicate that the data is sort of characteristically log-Laplacian:

enter image description here

However, I'm not sure if this is the appropriate way of going about it, or if there are caveats that I am unaware of that may be biasing my results so they look like this. Does anyone have experience with this sort of thing, and know of best practices?

Best Answer

Have you considered taking the mean of the (3-10) measurements from each sample? Can you then work with the resulting distribution - which will approximate the t-distribution, which will approximate the normal distribution for larger n?

Related Question