Solved – Generate data with skewed distribution and known percentiles, mean and median

distributionsr

I'm trying this one again as my imprecise use of language unfortunately confused my last attempt at this question…

I am trying to recreate some data from a publication. The study was on levels of chemical in a foodstuff and I would like to recreate data that will produce the same mean, median and percentiles:

  • p5 = 0.006 mg/kg
  • median = 0.02 mg/kg
  • mean = 3.153 mg/kg
  • p95 = 1.525 mg/kg
  • min > 0 mg/kg
  • max = 900 mg/kg
  • n =2521
  • 59.4% were below 0.005 mg/kg

Is it possible to recreate data to suit these values? Once I can recreate the data I would like to right censor it at a range of values (eg 500, 100, 50, 10) and see what effect it has on the mean.

There is no graph, and it's not possible to get the original data

Thank you for your help, and I hope I've framed the problem more accurately this time!

Best Answer

Whatever your sense of how difficult this is, and of how much guesswork is needed about information not given, you are exactly right.

The minimum in the previous version of the question (no longer visible) was stated to be 0, and is now finessed to be >0. (In what follows, I take units mg/kg as implied, unless I flag that I am working with natural logarithms. In similar spirit, I show more decimal places here than are defensible given the input and what is being done, but just in case anyone wants to check with their own favourite software.) In fact, for measuring concentrations that could be very small, we can guess at there being some minimum reportable or detectable amount, so the major problem is not at that end.

Empirical maxima for highly skewed distributions are just that: empirical maxima, which will vary enormously from sample to sample even if the underlying distribution is well defined and consistent.

Perhaps the biggest difficulty here is that there is absolutely no guarantee that any brand-name distribution (e.g. lognormal) will apply to satisfy the preferences of the investigator. Indeed, in this kind of problem the starting-point is probably a guess that some overall distribution is mixed with one or more rogue distributions (e.g. reflecting people, machines, plants with a serious contamination problem) in what is being observed.

A few sample calculations underline the difficulty. If we take the 5, 50 and 95% points on trust then on a natural logarithmic scale and with a very wild guess at lognormal then those results point to a lognormal with mean -6.215 and SD 1.683. With those as benchmark then ln(900) is 7.734 SD above the mean. That's not impossible but it implies that we are playing a wild guessing game.

Conversely, and again if the distribution is lognormal, the ratio mean/median itself implies an SD of 3.837, which is much higher than the earlier guess. A factor of 2 inconsistency is not surprising, but not comforting either.

The backdrop here is that the lognormal is a distribution that is capable of being very skewed indeed.

In short, my summary is, although it does not go appreciably further than the stated information,

  • This is a very difficult problem.

  • The summary statistics alone point to an extremely skewed distribution.

  • Back-of-the envelope calculations don't rule out a lognormal, but they hint that the right tail is so stretched out that the overall distribution may be something much more skewed than that. We don't have enough information to decide between assumptions that would imply quite different inferences.

Notes: My calculations using Stata as an calculator are appended.

. di invnormal(.05)
-1.6448536

. di invnormal(.95)
1.6448536

. di (ln(1.525) - ln(0.006)) / (2 * invnormal(.95))
1.6834295

. di ln(0.002)
-6.2146081

. di (ln(900) - ln(0.002)) / 1.683
7.7344046

. di sqrt(2 * ln(3.153/0.002))
3.8374373