Solved – Taking the mean of a data set with a skewed distribution

distributionsmean

I'm running experiements that record the time my algorithm takes to solve a set of problem instances on a particular benchmark. Each problem has an associated difficulty in the range [1, n]. Ideally these should be evenly distributed across the difficulty spectrum but this is not the case: the problem sample I have is skewed toward the easier end of the spectrum.

To account for this, I have grouped the problems by difficulty e.g. [1-10], [11-20], … , [n-9, n]. Each interval usually contains contains at least 10 problems (usually more; 50+ is not uncommon) and I take the average time required to solve all problems in each interval. This gives me a clearer picture of how my algorithm performs on both easy and hard problems with the caveat that the data is somewhat less reliable for the harder end of the spectrum.

First question: is this okay or are there some gotchas I haven't accounted for?

Next: For comparative purposes, I need to summarise performance on each benchmark as a single number. I am loath to simply take the average across all problems as this figure is skewed by too many easy problems. Which brings me to…

Second question: can I take an average of all interval averages instead?

Best Answer

You have a hierarchy of measurements, the first level of multiple time measurements on problem number $i$ ($1\leq i \leq n$), the second level is multiple problems of the same difficulty group.

Level 1. The measured times follow a distribution. This distribution may be normal (if the run time is influence by a large number of more or less independent factors), exponential distribution (if the algorithm waits for a random event to occur), or something complicated (e.g. multimodal, where run time strongly depends on initial decisions). The average is useful in case of the normal and exponential distributions, but may not be useful in the complicated cases without a large number of runs on the same problem. To determine the distribution of run times it may be useful to (a) pick a couple of problems, and measure the run time with a large number of repetitions, (b) to think over the mechanism, the details of the algorithm. You may find that a few repetitions are generally enough, or that many repetitions are needed and perhaps median may be a better statistic than mean.

Level 2. The difficulties within a difficulty group are not the same, but you think that they are similar. The differences in run times between groups may be small or large. If the times within difficulty groups are close to each other compared to the differences between adjacent difficulty groups it may not be very important to find a perfect summary measure to characterise a difficulty group, mean will probably do. If, however, the differences between difficulty groups are small you probably want to have the best summary measure of the difficulty of groups. In this case again, the distribution of problem times within a difficulty group decides which method to use.

I generally advice against using “a single number”, because usually expressing the level of uncertainty is almost as important as finding the most likely values.

Related Solutions

Solved – Generate data with skewed distribution and known percentiles, mean and median

Whatever your sense of how difficult this is, and of how much guesswork is needed about information not given, you are exactly right.

The minimum in the previous version of the question (no longer visible) was stated to be 0, and is now finessed to be >0. (In what follows, I take units mg/kg as implied, unless I flag that I am working with natural logarithms. In similar spirit, I show more decimal places here than are defensible given the input and what is being done, but just in case anyone wants to check with their own favourite software.) In fact, for measuring concentrations that could be very small, we can guess at there being some minimum reportable or detectable amount, so the major problem is not at that end.

Empirical maxima for highly skewed distributions are just that: empirical maxima, which will vary enormously from sample to sample even if the underlying distribution is well defined and consistent.

Perhaps the biggest difficulty here is that there is absolutely no guarantee that any brand-name distribution (e.g. lognormal) will apply to satisfy the preferences of the investigator. Indeed, in this kind of problem the starting-point is probably a guess that some overall distribution is mixed with one or more rogue distributions (e.g. reflecting people, machines, plants with a serious contamination problem) in what is being observed.

A few sample calculations underline the difficulty. If we take the 5, 50 and 95% points on trust then on a natural logarithmic scale and with a very wild guess at lognormal then those results point to a lognormal with mean -6.215 and SD 1.683. With those as benchmark then ln(900) is 7.734 SD above the mean. That's not impossible but it implies that we are playing a wild guessing game.

Conversely, and again if the distribution is lognormal, the ratio mean/median itself implies an SD of 3.837, which is much higher than the earlier guess. A factor of 2 inconsistency is not surprising, but not comforting either.

The backdrop here is that the lognormal is a distribution that is capable of being very skewed indeed.

In short, my summary is, although it does not go appreciably further than the stated information,

This is a very difficult problem.
The summary statistics alone point to an extremely skewed distribution.
Back-of-the envelope calculations don't rule out a lognormal, but they hint that the right tail is so stretched out that the overall distribution may be something much more skewed than that. We don't have enough information to decide between assumptions that would imply quite different inferences.

Notes: My calculations using Stata as an calculator are appended.

. di invnormal(.05)
-1.6448536

. di invnormal(.95)
1.6448536

. di (ln(1.525) - ln(0.006)) / (2 * invnormal(.95))
1.6834295

. di ln(0.002)
-6.2146081

. di (ln(900) - ln(0.002)) / 1.683
7.7344046

. di sqrt(2 * ln(3.153/0.002))
3.8374373

Best Answer

Related Solutions

Solved – Generate data with skewed distribution and known percentiles, mean and median

Related Question