Solved – Measure the uniformity of a distribution over weekdays

distributionsmeasurementprobabilityrandom variableuniform distribution

I have a similar problem to the question asked here:

How does one measure the non-uniformity of a distribution?

I have a set of probability distributions over the days of the week. I want to measure how close each distribution is to (1/7,1/7,…,1/7).

At the moment I am using an answer from the above question; an L2-Norm, which has value 1 when the distribution has mass 1 for one of the days, and is minimised for (1/7,1/7,…,1/7). I am linearly scaling this so it lies between 0 and 1, then flipping it so 0 means perfectly non-uniform and 1 means perfectly uniform.

This works pretty well, but I have one issue with it; it treats each weekday equally as a dimension in 7-Dim space, so it doesn't account for the nearness of days; in other words, it gives the same score to (1/2,1/2,0,0,0,0,0) and (1/2,0,0,1/2,0,0,0) even though in some sense the latter is more "spread out" and uniform, and should ideally get a higher score. There is obviously the added complication that the ordering of days is circular.

How can I alter this heuristic to account for the nearness of days?

Best Answer

The earth mover distance, also known as the Wasserstein metric, measures the distance between two histograms. Essentially, it considers one histogram as a number of piles of dirt and then assesses how much dirt one needs to move and how far (!) to turn this histogram into the other. You would measure the distance between your distribution and a uniform one over the days of the week.

This of course accounts for the nearness of days - it's easier to move "dirt" from Monday to Tuesday than from Monday to Thursday, so (1/2,0,0,1/2,0,0,0) would have a lower earth mover distance from the uniform distribution than a histogram that is concentrated on Monday and Tuesday.

What this does not do is consider the "circularity" of the week, i.e., that Saturday and Sunday are as close together as are Sunday and Monday. For that, you would need to look for an earth mover distance defined on circular probability mass distributions. This should be doable using a suitable optimization approach.


EDIT: In R, the emd package calculates earth mover distances between histograms.

You can address the "circularity" issue in a fairly simple (though ad-hoc) way.

  • Calculate an earth mover distance $d_1$ between your distribution and a uniform distibution on Monday through Sunday.
  • Calculate a distance $d_2$ against a uniform distribution on Tuesday through Monday.
  • Calculate a distance $d_3$ against a uniform distribution on Wednesday through Tuesday.
  • ...
  • Finally, as the final distance, use the mean of $d_1, \dots, d_7$.

This takes care of the circularity at the expense of a couple of additional calculations.

2nd EDIT: this is not the circular earth mover distance as such. For that, you'd need to look through some of the literature a search will turn up. If the best way to move dirt between days involves moving it two days from Saturday to Monday, this will show up in five out of the seven $d_i$, but not in the remaining two (where the dirt will need to be moved five days).

However, I'd still consider this a potentially useful way to at least consider the circularity in some manner - certainly better than just using a single histogram and defining the week as going from Sunday to Saturday or in some other arbitrary manner. Plus, while some links above turn up implementations for the circular earth mover distance, I'm not aware of one for R, which is probably the most-used language here.