Solved – Compare skewness of many distributions with few observations

distributionsentropyginiskewness

I have a dataset with page view data for about 500,000 users, divided into two groups. Each user can visit up to 5 pages, each as many or as few times as they want. So for each user, I have the distribution of number of visits to each page. I would like to compare the 'average skew' in distribution between the two groups. Roughly, users in one group are more likely to have distributions that look like this {3,0,0,0,0} and users in the other group are more likely to have distributions like this {1,1,0,1,0}. How can I compare the degree of skewness between the two groups? I thought of using the average Gini coefficient or entropy for each group, but each user has so few observations. How can I do this?

Best Answer

Seems like all you need is a reasonable score that quantifies how much disparity there is between site visits. Since you need to compute 500000 such scores, something simple seems best.

  1. Maybe your first thought is the best one - the Gini index.
  2. Here's another simple one: After ordering the counts $y_1<y_2<\ldots<y_5$, compute $$\frac{y_5-y_3+1}{y_3-y_1+1}$$ So for (3,0,0,0,0) the score is $\frac{3+1}{0+1}=4$, and for (1,1,1,0,0) it is $\frac{0+1}{1+1}=0.5$. The idea is that $y_5-y_3$ is the difference between the max and the median, and $y_3-y_1$ is the difference between the min and the median, so you're comparing the two halves of the distribution. $1$ is added to each so you never end up dividing by $0$. Like then Gini index, this is a function of the order statistics.
  3. Another simple measure of disparity is the SD of the logs. For nonzero data, $SD(\log ay)=SD(\log y)$ for any $a > 0$, so it is scale-invariant. It measures relative variation in the data. However, you'd have to add some constant before logging to avoid taking the log of $0$. In your examples, the SDs of $\log(y_i+1)$ are $0.62$ and $0.38$ respectively.

Note that options 2 and 3 are not scale-invariant, due to adding $1$ (or something) before dividing or logging, whereas the Gini index can be computed without any adjustment for the zeros, and is scale-invariant. So the choice might be based on that. Is (10,0,0,0,0) really the same as (3,0,0,0,0), in terms of your behavioral model? And is (4,4,4,0,0) the same as (1,1,1,0,0)?

Related Question