I have a dataset with page view data for about 500,000 users, divided into two groups. Each user can visit up to 5 pages, each as many or as few times as they want. So for each user, I have the distribution of number of visits to each page. I would like to compare the 'average skew' in distribution between the two groups. Roughly, users in one group are more likely to have distributions that look like this {3,0,0,0,0} and users in the other group are more likely to have distributions like this {1,1,0,1,0}. How can I compare the degree of skewness between the two groups? I thought of using the average Gini coefficient or entropy for each group, but each user has so few observations. How can I do this?
Solved – Compare skewness of many distributions with few observations
distributionsentropyginiskewness
Best Answer
Seems like all you need is a reasonable score that quantifies how much disparity there is between site visits. Since you need to compute 500000 such scores, something simple seems best.
Note that options 2 and 3 are not scale-invariant, due to adding $1$ (or something) before dividing or logging, whereas the Gini index can be computed without any adjustment for the zeros, and is scale-invariant. So the choice might be based on that. Is (10,0,0,0,0) really the same as (3,0,0,0,0), in terms of your behavioral model? And is (4,4,4,0,0) the same as (1,1,1,0,0)?