Solved – Is taking the median of a set of percentages statistically sound

medianpercentagepopulationsampling

I have several sets of data, unfortunately the data comes to me in a "summary" form. My job is to consolidate the several data sources into one general summary. I'm currently using the median to summarise the data, but I don't know if this is statistically sound. Here's a description of my problem:

There are $N_P$ samples, each with varying sample sizes, but all from a single population. Neither the sample size or the standard variation are known. Each sample can be divided into $N_Q$ disjoint groups (or qualities). From each sample, the only data that is known is what percent of the sample falls within a group (or category). For example, population $A$ contains, $x\%$ of $a$, $y\%$ of $b$ and $z\%$ of $c$.

The different samples are not disjoint, so a single item might be in several of the samples; but I don't know how much overlapping there is. There are 5-8 different samples with 5-7 categories. An example (smaller) table is the following.

            cat. a    cat. b    cat. c    
sample A    47.34%    30.05%    11.92%
sample B    41.60%    29.90%    11.90%
sample c    47.74%    29.67%    12.69%
--------    ------    ------    ------
median      47.34%    29.90%    11.92%

Now is it statistically sound to create this "median" summary, which takes each group from the different samples and finds the median? Maybe I should be using the mean? The problem I'm seeing is the "median sample" usually sums to less than 100%, even though the percentages from each sample sum to 100%. Should this matter?

Sample sizes: 100k - 100m
Population size: ~1 billion

Best Answer

What you are doing does not makes sense if your goal is to categorize what proportion of the entire population (sample A + sample B + sample C) is in category a, b, and c. Consider the following contingency table:

   a  b  c             a    b    c
A  8; 1; 1         A  .8;  .1;  .1
B  7; 2; 1         B  .7;  .2;  .1
C  1; 13; 16       C  .03; .43; .53

Then, for example, the median of the category a probabilities is 0.7 and the mean is 0.51, but only 16/50 = 0.32 of the all the observations are in column a. Likewise, the median of the category c probabilities would be 0.1, but only 0.36 of the observations are in column c. Does the "median summary" you propose tell you anything meaningful in a situation such as this one? Unless you have the marginal counts of either the samples or the categories, or you are willing to make some assumptions about them, I don't think there is a whole lot you can do in this case.

Do you have any specific goals in mind? Also, how many categories and samples do you have?

Edit: Your sample/population phrasing is slightly confusing. It's better to say you "have 3 samples, each which be sub-divided into 3 categories a,b, and c." The phrase "sample population" is troublesome, as is your reference to two different "populations."