Solved – How to best represent median proportions using stacked bar charts

data visualizationdescriptive statisticsmeanmedianr

I have a large dataset with several responses about an organization's labour costs. I would like to represent this data graphically. Typically what we do is provide a stacked bar chart (using R) of the percentages. However, there are some severe outliers in the data, and the mean becomes quite skewed by these outliers; obviously, the median is preferred. However, this becomes far more difficult to represent in a stacked bar chart. Here is an example of the data:

             Mean     Median
 Cost A   51.39647 49.5389570 
 Cost B   15.07913  4.2676470 
 Cost C    5.98685  0.2238874 
 Cost D   27.53755 16.9540230 
--------------------------------
 SUM      100.00000 70.9845144

As you can see, the mean is not a good representation of the median. What I had originally done was make a stacked bar chart of the mean; however, I am not satisfied with this, as I don't believe it's a good representation of the data. I would like to represent the median.

I have thought about normalizing the median data, so that, for example, Class A would become 49.54/70.98, but that would probably be even less representative than reporting the means.

Using the median, it would be easy to say "The average company spends 4% of its labour costs on Cost B." This works well for reporting one statistic. However, I want to report the statistic as a fraction of all the costs for the average company. Then I can no longer say what I said above, since the medians don't sum to 100. I need a good way to report, "The benchmark spending on Cost A is w%, on Cost B is x%, on Cost C is y%, and on Cost D is z%", so that sum(w,x,y,z)=100 and so that it is well representative of the population.

My question is, does anybody know a good way to report this data, preferably in a stacked bar chart showing how the average company divides its spending among these four costs? The big idea is to get an understanding of how the "average" company divides its spending between these four costs.

Best Answer

The companies that are "severe outliers" w.r.t. one type of cost are probably also outliers w.r.t. the other cost types. Wouldn't it be easier to remove such companies completely from the analysis? Perhaps you can present two charts, one with and one without the outliers.

Perhaps an even better solution is to look more carefully through your data. Is it the case that there are distinct cost structures in your data set so that you can partition the companies accordingly? This might lead you to discover the underlying reasons behind the different cost structures (company size, age, revenue, etc). If so, you should probably present different charts for each distinct group of companies.

Related Question