Solved – Plot the probability mass function

data visualizationdiscrete datamatplotlibpythonscipy

I am trying to plot the probability mass function of a sample of a discrete metric.

If it was continuous, I know that using pandas it would be as simple as calling:

sample.plot(kind="density")

But I'm afraid that this is not enough (or not right) for my sample. Is there a function within matplotlib, scipy, numpy, etc. that I could use for plotting it?

Best Answer

There are two parts to your question - how to display discrete data (a data visualization issue) and how to do it in Python (a "what function do I call" issue).

I will deal with the first one.

With discrete distributions, there are a number of possible ways to display data.

Leaving aside direct implementation issues for the present, I see three main competitors:

  1. the empirical cdf.

    enter image description here

  2. a sample probability function.

    enter image description here

    These are quite suitable for count data, for example.

  3. a barplot.

    enter image description here

    This is quite suitable for ordered categories. If you order the bars from largest to smallest (or in some other meaningful-to-your-needs fashion), it's also suitable for unordered categories.

There are numerous other possibilities. However, I don't think a histogram is generally suitable for discrete data, especially not one where the bins are automatically chosen. The first problem is that a histogram density estimate uses area rather than height to convey relative probabilities, so it fairly directly conveys an impression of continuity. The second issue is with bin-width -- you need to choose it carefully or you may be doing things like having alternating bins either combining two categories or one, or perhaps having a smaller or larger gap between two categories than between the others (often an end-category):

Histogram with smaller gap between 0 and 1 category and others

As we see the gaps are not of constant width, throwing off the impression the plot conveys.

As for how you do things like this in python, after you choose a display, that would probably be a good, more specific question (but probably more on topic elsewhere; worded right it might fit better on StackOverflow, but you should check their help for what's on topic. With careful phrasing it might survive here, or it might work on Superuser.