Distributions Visualization – Methods to Visualize Distribution with Many and Extreme Outliers Using Box Plots

boxplotdistributionspython

I have values with extreme outliers and want to visualize that. But the box plot doesn't seem a good choice for my data as you can see here.enter image description here


Most of the values are less than 50,000. But some them are over 1 million. .

What type of graph/figure is a good choice for data like this?

Here is an MWE creating that data

#!/usr/bin/env python3
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0)
sns.set_theme()

# 40.000 values from 500 to 10.000
vals = np.random.choice(range(500, 10000), 40000, replace=True)
# 2.000 values from 20.000 to 100.000
vals = np.append(vals, np.random.choice(range(20000, 100000), 2000, replace=True))
# 300 values (extrem outliners) from 1 to 4 Mio.
vals = np.append(vals, np.random.choice(range(1000000, 4000000), 300, replace=True))

sns.boxplot(vals)
plt.show()

EDIT: The example data is not contrived. This distribution is very near to a real data set.

EDIT 2: The values are currencies (in €); costs. And of course I will dive deeper into the data to find out why some persons cause so much more costs then others.

Best Answer

I think your example data is a bit contrived but all you need to consider is constructing a histogram or nonparametric density estimate on the log of the "extreme" data. (If you data contains negative values, then something else will need to be used.)

I don't know python but I assume there must be standard functions to produce such displays. In R the (essentially) equivalent commands would be the following:

# 40.000 values from 500 to 10.000
vals1 <- runif(40000, 500, 10000)
# 2.000 values from 20.000 to 100.000
vals2 <- runif(2000, 20000, 100000)
# 300 values (extrem outliners) from 1 to 4 Mio.
vals3 <- runif(300, 1000000, 4000000)
vals <- c(vals1, vals2, vals3)

# Histogram
hist(log(vals), breaks="Freedman-Diaconis", xlim=c(6,16), ylim=c(0,1), req=FALSE, 
  axes=FALSE)
axis(1, 2*c(3:8), pos=0)
axis(2, c(0:10)/10, las=1)

# Nonparametric density estimate
lines(density(log(vals)), col="red", lwd=3)

Histogram and nonparametric density estimate of log of the data

Even a box plot looks a bit better with using logs (but you're still losing insight into the distribution of the data given that there are so many data points that allows a more complete description of the data):

boxplot(log(vals))

boxplot of log of data