Describing / fitting a highly skewed distribution

curve fittingdata visualizationdistributions

I've got a data set of 84,529 entries, each entry referring to the number of times a particular entry is cited in a database. This set is extremely skewed, ranging from entries with 0 citations to one with over 45,000 citations alone. The median of the set is 16 citations, and it can be shown that a relatively small number of elements disproportionally account for the vast majority of citations. The data set is here and a log(x+1) histogram is shown below.

enter image description here

My perhaps stupid question is whether there's a standard way to describe this particular distribution? Naively, I figured it might be a generalised pareto distribution given the dominance of a small number of terms, but toying with log-log plots didn't convince me of this but I don't know enough about them to assert anything confidently.

Best Answer

You should consider comparing distributions. One handy Python library to compare distributions to describing empirical data is the powerlaw library. It has a paper that explains how to use it:

import powerlaw
import pandas as pd 

data = pd.read_csv("sampledata.txt", header=None)

fit = powerlaw.Fit(data[0].values)

ax = fit.plot_ccdf(marker="o", ls="", ms=2, color="k")
fit.power_law.plot_ccdf(ax=ax, label="power law")
fit.truncated_power_law.plot_ccdf(ax=ax, label="truncated power law")
fit.lognormal.plot_ccdf(ax=ax, label="lognormal")
ax.legend()
ax.set_ylabel(r"$P(X\geq x)$")
ax.set_xlabel(r"$x$");

enter image description here