Solved – What fat tail distributions can I fit to the data

distributionsnonparametric

I have 490 data points, which are very unlikely to be I.I.D. Below is a summary in Million dollars. My goal is to fit a distribution so that its 99.9th quantile captures the 70.22M maximum.

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.00854 0.01135 0.01588 0.18370 0.02997 70.22000

Lognormal, Loggamma, Generalized Pareto, 2 parameter g- and h- distribtion have all failed so far. Generalized Pareto has outperformed the other two with a 99.9th quantile of ~ 0.65M.

I would like to report my current approach to this problem:
I declare the max of 70.22 to be an outliner and the rest fits reasonably to a GPD, passing the KS and AD tests so far.

Best Answer

This does not necessarily answer your question but hopefully it will help you develop your question and the goals of your analysis.

Many individuals, when challenged with estimating a distribution for their data, think that the best/only way to do so is by going "shopping" among a finite number of parametric distribution and seeing which one fits best (i.e. log-normal, gamma, Pareto, etc).

While distribution shopping sometimes has it's benefits, it's often not the best way to go. This is especially true when you have a large sample size, cannot assume your data is iid, or are not particularly interested in estimating population parameters via maximum likelihood.

Below are two alternative, non-parametric, techniques which estimate the distribution of your data. They are very easy to implement. These techniques are often applied to non-iid data*.

I also comment more on non-iid assumptions below, and how you may be able to improve your question and get an even better answer by providing more context.

*(This last sentence is sort of a rabbit hole because the non-parametric estimators below are derived with iid assumptions. However, even non-identical dependent data can still converge on a stationary distribution...help from another user here would be appreciated)

If All You Care About Are Quantiles...

Use an empirical distribution function:

Given your data $x_1,x_2,...,x_n$ you can estimate the quantile of any value $t$, denoted $\hat F_n(t)$, with the following equation: $$ \hat F_n(t) = \frac{1}{n+1} \sum_{i=1}^n \mathbf{1}_{x_i \le t} $$ where $\mathbf{1}_{x_i \le t}$ is an indicator which equals 1 if $x_i \le t$ is true and zero otherwise. In your case $n=490$, and the quantile of your max value, $70.22$, is approximately $99.8$%.

You can also use the empirical quantiles to estimate the probability that $X$ will be in any interval i.e. $$ Pr(t_1< X \leq t_2) = \hat F(t_2)-\hat F(t_1) $$

To show this works I generate a sample of size 490 from a Gamma distribution with parameters $\alpha=2$ and $\beta=0.1$ and emperically estimated the quantiles. I then plot my estimates against the true quantiles (which are calculated using the Gamma distribution). Empirical Distribution Function

As you can see they are very close, not perfect, but close. The estimates get better with increased sample size.

If you Want a Density Estimate; $\hat f(X)$...

You can use Kernel Density Estimation (KDE)

This technique is related to mixture distributions (it can be interpreted as a mixture). The density estimate $\hat f(t)$ can be calculated as: $$ \hat{f}_h(t) = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{t-x_i}{h}\Big) $$ KDE requires you to choose a kernel function (a common choice is the normal pdf) and a bandwidth $h$. Generally the choice of $h$ is the one that effects the KDE more drastically. Discussion about choosing $h$ and $K$ is provided in my link above and the references cited therein.

Below I also provide a plot of the KDE against the true density using the same sample as above.

KDE

Important Note on non-iid

I understand that you want to use something robust to the iid assumption because after all, how do you know for sure that the data are actually iid? However, it is important that you present some reasoning for why your data may be (1) not independently distributed and (2) not identically distributed. For example;

"my observations are household income from multiple geographic regions. I know observations from similar regions tend to be correlated (not-independent) and the variance of household income within each region can vary (not-identical). Unfortunately I do not know which geographical region each observation comes from (the data are unlabeled)"

Providing this context would give someone enough information to formulate another model (like a mixture) that could potentially outperform my suggestions above. Articulating the context of your problem also helps you develop a better understanding of your data and the goals of your analysis i.e. (continuing with the above example)

"Do I really just want a distribution of all the data mixed together? or perhaps, would it be more informative to also group the data into clusters and summarize them seperately?"

Related Question