Solved – Sampling from heavy vs light tailed distribution

cauchy distributionheavy-tailedt-distribution

I am having some issue understanding the behavior of such distributions when generating random numbers.
I was under the impression that heavy tailed distributions have "heavier" tails, so there is more probability to observe higher values, whereas lighter tailed distributions have values more concentrated in the body of the distribution. Is this correct?
I tried to sample from a Cauchy distribution (heavy distribution) and from a t-distribution (light) and plot the histogram. I am confused because I expected exactly the opposite of what I get. Here some example in R (the same results can be replicated with any statistical software)

set.seed(999)

heavy_data <- rcauchy(1000)
light_data <- rt(1000, 10)

hist(heavy_data)
hist(light_data)

cauchy

t-student

It looks like that from the cauchy distributions, all the observations are in the body with almost anything in the tails, whereas for the t-distributions we have a wider spread of data, so in the body as well as in the tail.

Could anyone clarify this?

Best Answer

Cauchy. The reason for the the strange histogram from Cauchy data is precisely because you are getting many extreme values in the tails--too sparse and too extreme to show well on your histogram. A data summary or boxplot might be more useful to visualize what's going on.

set.seed(999)
x = rcauchy(10000)
summary(x)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5649.323    -0.970     0.021    -0.037     1.005  2944.847 
x.trnk = x[abs(x) < 200]  # omit a few extreme values for hist
length(x.trnk)
[1] 9971

par(mfrow=c(2,1))
dcauchy(0)
[1] 0.3183099   # Height needed for density plot in histogram
 hist(x.trnk, prob=T, br=100, ylim=c(0,.32), col="skyblue2")
  curve(dcauchy(x), add=T, col="red", n=10001)
 boxplot(x.trnk, horizontal=T, col="skyblue2", pch=20)
par(mfrow=c(1,1))

enter image description here

The standard Cauchy distribution (no parameters specified) is the same as Student's t distribution with DF = 1. The density function integrates to $1,$ as appropriate, but its tails are so heavy that the integral for its 'mean' diverges, so its mean doesn't exist. One speaks of its median as the center of the distribution.

Student's t, DF = 10. There is nothing particularly unusual about Student's t distribution with DF = 10. Its tails are somewhat heavier than for standard normal, but not so much heavier that it's hard to make a useful histogram (no truncation needed). And its mean is $\mu=0.$

y = rt(10000, 10)
summary(y)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-5.988219 -0.698855 -0.006711 -0.005902  0.685740  6.481538 
dt(0,10)
[1] 0.3891084
par(mfrow=c(2,1))
hist(y, prob=T, br=30, ylim=c(0,.4), col="skyblue2")
 curve(dt(x,10), add=T, col="red", n=10001)
boxplot(y, horizontal=T, col="skyblue2", pch=20)
par(mfrow=c(1,1))

enter image description here

The distribution $\mathsf{T}(10)$ is sufficiently heavy-tailed that samples from it as large as $n=10\,000$ tend to show many boxplot outliers---as seen above. In a simulation of $100\,000$ samples of size $10\,000,$ almost every sample showed at least one outlier and the average number of outliers per sample was more than 180. [This simulation runs slowly because each sample of $10,000$ needs to be sorted in order to determine its outliers.]

set.seed(2020)
nr.out = replicate(10^5, length(boxplot.stats(rt(10000,10))$out))
mean(nr.out)
[1] 188.5043
mean(nr.out>0)
[1] 1
Related Question