Solved – Generating distributions with with given variance, skewness, and kurtosis

generative-modelskurtosisnormal distributionskewness

I would like to generate distributions that are as close to normal as possible, except for the deviations shows below. The options I've located have properties that are related to the families of distributions used to generate the deviations. This plagues every simulation study I've seen. I want something more random.

I would like to generate 62,500 distributions with sample size n=80 (or 77-83) that cover the following combinations: SD = 1, 1.5, 2, 3, 4; skewness = -2, -1, 0, 1, 2, excess kurtosis = -2, 0, 2, 4, 6 (So each unique combination is replicated 500 times.) The values given can vary by up to 15% (SD = 1.15 is okay).

I know how to generate normal distributions with an approximate mean and SD in R, which will have an approximate skewness and kurtosis of 0. I will reject results that don't fall in the +/- 15% guideline.

I know there are an infinite number of ways to cause skewness and kurtosis, and distributions with the same values for each can look very different. The goal is to have no discernible pattern in the differences (the family-wise characteristics of the modifying function).

This will be used to examine the statistical properties of distributions as skewness and/or kurtosis increases – such as Type 1 error robustness when doing ANOVA.

Best Answer

One way to do this is to start with discrete distributions, then modify them by adding continuous noise to get continuous distributions, if continuous distributions are desired. The nice thing about discrete distributions is that it is very easy to manipulate them to get various values of skewness, kurtosis, etc.

The following code only deals with skewness and kurtosis. To change the standard deviation parameter, all that is needed is to multiply the data values by a scale factor. (For example, multiplying $x$ by 2 increases the standard deviation twofold.)

Here is code to calculate the skewness and kurtosis of discrete distributions whose values are in "x" and whose associated probabilities are in "p."

skew <-function(x,p) {
 k = length(x)
 m = sum(x*p)
 v = sum( (x-m)^2 *p)
 m3 = sum( (x-m)^3 *p) 
 sk = m3/v^1.5
 return(sk)
}

kurt <-function(x,p) {
 k = length(x)
 m = sum(x*p)
 v = sum( (x-m)^2 *p)
 m4 = sum( (x-m)^4 *p) 
 k = m4/v^2
 return(k)
}

With this code it is possible to generate all kinds of skewness and kurtosis values by playing with the "x" and "p." For example, a flat-topped leptokurtic distribution can be generated as follows:

#Example 1: Flat-topped leptokurtic distribution
x = c(1:4,10)
p = c(.24,.24,.24,.24,.04)

skew(x,p)
kurt(x,p)
plot(x,p, type="h", lwd=2, ylim = c(0, max(p)*1.2))

The skewness of this distribution is 2.24, the kurtosis is 9.80, and its graph is as follows:

Flat-topped leptokurtic

If a data set is needed, you can sample from the distribution as follows:

set.seed(12345)
n=10000
x.sample = sample(x, n, replace=T, p)

If continuous data is needed you can jitter or add noise:

x.sample = x.sample + .2*rnorm(n)

The skewness, kurtosis, and distributional shape properties of the smoothed sample are similar to those of the discrete distribution, as shown by the following code:

 library(moments)
 skewness(x.sample)
 kurtosis(x.sample)
 hist(x.sample, breaks=30, main = "Flat-topped but Leptokurtic")

The sample skewness and kurtosis are 2.19 and 9.74, and the histogram looks as follows: Flat-topped leptokurtic histo

As another example, you can easily create an example of data that are "peaked" but platykurtic, as follows:

# Example 2: Peaked platykurtic distribution
x = 1:9
p = c(rep(.08,4), .36, rep(.08,4))
skew(x,p)
kurt(x,p)
plot(x,p, type="h", lwd=2, ylim = c(0, max(p)*1.2))
xs = sample(x, n, replace=T, p) + .2*rnorm(n)
skewness(xs)
kurtosis(xs)
hist(xs, breaks=30, main="Peaked but Platykurtic")

The skewness and kurtosis of the discrete distribution are 0 and 2.46 (<3 implies platykurtic), and the smoothed data sample have similar values. The histogram of the continuously smoothed data set illustrates the peakedness (despite being platykurtic) clearly:

Peaked but platykurtic histogram

A more difficult problem is to start with skewness and kurtosis values, and have the computer automatically select x and p to give those values. The optimization routines in R can help here, but there are difficulties in that there may be infinitely many solutions, or no solutions at all as whuber noted in a comment.