[Math] How to find the probability density function of a set of random variables

density functionprobability distributionsstatistics

If I have a set of random variables $X_1$, $X_2$, $X_3$, …, how do I find the probability density function of them?

All I know is that $0 \leq X_i \leq 1$ and they are continuous values.

I have tried several solutions:

  • I tried first sort them first, then gradually work out the CDF, then work out PDF.
  • I also tried binize them to that make them into a categorical data and then get PDF by calculating the probabilities of each category.

However, I need the solution to be as a function of $X_i$ and this function has to be differentiable, but neither sorting nor bining is a differentiable computation.

Does such a solution even exist?

Best Answer

There is no way to be sure what distribution gives rise to your data. First, there is no assurance that your data fit any 'named' distribution. Second, even if you guess the correct parametric distribution family, you still have to use the data to estimate the parameters. Here are several approaches that might be useful.

First, you might see whether a member of the beta family of distributions is a reasonable fit to your data. These distributions have support $(0, 1)$ and two shape parameters $\alpha > 0$ and $\beta > 0.$ Roughly speaking, these determine the shape of the density curve near 0 and near 1, respectively. It is possible to estimate $\alpha$ and $\beta$ from data. One way is to pick parameters that match the sample mean and variance (method of moments estimation). See the Wikipedia article on 'beta distribution' for particulars.

Below is a histogram of a sample of 1000 observations simulated according to the distribution $Beta(\alpha = 1, \beta = 10)$ along with the density function (solid blue curve) of that particular distribution. (I used R statistical software and show the R code. R is available free of charge at www.r-project.org for Windows, Mac and Linux.)

x = rbeta(1000, 5, 10)             # generate fake data
mean(x);  sd(x)                    # sample mean and SD
## 0.3320269
## 0.1163734
hist(x, prob=T, col="skyblue")
curve(dbeta(x, 5, 10), lwd=2, col="blue", add=T)   # pop density curve
lines(density(x), lwd=2, lty="dotted", col="red")  # density est from data

enter image description here

It takes a lot of data to get really close estimates of the parameters. For example, here the sample mean is 0.3320 while the population mean is .3333. I will let you check how closely the sample and population variances match.

A second method is to use a density estimator of your data. (See Wikikpedia on 'density estimator' or google 'KDE' for 'kernel density estimator'.) The last line of code puts the dotted red density estimator onto the plot above. The function density(x) produces $(x,y)$ coordinates. These may be of use if a digitized approximation to the density is useful.

By sorting the data, it seems to me that you are making an "empirical distribution function" (ECDF). ECDFs tend to match theoretical CDFs better than density estimators match histograms, partly because information is lost when data are sorted into bins to make a histogram. For a continuous distribution, you could try taking differences of small intervals in an ECDF to approximate the PDF, but I think density estimation is easier to use.

Below is the ECDF of the data generated above (heavy black 'stairstep', increasing by $1/n$ at each sorted datapoint), along with the population CDF (thin blue curve). Black tick marks at the horizontal axis show the locations of individual observations.

plot(ecdf(x), lwd=3)
curve(pbeta(x, 5, 10), col="blue", add=T)
rug(x)

enter image description here

You do not say why you want to know the PDF for your data. By googling some of the terminology I have used here, you may be able to find a solution that matches your goals better than anything in my example.

Related Question