Probability – How to Create a Density Function from Data?

probability

I have read a lot about density functions, but what I am missing is how to create a density function if you have continuous values in data. For example, I have data with negative and positive values:

$$\text{Data} = (-20,30,21.4,2.3,-4.5).$$

My goal is to create a function, from which I could do a further calculation to find the probability that a value would fall within a certain range. For example, find out what is the probability of values between $-5$ to $-4.5$ in data. Can someone please guide me step-by-step.

Best Answer

You are not specific about how many observations you have, what population they come from, and what you may know about the population. So I will give a brief example based on fifty observations. If what I say is not enough, then please edit your Question to be more specific, and maybe someone can give you more relevant help.

Suppose you have the following 50 observations, which I have sorted from smallest to largest. (Numbers in brackets [ ] are the indexes of the first number in each row.)

x
 [1] -18.3 -13.2  -7.1  -4.8  -3.4  -1.6  -0.5   2.3   3.0   3.4
[11]   3.6   3.6   4.6   9.0  11.2  12.0  13.7  16.2  16.2  17.2
[21]  17.4  18.6  18.8  19.1  20.0  20.7  22.0  22.2  22.3  22.4
[31]  22.8  25.2  25.7  26.6  27.1  27.6  29.5  32.1  32.4  34.7
[41]  35.4  35.4  35.8  36.8  39.5  40.2  52.6  53.0  53.1  54.2

1. Rough count. Only five values out of 50 lie in the interval $(-10, 0],$ so as a very rough guess based on little data, you might say that about $5/50$ths or 10% of the data lie in that interval.

2. Histogram. You could make a histogram of the data. This is one way to get a rough idea of what the density function might look like. Here is a 'density histogram' of the fifty observations.

The vertical 'density scale' is arranged so that the total area of the bars is $1.$ Because exactly five of fifty observations lie in $(-10,0],$ the area of the bar above that interval is $5/50 = 0.1;$ its base is ten units long and its height is 0.01, so its area is $10 \times 0.01 = 0.1.$

3. Normal assumption. If you believe the population from which the sample was taken had a normal distribution, then you might estimate the population mean $\mu$ as $\hat \mu = \bar X = 19.81$ and the population standard deviation $\sigma$ as $\hat \sigma = S = 17.15,$ where $\bar X$ and $S$ are the 'sample mean' and 'sample standard deviation', respectively. Superimposing the density curve for the distribution $\mathsf{Norm}(19.81, 17.15),$ as a blue curve, we have the following figure.

If you believe the sample comes from a normal population, you can use what is known about normal distributions to find that the distribution $\mathsf{Norm}(19.81, 17.15)$ puts about 8.3% of its probability in the interval $(-10, 0].$ [You might use software to find this probability or 'standardize' and use printed normal tables.]

4. Density estimator. Some modern computer programs have the ability to piece together curves of various shapes in such a way as to approximate the density function of the population from which a sample was chosen. (The result is sometimes called a 'spline'.) One method is called 'kernel density estimation'. The red curve in the figure below shows a KDE based on our sample of fifty. You could use information about this KDE to see what percentage of the probability under the estimated density curve lies in $(-10,0].$

Notes: (a) For more information you can search on terminology I have put in 'single quotes'.

(b) Part 3 depends on making a particular assumption, whereas parts 2 and 4 assume only that data were sampled at random from a continuous distribution.

(c) I simulated the fifty observations as a random sample from $\mathsf{Norm}(\mu = 20,\, \sigma = 15).$ It happens in this case that the estimated normal distribution and the KDE are both remarkably good estimates of that normal distribution. Samples of size as small as fifty do not always give such nice results.

(d) Computations for the above estimates and figures were done in R software. In case it is of interest, some of the R code is shown below:

set.seed(1005); x = sort(round(rnorm(50, 20, 10),1))
hist(x, prob=T, col="skyblue2", ylim=c(0,.025))
  abline(h = seq(0,.025, by= .005), col="green2")
sum(x > -10 & x <= 0)
[1] 5
mean(x)
[1] 19.806
sd(x)
[1] 17.14959
curve(dnorm(x, 19.81, 17.5), add=T, col="blue")
diff(pnorm(c(-10,0), 19.81,17.5))
[1] 0.08293586
lines(density(x), type="l", col="red")

Related Solutions

R Probability KDE – Estimating the Density of a Zero-Inflated Parameter in R

The density is infinite at zero because it includes a discrete spike. You need to estimate the spike using the proportion of zeros, and then estimate the positive part of the density assuming it is smooth. KDE will cause problems at the left hand end because it will put some weight on negative values. One useful approach is to transform to logs, estimate the density using KDE, and then transform back. See Wand, Marron & Ruppert (JASA 1991) for a reference.

The following R function will do the transformed density:

logdensity <- function (x, bw = "SJ") 
{
    y <- log(x)
    g <- density(y, bw = bw, n = 1001)
    xgrid <- exp(g$x)
    g$y <- c(0, g$y/xgrid)
    g$x <- c(0, xgrid)
    return(g)
}

Then the following will give the plot you want:

set.seed(1)
x <- c(rlnorm(100),rep(0,50))
hist(x,probability=TRUE,breaks = 25)
fit <- logdensity(x[x>0]) # Only take density of positive part
lines(fit$x,fit$y*mean(x>0),col="red") # Scale density by proportion positive
abline(v=0,col="blue") # Add spike at zero.

enter image description here

Solved – How to turn a function into a probability density whilst maintaining the shape of the function

If you have a non-negative integrable function $f$ with domain $D$ such that

$$ k = \int_{D} f(x) dx < \infty $$

Then $f(x)/k$ is a probability density on $D$. The value $k$ is known as the normalizing constant.

Edit: In your example you said that $f(x) = \frac{a}{bx} + c$ for known constants $a,b,c$. In that case, the indefinite integral is simple to compute and the normalizing constant would be

$$ k = \left[ \frac{a \log(x) }{b} + cx \right]_{D}$$

if $D$ is an interval $(A,B)$ then this simplifies to

$$ k = \frac{a}{b} \cdot \log \left( \frac{B}{A} \right) + c(B-A) $$ Therefore $$ g(x) = \frac{\frac{a}{bx} + c}{\frac{a}{b} \cdot \log \left( \frac{B}{A} \right) + c(B-A)}$$ is a probability density on $(A,B)$.

Best Answer

Related Solutions

R Probability KDE – Estimating the Density of a Zero-Inflated Parameter in R

Solved – How to turn a function into a probability density whilst maintaining the shape of the function

Related Question