Estimate the maximum of a probability density function using sample points

physicsprobability distributionspythonstatistics

I have an experiment that shoots particles at a wall. It hits some regions with a higher probability than others. I don't know what the underlying probability density function is. I assume it's some smooth curve and I would like to know its maximum value.

I can get an estimate by firing lots of particles and recording where they land. Then make a histogram of the data and calculate the maximum value from that. The problem is I don't know what the optimum number of bins to use for the histogram is. Additionally, I don't know how accurate my estimate is. Do you know if there is a formula or method I can use to calculate the optimum number of bins for a given number of samples? Do you know if we can approximate error bars for our estimate?

I have attached some Python code to help explain what I am doing. In this example, the underlying probability density function is the normal distribution, with mean, $\mu=0$, and variance $\sigma^2=1$. In this example, I know the exact solution is $1/\sqrt{2\pi\sigma^2}$, however, in general, I don't know the formula for the probability density function so we need to estimate its maximum.

import numpy as np
import matplotlib.pyplot as plt

mu = 0
sigma = 1
len_particles_array = 5
num_particles_array = 10**np.arange(1, len_particles_array + 1)

len_bin_array = 16
num_bins_array = 2**np.arange(len_bin_array)

max_number_density_array = np.zeros((len_particles_array, len_bin_array))

analytic_solution = 1 / np.sqrt(2 * np.pi * sigma**2)

for i, num_particles in enumerate(num_particles_array):
    particle_positions = np.random.normal(mu, sigma, num_particles)
    for j, num_bins in enumerate(num_bins_array):
        number_density, bins = np.histogram(particle_positions, num_bins, density=True)
        max_number_density_array[i, j] = np.max(number_density)

fig = plt.figure()
ax = fig.add_subplot(111)

for i, num_particles in enumerate(num_particles_array):
    ax.loglog(num_bins_array, max_number_density_array[i, :], \
              label = '# particles = ' + str(num_particles))

xlims = ax.get_xlim()
ax.loglog(xlims, [analytic_solution, analytic_solution], 'k--', \
          label = r'Exact solution = $1\ /\ \sqrt{2\pi\sigma^2}$')
ax.set_xlim(xlims)

ax.set_xlabel('Number of bins')
ax.set_ylabel('Estimate for the maximum of the probability density function')
ax.legend()

plt.savefig('Estimate for the maximum of the probability density function.png')

plt.show()

Here is the outputted figure. It shows the estimate is dependent on the number of bins I use for the histogram and the number of sample particles I use.
enter image description here

Best Answer

To be clear, it seems you don't want the mode, which is the $x$-value at which the density $f(x)$ reaches its maximum; you want $\max_{x \in R}f(x),$ where $R$ is the real line. Using the standard normal distribution as an example, you want $\varphi(0) = 1/\sqrt{2\pi} = 0.3989 \approx 0.4,$ where $\varphi$ denotes the standard normal density function.

In R software, $\varphi$ is dnorm; default parameters are $\mu=0, \sigma=1:$

dnorm(0)
[1] 0.3989423

If, you have the density function, you can often use methods of calculus to find the maximum value of a a function.

If you have data generated from a distribution and do not know the density function of the distribution, you might come close by trying histograms, as you suggest. That will work best if you have a large sample. A difficulty is that choice of bins (intervals) for a histogram is arbitrary, so different histograms will typically have various tallest bars of various heights.

Let's look at a sample of size $n = 1000$ from the distribution $\mathsf{Norm}(\mu=0, \sigma=2),$ for which the desired max is known to be $0.1995,$ to four places.

dnorm(0, 0, 2)
[1] 0.1994711    # max height of std norm density

set.seed(2021)
x = rnorm(1000)  # sample of 1000

Specifically, if you have data (1000 observations), but you don't know from what continuous distribution they were sampled, then you might look at histograms similar to the ones shown below.

I have cheated by plotting the density curve, which we're pretending is unknown, on each histogram. You can see that some histograms work better than others for guessing the maximum height of the density curve. [R code (tediously repetitive) for the figure is shown at the end.]

enter image description here

However, kernel density estimators (KDEs) provide a somewhat more useful path. A KDE is the density of a mixture of distributions from a suitable family, that approximates the population density using a sample from that distribution. (For more on KDEs, see Wikipedia. For R documentation on density, see this. Also, there are some pages on this site that discuss the convergence of KDEs to their target PDFs, with varying degrees of formality; type KDE after the 'magnifying glass' at the top of the page to get a list. (All give insights into KDEs, but I don't see that any deal directly with your question of finding the maximum.)

In R, a KDE consists of 512 $(x,y)$ pairs that can be connected by short line segments to show an estimate of the desired density curve. There are many possible variations of the KDEs in R, but I have chosen to use the default KDE, which I have found works very well in most cases. Here is text output from the density procedure in R.

density(x)

Call:
        density.default(x = x)

Data: x (1000 obs.);    Bandwidth 'bw' = 0.4608

       x                 y            
 Min.   :-7.7861   Min.   :9.830e-06  
 1st Qu.:-3.7491   1st Qu.:2.834e-03  
 Median : 0.2879   Median :3.034e-02  
 Mean   : 0.2879   Mean   :6.187e-02  
 3rd Qu.: 4.3248   3rd Qu.:1.242e-01  
 Max.   : 8.3618   Max.   :1.951e-01  

Notice that the maximum height (y-coordinate) of the KDE) is $0.1951,$ which is very close to the actual maximum height $0.1995$ of the density function of $\mathsf{Norm}(0, 2).$

Here is the default histogram of the values in x, along with this actual density function (black) along with its KDE (red) based on my sample of size $n = 1000.$

enter image description here

hist(x, prob=T, col="skyblue2")
 curve(dnorm(x,0,2), add=T)
 lines(density(x), col="red")

Notes: (1) If I wanted to know the 'mode' (the x-value at which the y-value is maximized, I could do a grid search for it, as below. The result is that the mode of the KDE is a $0.209 \approx 0.$

vert = density(x)$y; max.vert=max(vert); max.vert
[1] 0.1950648
horz = density(x)$x
mean(horz[vert==max.vert])
[1] 0.2088506

(2) Comparisons: (a) If we have $n=1000$ observations x known to have been randomly sampled from some normal distribution, then we should be able to estimate the maximum height by using dnorm in R, with the sample mean $\bar X$ estimating $\mu$ and the sample SD $S$ estimating $\sigma.$ For the sample x above, this method gives the maximum height as $0.1957.$

(b) This is not much different from the result $0.1951$ above from the KDE. Above, the KDE method uses normal data, but does not explicitly assume normality.

(c) Also, from above we know that a histogram with about 20 bins (lower left in first figure) has its tallest bar near the max. This information can be displayed by using densities from a non-printing histogram (argument plot=F); the result is $2.0.$ This method does not assume normality, but may require some trial and error with histograms that have various numbers of bins. (In a real application, you wouldn't have the actual density curve in the background to guide you.)

dnorm(0, mean(x), sd(x))
[1] 0.1956927    # (a) Known normal, estimate mean and SD

max(density(x)$y)
[1] 0.1950648    # (b) Default KDE in R

max(hist(x, br=20, plot=F)$density)
[1] 0.2          # (c) Carefully chosen histogram

# R code for first figure
par(mfrow=c(2,2))
 hist(x, prob=T, br=5, ylim=c(0,.2), col="skyblue2")
  curve(dnorm(x,0,2), ylim=c(0,.2), add=T)
  abline(h=.2, col="red")
 hist(x, prob=T, br=10, col="skyblue2")
  curve(dnorm(x,0,2), add=T)
  abline(h=.2, col="red")
 hist(x, prob=T, br=20, col="skyblue2")
  abline(h=.2, col="red")
  curve(dnorm(x,0,2), add=T)
 hist(x, prob=T, br=50, col="skyblue2")
  abline(h=.2, col="red")
  curve(dnorm(x,0,2), add=T)
par(mfrow=c(1,1))
Related Question