[Math] How to find the mode of a continuous distribution from a sample

probabilitystatistics

First, my background is not math.

My objective is to find the value that occurs most frequently in a data sample OR the value that is most likely.

Let's say my sample is [1,5,6,6,7,10]. Finding the mode for this sample is simple (the mode is 6).

But if let's say I change the sample to [1,5,6,7,10], I don't know how to find the mode. The results that I want is 6 since 6 is the most probable data.

Problem is, I don't even know what to google (tried for hours), and even when I do find something that MAY be the answer (kernel density estimation, continuous probability distribution), I don't understand what the hell they're talking about.

The actual situation consist of hundreds of data (in floats) that are saved in Excel. I would appreciate if someone could demo it in Excel.

Best Answer

For the record, here are some general solution sketches that also work for high-dimensional distributions (probably too complex for the asker, though; some sort of kernel density estimation is much simpler and reasonably good):

  • Train an f-GAN with reverse KL divergence, without giving any random input to the generator (i.e. force it to be deterministic).

  • Train an f-GAN with reverse KL divergence, move the input distribution to the generator towards a Dirac delta function as training progresses, and add a gradient penalty to the generator loss function.

  • Train a (differentiable) generative model that can tractably evaluate an approximation of the pdf at any point (I believe that e.g. a VAE, a flow-based model, or an autoregressive model would do). Then use some type of optimization (some flavor of gradient ascent can be used if model inference is differentiable) to find a maximum of that approximation.

Related Question