Gaussian Mixture Distribution – Creating a Probability Density Function from a Gaussian Mixture Model in Python

distributionsgaussian mixture distributionpythontime series

I have some daily timeseries (27 right now but will be over 200 when I get more data) for electricity consumption. For each hour I want to know what the probability density function looks like.
What I did so far is extracting the hourly values for each series, to know what values are present in each hour.

E.g.

12_am =  [4.4, 1.1, 3.0, 4.7, 4.6, 5.9, 4.5, 5.9, 19.2, 3.5, 2.8, 2.9, 12.4, 14.0, 0.5, 52.7, 1.1, 8.8, 2.7, 1.9, 9.3, 6.3, 10.7, 9.6, 4.7, 8.2, 10.1]
len(12_am) = 27

So far so good. So I do have my values, I know the mean is 7.98, I know the variance is 95.587.
This leaves me with these questions:

  • How can I know create the PDF of these samples? So far I only figured how to just count them all:

enter image description here

  • Once I have the PDF, how can I fit a Gaussian Mixture Model to it?
  • I guess I would have to go with 2-3 Guassians for this example. Is there some criterion that tells me how many Guassians is best or do I have to judge it with my eyes?

I am a bit confused in this topic, so any pointing in the right direction is highly appreciated!

Best Answer

Fitting Gaussian Mixture Models can be done quite straightforwardly with classes from scikit. Here are some of your options:

  • If you want to provide the number of components in the mixture (you mentioned "2-3 Gaussians") yourself, simply use sklearn.mixture.GaussianMixture.
  • If you want to do some model selection before figuring out the "right" number of components, check out how to apply e.g. the information criterion BIC to your problem as described e.g. here (in section 2.1.1.2).
  • Finally, you could also go even more Bayesian by using sklearn.mixture.BayesianGaussianMixture, which is a class that tries to figure out everything automatically (but sometimes gets it wrong).

If you decide to have the number of components figured out by scikit, you might get into problems because of the small number (27) of data points. It could happen that scikit will suggest to you 27 components...

Having said that, maybe fitting a Gaussian mixture is not really the best option for your final goal. For e.g. Gaussians have infinite support while in your case all values are nonnegative. Fitting e.g. lognormal distributions might be preferable. However, often, a histogram or some KDE is just fine. Scikit also provides you with those, see e.g. here, or here.

Related Question