I have some daily timeseries (27 right now but will be over 200 when I get more data) for electricity consumption. For each hour I want to know what the probability density function looks like.
What I did so far is extracting the hourly values for each series, to know what values are present in each hour.
E.g.
12_am = [4.4, 1.1, 3.0, 4.7, 4.6, 5.9, 4.5, 5.9, 19.2, 3.5, 2.8, 2.9, 12.4, 14.0, 0.5, 52.7, 1.1, 8.8, 2.7, 1.9, 9.3, 6.3, 10.7, 9.6, 4.7, 8.2, 10.1]
len(12_am) = 27
So far so good. So I do have my values, I know the mean is 7.98, I know the variance is 95.587.
This leaves me with these questions:
- How can I know create the PDF of these samples? So far I only figured how to just count them all:
- Once I have the PDF, how can I fit a Gaussian Mixture Model to it?
- I guess I would have to go with 2-3 Guassians for this example. Is there some criterion that tells me how many Guassians is best or do I have to judge it with my eyes?
I am a bit confused in this topic, so any pointing in the right direction is highly appreciated!
Best Answer
Fitting Gaussian Mixture Models can be done quite straightforwardly with classes from scikit. Here are some of your options:
If you decide to have the number of components figured out by scikit, you might get into problems because of the small number (27) of data points. It could happen that scikit will suggest to you 27 components...
Having said that, maybe fitting a Gaussian mixture is not really the best option for your final goal. For e.g. Gaussians have infinite support while in your case all values are nonnegative. Fitting e.g. lognormal distributions might be preferable. However, often, a histogram or some KDE is just fine. Scikit also provides you with those, see e.g. here, or here.