I have time-series data containing 1440 observations and the plot of the data is
I want to fit the Gaussian Mixture Models (GMM) to the above plot, and for the same I am using Mclust function of mclust package. Finally, I want a fit somewhat like this:
On using Mclust function, I do get following statistics
mclus_data <- Mclust(givendataseries)
> summary(mclus_data)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust E (univariate, equal variance) model with 8 components:
log.likelihood n df BIC ICL
9525.438 1440 16 18934.52 18183.67
Clustering table:
1 2 3 4 5 6 7 8
1262 0 0 0 0 13 114 51
In the above statistic, I can not understand following:
- Significance of
log.likelihood
,BIC
andICL
. I can understand what each of them is, but what their magnitude/value refers to? - It shows there are 8 clusters, but why cluster no.
2,3,4,5
has0
values? What does this mean? - From the plot it is clear that there must be two Guassians, but why
Mclust
function shows there are 8 Guassians?
Update:
Actually, I want to do model based clustering of time series data. But currently I want to fit the distribution to my raw data, as shown in Figure 1 on page no. 3 of this paper. For your quick reference, mentioned figure in said paper is
Best Answer
There is a misunderstanding in your question that needs a correction. Time-series model is not univariate since you have two variables: actual values and time. To provide an example let's take a time-series data, say
woolyrnq
data fromforecast
R library (plotted below).Now, if you use univariate Mclust to find clusters it will ignore the time component and find two clusters.
We can also plot the density of fitted clusters:
If you look at the x-axis of this plot, you'll learn that it is related to values of your data (y-axis on the first plot), not to time. Now, if we color the point-values of the time series by cluster assignments, it will be more clear:
The method discovered clusters of "high" and "low" values, independent of time. The same applies to the eight clusters discovered by Mclust with your data - they ignore the time, so are unrelated to the peaks marked by you on the second plot in your question.
If you want to use Mclust for such data, you need to use a bivariate model including time. For example, with the
woolyrnq
data you can obtain two such clustersOr illustrated as 2-dimmensional density plot:
As you can see, now the clusters relate to the "higher" wool production in Australia up to the early 1970' and "lower" production afterwards. Notice that this is a bivariate, rather than univariate, model. The plot from the paper that you refer to is a marginalized version of such multidimensional density plot and can be easily obtained by extracting
mean
andvariance
objects fromparameters
inMclust
object (example below).The plot above, if expanded a little bit, could be also a very good example of why using such method is not really the best way to go with time series, what would get obvious if you look at the plot below.
As you can see, if you made predictions from such mixture model, you'll conclude that there were literally no wool production in Australia before 1850 and there would be no such production in ninety years from now. Time series are not really Gaussian shaped, so such methods should be used with caution.
R note: In the example provided
ts
object was used, where information about time units was available by thetime
method. However if you are not using ats
object, than you have to use additional variable that describes the time with appropriate time units.