Solved – the assumption on the distribution of data in gaussian mixture models

distributionsgaussian mixture distributionmachine learningnormal distributionunsupervised learning

I am reading about Gaussian mixture models from this slide

https://www.ics.uci.edu/~smyth/courses/cs274/notes/EMnotes.pdf

However, I am super confused at the very first line.

It says:

We have a dataset of some data $x_i$

Each data is assumed to be generated i.i.d. from an underlying
distribution. We assume that the underlying distribution is a mixture
of Gaussian distribution.

I do not understand why we make the assumption that the underlying distribution for the data is the mixture of Gaussian distribution.

This seems to me to be completely false.

The data distribution could be anything. We are only fitting a mixture of Gaussian model to whatever that underlying distribution is. We are minimizing the log-likehood using EM to approximate that distribution with the GMM.

Why do people assume that the data themselves are generated through Gaussians?

Is my interpretation correct?

Best Answer

Actually, the GMM assumes the underlying data is generated from Mixture of Gaussians. You are thereby automatically in the position of assuming the Mixture Gaussianity of data by accepting and using the model. You're actually believing that the GMM will approximately able to represent your data well enough. In almost every algorithm, there are certain assumptions that you accept/assume, e.g. Naive Bayes assumes independence between features. Remember that almost all models are wrong.

Related Question