Gaussian Mixture Distribution – Understanding the Meaning of Gaussian Mixture Weights

gaussian mixture distributionweights

other posts with similar title do not actually ask what's in their title, so I ask here:

What is the meaning of the weights in the Gaussian Mixture Model (GMM)? Does the GMM weight more heavily to individual mixture components based on their frequency of observations (classes with higher probabilities getting more weight)? number of observations (the mixture component or class with the longest length)? peakedness of a class' distribution? What do the final weights estimated by the GMM seek to accomplish exactly in terms of an optimization objective function?

For example, the weights found by the GMM for the two-mixture data and clusters shown below are [0.48659547, 0.51340453]. What do these results (the weights) mean exactly, and what was there about the 2nd dataset/component that made the GMM give that 2nd component a higher weight?

enter image description here

Best Answer

The simple answer is that the weights estimated by GMM seek to estimate the true weights of the GMM. Sticking the the one-dimensional case, a GMM has $K$ components, where each component is a different normal distribution. A classic example is to consider heights of humans: if you look at the density, it looks like it has two peaks (bimodal), but if you restrict to each gender, they looks like normal distributions. So you could think of the height of a human to be an indicator for gender, and then conditinoal on that indicator, height follows a normal distribution. That's exactly what a GMM models, and you can think of the weights as the probability of belonging to one of the $K$ components of the model. So in our example, the weights would just be the probability of being male and female.

Now with GMM, you may not have access to who belongs to what gender, and so you need to use your data to, in some sense, simultaneously learn about the two distributions and also learn about which distribution an observation belongs to. This is typically done through expectation maximization (EM), where you start by assuming that the weights are uniform, so they are all $1/K$ (or $1/2$ in our example). Then, you proceed with the EM steps and in theory, the weights converge to the true weights. Intuitively, what you're doing is figuring out for each observation $i$ and component $k$ , you estimate the probability of observation $i$ belonging to component $k$. Denote this $p_{ik}$. Then the weight for $k$ is defined as $\frac{1}{n}\sum_{i=1}^n p_{ik}$,which can be thought of as the sample probability of a random observation belonging to component $k$, which is exactly what the weight is basically defining.

Intuition of assignment of weights (and more generally, of EM procedure)

To answer your comment (and updated post), the weights are the estimated probability of a draw belonging to each respective normal distribution (don't know the ordering, but that means that a random draw from your sample has a 48.6% chance of being in one of them, and a 51.3% chance of being in the other... note that they sum up to one!).

As for how that is calculated, it's hard to provide much more than either intuition or the full blown calculations for the EM procedure, which you can easily find googling, but I'll give it a shot. Let's focus on your example. You start by specifying 2 distributions, and the EM process starts by assuming that each normal is equally likely to be assigned, and the variances of both normals are the same and equal to the variance of the entire sample. Then you randomly assign one observation to be the component mean for one of the two components, and another (distinct!) observation to the other component. So in your example, let's call the dark blue one component 1, and the turquoise one component 2. Since the true means are different, and since you randomly choose different observations for the mean estimate for each component, by definition one of two mean estimates will be closer to one of the two unknown true means. Then given those specifications, you calculate the probability of each observation belonging to each of the two components. For example, looking at your plot, for a point super far to the right, it will be more likely to belong to the component with initial mean further to the right than to the other one. Then based on these probabilities and the values, you update the weights, means, and variances of both components. Note that component two will quickly have a higher variance, since all those spread out values to the far right will all go to it. It may not yet pick up the far left ones, but as you keep doing this iterative procedure, eventually the variance of component one will get smaller, while the variance of component two will get larger. At a certain point, the variance of component 2 will be so great that the points to the way left will no longer be assigned to component 1, since although they are closer in terms of mean, they are not consistent with the spread of component one, which has a tighter variance, so they will start favoring component 2. I'm just talking about means and variances to illustrate, but you're also heavily abusing that the distributions are normal for this assignment process and figuring things out. Doing this over and over will slowly assign points to the correct components, and as you do that, the probability weights will also update accordingly. You basically do this until things dont change anymore, and the iterative process is done.

Related Question