Machine Learning – Explanation of Particular Scenarios in Machine Learning and Econometrics

econometricsmachine learningnormal distribution

"The set of points in $\mathbb{R}^2$ classified ORANGE corresponds to
{$x:x^Tβ>0.5$}, indicated in Figure 2.1, and the two predicted classes
are separated by the decision boundary {$x:x^Tβ=0.5$}, which is linear
in this case. We see that for these data there are several
misclassifications on both sides of the decision boundary. Perhaps our
linear model is too rigid—or are such errors unavoidable? Remember
that these are errors on the training data itself, and we have not
said where the constructed data came from. Consider the two possible
scenarios:

Scenario 1: The training data in each class were generated from
bivariate Gaussian distributions with uncorrelated components and
different means.

Scenario 2: The training data in each class came from a mixture of 10
low- variance Gaussian distributions, with individual means themselves
distributed as Gaussian.

A mixture of Gaussians is best described in terms of the generative
model. One first generates a discrete variable that determines which
of the component Gaussians to use, and then generates an observation
from the chosen density. In the case of one Gaussian per class, we
will see in Chapter 4 that a linear decision boundary is the best one
can do, and that our estimate is almost optimal. The region of overlap
is inevitable, and future data to be predicted will be plagued by this
overlap as well. In the case of mixtures of tightly clustered
Gaussians the story is different. A linear decision boundary is
unlikely to be optimal, and in fact is not. The optimal decision
boundary is nonlinear and disjoint, and as such will be much more
difficult to obtain."

Can someone please explain to me what the particular scenarios mean? It's from Elements of Statistical Learning by Tibshirani

Best Answer

In scenario 1, there are two bivariate Normal distributions. Here I show two such probability density functions (PDFs) superimposed in a pseudo-3D plot. One has a mean near $(0,0)$ (at the left) and the other has a mean near $(3,3)$.

Figure 1: Normal PDFs

Samples are drawn independently from each. I took the same number ($300$) so that we wouldn't have to compensate for different sample sizes in evaluating these data.

Figure 2: Normal samples and their best discriminator

Point symbols distinguish the two samples. The gray/white background is the best discriminator: points in gray are more likely to arise from the second distribution than the first. (The discriminator is elliptical, not linear, because these distributions have slightly different covariance matrices.)

In scenario 2 we will look at two comparable datasets produced using mixture distributions. There are two mixtures. Each one is determined by ten distinct Normal distributions. They all have different covariance matrices (which I do not show) and different means. Here are the locations of their means (which I have termed "nuclei"):

Figure 3: Component means

A mixture of Gaussians is best described in terms of the generative model. One first generates a discrete variable that determines which of the component Gaussians to use, and then generates an observation from the chosen density.

To draw an set of independent observations from a mixture, you first pick one of its components at random and then draw a value from that component. The PDF of a mixture is a weighted sum of PDFs of the components, with the weights being the chance of selecting each component in that first stage. Here are the PDFs of the two mixtures. I drew them with a little extra transparency so you can see them better in the middle where they overlap:

Figure 4: Mixture PDFs

To make the two scenarios easier to compare, the means and covariance matrices of these two PDFs we chosen to closely match the corresponding means and covariances of the two bivariate Normal PDFs used in scenario 1.

To emulate scenario 2 (the mixture distributions), I drew samples of 300 independent values from each of the two datasets by selecting each of their components with a probability of $1/10$ and then independently drawing a value from the selected component. Because the selection of components is random, the number of draws from each component was not always exactly $30 = 300 \times 1/10$, but it was usually close to that. Here is the result:

Figure 5: Two mixture samples

The black dots show the ten component means for each of the two distributions. Clustered around each black dot are approximately 30 samples. However, there is much intermingling of values, so it is impossible from this figure to determine which samples were drawn from which component.

In the case of mixtures of tightly clustered Gaussians the story is different. A linear decision boundary is unlikely to be optimal, and in fact is not. The optimal decision boundary is nonlinear and disjoint, and as such will be much more difficult to obtain."

The background in that last figure is the best discriminator for these two mixture distributions. It is complicated because the distributions are complicated; obviously it is not just a line or smooth curve, such as appeared in scenario 1.

I believe the entire point of this comparison lies in our option, as analysts, to choose which model we want to use to analyze either one of these two datasets. Because we would not in practice know which model is appropriate, we could try using a mixture model for the data in scenario 1 and we could equally well try using a Normal model for the data in scenario 2. We would likely be fairly successful in any case due to the relatively low overlap (between blue and red sample points). Nevertheless, the different (equally valid) models can produce distinctly different discriminators (especially in areas where data are sparse).

Related Solutions

Outlier Detection – How to Detect Outliers in a Mixture of Gaussians Using Normal Distribution Models

I have suggested, in comments, that an "outlier" in this situation might be defined as a member of a "small" cluster centered at an "extreme" value. The meanings of the quoted terms need to be quantified, but apparently they can be: "small" would be a cluster of less than 10 values and "extreme" can be determined as outlying relative to the set of component means in the mixture model. In this case, outliers can be found with simple post-processing of any reasonable cluster analysis of the data.

Choices have to be made in fine-tuning this approach. These choices will depend on the nature of the data and therefore cannot be completely specified in a general answer like this. Instead, let's analyze some data. I use R due to its popularity on this site and succinctness (even compared to Python).

First, create some data as described in the question:

set.seed(17) # For reproducible results
centers <- rnorm(100, mean=100, sd=20)
x <- c(centers + rnorm(100*100, mean=0, sd=1), 
       rnorm(100, mean=250, sd=1), 
       rnorm(9, mean=300, sd=1))

This command specifies 102 components: 100 of them are situated like 100 independent draws from a normal(100, 20) distribution (and will therefore tend to lie between 50 and 150); one of them is centered at 250, and one is centered at 300. It then draws 100 values independently from each component (using a common standard deviation of 1) but, in the last component centered at 300, it draws only 9 values. According to the characterization of outliers, the 100 values centered at 250 do not constitute outliers: they should be viewed as a component of the mixture, albeit situated far from the others. However, one cluster of nine high values consists entirely of outliers. We need to detect these but no others.

Most omnibus univariate outlier-detection procedures would either not detect any of these 109 highest values or would indicate all 109 are outliers.

Suppose we have a good sense of the standard deviations of the components (obtained from prior information or from exploring the data). Use this to construct a kernel density estimate of the mixture:

d <- density(x, bw=1, n=1000)
plot(d, main="Kernel density")

KDE

The (almost invisible) blip at the extreme right qualifies as a set of outliers: its small area (less than 10/10109 = 0.001 of the total) indicates it consists of just a few values and its situation at one extreme of the x-axis earns it the appellation of "outlier" rather than "inlier." Checking these things is straightforward:

x0 <- d$x[d$y > 1000/length(x) * dnorm(5)]
gaps <- tail(x0, -1) - head(x0, -1)
histogram(gaps, main="Gap Counts")

Gap histogram

The density estimate d is represented by a 1D grid of 1000 bins. These commands have retained all bins in which the density is sufficiently large. For "large" I chose a very small value, to make sure that even the density of a single isolated value is picked up, but not so small that obviously separated components are merged.

Evidently the gap distribution has two high outliers (which can automatically be detected using any simple procedure, even an ad hoc one). One characterization is that they both exceed 25 (in this example). Let's find the values associated with them:

large.gaps <- gaps > 25
ranges <- rbind(tail(x0,-1)[large.gaps], c(tail(head(x0,-1)[large.gaps], -1), max(x))

The output is

         [,1]     [,2]
[1,] 243.9937 295.7732
[2,] 256.3758 300.9340

Within the range of data (from 25 to 301) these gaps determine two potential outlying ranges, one from 244 to 256 (column 1) and another from 296 to 301 (column 2). Let's see how many values lie within these ranges:

lapply(apply(ranges, 2, function(r){x[r[1] <= x & x <= r[2]]}), length)

The result is

[[1]]
[1] 100

[[2]]
[1] 9

The 100 is too large to be unusual: that's one of the components of the mixture. But the 9 is small enough. It remains to see whether any of these components might be considered outlying (as opposed to inlying):

apply(ranges, 2, mean)

The result is

[1] 250.1848 298.3536

The center of the 100-point cluster is at 250 and the center of the 9-point cluster is at 298, far enough from the rest of the data to constitute a cluster of outliers. We conclude there are nine outliers. Specifically, these are the values determined by column 2 of ranges,

x[ranges[1,2] <= x & x <= ranges[2,2]]

In order, they are

299.0379 300.0376 300.2696 300.3892 300.4250 300.5659 300.7018 300.8436 300.9340

Solved – Why is optimizing a mixture of Gaussian directly computationally hard

First, GMM is a particular algorithm for clustering, where you try to find the optimal labelling of your $n$ observations. Having $k$ possible classes, it means that there are $k^n$ possible labellings of your training data. This becomes already huge for moderate values of $k$ and $n$.

Second, the functional you are trying to minimize is not convex, and together with the size of your problem, makes it very hard. I only know that k-means (GMM can be seen as a soft version of kmeans) is NP-hard. But I am not aware of whether it has been proved for GMM as well.

To see that the problem is not convex, consider the one dimensional case: $$ L = \log \left(e^{-({x}/{\sigma_{1}})^2} + e^{-({x}/{\sigma_{2}})^2}\right) $$ and check that you cannot guarantee that $\frac{d^2L}{dx^2} > 0$ for all x.

Having a non-convex problem means that you can get stuck in local minima. In general, you do not have the strong warranties you have in convex optimization, and searching for a solution is also much harder.

Best Answer

Related Solutions

Outlier Detection – How to Detect Outliers in a Mixture of Gaussians Using Normal Distribution Models

Solved – Why is optimizing a mixture of Gaussian directly computationally hard

Related Question