Solved – meant by the non-gaussianity in the independent component analysis (ICA)

factor analysisindependent component analysisnormal distributionself-studyunsupervised learning

What is meant by non-gaussianity in ICA? Why do we use in ICA? How is Non-Gaussianity is an important and essential principle in ICA estimation?

Following is the statement I found in a research paper. But I am not able to understand it. Please explain with help of an example and mathematical equations.

Separation of independent signals from their mixtures can be
accomplished by making the linear signal transformation as
non-Gaussian as possible. Non-Gaussianity is an important and
essential principle in ICA estimation.

Best Answer

First we look at the central limit theorem, which is basically concerned with the tendancy of estimations of the mean of independently drawn variables of any arbitrary distribution to follow a Gaussian distribution. This matters because in real world samples we are often observing data that is in fact a composite of many underlying factors, and based on the central limit theorem we understand that linear combinations of independent variables create an aggregate variable that tends towards Gaussian in nature.

Non independent variable aggregates can retain non Gaussian distributions as the distributions are linked, but if independent then their combination will tend towards Gaussian (just as the sum of multiple independent fair dice tends towards a normal distribution).

What we want to achieve with ICA is to separate out the independent variables that underlie the observed data, i.e. reverse the central limit theorem. Since the linear combination of independent variables is more Gaussian than the original variables, unless at least one is Gaussian, it follows that using non - Gaussianality is required to identify the underlying variables.

Thus ICA is built on using the assumption of non-Gaussianality in the latent factors to tease them apart. If more than one underlying factor is Gaussian then they will not be separated by ICA since the separation is based on deviation from normality. Basically two Gaussian variables give a circular joint probability for which rotation is arbitrary, so there is no single solution.

https://web.archive.org/web/20210303213322/fourier.eng.hmc.edu/e161/lectures/ica/node3.html http://wwwf.imperial.ac.uk/~nsjones/TalkSlides/HyvarinenSlides.pdf

Related Solutions

Solved – the relationship between independent component analysis and factor analysis

enter image description here

FA, PCA, and ICA, are all 'related', in as much as all three of them seek basis vectors that the data is projected against, such that you maximize insert-criteria-here. Think of the basis vectors as just encapsulating linear combinations.

For example, lets say your data matrix $\mathbf Z$ was a $2$ x $N$ matrix, that is, you have two random variables, and $N$ observations of them each. Then lets say you found a basis vector of $\mathbf w = \begin{bmatrix}0.1 \\-4 \end{bmatrix}$. When you extract (the first) signal, (call it the vector $\mathbf y$), it is done as so:

$$ \mathbf {y = w^{\mathrm T}Z} $$

This just means "Multiply 0.1 by the first row of your data, and subtract 4 times the second row of your data". Then this gives $\mathbf y$, which is of course a $1$ x $N$ vector that has the property that you maximized its insert-criteria-here.

So what are those criteria?

Second-Order Criteria:

In PCA, you are finding basis vectors that 'best explain' the variance of your data. The first (ie highest ranked) basis vector is going to be one that best fits all the variance from your data. The second one also has this criterion, but must be orthogonal to the first, and so on and so forth. (Turns out those basis vectors for PCA are nothing but the eigenvectors of your data's covariance matrix).

In FA, there is difference between it and PCA, because FA is generative, whereas PCA is not. I have seen FA as being described as 'PCA with noise', where the 'noise' are called 'specific factors'. All the same, the overall conclusion is that PCA and FA are based on second-order statistics, (covariance), and nothing above.

Higher Order Criteria:

In ICA, you are again finding basis vectors, but this time, you want basis vectors that give a result, such that this resulting vector is one of the independent components of the original data. You can do this by maximization of the absolute value of normalized kurtosis - a 4th order statistic. That is, you project your data on some basis vector, and measure the kurtosis of the result. You change your basis vector a little, (usually through gradient ascent), and then measure the kurtosis again, etc etc. Eventually you will happen unto a basis vector that gives you a result that has the highest possible kurtosis, and this is your independent component.

The top diagram above can help you visualize it. You can clearly see how the ICA vectors correspond to the axes of the data, (independent of each other), whereas the PCA vectors try to find directions where variance is maximized. (Somewhat like resultant).

If in the top diagram the PCA vectors look like they almost correspond to the ICA vectors, that is just coincidental. Here is another instance on different data and mixing matrix where they are very different. ;-)

enter image description here

Independent Component Analysis – How to Make Sense of ICA for Data Processing

Here's my attempt.

Background

Consider the following two cases.

You are a private eye at a party. Suddenly, you see one of your old clients talking to someone, and you can hear some of the words but not quite, because you also hear someone else who's next to him, participating in an unrelated discussion about sports. You don't want to come closer - he'll spot you. You decide to take your partner's phone (who's busy convincing the bartender non-alcoholic beer is great) and plant it about 10 meters next to you. The phone is recording, and the phone also records the old client's talk as well as the interfering sports guy. You take your own phone and start recording as well, from where you're standing. After about 15 minutes you go home with two recordings: one from your position, and the other from about 10 meters away. Both recordings contain your old client and Mr. Sporty, but on each recording, one of the speakers is of a slightly different volume relative to the other (and this relative volume is kept constant during the entire talk for each recording, because fortunately no one of the participants moved around the room).
You take a picture of a cute Labrador Retriever dog you see outside the window. You check-out the image, and unfortunately you see a reflection from the window that's between you and the dog. You can't open the window (it's one of those, yes) and you can't go outside because you're afraid he'll run away. So you take (for some unclear reason) another image, from a slightly different position. You still see the reflection and the dog, but they are in different positions now, since you're taking the picture from a different place. Also note that the position changed uniformly for each pixel in the image, because the window is flat and not concave/convex.

The question is, in both cases, how to restore the conversation (in 1.) or the image of the dog (in 2.), given the two images that contain the same two "sources" but with slightly different relative contributions from each. Surely my educated grandchild can make sense of this!

Intuitive solution

How can we, at least in principle, get back the image of the dog from a mixture? Each pixel contains values that are a sum of two values! Well, if each pixel was given without any other pixels, our intuition would be correct - we would not have been able to guess the exact relative contributions of each of the pixels.

However, we are given a set of pixels (or points in time in the case of the recording), that we know hold the same relations. For example, if on the first image, the dog is always twice stronger than the reflection, and on the second image, it is just the opposite, then we might be able to get the correct contributions after all. And then, we can come up with the correct way to subtract the two images at hand so that the reflection is exactly cancelled! [Mathematically, this means finding the inverse mixture matrix.]

Diving into details

Let's say you have a mixture of two signals, $$Y_1=a_{11}S_1+a_{12}S_2 \\ Y_2 = a_{21}S_1 + a_{22} S_2 $$

and let's say you would like to obtain back $S_1$ as a function of the two mixtures, $Y_1,Y_2$. And let's also assume that you want a linear combination: $S_1=b_{11} Y_1 + b_{12} Y_2$. So, all you need to do is to find the best vector $(b_{11},b_{12})$ and there you have it. Similarly for $S_2$ and $(b_{21},b_{22})$.

But how can you find it for general signals? they may look similar, have similar statistics, etc. So let's assume they're independent. That's reasonable if you have an interfering signal, such as noise, or if the two signals are images, the interfering signal may be a reflection of something else (and you took two images from different angles).

Now, we know that $Y_1$ and $Y_2$ are dependent. Since we may not recover $S_1,S_2$ exactly, denote our estimation for these signals as $X_1,X_2$, respectively.

How can we make $X_1,X_2$ be as close as possible to $S_1,S_2$? Since we know the latter are independent, we might want to make $X_1,X_2$ as independent as possible, by jiggling with the values of $b_{ij}$. After all, if the matrix $\{a_{ij}\}$ is invertible, we can find some matrix $\{b_{ij}\}$ that inverts the mixing operation (and if it's not invertible, we can get close), and if we make them independent, good chance we restore our $S_i$ signals.

If you are convinced we need to find such $\{b_{ij}\}$ that makes $X_1,X_2$ independent, we now need to ask how to do that.

So first consider this: if we sum up several independent, non-Gaussian signals, we make the sum "more Gaussian" than the components. Why? due to the central limit theorem, and you can also think about the density of the sum of two indep. variables, which is the convolution of the densities. If we sum several indep. Bernoulli variables, the empirical distribution will resemble more and more a Gaussian shape. Will it be a true Gaussian? probably not (no pun intended), but we can measure a Gaussianity of a signal by the amount it resembles a Gaussian distribution. For instance, we can measure its excess kurtosis. If it's really high, it is probably less Gaussian than one with the same variance but with excess kurtosis close to zero.

Therefore, if we were to find the mixing weights, we might try to find $\{b_{ij}\}$ by formulating an optimization problem that at each iteration, makes the vector of $X_1,X_2$ slightly less Gaussian. Mind that it may not be truly Gaussian at any stage, but we just want to reduce the Gaussianity. Hopefully, finally, and if we don't get stuck at local minima, we would get the backwards mixing matrix $\{b_{ij}\}$ and get our indep. signals back.

Of course, this adds another assumption - the two signals need to be non-Gaussian to begin with.