Intuitively speaking, the key to estimating the ICA model is nongaussianity. Actually, without nongaussianity the estimation is not possible at all, as mentioned in Sec. 3.3. This is at the same time probably the main reason for
the rather late resurgence of ICA research: In most of classical statistical theory, random variables are assumed to have gaussian distributions, thus precluding any methods related to ICA.
The Central Limit Theorem, a classical result in probability theory, tells that the distribution of a sum of independent random variables tends toward a gaussian distribution, under certain conditions. Thus, a sum of two independent random variables usually has a distribution that is closer to gaussian than any of the two original random variables.
Let us now assume that the data vector $x$ is distributed according to the ICA data model in Eq. 4, i.e. it is a mixture of independent components. For simplicity, let us assume in his section that all the independent components have identical distributions. To estimate one of the independent components, we consider a linear combination of the $x_i$ (see eq. 6); let us denote this by $y = w^T x = \sum_i w_ix_i$, where $w$ is a vector to be determined. If
$w$ were one of the rows of the inverse of $A$, this linear combination would actually equal one of the independent components. The question is now: How could we use the Central Limit Theorem to determine $w$ so that it would equal one of the rows of the inverse of $A$? In practice, we cannot determine such a $w$ exactly, because we have no knowledge of matrix $A$, but we can find an estimator that gives a good approximation.
To see how this leads to the basic principle of ICA estimation, let us make a change of variables, defining $z = A^Tw$. Then we have $y = w^T x = w^TAs = z^T
s$. $y$ is thus a linear combination of $s_i$, with weights given by $z_i$. Since a sum of even two independent random variables is more Gaussian than the original variables, $z^Ts$ is more Gaussian than any of the $s_i$ and becomes least Gaussian when it in fact equals one of the $s_i$. In this case, obviously
only one of the elements $z_i$ of $z$ is nonzero. (Note that the $s_i$ were here assumed to have identical distributions.)
Therefore, we could take as $w$ a vector that maximizes the non-Gaussianity of $w^T x$. Such a vector would necessarily correspond (in the transformed coordinate system) to a $z$ which has only one nonzero component. This
means that $w^T x = z^Ts$ equals one of the independent components!
Maximizing the non-Gaussianity of $w^T x$ thus gives us one of the independent components. In fact, the optimization landscape for nongaussianity in the $n$-dimensional space of vectors $w$ has $2n$ local maxima, two for each independent component, corresponding to $s_i$ and $−si$ (recall that the independent components can be estimated only up to a multiplicative sign). To find several independent components, we need to find all these local maxima. This is not difficult, because the different independent components are uncorrelated: We can always constrain the search to the space that gives estimates uncorrelated with the previous ones. This corresponds to orthogonalization in a suitably transformed (i.e. whitened) space.
Our approach here is rather heuristic, but it will be seen in the next section and Sec. 4.3 that it has a perfectly
rigorous justification.
Best Answer
Well, what intuition can there be? For a bivariate normal distribution (for $X$ and $Y$, say), uncorrelated means independence of $X$ and $Y$, while for the quite similar bivariate t distribution, with say 100 degrees of freedom, independence do not follow from correlation zero. Plotted this two distributions will look quite similar. For both distributions all contours of the joint density function are elliptical.
The only intuition that I can see is algebraic, the joint density for the bivariate normal is a constant times an exponential function. The argument of the exponential function is a quadratic polynomial in $x,y$. When the correlation is zero, this polynomial will not include any cross terms $xy$, only pure quadratic terms $x^2, y^2$. Then the property of the exponential function that $$ \exp(-x^2 -y^2) = \exp(-x^2)\cdot \exp(-y^2) $$ kicks in (of course the actual terms will be more complicated, but that does not change the idea). If you try to do the same with the bivariate t distribution, everything is the same, except---the quadatic polynomial sits inside the argument of another function without that nice separation property of the exponential! Thats the only intuition that I can see.