Dimensionality Reduction – PCA with Polynomial Kernel vs Single Layer Autoencoder

autoencodersdimensionality reductionneural networkspcapolynomial

What is the relationship between PCA with polynomial kernel and a single layer autoencoder ?

What if it is a deep autoencoder?

Best Answer

Assuming your polynomial kernel is given, for $d \in \mathbb I, c \in \mathbb R$ as $$K(x,y) = \langle \phi(x)|\phi(y)\rangle = (\langle x|y\rangle + c)^d$$

The functional form of PCA is given by: $$\phi(x')=WW^T\phi(x)\approx\phi(x)$$

In PCA $W$ is usually a rank-deficient matrix, so that the representation is lower dimensional than the original data.

On the other hand, an autoencoder with activation function $f$ in the hidden layer and identity activation in the output layer has the following functional form:

$$x' = W_2^Tf(W_1^Tx) \approx x$$

If we assume common activation functions, it can be seen that, at least in their functional forms, the kernel-PCA and the autoencoder won't ever coincide for non-trivial polynomial order $d$.

A deep autoencoder wouldn't match the kernel PCA on its functional form either, though it could match the PCA manifold, since deep neural networks are often approximate universal approximators.


A simple example can be sketched with $d=2$.

$$ \begin{align} K(x,y) &= (\langle x|y\rangle + c)^2 = \langle x|y\rangle^2 + 2 \langle x|y\rangle c + c^2 \\ &= \left(\sum_i^p x_iy_i\right)^2 + 2\left(\sum_i^p x_iy_ic\right)+c^2\\ &= \left(\sum_i^p x_iy_i\left(\sum_j^p x_jy_j\right)\right) + 2\left(\sum_i^p x_iy_ic\right)+c^2\\ &= \left(\sum_i^p\left(\sum_j^p x_iy_ix_jy_j\right)\right) + 2\left(\sum_i^p x_iy_ic\right)+c^2\\ &= \sum_i^p x_i^2y_i^2 + \sum_{i=2}^p\sum_{j=1}^{i-1} \left(\sqrt2x_ix_j\right)\left(\sqrt2y_iy_j\right) + \sum_i^p \left(\sqrt{2c}x_i\right)\left(\sqrt{2c}y_i\right)+\sum_i^p c\cdot c\\ \end{align}$$

We can see from this expansion that

$$\phi(x) = \left\{x_1^2, \dots,x_p^2, \dots, \sqrt2x_ix_j, \dots, \sqrt{2c}x_1, \dots, \sqrt{2c}x_p, c\right\}$$

In other words, the kernel gives you the original terms $x_i$, plus all powers and the interactions up to order $d$. This is not what you'll typically see in an autoencoder, which is usually based on the non-linear transformation of linear projections only. Interactions can be emulated, but are not enforced.

Using the multinomial theorem this result can be generalized to other choices of $d$.