In the article Relative Information Loss in the PCA, the authors make, at some point (in the introductory section), the following statement:
In case the orthogonal matrix is not known a priori, but has to be estimated from a set of input data vectors collected in the matrix $\underline{X}$, the PCA becomes a nonlinear operation:
$$\underline{Y} = \underline{w}(\underline{X})\underline{X}$$
Here, $\underline{w}$ is a matrix-valued function which computes the orthogonal matrix required for rotating the data (e.g., using the QR algorithm).
This statement contrasts with most statements about PCA, which is regarded as a linear transformation.
I designed a toy experiment to check the linearity (additivity property): $f(a + b) = f(a) + f(b)$.
import numpy
from sklearn.decomposition import PCA
if __name__ == '__main__':
numpy.random.seed(42)
m = 100
d = 3
X = numpy.random.normal(size = (m, d))
# Center data
X -= numpy.mean(X, axis = 0)
pca = PCA(n_components = d)
pca.fit(X)
Y = pca.transform(X)
# Check linearity, pca(a + b) = pca(a) + pca(b)
for i in range(0, m):
for j in range(0, m):
d = pca.transform([X[i] + X[j]]) - (Y[i] + Y[j])
assert numpy.allclose(d, numpy.array([0.0, 0.0, 0.0]))
The expression $f(a + b) – (f(a) + f(b))$, where $f = \mathrm{PCA}$, seems to be the zero vector, thus I assume the transform (PCA) is linear.
What am I missing then, that PCA is considered non-linear when the orthogonal matrix (the matrix of the principal components) is estimated from $X$ (see the quote above)?
Best Answer
I think the confusion is due to what exactly is meant here to be linear or non-linear.
Using the notation of your quote, operation $w(X)$ maps a data matrix $X$ into a projector $P_k$ on the first $k$ principal axes of $X$. Let us be completely clear about the notation here; for simplicity let us fix $k=1$ and assume that $X$ is centered. Then $X\in\mathbb R^{n\times p}$ and $P\in \mathbb P^p \subset \mathbb R^{p\times p}$, where by $\mathbb P^p$ I mean the space of all matrices of the form $P=\mathbf{uu}^\top$ with $\mathbf u\in \mathbb R^p$ and $\|\mathbf u\|=1$.
Now:
The quote talks about the $w(\cdot)$ function; it transforms a data matrix into a projection operator. It is non-linear. Your script investigates the $P(\cdot)$ function; it transforms a high-dimensional vector into a low-dimensional PCA projection, given a fixed dataset. It is linear.
So $w(\cdot)$ is a non-linear mapping into linear projections.
No contradiction.