I am interested on using sparse PCA in python and I found the sklearn implementation. However, I think this python implementation solves a different problem than the original sparse pca algorithm proposed in this paper and implemented in the R package elasticnet. For example, consider the following example regarding the explained variance of the sparse PCs:
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()
from sklearn.decomposition import SparsePCA
x = boston.data # Load boston dataset
x = x[:, [0, 2, 4, 5, 6, 7, 10, 11, 12]] # select non-categorical variables
spca = SparsePCA(n_components=5, alpha=1e-3, ridge_alpha=1e-6, normalize_components=False) # Solve sparse pca
spca.fit(x)
t_spca = spca.transform(x)
p_spca = spca.components_.T
# Calculate the explained variance as explained in sparse pca original paper
t_spca_qr = np.linalg.qr(t_spca) # QR decomposition of modified PCs
q = t_spca_qr[0]
r = t_spca_qr[1]
# compute adjusted variance
variance = []
for i in range(5):
variance.append(np.square(r[i][i]))
variance = np.array(variance)
# compute variance_ratio
total_variance_in_x = np.matrix.trace(np.cov(x.T)) # Variance in the original dataset
explained_variance_ratio = np.cumsum(variance / total_variance_in_x)
array([ 0.00010743, 0.00021485, 0.00032228, 0.0004297 , 0.00053713])
The explained variance calculation using sparse PCA is based on the original paper cited above, page 273. So as you can see, this explained variance ratio is pretty small.
My questions are:
- Am I doing something wrong on the calculations of the explained variance?
- Is there any place where the mathematical formulation for the sparse pca implemented in python can be found?
- what does the
normalize_components
parameter in spca python implementation do? Why does it raise a deprecation warning when set toFalse
?
EDIT (based on the answer)
Based on the answers, I test the results by setting normalize_components=True
and avoiding this way the deprecation warning.
spca = SparsePCA(n_components=5, alpha=1e-3, ridge_alpha=1e-6, normalize_components=True) # Solve sparse pca
spca.fit(x)
t_spca = spca.transform(x)
p_spca = spca.components_.T
t_spca_qr = np.linalg.qr(t_spca)
r = t_spca_qr[1]
# compute adjusted variance
variance = []
for i in range(5):
variance.append(np.square(r[i][i]))
variance = np.array(variance)
This yields to the following results for the variance
:
variance
array([ 4255042.12587337, 386089.3106474 , 31883.68815345, 15333.57500443, 9781.36298748])
However, those numbers are much larger than the sample size n=505
or than the variance from the original matrix x total_variance_in_x=9308.784
so by dividing the entries of the variance array by 505 or 9308.784 I still do not get meaningful results regarding the explained variance.
Best Answer
Answer:
Your code seems to be in line with the cited paper, and the incosistency is an artefact of non-standard definition of principal components used in
scikit-learn
of versions <0.22
(this is mentioned in the deprecation message).You can get meaningful results by
normalize_components=True
variance
array by the number of samples, 505.This gives you explained variance ratios like
0.90514782, 0.98727812, 0.99406053, 0.99732234, 0.99940307
.and 3. The most immediate way is to check the source files of the
sklearn.decomposition
on your computer.Details:
The code of
SparsePCA
, as inscikit-learn=0.21.3
, has an unexpected artefact: as is returns a transformation of inputs such that the $QR$ decomposition has $R$ diagonal with non-zero values in $\{-1, 1\}$ (rounding to 5 digits after the comma, easy to check) and therefore is not reflective of the preserved variance as it should be according to Zou, Hastie and Tibshirani.normalize components
is key to getting to the meaningful variance ratios. This argument is only present inscikit-learn=0.21.3
and absent inscikit-learn=0.22
. Set toFalse
it induces a normalization of the outputs that hides the information about the explained variance. One can see the details of this normalization in the source code:The deprecation warning reads that this normalization is not in line with the common definition of principal components and is removed in version
0.22
.If you are still using an old version like me, this normalization can be reversed as mentioned in the brief version of the answer.