Is there an incremental dimensionality reduction algorithm that can handle batch size less than number of components to be reduced

dimensionality reductiononline-algorithmspcascikit learn

I have a large dataset of patient data by hour. For example, given the shape as (hours, features), patient 1 data shape could be (30, 76) and patient 2 data shape could be (5, 76). I want to do incremental learning of the dataset because all the data does not fit inside the memory.

Before feeding the data to a classifier, I am using IncrementalPCA from Scikit as dimensionality reduction method. If I set IncrementalPCA’s n_components to lets say 20, patient 1 data’s dimensions can be reduced. However, for patient 2, an exception n_components=20 must be less or equal to the batch number of samples 5 is raised.

I did not set n_components to less than or equal 5 because I think eventhough it will make the PCA work, it would cause a lot of reduction for Patient 1 which is not ideal. Setting n_components to None sets n_components to min(n_samples, n_features) as per Scikit documentation which I think is not ideal too. Correct me if I’m wrong.

A minimal example is as following:

import numpy as np
from sklearn.decomposition import IncrementalPCA

patient_1 = np.random.randint(0, 5, (30, 76))
patient_2 = np.random.randint(0, 5, (5, 76))

pca = IncrementalPCA(n_components=20)

#======Successfully PCA reduced=======
pca.partial_fit(patient_1)
red_data = pca.transform(patient_1)
print(f"red_data.shape {red_data.shape}")

#======Failed to PCA reduced=======
# Raises exception "n_components=20 must be less or equal to the batch number of samples 5"
try:
    pca.partial_fit(patient_2)
    red_data = pca.transform(patient_2)
except ValueError as v:
    print(v)

Question:

Is there an incremental dimensionality reduction algorithm (possibly with Python implementation) that can reduce the feature dimensions and does not have the restriction like in PCA where n_components must be less or equal to batch number of samples?

Best Answer

There is no such algorithm because what you are asking for is impossible to provide. Here is why:

What PCA does is the following. Let's say, just for simplification, that you don't have 76 features but only three. Then the data of patient 1 would have shape $(30, 3)$. You can interpret each row as a point in three-dimensional space $\mathbb{R}^3$, i.e. the data of patient 1 becomes a point cloud of 30 points in $\mathbb{R}^3$.

Now, let's say you call PCA on this data with n_components = 2. Then, what PCA is trying to do, is to position an n_component-dimensional (i.e. 2-dimensional) plane through your data such that the sum of squared distances of all your 30 points to this plane will be minimal.

For example, if your points $(x, y, z)$ have arbitrary values for $x$ and $y$, but $z$ is always near zero, PCA will probably return the 2-dimensional plane that is the $(x,y)$-plane.

But now, imagine you don't have 30 data points, but only two. And imagine further, that you ask PCA for a fit with n_components=2. I.e. you want PCA to find the 2D plane that optimally approximates your two points. But there are infinitely many such 2D planes!

A plane through those two points can be rotated about the axis given by those two points and will generate all planes that have zero (i.e. minimal) distance to all the given points. So it is impossible for PCA to give you an optimal 2D plane. And your problem is of the same type, only that you have 76 instead of just three dimensions.

Now, that PCA refuses to arbitrarily select a plane from all those optimal planes is because this is almost always not what the user wants and almost always due to a typo or misunderstanding.

So, I would recommend rethinking your strategy. I don't know your data, but I could imagine that it would be beneficial to do PCA not for each patient separately, but one single PCA for (a subsample of) all the combined data from all the patients. Then you would always have enough data for PCA and you would also not have problems interpreting the different PCA components for different patients.

But I could also imagine, that it would make sense to just come up with some sensible summary variables to replace PCA. E.g., compute for each patient various feature means and variances, or quantiles, or whatever you think makes the most sense.