Solved – Why do we use PCA to speed up learning algorithms when we could just reduce the number of features

machine learningpca

In a machine learning course, I learned that one common use of PCA (Principal Component Analysis) is to speed up other machine learning algorithms. For example, imagine you are training a logistic regression model. If you have a training set $(x^{(i)},y^{(i)})$ for i from 1 to n and it turns out the dimension of your vector x is very large (let's say a dimensions), you can use PCA to get a smaller dimension (let's say k dimensions) feature vector z. Then you can train your logistic regression model on the training set $(z^{(i)},y^{(i)})$ for i from 1 to n. Training this model will be faster because your feature vector has less dimensions.

However, I don't understand why you can't just reduce the dimension of your feature vector to k dimensions by just choosing k of your features at random and eliminating the rest.

The z vectors are linear combinations of your a feature vectors. Since the z vectors are confined to a k-dimensional surface, you can write the a-k eliminated feature values as a linear function of the k remaining feature values, and thus all the z's can be formed by linear combinations of your k features. So shouldn't a model trained on an training set with eliminated features have the same power as a model trained on a training set whose dimension was reduced by PCA? Does it just depend on the type of model and whether it relies on some sort of linear combination?

Best Answer

Let's say you initially have $p$ features but this is too many so you want to actually fit your model on $d < p$ features. You could choose $d$ of your features and drop the rest. If $X$ is our feature matrix, this corresponds to using $XD$ where $D \in \{0,1\}^{p \times d}$ picks out exactly the columns of $X$ that we want to include. But this ignores all information in the other columns, so why not consider a more general dimension reduction $XV$ where $V \in \mathbb R^{p \times d}$? This is exactly what PCA does: we find the matrix $V$ such that $XV$ contains as much of the information in $X$ as possible. Not all linear combinations are created equally. Unless our $X$ matrix is so low rank that a random set of $d$ columns can (with high probability) span the column space of all $p$ columns we will certainly not be able to do just as well as with all $p$ features. Some information will be lost, and so it behooves us to lose as little information as possible. With PCA, the "information" that we're trying to avoid losing is the variation in the data.

As for why we restrict ourselves to linear transformations of the predictors, the whole point in this use-case is computation time. If we could do fancy non-linear dimension reduction on $X$ we could probably just fit the model on all of $X$ too. So PCA sits perfectly at the intersection of fast-to-compute and effective.

Related Solutions

Solved – How to transform test set to the PCA space of the training set, if the features in train and test are different

What I did was discard all the words that are in the test set and that weren't on the train set, and rearrange everything so the order are the same in each matrix.

It can be seen in the following Python code. x_train is a pandas dataframe which contains the training text, and x_test is a pandas dataframe which contains the test text

#Train
tdm = txtm.TermDocumentMatrix()  
for doc in x_train:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_train = pd.DataFrame(tdm_array, columns = tdm_terms)
TDM_df_train = TDM_df_train.reindex_axis(sorted(TDM_df_train.columns), axis=1) #Ordena las columnas en orden alfabético
#Test
tdm = txtm.TermDocumentMatrix()  
for doc in x_test:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_test = pd.DataFrame(tdm_array, columns = tdm_terms)
#Remove from TDM_df_test words that aren't on TDM_df_train
for col in TDM_df_test:
   if col not in TDM_df_train.columns:
        del TDM_df_test[col]
TDM_df_test = TDM_df_test.reindex_axis(sorted(TDM_df_train.columns), axis=1, fill_value=0)

tfidf = TfidfTransformer()
tfidfRedTrain = tfidf.fit_transform(TDM_df_train.values)
tfidfRedTest = tfidf.fit_transform(TDM_df_test.values)

The last lines just turn the TDMs to TFIDF matrices.

This works perfectly fine with small data sets, but now I have a new trouble. Can this be done more efficiently? Because with another dataset I have with 2000 documents, this doesn't work, it just keeps training and never ends. Does anybody know a way to do this in Scala, Spark, Hadoop or something that is faster? Or can recommend me where to do it?

Machine Learning – Interpretation Problem of Principal Components as Linear Combinations of Features

I think it is correct to say that "principal components are linear combinations of the original features".

You consider a $m\times n$ data matrix $A$ with $m$ data points in the $n$-dimensional space that has rank $r<n$. And you say that even though the original features are only $r$-dimensional, PCA will extract $n$ principal components, hence the quoted statement cannot be correct.

The problem in your argument is that in this case PCA will not extract $n$ principal components; it will only extract $r$ of them.

The $n\times n$ covariance matrix $C$ will be of rank $r$, meaning that it will have $n-r$ zero eigenvalues. If we do its eigenvector decomposition $C=VSV^\top$, the diagonal matrix $S$ will have these zeros in it. I would only call "principal components" those eigenvectors (columns of $V$) that correspond to non-zero eigenvalues. The ones that correspond to zero eigenvalues do not deserve to be -- and are not -- called principal components.

This resolves the contradiction.

Note that this is not my personal interpretation, it is the standard terminology usage; see e.g. Why are there only $n-1$ principal components for $n$ data points if the number of dimensions is larger or equal than $n$?

Edit in response to the comment: Regarding what exactly is called "principal component" please see my answer here What exactly is called "principal component" in PCA?. Whatever your personal preference is, there are only $r$ principal components if the rank of the data matrix is $r$.

Best Answer

Related Solutions

Solved – How to transform test set to the PCA space of the training set, if the features in train and test are different

Machine Learning – Interpretation Problem of Principal Components as Linear Combinations of Features

Related Question