Solved – Does PCA preserve observations (rows) of the data

pca

Say I have a data matrix of size $N \times P$ where $N$ is the number of samples and $P$ is the number of features. Now, if I do principal component analysis, I get another data matrix of size $N \times K$ where $K$ was chosen according to some criteria. My question: if I pick a row (sample) from the $\text{PCA}$ matrix, does it still point to the same sample as in the original data matrix?

In my study the data on each row is from one subject, so I want to know if the correspondence still exists if I use $\text{PCA}$ for feature selection. (I think this is correct but better safe than sorry…!)

Best Answer

As others said in the comments, yes, it does preserve row order. If you have a standardized data matrix $\mathbf D$ and a rotation matrix $\mathbf \Omega$, you get your rotated samples $\mathbf R$ simply doing:

$$\mathbf D \mathbf \Omega = \mathbf R$$

As you can see, by the matrix multiplication row order is preserved.

$\mathbf \Omega$ is a square matrix with $P$ rows and columns. If you want to reduce the number of PCs all you have to do is keep only the first $K$ columns of $\mathbf \Omega$.

Related Solutions

Solved – How to transform test set to the PCA space of the training set, if the features in train and test are different

What I did was discard all the words that are in the test set and that weren't on the train set, and rearrange everything so the order are the same in each matrix.

It can be seen in the following Python code. x_train is a pandas dataframe which contains the training text, and x_test is a pandas dataframe which contains the test text

#Train
tdm = txtm.TermDocumentMatrix()  
for doc in x_train:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_train = pd.DataFrame(tdm_array, columns = tdm_terms)
TDM_df_train = TDM_df_train.reindex_axis(sorted(TDM_df_train.columns), axis=1) #Ordena las columnas en orden alfabético
#Test
tdm = txtm.TermDocumentMatrix()  
for doc in x_test:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_test = pd.DataFrame(tdm_array, columns = tdm_terms)
#Remove from TDM_df_test words that aren't on TDM_df_train
for col in TDM_df_test:
   if col not in TDM_df_train.columns:
        del TDM_df_test[col]
TDM_df_test = TDM_df_test.reindex_axis(sorted(TDM_df_train.columns), axis=1, fill_value=0)

tfidf = TfidfTransformer()
tfidfRedTrain = tfidf.fit_transform(TDM_df_train.values)
tfidfRedTest = tfidf.fit_transform(TDM_df_test.values)

The last lines just turn the TDMs to TFIDF matrices.

This works perfectly fine with small data sets, but now I have a new trouble. Can this be done more efficiently? Because with another dataset I have with 2000 documents, this doesn't work, it just keeps training and never ends. Does anybody know a way to do this in Scala, Spark, Hadoop or something that is faster? Or can recommend me where to do it?

Dimensionality in PCA – Is PCA Still Done via Eigendecomposition of Covariance Matrix with High Dimensionality?

The covariance matrix is of $D\times D$ size and is given by $$\mathbf C = \frac{1}{N-1}\mathbf X_0^\top \mathbf X^\phantom\top_0.$$

The matrix you are talking about is of course not a covariance matrix; it is called Gram matrix and is of $N\times N$ size: $$\mathbf G = \frac{1}{N-1}\mathbf X^\phantom\top_0 \mathbf X_0^\top.$$

Principal component analysis (PCA) can be implemented via eigendecomposition of either of these matrices. These are just two different ways to compute the same thing.

The easiest and the most useful way to see this is to use the singular value decomposition of the data matrix $\mathbf X = \mathbf {USV}^\top$. Plugging this into the expressions for $\mathbf C$ and $\mathbf G$, we get: \begin{align}\mathbf C&=\mathbf V\frac{\mathbf S^2}{N-1}\mathbf V^\top\\\mathbf G&=\mathbf U\frac{\mathbf S^2}{N-1}\mathbf U^\top.\end{align}

Eigenvectors $\mathbf V$ of the covariance matrix are principal directions. Projections of the data on these eigenvectors are principal components; these projections are given by $\mathbf {US}$. Principal components scaled to unit length are given by $\mathbf U$. As you see, eigenvectors of the Gram matrix are exactly these scaled principal components. And the eigenvalues of $\mathbf C$ and $\mathbf G$ coincide.

The reason why you might see it recommended to use Gram matrix if $N<D$ is because it will be of smaller size, as compared to the covariance matrix, and hence be faster to compute and faster to eigendecompose. In fact, if your dimensionality $D$ is too high, there is no way you can even store the covariance matrix in memory, so operating on a Gram matrix is the only way to do PCA. But for manageable $D$ you can still use eigendecomposition of the covariance matrix if you prefer even if $N<D$.

See also: Relationship between eigenvectors of $\frac{1}{N}XX^\top$ and $\frac{1}{N}X^\top X$ in the context of PCA

Best Answer

Related Solutions

Solved – How to transform test set to the PCA space of the training set, if the features in train and test are different

Dimensionality in PCA – Is PCA Still Done via Eigendecomposition of Covariance Matrix with High Dimensionality?

Related Question