Solved – How to transform test set to the PCA space of the training set, if the features in train and test are different

classificationpcatext mining

I'm working on a text classification project, and I want to reduce the tf-idf matrix dimension with Principal Component Analysis (PCA) and then train my model with this, which is pretty straightforward. But once I do that with my training set; how do I transform my test set to the same space to where the training set was mapped? This would be simple if I were working with data where the features are fixed (by just multiplying the component matrix times the test data matrix), but in text analysis, the features change every time a document is added.

So my question is: how to do this with text classification? Or should I try another approach?

Best Answer

What I did was discard all the words that are in the test set and that weren't on the train set, and rearrange everything so the order are the same in each matrix.

It can be seen in the following Python code. x_train is a pandas dataframe which contains the training text, and x_test is a pandas dataframe which contains the test text

#Train
tdm = txtm.TermDocumentMatrix()  
for doc in x_train:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_train = pd.DataFrame(tdm_array, columns = tdm_terms)
TDM_df_train = TDM_df_train.reindex_axis(sorted(TDM_df_train.columns), axis=1) #Ordena las columnas en orden alfabético
#Test
tdm = txtm.TermDocumentMatrix()  
for doc in x_test:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_test = pd.DataFrame(tdm_array, columns = tdm_terms)
#Remove from TDM_df_test words that aren't on TDM_df_train
for col in TDM_df_test:
   if col not in TDM_df_train.columns:
        del TDM_df_test[col]
TDM_df_test = TDM_df_test.reindex_axis(sorted(TDM_df_train.columns), axis=1, fill_value=0)

tfidf = TfidfTransformer()
tfidfRedTrain = tfidf.fit_transform(TDM_df_train.values)
tfidfRedTest = tfidf.fit_transform(TDM_df_test.values)

The last lines just turn the TDMs to TFIDF matrices.

This works perfectly fine with small data sets, but now I have a new trouble. Can this be done more efficiently? Because with another dataset I have with 2000 documents, this doesn't work, it just keeps training and never ends. Does anybody know a way to do this in Scala, Spark, Hadoop or something that is faster? Or can recommend me where to do it?

Related Solutions

PCA Space Projection – How to Project a New Vector in R

Well, @Srikant already gave you the right answer since the rotation (or loadings) matrix contains eigenvectors arranged column-wise, so that you just have to multiply (using %*%) your vector or matrix of new data with e.g. prcomp(X)$rotation. Be careful, however, with any extra centering or scaling parameters that were applied when computing PCA EVs.

In R, you may also find useful the predict() function, see ?predict.prcomp. BTW, you can check how projection of new data is implemented by simply entering:

getS3method("predict", "prcomp")

PCA Train/Test Split – Best Practices in Machine Learning

For measuring the generalization error, you need to do the latter: a separate PCA for every training set (which would mean doing a separate PCA for every classifier and for every CV fold).

You then apply the same transformation to the test set: i.e. you do not do a separate PCA on the test set! You subtract the mean (and if needed divide by the standard deviation) of the training set, as explained here: Zero-centering the testing set after PCA on the training set. Then you project the data onto the PCs of the training set.

You'll need to define an automatic criterium for the number of PCs to use.
As it is just a first data reduction step before the "actual" classification, using a few too many PCs will likely not hurt the performance. If you have an expectation how many PCs would be good from experience, you can maybe just use that.
You can also test afterwards whether redoing the PCA for every surrogate model was necessary (repeating the analysis with only one PCA model). I think the result of this test is worth reporting.
I once measured the bias of not repeating the PCA, and found that with my spectroscopic classification data, I detected only half of the generalization error rate when not redoing the PCA for every surrogate model.
Also relevant: https://stats.stackexchange.com/a/240063/4598

That being said, you can build an additional PCA model of the whole data set for descriptive (e.g. visualization) purposes. Just make sure you keep the two approaches separate from each other.

I am still finding it difficult to get a feeling of how an initial PCA on the whole dataset would bias the results without seeing the class labels.

But it does see the data. And if the between-class variance is large compared to the within-class variance, between-class variance will influence the PCA projection. Usually the PCA step is done because you need to stabilize the classification. That is, in a situation where additional cases do influence the model.

If between-class variance is small, this bias won't be much, but in that case neither would PCA help for the classification: the PCA projection then cannot help emphasizing the separation between the classes.

Best Answer

Related Solutions

PCA Space Projection – How to Project a New Vector in R

PCA Train/Test Split – Best Practices in Machine Learning

Related Question