Solved – How to transform test set to the PCA space of the training set, if the features in train and test are different

classificationpcatext mining

I'm working on a text classification project, and I want to reduce the tf-idf matrix dimension with Principal Component Analysis (PCA) and then train my model with this, which is pretty straightforward. But once I do that with my training set; how do I transform my test set to the same space to where the training set was mapped? This would be simple if I were working with data where the features are fixed (by just multiplying the component matrix times the test data matrix), but in text analysis, the features change every time a document is added.

So my question is: how to do this with text classification? Or should I try another approach?

Best Answer

What I did was discard all the words that are in the test set and that weren't on the train set, and rearrange everything so the order are the same in each matrix.

It can be seen in the following Python code. x_train is a pandas dataframe which contains the training text, and x_test is a pandas dataframe which contains the test text

#Train
tdm = txtm.TermDocumentMatrix()  
for doc in x_train:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_train = pd.DataFrame(tdm_array, columns = tdm_terms)
TDM_df_train = TDM_df_train.reindex_axis(sorted(TDM_df_train.columns), axis=1) #Ordena las columnas en orden alfabético
#Test
tdm = txtm.TermDocumentMatrix()  
for doc in x_test:
    tdm.add_doc(doc) 
# Push the TDM data to a list of lists, then make that an ndarray, which then becomes a DataFrame.
tdm_rows = []
for row in tdm.rows(cutoff = 3): # The setting cutoff=1 means that words which appear in 1 or more documents will be included in the output
    tdm_rows.append(row)        
tdm_array = np.array(tdm_rows[1:])
tdm_terms = tdm_rows[0]
TDM_df_test = pd.DataFrame(tdm_array, columns = tdm_terms)
#Remove from TDM_df_test words that aren't on TDM_df_train
for col in TDM_df_test:
   if col not in TDM_df_train.columns:
        del TDM_df_test[col]
TDM_df_test = TDM_df_test.reindex_axis(sorted(TDM_df_train.columns), axis=1, fill_value=0)

tfidf = TfidfTransformer()
tfidfRedTrain = tfidf.fit_transform(TDM_df_train.values)
tfidfRedTest = tfidf.fit_transform(TDM_df_test.values)

The last lines just turn the TDMs to TFIDF matrices.

This works perfectly fine with small data sets, but now I have a new trouble. Can this be done more efficiently? Because with another dataset I have with 2000 documents, this doesn't work, it just keeps training and never ends. Does anybody know a way to do this in Scala, Spark, Hadoop or something that is faster? Or can recommend me where to do it?