I'm working on a text classification project, and I want to reduce the tf-idf matrix dimension with Principal Component Analysis (PCA) and then train my model with this, which is pretty straightforward. But once I do that with my training set; how do I transform my test set to the same space to where the training set was mapped? This would be simple if I were working with data where the features are fixed (by just multiplying the component matrix times the test data matrix), but in text analysis, the features change every time a document is added.
So my question is: how to do this with text classification? Or should I try another approach?
Best Answer
What I did was discard all the words that are in the test set and that weren't on the train set, and rearrange everything so the order are the same in each matrix.
It can be seen in the following Python code. x_train is a pandas dataframe which contains the training text, and x_test is a pandas dataframe which contains the test text
The last lines just turn the TDMs to TFIDF matrices.
This works perfectly fine with small data sets, but now I have a new trouble. Can this be done more efficiently? Because with another dataset I have with 2000 documents, this doesn't work, it just keeps training and never ends. Does anybody know a way to do this in Scala, Spark, Hadoop or something that is faster? Or can recommend me where to do it?