I have a question concerning the RTextTools package. When I was reading textbooks and wikipedia I imagined that text classification is all about defining a classificator. In RTextTools you define learning and testing documents in one command. For example
data <- data[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data$Title,data$Subject), language="english",
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100,virgin=FALSE)
models <- train_models(corpus, algorithms=c("MAXENT","SVM"))
We create a corpus from one dataset, where like first 75 documents is for learning and the rest is for training.
But when it comes to applying text classification theories for practical use, you want already predefined text classificator to be applied for different datasets. For now the only solution i managed to find is to train with the same documents and then use different test data. But it sounds very unconfortable when it comes to large texts. So my question is how to create predefined classificator and then apply it to different datasets with R? It would be preferable to use algorithms from RTextTools.
Best Answer
According to documentation, you need function
classify_models
(excerpt from the documentation page):Here is the reproducible example:
The last line should produce the error, since the new data has some terms which were not present in the data used to produce the models. To circumvent this, it is better to use
create_matrix
on all the data you have, and then to train model on the sample of result ofcreate_matrix
, instead of the original data.This solution is a bit quirky, since the convenience function
classify_models
demands acorpus
, where train and test data sets must be explicitly defined. Since the goal is to use the classifier, all data is test data. In this solution I sidestepped the problem by selecting only one observation for train and all the others for the test. To sidestep this it is possible to usepredict
method directly: