Solved – Obtaining final classifier in RTextTools

classificationrtext mining

I have a question concerning the RTextTools package. When I was reading textbooks and wikipedia I imagined that text classification is all about defining a classificator. In RTextTools you define learning and testing documents in one command. For example

data <- data[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data$Title,data$Subject), language="english",
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100,virgin=FALSE)
models <- train_models(corpus, algorithms=c("MAXENT","SVM"))

We create a corpus from one dataset, where like first 75 documents is for learning and the rest is for training.
But when it comes to applying text classification theories for practical use, you want already predefined text classificator to be applied for different datasets. For now the only solution i managed to find is to train with the same documents and then use different test data. But it sounds very unconfortable when it comes to large texts. So my question is how to create predefined classificator and then apply it to different datasets with R? It would be preferable to use algorithms from RTextTools.

Best Answer

According to documentation, you need function classify_models (excerpt from the documentation page):

Description

Uses a trained model from the train_models function to classify new data.

Usage

classify_models(corpus, models, ...)

Arguments

corpus Class of type matrix_container-class generated by the create_corpus function.

models List of models to be used for classification generated by train_models.

... Other parameters to be passed on to classify_model.

Here is the reproducible example:

###Train model (taken out of `classify_model` man page)

library(RTextTools)    
set.seed(123)
alldata <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv")
smpl <- sample(1:3100,size=100)
data <- alldata[smpl,]
matrix <- create_matrix(cbind(data$Title,data$Subject), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)

models <- train_models(corpus, algorithms=c("MAXENT","SVM"))

###Create new data which you want to classify: 

newdata <- alldata[sample((1:3100)[-smpl],size=101,replace=FALSE),]

newmatrix <- create_matrix(cbind(newdata$Title,newdata$Subject), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)

newcorpus <- create_corpus(newmatrix,newdata$Topic.Code,trainSize=1:1, testSize=2:101, 
virgin=FALSE)


newresults1 <- classify_model(newcorpus,models[[1]])
newresults2 <- classify_model(newcorpus,models[[2]])

The last line should produce the error, since the new data has some terms which were not present in the data used to produce the models. To circumvent this, it is better to use create_matrix on all the data you have, and then to train model on the sample of result of create_matrix, instead of the original data.

This solution is a bit quirky, since the convenience function classify_models demands a corpus, where train and test data sets must be explicitly defined. Since the goal is to use the classifier, all data is test data. In this solution I sidestepped the problem by selecting only one observation for train and all the others for the test. To sidestep this it is possible to use predict method directly:

predict(models[[1]],as.compressed.matrix(newmatrix))