Solved – Obtaining final classifier in RTextTools

classificationrtext mining

I have a question concerning the RTextTools package. When I was reading textbooks and wikipedia I imagined that text classification is all about defining a classificator. In RTextTools you define learning and testing documents in one command. For example

data <- data[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data$Title,data$Subject), language="english",
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100,virgin=FALSE)
models <- train_models(corpus, algorithms=c("MAXENT","SVM"))

We create a corpus from one dataset, where like first 75 documents is for learning and the rest is for training.
But when it comes to applying text classification theories for practical use, you want already predefined text classificator to be applied for different datasets. For now the only solution i managed to find is to train with the same documents and then use different test data. But it sounds very unconfortable when it comes to large texts. So my question is how to create predefined classificator and then apply it to different datasets with R? It would be preferable to use algorithms from RTextTools.

Best Answer

According to documentation, you need function classify_models (excerpt from the documentation page):

Description

Uses a trained model from the train_models function to classify new data.

Usage

classify_models(corpus, models, ...)

Arguments

corpus Class of type matrix_container-class generated by the create_corpus function.

models List of models to be used for classification generated by train_models.

... Other parameters to be passed on to classify_model.

Here is the reproducible example:

###Train model (taken out of `classify_model` man page)

library(RTextTools)    
set.seed(123)
alldata <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv")
smpl <- sample(1:3100,size=100)
data <- alldata[smpl,]
matrix <- create_matrix(cbind(data$Title,data$Subject), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)

models <- train_models(corpus, algorithms=c("MAXENT","SVM"))

###Create new data which you want to classify: 

newdata <- alldata[sample((1:3100)[-smpl],size=101,replace=FALSE),]

newmatrix <- create_matrix(cbind(newdata$Title,newdata$Subject), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)

newcorpus <- create_corpus(newmatrix,newdata$Topic.Code,trainSize=1:1, testSize=2:101, 
virgin=FALSE)


newresults1 <- classify_model(newcorpus,models[[1]])
newresults2 <- classify_model(newcorpus,models[[2]])

The last line should produce the error, since the new data has some terms which were not present in the data used to produce the models. To circumvent this, it is better to use create_matrix on all the data you have, and then to train model on the sample of result of create_matrix, instead of the original data.

This solution is a bit quirky, since the convenience function classify_models demands a corpus, where train and test data sets must be explicitly defined. Since the goal is to use the classifier, all data is test data. In this solution I sidestepped the problem by selecting only one observation for train and all the others for the test. To sidestep this it is possible to use predict method directly:

predict(models[[1]],as.compressed.matrix(newmatrix))

Related Solutions

Text Mining – Understanding Incremental IDF (Inverse Document Frequency)

Ok, Thanks to Steffen for the useful comments. I guess the answer is quite simple in the end. As he says, all we need to do is store current denominator (call it $z$):

$z(t) = |\{d:t\in d\}|$

Now given a new document $d^*$, we update the denominator simply by:

$z^*(t) = z(t) + \left\{ \begin{array}{ll} 1 & \mbox{if}\; {t\in d^*} \\ 0 & \mbox{otherwise} \end{array} \right.$

We would then have to recalculate the $tf-idf$ based on the new $idf$ vector.

Similarly to remove an old document, we decrement the numerator in a similar fashion.

This does mean that we either have to store the entire $tf$ matrix as well as the $tf-idf$ matrix (doubling the memory requirements), or we have to compute the $tf-idf$ scores when needed (increasing computational costs). I can't see any way round that.

For the second part of the question, about the evolution of $idf$ vectors over time, it seems that we can use the above method, and store a set of "landmark" $z$ vectors (denominators) for different date ranges (or perhaps content subsets). Of course $z$ is a dense vector of the length of the dictionary so storing a lot of these will be memory intensive; however this is probably preferable to recomputing $idf$ vectors when needed (which would again require storing the $tf$ matrix as well or instead).

Solved – corpus specifically for categories like sports, entertainment, or health

scikit-learn 20-newsgroups-text-dataset has 11314 train + 7532 test samples with 10,000 or more sparse features. The newsgroup categories are:

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc

For 2, 3, 4, 5 of these newsgroups (the worst ones) I get
83.2 82.6 82.2 80.6 % correct, using the fast sgd classifier.

(The first run of fetch_20newsgroups will take a while to download and cache the data.)

Best Answer

Related Solutions

Text Mining – Understanding Incremental IDF (Inverse Document Frequency)

Solved – corpus specifically for categories like sports, entertainment, or health

Related Question