According to documentation, you need function classify_models
(excerpt from the documentation page):
Description
Uses a trained model from the train_models function to classify new
data.
Usage
classify_models(corpus, models, ...)
Arguments
corpus
Class of type matrix_container-class generated by the
create_corpus
function.
models
List of models to be used for classification generated by
train_models
.
...
Other parameters to be passed on to classify_model.
Here is the reproducible example:
###Train model (taken out of `classify_model` man page)
library(RTextTools)
set.seed(123)
alldata <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv")
smpl <- sample(1:3100,size=100)
data <- alldata[smpl,]
matrix <- create_matrix(cbind(data$Title,data$Subject), language="english",
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100,
virgin=FALSE)
models <- train_models(corpus, algorithms=c("MAXENT","SVM"))
###Create new data which you want to classify:
newdata <- alldata[sample((1:3100)[-smpl],size=101,replace=FALSE),]
newmatrix <- create_matrix(cbind(newdata$Title,newdata$Subject), language="english",
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
newcorpus <- create_corpus(newmatrix,newdata$Topic.Code,trainSize=1:1, testSize=2:101,
virgin=FALSE)
newresults1 <- classify_model(newcorpus,models[[1]])
newresults2 <- classify_model(newcorpus,models[[2]])
The last line should produce the error, since the new data has some terms which were not present in the data used to produce the models. To circumvent this, it is better to use create_matrix
on all the data you have, and then to train model on the sample of result of create_matrix
, instead of the original data.
This solution is a bit quirky, since the convenience function classify_models
demands a corpus
, where train and test data sets must be explicitly defined. Since the goal is to use the classifier, all data is test data. In this solution I sidestepped the problem by selecting only one observation for train and all the others for the test. To sidestep this it is possible to use predict
method directly:
predict(models[[1]],as.compressed.matrix(newmatrix))
To create a good model, the model has to be built on training data which is of the same "structure" as the data the model will applied later on. This is the one boring assumption which underlies all classification models.
So by using an balanced data set meanwhile the real world is not balanced, you have already introduced a bias. While there are cases where this is not a problem (imagine perfectly separable (non-linear) classes, a model built on a balanced data set containing all border-relevant points will be still working perfectly on a skewed sample), classifying documents is often a game of probabilities and hence class skew is more problematic.
My suggestions:
- Built the model on the imbalanced set with the same proportions as in production. If you have to sample for this, then perform multiple runs across different samples during validation to improve generalization power.
- The "bias" towards the negative class in an imbalanced set originates from the-best-guess-is-majority-class-if-everything-else-is-equal, something which Naive Bayes is sensitive to (especially when a lot of (irrelevant) features are involved). Use a different classifier which can capture inter-feature/word-dependencies to reduce this. I'd try Gradient Boosting with trees as described in chapter 10 "Boosting and Additive Trees" of The elements of statistical learning.
- You are currently using "plain precision / recall" as metric. Based on your productions requirements, estimate whether a false positive is equally bad as a false negative and adjust the metric accordingly.
Best Answer
After several searches , the best way to do that is to calculate the tfidf for the training data.Then to validate your model, compute the tfidf for the test data using the vocabulary from training data.