Solved – How to deal with TFIDF for test classification

classificationtext miningtf-idf

I am working on a project where I have to classify texts.
For that, I am dividing my data into training and test data.

In order to train my classifier(to be determined later), I am planning to calculate the tfidf matrix for the documents.
However, I have a question concerning that.

TFIDF is highly related to the documents for which it was calculated.
Hence, does it make sence to recalculate it for the training data and test the classifier on it?

If yes, what is the logic behind that.

Please provide references if possible .

Best Answer

After several searches , the best way to do that is to calculate the tfidf for the training data.Then to validate your model, compute the tfidf for the test data using the vocabulary from training data.

Related Solutions

Solved – Obtaining final classifier in RTextTools

According to documentation, you need function classify_models (excerpt from the documentation page):

Description

Uses a trained model from the train_models function to classify new data.

Usage

classify_models(corpus, models, ...)

Arguments

corpus Class of type matrix_container-class generated by the create_corpus function.

models List of models to be used for classification generated by train_models.

... Other parameters to be passed on to classify_model.

Here is the reproducible example:

###Train model (taken out of `classify_model` man page)

library(RTextTools)    
set.seed(123)
alldata <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv")
smpl <- sample(1:3100,size=100)
data <- alldata[smpl,]
matrix <- create_matrix(cbind(data$Title,data$Subject), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)
corpus <- create_corpus(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)

models <- train_models(corpus, algorithms=c("MAXENT","SVM"))

###Create new data which you want to classify: 

newdata <- alldata[sample((1:3100)[-smpl],size=101,replace=FALSE),]

newmatrix <- create_matrix(cbind(newdata$Title,newdata$Subject), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=weightTfIdf)

newcorpus <- create_corpus(newmatrix,newdata$Topic.Code,trainSize=1:1, testSize=2:101, 
virgin=FALSE)


newresults1 <- classify_model(newcorpus,models[[1]])
newresults2 <- classify_model(newcorpus,models[[2]])

The last line should produce the error, since the new data has some terms which were not present in the data used to produce the models. To circumvent this, it is better to use create_matrix on all the data you have, and then to train model on the sample of result of create_matrix, instead of the original data.

This solution is a bit quirky, since the convenience function classify_models demands a corpus, where train and test data sets must be explicitly defined. Since the goal is to use the classifier, all data is test data. In this solution I sidestepped the problem by selecting only one observation for train and all the others for the test. To sidestep this it is possible to use predict method directly:

predict(models[[1]],as.compressed.matrix(newmatrix))

Solved – Naive Bayes: Imbalanced Dataset in Real-time Scenario

To create a good model, the model has to be built on training data which is of the same "structure" as the data the model will applied later on. This is the one boring assumption which underlies all classification models.

So by using an balanced data set meanwhile the real world is not balanced, you have already introduced a bias. While there are cases where this is not a problem (imagine perfectly separable (non-linear) classes, a model built on a balanced data set containing all border-relevant points will be still working perfectly on a skewed sample), classifying documents is often a game of probabilities and hence class skew is more problematic.

My suggestions:

Built the model on the imbalanced set with the same proportions as in production. If you have to sample for this, then perform multiple runs across different samples during validation to improve generalization power.
The "bias" towards the negative class in an imbalanced set originates from the-best-guess-is-majority-class-if-everything-else-is-equal, something which Naive Bayes is sensitive to (especially when a lot of (irrelevant) features are involved). Use a different classifier which can capture inter-feature/word-dependencies to reduce this. I'd try Gradient Boosting with trees as described in chapter 10 "Boosting and Additive Trees" of The elements of statistical learning.
You are currently using "plain precision / recall" as metric. Based on your productions requirements, estimate whether a false positive is equally bad as a false negative and adjust the metric accordingly.

Best Answer

Related Solutions

Solved – Obtaining final classifier in RTextTools

Solved – Naive Bayes: Imbalanced Dataset in Real-time Scenario

Related Question