Solved – Random Forest – Unbalanced Dataset for Training & Test

classificationrrandom forestsamplingunbalanced-classes

I am doing a classification modelling(using R and random forests) for a website where only 2% of the visitors convert. Now given the behavior and attributes of a visitor I want to predict probability of conversion.

The data collection process was pretty complicated and finally I had 3 month data available for 3166 cases of conversions & 10,849 non conversions. Normally, I know that the training & test data should have the same proportion of classes.

However, I wanted to use most of the "converted" data to train the model. I randomly took out 100 cases of conversions & 4000 cases of non-conversions to give me 1:40 ratio. This would be my test data.

For training, I took the remaining data which was approximately 1:2 ratio.

After training using Random Forests, when testing, I am getting decent results in terms of sensitivity & specificity but precision is low around 7-8%

What would be the possible repercussions of my approach? I wanted to get this sorted before I begin fine-tuning my model.

I did not do up-sampling/down-sampling or synthetic data generation because they would also ultimately try to balance my data when training the model but the test data would still reflect the true world scenario.

Any advice would be much appreciated.

EDIT 1: After the response by Fernando & DarXider, I tried the following 2 things:

a.) Took 200 Positive Class with all of Negative class. Then trained the models separately until Positive class is exhausted. Each model then predicted on the test data, their votes were counted and final probabilities calculated.

b.) Similar to the above except here 200 cases of Negative class were sampled with all of Positive Class.

However, the problem remains when testing. In case of (a) most trees vote for negative class when testing and vice-versa in case of (b).

I will try other suggestions and maybe other techniques and see what happens. I have already started the process of getting more data.. "Fingers crossed"

Code snippet for (a) is below.In case I have made any errors please do tell.. I know its a little inefficient code but am still learning 🙂

    tr_conv = trdata[trdata$Converted==1,]      ## converted & non-converted
    tr_nc = trdata[trdata$Converted == 0,]

    numr = nrow(tr_conv)         ##calculatin the number of rows
    min_size = 200               ##sample size
    temp = data.frame()          ##empty data frame to store the results
    temp = NA

    while(numr>0){

    a = ifelse((numr - min_size) < min_size,numr,min_size)# sample size
    rm(.Random.seed, envir=globalenv()) ## reset the seed each time
    k = sample(x = 1:numr,size = a)
    tr_conv1 = tr_conv[k,]            ## this sample will be trained on
    tr_conv = tr_conv[-k,]
    numr = nrow(tr_conv)

    # combining with non-converted training data

    comb1 = rbind.data.frame(tr_conv1,tr_nc)
    r1 = randomForest(x = comb1[,1:24],y = comb1$Converted,ntree = 2000,
					mtry = 6,strata = comb1$Converted,norm.votes = FALSE)
    pd = predict(object = r1,norm.votes = F,newdata = tsdata[,1:24]
         ,type = "vote")      ## tsdata is the 4100 test cases
    vt1 = data.frame(pd)
    vt1 = data.frame(vt1[,-1])
    temp = cbind(temp,vt1)

    }
    temp1 = temp[,-1]
    temp1$Yes = rowSums(temp1)    ## total number of yes votes
    tvc = 2000*(ncol(temp1)-1)    ## calculate the total votes cast
    temp1$No = abs(temp1$Yes - tvc)  ## total number of no votes
    myvotes = temp1[,c(16,17)]
    myvotes$yes_prob = myvotes$Yes/tvc  ## yes probabilit calculate
    myvotes$no_prob = 1 - myvotes$yes_prob  ## no probability calculate
    threshold = 0.5
    myvotes$prediction = ifelse(myvotes$yes_prob > threshold,1,0)
    j = confusionMatrix(data =myvotes$prediction,
        reference = tsdata$Converted,positive = "1",
        dnn =c("pred","actual"))

Best Answer

If I understood correctly, the real-world scenario has a converted to non-converted ratio of 1:49, which is similar to that for your test data which is 1:40; so far, so good! However, the data you use for training your model has said ratio of about 1:2. Therein lies your problem, and it causes you to have lots of false positives and, thus, bad precision when you apply your trained model to the test data.

I suggest that you have a look at this paper, Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure, published in Neural Computation 14, 21–41 (2001) by Saerens et al. which proposes a method to adjust the posterior probabilities $p(y|x)$ -- obtained by applying your trained model to the test data, where $x$ represents your data and $y = 0, 1$ is the class label -- in a scenario where the proportions of classes, $p(y = 0)$ and $p(y = 1)$, differ between the training and test sets.

You could also try building an ensemble of classifiers in the spirit of the BalanceCascade method (see Exploratory Undersampling for Class-Imbalance Learning by Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou in IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 2, APRIL 2009):

  1. For the first classifier, randomly sample a subset of the minority class from the training set that gives you the said ratio of approximately 1:40 when used with the entire majority class
  2. Train a classifier using this data (using, e.g., the argument class_weight = 'balanced' if you are using scikit-learn).
  3. Setting aside the selected minority samples from the pool of training minority class, perform the next subsampling to arrive at the second dataset.
  4. Used this set to train your second next classifier.
  5. Repeat until you have exhausted the entirety of the minority class members.

Finally, you will have a set of classifiers, trained on non-overlapping subsets of the minority class. You can then combine their predictions (e.g., using majority vote or weighted average) to arrive at the final prediction.

Having said this, the cascade-and-ensemble approach might not be effective in your case since each classifier will be trained on approximately 200 data points, which is a small number, which could lead to low generalization score, but YMMV; it might be worth trying.

Related Question