Solved – Xgboost making bad predictions

machine learningpython

Alright, I am training an xgboost model, that should give the probability of questions being duplicates (having the same intent). My train dataset is 70 MB; each column has a question pair and a boolean saying if the questions are duplicates (1 – duplicates, 0 – no); the set has 40% duplicates. I have a train set (x_train) that has 37% duplicates. Here are the params that I use

params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 4

d_train = xgb.DMatrix(x_train, label = y_train)
d_test = xgb.DMatrix(x_test, label = y_test)

watchlist = [(d_train, 'train'), (d_test, 'test')]
bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds = 50)

The problem is : my logloss is constantly increasing, and early_stopping_rounds only works when its decreasing. In fact, when I want my model to predict something, it gives the same prediction of 0.17 for all the features

Edit 1:

My xgboost version is 0.40.
I am compairing the questions by how much common words they have with this formula :
(number of words that appear in both questions x 2) / (sum of word count of the questions)

And how much rare words are there in the questions (I determine the rarity of a word with this formula : (number of times the word was used in the dataset) / (total number of words)
Here are some examples of rows in my x_train set:

[0.15384615384615385, 0.0075902783102047081], Duplicate

[0.42857142857142855, 1.4780520495923148], Not Duplicate

[0.16666666666666666, 0.11852563542667481], Duplicate

[0.3333333333333333, 0.68822955503676098], Not Duplicate

[0.3333333333333333, 0.13004646236067308] Not Duplicate

Best Answer

I assume you're working on the Kaggle.com Quora Question Pairs problem.

In this problem, question1 and question2, the question titles to be compared, are stored as text strings. These strings are very unlikely to be duplicated exactly in the remainder of the dataset. XGBoost is going to interpret this as (more or less) two levels of factors that have only unique values and thus hold no useful information- no general knowledge of the problem can be generated from any individual row, nor any combination of rows.

The center of the problem is how to generate useful metrics from the question texts such that they can be compared. Some possible features, such as the number or percent of repeated words across questions 1 and 2 or the Levenshtein distance between them may, on the other hand, be extremely useful for predicting whether the question is duplicated between the questions based on the title.

Basically, XGBoost has no idea what you want it to do with the text data in the factors in question- you have to add value by providing data that XGBoost can use.

By the way, the prediction of 0.17 is merely the prior of the training dataset- that is, the proportion of duplicates that exist in that data. It's effectively saying that the algorithm has no extra information except for the count of positive and negative class examples to go by.