Solved – Xgboost making bad predictions

machine learningpython

Alright, I am training an xgboost model, that should give the probability of questions being duplicates (having the same intent). My train dataset is 70 MB; each column has a question pair and a boolean saying if the questions are duplicates (1 – duplicates, 0 – no); the set has 40% duplicates. I have a train set (x_train) that has 37% duplicates. Here are the params that I use

params = {}
params['objective'] = 'binary:logistic'
params['eval_metric'] = 'logloss'
params['eta'] = 0.02
params['max_depth'] = 4

d_train = xgb.DMatrix(x_train, label = y_train)
d_test = xgb.DMatrix(x_test, label = y_test)

watchlist = [(d_train, 'train'), (d_test, 'test')]
bst = xgb.train(params, d_train, 400, watchlist, early_stopping_rounds = 50)

The problem is : my logloss is constantly increasing, and early_stopping_rounds only works when its decreasing. In fact, when I want my model to predict something, it gives the same prediction of 0.17 for all the features

Edit 1:

My xgboost version is 0.40.
I am compairing the questions by how much common words they have with this formula :
(number of words that appear in both questions x 2) / (sum of word count of the questions)

And how much rare words are there in the questions (I determine the rarity of a word with this formula : (number of times the word was used in the dataset) / (total number of words)
Here are some examples of rows in my x_train set:

[0.15384615384615385, 0.0075902783102047081], Duplicate

[0.42857142857142855, 1.4780520495923148], Not Duplicate

[0.16666666666666666, 0.11852563542667481], Duplicate

[0.3333333333333333, 0.68822955503676098], Not Duplicate

[0.3333333333333333, 0.13004646236067308] Not Duplicate

Best Answer

I assume you're working on the Kaggle.com Quora Question Pairs problem.

In this problem, question1 and question2, the question titles to be compared, are stored as text strings. These strings are very unlikely to be duplicated exactly in the remainder of the dataset. XGBoost is going to interpret this as (more or less) two levels of factors that have only unique values and thus hold no useful information- no general knowledge of the problem can be generated from any individual row, nor any combination of rows.

The center of the problem is how to generate useful metrics from the question texts such that they can be compared. Some possible features, such as the number or percent of repeated words across questions 1 and 2 or the Levenshtein distance between them may, on the other hand, be extremely useful for predicting whether the question is duplicated between the questions based on the title.

Basically, XGBoost has no idea what you want it to do with the text data in the factors in question- you have to add value by providing data that XGBoost can use.

By the way, the prediction of 0.17 is merely the prior of the training dataset- that is, the proportion of duplicates that exist in that data. It's effectively saying that the algorithm has no extra information except for the count of positive and negative class examples to go by.

Related Solutions

Solved – Random Forest cross-validation r2 is high but predictions on simulated data are bad

For the question in your subject, I think the answer is that r2 is poor metric to be using. See http://data.library.virginia.edu/is-r-squared-useless/

I would instead use RMSE to judge the quality of your models. (Just preferring RMSE over MSE as it is in actual units, so easier to judge if "4.5" is a good enough model or not.) (I'd also be looking at MAE; if RMSE is much worse than MAE you could describe it as "When it gets it wrong, it really gets it wrong.")

Also, just to be aware, though I am sure you have enough data that this won't be a factor: the final model is built using all your data, not the 80% that each CV model uses. It is worth comparing the RMSE of each of the 5 CV models with the RMSE of the final model. If they are all close, then the random fold selection is reliable; if they are wildly different, you may need to try a different way of making the folds.

As for the rest of the question:

To examine this I randomly selected 1,000 observations that were trained in the model, and then replicated each 70 (times) while changing year of fire so that each ranged from 1 to 70, and kept all other variables the same as what they originally were

Unless I have have misunderstood, the behaviour you see makes sense for a tree algorithm, i.e. a non-linear algorithm. The "all the other variables" will send it down certain parts of the tree before it considers Year. If the falsified value for Year is something it has not seen before, it will be choosing the value nearest to what it has seen.

As an extreme example, if you have trained with:

a   b    answer
3   5    8
2   3    5

Then you change the b to be all 3s, it will give this result if a tends to come before b in the trees:

a   b    prediction
3   3    8
2   3    5

And this result if b tends to come first:

a   b    prediction
3   3    5
2   3    5

Solved – Bad Linear regression results

There are lots of reasons why linear regression may perform "so bad". A linear regression model may in fact be appropriate but there is a lot of noise in the data. In other words, the explanatory variables that you have simply don't explain enough of the variation in the response. There may be non-linear associations, which could be modelled with linear model (by including non-linear terms in the model or by using an additive model) - alternatively a non-linear model may be more appropriate. There may be interactions among the explanatory variables.

To investigate further, you could plot the response variable against each of the explanatory variables in turn - this may indicate non-linearities, or indeed confirm that the linear model might be appropriate.

Also, before throwing away the model on the basis of $R^2$ (which is generally not a good thing to do) you should perform the usual regression diagnostics such as inspecting residual plots.