Solved – DQN with XGBoost

boostingreinforcement learning

Normally a DQN, uses a neuronal network to estimate the Q-Value. I have framed my problem as a regression problem before and have observed that XGBoost does outperform a NN.

Is it possible to replace the NN in a DQN with XGboost? I haven't found any hint/documentation about this at all. So anything would help me.

Best Answer

In Q learning, it is possible to use practically any regression model that can be updated incrementally.

In fitted Q learning, any regression model can be used, including tree-based approaches, see e.g. page 70 of this book: https://orbi.uliege.be/bitstream/2268/27963/1/book-FA-RL-DP.pdf

However, we can hardly speak about Deep Q learning that relates to deep neural networks by nature, i.e. including automated feature extraction from non-trivial objects (images, texts).

Related Solutions

Solved – XGBoost (Extreme Gradient Boosting) or Elastic Net More Robust to Outliers

1 Yes boosted trees would more easily fit unknown non-linear effects or interactions than regularized linear regression. However, as soon as you are aware of some specific non-linearity you could simply transform data to linearity and continue to use a linaer learner.
2 That depends on how you train the models. If you're new to boosted trees, check out some tutorials on how to avoid overfitting. I cannot see from your plots what kind of cross-validation was used. Use a thorough outer cross-validation and perhaps compare the results to a random forest model. RF models are much easier to handle, and default settings are often near optimal. A crude thumb rule; if your RF performs better than boosted trees(measured by outer cross validation), you either have chosen sub optimal training paramters for your boosted trees model or your data is quite noisy.

XGBoost – Why Overfitting Occurs and Acceptance Criteria

Is overfitting so bad that you should not pick a model that does overfit, even though its test error is smaller? No. But you should have a justification for choosing it.

This behavior is not restricted to XGBoost. It is a common thread among all machine learning techniques; finding the right tradeoff between underfitting and overfitting. The formal definition is the Bias-variance tradeoff (Wikipedia).

The bias-variance tradeoff

The following is a simplification of the Bias-variance tradeoff, to help justify the choice of your model.

We say that a model has a high bias if it is not able to fully use the information in the data. It is too reliant on general information, such as the most frequent case, the mean of the response, or few powerful features. Bias can come from wrong assumptions, for exemple assuming that the variables are Normally distributed or that the model is linear.
We say that a model has high variance if it is using too much information from the the data. It relies on information that is revelant only in the training set that has been presented to it, which does not generalize well enough. Typically, the model will change a lot if you change the training set, hence the "high variance" name.

Those definition are very similar to the definitions of underfitting and overfitting. However, those definition are often too simplified to be opposites, as in

The model is underfitting if both the training and test error are high. This means that the model is too simple.
The model is overfitting if the test error is higher than the training error. This means that the model is too complex.

Those simplifications are of course helpful, as they help choosing the right complexity of the model. But they overlook an important point, the fact that (almost) every model has both a bias and a variance component. The underfitting/overfitting description tell you that you have too much bias/too much variance, but you (almost) always have both.

If you want more information about the bias-variance tradeoff, they are a lot of helpful visualisation and good ressource available through google. Every machine learning textbook will have a section on the bias-variance tradeoff, here are a few

An introduction to statistical learning and Elements of statistical learning (available here).
Pattern Recognition and Machine Learning, by Christopher Bishop.
Machine Learning: A Probabilistic Perspective, by Kevin Murphy.

Also, a nice blog post that helped me grasp is Scott Fortmann-Roe's Understanding the Bias-Variance Tradeoff.

Application to your problem

So you have two models,

$$ \begin{array}{lrrl} & \text{Train MAE} & \text{Test MAE} &\\ \text{MARS} & \sim4.0 & \sim4.0 & \text{Low variance, higher bias},\\ \text{XGBoost} & \sim0.3 & \sim2.4 & \text{Higher variance, lower bias},\\ \end{array} $$

and you need to pick one. To do so, you need to define what is a better model. The parameters that should be included in your decisions are the complexity and the performance of the model.

How many "units" of complexity are you willing to exchange for one "unit" of performance?
- More complexity is associated with higher variance. If you want your model to generalize well on a dataset that is a little bit different than the one you have trained on, you should aim for less complexity.
- If you want a model that you can understand easily, you can do so at the cost of performance by reducing the complexity of the model.
- If you are aiming for the best performance on a dataset that you know comes from the same generative process than your training set, you can manipulate complexity in order to optimize your test error and use this as a metric. This happens when your training set is randomly sampled from a larger set, and your model will be applied on this set. This is the case in most Kaggle competitions, for exemple.

The goal here is not to find a model that "does not overfit". It is to find the model that has the best bias-variance tradeoff. In this case, I would argue that the reduction in bias accomplished by the XGBoost model is good enough to justify the increase in variance.

What can you do

However, you can probably do better by tuning the hyperparameters.

Increasing the number of rounds and reducing the learning rate is a possibility. Something that is "weird" about gradient boosting is that running it well past the point where the training error has hit zero seems to still improve the test error (as discussed here: Is Deeper Better Only When Shallow Is Good?). You can try to train your model a little bit longer on your dataset once you have set the other parameters,
The depth of the trees you grow is a very good place to start. You have to note that for every one unit of depth, you double the number of leafs to be constructed. If you were to grow trees of size two instead of size 16, it would take $1/2^{14}$ of the time! You should try growing more smaller trees. The reason why is that the depth of the tree should represent the degree of feature interaction. This may be jargon, but if your features have a degree of interaction of 3 (Roughly: A combination of 4 features is not more powerful than a combination of 3 of those feature + the fourth), then growing trees of size larger than 3 is detrimental. Two trees of depth three will have more generalization power than one tree of depth four. This is a rather complicated concept and I will not go into it right now, but you can check this collection of papers for a start. Also, note that deep trees lead to high variance!
Using subsampling, known as bagging, is great to reduce variance. If your individual trees have a high variance, bagging will average the trees and the average has less variance than individual trees. If, after tuning the depth of your trees, you still encounter high variance, try to increase subsampling (that is, reduce the fraction of data used). Subsampling of the feature space also achieves this goal.

Best Answer

Related Solutions

Solved – XGBoost (Extreme Gradient Boosting) or Elastic Net More Robust to Outliers

XGBoost – Why Overfitting Occurs and Acceptance Criteria

Related Question