MATLAB: Finding optimal regression tree using hyperparameter optimization

I am calculating propensity scores using fitrensemble. I am interested in finding the tree with the lowest test RMSE (as I am using the resulting model to predict outcomes in a very large second dataset). I am currently using hyperparameter optimization to find the optimal tree using the below code:

    % Optimize for model
    rng default
    propensity_final = fitrensemble(X,Y,...
    'Learner',templateTree('Surrogate','on'),...
    'Weights',W,'OptimizeHyperparameters',{'Method','NumLearningCycles','MaxNumSplits','LearnRate'},...
    'HyperparameterOptimizationOptions',struct('Repartition',true,...
    'AcquisitionFunctionName','expected-improvement-plus'));
     loss_final = kfoldLoss(crossval(propensity_final,'kfold',10));

However, I find that when not optimizing for the model, hence doing one of the below, the cross-validation error is lower.

     % Bagged
     propensity1_bag = fitrensemble(X,Y,...
    'Method','Bag',...
    'Learner',templateTree('Surrogate','on'),...
    'Weights',W,'OptimizeHyperparameters',{'NumLearningCycles','MaxNumSplits'},...
    'HyperparameterOptimizationOptions',struct('Repartition',true,...
    'AcquisitionFunctionName','expected-improvement-plus'));
    loss1_bag = kfoldLoss(crossval(propensity1_bag,'kfold',10));
    % LSBoost
    propensity1_boost = fitrensemble(X,Y,...
    'Method','LSBoost',...
    'Learner',templateTree('Surrogate','on'),...
    'Weights',W,'OptimizeHyperparameters',{'NumLearningCycles','MaxNumSplits','LearnRate'},...
    'HyperparameterOptimizationOptions',struct('Repartition',true,...
    'AcquisitionFunctionName','expected-improvement-plus'));
    loss1_boost = kfoldLoss(crossval(propensity1_bag,'kfold',10));

What is the objective (best so far and estimated) that the function tries to minimize? And why are loss1_boost and loss1_bag lower than loss_final? How do I know which model to use?

Thank you!

Best Answer

My guess is that your first run was worse because it was not run for enough iterations. The default MaxObjectiveEvaluations is 30 iterations, but since your first optimization searches a larger space (including a categorical variable) you should probably multiply that a few times. You're also using 'Repartition'=true which calls for more iterations. Try running it for at least 100 iterations. The more the better as time permits. You can pass MaxObjectiveEvaluations inside HyperparameterOptimizationOptions.

The objective being minimized for regression is log(1 + MSE) computed on the validation set. By default that's 5-fold crossvalidation. That's mentioned near the bottom of the OptimizeHyperparameters section on this doc page: http://www.mathworks.com/help/stats/fitrensemble.html#input_argument_d0e360201 Your final calls to kfoldLoss will return MSE, which will differ from the objective function values.

In any case, you should use the model that has the lowest cross-validated MSE no matter how you found it.

Best Answer

Related Solutions

MATLAB: How to re-train a model optimized by Bayesian optimization on new data

MATLAB: Change objective function for hyperparameter optimization in regression ensembles

Related Question