Solved – Optimal feature selection for MAPE criteria with RandomForest cross-validation

cross-validationrandom forestscikit learn

I am trying to optimize my set of features against random forest cross-validation using MAPE criteria.

I tried forward selection with Univariate linear regression test (f_regression in sklearn), I calculate MAPE for each set of variables selected by SelectKBest:

for i in range(1,len(X.columns)):
    selektor = SelectKBest(f_regression, k = i)
    clf = RandomForestRegressor(n_estimators=10, max_depth=None)
    pred = clf.fit(X, y).predict(T)
    MAPE = mean(abs(pred-y_real)/y_real)

I also tries backward selection with the RandomForest feature_importance attribute. I start with full set and in each iteration I remove the least important feature, and I calculate MAPE. I remove then, the feature with the least important MAPE:

while X: 
    clf = RandomForestRegressor(n_estimators=n, max_depth=None)
    pred = clf.fit(X, y).predict(T)
    imp = dict(zip(list(X.columns.values), clf.feature_importances_))
    todrop = min(imp.iteritems(), key=itemgetter(1))[0]

First technique returns decent results but it is not optimized because I select K best features with liner regression score criteria. Second technique returns noisy results (MAPE doesn't converge).

I am looking for a technique for feature selection to use in sklearn that will measure MAPE directly. The size of my full set is 150 features and 80,000 observations.

Thank you in advance for any suggestion…

Best Answer

I see two possibilities:

  1. You could conceivably use the MAPE directly as an objective function. Zheng (2011), International Journal of Machine Learning and Cybernetics) could perhaps be adapted to yield a smooth approximation to the MAPE, so you get a nice gradient.

  2. You could try to get full predictive densities from your model and then derive the one-number summary of that distribution that minimizes the expected MAPE.

Yes, both of these do not address feature selection. You may want to take a look at this for a few ideas on this aspect.

Related Question