Solved – How to combine linear and non-linear models

linear modelmachine learningmodelingnonlinearrandom forest

My data basically consists of two sets of features:

  • $F_{nl}$: Non-linear selfmade features like a few scorings and counts (10)
  • $F_{l}$: Linear features generated by TfIdf-vectorization of the text (30000)

The classification task is binary.
I'm currently using the standard accuracy metric as performance measure, even tho i should slightly modify it in the future to penalize false negatives (since they are more expensive than false positives for the task).

I have trained two different models with each combination of the features {$F_l$}, {$F_{nl}$} and {$F_l, F_{nl}$} which yield the following results:

  • RandomForest trained on $F_{l}$

    • Specificity: 0.46
    • Sensitivity:0.91
    • Miss-Rate: 0.09
    • Fall-Out: 0.54
    • Accuracy: 0.73
    • AUC-ROC: 0.80
  • RandomForest trained on $F_{nl}$

    • Specificity: 0.76
    • Sensitivity:0.82
    • Miss-Rate: 0.18
    • Fall-Out: 0.24
    • Accuracy: 0.80
    • AUC-ROC: 0.88
  • RandomForest trained on $F_{l} + F_{nl}$

    • Specificity: 0.78
    • Sensitivity:0.84
    • Miss-Rate: 0.16
    • Fall-Out: 0.22
    • Accuracy: 0.81
    • AUC-ROC: 0.89
  • Pegasos-SVM trained on $F_{l}$

    • Specificity: 0.72
    • Sensitivity:0.83
    • Miss-Rate: 0.17
    • Fall-Out: 0.22
    • Accuracy: 0.79
    • AUC-ROC: 0.85
  • Pegasos-SVM trained on $F_{nl}$

    • Specificity: 0.33
    • Sensitivity:0.69
    • Miss-Rate: 0.31
    • Fall-Out: 0.67
    • Accuracy: 0.55
    • AUC-ROC: 0.52
  • Pegasos-SVM trained on $F_{l} + F_{nl}$

    • Specificity: 0.33
    • Sensitivity:0.69
    • Miss-Rate: 0.31
    • Fall-Out: 0.67
    • Accuracy: 0.55
    • AUC-ROC: 0.52

For the Pegasos-SVM I have used the framework sofia-ml from google.
The RandomForest is from sklearn.

As you can see, the RandomForest classifier performs well on the nonlinear features and gains little improvement when the linear features are added as well. It completely fails when only trained on the linear features.
Where as the Pegasos-SVM classifier performs good on the linear features and very bad on the nonlinear features or the combination of both (which is not surprising since its meant for linear-separable classification tasks).

So the actual question is: Is it possible – or what is the best-practice – to combine the nonlinear model of the RandomForest (trained on $F_{nl}$) with the linear model of Pegasos-SVM (trained on $F_l$)?

Best Answer

Without knowing exactly what you did the following will be part speculation but:

  1. With a nonlinear kernel SVM generates its own nonlinear features internally. This may be telling you the features it inferred are better than the handcoded ones - $F_{nl}$. However SVM can overfit, and by augmenting the data with redundancy, i.e. attaching $F_{nl}$, you are inviting it to do so.
  2. With the last two paras, you should check what your code is doing. Getting the exact same numbers is a yellow card for me, especially when the last entry is a somewhat unexpected.
  3. Random Forest doesn't usually overfit. So you typically won't do any worse by adding sensible additional features.