Solved – Machine learning/random forests with noisy response data

machine learningmeasurement errorrandom forestregressionweighted-data

Machine learning techniques like random forests seem to assume that the responses in the training set are known perfectly. Specifically for regression applications, it seems one needs to account for the measurement error (which may be well characterized) in the responses (or the predictors, for that matter). In classical regression, you can add inverse variance weights to address this, but I'm not sure how to translate that to machine learning techniques.

I can imagine doing Monte Carlo, say 1000 random forests each with 1000 trees, in which I sample the response variables from their (known) distributions, and then aggregating the results. But that seems rather brute force.

Are there machine learning techniques in which you can explicitly account for uncertainty in your variables? Ideally, I'd also like the machine learning prediction to include these uncertainties (i.e., predict a confidence interval).

Best Answer

You mention weighting points in linear models as a method of incorporating uncertainty. This is possible to do in random forests as well.

In an unweighted RF, $n$ random subsamples of points are selected, and regression trees (jointly comprising the forest) are independently fit to each of them. Weights are incorporated in the random forest by altering the probabilities with which points are selected in each random subsample. A point with a higher weight will be present in a greater proportion of trees, and consequently have a larger influence on the random forest.