Solved – Biased Data in Machine Learning

I am working on a Machine Learning project with data that is already (heavily) biased by data selection.

Let's assume you have a set of hard coded rules.
How do you build a machine learning model to replace it, when all the data it can use is data that was already filtered by those rules?

To make things clear, I guess the best example would be Credit Risk Assessment: The task is to filter all clients that are likely to fail to make a payment.

Now, the only (labeled) data you have are from clients that have been accepted by the set of rules, because only after accepting you will see if someone pays or not (obviously). You don't know how good the set of rules is and how much they will affect the payed- to not-payed distribution.
Additionally, you have unlabeled data from the clients that have been declined, again because of the set of rules. So you don't know what would have happened with those clients if they had been accepted.

E.g one of the rules could be: "If age of client < 18 years, then do not accept"

The classifier has no way to learn how to handle clients that have been filtered by these rules. How is the classifier supposed to learn pattern here?

Ignoring this problem, would lead to the model being exposed to data it has never encountered before.
Basically, I want to estimate the value of f(x) when x is outside [a, b] here.

Best Answer

You are right to be concerned - even the best models can fail spectacularly if the distribution of out-of-sample data differs significantly from the distribution of the data that the model was trained/tested on.

I think the best you can do is train a model on the labelled data that you have, but try to keep the model interpretable. That probably means only being limited to simple models. Then, you could attempt to reason how the rules learnt by your model might interact with the prior rules you had, in an attempt to estimate how well your model might work on the unfiltered population.

For example - suppose, your model finds that in your labelled dataset, the younger the client is, the more likely they were to default. Then it may be reasonable to assume that your model will work well if you removed the prior filter of "If age of client < 18 years, then do not accept".

Best Answer

Related Solutions

Solved – Ideal learning sample in machine learning

Solved – model for machine learning on non-aggregated data, where we have a target variable, but also a grouping variable

Related Question