I'm trying to fit a dense neural network based on tabular data input, where the outputs are two separate classification vectors, with one cross-entropy loss function for each.
Example: given a few input features, for a customer that visits a travel website with the intention of buying a train ticket, the model would predict both the destination of travel and the traveling class (1st class or 2nd class) that the customer is likely to buy.
Problem: it seems as if internally, the network was divided in two at some point in the hidden layers, and each sub-network got specialised in predicting one output vector, ignoring the other. This leads to an overall acceptable accuracy for each output, but the consistency between the two outputs leaves to be desired.
For example, for a given entry, the network would predict "London" and "1st Class", because independently, each output makes sense according to the input features, but there isn't a single training point where London and 1st class can be found together, simply because there isn't a 1st class option when travelling to London. The network seems to be completely devoid of any concern for the consistency between the two.
Example, if the passenger is an accountant, 35yo and departs from Brussels, the training set gives a clear winner for destination: London
and, separately, also for class: 1st
, and so this is what the network will tend to predict, despite this combination being totally absent.
Would there be any way to amend the network and/or the organisation of the loss functions so that the consistency between the two outputs would be taken into account, and the network would avoid combination of outputs that can't be found in the training set, and favor those that are?
More generally, what would be some good approaches to tackle this issue? Note that I would like to avoid resorting to manual rules down the line, if that is possible.
Best Answer
Another method would be to build two neural networks. The first NN is trained to predict the destination. For the second NN, include the destination predicted by the first NN as an input feature and train the network to predict the class. The second network should then learn to only predict classes that are options for the predicted destination.
Edited in response to @Jivan's comment.
There are more complex methods of multi-label classification, but I'd keep it simple if possible, and try either @Dikran's or my approach first. They are both standard ways of implementing multi-label classification (see this Medium post). Dikran's method is a Label Powerset and mine is a Classifier Chain. As you've pointed out, there are pros and cons to both these methods. If neither of these produce a good enough result, you could try a variation of the classifier chain, where you build one network to predict one label from the union of destinations and classes. Then train two further networks, one that predicts the destination given a predicted class and the other that predicts the class given a predicted destination. At inference time, you would use the first network to predict either a class or destination, then the appropriate second network predict the other label.