Neural Networks – Consistency Between Two Outputs and Loss Functions

loss-functionsmachine learningneural networks

I'm trying to fit a dense neural network based on tabular data input, where the outputs are two separate classification vectors, with one cross-entropy loss function for each.

Example: given a few input features, for a customer that visits a travel website with the intention of buying a train ticket, the model would predict both the destination of travel and the traveling class (1st class or 2nd class) that the customer is likely to buy.

Problem: it seems as if internally, the network was divided in two at some point in the hidden layers, and each sub-network got specialised in predicting one output vector, ignoring the other. This leads to an overall acceptable accuracy for each output, but the consistency between the two outputs leaves to be desired.

For example, for a given entry, the network would predict "London" and "1st Class", because independently, each output makes sense according to the input features, but there isn't a single training point where London and 1st class can be found together, simply because there isn't a 1st class option when travelling to London. The network seems to be completely devoid of any concern for the consistency between the two.

Example, if the passenger is an accountant, 35yo and departs from Brussels, the training set gives a clear winner for destination: London and, separately, also for class: 1st, and so this is what the network will tend to predict, despite this combination being totally absent.

Would there be any way to amend the network and/or the organisation of the loss functions so that the consistency between the two outputs would be taken into account, and the network would avoid combination of outputs that can't be found in the training set, and favor those that are?

More generally, what would be some good approaches to tackle this issue? Note that I would like to avoid resorting to manual rules down the line, if that is possible.

Best Answer

Another method would be to build two neural networks. The first NN is trained to predict the destination. For the second NN, include the destination predicted by the first NN as an input feature and train the network to predict the class. The second network should then learn to only predict classes that are options for the predicted destination.

Edited in response to @Jivan's comment.

There are more complex methods of multi-label classification, but I'd keep it simple if possible, and try either @Dikran's or my approach first. They are both standard ways of implementing multi-label classification (see this Medium post). Dikran's method is a Label Powerset and mine is a Classifier Chain. As you've pointed out, there are pros and cons to both these methods. If neither of these produce a good enough result, you could try a variation of the classifier chain, where you build one network to predict one label from the union of destinations and classes. Then train two further networks, one that predicts the destination given a predicted class and the other that predicts the class given a predicted destination. At inference time, you would use the first network to predict either a class or destination, then the appropriate second network predict the other label.

Related Solutions

Solved – Find Neural Network Inputs Given Outputs

I'm no expert in this field, so I might be wrong. Therefore, correct me if I'm wrong.

consider this neural network (which I suppose is equivalent to yours):

A---H1
 \ /  \
  X    C
 / \  /
B---H2

consider that the activation function of H1, H2 and C is the bipolar sigmoid, to which we'll refer to as "bsig(x)"

also, we'll name the connections as follows:
A, H1: wa1;
A, H2: wa2;
B, H1: wb1;
B, H2: wb2;
H1, C: wh1;
H2, C: wh2

now the values of H1, H2 and C can be defined as:

H1 = bsig(wa1 * A + wb1 * B)
H2 = bsig(wa2 * A + wb2 * B)
C = bsig(wh1 * H1 + wh2 * H2)

So, C can be written as:

C = bsig(wh1 * bsig(wa1 * A + wb1 * B) + wh2 * bsig(wa2 * A + wb2 * B))

All you need to do is solve this equation in order to B or A depending on which of the values is unkown.

Solved – Can neural network with unsupervised learning minimize a cost-function

This is an optimization problem rather than an unsupervised learning problem. You're not trying to learn from examples, but to minimize a function of known quantities. Neural nets can be used to solve this type of problem, but it looks different than solving supervised/unsupervised problems that one typically sees in the machine learning literature (no learning is involved here).

For exmaple, see work using Hopfied nets to solve the traveling salesman problem (Hopfield and Tank 1985, and many others since then). A recurrent network is configured such that it encodes the problem. The network has an energy function that governs the behavior of the network (which tends to move to lower energy states). Each network state corresponds to a possible solution. The weights are set such that low-cost solutions (that respect the constraints of the problem) have lower energy. The network is then run from some initial state until it converges to a low energy (i.e. low cost) solution. The traveling salesman problem is NP complete, so it's probably infeasible to compute exact solutions for large problems. The purpose of this method is to compute approximate solutions in a reasonable amount of time (and to do so in a way that mimics biological neural nets; raw performance isn't necessarily the goal here).

There's no need to limit yourself to neural nets; this is a discrete optimization problem and most methods look nothing like neural nets. There may well be more efficient heuristic approaches. For reading, the discrete optimization, operations research, and computer science literature will probably be more helpful than the machine learning literature.

References:

Hopfield and Tank (1985). Neural computation of decisions in optimization problems.

Best Answer

Related Solutions

Solved – Find Neural Network Inputs Given Outputs

Solved – Can neural network with unsupervised learning minimize a cost-function

Related Question