Solved – Which statistical classification algorithm can predict true/false for a sequence of inputs

classificationmachine learningmodeling

Given a sequence of inputs, I need to determine whether this sequence has a certain desired property. The property can only be true or false, that is, there are only two possible classes that a sequence can belong to.

The exact relationship between the sequence and the property is unclear, but I believe it is very consistent and should lend itself to statistical classification. I have a large number of cases to train the classifier on, although it might be slightly noisy, in the sense that there's a slight probability that a sequence is assigned the wrong class in this training set.

Example training data:

Sequence 1: (7 5 21 3 3) -> true
Sequence 2: (21 7 5 1) -> true
Sequence 3: (12 21 7 5 11 1) -> false
Sequence 4: (21 5 7 1) -> false
...

In rough terms, the property is determined by the set of values in the sequence (e.g. the presence of an "11" means that the property will almost certainly be false), as well as the order of the values (e.g. "21 7 5" significantly increases the chance that the property is true).

After training, I should be able to give the classifier a previously unseen sequence, like (1 21 7 5 3), and it should output its confidence that the property is true.
Is there a well-known algorithm for training a classifier with this kind of inputs/outputs?

I have considered the naive Bayesian classifier (which is not really adaptable to the fact that the order matters, at least not without severely breaking the assumption that the inputs are independent). I've also investigated the hidden Markov model approach, which appears to be inapplicable because only a single output is available, instead of one output per input. What did I miss?

Best Answer

You could try probabilistic approaches similar to the naive Bayes classifier but with weaker assumptions. For example, instead of making the strong independence assumption, make a Markov assumption:

$$ p(x \mid c) = p(x_0 \mid c)\prod_t p(x_t \mid x_{t - 1}, c) $$

$c$ is your class label, $x$ is your sequence. You need to estimate two conditional distributions, one for $c = 1$ and one for $c = 0$.

By Bayes' rule:

$$ p(c = 1 \mid x) = \frac{p(x \mid c = 1) p(c = 1)}{p(x \mid c = 1) p(c = 1) + p(x \mid c = 0) p(c = 0)}. $$

Which distributions to pick for $p(x_t \mid x_{t - 1}, c)$ depends on which other assumptions you can make about the sequences and how much data you have available.

For example, you could use:

$$ p(x_t \mid x_{t - 1}, c) = \frac{\pi(x_t, x_{t - 1}, c)}{\sum_i \pi(x_i, x_{t - 1}, c)} $$

With distributions like this, if there are 21 different numbers occurring in your sequences, you would have to estimate $21 \cdot 21 \cdot 2 = 882$ parameters $\pi(x_t, x_t, c)$ plus $21 \cdot 2 = 42$ parameters for $p(x_0 \mid c)$ plus $2$ parameters for $p(c)$.

If the assumptions of your model are not met, it can help to fine-tune the parameters directly with respect to the classification performance, for example by minimizing the average log-loss

$$ -\frac{1}{\#\mathcal{D}} \sum_{(x, c) \in \mathcal{D}} \log p(c \mid x) $$

using gradient-descent.

Related Question