Solved – How to improve classification performance based on multiple known classification results

classificationmachine learningmulti-class

I am working on a classification problem, which may contains a unknown number of data classes, typically 5-50 classes in each sample. I had several classification algorithms, each gives me a classification output based on a given sample. However, these classification outputs are almost unlikely to agree with each other completely.

When I looked into these classification results, I noticed that there always exists a better classification output in a subset of a given sample. This means that I shall be able to get a better classification output if I can use these existing classification outputs properly. However, the problem is I do not know how to do this job.

So far, what I have is 500 manually annotated sample data, and corresponding classification results for these training data. Is there any simple way that I can train a system to automatically form a new classification output for me?

Another thing that bothered me is that I am completely lost in formulating my task in mathematics. I know my following description is problematic, please help me to correct it.

Suppose we have a training sample composed of $m$ 2D data points $\Omega =\{d_1,d_2,\cdots,d_m\}$ with a corresponding known label set $C_{\rm gt} = \left\{l(d_i|{\rm gt})\big|i\in\Omega\right\}$, where $l(d_i|{\rm{gt}})=k$ means the label of data point $d_i$ is of the $k$th class in ground truth (gt). Assume I have $n$ classification algorithms $A_1(\cdot),\cdots,A_n(\cdot)$ and each takes the entire dataset as input and generates a classification result $C_j = \left\{l(d_i|A_j)\big|i\in\Omega\right\}$, which is the label collection for all data points using algorithm $A_j$. And my objective is to find a function $A_{new}=f(A_1,\cdots,A_j)$ such that
$$A_{new} = \arg\min_{f(\cdot)}\sum_{i\in\Omega}\left\|l(d_i|A_{new})-l(d_i|\rm{gt})\right\|.$$

This is some formulation that I learnt in signal estimation, but in this problem a label is quite different from a random variable:
1) it is discrete rather than continuous
2) even if one algorithm assigns a label $k'$ to a data point which is marked as class $k$ in the groundtruth, it does not necessarily mean it is wrong, because what we really care about is whether the $k$th class in the groundtruth has been marked as one single class in our output, and we do not care about which label it used in the output.

Finally, I have no idea how to construct a function $f(\cdot)$.

Best Answer

If you have some labeled data, you can probably perform what is known as "semi-supervised clustering". That is, you find the "optimal" clustering based on the complete data using both labeled and unlabeled data. The EM algorithm is often used for this type of problem depending on what type of model you use. One flexible approach may be to fit a Normal mixture model. You can try this for an varying number of classes (as you note that you do not know the true number of classes) and use some information criterion or cross-validation approach to select the correct number of classes.