Stratified split for instance segmentation problem

discrete optimizationmachine learningoptimization

Instance segmentation is a problem where you have a set of images $X = \{x_i\}_{i=1}^{N_{imgs}}$ and with every image you have associated labels. Labels are polygons which provide accurate location of the objects of a given class within an image. Let's say we have "sheep", "dog", "cat" and "horse" classes.

a busy cat

In the image above we have 1 dog and 3 sheep. We can describe this with a vector $\mathbf{v} = [3, 1, 0, 0]^T$ which provides class counts for this image (with every class instance we have associated label which is a polygon but that is not important for the problem at hand). In the overall $X$ dataset we have a certain distribution of the class counts which we may obtain by summing $\mathbf{v}$ vectors for all images and normalizing. We want to split the $X$ dataset into training and testing subsets, where test subset consists of $M$ samples (does not need to be exact) and the distribution of class counts for these subsets is as similar to the original as possible. It is important that for example if we have only 2 instances of a horse in our $X$ dataset (in separate images) that 1 image goes into training and 1 into test subset. Keeping the same distribution between training and test sets is in general an important assumption in machine learning especially when minority classes exist in the dataset. Precise mathematical formulation of the problem is given below:

Let V = $\{\mathbf{v}_i\}^{N}_{i=1}$, $N \in \mathbb{N}$ be a set of vectors s.t. $\mathbf{v}_i \in \{ \mathbb{N} \cup \{ 0 \} \}^k, k \in \mathbb{N}$. Let $\mathbf{p}_V = \frac{ \sum_{\mathbf{v} \in V} \mathbf{v} }{ \| \sum_{\mathbf{v} \in V} \mathbf{v} \| }$, $r \in (0,1)$ and $M = \lfloor r \cdot N \rfloor$.

We want to find $S \subset V$ s.t. |S| = M and $S$ maximizes $L_{Total}(S) = L(S) + L(V \setminus S)$, where $L(S) = cos(\mathbf{p}_S, \mathbf{p}_{V})$, $L(V \setminus S) = cos(\mathbf{p}_{V \setminus S}, \mathbf{p}_{V})$, $\mathbf{p}_S = \frac{ \sum_{\mathbf{v} \in S} \mathbf{v} }{ \| \sum_{\mathbf{v} \in S} \mathbb{v} \| }$, $\mathbf{p}_{V \setminus S} = \frac{ \sum_{\mathbf{v} \in V \setminus S} \mathbf{v} }{ \| \sum_{\mathbf{v} \in V \setminus S} \mathbf{v} \| }$.

Best Answer

You can solve the problem via mixed integer nonlinear programming (MINLP) as follows. For $i\in \{1,\dots,N\}$, let binary decision variable $x_i$ indicate whether $v_i\in S$. The problem is to maximize $\cos(p_S,p_V)+\cos(p_{V\setminus S},p_V)$ subject to \begin{align} p_S &= \frac{\sum_i x_i v_i}{||\sum_i x_i v_i||} \\ p_{V\setminus S} &= \frac{\sum_i (1-x_i) v_i}{||\sum_i (1-x_i) v_i||} \\ \sum_i x_i &= M \end{align} If are willing to change the measure of similarity, there are also linear (MILP) approaches.

Related Question