Solved – Machine Learning: Stratified Test-train-validation split for images with multiple classes and examples per image

analysiscross-validationimage processingmachine learningscikit learn

I have a dataset with 300 images, each of which has a variable number of flowers. These flower examples can be any of 3 classes. My goal is to develop a prediction algorithm to classify the flower based on its individual appearance.

I want to get:

  1. A stratified 50:25:25 training:validation:test for the flowers.
  2. However, I want to do this while guaranteeing that all flowers within the same image are assigned together into their respective training, validation or the testing set.

The second condition must be satisfied because I want to be able to say that image X is either a training image, a validation image or a testing image. Unless all flowers in image X fall into the same set, I cannot say this about image X.

I couldn't find anything in sklearn that can do this. Any hints?

Best Answer

It should be fairly simple to do the division by hand. The trick is that you do not need an exact 50/25/25 split to obtain valid estimates of uncertainty.

Basically, you will have to transpose the dataset to a long rectangular array. Each row corresponds to a single flower and there is an image ID for each, randomly assigned. And possibly create a flowers-per-image variable. You may randomly permute the order of the dataset and tally the cumulative number of flowers per each image. Divide the array at the nearest row to hit 50% of the total number of flowers and then again at the 75%. These will approximately be the 50/25/25 datasets you are after.

A question, however, is whether the total number of flowers per image might be considered a feature to predict the flower type. Some grow in clusters whereas others may be a single bud per stem. That's useful information.