Solved – Neural Network: output representation (output layer)

conv-neural-networkdeep learningimage processingmachine learningneural networks

I have more of a high-level question. I’m trying to design a neural network for flower detection. The images are 440x440x3 pixels (RGB) and are taken by drone so flowers are white dots (of radius up to 10 px).

I have already annotated over 2000 images but I’m having trouble with the data representation for the output layer. For now, each flower is represented as x-y coordinates representing its centre. Each image contains a different number of flowers. This is not a standard classification/regression task and that's why I’m a bit lost.

I planned to use CNN but I don’t know how to define the output. For now, the input X is:
440 x 440 x 3 x N matrix (height x width x channels x index).
This is pretty standard and I already have an input layer.

However, it gets tricky when I’m trying to design the output layer. I was thinking about representing output Y as:
440 x 440 x 1 x N matrix of binary images where 1 represents a flower (0 otherwise; according to the coordinates from annotated data).
However, this approach seems hard to implement. Has anyone heard of such an approach?

My other thought was to have flowers annotated as before (x-y coordinates) and have 2 outputs (one for x and one for y coordinates). These would work as probabilities for each of the coordinates and if they’re both acceptably high then we would predict a flower.

Are there any other ways to represent the output for this particular task? Any help would be greatly appreciated 🙂

Best Answer

Your approach actually implies an explosion of the number of parameters of the output layer that does not scale well with the size of the image and the number of objects. Similar to the case of a fully connected MLP and a convolutional network. You loose the position invariance. The way that is actually approached is to learn and output the enclosing box of the object, that is, (x,y) coordinates of the top left corner, and the width and height of the box.

But in order to be fast, one needs a complex pipeline: first extract candidate regions containing possible objects, and a classifier that finally decides if the region corresponds to an object, and if so, which one.

The current state of the art approach is Faster R-CNN, which you can find also in GitHub. The fundamental idea is that the features are learned at a first stage (learning these features is computationally very expensive). Then two networks are trained on these features: one to learn which regions contain objects (the region proposal network) and a second one that detects an object in a given region. This way you have a high degree of parameter sharing, which improves runtime significantly.

Related Question