Solved – multiple output layer in tensorflow

conv-neural-networkdeep learningimage processingtensorflow

I have code in Tensorflow using CNN model to detect text. The Model contain 9-conv layers flowed with RELU activation and 4-max pooling layers with window and stride equal to 2.

image size is 224*224*3, and the label for each image is (x,y,width,height); where the x and y are the text coordinate(location) in image, width and height are the bounding box size.

Now, how can I use the (x,y,w,h) in Dens regression layers? Should I uses 4-dens layer (one layer for each value in the coordinate and size). If I can do this, it will be correct to use 4-loss function (MSE error) which each loss function for one Dens layer?
also should I use an optimizer for each one?

Or is there another way to use just one Dens layer?

Best Answer

Should I uses 4-dens layer (one layer for each value in the coordinate and size). If I can do this, it will be correct to use 4-loss function (MSE error) which each loss function for one Dens layer? also should I use an optimizer for each one?

Yes, you will need to calculate four losses and combine them (tf.reduce_sum or tf.reduce_mean) to build a final loss function and pass it to the optimizer. Only a single optimizer will suffice.

You can do this with one dense layer also. You can use a dense layer with units=4 and a sigmoid activation function to scale the values between [0, 1].

The fractional output will give you the fraction of the image to consider. For e.g. x=0.2 will mean that the x coordinate is ~0.2*W and h=0.5 will mean that the height of the bounding box is ~0.5*H. (H, W are height and width of the image and must be constant for the model)

Hope this helps.

Related Question