I have code in Tensorflow using CNN model to detect text. The Model contain 9-conv layers flowed with RELU activation and 4-max pooling layers with window and stride equal to 2.
image size is 224*224*3, and the label for each image is (x,y,width,height); where the x and y are the text coordinate(location) in image, width and height are the bounding box size.
Now, how can I use the (x,y,w,h) in Dens regression layers? Should I uses 4-dens layer (one layer for each value in the coordinate and size). If I can do this, it will be correct to use 4-loss function (MSE error) which each loss function for one Dens layer?
also should I use an optimizer for each one?
Or is there another way to use just one Dens layer?
Best Answer
Yes, you will need to calculate four losses and combine them (
tf.reduce_sum
ortf.reduce_mean
) to build a final loss function and pass it to the optimizer. Only a single optimizer will suffice.You can do this with one
dense
layer also. You can use a dense layer withunits=4
and a sigmoid activation function to scale the values between [0, 1].The fractional output will give you the fraction of the image to consider. For e.g. x=0.2 will mean that the x coordinate is ~0.2*W and h=0.5 will mean that the height of the bounding box is ~0.5*H. (H, W are height and width of the image and must be constant for the model)
Hope this helps.