Solved – multiple output layer in tensorflow

conv-neural-networkdeep learningimage processingtensorflow

I have code in Tensorflow using CNN model to detect text. The Model contain 9-conv layers flowed with RELU activation and 4-max pooling layers with window and stride equal to 2.

image size is 224*224*3, and the label for each image is (x,y,width,height); where the x and y are the text coordinate(location) in image, width and height are the bounding box size.

Now, how can I use the (x,y,w,h) in Dens regression layers? Should I uses 4-dens layer (one layer for each value in the coordinate and size). If I can do this, it will be correct to use 4-loss function (MSE error) which each loss function for one Dens layer?
also should I use an optimizer for each one?

Or is there another way to use just one Dens layer?

Best Answer

Should I uses 4-dens layer (one layer for each value in the coordinate and size). If I can do this, it will be correct to use 4-loss function (MSE error) which each loss function for one Dens layer? also should I use an optimizer for each one?

Yes, you will need to calculate four losses and combine them (tf.reduce_sum or tf.reduce_mean) to build a final loss function and pass it to the optimizer. Only a single optimizer will suffice.

You can do this with one dense layer also. You can use a dense layer with units=4 and a sigmoid activation function to scale the values between [0, 1].

The fractional output will give you the fraction of the image to consider. For e.g. x=0.2 will mean that the x coordinate is ~0.2*W and h=0.5 will mean that the height of the bounding box is ~0.5*H. (H, W are height and width of the image and must be constant for the model)

Hope this helps.

Related Solutions

Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks

The fraction part comes from the stride operation. Without stride, the output size should be output_no_stride = input + 2*pad - filter + 1 = 224. With stride, the conventional formula to use is output_with_stride = floor((input + 2*pad - filter) / stride) + 1 = 112.

In many programming languages, the default behavior of integer division is "round toward zero" so the floor operation can be omitted when the numerator and denominator are positive integers. (Ref: Caffe's convolution implementation, Cudnn docs)

Comparing the output dimension with and without stride

output_with_stride = floor((input + 2*pad - filter) / stride) + 1
                   = floor((output_no_stride - 1) / stride) + 1
                   = ceil(output_no_stride / stride)

Caffe's pooling is a bit complicated, it first replaces the floor with ceiling, then decreases the size by one if the last pooling does not start strictly inside the image, as shown in the code.

  pooled_height_ = static_cast<int>(ceil(static_cast<float>(
      height_ + 2 * pad_h_ - kernel_h_) / stride_h_)) + 1;
  pooled_width_ = static_cast<int>(ceil(static_cast<float>(
      width_ + 2 * pad_w_ - kernel_w_) / stride_w_)) + 1;
  if (pad_h_ || pad_w_) {
    // If we have padding, ensure that the last pooling starts strictly
    // inside the image (instead of at the padding); otherwise clip the last.
    if ((pooled_height_ - 1) * stride_h_ >= height_ + pad_h_) {
      --pooled_height_;
    }
    if ((pooled_width_ - 1) * stride_w_ >= width_ + pad_w_) {
      --pooled_width_;
    }
    CHECK_LT((pooled_height_ - 1) * stride_h_, height_ + pad_h_);
    CHECK_LT((pooled_width_ - 1) * stride_w_, width_ + pad_w_);
  }

I think the result is mostly aligned with the conventional formula except when the last pooling is entirely outside the original input.

Best Answer

Related Solutions

Solved – Fractional output dimensions of “sliding-windows” (convolutions, pooling etc) in neural networks

Related Question