I'm trying to use MobileNets to localize a rectangular object in an image. I want to construct a model that inputs an image, and outputs the keypoints/coordinates (8 total points) of each corner of the rectangular object.

The rectangular object in each image is pretty unique and I would guess should be easy to generalize. For example (my problem is very similar to this): 900 images of the Chicago Bull's court along with the 8 given coordinates for each. The image is always with the center logo front-facing, but from any angle/rotation, & can be taken from close range or at a distance (so the court dimensions/rotations vary).

The model would be able to give me the 8 points:

[upper_left_x, upper_left_y, upper_right_x, upper_right_y, bottom_left_x, bottom_left_y, bottom_right_x, bottom_right_y]

We could then manually (I don't want the network to do this) apply the points:

I initialize with a base model:

model = MobileNet(weights=None,              # can also be Imagenet
                  include_top=False,         # the top is for classification
                  input_shape=(224, 224, 3))

and for the output layers I apply something very similar to face-keypoints model

x = model.output
x = Flatten()(x)
x = Dense(500, activation="relu")(x)
x = Dense(90, activation="relu")(x)
predictions = Dense(8)(x)

Other Setup and Problems

I've used Adam optimizers, SGD, initialized the model with imagenet weights (tried frozen/unfrozen layers), changed learning rates , restructured & sized the output layers in many ways, opted for different/custom model implementations, validated my dataset (X,y) processing, trained on 100s-1000s of epochs….

I do not get anywhere as near accurate results as the facial-keypoints example. In fact, the predictions seem completely random OTHER THAN the model accurately outputting a rectangle with the points in the correct order. Any model seems to only recognize the shape, without worry to any correct dimensions or rotation — I've played with the facial keypoints and even that variable accuracy is low, but seems to perform well due to the faces having very similar dimension & virtually no rotation (if facial dimensions don't change, the outputted shape typically fits)

No matter the structure, the lowest validation loss I can achieve seems to be around 0.0xxx and accuracy is completely off (in terms of correct dimension/rotation). I've accounted for over-fitting, but think the structure of my model is the real issue.

Is mobilenet the wrong structure? Should the output layers be structured differently?

I would appreciate any help, thanks.

Edit:
loss is MSE, keypoint normalization is {-1,1}

Best Answer

Any model seems to only recognize the shape, without worry to any correct dimensions or rotation

It is a known problem that CNNs aren't particularly good at handling such transformations (precisely, their output is not invariant/equivariant under such transformations). The way to handle this in practice is to augment your data (in Keras you can do this with these tools).

You didn't mention which layers you use. Did you try lower layers? Sometimes higher layers are adapted to specific high level features - remember that on ImageNet your model learned to distinguish cars from cats for example, so maybe the last weights aren't so good at extracting features relevant to your task.

All that being said, I don't think that there is a definite answer - note that pretrained models are trained on huge datasets, so maybe you just don't have enough data for your task to train a deep network (even with transfer learning). BTW do you actually need neural net for this? Your task doesn't seem to be particularly complex, maybe it can be tackled using classical computer vision tools (which most likely won't require any training, like in this example).

Solved – MobileNets object keypoints localization with Keras

Other Setup and Problems

Best Answer

Related Question

Other Setup and Problems

Best Answer

Related Solutions

Solved – How does Krizhevsky’s ’12 CNN get 253,440 neurons in the first layer

Solved – Alternatives to L1, L2 and Dropout generalization

Related Question