Solved – Computer vision algorithm that maps the positions of objects in 3D onto 2D image

computer visionmachine learningneural networksreferences

Here is that I want to achieve: the input is a soccer image frame
enter image description here

Then, I want my model to return a 2D model has following information:

enter image description here

Where player's position in the field are given; and the player's relative distance are well preserved as well.

Here is my plan to achieve this:

First, player detection using some standard ML algorithm (lots of resources for study)

However, the major challenges is that how do I figure out that player's corresponding position in 2D model given the input image? Meanwhile, the very important part, having the part of field in the camera image well-represented in the 2D model field (proportionate well so that relative distance between players can be well preserved)

Since my intention is to model a 2D soccer matches based on broadcast video( 2D model that reflects a real broadcast soccer video. So a sequence of events from broadcast camera can be modeled into 2D from bird-view). The broadcast video are moving from one side to another side. (For now, I preclude the frame that taping audiences, or close-up camera on single players)

All the lines in the input image that can be used as references aren't fixed for each frame. It makes difficult to determine player's position.

(I'm looking for published papers so that I can implement the algorithms)

Best Answer

I do not know of a publication on this area.

In my opinion it is a computer vision problem, comprising several smaller problems. You need a model of the pitch, be able to segment and keep track of the players, and to keep track of where the camera is looking at.

Ideally, the camera is calibrated, so you have a mapping from pixels into meters. The problem is recognizing what part of the field the camera is covering (need landmark detection), detect the players and project their position onto the plane, and finally apply a projective transformation defined by the camera, that rectifies the pitch into a 2d view of the field (or 3D from the top).

The question is then:

  • how to define what the landmarks are, and how to find them in the image,
  • how to detect and keep track of the players.

For the first problem there are a number of approaches to fit ellipses and other sort of primitives. See for example (Robust Pose Estimation from a Planar Target, Schweighofer et al.). This would additionally allow to recalibrate or realign your system, if necessary.

You may be able to detect the corners with some standard approach for corner detection: Harris (at the right scale). See this lecture for more details (https://ags.cs.uni-kl.de/fileadmin/inf_ags/opt-ss14/OPT_SS2014_lec02.pdf)

For the second, I would expect a standard HOG based approach to be able to detect the players (see Histograms of Oriented Gradients for Human Detection). At least in the sort of images like the one you posted.

Keeping track of the players can get really tricky when they overlap each other. I am not aware of a robust approach to that tracking problem. You may get some good results from one of the approaches implemented in OpenCV (https://www.learnopencv.com/object-tracking-using-opencv-cpp-python/)

Related Question