1 Introduction

Automatic video editing allows small events to be available to a much larger audience. Indeed, many events cannot be broadcast because of the fixed cost of production (crew and equipment). By automatic video editing, i.e. automatic selection of the best viewing angle in a multi-camera system, the live video stream where the action takes place can be provided to the spectator. CitizenCamFootnote 1, a French company which offers multi-camera automatic recording solutions, wants to retransmit on the web every type of event to the greatest number of people. To achieve this goal, CitizenCam choose to reduce costs by automating recording and broadcasting while using IP cameras. The gathering of statistical knowledge on the scene is required to understand the action and perform camera selection. The specific context of this study is the case of indoor sport broadcast, especially Basketball games. For this article we are interested in tracking players in order to determine key events such as free throws or counter-attacks, and also to obtain statistics on each player. The dataset is available in [1] and includes different views of a game of basketball; from the side and above. While watching this footage, we can detect two types of challenges which influence the precision of the tracking.

First, some of them are due to the nature of basketball:

  • Occlusion: During tracking, players can be hidden during a certain time and it can be hard to recover from it. Two sub-cases can be identified: first, if two players are crossing each other. Second, when a player is hiding another during a static phase.

  • Rotation: It implies that appearance models are complex to use because the looks of the players are changing function of how they rotate and where they are located; seing a player from a side is not the same than above.

  • Acceleration: Some tracking algorithms use difference between two frames to determine the next position, but if there is a brutal and unexpected change of direction, it can be difficult to perform a good tracking.

  • Groups: Based on the structure of basketball, most of uncertain situations imply just two players, excepting some categories: beginning, injury, celebration of a goal and ending. Then a lot a similar players stand next to each other and it is hard to follow them.

Second, some issues are directly related to the video caption. The footage is actually recorded with different fish-eye cameras (wide-angle). If we focus on the view from above that we are using, which is recorded with a 180\(^\circ \) security camera, we can observe two issues in our analysis.

  • Cropped Image: Some correction has ever been applied to make the video watchable but it cropped the image, so we need to determine when a player is going out of the window and when he is back.

  • Distortion: The distortion is not completely corrected so it implies that the size of a player is changing function of his position. For instance a player in the center is way bigger than in a corner, so we need to correct this.

After having exposed this different challenges, it appears clear that a lot of information is contained into the nature of the game. Because of this major constraints, it makes sense to develop a specific solution instead of using generic algorithms in order to make the tracking smarter, i.e. better and faster.

2 Available Techniques

In order to evaluate the results of our solution, it is necessary to compare it to different algorithms. Because we are working with OpenCV, we can observe that some tracking algorithms are ever implemented. The algorithms available are Boosting, KCF, MedianFlow, Multiple Instance Learning and Tracking-learning-detection trackers. Since MedianFlow tracker [5] and MIL tracker [6] are not adapted to our application (random displacement, quick rotation), we will focus on the other trackers. For a more exhaustive comparison, please see the work of Janku et al. [8].

  • Boosting Tracker: based on the AdaBoost algorithm, which uses the surrounding background as negative examples to find the most discriminative features of the tracked object. Because it is based on the appearance, changes of the player like rotations or light changes are normally well handled [2].

  • Kernelized Correlation Filter (KCF) Tracker: The main goal of a tracker is to distinguish the target from the environment. This algorithm translates and scales different patches in order to find the best one. To improve computation power, some improvements have been done by seeing that the studying matrix is circulant [3] and that a correlation filter can be applied [4].

  • Tracking-learning-detection (TLD) Tracker: The main approach here is to detect an object at one frame, then detect if the object is there in the following frames. Function of that, the tracker is updated differently [7]. This algorithm is supposed to be able to handle rapid motions and partial occlusions.

After this review, we can say that KCF, Boosting and TLD are suitable for our study. We will compare the results of our solution to these algorithms.

3 Implementation

Our solution is implemented in Python and includes few steps of processing.

On each frame, we are performing several actions:

  1. 1.

    A background subtraction model is used to remove the background. Because the camera is fixed, it is pretty efficient.

  2. 2.

    The subtraction is cleaned with closing and opening transformations in order to remove blur and to retrieve clean players.

  3. 3.

    The Suzuki algorithm [9] is used to find the different players.

Then, the different objects are linked on each frame based on criteria like surface, center distance, speed vector direction and main color. But in order to get good results, the distortion should be first corrected (at least limited), otherwise the algorithm would need to link two players way to different between the center and a corner of the image.

3.1 Distortion

The proprietary software associated with the camera has ever been used to retrieve the actual view. Using the raw footage is not an option because the camera uses a panamorphic lens; the distortion is not radial and there is no easy way to correct it without having access to the main characteristics of the camera. Then we need to work on the corrected image. The first try was to estimate the distortion model while asserting this one was radial: after solving the equations, we can get a good result at the center of the picture, but the borders areas were completely deformed (Fig. 1a) and unusable.

Because of the bad results of the previous approach and because undistorting the video was slowing too much the algorithm, we decided to use a pixel/mm ratio created manually which is changing across the frame function of the position. The main advantage to have this correction is that it becomes possible to set parameters in millimeters and no more in pixels. The projection of the function we found can be seen in the Fig. 1b.

Fig. 1.
figure 1

Distortion

3.2 Background Subtraction and Contours Extraction

To be able to detect players, a background subtraction [14] is used. Because the camera is fixed, the history used to calculate this subtraction needs to be the longest. When applying it, some blur can still been seen (see Fig. 2a), particularly the lines of the field. Moreover, some players are divided in chunks, so some cleaning is necessary. Three morphological transformations are applied to the image: first, a small opening (\(2\,\times \,2\) pixels, see Fig. 2c) to remove the blur, then a closing to unify the chunks (\(9\,\times \,9\) pixels, see Fig. 2d) then a final opening to clean the contours of the objects (\(3\,\times \,3\), see Fig. 2e). The final result can be seen in Fig. 2b. The last step of processing is detecting players. Based on the subtraction, a contour detection algorithm is used. We took the Suzuki Algorithm [9], which follows the contours in order to determine the whole shape of the tracked object. Then the bounding boxes of the contours are filtered to avoid the nested one. Finally the too small players (based on a surface parameter) are removed.

Fig. 2.
figure 2

Succeeding steps of the subtraction

3.3 Score Estimation

To sum up the previous steps, now are available on each frame different boxes which are normally players. The main challenge here is to link the different boxes across the frames in order to retrieve the whole trajectories. Terminology: the target names the player tracked at the previous frame, the players are all the detected boxes on the current frame. On each frame, we are calculating an association score between the target and the players. Under certain conditions, the player with the best score is associated with the target. The whole decision tree can be found in Fig. 3.

Fig. 3.
figure 3

Structure of the score

Distance Evaluation. The distance is here critical because there is no way that the target can move faster than a certain limit. Based on the information available, we can create for each player a distance score to the target defined in formula 1, where distance is the euclidean distance between the player and the target, and max is the maximum value to link a player with a target. This value has been set at 3 meters for experimentations and we keep only the players 1.2 meters (\(score>0.6\)) apart for the association of the players to the targets

$$\begin{aligned} score= \left\{ \begin{matrix} 1 &{} if\ distance >min,\\ 0&{} if\ distance <=max,\\ \frac{max-distance}{max-min}&{} else \end{matrix}\right. \end{aligned}$$
(1)

Speed Vector Projection. Based on the previous data, occlusion can be solved under some conditions. For instance, the speed vector can help to determine when two players are crossing each other, but this information is relevant only if the two vectors are not co-linear and if at least one of the modules of the speed vectors is not too low. Currently the score is calculated function of the angle between the two vectors (see Fig. 4).

Fig. 4.
figure 4

Projection of the speed vectors

Color Comparison. When two players are too close, it is not possible to determine exactly where they stand, but determining their positions when they are separating from each other is possible. To increase our accuracy, and because there is two teams with different wearings, the color gives us precious information. The first guess was to use conditional statements on the RGB (Red Green Blue) color space, but the main issue here is that this space is not linear and pretty dependent of the luminary exposition. That is why we change the color space. The LAB color space seems pretty efficient: L states for lightness, a for green-red scale and b for blue-yellow [10]. Because one team is wearing red, it is fine to calculate a ratio between red pixels (=a higher than a certain limit which depends of the LAB implementation) and the total amount of pixels. To determine who is who, we are comparing the ratios before losing the players with the ratios after the separation. One issue with this implementation is that it is asking a lot of resources to compute the color information; indeed the moment we will need this information is unknown. In order to make the program faster, calculating it only during few frames is enough.

4 Results

For all the considered algorithms, if the tracking fails, there is no recovery system. Because all of them fail at some point, it was not efficient to use a metric to show how efficient they were. It makes more sense to determine if they are able to solve different issues as exposed in the first part of this article. We manually annotated the position of each player in different footages corresponding to specific situations (between 5 and 10 extracts for each case, see Fig. 5). Thanks to the ground truth, we calculated an accuracy score [15] based on the amount of situation solved. The results can be found in the Table 1. The processing time per image (P. T. I.) expresses the average time (in milliseconds) needed to detect and track a player in an image.

Fig. 5.
figure 5

Problematic situations

Table 1. Accuracy of the different algorithms

To sum up the results, we can see that two algorithms are well performing in our study: the KCF tracker and proposed method. Both of them are able to follow most of the time the players, excepting in the corners where the distortion is too important. Concerning the algorithms that are not performing well, bad results of Boosting can be explained by seeing that the tracker is often blocked on lines; because this descriptive feature is not moving too much, it is possible to suppose that the algorithm get fooled. Concerning TLD, the main issue we observe is that the player is sometimes lost during few frames and the focus is placed on another player randomly chosen in the image, but it often goes back to the tracked player after few frames. Smoothing the trajectory to remove this jumps could improve the tracking. Anyway, these two algorithms are way slower than KCF.

The results begins to be interesting when we are speaking about scalability. The main issue with the KCF tracker is that we need to set a tracking object for each player we want to follow, so it slows down the speed of the process. On the contrary, most of the processing time of our solution is taken by the background subtraction. It means that tracking new players just asks new resources to link boxes, which is fast. So our solution is more suitable for tracking several players. Finally we can observe the result window of the algorithm in Fig. 6.

Fig. 6.
figure 6

Detection process

5 Conclusion and Perspectives

During this study we expose challenges caused by automatic video editing applied to basketball. We see that most of this issues fool the generic algorithms but they can be solved using specific informations like color, positions or speed of the players. Implementing a dedicated solution makes the algorithm way better in term of accuracy but mainly in term of speed. If a long time of calibration of the algorithm is needed, we can explain it by different factors: quality of the video, same color of the wearing and the background or diffraction. This promising results let us expect that improving the model while implementing the multi-camera system or modifying the subtraction step will makes the solution still better. With the final trajectories of all the players, it should be easier to determine specific actions like counterattacks and free throws to finally perform the automatic video edition.

Even if we see promising results, the implemented algorithm is still facing some issues, like the other algorithms. Most of them are related to the similarity between the background color and the outfit of one team. For instance, if two players of the same color are crossing each other at low speed, the player can be lost. To solve this problems, we thought to few improvement that can be done to improve the accuracy.

Fig. 7.
figure 7

General issues

First the tracking is one-target only. It means that it is not possible to use informations of previous tracking. For instance, we can see in Fig. 7a the trajectories of 2 players. If we try to follow the player 2, and that the color can’t help us, the algorithm will face issues to find the relevant player when they will cross each other because \(|\overrightarrow{v_{2}}|\) is too low to give information (the player can go backward). But if at the same time, you are tracking the player 1, because \(|\overrightarrow{v_{1}}|\) is significant, you will be able to him. Combining these informations with a system of rules could help to solve this issue. Moreover a criteria on the distance can lower the multi-tracking task complexity.

The algorithm can be fooled if the players are in the configuration of Fig. 7b. Speed is big enough to consider the angle of speed vector and the tracking will probably fail. One answer to that could be to improve the background subtraction step as shown in [11].

As seen in the dataset, more videos than just the above view are available. A multi-camera model as implemented in [12] or in [13] could help to fix most of the issues listed above.