Multiple object tracking for video-based sports analysis
Julius Gudauskas and Žygimantas Matusevičius
Kaunas University of Technology, Studentų g. 50, Kaunas, Lithuania


             Abstract
             Multiple object tracking (MOT) is a challenging task in computer vision. Many algorithms
             have been proposed to track multiple targets for video surveillance, team-sport analysis, or
             human–computer interaction. Recent studies have already indicated that multiple object
             tracking could provide valuable information in team sports analysis. Therefore, in this paper,
             we investigate object tracking techniques for paralympic team sport – goalball. Different
             tracking methods have been implemented and compared, evaluating prediction accuracy and
             performance speed in players and the ball tracking.

             Keywords
             Multiple object tracking, MOT, SOT, CNN, ONNX, Goalball, Boosting, CSR-DCF, KCF,
             MOSSE, TLD.

1    Introduction
    As computer computing capabilities increase, image processing technologies are becoming
increasingly important. Image processing involves many processes, but the main goal is to detect
objects and identify their movement's nature. Object recognition and tracking of its dynamics serve in
many areas of life, such as robotics [1], smart cities [2], [3], medicine [4], [5], or human-computer
interface [6]. Besides those already mentioned herein, we would like to address further how better to
integrate computer vision technologies into paralympic sports applications. Recently, there has been a
growing interest in developing intelligent systems for detecting and tracking player's movements to
analyse sports players' performance during games or training sessions and improve their performance.
The information derived from such analysis is valuable to sports experts and coaches since it allows
them to understand major mistakes better, identify trainees' weaknesses, and modify training and
strategic plans accordingly. Although the number of sports innovations has been increasing and
technological solutions are being implemented even by archaic sports, for example, video assistant
referee (VAR) system in football, sports for the disabled remain a side-line - for instance, goalball one
of the most popular sports among the blind. In order to raise or at least maintain the competition in this
sport, it is necessary to implement novel technological solutions in the process of training and
tournaments. Artificial intelligent based technologies should be integrated in order to acquire
multidimensional information. According to the experts, the essential task is to monitor the ball's and
players' movements. However, tracking ball and players over a large playing area is a challenging
problem for several reasons. First, the players move quickly and have large variations in their
silhouettes. Goalball is a team sport; therefore, multiple player tracking must be performed, but it may
be complicated when players are spatially close together. Second, the dynamic nature of ball
appearance, movement, and continuously changing background make the detection and tracking
processes even more challenging [7]. Besides, the ball size is relatively small compared to other objects
in a frame, and it can be overlapped or covered by other objects. This work aims to integrate image
processing techniques into goalball game video analysis for real-time detection and tracking of multi-
players and the ball. Different tracking methods have been implemented and compared in terms of
detection precision and speed.


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
2    Overview
  Some object tracking strategies have been implemented, but the best solution has not yet been found.
Next, we examine the existing object tracking techniques and object tracking solutions in sports.

2.1 Related works
   SAP develops solutions for video-based sports analytics. For the football world championship 2014
in Brazil, SAP with German Football Association successfully developed Match Insights analytical
solution. It was decided to integrate it with Panasonic video and tracking software [8] to improve the
solution.
   VisualiZation in real-time (Vizrt) provides content creation, control, and delivery tools for the digital
media business. The company's products include software for designing real-time 3D graphics and
maps, envisioning sports analyses, controlling media assets, and obtaining single workflow solutions
for the digital broadcast trade [9], [10].
   PITCHf/x data set is a free source granted by Major League Baseball Advanced Media (MLBAM)
and Sportvision. Brooks Baseball [11] performs methodical innovations to this data to increase its worth
and usability. They manually analyze the Pitch Info by using many parameters of each pitch's trajectory
and approve the parameters against some other sources such as video proof and direct interaction with
on-field personnel (e.g., pitching coaches, catchers, and the pitchers themselves). The trajectory data's
default values are somewhat altered to align them more nearly with the actual values.
   Sportradar [11], a Swiss corporation, concentrates on accumulating and examining data related to
sports results by cooperating with bookmakers, widespread football associations, and global football
associations. Their performing projects include collecting, processing, monitoring, and selling sports
data, appearing in a different collection of sports-related live data and digital content.

2.2 Multiple object tracking based on single object tracking
     Multiple object tracking (MOT) is one of the most challenging tasks in computer vision. A reliable
and universal solution to this problem is not yet known - often, several objects are tracked using a single
object tracking (SOT) method. With this tracking method, each object is tracked separately and
independently of the other objects. The article [12] proposed a powerful real-time tracking method
Boosting, that considers the tracking problem as a binary classification problem between object and
background. Most existing approaches build a representation of the targeted object before the tracking
function begins and therefore utilize an established representation to handle appearance adjustments
during tracking. However, this method does both - adjusting to the variations in appearance during
tracking and selecting suitable features that can learn any object and discriminate it from the
surrounding background. In Discriminative Correlation Filter with Channel and Spatial Reliability
(CSR-DCF), the reliability map adapts the filter support to the object suitable for tracking, overcoming
both the circular shift problems and enabling an arbitrary search range and the rectangular shape
assumption's limitations [13]. The CSR-DCF has the highest performance on standard benchmarks –
OTB100, VOT2015, and VOT2016 while running in real-time on a single CPU. Despite using basic
features like histogram of oriented gradient (HOG) and Colornames, the CSR-DCF performs parallel
with trackers that apply computationally complex deep Convolutional Networks but is noticeably faster.
In [14], originators demonstrated that it is possible to analytically model natural image translations,
showing that the resulting info and kernel matrices become circulant under some conditions. The DFT's
diagonalization presents a general blueprint called Kernelized Correlation Filter (KCF) for creating fast
algorithms that deal with translations. This blueprint has been applied to linear and kernel ridge
regression, obtaining the highest development trackers that work at hundreds of FPS and can be
implemented with a few code lines. The visual tracking problem, which is traditionally solved using
heavyweight classifiers, complex appearance models, and stochastic search methods, can be replaced
by effective and more straightforward Minimum Output Sum of Squared Error (MOSSE) correlation
filters [15]. However, there are several ways how this tracker can be improved. For example, if the

   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
target's appearance is relatively steady, drifting could be eased by occasionally recentring the filter
based on the initial frame. Also, the tracker can be extended to estimate scale and rotation changes by
filtering the tracking window's log-polar transform after an update. In paper [7], authors studied the
problem of tracking an object in a video stream, where the object changes appearance frequently moving
in and out of the camera view. They designed a new Tracking, Learning, and Detection (TDL)
framework. Many challenges have to be addressed to get a more trustworthy and general system based
on TLD. For example, TLD does not perform well in the case of full out-of-plane rotation. In that case,
the Median-Flow tracker drifts away from the target and can be re-initialized if the object comes back
with an appearance seen/learned before. The current implementation of TLD trains only the detector,
and the tracker stays fixed. As a result, the tracker always makes identical errors, and currently, it tracks
a single object. Multi-target tracking opens engrossing questions about how to train the models and
share features to scale jointly.

2.3 Multiple object tracking based on object detection and position
    forecasting
   We can also rely on recognition-based solutions to solve the problem of tracking multiple moving
objects. These algorithms' idea is to detect the tracked objects in each analyzed frame and classify them
into sets of moving objects. This problem is usually framed as a data linking task, but several obstacles
can lead to poor tracking accuracy. To identify tracked objects in a frame can be applied various neural
network-based or non-neural network-based algorithms. Such classical methods as the Viola-Jones
algorithm work in real-time by analyzing the image's pixels [16]. Although the algorithm is quite
primitive, it has pretty high accuracy and real-time speed. This algorithm can be taught to detect classes
of different objects (applied to different subtasks such as pedestrian and car), but due to the algorithm's
favourable properties, this algorithm is usually applied in face recognition. [18] The object detection
process can be established using HOG [14], scale-invariant feature transformation [16], Haar cascade
classifiers [16], etc. These algorithms are used to determine low-level feature information. More
complex tasks usually require obtaining higher-level information, and that is possible using deep
learning techniques. A convolutional neural network (CNN) a class of deep neural networks, most
commonly applied for image recognition tasks [16]. You Only Look Once (YOLO) is a deep learning
algorithm for object detection, which is most fast and accurate than most other algorithms [16]. By
dividing the input image into areas and predicting the boundary box's coordinates and the class's
probability for each region, it converts object detection problems into regression issues to achieve end-
to-end detection. YOLO can work well for multiple objects where each object is associated with one
grid cell. However, in the case of overlap, in which one grid cell contains two different objects' centre
points, we can use anchor boxes to allow one grid cell to detect multiple objects. The common challenge
complicates the multiple objects tracking and detrimental to the result – frequent occlusions, similar
appearance, interactions between multiple tracked objects, the unstable appearance of the object in the
video, etc.


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
3    Proposal
   In the following part of the article, we will provide proposals for multiple object tracking. The
presented algorithms are designed to solve the players tracking problem in targeted video.

3.1 MOT using SOT
    Firstly, multiple object tracking (MOT) is developed by employing independent single object
trackers (SOT). MOT model has a list of tracked objects, and each of objects has its tracker, id,
and rectangle object, which stores the metrics of the tracked object: x and y coordinates, height and
width of the region of interest (ROI) (see Figure 1).


                                    Figure 1: SOT based MOT model

   The process of object tracking using Unified Modeling Language (UML) notation is provided in
Figure 2. First, a video file is selected, and MOT initialization is performed. After the initialization of
the model, the objects to be tracked are marked. This process is performed manually. Finally, it is
possible to start processing video frames, where each frame is used to update the MOT model.


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
  Figure 2: SOT based MOT        Figure 3: Update MOT process          Figure 4: Update tracking object
           process                                                                 process

    During the MOT model update (see Figure 3), the frame is used to update each tracked object.
Because objects are tracked entirely independently, this process can be parallelized. After updating
tracked objects, MOT removes from the list those objects that have not been successfully updated for a
certain period of time - it is assumed that the object has been lost. During the tracked object update
process (see Figure 4), the object tracker is updated. If this operation is performed successfully, then
the rectangle object is updated, and the failure counter is restored; otherwise, the fail counter increased.

3.2 MOT using CNN object detection
   An alternative way to track multiple moving objects is to use constant object detection and detected
object classification. To solve this problem, a method of object recognition that provides the highest
possible accuracy, as well as a method of classifying objects according to the previous coordinates of
the presence of each moving object, is required. Convolutional Neural Network (CNN) allows forming
a multi-layered model, which can provide an advantage in analyzing more than one feature without
compromising speed. For CNN model training, 117 different shots of a goalball match have been used.
All the training data has been marked with the bounding box required for prediction (see Figure 5).


                                  Figure 5: Preparing a frame for training


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
    To evaluate the performance of the CNN model, three accuracy characteristics have been used:
precision, recall, and the mean average precision (mAP):
    The trained model can be exported and applied locally. Depending on the technology used, the model
format can also be selected in different ways. The Open Neural Network Exchange (ONNX) format
model is used for this study. ONNX provides definitions of an extensible computation graph model,
built-in operators, and standard data types focused on inferencing (evaluation). The model was
constructed using eight layers with the input image in BGR format. Trained CNN model provides
composed of bounding boxes, class labels, and confidence levels (see Figure 6). Each player is detected
multiple times with a different probability. To remove unwanted redundancies, a filter is used that
leaves only the bounding box satisfying the marginal probability. Recognition of players is not enough
to track them. There is a classification problem in how to assign a bounding box to a particular player.


                                   Figure 6: CNN predictions results

    At the beginning of the analysis, the players being followed are marked. Having the start coordinate
of each player, we can go through all the CNN results and assign each player the best bounding box.
This step is repeated for each iteration of the refinement. Once the player was not detected using CNN
(it usually appears when the player intersects), we keep the old coordinate and move to the next frame.

4   Experiment
    The experiment was done using five different single object tracking methods: Boosting, CSR-DCF,
KCF, MOSSE, TLD, and one multiple object tracking using CNN object detection. For testing, it was
used three different goalball videos up to 1 minute long each. The experiment goal is to track six
different players on the playfield marked from 1 to 6 (see Figure 7).


                                   Figure 7: Tracked objects labels


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
   The two most essential parameters in the evaluation of algorithms are:
      • How accurately the player is tracked;
      • How quickly the video is analyzed.
    To evaluate player tracking accuracy, the number of frames in which a player was detected and
tracked was processed. Because videos of specific durations and frames per second ratio were used for
testing, we can evaluate each player's accuracy being tracked. However, not all algorithms can
determine whether a player's position has been adjusted or not, so this metric is only valid for specific
algorithms.
    We have analyzed results of the KCF algorithm at the tracking in more details (see Figure 8). The
algorithm provides the best tracking results for the third and the sixth players (in some cases, the first).
The best results were obtained using the “Male BRA vs. SWE” video stream: the total accuracy of all
tracked players is 89%. From the results, it cannot be said that the algorithm gives a stable and similar
result under all conditions because it depends on many factors: the noise in the video material,
bystanders, the exchange of players, the angle of the frames. It can be justified comparing the “Male
BRA vs. SWE” and “Male LTU vs. USA” videos where a clear difference is visible. Using the same
algorithm, total accuracy dropped from 89% to 81%. Supervising the algorithm revealed that players
overlap more often in the less accurate video than in the greater accuracy provided one. Also, the tracked
player is more often abandoned when making very sudden movements.


                      Figure 8: KFC algorithm results with different video streams

   Another critical factor in the evaluation of algorithms is speed. Each algorithm is based on a different
computational strategy, in which the speed may depend on different factors. After performing an
experiment with each tracking algorithm and analyzing three videos for testing (see Figure 9), it was
noticed that the MOSSE algorithm copes with the task even several times faster than the other
algorithms. The slowest algorithm that the experiment was done is TLD.


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
                                  Figure 9: Algorithms performance

   Another experiment - to use CNN for object detection and tracking. The model was trained using
Microsoft Azure Custom Vision Service. The training process was performed using 117 different shots
from the goalball game videos. In each frame, the players on the playing field and the ball were marked.
The obtained results after training the convolutional neural network are provided in the Table 1.

Table 1
Model training results
         Parameter             Value                               Explanation
  Over all precision of tags   96.9%        It measures how many of the predictions that the model
                                                           made were actually correct
    Over all recall of tags    82.4%         It measures how well the model can find all the positive
                                                                 predicted boxes
   Over all Mean Average       66.8%     It is calculated by taking the mean average precision over all
      Precision (mAP)                      classes and overall IoU thresholds, depending on different
                                                         detection challenges that exists

   From the results, we can conclude that the model quite accurately predicted the players in the frames.
Of these guesses, an average of 82% is accurate bounding boxes belonging to the hypothesized object.
When it comes to recognizing the ball in the frame, the model performs worse. While the average is
89% of the shots, the ball is guessed; only an average of 39% is guessed in the proper right place. This
may be because the ball is relatively small in the frame, and it is partially blocked by the players and
sometimes merges with the background.
   Additional experiments have been carried out to evaluate how the model performs in real test cases
using video stream. The main task of the tracking algorithm is to solve the predicted bounding box
classification. The prediction was made using CNN, and bounding box classification was performed
using an algorithm that depends on previously detected player coordinates. The same data set, including
three video streams, has been used in the experiment. Tracker provides quite a stable accuracy for each
video stream (see Table 2).

Table 2
CNN based object tracking average results with different video streams
                        Video                                Average
                                                            accuracy
  Paralympic Games 2016 Goalball Male LTU vs USA             88.12%
  Paralympic Games 2016 Goalball Male BRA vs SWE             89.96%
 Paralympic Games 2016 Goalball Female USA vs BRA            90.26%


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
   Since the result determines in which part of the frame each player was detected and classified (see
Figure 10), it is difficult to evaluate whether the classifier classifies the players with their belonging
bounding boxes. The accuracy of the analysis can only be confirmed by the person supervising the
analysis.


              Figure 10: CNN based player tracking results with different video streams

5    Conclusions
    In this paper, the research of object tracking techniques used for real-time goalball video analysis
has been performed. Our proposal's novelty is that we have adapted single object tracking algorithms
to solve multiple object tracking task. We also applied multiple object tracking models to the analysis
of sports video material. We carried out different experiments to evaluate two multiple objects tracking
task approaches: by employing a single object tracker (including Boosting, CSR-DCF, KCF, MOSSE,
TLD); and CNN for multiple object tracking. For the first approach, we evaluate the method's
performance in terms of the number of frames and speed. Experiments have shown that only the KCF
algorithm can determine the adjustments of a player's position. MOSSE algorithm outperforms other
algorithms in terms of speed and is three times faster than KCF and 9,8 times faster than TLD. CNN
results are promising for players' position prediction, and accuracy varies from 88.12% to 90,26%; the
accuracy was measured by calculating the total number of frames where each player was predicted and
classified. However, CNN has shown poor performance for ball predictions providing 39% average
accuracy of ball position. An interesting direction for further research would be to combine neural
networks-based object detection and single object tracking in order to get better tracking results.

6    Acknowledgments
   We want to express our very great appreciation to Dr. Agnė Paulauskaitė-Tarasevičienė for her
insights and advice.


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
7    References

    [1]      E. Martinez-Martin ir A. P. d. Pobil, „Object Detection and Recognition for
         AssistiveRobots,“ Robotics & automation magazine, t. 24, pp. 123 - 138, 2017.
    [2]      M. S. Adam, M. H. Anisi ir IhsanAli, „Object tracking sensor networks in smart cities:
         Taxonomy, architecture, applications, research challenges and future directions,“ Future
         Generation Computer Systems, t. 107, pp. 909 - 923, 2020, June.
    [3]      F. Joy ir V. V. Kumar, „A review on multiple object detection and tracking in smart city
         video analytics,“ Research gate, 2018, January.
    [4]      M. Li, „Detecting, segmenting and tracking bio-medical objects,“ Scolars Mine Doctoral
         Dissertations, 2016.
    [5]      Y. Wang, B. Georgescu, T. Chen, W. Wu, P. Wang, X. Lu, R. Ionasec, Y. Zheng ir D.
         Comaniciu, „Learning-Based Detection and Tracking in Medical Imaging: A Probabilistic
         Approach,“ M. González Hidalgo et al. (eds.), Deformation Models, Lecture Notes in
         Computational Vision and Biomechanics 7, pp. 209 - 235, 2013.
    [6]      R. Azad, B. Azad, N. B. Khalifa ir S. Jamali, „Real-time human-computer interaction based
         on face and hand gesture recognition,“ International Journal in Foundations of Computer
         Science & Technology (IJFCST), t. 4, nr. 4, pp. 37 - 48, 2014, July.
    [7]      Z. Kalal, K. Mikolajczyk ir J. Matas, „Tracking-learning-detection. Pattern Analysis and
         Machine Intelligence,“ IEEE Transactions, pp. 1409 - 1422, 2012.
    [8]      „SAP and Panasonic Launch Joint Initiative for Video-Based Sports Analytics Solutions,“
         SAP News, 2014, September 12.
    [9]      M. Danelljan, G. Hager, F. S. Khan ir M. Felsberg, „Accurate scale estimation for robust
         visual tracking,“ roc. British Machine Vision Conference, %1 t. iš %21, 2, 4, 8, pp. 1 - 11,
         2014.
    [10]     Danelljan, G. Hager, F. S. Khan ir M. Felsberg, „Learning spatially regularized correlation
         filters for visual tracking. Pages 4310 - 4318,“ įtraukta IEEE International Conference on
         Computer Vision, Santiago, Chile, 2015, December 7-13.
    [11]     M. Danelljan, F. S. Khan, M. Felsberg ir J. v. d. Weijer, „Adaptive color attributes for real-
         time visual tracking. Pages 1090–1097,“ įtraukta IEEE Conference on Computer Vision and
         Pattern Recognition, Columbus, OH, USA, 2014, June 23 - 28.
    [12]     Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov ir D. Tao, „Multi-store tracker (muster):
         A cognitive psychology inspired approach to object tracking,“ Comp. Vis. Patt. Recognition,
         pp. 749 - 758, June 2015.
    [13]     H. Grabner, M. Grabner ir H. Bischof, „Real-time tracking via on-line boosting,“ BMVC,
         t. 1, p. 6, 2006.
    [14]     A. Lukezic, T. Voj'ir, L. C. Zajc, J. Matas ir M. Kristan, „Discriminative correlation filter
         tracker with channel and spatial reliability,“ International Journal of Computer Vision, 2018.
    [15]     J. F. Henriques, R. Caseiro, P. Martins ir J. Batista, „Exploiting the circulant structure of
         tracking-by-detection with kernels,“ In proceedings of the European Conference on Computer
         Vision, 2012.
    [16]     D. S. Bolme, J. R. Beveridge, B. A. Draper ir M. L. Yui, „Visual object tracking using
         adaptive correlation filters,“ įtraukta Computer Vision and Pattern Recognition (CVPR), 2010.
    [17]     S. K. S. Anjali B Guptha, „Multiple Face Detection and Tracking using Viola-Jones
         Algorithm,“ International Research Journal of Engineering and Technology (IRJET), t. 07,
         nr. 04, 2020.
    [18]     R. Padilla, S. L. Netto ir E. A. B. d. Silva, „A Survey on Performance Metrics for Object-
         Detection Algorithms,“ 2020.


   Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).