Customer Traffic Distribution Analysis Based on Video
Information
Tatyana Martynenkoa, Tatyana Vasyaevaa, Aida Velievaa and Yuriy Skobtsovb
a
    Donetsk National Technical University, Donetsk, Ukraine
b
    Saint Petersburg State University of Aerospace Instrumentation, St. Petersburg, Russia


                  Abstract
                  The map constitution task of the customer movement through the store using videoanalitics
                  has been considered. The problem is reduced to the video stream objects detection with
                  further tracking. It is proposed to use pre-trained CNN for object detection. The
                  experimentally justified joint use of pre-trained networks MobileNet-SSD. For the object
                  tracking there have been performed experiments with algorithms built into the OpenCV
                  library: GOTURN, CSRT, KCF, BOOST, TLD, MOSSE, MedianFlow, and MIL. According
                  to the multiple object tracking accuracy (MOTA), the MedianFlow tracker is selected.
                  Experiments were performed using a set of video sequences containing various negative
                  parameters confirmed the effectiveness of the selected solutions.

                  Keywords 1
                  Computer vision, video analytics, deep learning, convolutional neural networks, detection,
                  customer flow, tracking, conversion rate

1. Introduction
    Today the retail sector is experiencing fierce competition and an increase in consumer demands for
service levels. To maximize the effectiveness of marketing and sales an impressive array of IT
solutions have been offered. One of them is video analytics [1]. In the retail sector video analytics
provides significant competitive advantages, it allows you to evaluate such important parameters as
the number of visitors, conversion rate (paying attention to a particular product). You can use it to get
useful information about your customers and use it in the future to stimulate customer activity as well
as to optimize the trading process through timely and effective personnel management.
    Video analytics using computer vision methods can produce continuous automated data collection,
analyzing the sequence of images coming from video cameras in real time or from archival records
without additional staff.
    According to undertaken studies [2] video analysis and computer vision technologies reduce by
10% the number of people leaving the store without buying something and by 20% the loss of store
profits, and sales of individual products can be increased by 15-25% when changing their location
according to the detected “hot zones”.
    Thuswise, at the moment one of the most promising areas for analyzing customers behavior in a
retail store is technology based on video analysis. It allows you to determine customer traffic statistics
quickly and effectively, to create a portrait of the target audience, and to study the customer activity.


Russian Advances in Artificial Intelligence: selected contributions to the Russian Conference on Artificial intelligence (RCAI 2020),
October 10-16, 2020, Moscow, Russia
EMAIL: tatyana.v.martynenko@gmail.com (T. Martynenko); vasyaeva@gmail.com (T. Vasyaeva); velievaaida9@gmail.com (A. Velieva);
ya_skobtsov@list.ru (Y. Skobtsov)
ORCID: 0000-0002-1483-8483 (T. Martynenko); 0000-0001-9362-2279 (T. Vasyaeva); 0000-0002-5362-6256 (A. Velieva); 0000-0002-
7677-2010 (Y. Skobtsov)
            © 2020 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

            CEUR Workshop Proceedings (CEUR-WS.org)
2. Articulation of an issue
    The basic aim of any trading network is to get the maximum profit. This is achieved by increasing
sales (due to an increase in the number of customers) and reducing costs (including reducing the
number of staff avoiding the service degradation).
    One of the key terms of video analytics in the retail industry is the customer traffic. According to
[3], the customer traffic (customer flow) is the direction that most customers in the store follow.
    The owner of a point of sale needs to have an up-to-date idea of the institution's attendance, about
the movements of visitors inside the sale area, since this information is used to build a strategy for
attracting and retaining customers, which in its turn is based on:
    •    optimization of the work of staff (to adjust the number of staff according to the intensity of
    the customer traffic in different periods of time);
    •    sales conversions [4] for selected departments or the store as a whole. Conversions show the
    ratio of the number of visitors to the point of sale in relation to the number of transactions
    (purchases);
    •    increasing the growth of sales of unpopular products due to their placement in the so-called
    “hot zones”, i.e. the most popular places to visit in this point of sale;
    •    successful placement of advertisements and promotions in the departments that attract the
    most interest from visitors;
    •    changing the product layout based on the map of customers movements in the shop.
    The main source of data for analysis is video cameras located above the entrance and exit,
departments of supermarket and cash registers. A generalized plan of sales area is shown in Fig. 1.
    The use of video analytics involves automation of four main functions [1]: detection, tracking,
recognition, forecasting.
    Tracking via detection is used to analyze the distribution of customer traffic [5]. This approach
makes it possible to use high-precision object detection methods without a large computational load
of the system due to tracking of already detected objects, excluding their repeated detections.


Figure 1: Plan of the sales area with video cameras.
   The study object of this research is the process of detecting and tracking customers based on video
information.
   The aim of this work is to constitution a map of the customer movement around the store based on
video information due to the usage of modern methods for detecting and tracking, which will make it
possible to make effective managerial decisions in the retail industry.
3. Research Objective
3.1. Detection of Video Sequence Objects
    The task of detecting in a video stream should be understood as detecting pre-defined classes of
objects (people, vehicles, furniture, animals and so on) with determining the label and coordinates of
the object's location [6].
    You can represent the object`s location in different ways such as the set of pixels that correspond
to the object [7] or the coordinates of a rectangle that bounds the object [8]. In this research we will
get dozens of bounding rectangles (bounding boxes) at the output of the detection algorithm.
    The input information of the developed system is the video stream S, which is made as a sequence
of frames I1, I2, … , Ik, … , IN.
                  𝐼𝐼𝑘𝑘 = {𝐼𝐼𝑘𝑘 (𝑥𝑥, 𝑦𝑦),0 ≤ 𝑥𝑥 < 𝑤𝑤𝑤𝑤𝑤𝑤𝑤𝑤ℎ, 0 ≤ 𝑦𝑦 < ℎ𝑒𝑒𝑒𝑒𝑒𝑒ℎ𝑡𝑡}, 𝑘𝑘 = �����
                                                                                       1, 𝑁𝑁,     (1)
   where width – is the width of frame, height – is the height of frame, Ik(x,y) – is the feature vector
of colors, N – is the number of frames; k – is the frame number.
   The set containing classes C = {c1, c2, ... , ci, …, cM}. Our task is to detect objects X = {x1, x2, ...,
xc}, with their subsequent selection by belonging to a given class (people’s figures):
                              𝑃𝑃𝑐𝑐𝑖𝑖 = �𝑃𝑃𝑐𝑐𝑖𝑖,1 , 𝑃𝑃𝑐𝑐𝑖𝑖,2 , … , 𝑃𝑃𝑐𝑐𝑖𝑖,𝑛𝑛𝑐𝑐 � ∈ 𝑋𝑋,              (2)
   where 𝑃𝑃𝑐𝑐𝑖𝑖 – is the set of detected objects belonging to the class ci; nc – is the number of detections.
   The bounding box, which characterizing the location of the object in the image can be represented
by the formula:
                                  {𝑂𝑂𝑂𝑂𝑂𝑂𝑘𝑘 |𝐼𝐼𝑘𝑘 ∈ 𝑆𝑆}𝑂𝑂𝑂𝑂𝑂𝑂𝑘𝑘 = {𝑃𝑃𝑖𝑖 }, 𝑖𝑖 = ������
                                                                                  1, 𝑛𝑛𝑘𝑘 ,                      (3)
   where Outk – bounding box, nk – the number of selected objects on a k-frame.
   In such a case at the output of the detection algorithm one frame is made asset of dozens bounding
rectangles that border objects of interest (people figures):
   In such a case at the output of the detection algorithm we have many bounding boxes
corresponding to objects of interest on one frame:
               𝐼𝐼𝑘𝑘 = {𝑂𝑂𝑂𝑂𝑂𝑂1 (𝑥𝑥, 𝑦𝑦, 𝑤𝑤, ℎ, 𝑐𝑐), 𝑂𝑂𝑂𝑂𝑂𝑂2 (𝑥𝑥, 𝑦𝑦, 𝑤𝑤, ℎ, 𝑐𝑐) … 𝑂𝑂𝑂𝑂𝑂𝑂𝑛𝑛 (𝑥𝑥, 𝑦𝑦, 𝑤𝑤, ℎ, 𝑐𝑐)}, (4)
   where (x,y) – are coordinates for each object in the image; (w, h) – is the dimensions of the object,
given the width (w) and height (h); c – is the class connected with each bounding box.

3.2.     Tracking of Video Sequence Objects
    Tracking of moving objects (tracking) is the creation of a trajectory of movement of target objects
in time by localizing its position on the input sequence of frames.
    An object's movement trajectory is a sequence of its positions:
         𝑇𝑇 = {𝑂𝑂𝑂𝑂𝑂𝑂𝑠𝑠 (𝑥𝑥, 𝑦𝑦, 𝑤𝑤, ℎ, 𝑐𝑐), 𝑂𝑂𝑂𝑂𝑂𝑂𝑠𝑠+1 (𝑥𝑥, 𝑦𝑦, 𝑤𝑤, ℎ, 𝑐𝑐), … , 𝑂𝑂𝑂𝑂𝑂𝑂𝑠𝑠+𝑙𝑙−1 (𝑥𝑥, 𝑦𝑦, 𝑤𝑤, ℎ, 𝑐𝑐)}, (5)
   where s – is the number of first frame in which the object was detected, l – is the number of frames
sequence in which the object is observed, x and y – are coordinates of location; w – is the width; h – is
the height and c – is the class number of the object in the video image.
   To evaluate tracker accuracy, the MOTA criterion is typically used [9]. Mathematically, the
MOTA criterion is described by the following formula:
                                       ∑ 𝑚𝑚 +𝑓𝑓𝑓𝑓 +𝑚𝑚𝑚𝑚𝑚𝑚𝑡𝑡                                      (6)
                        𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 = 1 − 𝑡𝑡 𝑡𝑡 ∑ 𝑡𝑡           , 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 → 𝑚𝑚𝑚𝑚𝑚𝑚,
                                                    𝑡𝑡 𝑔𝑔𝑡𝑡

   where mt – is the number of misses upon detection (false detection); fpt – is the number of false
positives; mmet – is the number of mismatches; gt – is the number of people present at time t.
4. Analysis of the Convolutional Neural Networks Usages in Object Detection
   Tasks
   Detection is fundamental and one of the most difficult tasks of computer vision. Deep learning
methods [6], in particular artificial neural networks [10], have become a powerful tool to solving
them.
   Algorithms based on convolutional neural networks (CNN) show the best quality in object
detection tasks. CNN is a special neural network architecture proposed by Yann LeCun which is the
main one used in computer vision [11]. A distinctive feature of CNN is the detection of objects in
video images with an accuracy that exceeds the accuracy of other video image detection methods.
   The classical CNN [11] has a hierarchical architecture (Fig. 2), and usually includes: convolution
layers (convolution layer), pooling layers (pooling layer), and fully connected layers (dense layer).


Figure 2: The classical CNN architecture
    The differences between algorithms based on CNN use are in the configuration and selection:
parameters of architecture (the number and type of layers, the value of weights, the number of
neurons on each layer); training parameters; data that the neural network is trained on, as well as
methods for input features processing.
    Nowadays there are a large number of pre-trained convolutional networks, i.e. those that are
already trained on a large data set within the large-scale task of video images detecting.
    Detectors based on the use of convolutional neural networks can be divided into several groups:
two-stage, one-stage, and basic algorithms.
    The idea of two-stage algorithms (Region Proposal Networks, RPN) is to initially search for sets
of sites, each of which assumes the presence of an object. In the second stage it processes proposal
and classifies the object in each area. The two-stage RPN detectors are the part of the architecture
series: R-CNN, Fast R-CNN, Faster R-CNN, R-FCN, Mask R-CNN. R-CNN range detectors are very
accurate, but a great problem of such networks is the low speed which upon average is only 5 FPS per
GPU [12, 13].
    The one-stage algorithms consist in forecasting the bounding box and the probabilities of
belonging to classes for each object at once over the whole given image. With this approach one
convolutional network at a time can detect many objects and their probability of belonging to the
specified classes. This significantly increases the speed of operation compared to RPN. The main
representatives of this group are the YOLO and SSD. In general, one-stage detectors are less accurate
than two-stage detectors, but are superior in speed [8, 14, 15].
    Basic architectures belong to classification convolution networks. Among the most popular
architectures are: MobileNet, VGG, GoogLeNet, ResNet [16].
    In order to select a detector, for the task of analyzing the customer flows distribution, studies were
performed. Comparative characteristics of the studied models of the most popular pre-trained neural
networks are summarized in Table 1. All models reviewed in Table 1 were trained on datasets, they
include the class "Person".
Table 1
Comparative study of the pre-trained neural networks models
       Model1              Scientists             mAP                       FPS             Dataset
    Faster R-CNN         Ren et al. [13]          73.2                       5             Pascal VOC
       YOLOv3          Redmon et al. [8]          57.9                      20               COCO
    YOLOv3-Tiny         Gong et al. [14]          33.1                      220              COCO
       SSD300            Liu et al. [15]          74.3                      46             Pascal VOC
       SSD512            Liu et al. [15]          76.8                      19             Pascal VOC
    MobileNet 3        Howard et al. [16]          22                       31               COCO
   MobileNet-SSD       Howard et al. [17]          35                       56               COCO

   The analysis of Table 1 shows that the R-CNN range models have high accuracy (mAP), but they
have a very low video stream detection rate (FPS). YOLO in its turn provides a high FPS value, but
has a lower mAP compared to other models. The MobileNet-SSD model has the most optimal
parameters in FPS and mAP.

5. Analysis of Object Tracking Methods in Video Surveillance Systems
   Video surveillance systems use various methods and algorithms to track the object's trajectory.
According to [18] they are classified as follows:
   •    Dense optical flow – it helps to estimate the motion vector of all points in the video image,
   examples of this class of algorithms are Farneback, Horn-Schunck, and also SimpleFlow;
   •    Sparse optical flow – tracks the location of only a few characteristic points (representing the
   corners or edges of the object) on the video image, an example is the Lucas-Canada algorithm;
   •    Mean-Shift and CamShift determine the locations of the density function maxima;
   •    Kalman Filtering –allows you to get the probable positions of previously found objects in a
   new frame based on the history of its previous positions. On its basis the best for the date online
   tracker DeepSort is offered;
   •    Single object trackers (SOT) –assumes that the rectangle selects an object in the first frame
   and then tracks it in the next frame;
   •    Multiple object trackers (MOT) – assumes tracking multiple objects in each frame;
   •    Tracking algorithms built into the OpenCV library (Boosting, MIL, KCF, CSRT,
   MedianFlow, Mosse, Goturn, TLD).
   Analysis of object tracking methods has shown [19] that at the moment the problem is particularly
acute for providing continuous tracking. Most existing tracking systems do not support this
functionality or they try to solve the problem by selecting an angle where the probability of overlap is
minimal.
   One of the main requirements for our task is the ability to track many objects of a video sequence.
Also, when choosing a method, you must try to ensure maximum performance and reliability. In this
paper trackers of the OpenCV library were subjected to experimental research.

6. Customer Traffic Distribution Analysis Based on Video Information
    The scheme for obtaining information on the movement of customers based on vid-eo analysis is
shown in Fig. 3.
    At the first stage the detection algorithm (detector) works. Detection is made on the key frames of
the video stream F1, F1+step, F1+2*step, … . In order to check whether new objects have appeared
in the frame as well as to see if objects that were lost during the “tracking” stage have appeared, i.e. to
correct the tracker operation. The system creates or updates a tracker with new bounding box
coordinates for each detected object.
Figure 3: Generalized workflow for generating information on the movement of customers.
    Since the operation of the detector allows to achieve high quality detections, and the detection
algorithms are rather laborious and resource intensive, the detection phase is launched only once
every N frames.
    Each detected object is assigned a unique identifier. In the first frame identifier randomly
generated and then tracked in next frames. The number of generated identifiers in the first frame is
equal to the number of detections. For detections in next frames an identifier is assigned either from
the previous frame (for the existing object in the system) or a new identifier is generated (for a new
object).
    When a new person appears in the frame, an object is created and an identifier is generated. For
each of the detected objects the system creates an object tracker that tracks the object as it moves in
the frame. Tracker works faster and more efficiently than the object detector. The system continues to
track the object of interest until it reaches the N frame and then re-initializes the detector.
    This problem can be considered as the construction of a matrix of scores (energies) for linking the
current set of trajectories with new detections.
    Trajectories linking process includes two bounding box lists: a tracking list (t-1) and a detection
list (t). You need to look at the list of traces and detections with calculated IOUs (intersection over
union) – a function that calculates the ratio of the area of intersection of the rectangles to the area of
their union, and record the results in the matrix of scores.
    According to the formed matrix, a correspondence is established between detection and tracking.
    The third step is to combine detections in the trajectory. In other words a so-called optimal binding
search is required: each detection joins the current trajectory or gives rise to a new trajectory. If
neither the tracker nor the detector shows any bounding boxes the object is considered to be lost.
    As a result of the objects identification (customers) and analysis of their movement trajectories, the
time spent by each visitor in the sells areas (customer area) is determined. To facilitate the analysis of
the total time spent, it was proposed to normalize the obtained time indicators:
                                       𝑉𝑉 ′ =
                                                𝑉𝑉𝑖𝑖 −min (𝑉𝑉𝑖𝑖 )
                                                                  ,                                 (5)
                                     𝑖𝑖   max(𝑉𝑉𝑖𝑖 )−min (𝑉𝑉𝑖𝑖 )

   where Vi – is the residence time of the object of interest in sells area i.
   The result of this work is the construction of the heat map reflecting the areas of interest of
customers, the form of which is shown in Fig. 4.
   In Fig. 4 the cell at the intersection of date and department the value 𝑉𝑉𝑖𝑖′ is indicated. The highest
value of the coefficient 𝑉𝑉𝑖𝑖′ corresponds to greater attendance by customers.
Figure 4: Analysis of the distribution of customer flow by department.

7. Experiments
   There was developed a software implementation of the subsystem in the Python programming
language using OpenCV computer vision libraries and Caffe deep learning.
    OpenCV [20] is an open source library for computer vision and machine learning. It contains about 2500
algorithms; the main aim of this library is to increase the computational activity of video image processing
procedures.
    Caffe [21] – a framework that supports the operation of convolutional neural networks. The advantage of
using this tool is that it is possible to work in a very simple way and that there are pre-trained models.
    Experimental research of the developed video analysis subsystem were carried out on the following
computer system configuration: Windows 10 Pro, Intel (R) Celeron CPU N3050 @ 1.60 GHz, 2 GB.
    The first step is choosing a detector. Models in the Table 1 are used to detect objects in the image, but they
also applicable to the video stream. Since the objects detection in the video stream is reduced to processing a
sequence of key frames.
    The results of the experiments using standard datasets are given in Table 2.

Table 2
The results of the experiments using standard datasets
                                                 mAP
                  Model              All classes                   Person                  FPS
             Faster R-CNN                 69                         78                    10
                 YOLOv3                   76                         78                    15
              YOLOv3-Tiny                 70                         67                    56
                 SSD300                   80                         79                    35
                 SSD512                   74                         87                    45
              MobileNet 3                 80                         70                    55
            MobileNet-SSD                 89                         95                    56

   Analysis of Table 2 showed that the MobileNet-SSD model has the best value (mAP) for the class
"Person" and a high rate of speed (FPS). The speed of the detector is important for the tracker. The
lower speed of the detector, enhance the risk losing an object of interest during tracking. Thus, the
MobileNet-SSD model was selected for the video analysis subsystem.
   The choice of tracker was carried out experimentally using 4 test sets. Test datasets are complete
video sequences obtained from shopping center surveillance cameras. Datasets contains negative
parameters: various lighting, overlapping objects with each other, complete disappearance of the
object for some time, changing the size of the object. The presence of such parameters allow you to
evaluate the stability of the algorithm to various emergency situations. The characteristics of test
datasets are shown in Table 3.

Table 3
The characteristics of test datasets
   Characteristics          Test dataset1     Test dataset2      Test dataset3     Test dataset 4
 Quantity of frames             2782               654                900               525
       Camera                    No                Yes                Yes                No
     movement
       Lighting                 Good               Bad               Good               Bad
    Overlapping                  Yes               No                 Yes               Yes
      Complete                   Yes               Yes                No                No
   disappearance
  Changing the size              Yes               Yes                Yes                No
    of the object

   Video analysis of the shopping center involves processing a large amount of video data, so the
speed of the tracker is also important. Table 4 compares the speed of 8 trackers on four video
sequences in regards to of the quantity of frames per second. According to experimental research the
fastest tracker is MOSSE with an average speed of 388.38 FPS. Another effective tracker is KCF with
an average value of 34 FPS. TLD and MIL trackers showed the worst results.
Table 4
Comparative study of the tracker speed
        Tracker           Test dataset1        Test dataset2      Test dataset3     Test dataset 4
       GOTURN                   15                   16                 12                13
          CSRT                  13                   23                  7                34
          KCF                   33                   41                 29                33
         BOOST                   7                   12                  7                13
          TLD                   0.5                  1.8               0.4                0.2
        MOSSE                 220.5                 823                320               190
     MedianFlow                 12                   35                 11                45
          MIL                   17                   12                 22                19

  The values of the MOTA coefficients for eight trackers on 4 video sequences are presented in
Table 5.

Table 5
The values of the MOTA coefficients
       Tracker          Test dataset1         Test dataset2      Test dataset3     Test dataset 4
      GOTURN                 20%                   55%                35%               45%
        CSRT                 75%                   65%                74%               65%
         KCF                 45%                   25%                69%               20%
       BOOST                 65%                   55%                75%               55%
         TLD                 20%                   20%                35%               35%
       MOSSE                 56%                   35%                40%               50%
    MedianFlow               40%                   80%                80%               20%
         MIL                 65%                   75%                85%               50%

   According to the made research it can be concluded that the BOOST tracker is very slow (with an
average value of 9.75 FPS) and often loses the object of detection. MIL and KCF trackers showed
good quality of the algorithm work speed (17.5 and 34 FPS correspondingly. The TLD tracker
generates a lot of false positives which makes it unusable.
   The MOUSSE tracker provides the highest speed of all the considered trackers (388.75 FPS), and
the CRT tracker provides a fairly high accuracy of the tracking algorithm falling short of speed
herewith. The MedianFlow tracker has performed well both with regard to speed and accuracy.

8. Results and Discussion
   In this work, the task of compiling customers movement map through the store was solved. There
was performed the analysis of methods for detecting and tracking buyers based on video information.
A subsystem for analyzing customer flows using video surveillance has been developed and its
software implementation has been completed. At the first stage the problem of detection using CNN is
solved. The use of MobileNet and SSD together is well substantiated. There was selected
MedianFlow tracker which showed high values of speed and accuracy. The developed set of solutions
made it possible to monitor the movements of customers, to identify areas of interest with the goal of
further effective personnel management and display of goods.

    Referenses
[1] Connell, J., Fan Q., Gabbur, P., Haas, N., Pankanti, S., Trinh, H.: Retail Video Analytics: An
     Overview and Survey. Proceedings of SPIE - The International Society for Optical Engineering,
     vol. 8663, no. 1, pp. 86630X-86630X. (2013). DOI: 10.1117/12.2008899.
[2] Hernandez, M., Nalbach, O., Werth, D.: How Computer Vision Provides Physical Retail with a Better
     View on Customers. IEEE 21st Conference on Business Informatics. Moscow, Russia, vol. 1, pp.
     462-471. (2019). DOI: 10.1109/CBI.2019.00060.
[3] Ma, N.L., Choy, M.: Improving Customer’s Flow Through Data Analytics. Advances and Trends
     in Artificial Intelligence. From Theory to Practice. Springer, Cham, vol. 11606, pp. 279-286.
     (2019). DOI: 10.1007/978-3-030-22999-3_25.
[4] Perdikaki, O., Kesavan, S., Swaminathan, J.: Effect of Traffic on Sales and Conversion Rates of
     Retail Stores. Manuf. Serv. Oper. Manag, vol. 14, no. 1, pp. 145–162. (2011). DOI:
     0.1287/msom.1110.0356.
[5] Shengyong, Ch., Yingkun, X., Xiaolong, Zh., Fenfen, Li.: Deep Learning for Multiple Object
     Tracking: A Survey. IET Computer Vision, vol. 13, pp. 61–88. (2019). DOI:
        10.1016/j.neucom.2019.11.023.
[6] Zhao, Z., Zheng, P., Xu, S., Wu, X.: Object Detection With Deep Learning: A Review. IEEE
     Transactions on Neural Networks and Learning Systems, vol. 30, no. 11. (2019).
[7] Martynenko, T., Privalov, M., Sekirin, A.: Evolutional Approach to Image Processing on the
     Example of Microsections. Biologically Inspired Cognitive Architectures (BICA) for Young
     Scientists, Advances in Intelligent Systems and Computing. Springer, vol. 449, pp.141-150.
     (2016). DOI: 10.1007/978-3-319-32554-5_19.
[8] Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement. ArXiv, vol. 1804.02767.
     (2018).
[9] Bernardin, K., Stiefelhagen, R.: Evaluating Multiple Object Tracking Performance: The CLEAR
     MOT Metrics. Hindawi Publishing Corporation EURASIP Journal on Image and Video
     Processing. (2008). DOI: 10.1155/2008/246309.
[10] Szegedy, Ch., Toshev, A., Erhan, D.: Deep neural networks for object detection. Proceedings of the
     26th International Conference on Neural Information Processing Systems. Curran Associates Inc.,
     Red Hook, NY, USA, vol. 2, pp. 2553-2561. (2013). DOI: 10.5555/2999792.2999897.
[11] LeCun, Y., Bengio, Y.: Convolutional networks for images, speech, and time-series. M. A.
     Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press. (1995).
[12] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection
     and semantic segmentation. CVPR '14: Proceedings of the 2014 IEEE Conference on Computer Vision
     and Pattern RecognitionJune, pp. 580-587. (2014). DOI: 10.1109/CVPR.2014.81.
[13] Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with
     Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence,
     vol. 39, no 6. Pp. 1137-1149. (2017). DOI: 10.1109/TPAMI.2016.2577031.
[14] Gong, H., Li, H., Xu, K., Zhang, Y.: Object Detection Based on Improved YOLOv3-tiny.
     Chinese Automation Congress (CAC), Hangzhou, China, pp. 3240-3245. (2019). DOI:
     10.1109/CAC48633.2019.8996750.
[15] Liu, W., Anguelov, D., Erhan, D., Szegedy, Ch.: SSD: Single Shot MultiBox Detector. Lecture
     Notes in Computer Science. Springer International Publishing, pp. 21–37. (2016). DOI:
     10.1007/978-3-319-46448-0_2.
[16] Howard, A., Sandler, M., Chu, G., Chen, L.-Ch., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang,
     R., Vasudevan, V., Le, Q. V.: Searching for MobileNetV3. arXiv. (2019).
[17] Howard, A., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M.,
     Adam, H.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision
     Applications. arXiv, vol. 1704.04861. (2017).
[18] Mohammad, H., Michael, G., Jonathan, S.: Combination of Mean Shift of Colour Signature and
     Optical Flow for Tracking During Foreground and Background Occlusion. Image and Video
     Technology. Lecture Notes in Computer Science, Springer, Cham, vol. 9431, pp. 1–12. (2016).
     DOI: 10.1007/978-3-319-29451-3.
[19] Parekh, H., Thakore, D., Jaliya, U.: A Survey on Object Detection and Tracking Methods.
     International Journal of Innovative Research in Computer and Communication Engineering, vol.
     2, pp. 2970-2978. (2014).
[20] OpenCV (Open Source Computer Vision Library), https://opencv.org/, last accessed 2020/04/26.
[21] Caffe: Deep Learning Framework, https://caffe.berkeleyvision.org/, last accessed 2020/04/26.