<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Visual Objects Tracking on Road Sequences Using Information about Scene Perspective Transform</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikolay Nemcev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nickolay Kozyrev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ITMO University</institution>
          ,
          <addr-line>Saint Petersburg, 197101, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper studies existing approaches and methods used in the task of objects tracking on video, which is one of the most important tasks facing both visual data analysis systems as a whole and road trafic control systems mounted on moving participants of the scene directly (including self-driving vehicles). The proposed approach is used for road scene perspective transform estimation, the search area location, and works in conjunction with a convolutional neural network for objects tracking. The proposed approach helps significantly increase tracking eficiency (on average 10 %, up to 20 % for certain object classes) on a subset of the road scenes videos shot from a moving vehicle and can be used in practice in environment perception modules mounted directly to vehicles.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Visual Data Processing</kwd>
        <kwd>Visual Object Tracking</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Perspective Transform</kwd>
        <kwd>Vanishing Point</kwd>
        <kwd>RANSAC</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the automotive industry, computer vision algorithms are used to solve various problems. For
example, object and lane detection, velocity and free space estimation, creating functions for
understanding the environment, motion planning for autonomous moving devices.</p>
      <p>The task of tracking an object between two frames of a video sequence can be represented as
a search for the position of the object () at some frame , by the known state of the object
on the previous frame of the sequence (), which is given by a rectangular bounding box.</p>
      <p>Object tracking technology is widely used in systems for the road environment understanding
in the modules of perception and motion planning for unmanned vehicles. Extensive use of
technology leads to additional requirements on them. These requirements are related to
realtime data processing in the changing weather and illumination conditions, and with the specific
nature of the movement of tracked objects. The specific nature of the movement of tracked
objects is characterized by high movement speed, frequent overlap, and significant
frame-toframe object’s size change caused both by the own movement of the scene objects and the
movement of the camera.</p>
      <p>In general, real-time object tracking algorithms can be divided according to the method of
obtaining and describing the model of the tracked object on two types of algorithms: classical
ones, and based on the principles of machine learning.</p>
      <p>
        Classical algorithms include the basic pattern of the search algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Those algorithms
estimate the position of an object at the next frame by searching the area most similar to the
used object template (object image at the previous frame) according to the minimum matching
error (SAD) criterion or a maximum of the correlation coeficient. Algorithms which are based
on principles of contours tracking uses as template information not about the entire pixel field
of an object, but its shape and boundaries [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is necessary to take into account approaches
based on the methods of the object’s key points extracting and their subsequent comparison
with key points of the next frame search area [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The key points estimation algorithm can
be performed by the usage of diferent approaches described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The task of
tracking objects on a video can be solved by the related task of the object’s motion estimation
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        The advantages of classical algorithms include availability to work without preliminary stage
training for tracking module, low computational complexity and high speed of the baseline
approaches. The disadvantages of classical algorithms include sensitivity to changes in the
illumination of the scene, the problems with object tracking at scenes with a non-static
background. It should be noted that the above problems are inherent in the baseline algorithms of
this class, and there are algorithms based on the classical principles of computer vision, devoid
of these shortcomings. However, such approaches are usually computationally complex and
unable to work in real-time [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and even more for organizing inter-machine exchange through
a network [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Algorithms based on the principles of machine learning use various neural network
architectures [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Also, algorithms can use other methods of machine learning, for example,
RandomForest [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] These principles of machine learning allow extracting a set of tracked
object’s features, used later for searching object position at the next frame of the sequence.
Some of these approaches search for the position of the object on the next frame by searching
for candidate regions in a certain area [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Other approaches solve the problem of tracking
the object as a one-shot detection task [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The need for preliminary preparation of feature
extraction modules is a hallmark of algorithms which use machine learning methods [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>ML-based algorithms (ML - machine learning) are more resistant to changes in the parameters
of the scene and more robustly extract features of partially overlapping scene objects, which
makes them more applicable in object tracking modules mounted on moving vehicles.</p>
      <p>
        It should also be noted that when solving the task of tracking objects on video often used
modifications of the Kalman filter [ 18]. This modification used both for filtering the trajectory
of objects and for predicting the position of the object on the next frame based on the history of
its motion [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] are often used in the task of tracking objects on video.
      </p>
      <p>
        The proposed approach combines a method for assessing the parameters of the perspective
transformation of the scene, used to refine the search region at the next frame of the sequence,
and a modified convolutional Siamese network for the object position estimation within the
given search region [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Usage of the proposed method for refining the parameters of the
search region is caused by the need to compensate the displacement and resizing of objects
moving longitudinally to the camera (in this case, the movement of these objects is generated
by both their movement and the displacement of the camera mounted on the vehicle).
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The general scheme of the proposed approach</title>
      <p>
        Conventionally, the task of tracking an object between two frames can be represented as a
search for the state of the object () on some frame , based on the known state of the object
on the previous frame of the sequence (− 1), specified by a rectangular bounding box. The
proposed approach can be divided into two separate modules - a module for defining parameters
of an object search region on the next frame, used to calculate the assumed position and scale
of the object, and a modified convolutional neural network that searches for the position of the
object in a given region of interest [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The general diagram of the approach is given in figure
1.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method for refining search region parameters</title>
      <p>
        For search region parameters estimation (center of an area  = (, ) and scale ) on the frame
 used by the tracker for the object position estimation is used the procedure based on a method
of random samples [18] (RANSAC, Random Sample Consensus) and estimation of parameters
of the scene perspective transform by vanishing point search. At first step of the perspective
transformation estimation, object boundaries are searched using the Canny edge detector [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
and a set of linear object boundary segments whose length exceeds 3 pixels is searched by
the Huf transform. Each segment  = (, , ) is described by the combination of the
position of the center , the slope of the segment  and its length , while segments
whose angle of inclination to the vertical axis of the frame didn’t belong to the range from 10 to
70 degrees were removed.
      </p>
      <p>The structure of the RANSAC algorithm can be described by two stages. At first stage a set of
hypotheses is selected, in this case, the hypothesis is a model of the vanishing point  (1, 2),
selected as the intersection of two random segments 1 and 2, obtained at the previous stage.
Finally, the votes for each model are counted and the model with the most votes is the output
of the algorithm (target vanishing point).</p>
      <p>To count the votes of some hypothesis  (1, 2), we iterate around all available segments
 and calculate the weight of each voice using the following expression:
(,  (1, 2) =
{︃ 1− 1−− − cos2  ·  · (), if  ≤ 5∘ , ,
0, otherwise
(1)
where is the smaller angle between the voting segment and the line connecting the hypothetical
vanishing point to the center of the given segment,  is the parameter describing the dependence
of the voice weight on the angles similarity level,  is the coeficient describing the influence of
the segment length on the weight of the its vote.</p>
      <p>The model with the most votes is the approximate position of the vanishing point, describing
the parameters of the perspective transformation of the scene. After finding this model, the
point refinement procedure is performed according to the approach described in [19].</p>
      <p>
        Knowing the parameters (coordinates of angles) of the bounding box of the object (− 1) on
frame − 1 and the coordinates of the vanishing point   = (, ), we can construct a set of
estimated parameters of the search region (position and scale) based on the hypothesis about the
longitudinal motion of objects [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Hypothetical search regions are selected by shifting (taking
into account perspective transformation parameters) the bounding box of the object (− 1)
on the frame  along the line connecting the center of the given object and the vanishing point,
so the coeficient of object scale (ratio of the area of the supposed bounding box to the size
of object’s box at the frame (− 1)) varies in the range from 0.75 to 1.25 with step 0.1. The
illustration is shown in Figure 2.
      </p>
      <p>At next step, the most appropriate ′() from the set of assumed bounding frames, is
selected based on the criterion of the maximum correlation level with the frame area − 1
corresponding to the image of an object (− 1).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Neural network model</title>
      <p>
        After determining the hypothetical search region ′() the position of the object on the
frame  is searched using the Siamese neural network for tracking objects. The architecture
is almost identical to the network described in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The main diference of this network
architecture is that in addition to the object template and search area specified by the previous
center of the frame (− 1), a hypothetical search region is also supplied to the network
input. The hypothetical search region is described by the bounding box center ′(0) and
size change factor . This network solves the problem of object tracking as detecting with
template, operates in parallel with both fields of search and describes the received results
using the rectangular bounding boxes of (, − 1, 1) and (, ′(), ) and the
probabilities of detection of  (, (− 1, 1) and  (, ′(, ) corresponding to the search
area described by the previous object position and hypothetical search area respectively. The
resulting bounding box () is calculated according to the following expression:
() =
︂{ (, ′(), ), if  (, ′(), ) − 0.1|1 − | &gt;  (, ′(− 1), 1), ,
(, ′(− 1), 1), otherwise
 =
1
ℎ
∑︁
ℎ −  =
      </p>
      <p>=</p>
      <p>1 ∑︁ 
 =1</p>
    </sec>
    <sec id="sec-5">
      <title>5. Assessment of the efectiveness of the proposed approach</title>
      <p>
        At this stage, a comparative analysis of the proposed approach and the classical
implementation of the Siam-RPN tracker [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which became the basis of the proposed approach, was
produced according to the mean overlap criterion (EAO, expected average overlap), calculated
in compliance with the procedure described in [19]:
(2)
(3)
(4)
      </p>
      <p>In equation (3)  is the minimum length and ℎ is the maximum length of the sequence
of frames on which the tracked object is present, and  calculated according to the following
formula:</p>
      <p>
        Here (4)  is the average overlap for length sequence , and  is the coeficient of overlap
a predicted position of the object and its true position on the frame i (IOU, intersection over
union [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]). A subset of the data set BDD100K [20], consisting of 61 road scenes sequences
shot from a moving car in various weather conditions and at diferent day times was used as
a test data set. Tracking was performed for each video object from its first appearance to the
end of the video. In this case, the initial state (position of the bounding box) of the object was
taken directly from the markup of used dataset. The results of the efectiveness analysis of the
proposed approach for diferent object’s classes are given in Table 1.
      </p>
      <p>
        The obtained results show that using the proposed approach makes it possible to significantly
increase the eficiency of tracking objects compared to the classical implementation of Siam-RPN
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. It should be noted that the proposed approach operates with the same parameters (weights)
of the neural network as the classical implementation. At the same time, using the Kalman
iflter [ 18] to predict the position of the object on the next frame (and select the corresponding
search region) does not give a noticeable increase in tracking quality (Siam-RPN [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] + UKF
[18] and Proposed + UKF [18]), this is primarily due to the small length of used video sequences,
during which the filter often does not have enough time to formalize the model of the object
movement.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>
        The approach described in this article for tracking objects on video is based on the method of
refining the parameters of the search region and using a modified neural network for tracking
objects. The proposed approach of refining parameters of the search region is based on the
method of estimating the perspective transformation of the scene by searching for the vanishing
point and used to compensate the movement and scaling of objects caused by their longitudinal
movement and allows to significantly increase the eficiency of the neural network for tracking
objects (average 10%, up to 20% for some object classes) on a subset of video sequences of road
scenes taken from a moving camera. Modified network performs object search simultaneously
at two search areas using the same object template. It should be noted that the search region
refinement module usage slightly increases the computational complexity of the tracking process
and its duration. However, the information about perspective transformation may be used by
other unmanned vehicle modules, such as the road marking detection and tracking module. The
modified neural network also imposes higher requirements on the computational capabilities of
the used graphics accelerator (primarily its memory). However, the relative simplicity of the
original Siam-RPN architecture [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] allows the proposed approach for tracking objects to work
in real-time on devices mounted directly on moving unmanned vehicles.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018,
pp. 8971–8980.
[18] V. D. M. R. Wan E. A., The unscented kalman filter for nonlinear estimation, in: Proceedings
of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control
Symposium, 2000, pp. 153–158.
[19] I. S. Chaudhury K., DiVerdi S., Auto-rectification of user photos, in: 2014 IEEE International
      </p>
      <p>Conference on Image Processing (ICIP), 2014, pp. 3479–3483.
[20] Y. F. et al, Bdd100k: A diverse driving video database with scalable annotation tooling,
2018.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. S. L. Gonzalez R. C.</given-names>
            ,
            <surname>Woods</surname>
          </string-name>
          <string-name>
            <surname>R. E.</surname>
          </string-name>
          ,
          <source>Digital image processing using MATLAB, Pearson Education India</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>B. A. Isard M.</surname>
          </string-name>
          ,
          <article-title>Contour tracking by stochastic propagation of conditional density</article-title>
          ,
          <source>in: European conference on computer vision</source>
          , Springer, Berlin, Heidelberg,
          <year>1996</year>
          , pp.
          <fpage>343</fpage>
          -
          <lpage>356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. L</given-names>
            .
            <surname>Yang</surname>
          </string-name>
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Duraiswami</surname>
          </string-name>
          <string-name>
            <surname>R.</surname>
          </string-name>
          ,
          <article-title>Eficient mean-shift tracking via a new similarity measure</article-title>
          ,
          <source>in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05)</source>
          , IEEE,
          <year>2005</year>
          , pp.
          <fpage>176</fpage>
          -
          <lpage>183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>M. J. Lienhart R.</surname>
          </string-name>
          ,
          <article-title>An extended set of haar-like features for rapid object detection</article-title>
          ,
          <source>in: Proceedings. international conference on image processing</source>
          , IEEE,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. C. Zhou H.</given-names>
            ,
            <surname>Yuan</surname>
          </string-name>
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ,
          <article-title>Object tracking using sift features and mean shift, in: Computer vision</article-title>
          and image understanding,
          <year>2009</year>
          , pp.
          <fpage>345</fpage>
          -
          <lpage>352</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>R. E.</surname>
          </string-name>
          et al,
          <article-title>Orb: An eficient alternative to sift or surf</article-title>
          ,
          <source>in: 2011 International conference on computer vision</source>
          , IEEE,
          <year>2011</year>
          , pp.
          <fpage>2564</fpage>
          -
          <lpage>2571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>B. T.</surname>
          </string-name>
          ,
          <article-title>Pedestrian detection and tracking using temporal diferencing and hog features</article-title>
          , in: Computers Electrical Engineering,
          <year>2014</year>
          , pp.
          <fpage>1072</fpage>
          -
          <lpage>1079</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>B. J. Y.</surname>
          </string-name>
          et al.,
          <article-title>Pyramidal implementation of the afine lucas kanade feature tracker description of the algorithm</article-title>
          , in: Intel Corporation,
          <year>2001</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>F. M.</surname>
          </string-name>
          et al.,
          <article-title>Handcrafted and deep trackers: Recent visual object tracking approaches and trends</article-title>
          ,
          <source>in: ACM Computing Surveys (CSUR)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. S. V.</given-names>
            <surname>Bogatyrev</surname>
          </string-name>
          <string-name>
            <given-names>A. V.</given-names>
            ,
            <surname>Bogatyrev</surname>
          </string-name>
          <string-name>
            <surname>V. A.</surname>
          </string-name>
          ,
          <article-title>Multipath redundant transmission with packet segmentation</article-title>
          ,
          <source>in: 2019 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P. V. Arustamov S.A.</given-names>
            ,
            <surname>Bogatyrev</surname>
          </string-name>
          <string-name>
            <surname>V.A.</surname>
          </string-name>
          ,
          <article-title>Back up data transmission in real-time duplicated computer systems</article-title>
          ,
          <source>in: Advances in Intelligent Systems and Computing</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>B. V.A.</surname>
          </string-name>
          ,
          <article-title>Exchange of duplicated computing complexes in fault-tolerant systems</article-title>
          ,
          <source>in: Automatic Control and Computer Sciences</source>
          ,
          <year>2011</year>
          , p.
          <fpage>268</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S. S. Held D.</given-names>
            ,
            <surname>Thrun</surname>
          </string-name>
          <string-name>
            <surname>S.</surname>
          </string-name>
          ,
          <article-title>Learning to track at 100 fps with deep regression networks</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer, Cham,
          <year>2016</year>
          , pp.
          <fpage>749</fpage>
          -
          <lpage>765</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>N. G.</surname>
          </string-name>
          et al.,
          <article-title>Spatially supervised recurrent convolutional neural networks for visual object tracking</article-title>
          ,
          <source>in: 2017 IEEE International Symposium on Circuits and Systems (ISCAS)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>L. A.</surname>
          </string-name>
          et al.,
          <article-title>Classification and regression by randomforest</article-title>
          , in: R news,
          <year>2002</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M. J. Kalal Z.</given-names>
            ,
            <surname>Mikolajczyk</surname>
          </string-name>
          <string-name>
            <surname>K.</surname>
          </string-name>
          ,
          <article-title>Tracking-learning-detection</article-title>
          ,
          <source>in: IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <source>2011</source>
          , pp.
          <fpage>1409</fpage>
          -
          <lpage>1422</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>L. B.</surname>
          </string-name>
          et al,
          <article-title>High-performance visual tracking with siamese region proposal network</article-title>
          , in:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>