<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EdgeLabel: A Video Annotation Method for Moving Camera using Edge Devices ⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hai-Thien To</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Khac-Hoai Nam Bui</string-name>
          <email>hoainam.bk2012@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chi-Luan Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Transport and Technology</institution>
          ,
          <addr-line>Hanoi</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Viettel Cyperspace Center</institution>
          ,
          <addr-line>Hanoi</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <fpage>91</fpage>
      <lpage>99</lpage>
      <abstract>
        <p>Currently, AI edge device is becoming the trend in real-time object detection. However, one of the disadvantages of current object detection models is the limitation of objects or the lack of additional data for existing objects. Therefore, labeling data is an important task. This study proposes a new method for labeling object detection in video, which collects from moving cameras using edge devices. Specically, our method is able to collect sharp resolution frames containing new objects and objects that are mis-detected during the real-time running of AI Edge devices. The application of this solution supports locating new objects and suggests adding data to existing data in a frame/image, which is able to save a lot of time and eort for labeling video data.</p>
      </abstract>
      <kwd-group>
        <kwd>Video labeling • Object detection • Edge devices • Moving camera</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Recently, video understanding has received more attention since the availability
of several large-scale video datasets [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, annotating large-scale video
datasets is cost-intensive and cumbersome. This suggests a need for a
semiautomatic annotation method to improve this process [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Specically, the core
idea is using an automatic preprocessing method using a neural network to
roughly annotate the image before the human review and revision [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Nonetheless, video annotation, especially data from moving cameras, is still a challenging
issue because of variations in viewpoint, scale, and appearance within the video.
Furthermore, Deep Neural Networks (DNN) grow with the complexity,
executing the object detection method using on edge devices becomes a challenging
problem [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>This paper presents EdgeLabel, a novel framework for labeling object
detection of moving cameras by executing object detection methods in edge devices.
Specically, EdgeLabel obtains better results than using pretrained object
detectors and focuses on incorporating object detection methods on edge devices
to enable real-time processing. Fig. 1 illustrates the primary process of
EdgeLabel. Notably, our method does not attempt to provide a better and general
object detector. Specically, we capitalize on the redundancy and specicity of
individual video streams that empirically perform better on that specic video.
The main contribution of this study is twofold as follows:
•
•</p>
      <p>We propose a novel semi-automatic annotation method for videos, which
are collected from moving camera in terms of ecient time consuming and
better performers on specic video.</p>
      <p>The object detection method is executed on edge devices for enabling the
real-time processing of streaming data.</p>
      <p>The rest of this paper is organized as follows: In Section 2, we take a brief review
of image labeling tools and object detection methods. Section 3 presents the main
process of the EdgeLabel framework. Section 4 demonstrates preliminary results
of the proposed framework on specic video data. Section 5 is the conclusion
and future work.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Literature Review</title>
      <sec id="sec-2-1">
        <title>Image Labeling Tools Current models can label hundreds of thousands of objects, but the number of labeled objects cannot compare with the number of objects in daily life. Therefore, we always need to add new data to articial intelligence (AI) models related</title>
        <p>
          to object detection. Several well-known tools that support data labeling are
labelImg [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], labelme 3 and so on. The inputs of these tools are images or videos.
In the case of images, the objects in the image will be zoned and labeled.
However, with the input video, the labelling process has to follow several steps, as
shown in Fig.2. Specically, the input video is extracted to many frames. For
instance, if we want to label 1 hour video with the FPS is 24, we have to spend
time considering 86400 frames. However, not all frames have new objects, which
are not necessary to spend a lot of time checking all frames. Therefore, we need
a solution that retrieves frames containing new data or suggests frames
containing objects that may be misdetected. In this regard, Poorgholi et. al introduce
t-SNE, a new method to speed up the annotation process by placing the same
actions from dierent videos into two-dimensional space based on feature
similarity [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Furthermore, authors in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] proposed an automatic annotation method
based on deep learning preprocessing which includes two stage: i) improving the
results of CNN processing by combining with the object detection outcome; ii)
using a sliding window in order to deal with a large number of outliers. Zhou et.
al. [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] combining two complementary sources of knowledge using bounding box
merging and model distillation to generate detections on the unlabeled portion.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Object Detection Method</title>
        <p>
          In this study, we dene a new object as an object that is not detected, or
misrepresented by using neural network models. Specically, modern object detection
methods integrate feature generation, proposal, and classication into a single
pipeline, which fall into two categories: i) single-stage architectures (e.g., YOLO
[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and SSD [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]) which directly predict the coordinates and class of bounding
boxes; ii) two-stage detectors (e.g.,Faster R-CNN [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]) rene proposals produced
by a region proposal network. Technically, comparing with two-stage methods,
single-stage models are less accurate and less expensive, which are suitable for
detection and tracking problems [
          <xref ref-type="bibr" rid="ref2 ref3">3, 2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>3 https://github.com/tzutalin/labelImg</title>
        <p>
          Since we focus on executing the object detection process on edge devices,
we ne-tune a compact single-stage object detector (i.e., MobileNet-SSD [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]),
which is trained on Microsoft COCO [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to generate labels.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>EdgeLabel: Overview and Features</title>
      <p>EdgeLabel is responsible for identifying and collecting useful frames while the
Edge device is running other AI models. Specically, Alg. 1 illustrates the main
process of EdgeLabel framework. Particular, the framework includes three main
processes such as frame processing, object detection method, and post
processing, which are sequentially described as follows:
3.1</p>
      <sec id="sec-3-1">
        <title>Frame Processing</title>
        <p>This process focus on select the input frames. In streaming data of input video,
there will be a lot blurred frames. Therefore, we adopt Laplacian method for
selecting quality image based on the focus level. The peseudo code of this process
is illustrated in Line 3-9 of the Alg. 1.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Object Detection Method</title>
        <p>
          The selected input frames are put into an object detection model. In this study,
we employ SSD MobilenetV2 trained over COCO dataset using Tensorow API
for the object detection. Furthermore, the object detection model is quantized
for running real-time with the best performance inside edge devices [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. After
using the object detection model, all object-label and locations of
misidentiedobjects (low obj-score) are temporarily stored. This process is illustrated in lines
10-11 of the Alg. 1.
        </p>
        <p>Furthermore, one of the main problems in this process is that there are
many objects that cannot detect by object detection models. However, they have
certain shapes such as triangles, squares, rectangles, and circles. Therefore, we
adopt contour, a well-known image processing algorithm to extract meaningful
objects. Sequentially, their locations and shape-name are stored for next process,
which is expressed in line 12 of the algorithm.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Post Processing</title>
        <p>This process is provided for selecting frames used for the annotation, which is
expressed in lines 13-26 of the algorithm. Specically, after detecting the image
quality, misidentied-objects and meaningful objects of the frame in the
previous process, the next step is to check whether the select frame is a new frame or
not by calculating the structural similarity between two frames. Depends on the
speed of the moving camera, we can select correspond threshold, and two frames
that are used for structural similarity comparison. The process includes three
Algorithm 1 EdgeLabel framework for labeling video from moving camera
1: cap = CaptureVideo() ▷ Initialize real-time camera
2: frame-count = 0
3: check-similarity-background = False
4: while cap.isOpened() do ▷ Check real-time camera is open or not
5: frame = cap.read()</p>
        <p>▷ Initialize a list to store useful frames
6: ListFrame = []</p>
        <p>▷ Initialize a list to location of each object in useful frames
7: BBox = []</p>
        <p>▷ Initialize the list to store pre-label of each object
8: Pre-label = []</p>
        <p>▷ Declare a variable to evaluate the image quality of each frame
9: good-frame = Evaluate the image quality of frame.</p>
        <p>▷ Collect misidentied object of object detection model
10: object-label,points,obj-score = ObjectDetectionModel(frame)
11: misidentied-object = considering the obj-score &lt; xed-threshold
▷ Beside, the objects is detected by using object detection model, we can evaluate
the frame contain more new object or not
12: Detect meaningful-object of input frame (the shape of objects are
triangles,squares, rectangles, circles) by using contour algorithms.
13: new-background = Evaluate the background of the frame is new or not.
14: count-down = 5 (get 5 frames)
15: if new-background and count-down &gt; 0 then
16: if good-frame == True then
17: save frame to a path
18: ListFrame.append(good-frame)
19: BBox.append(location of misidentied-object)
20: Pre-label.append(object-label)
21: BBox.append(location of meaningful-object)
22: Pre-label.append(shape-name of meaningful-object)
23: Store ListFrame,BBox,Pre-label to xml le with frame path to DB
24: end if
25: count-down = count-down - 1
26: end if
27: ▷ Loading frame and xml le to LabelImg and Labelme. Sequentially, by using</p>
        <p>LabelImage and Labelme, we can re-label new objects easily
28: end while
steps as follows: i) First, we dene compare-area inside of the frame for
comparing two frames; ii) Then, we calculate the similarities using image processing
algorithms. iii) Maximum ve consecutive frames are stored with the best image
quality, including information of misidentied-objects and meaningful objects.
The output is XML les, which follow the formats of LabelImg or LabelMe for
the labeling.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Demonstration and Results Analysis</title>
      <p>We test our framework with an input video that are collected using the Coral
Camera, which is a 5MP camera designed for use with the Coral Dev Board.
Fig. 3 demonstrates the interface of our test. Specically, EdgeLabel includes
three main processes for labeling video data: i) Input data are preprocessed for
removing blurred frames; ii) The selected frames are put into an object detection
model. Notably, in this study, we employ SSD MobilenetV2 on edge devices
to provide the real-time processing; iii) lastly, the post-processing is provided
for extracting frames, which include misidentied-objects for the annotation.
Sequentially, the output are xml les, which are input into labeling tools such
as labeImg or LabelMe for the annotation, as shown in Fig. 4.</p>
      <p>Tab. 1 shows the results of our framework for the input video in which the
video lengths equal 3m30s, and the speed of moving camera is around 3km/h.
Particularly, we test the input video with various values of distance between two
frames to calculate the similarity and the threshold to extract frames for labeling.
Accordingly, with the distance between two frame is 50 and the threshold is
85%, there is 69 frames including 533 mis-objects for the annotation. Note that
the values of frame distance depend on the speed of moving the camera to get
appropriate results.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future work</title>
      <p>In this paper, we present LabelEdge, a semi-automatic video annotation method,
which focus on the video inputs are collected from moving camera. Specically,
the framework includes three main processes as i) frame processing for selecting
the quality frame and remove redundant objects; ii) object detection method
with a lightweight model (i.e., MobilenetV2) with SSD for identifying
misidentied and unknown objects; and iii) post-processing for selecting frames that
used for the annotation. In particular, LabelEdge enables real-time labeling
processing by executing the object detection model on edge devices. Regarding the
future work of this study, we are trying to exploit the capability of the proposed
framework by providing a large-scale dataset of moving cameras for object
detection using labeled. Furthermore, more studies on the quantization approach
of object detection methods on edge devices for improving performance are also
taken into account.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>This research is funded by University of Transport Technology (UTT) under
grant number —TT—2021-06</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnander</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>de Gevigney</surname>
            ,
            <given-names>F.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahlberg</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Felsberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Semi-automatic annotation of objects in visual-thermal video</article-title>
          .
          <source>In: Proceeding of the IEEE/CVF International Conference on Computer Vision Workshops</source>
          ,
          <string-name>
            <surname>ICCV Workshops</surname>
          </string-name>
          <year>2019</year>
          . pp.
          <fpage>22422251</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2019</year>
          ). https://doi.org/10.1109/ICCVW.
          <year>2019</year>
          .00277
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bui</surname>
            ,
            <given-names>K.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A multi-class multi-movement vehicle counting framework for trac analysis in complex areas using cctv systems</article-title>
          .
          <source>Energies</source>
          <volume>13</volume>
          (
          <issue>8</issue>
          ),
          <year>2036</year>
          (
          <year>2020</year>
          ). https://doi.org/10.3390/en13082036, https://www.mdpi.com/1996- 1073/13/8/2036
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bui</surname>
            ,
            <given-names>K.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>A vehicle counts by class framework using distinguished regions tracking at multiple intersections</article-title>
          .
          <source>In: Proceeding of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern Recognition,
          <source>CVPR Workshops</source>
          <year>2020</year>
          . pp.
          <fpage>24662474</fpage>
          . Computer Vision Foundation / IEEE (
          <year>2020</year>
          ). https://doi.org/10.1109/CVPRW50498.
          <year>2020</year>
          .00297
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
          </string-name>
          , H.:
          <article-title>A semi-automatic annotation technology for trac scene image labeling based on deep learning preprocessing</article-title>
          .
          <source>In: Proceeding of the IEEE International Conference on Computational Science and Engineering</source>
          ,
          <string-name>
            <surname>CSE</surname>
          </string-name>
          <year>2017</year>
          . pp.
          <fpage>315320</fpage>
          . IEEE Computer Society (
          <year>2017</year>
          ). https://doi.org/10.1109/CSE-EUC.
          <year>2017</year>
          .63
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , DollÆr,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.L.</surname>
          </string-name>
          :
          <article-title>Microsoft COCO: common objects in context</article-title>
          .
          <source>In: Proceeding of 13th European Conference on Computer Vision - ECCV 2014. Lecture Notes in Computer Science</source>
          , vol.
          <volume>8693</volume>
          , pp.
          <fpage>740755</fpage>
          . Springer (
          <year>2014</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -10602-1_
          <fpage>48</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anguelov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berg</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          :
          <article-title>SSD: single shot multibox detector</article-title>
          .
          <source>In: Proceeding of the 14th European Conference on Computer Vision</source>
          ,
          <source>ECCV 2016. Lecture Notes in Computer Science</source>
          , vol.
          <volume>9905</volume>
          , pp.
          <fpage>2137</fpage>
          . Springer (
          <year>2016</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -46448-
          <issue>0</issue>
          _
          <fpage>2</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Marchisio</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanif</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khalid</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plastiras</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kyrkou</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theocharides</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaque</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Deep learning for edge computing: Current trends, cross-layer optimizations, and open research challenges</article-title>
          .
          <source>In: Proceeding of the IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2019</source>
          . pp.
          <fpage>553559</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2019</year>
          ). https://doi.org/10.1109/ISVLSI.
          <year>2019</year>
          .00105
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Poorgholi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kayhan</surname>
            ,
            <given-names>O.S.</given-names>
          </string-name>
          , van Gemert,
          <string-name>
            <surname>J.C.</surname>
          </string-name>
          <article-title>: t-eva: Time-ecient t-sne video annotation</article-title>
          .
          <source>In: Proceeding of the International Workshops and Challenges Pattern Recognition, ICPR 2021. Lecture Notes in Computer Science</source>
          , vol.
          <volume>12664</volume>
          , pp.
          <fpage>153</fpage>
          <lpage>169</lpage>
          . Springer (
          <year>2020</year>
          ). https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -68799-1_
          <fpage>12</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Real</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mazzocchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Youtubeboundingboxes: A large high-precision human-annotated data set for object detection in video</article-title>
          .
          <source>In: Proceeding of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2017</year>
          . pp.
          <fpage>74647473</fpage>
          . IEEE Computer Society (
          <year>2017</year>
          ). https://doi.org/10.1109/CVPR.
          <year>2017</year>
          .789
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Redmon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Divvala</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farhadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>You only look once: Unied, real-time object detection</article-title>
          .
          <source>In: Proceeding of the IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2016</year>
          . pp.
          <fpage>779788</fpage>
          . IEEE Computer Society (
          <year>2016</year>
          ). https://doi.org/10.1109/CVPR.
          <year>2016</year>
          .91
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <string-name>
            <surname>Faster R-CNN</surname>
          </string-name>
          <article-title>: towards real-time object detection with region proposal networks</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>39</volume>
          (
          <issue>6</issue>
          ),
          <volume>11371149</volume>
          (
          <year>2017</year>
          ). https://doi.org/10.1109/TPAMI.
          <year>2016</year>
          .2577031
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Russell</surname>
            ,
            <given-names>B.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murphy</surname>
            ,
            <given-names>K.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freeman</surname>
          </string-name>
          , W.T.:
          <article-title>Labelme: A database and web-based tool for image annotation</article-title>
          .
          <source>Int. J. Comput. Vis</source>
          .
          <volume>77</volume>
          (
          <issue>1-3</issue>
          ),
          <volume>157173</volume>
          (
          <year>2008</year>
          ). https://doi.org/10.1007/s11263-007-0090-8
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Sandler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhmoginov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>
          .
          <source>In: 2018 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2018</year>
          ,
          <article-title>Salt Lake City</article-title>
          ,
          <string-name>
            <surname>UT</surname>
          </string-name>
          , USA, June 18- 22,
          <year>2018</year>
          . pp.
          <fpage>45104520</fpage>
          . Computer Vision Foundation / IEEE Computer Society (
          <year>2018</year>
          ). https://doi.org/10.1109/CVPR.
          <year>2018</year>
          .00474
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. To, H.,
          <string-name>
            <surname>Bui</surname>
            ,
            <given-names>K.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bui</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Real-time social distancing alert system using pose estimation on smart edge devices</article-title>
          .
          <source>In: Proceeding of the 13th Asian Conference in Intelligent Information and Database Systems</source>
          ,
          <string-name>
            <surname>ACIIDS</surname>
          </string-name>
          <year>2021</year>
          .
          <source>Communications in Computer and Information Science</source>
          , vol.
          <volume>1371</volume>
          , pp.
          <fpage>291300</fpage>
          . Springer (
          <year>2021</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dulloor</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andersen</surname>
            ,
            <given-names>D.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaminsky</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>EDF: ensemble, distill, and fuse for easy video labeling</article-title>
          . CoRR abs/
          <year>1812</year>
          .03626 (
          <year>2018</year>
          ), http://arxiv.org/abs/
          <year>1812</year>
          .03626
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>