<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identity Documents Recognition and Detection using Semantic Segmentation with Convolutional Neural Network</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mykola Kozlenko</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volodymyr Sendetskyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oleksiy Simkiv</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nazar Savchenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andy Bosyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>MindCraft AI LLC</institution>
          ,
          <addr-line>19 Lisna str., Lviv, 79010</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vasyl Stefanyk Precarpathian National University</institution>
          ,
          <addr-line>57 Shevchenko str., Ivano Frankivsk, 76018</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>234</fpage>
      <lpage>242</lpage>
      <abstract>
        <p>Object recognition and detection are well-studied problems with a developed set of almost standard solutions. Identity documents recognition, classification, detection, and localization are the tasks required in a number of applications, particularly, in physical access control security systems at critical infrastructure premises. In this paper, we propose the new original architecture of a model based on an artificial convolutional neural network and semantic segmentation approach for the recognition and detection of identity documents in images. The challenge with the processing of such images is the limited computational performance and the limited amount of memory when such an application is running on industrial oneboard microcomputer hardware. The aim of this research is to prove the feasibility of the proposed technique and to obtain quality metrics. The methodology of the research is to evaluate the deep learning detection model trained on the mobile identity document video dataset. The dataset contains five hundred video clips for fifty different identity document types. The numerical results from simulations are used to evaluate the quality metrics. We present the results as accuracy versus threshold of the intersection over union value. The paper reports an accuracy above 0.75 for the intersection over union (IoU) threshold value of 0.8. Besides, we assessed the size of the model and proved the feasibility of running the model on an industrial one-board microcomputer or smartphone hardware.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Identity document</kwd>
        <kwd>object detection</kwd>
        <kwd>semantic segmentation</kwd>
        <kwd>document recognition</kwd>
        <kwd>document classification</kwd>
        <kwd>deep learning</kwd>
        <kwd>neural network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Almost every organization today uses access control security systems. Usually, employees use
special access cards. But there is a problem for guests or people who visit an object for the first time
and do not have an access card. In this case, the identification of the person can be performed
according to the data of any official identity document. Identification can be done by detecting a
document in an image from a camera or scanner followed by extraction of text information.</p>
      <p>
        Object recognition and detection are well-studied problems with a developed set of almost
standard solutions. Identity documents recognition, classification, detection, and localization are very
popular tasks in the computer vision area and are required in many security applications [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Nowadays there are some classical approaches to object detection: Viola-Jones object detection
framework based on Haar features [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], scale-invariant feature transform [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a histogram of oriented
gradients [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], etc. Also, object detection algorithms are implemented in popular frameworks and
libraries such as OpenCV and many others. There are many deep learning-based approaches as well
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this paper, we propose a new neural network (NN) architecture and investigate the
performance of the semantic segmentation-based approach for identity documents detection.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In recent years, many successful approaches to object detection using deep learning were
proposed. R-CNN solution was proposed first in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Reference [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] presents the Fast R-CNN. The
Faster R-CNN is reported in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Also, the following are well-known and widely used approaches.
Single Shot MultiBox Detector (SSD) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] approach is based on a feed-forward convolutional network
that produces a collection of bounding boxes and scores for the presence of object class instances.
One of the most popular object detectors is the You Only Look Once (YOLO) detector [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. YOLO
sees the entire image during training and test time so it implicitly encodes contextual information
about classes [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. It outperforms all other detection methods, including R-CNN. There are also some
other well-known methods: Single-Shot Refinement Neural Network for Object Detection
(RefineDet) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], Retina-Net [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Deformable convolutional networks [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], and others. Reference
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is devoted to identity document recognition in a video stream. The paper [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] studies the problem
of image classification of identity documents composed of few textual information fields and complex
backgrounds. The proposed approach simultaneously locates the document and recognizes the class.
Paper [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] discusses the problem of simultaneous document type recognition and projective distortion
parameter estimation for the images of identity documents. The problem of face detection on identity
documents under unconstrained environments was sufficiently studied in [17]. In [18] it is proposed
the original neural network architecture for the semantic image segmentation task contains layers
calculating direct and transposed integral Fast Hough Transform operators.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>In this research, we use the Mobile Identity Document Video dataset (MIDV-500) [19]. It consists
of 500 video clips for 50 different identity document types with ground truth. The dataset contains
data on 17 types of ID cards, 14 types of passports, 13 types of driving licenses, and 6 other identity
documents of various countries. Each captured frame had the same resolution of 1080 by 920 pixels.
There are the following cases in the dataset: the document lies on the table with homogeneous
background, the document lies on various keyboards, the document is held by a hand, the document is
partially hidden, the background is stuffed with unrelated objects. Total counts of train and test
samples are 10500 and 4500. Some instances of images are presented in Fig. 1. Fig. 1 also shows the
detection results obtained using OpenCV (light green boundaries). There are several images in which
this approach works well, such as the bottom-right picture in the figure. But for most of the images,
we conclude that the conditions are very diverse. A simple image processing algorithm cannot cover
all the variety of colors, lighting, shadows, blur, and other differences. We converted the data in our
dataset into the following structure (refer to Fig. 2), where: the ‘path’ is the path to an image within
the dataset, the ‘x0’, ‘y0’, ‘x1’, ‘y1’, ‘x2’, ‘y2’, ‘x3’, ‘y3’ are the ground truth coordinates of
quadrilateral vertices of the document image, the ‘part’ is a number specifying a part of the dataset,
the ‘group’ is the background used in the image.</p>
      <p>The idea of the data import is simple: iterate over all the images, resize them, draw the ground
truth in a blank image, and store them in the corresponding variables. Then, we can simply return a
batch of a certain size.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Method and Model Design</title>
      <p>The proposed architecture of the artificial convolutional neural network (CNN) is presented in Fig.
3. The idea behind this model is as follows: we downsample the input image to the size 8x8 while
learning some features about most of the regions. Then, we pass those features to a few dense layers
that make a decision on whether there is an identity document in the image and if yes, where it is
located. Finally, we use that decision and features calculated in the downsampling part.</p>
      <p>All the concatenate layers implement a kind of skip-connections in the CNN. Despite this decision
layer inside the model, it is still a semantic segmentation network that produces a probability map to
define whether a pixel belongs to an identity document or not.</p>
      <p>For architecture details, data dimensionality, hyper-parameters, and the number of neurons in
layers refer to Fig. 3. The optimizer is ‘Adam,’ Keras built-in. The learning rate is 0.001. The number
of training epochs is 60. The loss function is binary cross-entropy. The metrics are the following:
accuracy, precision, recall. There are a total of 198,273 trainable model parameters. The size of the
model is 832 KiB. It appears to be small enough to run on a smartphone or one-board microcomputer.</p>
      <p>We used the TensorFlow [20] and Keras [21] frameworks in our work. The Tensorboard was used
for the visualization of training scalars and neural network structures.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Training and Evaluation</title>
      <p>Training of the model was performed using a conventional server with Intel(R) Core(TM)
i79700K CPU, 3.60 GHz with 64 GiB of RAM. The training procedure takes approximately 9 ms per
one sample, 290 ms per step (batch), 95 seconds per epoch. The number of samples per gradient
update (the batch size) is 32. The training and validation loss, accuracy, precision, and recall versus
epoch number are presented in Fig. 4 and 5. Values are taken at the end of each epoch.</p>
      <p>We used post-predict evaluation in order to evaluate the model. The test set went through the
prediction method. After that, predictions were compared to the ground truth and the confusion matrix
was derived. The following class-wise metrics were obtained from the confusion matrix: accuracy,
true positive rate (TPR, recall), positive predictive value (PPV, precision), etc.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>The plot of accuracy versus Intersection over Union (IoU) threshold value is presented in Fig. 6.
We achieved an accuracy value of 0.77 for an IoU threshold value of 0.8 on the test set. That is much
better in comparison with the simple OpenCV-based approach (accuracy value of 0.32 for this
dataset). An example of the model input, ground truth, and prediction is presented in Fig. 7.</p>
      <p>After NN makes its prediction on the resize of a given image, we threshold the result by 0.5 and
search it for all the contours. After we smooth each contour, we check whether it has four edges and if
it occupies the minimum allowed area. If yes, it is checked to be the biggest among other such
contours. The selected contour is resized correspondingly to an input image and the rectangle is
extracted using OpenCV tools. The result is shown in Fig. 8.</p>
      <p>Time complexity is one of the most important issues related to real-time data processing. We
found the run-time complexity of the detection by measuring the time of one image processing on the
needed hardware platform. The average processing time of one image is 8 ms. So, it is possible to
perform real-time object detection with the mentioned above hardware platform. Detailed Python3
code of the working prototype we provide in [22].</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion and discussion</title>
      <p>The overall purpose of the study was to prove the feasibility of efficient identity document
detection using the convolution neural network of the proposed architecture. Our main finding
suggests that the use of the proposed CNN has an acceptable outcome. CNN layers, as feature
extractors, and dense neural layers are easy to implement computational structures with modern
hardware platforms such as smartphones, microcontrollers, and industrial one-board microcomputers.
They can be easily implemented using modern software frameworks. So, it is possible to build
different applications and services using this approach. As stated above, the accuracy of the method is
high enough. An important advantage of the proposed method is the ability to permanently retraining
on new data. This makes it easy to adapt to new conditions and image properties.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitations and further research</title>
      <p>The concern about the study was the limitation of the use of only one dataset. Other data might
have different properties. Therefore, there is a need to evaluate the model on other data. In addition,
tuning the hyperparameters issue is to be studied. The limitations of the study are not fatal and will be
addressed in our future research. Also, we are planning to apply this semantic segmentation-based
deep learning approach to process one-dimensional [23] and three-dimensional LiDAR data.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Acknowledgment</title>
      <p>The authors gratefully acknowledge the contributions of scientists of the MindCraft AI LLC and
the Department of Information Technology of the Vasyl Stefanyk Precarpathian National University
for scientific guidance given in discussions and technical assistance helped in the actual research.
10. Disclosures</p>
      <p>The authors declare that there is no conflict of interest.
11. References
[17] S. Bakkali, M. M. Luqman, Z. Ming, J. Burie, Face Detection in Camera Captured Images of
Identity Documents Under Challenging Conditions, in: International Conference on Document
Analysis and Recognition Workshops, ICDARW, Sydney, Australia, 2019, pp. 55–60,
doi:10.1109/ICDARW.2019.30065.
[18] A. Sheshku, D. Nikolaev, V. L. Arlazaro, Houghencoder: Neural Network Architecture for
Document Image Semantic Segmentation, in: IEEE International Conference on Image
Processing, ICIP, Abu Dhabi, United Arab Emirates, 2020, pp. 1946–1950,
doi:10.1109/ICIP40778.2020.9191182.
[19] V. Arlazarov, K. Bulatov, T. Chernov, V. Arlazarov, MIDV-500: a dataset for identity document
analysis and recognition on mobile devices in video stream, Computer Optics 43.5 (2019) 818–
824. doi:10.18287/2412-6179-2019-43-5-818-824.
[20] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,</p>
      <p>M. Isard, Tensorflow: a system for large-scale machine learning, OSDI 16 (2016) 265–283.
[21] F. Chollet, Keras, 2015. URL: https://keras.io
[22] A. Simkiv, Practical Guide to Semantic Segmentation, 2020. URL:
https://towardsdatascience.com/practical-guide-to-semantic-segmentation-7c55b540489c
[23] M. Kozlenko, I. Lazarovych, V. Tkachuk, V. Vialkova, Software Demodulation of Weak Radio
Signals using Convolutional Neural Network, in: IEEE 7th International Conference on Energy
Smart Systems, ESS, Kyiv, Ukraine, 2020, pp. 339–342, doi:10.1109/ESS50319.2020.9160035.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Dasiopoulou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Papastathis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Strintzis</surname>
          </string-name>
          .
          <article-title>Knowledge-assisted semantic video object detection</article-title>
          ,
          <source>IEEE Transactions on Circuits and Systems for Video Technology 15.10</source>
          (
          <year>2007</year>
          ). doi:
          <volume>10</volume>
          .1109/TCSVT.
          <year>2005</year>
          .
          <volume>854238</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Peleshko</surname>
          </string-name>
          , K. Soroka,
          <article-title>Research of usage of Haar-like features and AdaBoost algorithm in Viola-Jones method of object detection</article-title>
          ,
          <source>in:12th International Conference on the Experience of Designing and Application of CAD Systems in Microelectronics, CADSM</source>
          , IEEE,
          <year>2013</year>
          , pp.
          <fpage>284</fpage>
          -
          <lpage>286</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>W.</given-names>
            <surname>Cheung</surname>
          </string-name>
          , G. Hamarneh, n -SIFT:
          <fpage>n</fpage>
          -Dimensional
          <source>Scale Invariant Feature Transform, in: IEEE Transactions on Image Processing 18.9</source>
          (
          <year>2009</year>
          )
          <fpage>2012</fpage>
          -
          <lpage>2021</lpage>
          . doi:
          <volume>10</volume>
          .1109/TIP.
          <year>2009</year>
          .
          <volume>2024578</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Dalal</surname>
          </string-name>
          , Bill. Triggs,
          <article-title>Histograms of Oriented Gradients for Human Detection</article-title>
          , in: International Conference on Computer Vision &amp; Pattern Recognition, CVPR '
          <fpage>05</fpage>
          , San Diego, United States,
          <year>2005</year>
          . pp.
          <fpage>886</fpage>
          -
          <lpage>893</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          ,
          <source>in: NIPS</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          , J. Malik,
          <article-title>Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation</article-title>
          , in: IEEE Conference on Computer Vision and Pattern Recognition, Columbus,
          <string-name>
            <surname>OH</surname>
          </string-name>
          ,
          <year>2014</year>
          , pp.
          <fpage>580</fpage>
          -
          <lpage>587</lpage>
          , doi:10.1109/CVPR.
          <year>2014</year>
          .
          <volume>81</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Fast</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          , in: IEEE International Conference on Computer Vision, ICCV, Santiago,
          <year>2015</year>
          , pp.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          , doi:10.1109/ICCV.
          <year>2015</year>
          .
          <volume>169</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>39</volume>
          .6 (
          <year>2017</year>
          )
          <fpage>1137</fpage>
          -
          <lpage>1149</lpage>
          . doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2016</year>
          .
          <volume>2577031</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          , SSD: Single Shot MultiBox Detector, in: B.
          <string-name>
            <surname>Leibe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Matas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Sebe</surname>
          </string-name>
          , M. Welling, (Eds.),
          <source>Computer Vision - ECCV</source>
          ,
          <year>2016</year>
          . pp.
          <fpage>21</fpage>
          -
          <lpage>37</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -46448-
          <issue>0</issue>
          _
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Redmon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Divvala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , You Only Look Once: Unified,
          <string-name>
            <surname>Real-Time Object</surname>
            Detection, in: IEEE Conference on Computer Vision and Pattern Recognition,
            <given-names>CVPR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          ,
          <year>2016</year>
          , pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          , doi:10.1109/CVPR.
          <year>2016</year>
          .
          <volume>91</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Single-Shot Refinement Neural Network for Object Detection</article-title>
          , in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City,
          <string-name>
            <surname>UT</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>4203</fpage>
          -
          <lpage>4212</lpage>
          , doi:10.1109/CVPR.
          <year>2018</year>
          .
          <volume>00442</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Bian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Single-Shot Refinement Neural Network for Object Detection</article-title>
          , in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City,
          <string-name>
            <surname>UT</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>4203</fpage>
          -
          <lpage>4212</lpage>
          , doi:10.1109/CVPR.
          <year>2018</year>
          .
          <volume>00442</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dai</surname>
          </string-name>
          , Deformable ConvNets V2:
          <article-title>More Deformable, Better Results</article-title>
          , in: IEEE/CVF Conference on Computer Vision and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          , Long Beach, CA, USA,
          <year>2019</year>
          , pp.
          <fpage>9300</fpage>
          -
          <lpage>9308</lpage>
          , doi:10.1109/CVPR.
          <year>2019</year>
          .
          <volume>00953</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bulatov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Arlazarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chernov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Slavin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nikolaev</surname>
          </string-name>
          , Smart IDReader:
          <article-title>Document Recognition in Video Stream</article-title>
          ,
          <source>in: 14th IAPR International Conference on Document Analysis and Recognition</source>
          ,
          <string-name>
            <surname>ICDAR</surname>
          </string-name>
          , Kyoto,
          <year>2017</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>44</lpage>
          , doi:10.1109/ICDAR.
          <year>2017</year>
          .
          <volume>347</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>A. M. Awal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanmi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Sicre</surname>
          </string-name>
          , T. Furon,
          <article-title>Complex Document Classification and Localization Application on Identity Document Images</article-title>
          ,
          <source>in: 14th IAPR International Conference on Document Analysis and Recognition</source>
          ,
          <string-name>
            <surname>ICDAR</surname>
          </string-name>
          , Kyoto,
          <year>2017</year>
          , pp.
          <fpage>426</fpage>
          -
          <lpage>431</lpage>
          , doi:10.1109/ICDAR.
          <year>2017</year>
          .
          <volume>77</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Skoryukina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Arlazarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nikolaev</surname>
          </string-name>
          ,
          <article-title>Fast Method of ID Documents Location and Type Identification for Mobile and Server Application</article-title>
          ,
          <source>in: International Conference on Document Analysis and Recognition</source>
          ,
          <string-name>
            <surname>ICDAR</surname>
          </string-name>
          , Sydney, Australia,
          <year>2019</year>
          , pp.
          <fpage>850</fpage>
          -
          <lpage>857</lpage>
          , doi:10.1109/ICDAR.
          <year>2019</year>
          .
          <volume>00141</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>