<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Document Understanding: Problems and Technological Solutions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kristina Arnaoudova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maria Nisheva</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AI Lab at IBS Bulgaria</institution>
          ,
          <addr-line>4 Pimen Zografski Str., Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski</institution>
          ,
          <addr-line>5 James Bourchier Blvd., 1164 Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institute of Mathematics and Informatics, Bulgarian Academy of Sciences</institution>
          ,
          <addr-line>acad. G. Bonchev Str., Block 8, 1113 Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
      </contrib-group>
      <fpage>148</fpage>
      <lpage>158</lpage>
      <abstract>
        <p>The paper analyzes the most significant issues related to the information extraction from documents and discusses some theoretical models suitable for recognition. It presents, in brief, the main components of a hybrid approach, which combines an application of a domain ontology and deep learning techniques to recognize and classify the document structure efficiently. The approach is based on the use of a symbolic logic inference model to utilize the knowledge about the document semantics. The document's segments are referring to the conceptual purpose of the document, which is recognized by applying modern architecture for object detection. An appropriate domain ontology solves the issues related to the semantic completeness of the document's data. Several experiments have been carried out and the results obtained have been analyzed in terms of the applicability of various modern technologies for the implementation of document understanding system.</p>
      </abstract>
      <kwd-group>
        <kwd>document processing</kwd>
        <kwd>document layout understanding</kwd>
        <kwd>ontology</kwd>
        <kwd>deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Automatic information extraction from documents is not a new issue. Manual
document processing is a major cost driver in organizations. Meaningful results
are achieved in text recognition, using a variety of Artificial Intelligence methods.
The traditional way to digitize documents is optical character recognition (OCR),
which recognizes text and is a proven and workable approach. The OCR, though,
does not retain the formatting elements of the template and fails at recognizing
symbols with enough precision due to the presence of tables and other items.
In the modern versions of optical recognition, neural networks are also used
with radical improvements in performance. At the same time, however, the
problem with text documents formatted in a specific way is substantial and not
entirely resolved. When tables or other graphical elements are used, recognition
libraries not only cannot retain formatting, but the presence of such elements can
dramatically worsen the results. Even when the format may be given, just small
deviations could be challenging.</p>
      <p>Moreover, a hierarchical structure needs additional knowledge about
the relationships among the image components. The processing of formatting
elements may need a detailed definition of the used template in terms of relations
between particular regions on the documents. The convolutional neural networks
are revolutionary in recognition but still not design for interpreting the spatial
relations.</p>
      <p>Therefore, we consider a hybrid approach using a domain ontology for the
description of the document sections and their relations as very suitable to that
gap. The particular approach we propose is a top-down one and envisages three
fundamental phases: concept recognition, template recognition, and semantic
understanding. It enables inference complementary to annotated image segments
and formatting elements, considering that the recognition of the formatting
elements of the document is a key point to a successful text extraction using OCR.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Main components and phases</title>
      <sec id="sec-2-1">
        <title>2.1 Conceptual knowledge – domain ontology</title>
        <p>An ontology [1] is a formalization of a conceptual representation of a domain.
Human recognition capabilities can be seen as a combination of visual perception
and inference based on complex sampled data. Computer vision is inspired by
human perception and performs the classification task with basic recognition by
repeated examples. The idea to use ontologies for image interpretation is not
new, involving classification following the concept’s definitions, which should
be provided as done by humans [2]. Some of the first efforts in this direction are
discussed in [3] and [4].</p>
        <p>Ontologies enable the application of symbolic logic inference mechanisms as
classification. The ontological approach is managing and integrating different
sources and generally is a widely adopted approach to structuring unstructured
data. The document content as an unstructured amount of information could be
formalized using appropriate ontologies. On the other hand, the document is an
image, and the data may be presented in a hybridlike manner using the best of
both approaches – typical computer vision and knowledge-based inference. The
significance of this hybrid approach is explained by the potential reduction of the
cost of generating recognition models by reducing the training data annotation.
The ontology’s global definition can be reused without significant changes.
Therefore, our attempt to fully understanding the document is following the idea
of a hybrid approach, using the ontology as a primary domain knowledge source
and deep learning that allows to:
 map the formatting elements,
 perform template classification,
 semantically validate the document content.</p>
        <p>Using an ontology with mapped concepts could significantly reduce the needed
resources for document type classification. Following a hybrid approach, the
defined classes are mapped to the semantic elements of the documents. So, enable
the performing of additional inference steps for image template recognition. The
ontology defines the concepts and formatting elements. For example, an ontology
of documents could describe concepts like person, organization, document, issuer,
entitled, etc. and formatting elements as title, table, paragraph, box, n-dimensional
grid, point, coordinates, etc.</p>
        <p>We have validated the concepts with a sick leave documents dataset. Examples
of semantic concepts as elements of the experimental domain ontology being
under development for this research are shown in Table 1 and Fig. 1.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Template classification</title>
        <p>An essential step in the successful implementation of information extraction is
the definition and recognition of the template. The template is understood as a
model, grouping similar primitive patterns of formatting. It is an essential step
for processing the image realized with image processing techniques. According
to the identified formatting elements and as a preprocessing step in the training
and inference stage, various image-processing techniques are applied, relative
to the type of formatting elements. The formatting pattern can be classified
using the spatial relation described in the ontology as metadata. The template
identification is represented as spatial relationships between concepts. The spatial
relation within the documents refers to their relative position on the document
grid. The recognition of the used template is related to the common document
format pattern classification described in the ontology. Fig. 2 shows the main
steps of the process.</p>
        <p>The final document template is the combination of formatting elements
and detected objects of the document type. The document type may improve
significantly the object’s recognition confidence score by applying different
processing. The classification is based on the metadata including concepts and
spatial relation by their grid relative position on the document image.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 Semantic understanding</title>
        <p>The convolutional neural network object detectors identify the regions of interest,
and the OCR extracts the text successfully. The template mapping uses the
recognized formatting elements embedded in the regions. The understanding,
however, comes with the inferred knowledge about the concepts of the documents.
Our hybrid approach suggests additional processing of semantic truthfulness and
explanation of the extracted data using a symbolic logic model, and inference
algorithms, using diferent types of information, in particular data from the detector
and conceptual knowledge. The metadata is used as semantical validation of the
recognized objects using the ontology axioms and rules. The expected effect
is leveraging the recognition by the deep learning algorithm and generating an
explanation of the results in terms of the used template. The concepts recognized
by the object detector will be processed by linking to existing axioms of such
basic concepts to form other, more complicated, defined concepts. An example in
the context of the sick leave document could be the validation if the recommended
regime is a logically valid result for the given diagnosis.
 Convolutional neural networks
The origin of convolutional neural networks (CNN) was introduced by Kunihiko
Fukushima (Neocognitron) in 1980 [5] and Yann LeCun at al. (LeNet-5) in 1998
[6]. The name CNN comes from one of the most important operations in the
network, which is the convolution. The convolution is performed on the input
data with the use of a kernel to produce a feature map. It is executed by sliding
the filter over the input. At every location, matrix multiplication is performed
and summation of the result onto the feature map. Convolutional neural
nets have revolutionized speech and object detection. CNN leverages three
important fundaments as sparse connectivity, shared parameters, and invariance
to translation. Sparse weights refer to the reduced number of parameters, and
parameter sharing is the factor used for more than one function, making the
parameters significantly computationally efficient. The invariance to translation
means that if the input translates to some extent, the output changes in the same
way. Among the very successful methods for object detection being of the focus
in AI, one can mention Faster R-CNN, RPN, Mask-RCNN, and FCN.</p>
        <p>Our experiments successfully apply a convolutional neural network,
implementing automatic document processing for detecting the main concepts
and the embedded elements. Based on the concepts, identifying the main classes,
which represent the basic concepts of the document’s domain, the segments of the
document image are annotated. We have experimented with several convolutional
models; among them is Mask R-CNN.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Object detection</title>
      <p>An object recognition algorithm identifies which objects are present in an image.
It takes the entire image as an input and outputs class labels and class probabilities
of objects present in that image. For example, a class label could be “cat,” and the
associated class probability could be 97%. The object detection methods add the
location with coordinates of outputs bounding boxes; predict where on the image
is the object.</p>
      <p>A convolutional neural network for object detection classifies the objects and
finds their location within the document in terms of detected regions. Annotations
may be different according to the required semantics. We have conducted various
approaches to conceptualize the image area segmentation used for recognition
through annotation. We have chosen to label regions, following the predefined
concepts, such as a person, experts, purpose, subject, etc. Also, the formatting
elements, such as a box, table, grid, etc. The recognized regions represent complex
concepts within the document, a data structure comprising different attributes.
Along with our experiments, we encounter quite a few challenges, some of which
are fonts, scale, quality, small objects, and the number of recognized objects.
To overcome the various limitation, we have applied different image processing
techniques. One of the main challenges encountered are the recognition of the small
symbols and overlapping elements. Another obstacle is the learning algorithm
limitation in the number of recognized objects. The results are substantial, though
the task for document understanding needs objects and relations among them.
It is not one of the designed purposes of CNN, here we singled out the usage
of ontology engineered concepts and reasoning as an integrated step document
understanding solution.</p>
      <p>In the experiments, we have used IBM Power AI Vision, a software that
implements the most modern computer vision convolutional neural network
architectures. We have conducted experiments with different architectures;
among them, Mask R-CNN implemented by Detectron (FAIR).</p>
      <sec id="sec-3-1">
        <title>3.1 Detectron FAIR</title>
        <p>
          The software system Facebook AI Research implements advanced algorithms for
detecting objects, including Mask R-CNN architecture [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Some results from
Mask R-CNN are shown in Fig. 3.
        </p>
        <sec id="sec-3-1-1">
          <title>Mask R-CNN</title>
          <p>Mask R-CNN efficiently detects objects in an image while simultaneously
generating a high-quality segmentation mask for each instance. The architecture
of Mask R-CNN extends Faster R-CNN by adding a branch for predicting an
object mask in parallel with the existing branch for bounding box recognition.
Mask R-CNN is a deep neural network targeted to solve instance segmentation
problems in computer vision. In other words, it can separate different objects and
returns the object bounding boxes, classes, and masks. There are two stages of
Mask R-CNN. Firstly, it generates proposals about the regions where there might
be an object based on the input image. Secondly, it predicts the class of the object,
refines the bounding box, and generates a mask at the pixel level of the object
based on the already existing proposals.</p>
          <p>Our experience shows that Mask R-CNN is the most efficient
documentprocessing model. The achieved result in segmentation performance is 96%
accuracy.</p>
          <p>We have significantly reduced the implementation time with the IBM Visual
Insights (prev. IBM PowerAI Vision), a software system for computer vision
and deep learning implementing the most modern CNN architectures and among
them Detectron.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>FCN – Fully convolutional network</title>
          <p>Mask R-CNN is a Region convolutional neural network which has been
improved by the Fully convolution networks (FCN) branch [8]. FCN is built only
from locally connected layers, such as convolution, pooling, and upsampling.
No dense layer is used in this kind of architecture. This reduces the number of
parameters and computation time. In addition, the network can work regardless of
the original image size, without requiring any fixed number of units at any stage,
given that all connections are local. Fig. 4 presents the Mask R-CNN architecture.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>R-CNN: Regions with CNN features</title>
          <p>A group of architectures based on several different components, each of
which is a network. The main components are the proposed region classification
of classes. R-CNN [9] and is one of the first architectures used for object detection.
R-CNN first try to construct different proposed regions. The proposed region is
the area on the image, which has a high probability of containing the object.
To construct proposed regions, external region proposal methods like Selective
Search are used.</p>
          <p>Selective Search Algorithm:
1. Generate initial sub-segmentation, generating many candidate regions
2. Use a greedy algorithm to combine similar regions into larger ones
recursively
3. Use the generated regions to produce the final candidate region proposals</p>
          <p>In R-CNN, the input image is fed to a region proposal algorithm like selective
search, and CNN is performed on each proposed region. The output of CNN is
given to SVMs for the classification of objects detected. Therefore, if there are
2000 proposed regions, the needed runs are 2000 CNN networks.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>Fast R-CNN [10]</title>
          <p>This model is an improvement of R-CNN. It is 25x more effective than R-CNN.
To reduce the overhead of multiple CNN networks in R-CNN, first, the input
image is fed to CNN, which gives an insight on the features of the image, and
then the selective search is performed to get proposed regions.</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>Faster R-CNN [11]</title>
          <p>Faster R-CNN is an improvement of Fast R-CNN where Region Proposal
Network is used as a proposed region generator instead of selective search; Fig. 4
shows R- CNN, Fast R-CNN, and Faster R-CNN. Faster R-CNN is a combination
of Fast R-CNN and RPN. The first stage here is to prepare the features in the
respective map and then suggest coordinates of the presumed location of the site.
The proposed regions are given to a classifier for object classification. Faster
R-CNN is a highly effective approach for object detection. It is 250x more
effective than R-CNN.</p>
        </sec>
        <sec id="sec-3-1-6">
          <title>RPN (Region Proposal Network)</title>
          <p>RPN provides a time-effective way of generating region proposals/regions
of interest. It is more effective than the selection search used in R-CNN/Fast
R-CNN. RPN ranks region boxes, called anchors, and proposes the ones most
likely containing objects. Each RPN has a classifier and a regressor. To generate
proposals for the region where the object is, a small network is a slide over a
convolutional feature map that is the output by the last convolutional layer. The
anchor is the central point of the sliding window. The anchors are of different
aspect ratios, and the proposals are based on significant Intersection-over-union
overlap with a ground truth box. RPN is an algorithm that needs to be trained and
has defined the loss function.</p>
          <p>From the group of architectures discussed in this paper, we have experimented
with Faster R-CNN in the context of the sick leave document type. It is not so
accurate in determining the coordinates of the object and cannot recognize small
objects with enough accuracy.</p>
        </sec>
        <sec id="sec-3-1-7">
          <title>YOLO – You only look once [12]</title>
          <p>One of the fastest CNN, working in real-time. It is highly efficient because
of not repeating a segment of the picture looking for different objects (see Fig. 6).
The algorithm is not optimized for accuracy, but for time. It is not sufficient to
be able to recognize many small objects and full overlapping elements. Some
elements in the document sample for sick leave are low for this type of network.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>An end-to-end implementation of an effective, comprehensive document
information extraction could be a combination of different inference mechanisms,
integrating several approaches. The fundamental recognition and localization
of different concepts are based on the modern object detectors, and further
understanding is grounded on logically inferred relationships using ontological
engineering. We have experimented, organizing the recognition process in
semantic blocks. Significant stages in the architecture are the ontology layer, object
detection, and post-processing layer, based on the template classification. We use
inferred knowledge from the ontology axioms and rules, and accordingly, the
system undertakes different actions. The result of the object detection is reliable
with sufficient accuracy. Still, the information extracted should be additionally
processed or confirmed semantically. Therefore, the proposed hybrid approach
could be effectively applied.</p>
      <p>Our experiments were conducted with a limited set of data; the small
amount of data assumes no balance of classes. Upon accumulating data and
continuous training, the object detection accuracy and the variety of the concepts
are expected to improve. Document’s element recognition is a widely used task
that meets difficulties in formatting recognition. The experiments show that the
document can be successfully read automatically using a hybrid approach based
on the utilization of suitable domain ontology and deep learning object detection
neural networks. The next steps are to develop further the ontology semantics
and to improve the extraction of features with maximum preservation of spatial
information so that we can recognize small objects.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The AI laboratory at IBS Bulgaria has supported the presented research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
           Gruber,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Toward Principles for the Design of Ontologies Used for Knowledge Sharing</article-title>
          .
          <source>International Journal of Human-Computer Studies</source>
          , Vol.
          <volume>43</volume>
          (
          <year>1995</year>
          ), pp.
          <fpage>907</fpage>
          -
          <lpage>928</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
           Porello,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Cristani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Ferrario</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          :
          <article-title>Integrating Ontologies and Computer Vision for Classification of Objects in Images</article-title>
          .
          <source>In Proceedings of the Workshop on Neural - Cognitive Integration (NCI @ KI</source>
          <year>2015</year>
          ),
          <source>PICS Publications of the Institute of Cognitive Science</source>
          Vol.
          <volume>3</volume>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
           Straccia,
          <string-name>
            <given-names>U.</given-names>
            ,
            <surname>Visco</surname>
          </string-name>
          , G.:
          <article-title>Dlmedia: an ontology mediated multimedia information retrieval system</article-title>
          .
          <source>In Proceedings of the 2007 International Workshop on Description Logics DL2007</source>
          (
          <string-name>
            <surname>Brixen-Bressanone</surname>
          </string-name>
          ,
          <year>Italy</year>
          ,
          <fpage>8</fpage>
          -
          <issue>10</issue>
          <year>June 2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
           Town,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Ontological inference for image and video analysis</article-title>
          .
          <source>Mach. Vis. Appl.</source>
          ,
          <volume>17</volume>
          (
          <issue>2</issue>
          ):
          <fpage>94</fpage>
          -
          <lpage>115</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
           Fukushima,
          <string-name>
            <surname>K.</surname>
          </string-name>
          :
          <article-title>Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position</article-title>
          .
          <source>Biological Cybernetics</source>
          ,
          <volume>36</volume>
          (
          <issue>4</issue>
          ):
          <fpage>93</fpage>
          -
          <lpage>202</lpage>
          (
          <year>1980</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
           LeCun,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Hafnner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Bottou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Object Recognition with Gradient-based Learning</article-title>
          .
          <source>LNCS</source>
          , Vol.
          <volume>1681</volume>
          , pp
          <fpage>319</fpage>
          -
          <lpage>345</lpage>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gkioxari</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollár</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          , R.:
          <string-name>
            <surname>Mask R-CNN</surname>
          </string-name>
          . arXiv:
          <volume>1703</volume>
          .06870 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title> Long</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Shelhamer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          :
          <article-title>Fully Convolutional Networks for Semantic Segmentation</article-title>
          .
          <source>arXiv:1411.4038v2 [cs.CV] 8 Mar 2015</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
           Girshick,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Malik</surname>
          </string-name>
          , J.:
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>
          .
          <source>2014 IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (Columbus</article-title>
          ,
          <string-name>
            <surname>OH</surname>
          </string-name>
          ,
          <year>2014</year>
          ), pp.
          <fpage>580</fpage>
          -
          <lpage>558</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          Girshick,
          <string-name>
            <given-names>R.</given-names>
            :
            <surname>Fast R-CNN</surname>
          </string-name>
          .
          <source>2015 IEEE International Conference on Computer Vision</source>
          (Santiago,
          <year>2015</year>
          ), pp.
          <fpage>1440</fpage>
          -
          <lpage>1448</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with 10. Region Proposal Networks</article-title>
          .
          <source>arXiv:1506.01497</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Redmon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Divvala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farhadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>You only look once: Unified, real-time object detection</article-title>
          .
          <source>2016 IEEE Conference on Computer Vision</source>
          and
          <article-title>Pattern Recognition (Las Vegas</article-title>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          ,
          <year>2016</year>
          ), pp.
          <fpage>779</fpage>
          -
          <lpage>788</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>