<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Worm-like image descriptor for signboard classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksei Samarin</string-name>
          <email>aleksei.samarin@vk.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentin Malykh</string-name>
          <email>valentin.malykh@phystech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kazan Federal University</institution>
          ,
          <addr-line>Kazan</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>VK Research, Saint-Petersburg State University</institution>
          ,
          <addr-line>Saint-Petersburg</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>-We introduce a special image descriptor that is well suited for classification of images containing various inscriptions. In order to demonstrate effectiveness of the proposed solution we provide evaluation of a system based on the introduced descriptor on commercial building facade photographs grouping problem according to the type of services provided. Our system achieved state of the art performance (0.28 in averaged F1) over classical CNN-based methods and a composite baseline. Index Terms-image descriptor, image classification, signboard recognition, visual characteristics</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        I. INTRODUCTION
for photographs of advertising signs makes pure extracted text
based signboard photograph classification approaches
insufficient. We should also mention another type of methods that use
combined classifiers that retrieve textual information as well as
pure visual features in order to achieve better performance in
signboard photographs classification, e.g. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. These methods
also suffer from poor OCR quality. In order to improve such
deficiency, we developed a new solution avoiding explicit text
retrieval and replacing it with special visual features extraction
from image patches with text.
      </p>
      <p>
        We propose a neural network method based on the
extraction of several types of general visual features and the special
image descriptor analysis. This method shows better efficiency
than methods that use only visual information or based only on
the analysis of the text recognized during photo processing and
also combined methods that uses visual features and explicit
text information retrieval [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Currently, in the field of applied marketing, problems related
with advertising signs recognition are urgent [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ].
      </p>
      <p>One of these issues is the problem of photographs of
advertising posters classification by type of a provided services. The
problem is quite difficult because of unique fonts and colors,
and different label sizes, and also various shooting conditions.</p>
      <p>
        A decision on whether a photograph of an advertising sign
belongs to one or another category can be obtained on the II. PROBLEM STATEMENT
basis of both textual information located on this sign and pure In this work we investigate a special image descriptor
visual features extracted from the photograph [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Up to this effectiveness for the problem of advertising sign photograph
day, a lot of methods have demonstrated their effectiveness in classification by the type of provided services. The problem
solving the problem of general image classification [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], can be formulated as follows. An input photograph containing
[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], although classification of general objects photographs signboard Q should be assigned to one of the classes C = Ci,
is quite different from the classification of photographs of where i ∈ [0, N ].
advertising posters. One important feature of advertising signs In addition to the formal statement of the problem we use
which differs heterogeneous objects (like the ones depicted the following restrictions. Images are captured by a camera
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]) and significantly complicates their classification is fixed on a car, following along the roadway [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], hence: a)
the absence of the convex elements. This feature leads to lower may contain visual defects - sun glare, noise, including those
efficiency of the heterogeneous images classification methods that greatly impede optical text recognition; b) angle, framing,
to distribute signboard photographs by groups [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Another lighting and colour balance are unknown and can vary
signifimportant feature of photographs of advertising posters is icantly from shot to shot; c) the relative size and placement
presence of a textual information. In some cases, the text of signboard in snapshot can also vary greatly (Fig. 1).
shown on the signboard may contain information of key
importance for classifying an image. There are a lot of document III. PROPOSED METHOD
image classification methods that are based on the prior text The proposed system contains several modules: a visual
recognition [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. Such methods demonstrate high features extraction module, text detection one, and special
texteffectiveness for scanned document classification. However, containing image descriptor module. The general architecture
such factors as the possible lack of sufficient information in of our solution is presented on Fig. 2.
the text of the advertising sign and the difficulty of solving The proposed scheme contains a module of visual features
the problem of optical text recognition with variable angles, extraction. It is CNN-based, since such image features
exfonts, styles of signage and lighting (Fig. 1), that is typical tractors is effective in solving problems of classifying
images of heterogeneous objects [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. However
classifiers that are based only on CNN extracted features do
not achieve a great performance in signboard photographs
classification [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The reason for this phenomenon is the
specificity of advertising posters in terms of general image
features. In order to improve efficiency of our solution we
introduce the additional features type that is obtained from
areas of an input image that contain text. We call it
wormlike descriptor. The output of the post-processing result of
evaluated worm-like descriptor is concatenated with CNN
features obtained from the whole original image. The result
of the concatenation is projected onto a space of dimension 4
(according to the number of classes). Then we apply SoftMax
function to the obtained vector and interpret result values as
the probabilities of target classes.
      </p>
    </sec>
    <sec id="sec-2">
      <title>A. General image features extractor</title>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we use MobileNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] as general image
descriptor. The features are extracted from a whole input
image in order to retrieve significant information from
background which helps to establish the type of services provided.
MobileNet model itself is a sequence of convolutional, fully
connected layers and residual connections [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>B. Text detector</title>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we use EAST text detector [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] to localize
the text position of the scene space is a fast multi-channel
CNN-based architecture resistant to a varying angle.
      </p>
    </sec>
    <sec id="sec-4">
      <title>C. Worm-like image descriptor</title>
      <p>We introduce a special type of descriptor for images
containing textual information. We aim to develop a method,
which could be of low computational complexity and could
be applied in parallel.</p>
      <p>The most important feature of this descriptor is based
on the idea of obtaining the maximal information from the
mutual arrangement of regions with the maximum brightness
variation. The second feature of poster images used is the
repeating nature of the characters. Thus, local differences
between the expression of the first and last characters of a
word can be described independently and in the same terms.
Using these considerations, we construct a picture descriptor
as a trace of a certain number of agents (we call them worms)
moving from given initial positions on the picture in directions
that maximize the brightness variance at each step. The sample
agent traces presented as Fig. 3.</p>
      <p>It should be noted that each worm has a predefined
movement direction to avoid displacement of the main direction of
movement in the direction of contours that are not related to
b)
symbol images (for example, poster borders). Summing up the
above, the main component of the descriptor of an image is
the trace of an agent:</p>
      <p>T v(x0, y0) = (m1v(x0, y0), ..., mvN (x0, y0)),
where T v(x0, y0) - trace of a worm with priority movement
direction v and initial position (x0, y0). miv(x0, y0) stands for
a movement direction (couble be {up, down, lef t, right}) for
a step number i with priority direction v and initial position
(x0, y0). We select each movement type according to the
following expression:
miv+1 = arg max(Var[I[x, y]] + c(m, v)),</p>
      <p>m∈M
where</p>
      <p>(x, y) ∈ P ((xi, yi), s), v ∈ {up, down, lef t, right},
and miv(x0, y0) stands for a movement direction for a step
number i with priority direction v and initial position (x0, y0).
N is equal to the number of traced steps.P ((xi, yi), s) stands
for a set of coordinates that can be achieved with one
step from position (xi, yi) with a step size that is equal to
s pixels. And M stands for a set of possible step types
({start, f inish, up, down, lef t, right}). Thus we evaluate
general descriptors for an each movement direction:</p>
      <p>T up = (T up(x0, H), ..., T up(xA, H)),
T down = (T down(x0, 0), ..., T down(xA, 0)),
T left = (T left(W, y0), ..., T left(W, yB)),</p>
      <p>T right = (T right(0, y0), ..., T right(0, yB)),
where A and B stands for horizontal and vertical worms)
number, W and H denotes an input image width and height.
Finally we merge descriptors from each direction into result
image descriptor that can be described as follows:
T = (T up, T down, T left, T right).</p>
      <p>
        From the above description, it is easy to construct an algorithm
that calculates worm-like descriptor for the number of steps
O(w + h), where w and h stands for an input image width
and height correspondingly, whereas a lot of basic image
descriptors implementations (HOG [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and LBP [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) requires
O(w ∗ h) steps. It should also be noted that the procedure for
calculating our descriptor is parallelized with a small effort.
      </p>
      <sec id="sec-4-1">
        <title>IV. EXPERIMENTS</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>A. Dataset</title>
      <p>
        We trained the proposed classifier on a dataset, presented
in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The dataset contains 357 advertising signs photographs
that are taken using a camera fixed on a car. All of the images
were obtained under different lighting conditions and camera
angles. Signboards contain textual information decorated with
different fonts styles and colors. We also use the additional
markup presented in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. All photographs from the dataset
were split into 4 classes according to the type of services
provided (hotels, shops, restaurants, and “other”). All of listed
classes contains approximately the same number of samples.
      </p>
    </sec>
    <sec id="sec-6">
      <title>B. Baselines</title>
      <p>
        We compare performance of our solution with a model
proposed in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. That model is based on a classifier that merges
visual and textual features to enrich the image embedding.
The main scheme of this baseline is similar to our one, the
difference is in usage of an image descriptor, where authors
involve OCR to produce a noisy text from the detected text
regions, and then embed this text with special character-level
vector model.
      </p>
      <p>
        We also provide results of comparison with other combined
methods, where text region descriptor is either LBP [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] or
HOG [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] based. That experimental models are based on the
same architecture as our one with the only and differ only
in the type of used descriptor. We chose these two baselines,
since they have proven their effectiveness in image
classification problems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Local binary pattern (LBP) descriptor
uses a binary string representation for demonstration of the
spatial relationship between the local neighboring pixels [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Histogram of oriented gradients (HOG) descriptor is based
on the histogram of pixel gradients neighbors for image
blocks [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <sec id="sec-6-1">
        <title>V. RESULTS</title>
        <p>
          As a quality measure we use F1 metric following [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. It
is formulated as follows:
        </p>
        <p>T P
P recision = ,</p>
        <p>T P + F P</p>
        <p>T P
Recall = ,</p>
        <p>T P + F N
2 · P recision · Recall
F1 = ,</p>
        <p>P recision + Recall
where T P is a number of objects are correctly marked up by
a model to belong to the class; F P is a number of objects that
are incorrectly marked up by a model to belong to the class;
and F N is a number incorrectly marked up by a model to not
belong to the class. The F1 metric defined above describes
quality only for one class. In order to get the final F1 metric
for all classes we average their scores. Each configuration was
trained ten times. The results of comparison with the baselines
(F1 score mean and variation values) are given in Tab. I.</p>
        <p>As one can see, LBP-based model shows higher results
than previous best model, although the proposed worm-like
description model shows even better performance in this task.</p>
      </sec>
      <sec id="sec-6-2">
        <title>CONCLUSION</title>
        <p>We propose a special image descriptor for an implicit
semantic information extraction from a photograph of
signboard. We also show the effectiveness of a combined model
configured to use introduced worm-like descriptor in the
context of advertising sign photograph classification problem.
The introduced model has demonstrated better efficiency in
comparison to methods based on the use of only visual or
combined visual and explicit textual features method. In the
problem of signboard photographs classification our model
achieves new state of the art result (0.28 in averaged F1
score against 0.24 of previous best model). In addition to
efficiency in solving the problem under consideration, the
proposed method is more lightweight than its analogues, and
contains modules that are portable to mobile devices. Among
the disadvantages of the proposed method we highlight some
general heaviness of the modular architecture, that does not
allow its usage in real time on mobile devices. Basing on the
obtained results, the further research could be focused on the
study of more optimal strategies for getting traces of worm-like
agents and the whole model performance optimization.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Ashutosh</given-names>
            <surname>Bachchan</surname>
          </string-name>
          , Apurba Gorai, and
          <string-name>
            <given-names>Phalguni</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>Automatic license plate recognition using local binary pattern and histogram matching</article-title>
          . pages
          <fpage>22</fpage>
          -
          <lpage>34</lpage>
          ,
          <year>07 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ayan</given-names>
            <surname>Bhunia</surname>
          </string-name>
          , Shuvozit Ghose, Partha Roy, and
          <string-name>
            <given-names>Subrahmanyam</given-names>
            <surname>Murala</surname>
          </string-name>
          .
          <article-title>A novel feature descriptor for image retrieval by combining modified color histogram and diagonally symmetric co-occurrence texture pattern</article-title>
          .
          <source>Pattern Analysis and Applications</source>
          ,
          <volume>03</volume>
          <fpage>2019</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Kai</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In 2009 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pages
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          ,
          <year>June 2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Tamarafinide</given-names>
            <surname>Dittimi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Ching</given-names>
            <surname>Suen</surname>
          </string-name>
          .
          <article-title>Modified hog descriptor-based banknote recognition system</article-title>
          .
          <source>Advances in Science, Technology and Engineering Systems Journal</source>
          ,
          <volume>3</volume>
          ,
          <string-name>
            <surname>10</surname>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Andrew</surname>
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Howard</surname>
            , Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and
            <given-names>Hartwig</given-names>
          </string-name>
          <string-name>
            <surname>Adam</surname>
          </string-name>
          . Mobilenets:
          <article-title>Efficient convolutional neural networks for mobile vision applications</article-title>
          . 04
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          ,
          <year>June 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Chunde</given-names>
            <surname>Huang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jiaxiang</given-names>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>A fast hog descriptor using lookup table and integral image</article-title>
          .
          <source>03</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Tsung-Yi Lin</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            , Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dolla´r, and
            <given-names>C. Lawrence</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          .
          <article-title>Microsoft coco: Common objects in context</article-title>
          . In David Fleet,
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Pajdla</surname>
          </string-name>
          , Bernt Schiele, and Tinne Tuytelaars, editors,
          <source>Computer Vision - ECCV</source>
          <year>2014</year>
          , pages
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          , Cham,
          <year>2014</year>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Tianyi</given-names>
            <surname>Liu</surname>
          </string-name>
          , Shuangsang Fang,
          <string-name>
            <given-names>Yuehui</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Peng</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jun</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <article-title>Implementation of training convolutional neural networks</article-title>
          .
          <source>CoRR, abs/1506.01195</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Valentin</given-names>
            <surname>Malykh</surname>
          </string-name>
          and
          <string-name>
            <given-names>Aleksei</given-names>
            <surname>Samarin</surname>
          </string-name>
          .
          <article-title>Combined advertising sign classifier</article-title>
          .
          <source>InAnalysis of Images, Social Networks and Texts</source>
          , pages
          <fpage>179</fpage>
          -
          <lpage>185</lpage>
          , Cham,
          <year>2019</year>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Smith</surname>
          </string-name>
          .
          <article-title>An overview of the tesseract ocr engine</article-title>
          .
          <source>In Ninth International Conference on Document Analysis and Recognition (ICDAR</source>
          <year>2007</year>
          ), volume
          <volume>2</volume>
          , pages
          <fpage>629</fpage>
          -
          <lpage>633</lpage>
          , Sep.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Junding</surname>
            <given-names>Sun</given-names>
          </string-name>
          , Zhu Shisong, and
          <string-name>
            <given-names>Wu</given-names>
            <surname>Xiaosheng</surname>
          </string-name>
          .
          <article-title>Image retrieval based on an improved cs-lbp descriptor</article-title>
          . pages
          <fpage>115</fpage>
          -
          <lpage>117</lpage>
          ,
          <year>05 2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Wei Liu, Yangqing Jia,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          .
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>June 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Zhi</surname>
            <given-names>Tian</given-names>
          </string-name>
          , Weilin Huang, Tong He,
          <string-name>
            <surname>Pan He</surname>
            , and
            <given-names>Yu</given-names>
          </string-name>
          <string-name>
            <surname>Qiao</surname>
          </string-name>
          .
          <article-title>Detecting text in natural image with connectionist text proposal network</article-title>
          .
          <source>In Bastian Leibe</source>
          , Jiri Matas, Nicu Sebe, and Max Welling, editors,
          <source>Computer Vision - ECCV</source>
          <year>2016</year>
          , pages
          <fpage>56</fpage>
          -
          <lpage>72</lpage>
          , Cham,
          <year>2016</year>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Kai</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boris Babenko</surname>
            , and
            <given-names>Serge</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
          </string-name>
          .
          <article-title>End-to-end scene text recognition</article-title>
          .
          <source>In Proceedings of the 2011 International Conference on Computer Vision</source>
          , ICCV '
          <volume>11</volume>
          , pages
          <fpage>1457</fpage>
          -
          <lpage>1464</lpage>
          , Washington, DC, USA,
          <year>2011</year>
          . IEEE Computer Society.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Alok</given-names>
            <surname>Watve</surname>
          </string-name>
          and
          <string-name>
            <given-names>Shamik</given-names>
            <surname>Sural</surname>
          </string-name>
          .
          <article-title>Soccer video processing for the detection of advertisement billboards</article-title>
          .
          <source>Pattern Recogn. Lett.</source>
          ,
          <volume>29</volume>
          (
          <issue>7</issue>
          ):
          <fpage>994</fpage>
          -
          <lpage>1006</lpage>
          , May
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Jiang</surname>
            <given-names>Zhou</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kevin McGuinness</surname>
          </string-name>
          , and
          <string-name>
            <surname>Noel E. O'Connor</surname>
          </string-name>
          .
          <article-title>A text recognition and retrieval system for e-business image management</article-title>
          .
          <source>In Klaus Schoeffmann</source>
          ,
          <string-name>
            <surname>Thanarat H. Chalidabhongse</surname>
          </string-name>
          , Chong Wah Ngo, Supavadee Aramvith,
          <string-name>
            <surname>Noel E. O'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yo-Sung</surname>
            <given-names>Ho</given-names>
          </string-name>
          , Moncef Gabbouj, and Ahmed Elgammal, editors,
          <source>MultiMedia Modeling</source>
          , pages
          <fpage>23</fpage>
          -
          <lpage>35</lpage>
          , Cham,
          <year>2018</year>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>He</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Liang</surname>
          </string-name>
          . East:
          <article-title>An efficient and accurate scene text detector</article-title>
          .
          <source>In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>2642</fpage>
          -
          <lpage>2651</lpage>
          ,
          <year>July 2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>