<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Device using Structural Approaches</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vasyl Tereshchenko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaroslav Tereshchcenko</string-name>
          <email>vtereshch@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>Addres 64/13, Volodymyrska Street, Kyiv, 01601</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>0</volume>
      <fpage>1</fpage>
      <lpage>03</lpage>
      <abstract>
        <p>The work is devoted to developing efficient algorithms for detecting text in an image obtained from mobile device's camera. The peculiarity of such algorithms is essential limitations of memory consumption and execution time (in our case: 5MB, ~1 s) while supporting detection quality on the level that cloud-based services provide. To ensure the efficiency of algorithms with specified constraints we propose a mixed model for detecting texts in images that involves preprocessing, detection, and post-processing.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Text recognition on images with arbitrary backgrounds belongs to one of the actively developed
problems of the computer vision.</p>
      <p>
        Despite the fact that methods for recognizing texts have been
evolving for several decades they still have much space for further improvements. In particular in case
of complex and heterogeneous scenes when a priory criteria for distinguishing a text from its
background are unclear [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">1-7</xref>
        ]. Such problems, for instance, are inherent for applications developed for
mobile devices where there are tight limits on the runtime performance and the memory footprint. In
turn, this motivates searching new optimal pipelines for balancing between text detection and its
further processing (e.g. handwriting, print, style, and language recognition). It is worth noting that the
localization of text blocks is not an easy task under variety of conditions like lighting, arrangement of
the text in relation to the camera, presence of non-textual symbolic artifacts and graphic information
along with text on the images, other distortions of the text. Examples of the text areas are inscriptions
on billboards, buildings, institutions, road signs, participants of the movement (pedestrians and cars),
as well as the text on the board, lecture notes, and other textual information (Fig. 1).
      </p>
      <p>There are many approaches to solving this problem: methods based on correlation, contour and
texture segmentation, discrete Fourier transformation, wavelet transformation, neural networks. All
popular approaches conventionally belong to the following main classes: structural and based on
machine learning (CNN-convolutional neural network) and combined (or
mixed). Structural
approaches include methods that take into account geometry and topology of image elements. In
particular, these methods are based on the processing of images and the construction of data
structures, as well as the use of image decomposition. For example, the most frequently used data</p>
      <p>2022 Copyright for this paper by its authors.
structures are lists, PHP, trees (k-trees, BSP trees, BD trees, BBD trees, QBD trees, VP trees,
concatenable queue).</p>
      <p>
        We also use such decomposition tools as triangulation (Delaunay triangulation), Voronoi diagram,
orthogonal recursive partition, and so on. Structural methods allow to reduce noises significantly at
the preprocessing stage. It significantly improves quality and speed of detecting desired elements in
the image. To date, there are many structural algorithms of automatic segmentation, which in general
can be divided into two groups: the division into homogeneous regions [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref8 ref9">8 -12</xref>
        ] and the allocation of
contours [
        <xref ref-type="bibr" rid="ref10 ref13 ref14 ref15">10,13-15</xref>
        ]. There are many methods however none of them are universal enough.
      </p>
      <p>
        Authors in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] predict that the letters and words in the image, as a rule, have a constant (or width
of the contour in a certain narrow range) the thickness of the stroke. Therefore, in their view, to
identify such objects, it is promising to use the algorithm SWT (Stroke Width Transform). The width
of the stroke, we can use not only as one of the features of the classification of areas, but also as a
feature when combining areas in the words. Character contours within the described approach can be
defined, for example, with the help of Canny boundary detector. However, it should be borne in mind
that the SWT algorithm requires certain additional computational resources to combat errors and
some other specific effects.
      </p>
      <p>
        The next approach is related to the use of convolutional neural networks (CNN) to detect a text in
an image. In particular, in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] authors describe the automatic generation of features that will be used
for recognition. It is proposed to create such features by means of machine learning, and as dataset for
training take artificial images 8x8 pixels that contain fragments of text characters. To search in real
images it will be enough to calculate the found features in the desired areas of the image. In papers
[
        <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19 ref20 ref21 ref22 ref23">16-23</xref>
        ] for the detection of a text on the image, the authors suggest using as a classifier the neural
network (CNN), which, in comparison with classical neural networks has the following advantages:
the ability to account for the spatial structure, reduction of architecture complexity and training,
resistance to distortion of symbols. In our view, methods that use semantic knowledge about objects
and algorithms of machine learning to improve the results of segmentation [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] are quite promising.
The most suitable, in our opinion, is SSD and YOLO networks. These networks give good results to
text detection using insignificant resources. However, we consider that it is not appropriate to use
only neural networks in the way that we have to process a huge amount of redundant and unnecessary
information, which results in loss of time and resources and thus restricts the ability to use them for
mobile devices.
      </p>
      <p>
        One of the main problems that we have to solve when implementing the proposed method is that
the quality of the classification stage (including the use of convolutional neural networks) depends
significantly on the volume and the representation of the training sample. Today, as datasets, we can
use specially selected image bases, such as the ICDAR artificial intelligence algorithms database [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ].
It should be noted that the number and type of images included in the training sample involved in the
classifier's training determine the learning speed and the accuracy of the classification.
      </p>
      <p>
        Modern approaches to solving the problem of text detection on a natural image usually consist of
sequential use of algorithms, where the result of the work of the previous one is passed to the input of
the next. Everything begins with the detection of low-level characters or strokes, after that the
following steps are usually performed: filtering not-text components, constructing text strings and
checking them. The presence of such a number of steps complicates the algorithm itself and reduces
its reliability and flexibility. The accuracy of this approach depends on the accuracy of detection
symbols methods, connected-components or sliding-window method. These methods generally
examine the low-level features that were obtained when using SWT [
        <xref ref-type="bibr" rid="ref24 ref3">3, 24</xref>
        ], MSER [
        <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
        ] or HOG
[
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] to distinguish text from the background. However, this does not guarantee the reliability of this
approach, since individual strokes or symbols are identifying without context. For example, it's easier
for a person to recognize a sequence of characters than a single letter, especially when it is implicit.
These restrictions cause false detection of the text at the stage of letters detection, which in turn
complicates the elimination of these errors. Moreover, the number of errors is accumulated with each
stage of the algorithm [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]. In our work, we focus on one of the most efficient structural methods of
text recognition on a natural image -SWT [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In particular, we propose an optimization and
adaptation of this method for developing applications for mobile devices.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The Peculiarities of the Method SWT (Stroke-Width Transform) in the Text</title>
    </sec>
    <sec id="sec-3">
      <title>Detection Problems on the Image</title>
      <p>
        The main idea of the SWT method is detecting gestures (strokes) of the equal width on binarized
image (for example, obtained using the Canny algorithm [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]) that are likely candidates for the
characters or letters of the text on a natural image. The method differs from other approaches in that it
does not look for separated features per pixel, such as gradient or color. In addition, the SWT method
does not use any language-related filtering mechanisms, such as statistics of gradient directions in the
candidate window relating to a particular alphabet. This allows us to use this approach for
multilingual text detection. Also, the method does not focus on the exact definition of the gesture
(stroke) width, and on the trend detection that the stroke is an element of the text within a certain
width of the contour. The next step of the algorithm is to group pixels into the list of candidates for
the letters. Two adjacent pixels can be grouped if they have the same contour width. Then there is a
filtering based on the mean square deviation, size, and other structural features. In the following,
grouping words into blocks is by clustering methods.
      </p>
      <p>In most cases, SWT gives better results in time than other algorithms. However, it can spend a lot
of time to areas where there is no text due to a certain noise, or if the plot also has a uniform contour.
Another drawback of SWT is its unreliability in cases where the width of stroke changes sufficiently.
It can often happen in handwritten texts, or in certain font types of printed text. Therefore, we offer an
algorithm for optimizing the procedure for detecting thickness and grouping letters into text blocks
based on the use of the Voronoi diagram. It greatly speeds up the work of the SWT itself and thus can
be applied to detect text on the image using mobile devices and special robot-technical devices.</p>
      <p>
        Usually, the use of the Voronoi diagram is not new for recognition tasks, and in particular, a lot of
works are devoted to the localization of a text in an image. Thus, N. C. Kha and N. Masaki [
        <xref ref-type="bibr" rid="ref31">31, 32</xref>
        ]
offer a method for segmenting characters of handwritten text pages for the Japanese language. Using
the Voronoi diagram, the authors divide text elements for qualitative using SWT. However, this
modification does not apply to the main procedure of the SWT operator: the finding of the equal
width strokes and candidates for characters or letters of the text. In [33], the authors propose a new
approach to skeletonization of text images using the Voronoi diagram.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Model and Methods for Solving the Problem</title>
      <p>
        In the methods for text localization on the image with limitations, much attention is focused on the
execution time and memory required to store and implement algorithms. Text recognizing problem is
one of the topical issues of computer vision. Despite the fact that the development of methods for
recognizing texts is for several decades, this problem is far from completing for real images,
which have complex and heterogeneous backgrounds and the absence of clear criteria for
distinguishing text from background [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6 ref7">1-7</xref>
        ]. The problem is particularly acute in the development of
applications for mobile and special robotic devices, where there are severe limitations on runtime and
memory. This, in turn, leads to the development of new approaches and models for solving problems
of text detection in the image and its further processing (handwriting, print, style, and speech
recognition). For text localization with limitations, we pay attention to the runtime and memory
needed to store and implement algorithms. Restrictions, for example, may be due to the adaptation of
algorithms on mobile devices under Android. Therefore, to achieve efficiency of algorithms within
the given limits set, we propose a mixed model of text localization in an image that involves
preprocessing, detection, and post-processing. For detection, we choose a shortened convolutional
neural network (CNN) type, which satisfies the specified restrictions. However, with a reduced SSD,
the recognition quality may deteriorate. To support high-quality recognition, we offer a preprocessing
step. At the preprocessing stage, the input RGB image is processed to enhance and expand the space
of its features: remove unnecessary information (noise), change the contrast, brightness, color palette
and other image processing procedures. At this stage, we can use structural methods (MSER, SURF,
gradient method, Delaunay triangulation, Voronoi diagram, data structure usage, and NN) and in
particular SWT. However, considering the time constraints, the classical SWT spends too much time
searching for gestures of the same width. Therefore, optimization is required. In this work, we offer
such SWT optimization using the Voronoi diagram, which allows you to accelerate and improve the
quality of work significantly (to determine the contours of the gestures of their width), as well as to
expand the space of features for further recognition.
      </p>
      <p>The SSD is the most suitable CNN for our conditions. We have modified the SSD architecture
(optimization of convolutional depth, size of the input image and feature map, number of channels of
the input image). This allowed us to satisfy the specified limits (5Mb, 1-2 seconds). We conducted
several tests with an SSD detector based on Inception v3 extractor features with a reduced number of
neurons in each layer. Detection accuracy dropped by only 5%, while the number of operations 10
times decreased. In addition, the analysis showed that the most promising model for solving the
problem of localization of a text on the image of mobile devices is a combined model that we choose
as a basis. In order to get the maximum quality of localization of text at minimal expenses of
computing resources, we propose to act according to the following scheme, Fig. 2.</p>
      <p>Thus, we choose the following sequence of steps for the text localization algorithm on the natural
image:</p>
      <p>1) Expand the range of features. An expanded feature in our case is SWT. Finding a SWT
depends on the quality of the algorithm for detecting borders, the key element of which is the
threshold. To do this, we find the SWT at different threshold values and leave the maximum stable
strokes (the maximum number of thresholds for which the strokes do not change their significance.
Fig. 3).
2) Then, we submit to the input of the modified NN SSD RGB image and generated SWT. We
submit SWT not only to the first input layer, but also to other layers for features at different scales.
Fig. 4.</p>
      <p>3) 3) At the post-processing stage, the text blocks detected are fed to the input of the filtering
block. At this stage, the blocks that turned out to be false, as well as the separated parts of the words
are combined into words and sentences.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Description of the Problem Solving Results</title>
      <p>Already at the first stage of the development of the algorithm of localization (development of a
prototype without optimization) we obtained optimistic results.</p>
      <p>• Localization time on core i7 CPU - 0,4-0,7s; on ARM - 1-1.5s.
• Memory -1.9 MB and Accuracy – 0.61.</p>
      <p>• NN training was taking place on a new Dataset that holds 17,000 photos (1-2-day training time).</p>
    </sec>
    <sec id="sec-6">
      <title>SWT Optimization Method</title>
      <p>To improve the quality and speed of recognition at the preprocessing stage, we proposed the
optimization of the SWT algorithm, based on the Voronoi diagram and other procedures (Fig. 5).</p>
      <p>
        In particular, to optimize SWT, we first apply one of the methods of binarization (for example, the
Canny method [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]), which allows finding outlines based on their hierarchy (Fig. 6, a).
      </p>
      <p>We remove extra points from the resulting contours (Fig. 6, b). To the resulting set of points that
form contours, we use the Voronoi diagram.</p>
      <p>
        For this purpose, we can use one of the known effective algorithms, for example, the Fortune’s
algorithm [52] with the complexity of O (nlogn). In the process of constructing the Voronoi diagram,
we present it as an expanded double–connected-edge-list (DCEL) [
        <xref ref-type="bibr" rid="ref28">28</xref>
        ], which contains distance
between a pair of points for each edge that divides it. This allows you to define contours of the equal
width (letters). If we have a Voronoi diagram for the set of input points in the form of an RSP, after
binarization and deletion of noises, it is possible to localize the gestures of the letters in equal
thickness (or within a certain width of the contour), (Fig. 7, a). In this case, the search is carried out
not in the whole Voronoi diagram, but only in the direction of equal distances, that is in the direction
of letters skeleton, (Fig. 7, b). Compared to the usual SWT, which is based on the circumference of
the pixels contour, we move along the edges that divide the points forming the contour of the letter.
Due to it, the number of operations to identify the letter is reduced. By combining the received
components of the letter, we get its silhouette.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] it is assumed that the letters and words in the image, as a rule, have a constant stroke width.
But the proposed SWT optimization is not tied rigidly to the contour width and it allows the algorithm
to work within its averaged width. Therefore, according to the authors, it is prospectively to use the
algorithm SWT to detect such objects. The stroke width can be used not only as one of the features in
the classification of areas, but also as a feature when combining the areas in the words.
4.2.
      </p>
    </sec>
    <sec id="sec-7">
      <title>General Description of the Configured SSD</title>
      <p>
        To solve the detection problem, we developed a prototype algorithm based on a modified
convolutional neural network SSD, which in its original form occupied 40 MB. The neural network
has been modified and abbreviated by removing extra layers and neurons. In the final form, the neural
network occupies (1.9 mb). It has a smaller initial number of classes; some layers have been added to
the inception channel to improve the recognition of large objects. Wherein, despite this reduction of
NN, the quality of text recognition on images is retained. The inception block contains three layers for
forming image attributes. These layers correspond to matrices 1х1, 3х3 and 5х5. . To ensure the rapid
operation of the detection algorithm, we optimized the Inception blocks. Such optimization involves
adding Batch Normalization and Slice layers, as well as reducing the number of neurons based on
statistics that determine the feasibility of using a neuron for a particular layer. The operating time of
the MF - 0.4-0.7 s (it was on Everest i7000, CPU 0, 2 -0.8 s). A training base was also created on the
basis of labeled test images, new added photos from the phone, as well as the Coco Dataset. Total
amount of the training base of 17 000 pictures. The ability to submit more than 3 channels to HMs is
provided through the Aggregate Channel Feature (ACF) [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ].
      </p>
    </sec>
    <sec id="sec-8">
      <title>Comparative Characteristics of the SSD</title>
      <p>We conducted a series of experiments to optimize the number of inception blocks and their effect
on recognition. For this purpose, the statistics of localized bounding boxes were collected. Each
bounding box was assigned to one of the clusters, which is responsible for a certain size of the
detector. Also, compared to the original NN, the reduced network has a batch norm and slice layers.
Table 1 shows the number of neurons for the original and reduced NN for each layer.
4.4.</p>
    </sec>
    <sec id="sec-9">
      <title>Post-processing (Filtering)</title>
      <p>To solve this problem a prototype algorithm was developed on the basis of a small neural network
in order to recognize the words of the same style (Fig. 8). This neural network is used at the stage of
searching for words that have not been recognized by the main SSD neural network and that match
the style with the selected high scored word.</p>
      <p>We highlighted a large number of negatives in order to improve stylesheet recognition. Below are
a few images (Fig. 9) on which the labeled version of the image is shown on the left and the result of
the neural network operation is on the right, in addition the found text is classified as handwritten
red text blocks and printed - green.</p>
      <p>In case of SWT modification is worth noting that we have developed a modified algorithm for
constructing Voronoi diagram based on the idea of Fortune. The algorithm was adapted specifically
for the SWT modification. Figure 10 shows graphs of the comparison of the algorithms of
constructing the Voronoi diagram on the N data obtained after image binarization.</p>
      <p>The blue color shows the execution time of the developed algorithm, and the orange color is the
time to run the library algorithm with OpenCV. The graphs show that the developed algorithm is
much faster than the library algorithm on large sets of input data.
4.5.</p>
    </sec>
    <sec id="sec-10">
      <title>Experimental Data</title>
      <p>The comparison of proposed approaches was conducted on COCO-Text dataset (Tab. 2.).
Accuracy of the proposed model by classes, Table 3:</p>
    </sec>
    <sec id="sec-11">
      <title>5. Conclusions</title>
      <p>1. A new hybrid text detection approach for mobile devices is proposed, which is a combination of
structural methods for the allocation of attributes and their application for the training of SSD. To ensure
the rapid operation of the detection algorithm, the Inception blocks were optimized, which included adding
Batch Normalization and Slice layers, as well as reducing the number of neurons based on statistical data
that determine the feasibility of using a neuron for a particular layer.</p>
      <p>2. A modified SWT method, which uses the Voronoi diagram to find the width of text symbols
(gestures), is proposed. This approach considerably speeds up the work of SWT, because after filtering the
ribs of the Voronoi diagram we work with a much smaller amount of data.
3. To combine text blocks into groups (lines), the small style classification NN is used, that allows to
take into account not only the connection directly with the adjoining text block, but also with its
environment. To generate the graph, for the first time, the Delaunay triangulation was used for centroids of
text blocks.</p>
    </sec>
    <sec id="sec-12">
      <title>6. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Shafait</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Uchida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>A Hierarchical Visual Saliency Model for Character Detection in Natural Scenes</article-title>
          . LNCS,
          <volume>8357</volume>
          ,
          <fpage>18</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Coates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Carpenter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Case</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Suresh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ng</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Text Detection and Character Recognition in Scene Images with Unsupervised Feature Learning</article-title>
          .
          <source>In Proceedings of 11th International Conference on Document Analysis and Recognition (ICDAR)</source>
          , IEEE, Beijing, China , pp.
          <fpage>440</fpage>
          -
          <lpage>445</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Epshtein</surname>
          </string-name>
          , E. Ofek,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wexler</surname>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>Detecting Text in Natural Scenes with Stroke Width Transform</article-title>
          .
          <source>In Proceedings of 23rd IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , IEEE, San Francisco, vol.
          <volume>5</volume>
          , pp.
          <fpage>2963</fpage>
          -
          <lpage>2970</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kunishige</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yaokai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Uchida</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Scenery Character Detection with Environmental Context [Text]</article-title>
          .
          <source>In Proceedings of 11th International Conference on Document Analysis and Recognition (ICDAR)</source>
          , IEEE, Beijing, China, pp.
          <fpage>1049</fpage>
          -
          <lpage>1053</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Uchida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shigeyoshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kunishige</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yaokai</surname>
          </string-name>
          .
          <article-title>Keypoint-Based Approach Toward Scenery Character Detection</article-title>
          .
          <source>In Proceedings of 11th International Conference on Document Analysis and Recognition (ICDAR)</source>
          , IEEE, Beijing, China , pp.
          <fpage>819</fpage>
          -
          <lpage>823</lpage>
          , (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lao</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Dot Text Detection Based on FAST Points</article-title>
          .
          <source>In Proceedings of 11th International Conference on Document Analysis and Recognition (ICDAR) IEEE</source>
          , Beijing, China, pp.
          <fpage>435</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.F.</given-names>
            <surname>Liu</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Kim</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Accurate text localization in images based on SVM output scores [Text]</article-title>
          .
          <source>Image and Vision Computing</source>
          ,
          <volume>27</volume>
          ,
          <fpage>1295</fpage>
          -
          <lpage>1301</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.S.</given-names>
            <surname>Manjunath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shin</surname>
          </string-name>
          (
          <year>1999</year>
          ).
          <article-title>Color Image Segmentation</article-title>
          .
          <source>In Proceedings of Computer Society Conference on Computer Vision</source>
          and Pattern Recognition ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , Fort Collins, USA , V.2, pp.
          <fpage>446</fpage>
          -
          <lpage>451</lpage>
          .
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hojjatoleslami</surname>
          </string-name>
          and J.
          <string-name>
            <surname>Kittler</surname>
          </string-name>
          (
          <year>1998</year>
          ).
          <article-title>Region Growing: A New Approach</article-title>
          .
          <source>IEEE Trans. On Image Processing</source>
          <volume>7</volume>
          (
          <issue>7</issue>
          ),
          <fpage>1079</fpage>
          -
          <lpage>1084</lpage>
          . http://cgm.computergraphics.ru/content/view/147
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Koepfler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.</given-names>
            <surname>Morel</surname>
          </string-name>
          (
          <year>1994</year>
          ).
          <article-title>A Multiscale Algorithm for Image Segmentation by Variational Method SIAM</article-title>
          .
          <source>Journal on Numerical Analysis</source>
          <volume>31</volume>
          (
          <issue>1</issue>
          ),
          <fpage>282</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.L.</given-names>
            <surname>Marroquin</surname>
          </string-name>
          .
          <year>1985</year>
          .
          <article-title>Probabilistic Solution of Inverse Problems Tech</article-title>
          . Rep. Massachusetts Institute of Technology.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Prewitt</surname>
          </string-name>
          .
          <year>1970</year>
          .
          <article-title>Object Enhancements and Extraction</article-title>
          .
          <source>Picture Processing and Psychopictorics</source>
          <volume>10</volume>
          ,
          <fpage>15</fpage>
          -
          <lpage>19</lpage>
          .
          <string-name>
            <given-names>B.</given-names>
            <surname>Lipkin</surname>
          </string-name>
          and
          <string-name>
            <surname>A</surname>
          </string-name>
          . Rosenfeld (eds.), Academic Press, New York .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>R. M. Haralick</surname>
            ,
            <given-names>L. G.</given-names>
          </string-name>
          <string-name>
            <surname>Shapiro</surname>
          </string-name>
          (
          <year>1985</year>
          ).
          <article-title>Image Segmentation Techniques</article-title>
          .
          <source>Computer Vision</source>
          , Graphics, and Image Processing,
          <volume>29</volume>
          (
          <issue>1</issue>
          ),
          <fpage>100</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Canny</surname>
          </string-name>
          (
          <year>1986</year>
          ).
          <article-title>A Computational Approach to Edge Detection</article-title>
          .
          <source>IEEE Trans. on Pattern Analysis and Machine Intelligence</source>
          <volume>8</volume>
          (
          <issue>6</issue>
          ),
          <fpage>679</fpage>
          -
          <lpage>698</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Robust scene text detection with convolution neural network induced MSER trees</article-title>
          .
          <source>LNCS</source>
          ,
          <volume>8692</volume>
          ,
          <fpage>497</fpage>
          -
          <lpage>511</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>М. Delakis</surname>
          </string-name>
          , Сr.
          <source>Garcia</source>
          (
          <year>2008</year>
          ).
          <article-title>Text detection with convolutional neural networks</article-title>
          .
          <source>In Proceedings of International Conference on Computer Vision Theory and Applications</source>
          , January, Funchal, Madeira - Portugal, pp.
          <fpage>290</fpage>
          -
          <lpage>294</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lao</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Dot Text Detection Based on FAST Points</article-title>
          .
          <source>In Proceedings of International Conference on Document Analysis and Recognition</source>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , Beijing, China, pp.
          <fpage>435</fpage>
          -
          <lpage>440</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B. H.</given-names>
            <surname>Shekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Smitha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Palaiahnakote</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Discrete Wavelet Transform and Gradient Difference Based Approach for Text Localization in Videos</article-title>
          .
          <source>In Proceedings of 5th International Conference on Signal and Image Processing</source>
          ,
          <string-name>
            <surname>ICSIP</surname>
          </string-name>
          , IEEE, Bangalore, India, South Korea, pp.
          <fpage>280</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Enachescu</surname>
          </string-name>
          , Cr. D.
          <string-name>
            <surname>Miron</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <source>Handwritten Digits Recognition Using Neural Computing. Scientific Bulletin of the Petru Maior University of Tirgu Mures 6 (XXIII)</source>
          ,
          <fpage>17</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>P.</given-names>
            <surname>Arbelaez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hariharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Malik</surname>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Semantic segmentation using regions and parts</article-title>
          .
          <source>In Proceedings of Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , Providence, USA, pp.
          <fpage>3378</fpage>
          -
          <lpage>3385</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Antoshchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sotov</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Correlation-extreme method for text area localization on images</article-title>
          .
          <source>In Proceedings of First International Conference on Data Stream Mining &amp; Processing (DSMP)</source>
          , IEEE, Lviv, Ukraine , pp.
          <fpage>173</fpage>
          -
          <lpage>176</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>W.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Text localization in natural images using Stroke Feature Transform and Text Covariance Descriptors</article-title>
          .
          <source>In Proceedings of International Conference on Computer Vision</source>
          (ICCV), IEEE, December, Sydney, Australia , pp.
          <fpage>1241</fpage>
          -
          <lpage>1248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>X.- C.Yin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , H.-W. Hao (
          <year>2014</year>
          ).
          <article-title>Robust Text Detection in Natural Scene Images</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>36</volume>
          (
          <issue>5</issue>
          ),
          <fpage>970</fpage>
          -
          <lpage>983</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Neumann</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Matas</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Real-time Lexicon-free Scene Text Localization and Recognition</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>38</volume>
          (
          <issue>9</issue>
          ),
          <fpage>1872</fpage>
          -
          <lpage>1885</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.L</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Tan Text Flow: A Unified Text Detection System in Natural Scene Images</article-title>
          .
          <source>In Proceedings of International Conference on Computer Vision</source>
          (ICCV), December, Convention Center in Santiago, Chile, pp.
          <fpage>4651</fpage>
          -
          <lpage>4659</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Schroth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grzeszczuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Girod</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>Robust Text Detection in Natural Images with Edge-enhanced Maximally Stable Extremal Regions</article-title>
          .
          <source>In Proceedings of 18th IEEE International Conference on Image Processing</source>
          , IEEE, Brussels, Belgium, pp.
          <fpage>2601</fpage>
          -
          <lpage>2604</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>F.</given-names>
            <surname>Preparata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.I. Shamos. 1985. Computational</given-names>
            <surname>Geometry</surname>
          </string-name>
          :
          <article-title>An introduction</article-title>
          . Springer-Verlag, Berlin.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.Z.</given-names>
            <surname>Li</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Aggregate channel features for multi-view face detection</article-title>
          .
          <source>In Proceedings of International Joint Conference on Biometrics, IEEE</source>
          , Clearwater, USA, pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fortune</surname>
          </string-name>
          (
          <year>1987</year>
          ).
          <article-title>A sweep line algorithm for Voronoi diagrams</article-title>
          .
          <source>Algorithmica</source>
          <volume>2</volume>
          ,
          <fpage>153</fpage>
          -
          <lpage>174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>N.C.</given-names>
            <surname>Kha</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Masaki</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Enhanced Character Segmentation for Format-Free Japanese Text Recognition</article-title>
          .
          <source>In Proceedings of 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)</source>
          , Shenzhen,
          <string-name>
            <given-names>P.R.</given-names>
            <surname>China</surname>
          </string-name>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>143</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>K. C. Nguyen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nakagawa</surname>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Text-Line and Character Segmentation for Offline Recognition of Handwritten Japanese Text</article-title>
          .
          <source>IEICE technical report</source>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S-H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J., G.,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Stroke Width Based Skeletonization for Text Images</article-title>
          .
          <source>Journal of Computing Science and Engineering</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <fpage>149</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>