<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Text detection in natural scenes with multilingual text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikhail Zarechensky</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Analytical Information Systems, Saint Petersburg State University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Proceedings of the Tenth Spring Researcher's Colloquium on Database and Information Systems</institution>
          ,
          <addr-line>Veliky Novgorod, Russia, 2014</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Scientific supervisor: Ph.D. Natalia Vassilieva</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <abstract>
        <p>Detecting text in natural scenes is an important prerequisite for further text recognition and other image analysis tasks. Most of text detection methods for scene images usually use a priori knowledge of language to detect text. As a rule such algorithms are evaluated on datasets which contain scenes only with text in English. This paper discusses known text detection algorithms and investigates them for invariance to the language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Recent advances in digital technology allow to take
pictures from a large number of mobile devices. As a
result, the number of photos taken by users is increasing
every day. At the same time, we often have no
annotations for images except those made by the device. Text in
images provides important information about semantics
of the image. Annotated images can be used in various
applications, such as content-based image retrieval,
automatic navigation, automatic translation. It is often the
case that a language of a text in an image is not known in
advance, or a single image contains text areas with text in
different languages. How to effectively detect and
recognize text in scene images is an actual research question.
Text detection is an important prerequisite for further text
recognition. In this paper we explore the problem of text
detection.</p>
      <p>In this paper, we discuss several known text detection
algorithms and investigate them for invariance to a
language. Quality of the text detection algorithm greatly
depends on the shooting conditions and noise on the
image, but in this paper we focus on the problem of
language invariance of the algorithms in good conditions.
First, we distinguish the main common steps of these
algorithms. Second, we provide a theoretical estimation
of language invariance for every step of the algorithms.
Third, we perform experiments with two algorithms on
different datasets to confirm theoretical result.</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>In order to recognize text in an image, it first has to be
robustly detected. Unlike text detection for document
images, text detection for scenes is still a challenging task
due to the large variety of text appearance in images.
Text in scenes can have different variations of the font
style, size, distortion; it can have different contrast due
to different lighting conditions. The whole image can
also vary greatly. We should take into account low
resolution, low contrast, heterogeneous background. Such
variety gives rise to various approaches to text detection.</p>
      <p>Existing methods for scene text detection can be
broadly categorized into three groups: texture-based
methods, region-based methods and hybrid methods.</p>
      <p>Texture-based methods extract textural features of an
image and then use machine learning techniques to
identify text regions. It is common to extract textural features
of image sub-regions using a sliding window and later
classify every subregion as text or non-text. Thus these
methods tend to be slow, because an image has to be
processed at several scales. Another problem is
construction of a whole text area from the coordinates of image
sub-regions classified as text. Also image quality affects
greatly these methods. Therefore, these approaches are
difficult to use on mobile devices.</p>
      <p>Region-based methods commonly use connected
components labeling to extract components, which are
character candidates. Next, various heuristics are applied
to filter out non-character components. Remaining
character candidates are grouped together to form text areas.
Usually components are grouped based on their
geometric properties. After character candidates are grouped
into text, there may be additional checks to remove false
positives. This approach is the most common.
Performance of these methods mostly depends on heuristics to
filter out regions that do not contain text.</p>
      <p>Hybrid methods exploit region detector to detect text
candidates and then segment image to extract character
candidates. After character candidates are extracted,
various heuristics are applied to eliminate non-characters
as in the connected components based methods. Lastly
character candidates are grouped into text.</p>
      <p>
        In this paper we consider only connected components
based methods. According to the results of the
competition at the ICDAR 2013, this approach proved itself to be
more effective comparing to others. We picked methods
proposed by Yin et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Gomez et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Chen et
al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for further consideration. These algorithms have
good results on the ICDAR datasets and use different
approaches to detect text. All of these methods use the
MSER algorithm for extraction of character candidates,
so it is important to describe this algorithm in details.
      </p>
      <sec id="sec-2-1">
        <title>The MSER algorithm</title>
        <p>
          Maximally Stable Extremal Region (MSER) algorithm
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] is used for detecting character candidates in many
state-of-the-art text detection algorithms [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>The input of the MSER algorithm is a grayscale image
I. The output of the algorithm is a sequence of images
(It)t2=550 which is created as follows. An input image I is
successively binarized with a threshold t iterating from 0
to 255. The first image in the sequence I0 is completely
black. In the next images of the sequence white areas
appear and grow. And the latest image I255 is completely
white. There are also implementations of this algorithm
when a sequence is constructed conversely. So the first
image in the sequence is white and the latest image is
completely black. White areas in the sequence are called
extremal regions. For every extremal region it can be
found for how many successive images in the sequence
this region stays the same. Thus, by selecting a
threshold value R, we can choose regions which are exactly
the same in at least R successive images of the sequence.
Such regions are called Maximally Stable Extremal
Regions.</p>
        <p>An advantage of the MSER algorithm is that it is
well applicable for finding text character candidates. The
MSER algorithm is invariant to affine transform, it can
be applied to images with low quality, it has an efficient
implementation. For example, the original
implementation proposed by Matas et al. has the complexity of
O(nlog(log(n))), where n – is a number of pixels in the
image. In particular, it is important that this algorithm is
invariant to a language of text in images.</p>
        <p>A disadvantage of the MSER is that it detects a lot of
false positives – regions that do not contain characters.
Therefore, it is necessary to apply additional checks to
eliminate non-text regions. Also the MSER is quite
sensitive to image blur. In case of a blurred image, some
character regions may not be separable from each other.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Overview of text detection algorithms</title>
        <p>
          The algorithm proposed by Yin et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] was presented
at ICDAR 2013 and got the first place in “Multi-script
Robust Reading Competition in ICDAR 2013” [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. It
uses an approach based on the MSER algorithm to find
character glyphs.
        </p>
        <p>
          As we mentioned before, it is possible to detect
character regions with original MSER algorithm even when
an image is of poor quality. However, in this case a
number of false positive character regions can be large. To
solve this problem, the algorithm by Yin et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
additionally performs parent-children elimination for the
MSER tree. It improves accuracy of finding character
regions. The main idea is to eliminate regions with very
small or very big aspect ratio. That is, if at some moment
of execution of the MSER algorithm, an extremal region
violates the aspect ratio, then this region is removed from
character candidates and is not processed further.
        </p>
        <p>The next step of the algorithm is to group characters
in order to construct text candidates. Character
candidates are clustered into text candidates by the single-link
clustering algorithm. Parameters for clustering
algorithm – a distance function and a threshold are learned
simultaneously using the algorithm called “self-training
distance metric learning” which is also proposed by
the Yin et al. The parameters depend on the following
features: spatial distance, a differences between width
and height, top and bottom alignments, color difference,
stroke width difference. At the final step text candidates
are labeled by a classifier as text or non-text areas.
The following features are used to train the classifier:
smoothness, the average stroke width, stroke width
variation, height, width, and aspect ratio.</p>
        <p>
          The second algorithm which we selected for
analysis is the algorithm proposed by Gomez et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. This
algorithm was presented at ICDAR 2013 and got
second place in “Multi-script Robust Reading Competition
in ICDAR 2013” [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>This algorithm is a region-based algorithm and it uses the
MSER algorithm at the first step for detecting text
characters. The regions produced by the MSER are then
filtered by the following features: size, aspect ratio, stroke
width variance, and number of holes.</p>
        <p>
          Next, a number of possible grouping hypotheses is
created. The hypotheses differ one from another by
image features. Then, these groups are analyzed based on
the theory of Gestalt, formalized in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and only the most
meaningful ones are kept. To construct the groups the
following features are used: geometrical features,
intensity and color means of the region, intensity and color
means of the outer boundary, stroke width, gradient
magnitude mean on the border.
        </p>
        <p>To construct text candidates the single-link clustering
algorithm is used with similar features.</p>
        <p>At the final step a classifier is used in order to filter
non-text candidates. To train the classifier the following
features are used: stroke width, area, perimeter, number
and area of holes.</p>
        <p>
          Let us discuss an algorithm proposed by Chent et
al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. It uses a combination of the MSER and Canny
edge detector [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] for detecting text candidates. In case
when image is blurred this combination copes well
because close symbols will be distinguished by Canny
detector. This achieved by removing the MSER pixels
outside the boundary formed by the Canny edges.
        </p>
        <p>To filter out non-text regions the following features
are used: size, aspect ratio, number of holes, stroke
width.</p>
        <p>For text candidates construction the single-link
clustering algorithm is used, as the main parameters are
spatial distance, width and height, aspect ratio. There is an
additional check after text candidates are built. A text
line is rejected if a significant portion of the objects are
repetitive.</p>
        <p>
          At the final step text lines are split into individual
words using Otsu’s method [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>The main common steps of text detection algorithms</title>
        <sec id="sec-2-3-1">
          <title>Number of holes</title>
          <p>In general, by analyzing some of the most efficient
algorithms on the ICDAR datasets, we can distinguish the
main steps which are common for every algorithm:
1. Region decomposition: text character candidates
extraction
2. Filtering regions using different heuristics to
eliminate non-text candidates
3. Text line formation
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Analysis of the main steps of the algorithms</title>
      <p>In this section we discuss the main steps of the methods
and provide a theoretical estimation of their language
invariance.</p>
      <sec id="sec-3-1">
        <title>Character candidates extraction</title>
        <p>As presented above, for the region decomposition it is
common to use the MSER algorithm. The MSER
algorithm depends only on the intensity of the image. Since
the text in the image tends to have equal intensity, at least
in each symbol, the result of the algorithm is independent
of language.</p>
        <p>We can conclude that the MSER algorithm is equally
applicable for region decomposition as for images,
containing only one language and for images with
multilingual text. As the Canny edge detector is not depend on
a language, it follows that the modified MSER, proposed
by Chen et al. is also invariance to a language</p>
      </sec>
      <sec id="sec-3-2">
        <title>Filtering of regions</title>
        <p>Let us review every feature used for region filtering.</p>
        <sec id="sec-3-2-1">
          <title>Aspect ratio</title>
          <p>Most letters of English language have aspect ratio
being close to 1, so this feature might be useful to
filter out false character candidates. To cope with
elongated letters such as ’i’ or ’l’, a threshold should
be small enough. On the one hand, this feature can
be used for many languages because even if a letter
has a very small aspect ratio and is filtered out, the
absence of this letter will not affect the grouping of
whole word at the grouping stage.</p>
          <p>On the other hand, when an entire word is not split
into the characters it might cause difficulties in text
detection. There are languages in which every word
is continuously connected. For instance, Hindi, in
which all words are linked by continuous line. In
this case, rational use of this feature is difficult,
because words might be very long. Thus this
feature has limitations and may not be used for all
languages.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Region height Irrespective of language, height of the characters in one word are always about the same. Therefore, this type of filter is invariant to the language.</title>
          <p>Number of holes in the English characters and in
the hieroglyphs might be different. Therefore, this
feature requires an additional configuration for
different languages.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Stroke width</title>
          <p>
            This feature is very important as it is shown in the
work Epshtein et al. [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ]. However, the proposed
implementation has a limitation for the elements that
have non-parallel edges. This feature of the
implementation is essential for such languages as Arabic.
Also the style of writing in Arabic language tends
to have more variation in the stroke width, thus to
achieve maximum efficiency, this feature must be
configured for different languages separately.
          </p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Text line formation</title>
        <p>Typically, to construct text one of the two approaches is
used: methods based on machine learning and methods
based on pairing the connected components using rules.
Algorithms that use rules for pairing regions are quite
stable for different languages because the main criteria
for regions combination are spatial distance, lower and
upper alignment. This features are invariant to a
language, so all pros and cons of this approach will stay
irrespective to a language.</p>
        <p>At the same time, as shown above, algorithms that use
machine learning techniques are using single-link
clustering algorithm, i.e., distance between two clusters
defined as the distance between two closest members of
these clusters. Usually, the problem is to determine the
distance function. For example, in the algorithm
proposed by Yin et al.the distance function is a weighted
sum of features where weight of each feature is
determined by machine-learning techniques. To use this
approach, you must have good training set. Other problem
related to machine learning algorithms is overfitting,
especially if there is a need to build a training set for many
languages.</p>
        <p>In order to emphasize the need of additional
configuration for algorithms that perform characters grouping,
it is enough to take into account Chinese characters
consisting of several parts. In this case, not only characters
must be grouped but also different elements of the same
character. Therefore, weights of some features such as
upper and lower alignment, width and height of the
character, relative size, should not be big. Otherwise the
probability to get an error of the second kind is increased
because parts of the character will be interpreted by the
algorithm as individual characters.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Empirical analysis</title>
      <p>To confirm the theoretical estimations provided in the
previous section we will perform a series of experiments
for the two methods described in the section 2.</p>
      <sec id="sec-4-1">
        <title>Descriptioin of the experiments</title>
        <p>
          For the experiments two algorithms were selected: the
one proposed by Yin et al. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], and the algorithm by
as following: recall =
Chen et al. [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] Let us remind, that to filter out non-text
regions the first algorithm uses a machine-learning
technique and the second one uses a rule-based approach.
For evaluation we used a similar approach and the same
quality measures as in the evaluation scheme of ICDAR
2013 competition. The following quality measures are
used: precision, recall and f measure. They are defined
PjGj matchG(Gi)
i=1
, precision =
PjDj matchD(Di)
j=1
        </p>
        <p>jDj
set of groundtruth rectangles and D is the set of
estimated rectangles. The matching functions are defined
as following:
, f = 2 rreeccaallll+pprreecciissiioonn , where G is the
jGj
matchG(Gi) =
matchD(Dj ) =</p>
        <p>max
j=1::jDj area(Gi) + area(Dj )</p>
        <p>2 area(Gi \ Dj )
max
i=1::jGj area(Dj ) + area(Gi)
2 area(Dj \ Gi)
(1)
(2)</p>
      </sec>
      <sec id="sec-4-2">
        <title>Description of test data</title>
        <p>In the first group of tests we run the selected algorithms
on the following datasets: MSRA-TD500, ICDAR 2011,
ICDAR 2013. ICDAR 2011 dataset contains images with
text in English only. MSRA-TD500 dataset contains
images with text in English and Chinese. And ICDAR 2013
dataset contains images with multilingual text including
Indo-Aryan languages and Chinese writing.</p>
        <p>Also we created a new dataset which contains
images with multilingual text. We included images in this
dataset which are worse suited for text detection with
heuristics as in algorithms of authors Yin et al. and Chen
et al.. The majority of images of this dataset contain text
in Hindi or in Arabic with a strong variability of stroke
width, or contain text in Chinese where letters consist of
several parts.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Analysis of the experimental results</title>
        <p>The results of every test are presented in the following
tables.</p>
        <p>Based on the experimental results one may see that the
difference between the result on the ICDAR 2011 dataset
which contains only images with English text, and all
others, is quite big. The minimum recall is reached on
our special dataset, as it was expected.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this work the following results were obtained.</p>
      <p>The most efficient text detection algorithms are
discussed.</p>
      <p>The main common steps of text detection
algorithms are identified.</p>
      <p>Every step of text detection algorithms is analyzed
analytically for invariance to a language.</p>
      <sec id="sec-5-1">
        <title>Evaluated a series of experiments</title>
        <p>During the work it was obtained that the existing set of
features may strongly depend on a language. By
changing settings of the rules that are used in the algorithms
you can improve the text detection results on some
predefined languages.</p>
        <p>As a possible continuation to this work it is planned to
implement a complete algorithm that solves the problem
of text detection irrespective of a language. The analysis
presented in this paper helps to identify problem pieces
of the existing algorithms. The created dataset and the
experimental results will allow to evaluate better the
result of this new algorithm.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Canny</surname>
          </string-name>
          .
          <article-title>A computational approach to edge detection</article-title>
          .
          <source>IEEE Trans. Pattern Analysis and Machine Intelligence</source>
          ,
          <volume>8</volume>
          :
          <fpage>679</fpage>
          -
          <lpage>698</lpage>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.S.</given-names>
            <surname>Tsai</surname>
          </string-name>
          , G.Schroth,
          <string-name>
            <given-names>David M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grzeszczuk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Girod</surname>
          </string-name>
          .
          <article-title>Robust text detection in natural images with edgeenhanced maximally stable extremal regions</article-title>
          .
          <source>IEEE International Conference on Image Processing</source>
          ,
          <year>Sep 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Desolneux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moisan</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>-M. Morel</surname>
          </string-name>
          .
          <article-title>A grouping principle and four applications</article-title>
          .
          <source>IEEE Trans. PAMI</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Epshtein</surname>
          </string-name>
          , E. Ofek, and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wexler</surname>
          </string-name>
          .
          <article-title>Detecting text in natural scenes with stroke width transform</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gomez</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          <article-title>. Multi-script text extraction from natural scenes</article-title>
          .
          <source>ICDAR</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.N. Anil</given-names>
            <surname>Prasad</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.G.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          .
          <article-title>Multiscript robust reading competition in icdar 2013</article-title>
          .
          <source>In ACM Proc. International Workshop on Multilingual OCR, (MOCR</source>
          <year>2013</year>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Chum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Urban</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Pajdl</surname>
          </string-name>
          .
          <article-title>Robust wide baseline stereo from maximally stable extremal regions</article-title>
          .
          <source>In British Machine Vision Conference</source>
          , volume
          <volume>1</volume>
          , pages
          <fpage>384</fpage>
          -
          <lpage>393</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>S.</given-names>
            <surname>Milyaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Barinova</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Novikova</surname>
          </string-name>
          .
          <article-title>Image binarization for end-to-end text understanding in natural images</article-title>
          .
          <source>ICDAR</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Neumann</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Matas</surname>
          </string-name>
          .
          <article-title>Real-time scene text localization and recognition</article-title>
          .
          <source>IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Otsu</surname>
          </string-name>
          .
          <article-title>A threshold selection method from gray-level histograms</article-title>
          .
          <source>IEEE Transactions on Systems, Man and Cybernetics</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>62</fpage>
          -
          <lpage>66</lpage>
          ,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I. Zeki</given-names>
            <surname>Yalniz</surname>
          </string-name>
          , Douglas Gray, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Manhmatha</surname>
          </string-name>
          .
          <article-title>Adaptive exploration of text regions in natural scene images</article-title>
          .
          <source>ICDAR</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>X.-C. Yin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          .
          <article-title>Robust text detection in natural scene images</article-title>
          .
          <source>CoRR, abs/1301.2628</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>