Development and Evaluation of a Text Recognition
               Framework using Synthetic Data
             Daniel Steininger1 and Andreas Zweng1 and Csaba Beleznai1 and Thomas Netousek2


Abstract. Text recognition is an intricate Computer Vision task.
The main complexity arises from the fact that text as a character se-
quence spans a very large space of possible appearances, induced by
combinatorially vast character orderings, diverse font styles, weights,
colors and backgrounds. In order to encode a rich representation
of variations and to generate an informative model by statistical
learning, image data balanced along all dimensions of variations are
needed. In this paper we present a synthetic text pattern generation
framework and its use for localizing text lines and recognizing char-      Figure 1. End-to-end processing chain of the text recognition framework.
acters in individual frames of broadcast videos. The paper demon-
strates that detection and recognition accuracies can be significantly
                                                                           data can be generated automatically resulting in an improvement of
enhanced by employing synthetic text image patches for training an
                                                                           the classification accuracy [8], [14], [12]. By means of synthetic data,
Aggregated Channel Features (ACF) detector and a Convolutional
                                                                           not only the ground truth is automatically generated along with the
Neural Network (CNN) character recognizer for the text recognition
                                                                           data, but also the training data can be adjusted to a targeted task. In
task. Moreover, an efficient annotation tool is presented for ground
                                                                           particular, for video text recognition we often know properties such
truth data generation from videos, enabling evaluation experiments
                                                                           as the expected font type, the font color, the orientation and to some
on large-scale (several thousands of frames) video datasets. A quanti-
                                                                           extent the background. This prior knowledge can thus directly be
tative evaluation of the detection functionality and qualitative exper-
                                                                           considered when generating the training data.
iments for character recognition are presented, exhibiting promising
                                                                              The recent highly successful deep Convolutional Neural Network
results on challenging (low-resolution, compression artifacts) real-
                                                                           learning paradigm exhibits an exceptional generalization capability,
world test data.
                                                                           but at the same time possesses a high descriptive complexity imply-
                                                                           ing that in order to establish meaningful distributed representations
1   INTRODUCTION                                                           at deeper layers of the network, it requires large amounts of training
                                                                           data. Since manually annotating word regions is highly time consum-
End-to-end text recognition extracting textual information from digi-      ing, existing datasets are insufficient for reflecting the high variability
tal images has been a key pattern recognition research topic for mul-      of real data. Therefore, generating synthetic text data is an essential
tiple decades. Recognition in constrained (high resolution and con-        requirement. In our end-to-end text recognition chain (see Figure 1)
trast, known scale) scenarios such as optical character recognition        we first employ the highly run-time efficient ACF detector [6] (de-
(OCR) in scanned documents has matured to practically relevant sys-        tection and localization step), and afterwards CNN-based character
tems, while unconstrained, ”in the wild” scenarios still represent a       recognition and segmentation stages. All stages are trained by large
substantial challenge. Video text recognition in broadcast videos -        quantities of synthetic data.
representing the main focus of this paper - lies in-between in com-           The first promising application of Convolutional Neural Networks
plexity terms, since overlay text content is typically free from geo-      (CNNs) for recognizing handwritten numbers was shown in [11].
metric deformations, however still subject to many variations (style,      Although problem size and data variability were limited due to the
color, background).                                                        available hardware at that time, the end-to-end pipeline for training
   In this paper we present the use of synthetic data to generate rich     and testing, as well as the compositional power of multi-layer neu-
statistical models for the text detection and character recognition        ral networks fundamentally influenced many areas in machine learn-
tasks. The text detection and localization step generates a set of text    ing in later years. Based on this precursor work, the results of [10]
line candidates (delineated by a bounding box) and associated text         showed the potential of deep architectures in combination with the
probabilities, whereas character recognition yields class label esti-      general-purpose computing capabilities of GPUs, which facilitated
mates - based on a pool of previously trained classes - for each of the    training on larger datasets and tackling more challenging tasks.
characters forming a given text line.                                         Recent approaches with classifiers based on CNNs have shown
   In recent years it has been shown within the context of various         a significant impact on the accuracy of text recognition systems
visual object recognition tasks that vast amounts of artificial training   [3][2][12]. The absence of hand-crafted features, implicit learning of
1  AIT Austrian Institute of Technology GmbH, Vienna, Austria, email:      prior knowledge and deployment on more powerful hardware make
  daniel.steininger.fl@ait.ac.at                                           them the most promising approach for Optical Character Recog-
2 eMedia Monitor GmbH, Vienna, Austria
                                                                           nition. The approach in [4] tunes the classification accuracy with
                                                                                    Figure 3. Reference lines governing a text line’s geometric properties. The
                                                                                    right side of the image shows artificial text image patches aligned to these
                                                                                    reference lines during the synthesis step.

                                                                                    covering all essential broadcast video content. In order to capture
                                                                                    the visual appearance of vast amounts of text, we built a text image
                                                                                    synthesis tool capable to generate unlimited amounts of training data
                                                                                    well matching the appearance of overlay text in videos. (ii) Large-
                                                                                    scale characterization and evaluation: desired improvements in the
                                                                                    recognition accuracy call for datasets which represent a broad range
                                                                                    of variation of the class to be recognized (for text: font styles, weight,
                                                                                    size, spacing, background) and of the class which should be ignored
                                                                                    (background, also containing difficult patterns such as repetitive tex-
                                                                                    tures, clutter and high-frequency noise). In order to accomplish this
                                                                                    task (i) we built a video annotation tool targeting large scale anno-
                                                                                    tation and (ii) constructed an evaluation framework following estab-
                                                                                    lished (ICDAR [9]) standards in the text recognition community.


                                                                                    2.2     Synthetic data generation
                                                                                    We elaborated a text image synthesis tool in Matlab which is capa-
                                                                                    ble to create within a normalized geometric reference frame defined
                                                                                    by predefined baseline, ascender and descender lines (see Figure 3)
                                                                                    patches of local text patterns, also including artificially-induced vari-
Figure 2. Illustration providing an overview on the characteristics of gen-         ations such as font type and weight, text background, spatial off-
erated synthetic datasets. Top: for text detection and localization patches of      set, scaling variations and deformations. The synthesis tool employs
character triplets are created. Center: individual characters are trained on syn-   prior on the bigram (adjacent characters) frequency of the English
thetic single-character instances. Bottom: to detect gaps between characters        language, thus it tries to recreate not random but plausible font adja-
we use an additional ”gap” class, which consists of two-character patch in-         cencies. Generated text samples are shown in Figure 2 for the differ-
stances centered on their gap location.                                             ent tasks addressed in this paper.
                                                                                       The synthesis code employs the Matlab native listfonts command,
synthetic data specialized on specific font categories, whereas [8]                 which retrieves all available system fonts. These fonts can be used
presents a sophisticated framework for synthetic text generation,                   to render a text string by the text command, and using a screen cap-
which can easily be adapted to other languages without any human                    ture script from the MathWorks FileExchange [1] we convert the ren-
labeling effort.                                                                    dered text into an image. This image is subsequently cropped around
   The paper is organized as follows: Section 2 describes the overall               the desired set of characters while using the predefined text refer-
employed methodology, the synthetic data generation scheme and                      ence lines (Figure 3) for geometric normalization. Moreover, addi-
the individual algorithmic steps to train, test and evaluate the text               tional geometric transforms are applicable: spatial offsets along the
detection and recognition stages. Section 3 presents and discusses                  x- and y-directions and a rotation within a predefined angular range.
the obtained results for text localization and recognition. Finally the             An additional image containing no text can be used as a background
paper is concluded in Section 4.                                                    overlay to introduce some structure and texture behind the text char-
                                                                                    acters. This structured background targets enhanced invariance of
2     METHODOLOGY                                                                   text detection and recognition in presence of clutter, while geomet-
                                                                                    ric transformations are performed to increase invariance with respect
In the following section we describe a text synthesis tool and its                  to deviations from an ideal character position and pose. The intro-
use in the context of training text detection and character recogni-                duced perturbations typically result in an increased recall rate, which
tion. Furthermore, we present an annotation tool enabling key-frame                 is necessary to achieve if all relevant text lines need to be found and
based manual annotation in broadcast videos, and opening ways for                   correctly read.
large-scale evaluation of video text detection and recognition.

                                                                                    2.3     Synthetic data aided text detection
2.1     General overview
                                                                                    2.3.1    Training and testing
Establishing large datasets for the development of a video text recog-
nition system is essential for two reasons: (i) Learning representa-                We generated 35000 text patch samples (a subset is shown in Figure
tions of text appearance: as with other complex vision systems, the                 2 top) which we used to train the ACF (Aggregate Channel Features)
need for good generalization capability (modeling quality for pre-                  Detector [5],[6] which we have ported to C++. The synthetically
viously unseen data) is crucial, since high recall is substantial for               trained detector surpasses the one trained on 5000 manually cropped
Figure 4. Two difficult (repetitive structures) images demonstrating the
classifier accuracy improvement accomplished by using a large synthetic
dataset. Yellow rectangles show raw output of the ACF detector trained using
real world (a) and synthetic (b) data.

text patches from videos, mostly due to the facts that (i) more data
with greater variation is employed, and (ii) nuisance factors such as
compression artifacts, color bleeding and low resolution effects are
absent. Qualitative and quantitative (precision-recall, DetEval [13])
evaluation of these detectors has been performed and results show
                                                                               Figure 5. Text detection performance improvement using large synthetic
a significant improvement over the detector trained with manually
                                                                               dataset (SYNTH = 35000 samples) vs. the a classifier (REAL), trained on
cropped real world data.
                                                                               5000 manually cropped real image sample.
   The ACF detector employs features (intensity channel, gradient
magnitude and orientations) - the so-called channel features - ex-
                                                                               order to have a balanced training set. In this paper, we used a set
tracted at multiple image resolutions. A specific advantage of the
                                                                               of 6000 samples per class for the training stage where each charac-
ACF multi-resolution feature computation is that not all resolution
                                                                               ter has a separate class for upper-case and lower-case, resulting in 62
levels have to be computed, but features at certain scale levels can be
                                                                               classes (52 classes for characters and 10 digits) and therefore 372.000
directly extrapolated from features of nearby scales, at no significant
                                                                               input images. The network layer architecture is the same as LeNet-5
expense of the detection accuracy. This feature approximation trick
                                                                               in [11] which is shown in Figure 6.
yields an overall detection framework of excellent run-time perfor-
                                                                                  For training the adagrad adaptive learning rate method was used,
mance versus recognition accuracy ratio.
                                                                               which was originally proposed by Duchi et.al. [7]. The process of
   Figure 5 illustrates the obtained accuracy improvement by text de-
                                                                               training our model consisted of several difficulties such as different
tection by a quantitative measure. As it can be seen from the ROC
                                                                               font types, background variability, font weight (bold, italic or nor-
(Receiver Operating Characteristic) curves obtained for manually an-
                                                                               mal) and inverted colors for half of the training set (black on white
notated (real) and synthetic samples (synth) training data, the use of
                                                                               and white on black) and took around 5 hours with an evaluation set
synthetic data improves the detection rate (true positive rate) by about
                                                                               of 1000 samples for each character and accomplished a recognition
5 percent at a false positive rate of 0.1. A detailed specification of this
                                                                               rate of 95 percent on an independent test set.
evaluation experiment can be found in the Results section.

                                                                               2.4.2   Recognition
2.4     Synthetic data aided text recognition (OCR)
                                                                               The recognition task operates on the output of the text localization.
Synthetic data for optical character recognition (OCR) enables the             The detected regions are used for further investigation by sliding
possibility to enhance the recognition rate by increasing the amount           through the region and evaluating each position for possible char-
of training data when needed. In this paper, training a model us-              acters using the trained model. Figure 11 shows the confidence re-
ing synthetic data is done using Convolutional Neural Networks and             sponses at each sliding window position for each character and digit
recognition is done using a sliding window approach within the pre-            (from top to bottom: 0 to 9, a to Z). The highest responses are used
viously found bounding box from text recognition. The following                to compose the text.
sections describe our approaches for training and recognition of text             The resulting confidence map consists of confidences (values be-
in images.                                                                     tween 0 and 1) at each position of the sliding window. In order to
                                                                               segment the text into characters, we find local maxima in the con-
2.4.1    Training                                                              fidence map and use the corresponding character to compose the
                                                                               text. Often it is the case, that a local maximum is not centered at the
Training a model using Convolutional Neural Networks issues the                character position, which makes it difficult to find the correct posi-
question of how much training data will be used and how the train-             tions of the characters. To overcome this problem, we train a second
ing data is distributed among the object classes. Even though some             model which is intended to find gaps between characters. Therefore
characters have a higher probability of appearance then others, but            a CNN is trained with 20000 samples of character gaps (gap-class)
for training, each class should have the same amount of samples, in            and 20000 samples of characters (character-class), evenly distributed
                              Figure 6. Architecture of the Convolutional Neural Network for Optical Character Recognition.

between all characters and digits from our text recognition model.               consisting of 4065 annotated frames. Annotation contains bound-
Since gaps contain significant edges on the left and right side of the           ing box coordinates (relating to single text lines) and ground truth
image center which are caused by the character borders of any char-              ASCII text for each bounding box. The dataset contains many com-
acter, and the sliding window outputs show promising results (see                plex scenes where clutter, high-frequency and repetitive patterns oc-
Figure 7), the use of this second CNN in our approach is well suited             cur. The annotated dataset forms our evaluation dataset where experi-
for character segmentation.                                                      ments for assessing the text detection accuracy have been performed.
                                                                                    Conventional bounding box overlap based evaluation measures
                                                                                 exhibit a deficiency in the definition of matching between detected
                                                                                 and ground truth (GT) bounding boxes (BB). Namely, in most cases
                                                                                 there is no one-to-one correspondence, but several detected BBs cor-
Figure 7. Color-coded confidence distribution (bright = high confidence)         respond to a single GT-BB, or a single detected BB matches multiple
for recognizing gaps between characters by learning-based character separa-      GT-BBs. The first case is denoted as splitting, the latter as merging.
tion                                                                             See these situations depicted in Figure 9. These ambiguous match-
                                                                                 ing situations render overlap-based evaluation results ambiguous: 50
                                                                                 percent overlap to GT might imply that either half of the GT-BBs
2.5    Annotation and evaluation                                                 were detected, or half of the area of a single GT-BB was detected.
                                                                                    We adopted the ICDAR 2015 evaluation scheme [9] relying on the
We built an easy-to-use software tool (in Matlab, but also compiled              DetEval framework [13]. The DetEval evaluation measures take the
to a binary executable), which can perform key-frame based manual                various bounding box correspondence types into account and derive
annotation by assigning bounding box representations to text lines in            BB-level precision and recall values based on a parameter-tunable
discrete video frames (see Figure 10). Due to the key-frame concept              GT-to-BB association scheme. Using a fixed association tuning pa-
these annotations are propagated across time and can be updated and              rameter we can generate an ROC curve based characterization (such
terminated at any time instance of the video. In this way also ani-
mated text lines can be annotated. The framework is able to gener-
ate specific, from videos derived datasets according to the evaluation
criteria: varying spatial and temporal resolutions and predefined an-
notation data schemes (txt, xml, yaml).
   We have annotated multiple hours of broadcast videos downloaded
from YouTube. As Figure 8 displays the annotated set contains 5 dif-
ferent English-language TV channels of various lengths, altogether


Figure 8. The composition of our generated YouTube dataset containing
4065 annotated frames in total, originating from five different broadcast pro-   Figure 9. Different correspondence types between ground truth (green) and
grams.                                                                           detection results (red dashed) bounding boxes.
Figure 10. Screenshot of the video annotation tool, capable to quickly annotate large amounts of video text by setting text bounding boxes for selected
key-frames and interpolating this information between key-frames.

as in Figure 5), whereas by allowing a variable association parame-           reduce the amount of False Positives further.
ter, characteristic DetEval performance graphs can be computed (not              Text recognition: During optical character recognition, a text line
shown).                                                                       detected in the previous localization step is employed as input. The
                                                                              proposed sliding window approach performs the classification of the
                                                                              local analysis window content and assigns a label to it, matching one
3   RESULTS                                                                   of the 62 character classes or the gap class. Due to the tightly cou-
In order to perform the evaluation of the presented text detection            pled spatial analysis and fine-grained classification, a separate spatial
stages, following steps were carried out.                                     character segmentation step is avoided. Typically, the character seg-
   Text detection and localization: The 4065 annotated images were            mentation step (in terms of foreground-background segmentation) is
used to evaluate the ACF detector [5], in one case trained on 5000            the most sensitive stage in an end-to-end text recognition framework,
manually annotated text data (denoted as the REAL detector), in the           where small resolution and degraded character appearance are prob-
second case on 35000 fully synthetic data (denoted as the SYNTH               able to lead to segmentation failures. Classification results are shown
detector). The DetEval evaluation criterion was used with a fixed             in Figure 11, where individual class-specific confidences are shown
BB-to-BB (BB - bounding box) association parameter, while vary-               as dark peaks (confidence heat map is inverted for better visibility) in
ing the classifier sensitivity. A qualitative comparison is displayed in      the individual rows of character classes. The bottom part of the fig-
Figure 4 and the resulting ROC curve is shown in 5. As the plot illus-        ure displays the maximum-confidence classifier response at charac-
trates, synthetic data brings a significant improvement in terms of the       ter locations, thus forming the OCR output. As it can be seen, certain
detection rate (True Positive Rate) at a given false alarm rate (False        recognition errors still occur: the lower case character ”o” is confused
Positive Rate). Training the ACF detector with even more synthetic            with the upper case character ”Q”. The reason for this problem is the
data (50000 samples) did not improve results, indicating that (i) the         presence of upper case and lower case letters as well as letters with a
data does not introduce additional appearance or structural informa-          descender within a single recognition region (text row), because the
tion about the modeled class and (ii) the representation capability of        sliding window is selected by the height of the highest character in
the employed features and learning strategy is limited. Due to these          this region and the bottom-most point is selected by the lowest de-
limitations, repetitive structures and clutter still represent a problem      scender of all characters in the region. Given theses conditions and
(seen by the non-vanishing amount of False Positives), but synthetic          the fact that each training image is geometrically normalized with
data improves on the rate of false alarms. The ACF detector is fast,          respect to its containing character (with border pixels) during train-
detecting and localizing text with about 7 frame-per-seconds in an            ing, the appearance can vastly vary between the training images and
image frame with a resolution of 1024×768 pixels on a modern PC.              a given recognition window. At the current state of our recognition
A subsequent more stringent and - at the same time computationally            step, this geometric scaling discrepancy is not taken into consider-
more demanding - analysis steps, such as an OCR step, is able to              ation. However, the synthetic data generation tool can be easily ex-
Figure 11. Recognition results shown in detail for a given text line. Individual character classes shown as separate rows (see the small class labels at the left
border). Individual dark peaks in respective rows indicate a high confidence in a given class. Peaks highlighted in red display the maximum classifier response
at a given character center, resulting in the corresponding OCR output at the bottom of the figure.

tended to also include samples with such scaling variations, while                  [2] Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut
the Convolutional Neural Network is capable to accommodate these                        Neven, ‘Photoocr: Reading text in uncontrolled conditions’, in Pro-
                                                                                        ceedings of the IEEE International Conference on Computer Vision,
additional variations. This extra representational effort will probably                 pp. 785–792, (2013).
lead to a greatly enhanced recognition performance.                                 [3] Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin
                                                                                        Suresh, Tao Wang, David J Wu, and Andrew Y Ng, ‘Text detection
                                                                                        and character recognition in scene images with unsupervised feature
4    CONCLUSION                                                                         learning’, in 2011 International Conference on Document Analysis and
In this paper we present an applied example of using large amounts                      Recognition, pp. 440–445. IEEE, (2011).
                                                                                    [4] Teófilo Emı́dio de Campos, Bodla Rakesh Babu, and Manik Varma,
of synthetic data for generating representative statistical models for                  ‘Character recognition in natural images.’, in VISAPP (2), pp. 273–280,
the text detection and character recognition tasks. Furthermore, we                     (2009).
show that annotating and synthesizing text is not an overly compli-                 [5] Piotr Dollar, Ron Appel, Serge Belongie, and Pietro Perona, ‘Fast fea-
cated task, and simple software tools can generate vast amounts of                      ture pyramids for object detection’, IEEE Trans. Pattern Anal. Mach.
                                                                                        Intell., 36(8), 1532–1545, (August 2014).
data with broad appearance characteristics. The rich variability en-
                                                                                    [6] Piotr Dollár, Serge Belongie, and Pietro Perona, ‘The fastest pedestrian
compassed by the synthetic data yields improved accuracy for the                        detector in the west’, in BMVC, (2010).
text detection task, and it enables character recognition without an                [7] John Duchi, Elad Hazan, and Yoram Singer, ‘Adaptive subgradient
explicit segmentation step.                                                             methods for online learning and stochastic optimization’, J. Mach.
   Future work will mainly involve the improvement of the slid-                         Learn. Res., 12, 2121–2159, (July 2011).
                                                                                    [8] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisser-
ing window based recognition approach. Since Convolutional Neural                       man, ‘Reading text in the wild with convolutional neural networks’,
Networks provide a large capacity to accommodate appearance, scale                      International Journal of Computer Vision, 116(1), 1–20, (2016).
and other variations for a large number of classes, we plan to enrich               [9] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou,
the training set with even more variations. We plan to train each char-                 Suman K. Ghosh, Andrew D. Bagdanov, Masakazu Iwamura, Jiri
                                                                                        Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu,
acter in different text configurations such as variable padding thus
                                                                                        Faisal Shafait, Seiichi Uchida, and Ernest Valveny, ‘Icdar 2015 compe-
improving the classifier’s invariance with respect to position and ge-                  tition on robust reading.’, in ICDAR, pp. 1156–1160. IEEE Computer
ometric scaling. This is highly relevant for the typical case when a                    Society, (2015). relocated from Tunis, Tunisia.
text line consists of geometrically varying characters, such as upper-             [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, ‘Imagenet
case, lower-case and characters with ascender and descenders.                           classification with deep convolutional neural networks’, in Advances
                                                                                        in neural information processing systems, pp. 1097–1105, (2012).
                                                                                   [11] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner,
ACKNOWLEDGEMENTS                                                                        ‘Gradient-based learning applied to document recognition’, Proceed-
                                                                                        ings of the IEEE, 86(11), 2278–2324, (1998).
The work was partially supported by the Vision+ project under                      [12] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng, ‘End-to-
the COMET program of the Austrian Research Promotion Agency                             end text recognition with convolutional neural networks’, in Pattern
(FFG) and the research initiative ’Intelligent Vision Austria’ with                     Recognition (ICPR), 2012 21st International Conference on, pp. 3304–
                                                                                        3308. IEEE, (2012).
funding from the Austrian Federal Ministry of Science, Research and                [13] Christian Wolf and Jean-Michel Jolion, ‘ Object count/Area Graphs for
Economy and the Austrian Institute of Technology.                                       the Evaluation of Object Detection and Segmentation Algorithms’, In-
                                                                                        ternational Journal of Document Analysis and Recognition, 8(4), 280–
                                                                                        296, (April 2006).
REFERENCES                                                                         [14] Gkhan Yildirim, Radhakrishna Achanta, and Sabine Ssstrunk, ‘Text
[1] Mathworks fileexchange. https://de.mathworks.com/                                   recognition in natural images using multiclass hough forests’, in In
    matlabcentral/fileexchange/?s_tid=gn_mlc_fx.                                        Proc. VISAPP, (2013).
    Accessed: 2016-09-20.