Development and Evaluation of a Text Recognition Framework using Synthetic Data Daniel Steininger1 and Andreas Zweng1 and Csaba Beleznai1 and Thomas Netousek2 Abstract. Text recognition is an intricate Computer Vision task. The main complexity arises from the fact that text as a character se- quence spans a very large space of possible appearances, induced by combinatorially vast character orderings, diverse font styles, weights, colors and backgrounds. In order to encode a rich representation of variations and to generate an informative model by statistical learning, image data balanced along all dimensions of variations are needed. In this paper we present a synthetic text pattern generation framework and its use for localizing text lines and recognizing char- Figure 1. End-to-end processing chain of the text recognition framework. acters in individual frames of broadcast videos. The paper demon- strates that detection and recognition accuracies can be significantly data can be generated automatically resulting in an improvement of enhanced by employing synthetic text image patches for training an the classification accuracy [8], [14], [12]. By means of synthetic data, Aggregated Channel Features (ACF) detector and a Convolutional not only the ground truth is automatically generated along with the Neural Network (CNN) character recognizer for the text recognition data, but also the training data can be adjusted to a targeted task. In task. Moreover, an efficient annotation tool is presented for ground particular, for video text recognition we often know properties such truth data generation from videos, enabling evaluation experiments as the expected font type, the font color, the orientation and to some on large-scale (several thousands of frames) video datasets. A quanti- extent the background. This prior knowledge can thus directly be tative evaluation of the detection functionality and qualitative exper- considered when generating the training data. iments for character recognition are presented, exhibiting promising The recent highly successful deep Convolutional Neural Network results on challenging (low-resolution, compression artifacts) real- learning paradigm exhibits an exceptional generalization capability, world test data. but at the same time possesses a high descriptive complexity imply- ing that in order to establish meaningful distributed representations 1 INTRODUCTION at deeper layers of the network, it requires large amounts of training data. Since manually annotating word regions is highly time consum- End-to-end text recognition extracting textual information from digi- ing, existing datasets are insufficient for reflecting the high variability tal images has been a key pattern recognition research topic for mul- of real data. Therefore, generating synthetic text data is an essential tiple decades. Recognition in constrained (high resolution and con- requirement. In our end-to-end text recognition chain (see Figure 1) trast, known scale) scenarios such as optical character recognition we first employ the highly run-time efficient ACF detector [6] (de- (OCR) in scanned documents has matured to practically relevant sys- tection and localization step), and afterwards CNN-based character tems, while unconstrained, ”in the wild” scenarios still represent a recognition and segmentation stages. All stages are trained by large substantial challenge. Video text recognition in broadcast videos - quantities of synthetic data. representing the main focus of this paper - lies in-between in com- The first promising application of Convolutional Neural Networks plexity terms, since overlay text content is typically free from geo- (CNNs) for recognizing handwritten numbers was shown in [11]. metric deformations, however still subject to many variations (style, Although problem size and data variability were limited due to the color, background). available hardware at that time, the end-to-end pipeline for training In this paper we present the use of synthetic data to generate rich and testing, as well as the compositional power of multi-layer neu- statistical models for the text detection and character recognition ral networks fundamentally influenced many areas in machine learn- tasks. The text detection and localization step generates a set of text ing in later years. Based on this precursor work, the results of [10] line candidates (delineated by a bounding box) and associated text showed the potential of deep architectures in combination with the probabilities, whereas character recognition yields class label esti- general-purpose computing capabilities of GPUs, which facilitated mates - based on a pool of previously trained classes - for each of the training on larger datasets and tackling more challenging tasks. characters forming a given text line. Recent approaches with classifiers based on CNNs have shown In recent years it has been shown within the context of various a significant impact on the accuracy of text recognition systems visual object recognition tasks that vast amounts of artificial training [3][2][12]. The absence of hand-crafted features, implicit learning of 1 AIT Austrian Institute of Technology GmbH, Vienna, Austria, email: prior knowledge and deployment on more powerful hardware make daniel.steininger.fl@ait.ac.at them the most promising approach for Optical Character Recog- 2 eMedia Monitor GmbH, Vienna, Austria nition. The approach in [4] tunes the classification accuracy with Figure 3. Reference lines governing a text line’s geometric properties. The right side of the image shows artificial text image patches aligned to these reference lines during the synthesis step. covering all essential broadcast video content. In order to capture the visual appearance of vast amounts of text, we built a text image synthesis tool capable to generate unlimited amounts of training data well matching the appearance of overlay text in videos. (ii) Large- scale characterization and evaluation: desired improvements in the recognition accuracy call for datasets which represent a broad range of variation of the class to be recognized (for text: font styles, weight, size, spacing, background) and of the class which should be ignored (background, also containing difficult patterns such as repetitive tex- tures, clutter and high-frequency noise). In order to accomplish this task (i) we built a video annotation tool targeting large scale anno- tation and (ii) constructed an evaluation framework following estab- lished (ICDAR [9]) standards in the text recognition community. 2.2 Synthetic data generation We elaborated a text image synthesis tool in Matlab which is capa- ble to create within a normalized geometric reference frame defined by predefined baseline, ascender and descender lines (see Figure 3) patches of local text patterns, also including artificially-induced vari- Figure 2. Illustration providing an overview on the characteristics of gen- ations such as font type and weight, text background, spatial off- erated synthetic datasets. Top: for text detection and localization patches of set, scaling variations and deformations. The synthesis tool employs character triplets are created. Center: individual characters are trained on syn- prior on the bigram (adjacent characters) frequency of the English thetic single-character instances. Bottom: to detect gaps between characters language, thus it tries to recreate not random but plausible font adja- we use an additional ”gap” class, which consists of two-character patch in- cencies. Generated text samples are shown in Figure 2 for the differ- stances centered on their gap location. ent tasks addressed in this paper. The synthesis code employs the Matlab native listfonts command, synthetic data specialized on specific font categories, whereas [8] which retrieves all available system fonts. These fonts can be used presents a sophisticated framework for synthetic text generation, to render a text string by the text command, and using a screen cap- which can easily be adapted to other languages without any human ture script from the MathWorks FileExchange [1] we convert the ren- labeling effort. dered text into an image. This image is subsequently cropped around The paper is organized as follows: Section 2 describes the overall the desired set of characters while using the predefined text refer- employed methodology, the synthetic data generation scheme and ence lines (Figure 3) for geometric normalization. Moreover, addi- the individual algorithmic steps to train, test and evaluate the text tional geometric transforms are applicable: spatial offsets along the detection and recognition stages. Section 3 presents and discusses x- and y-directions and a rotation within a predefined angular range. the obtained results for text localization and recognition. Finally the An additional image containing no text can be used as a background paper is concluded in Section 4. overlay to introduce some structure and texture behind the text char- acters. This structured background targets enhanced invariance of 2 METHODOLOGY text detection and recognition in presence of clutter, while geomet- ric transformations are performed to increase invariance with respect In the following section we describe a text synthesis tool and its to deviations from an ideal character position and pose. The intro- use in the context of training text detection and character recogni- duced perturbations typically result in an increased recall rate, which tion. Furthermore, we present an annotation tool enabling key-frame is necessary to achieve if all relevant text lines need to be found and based manual annotation in broadcast videos, and opening ways for correctly read. large-scale evaluation of video text detection and recognition. 2.3 Synthetic data aided text detection 2.1 General overview 2.3.1 Training and testing Establishing large datasets for the development of a video text recog- nition system is essential for two reasons: (i) Learning representa- We generated 35000 text patch samples (a subset is shown in Figure tions of text appearance: as with other complex vision systems, the 2 top) which we used to train the ACF (Aggregate Channel Features) need for good generalization capability (modeling quality for pre- Detector [5],[6] which we have ported to C++. The synthetically viously unseen data) is crucial, since high recall is substantial for trained detector surpasses the one trained on 5000 manually cropped Figure 4. Two difficult (repetitive structures) images demonstrating the classifier accuracy improvement accomplished by using a large synthetic dataset. Yellow rectangles show raw output of the ACF detector trained using real world (a) and synthetic (b) data. text patches from videos, mostly due to the facts that (i) more data with greater variation is employed, and (ii) nuisance factors such as compression artifacts, color bleeding and low resolution effects are absent. Qualitative and quantitative (precision-recall, DetEval [13]) evaluation of these detectors has been performed and results show Figure 5. Text detection performance improvement using large synthetic a significant improvement over the detector trained with manually dataset (SYNTH = 35000 samples) vs. the a classifier (REAL), trained on cropped real world data. 5000 manually cropped real image sample. The ACF detector employs features (intensity channel, gradient magnitude and orientations) - the so-called channel features - ex- order to have a balanced training set. In this paper, we used a set tracted at multiple image resolutions. A specific advantage of the of 6000 samples per class for the training stage where each charac- ACF multi-resolution feature computation is that not all resolution ter has a separate class for upper-case and lower-case, resulting in 62 levels have to be computed, but features at certain scale levels can be classes (52 classes for characters and 10 digits) and therefore 372.000 directly extrapolated from features of nearby scales, at no significant input images. The network layer architecture is the same as LeNet-5 expense of the detection accuracy. This feature approximation trick in [11] which is shown in Figure 6. yields an overall detection framework of excellent run-time perfor- For training the adagrad adaptive learning rate method was used, mance versus recognition accuracy ratio. which was originally proposed by Duchi et.al. [7]. The process of Figure 5 illustrates the obtained accuracy improvement by text de- training our model consisted of several difficulties such as different tection by a quantitative measure. As it can be seen from the ROC font types, background variability, font weight (bold, italic or nor- (Receiver Operating Characteristic) curves obtained for manually an- mal) and inverted colors for half of the training set (black on white notated (real) and synthetic samples (synth) training data, the use of and white on black) and took around 5 hours with an evaluation set synthetic data improves the detection rate (true positive rate) by about of 1000 samples for each character and accomplished a recognition 5 percent at a false positive rate of 0.1. A detailed specification of this rate of 95 percent on an independent test set. evaluation experiment can be found in the Results section. 2.4.2 Recognition 2.4 Synthetic data aided text recognition (OCR) The recognition task operates on the output of the text localization. Synthetic data for optical character recognition (OCR) enables the The detected regions are used for further investigation by sliding possibility to enhance the recognition rate by increasing the amount through the region and evaluating each position for possible char- of training data when needed. In this paper, training a model us- acters using the trained model. Figure 11 shows the confidence re- ing synthetic data is done using Convolutional Neural Networks and sponses at each sliding window position for each character and digit recognition is done using a sliding window approach within the pre- (from top to bottom: 0 to 9, a to Z). The highest responses are used viously found bounding box from text recognition. The following to compose the text. sections describe our approaches for training and recognition of text The resulting confidence map consists of confidences (values be- in images. tween 0 and 1) at each position of the sliding window. In order to segment the text into characters, we find local maxima in the con- 2.4.1 Training fidence map and use the corresponding character to compose the text. Often it is the case, that a local maximum is not centered at the Training a model using Convolutional Neural Networks issues the character position, which makes it difficult to find the correct posi- question of how much training data will be used and how the train- tions of the characters. To overcome this problem, we train a second ing data is distributed among the object classes. Even though some model which is intended to find gaps between characters. Therefore characters have a higher probability of appearance then others, but a CNN is trained with 20000 samples of character gaps (gap-class) for training, each class should have the same amount of samples, in and 20000 samples of characters (character-class), evenly distributed Figure 6. Architecture of the Convolutional Neural Network for Optical Character Recognition. between all characters and digits from our text recognition model. consisting of 4065 annotated frames. Annotation contains bound- Since gaps contain significant edges on the left and right side of the ing box coordinates (relating to single text lines) and ground truth image center which are caused by the character borders of any char- ASCII text for each bounding box. The dataset contains many com- acter, and the sliding window outputs show promising results (see plex scenes where clutter, high-frequency and repetitive patterns oc- Figure 7), the use of this second CNN in our approach is well suited cur. The annotated dataset forms our evaluation dataset where experi- for character segmentation. ments for assessing the text detection accuracy have been performed. Conventional bounding box overlap based evaluation measures exhibit a deficiency in the definition of matching between detected and ground truth (GT) bounding boxes (BB). Namely, in most cases there is no one-to-one correspondence, but several detected BBs cor- Figure 7. Color-coded confidence distribution (bright = high confidence) respond to a single GT-BB, or a single detected BB matches multiple for recognizing gaps between characters by learning-based character separa- GT-BBs. The first case is denoted as splitting, the latter as merging. tion See these situations depicted in Figure 9. These ambiguous match- ing situations render overlap-based evaluation results ambiguous: 50 percent overlap to GT might imply that either half of the GT-BBs 2.5 Annotation and evaluation were detected, or half of the area of a single GT-BB was detected. We adopted the ICDAR 2015 evaluation scheme [9] relying on the We built an easy-to-use software tool (in Matlab, but also compiled DetEval framework [13]. The DetEval evaluation measures take the to a binary executable), which can perform key-frame based manual various bounding box correspondence types into account and derive annotation by assigning bounding box representations to text lines in BB-level precision and recall values based on a parameter-tunable discrete video frames (see Figure 10). Due to the key-frame concept GT-to-BB association scheme. Using a fixed association tuning pa- these annotations are propagated across time and can be updated and rameter we can generate an ROC curve based characterization (such terminated at any time instance of the video. In this way also ani- mated text lines can be annotated. The framework is able to gener- ate specific, from videos derived datasets according to the evaluation criteria: varying spatial and temporal resolutions and predefined an- notation data schemes (txt, xml, yaml). We have annotated multiple hours of broadcast videos downloaded from YouTube. As Figure 8 displays the annotated set contains 5 dif- ferent English-language TV channels of various lengths, altogether Figure 8. The composition of our generated YouTube dataset containing 4065 annotated frames in total, originating from five different broadcast pro- Figure 9. Different correspondence types between ground truth (green) and grams. detection results (red dashed) bounding boxes. Figure 10. Screenshot of the video annotation tool, capable to quickly annotate large amounts of video text by setting text bounding boxes for selected key-frames and interpolating this information between key-frames. as in Figure 5), whereas by allowing a variable association parame- reduce the amount of False Positives further. ter, characteristic DetEval performance graphs can be computed (not Text recognition: During optical character recognition, a text line shown). detected in the previous localization step is employed as input. The proposed sliding window approach performs the classification of the local analysis window content and assigns a label to it, matching one 3 RESULTS of the 62 character classes or the gap class. Due to the tightly cou- In order to perform the evaluation of the presented text detection pled spatial analysis and fine-grained classification, a separate spatial stages, following steps were carried out. character segmentation step is avoided. Typically, the character seg- Text detection and localization: The 4065 annotated images were mentation step (in terms of foreground-background segmentation) is used to evaluate the ACF detector [5], in one case trained on 5000 the most sensitive stage in an end-to-end text recognition framework, manually annotated text data (denoted as the REAL detector), in the where small resolution and degraded character appearance are prob- second case on 35000 fully synthetic data (denoted as the SYNTH able to lead to segmentation failures. Classification results are shown detector). The DetEval evaluation criterion was used with a fixed in Figure 11, where individual class-specific confidences are shown BB-to-BB (BB - bounding box) association parameter, while vary- as dark peaks (confidence heat map is inverted for better visibility) in ing the classifier sensitivity. A qualitative comparison is displayed in the individual rows of character classes. The bottom part of the fig- Figure 4 and the resulting ROC curve is shown in 5. As the plot illus- ure displays the maximum-confidence classifier response at charac- trates, synthetic data brings a significant improvement in terms of the ter locations, thus forming the OCR output. As it can be seen, certain detection rate (True Positive Rate) at a given false alarm rate (False recognition errors still occur: the lower case character ”o” is confused Positive Rate). Training the ACF detector with even more synthetic with the upper case character ”Q”. The reason for this problem is the data (50000 samples) did not improve results, indicating that (i) the presence of upper case and lower case letters as well as letters with a data does not introduce additional appearance or structural informa- descender within a single recognition region (text row), because the tion about the modeled class and (ii) the representation capability of sliding window is selected by the height of the highest character in the employed features and learning strategy is limited. Due to these this region and the bottom-most point is selected by the lowest de- limitations, repetitive structures and clutter still represent a problem scender of all characters in the region. Given theses conditions and (seen by the non-vanishing amount of False Positives), but synthetic the fact that each training image is geometrically normalized with data improves on the rate of false alarms. The ACF detector is fast, respect to its containing character (with border pixels) during train- detecting and localizing text with about 7 frame-per-seconds in an ing, the appearance can vastly vary between the training images and image frame with a resolution of 1024×768 pixels on a modern PC. a given recognition window. At the current state of our recognition A subsequent more stringent and - at the same time computationally step, this geometric scaling discrepancy is not taken into consider- more demanding - analysis steps, such as an OCR step, is able to ation. However, the synthetic data generation tool can be easily ex- Figure 11. Recognition results shown in detail for a given text line. Individual character classes shown as separate rows (see the small class labels at the left border). Individual dark peaks in respective rows indicate a high confidence in a given class. Peaks highlighted in red display the maximum classifier response at a given character center, resulting in the corresponding OCR output at the bottom of the figure. tended to also include samples with such scaling variations, while [2] Alessandro Bissacco, Mark Cummins, Yuval Netzer, and Hartmut the Convolutional Neural Network is capable to accommodate these Neven, ‘Photoocr: Reading text in uncontrolled conditions’, in Pro- ceedings of the IEEE International Conference on Computer Vision, additional variations. This extra representational effort will probably pp. 785–792, (2013). lead to a greatly enhanced recognition performance. [3] Adam Coates, Blake Carpenter, Carl Case, Sanjeev Satheesh, Bipin Suresh, Tao Wang, David J Wu, and Andrew Y Ng, ‘Text detection and character recognition in scene images with unsupervised feature 4 CONCLUSION learning’, in 2011 International Conference on Document Analysis and In this paper we present an applied example of using large amounts Recognition, pp. 440–445. IEEE, (2011). [4] Teófilo Emı́dio de Campos, Bodla Rakesh Babu, and Manik Varma, of synthetic data for generating representative statistical models for ‘Character recognition in natural images.’, in VISAPP (2), pp. 273–280, the text detection and character recognition tasks. Furthermore, we (2009). show that annotating and synthesizing text is not an overly compli- [5] Piotr Dollar, Ron Appel, Serge Belongie, and Pietro Perona, ‘Fast fea- cated task, and simple software tools can generate vast amounts of ture pyramids for object detection’, IEEE Trans. Pattern Anal. Mach. Intell., 36(8), 1532–1545, (August 2014). data with broad appearance characteristics. The rich variability en- [6] Piotr Dollár, Serge Belongie, and Pietro Perona, ‘The fastest pedestrian compassed by the synthetic data yields improved accuracy for the detector in the west’, in BMVC, (2010). text detection task, and it enables character recognition without an [7] John Duchi, Elad Hazan, and Yoram Singer, ‘Adaptive subgradient explicit segmentation step. methods for online learning and stochastic optimization’, J. Mach. Future work will mainly involve the improvement of the slid- Learn. Res., 12, 2121–2159, (July 2011). [8] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisser- ing window based recognition approach. Since Convolutional Neural man, ‘Reading text in the wild with convolutional neural networks’, Networks provide a large capacity to accommodate appearance, scale International Journal of Computer Vision, 116(1), 1–20, (2016). and other variations for a large number of classes, we plan to enrich [9] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, the training set with even more variations. We plan to train each char- Suman K. Ghosh, Andrew D. Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, acter in different text configurations such as variable padding thus Faisal Shafait, Seiichi Uchida, and Ernest Valveny, ‘Icdar 2015 compe- improving the classifier’s invariance with respect to position and ge- tition on robust reading.’, in ICDAR, pp. 1156–1160. IEEE Computer ometric scaling. This is highly relevant for the typical case when a Society, (2015). relocated from Tunis, Tunisia. text line consists of geometrically varying characters, such as upper- [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, ‘Imagenet case, lower-case and characters with ascender and descenders. classification with deep convolutional neural networks’, in Advances in neural information processing systems, pp. 1097–1105, (2012). [11] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, ACKNOWLEDGEMENTS ‘Gradient-based learning applied to document recognition’, Proceed- ings of the IEEE, 86(11), 2278–2324, (1998). The work was partially supported by the Vision+ project under [12] Tao Wang, David J Wu, Adam Coates, and Andrew Y Ng, ‘End-to- the COMET program of the Austrian Research Promotion Agency end text recognition with convolutional neural networks’, in Pattern (FFG) and the research initiative ’Intelligent Vision Austria’ with Recognition (ICPR), 2012 21st International Conference on, pp. 3304– 3308. IEEE, (2012). funding from the Austrian Federal Ministry of Science, Research and [13] Christian Wolf and Jean-Michel Jolion, ‘ Object count/Area Graphs for Economy and the Austrian Institute of Technology. the Evaluation of Object Detection and Segmentation Algorithms’, In- ternational Journal of Document Analysis and Recognition, 8(4), 280– 296, (April 2006). REFERENCES [14] Gkhan Yildirim, Radhakrishna Achanta, and Sabine Ssstrunk, ‘Text [1] Mathworks fileexchange. https://de.mathworks.com/ recognition in natural images using multiclass hough forests’, in In matlabcentral/fileexchange/?s_tid=gn_mlc_fx. Proc. VISAPP, (2013). Accessed: 2016-09-20.