Deep Convolutional Neural Network for Recognizing the Images of Text Documents Vladimir Golovko1,2[0000-0003-2615-289X], Aliaksandr Kroshchanka1[0000-0003-3285-3545], Egor Mikhno1[0000-0002-7667-7486], Myroslav Komar3[0000-0001-6541-0359], Anatoliy Sachenko3,4[0000-0002-0907-3682], Sergei Bezobrazov1[0000-0001-6436-2922], Inna Shylinska3[0000-0002-0700-793X] 1 Brest State Technical University, Brest, Belarus 2 Państwowa Szkoła Wyższa im. Papieża Jana Pawła II, Biala Podlaska, Poland 3 Ternopil National Economic University, Ternopil, Ukraine 4 Kazimierz Pulaski University of Technology and Humanitie, Radom, Poland vladimir.golovko@gmail.com1,2, mko@tneu.edu.ua3, sachenkoa@yahoo.com3,4 Abstract. A comparative analysis of various methods and architectures used to solve the problem of object detection is carried out. This allows so-called one- way neural networks architectures to provide high quality solutions to the problem. A neural network algorithm for labeling images in text documents is developed on the basis of image preprocessing that simplifies the localization of individual parts of a document and the subsequent recognition of localized blocks using a deep convolutional neural network. The resulting algorithm provides a high quality of localization and an acceptable level of subsequent classification. Keywords: Object Detection, Deep Convolutional Neural Network, Labeling Images, Image Preprocessing, Text Image. 1 Introduction Deep learning is a very effective technique in the domain of machine learning and has been successfully applied to many problems referring to artificial intelligence, namely object detection, natural language processing, data visualization, etc. Different techniques for deep neural networks learning exist [1-4]. We address to the object detection in text images using deep learning techniques in this paper. To detect objects in images it is necessary to select separate blocks of an image referring to certain predefined classes. A model that performs such an operation receives an image at the input, and gives back the coordinates and dimensions of rectangular areas including the objects being searched for as well as the probability of referring the included object to a certain class. The solution to such a problem is one of the challenges in computer science. In fact, due to this functionality, it is possible to analyze photos and video images in real time by placing labels on certain objects and performing predefined processing operations. At the same time, it is necessary to distinguish between the object detection problem and the semantic segmentation problem that is actually the classification of each image pixel. The object detection task can be logically divided into two subtasks – the localization of the object and its classification. Many existing approaches to detecting objects in images allow us to combine these two distinct stages in a single neural network, which performs both tasks simultaneously and gives the final result at the output (Fig. 1). This makes it possible to solve the problem much faster and receive a result with the information about all the detected objects without the need for their sequential processing. However, one should not completely abandon the classical approaches as there are tasks solving which by such methods provide better results. Fig. 1. General view of the neural network model used to solve the problem of detecting objects in an image [5] The assessment of the quality of the solution to the object localization problem is performed by calculating the IoU (Intersection over Union) metric [6]. We had to solve the problem of detecting objects in documents represented by images. Actually, for such a case, the problem of detecting objects is reduced to the task of labeling electronic document, highlighting its components. An example of the analyzed document from the Doxima7000 sample provided by CIB Software [7] is shown in Fig. 2. It can be seen in the presented image that the document consists of certain logical units (in particular, company logo, table, text, bank data, address, etc.) that can be detected and further processed (for example, text blocks can be recognized converting them into a format that can be more easily analyzed and interpreted using a computer. The Doxima7000 data set consists of approximately 7000 documents in German. These documents can be classified into 3 main categories: invoices, receipts and business cards. The distribution of individual categories of documents varies greatly. So, the most numerous is the group of “Text” objects, and the less common are objects of the “Sign” and “Contacts” classes. Comparative distribution of objects of different classes in the first 100 documents of the data set can be seen in Fig. 3. Fig. 2. Fragment of a document from the Doxima7000 data set Fig. 3. Objects distribution in Doxima7000 data set This data set was not originally intended for training and testing neural network detection models, therefore it does not include the document labeling. Thus, the labeling was done manually, by highlighting the corresponding block of the document and assigning it to a certain predefined class. For training and testing classic neural network detection models, data set of 145 and 36 documents, respectively, were used. However, in the proposed neural network model, the labeling and submission of the whole documents was not compliant with the model architecture. Therefore, we used a training data set of 2500 images of individual blocks of documents for neural network learning. This allows us to ensure a balanced sample with respect to the main classes of blocks. 2 Application of Conventional Architectures 2.1 Faster R-CNN The first neural network model used by us for the detection of individual blocks of documents was the Faster R-CNN [8]. This model consists of three parts (Fig. 4). The first part is the ResNet-50 (ResNet-101) classifier, which was trained on the COCO data set [9]. The second part is the RPN network that generates the candidate regions. Finally, the third part is the detector, which is represented by additional fully connected layers that generate the coordinates of rectangular areas containing the desired objects and class labels for each of such areas. The speed and efficiency of the analysis is significantly influenced by the RPN network, at the input of which the feature maps obtained by the preceding convolutional layer are fed. Due to this, candidate regions generation is faster than the use of the original full-size image. Fig. 4. Faster R-CNN structure [8] 2.2 SSD (Single-shot detector) The SSD model [10], as well as YOLO [11], belongs to the category of single-pass methods that allow solving the problem of detecting objects within a single network. Schematic representation of the architecture is shown in Fig. 5. Fig. 5. SSD architecture diagram The main features of this model: • It accumulate information about objects position and scale through subsequent convolutional layers. Each of this layers detects objects of specific size. (Fig.6). • The pre-trained network (VGG or ResNet) is used as the base element, which is converted to a fully CNN (FCN). • Non-maxima suppression is used to decrease the number of detected boxes during network operation. • Each feature map cell creates a group of default boxes (or anchors), differing in scale and aspect ratio. • The model is trained in order for each anchor to correctly predict its class and offset. Fig. 6. Localization of objects in feature maps of different sizes [8] The results of applying the SSD architecture to solving the task are shown in Fig. 7. Fig. 7. Example of document labeling, made by SSD model 3 The Proposed Neural Network Approach to Labeling the Text Images In addition to reviewing and studying standard solutions to the object detection problem, we proposed an original approach based on the R-CNN method that includes two processing steps [12, 13]. At the first stage, the selection of the regions of interest is carried out; it is preferable when working with the text data. At the second stage, the regions were classified using the classical convolutional network. Let us dwell on the description of each processing stage. I. The following operations are applied to the original image: • Median filter – to remove noise in the original document, associated with non- ideal conditions of scanning a document, printing, etc. • Box filter – a linear filter used to create a blur effect (necessary for suppressing small details and highlighting regions with the same type of content). • Applying the threshold function for the formation of continuous regions. • Selecting the contour areas and localization of text blocks. The first 3 of these operations are shown in Fig. 8. Fig. 8. Stages of pre-processing of the original image for localization of the text blocks After performing the above-mentioned operations, we get a set of rectangular areas containing text blocks (Fig. 9). Fig. 9. Results of object localization II. Training convolutional neural network. For recognition of blocks obtained at the first stage, a convolutional neural network is trained. At this stage, we have created a training sample, consisting of about 2500 images, divided into 7 classes (logos, bar- and QR-codes and signatures, etc.). The neural network selected as a working model is shown in Fig. 10. Fig. 10. Convolutional neural network for the classification of the text blocks After completing the training, the obtained test results are presented in Table 1. Table 1. Test results (correctly recognized objects in percentage % ) Logos, % Bar codes, % Signatures, % Other, % 97.65 100 97.76 99.29 Thus, rather high efficiency indicators for the trained classifier were obtained. Results of identifying certain types of documents are shown in Fig. 11. Obtained results can be used by neural network immune detectors for identification and classification of computer attacks and malicious application detection in Android OS [14]. Fig. 11. Results of object detection system 4 Analysis of Results In order to analyze and compare the different neural network approaches we have used the standard metrix mAP. The mAP (mean average precision metric) [15] in context of this work was used to evaluate the quality of detection of document parts. Sometimes mAP is used with its modifications computed for various values of IoU (Intersection over Union, Jaccard index). IoU is calculated in the following way (Fig. 12): S ground _ true ∩ Sbox IoU = (1) S ground _ true ∪ Sbox where S ground _ true determines the area of the reference block, which is used for labeling the testing set, and Sbox is the area of the detected block. Fig. 12. Calculating IoU metrics For objects detection task, the number true-positive detection results defines based on the total number of rectangular blocks for which the value of IoU exceeds some threshold (usually the threshold of 0.5 is chosen). TP-results are calculated for real block (ground-true block). If real block has several detections, for it is selected only one block with largest value of IoU and other blocks are considered as FP. The averaged value for all values of recall gives the AP: 1 N TPi AP = ∑ N i =1 TPi + FPi (2) where N is the number of equally spaced recall values. The value of mAP is obtained from AP by averaging over all classes of objects. The results of object detection in text images for different approaches are shown in the Table 2. Table 2. Comparison of different approaches Model mAP FPS Faster R-CNN 0,9184 10 SSD (Inception v2) 0,8321 23 Preprocessing + CNN 0,8955 21 As can been seen from the Table 2 the proposed approach has the best results concerning Frames per Second (FPS) and acceptable accuracy of object detection. 5 Conclusion The neural network algorithm for labeling images in text documents based on image preprocessing was developed. The algorithm simplifies the localization of individual parts of a document and the subsequent recognition of localized blocks using a deep convolutional neural network. The resulting algorithm provides a high quality of localization and an acceptable level of subsequent classification. In addition, a comparative analysis of various methods and architectures used to solve the object detection problem was carried out. This allows so-called one-way neural networks architectures to provide high quality solutions to the problem. References 1. LeCun, Y., Bengio, Y., Hinton, G. Deep learning, Nature, 521 (7553), 436–444. (2015). 2. Hinton, G., Salakhutdinov, R. Reducing the dimensionality of data with neural networks, Science, 313 (5786), 504–507. (2006). 3. Bengio, Y. Learning deep architectures for AI, Foundations and Trends in Machine Learning, 2(1), 1–127. (2009). 4. LeCun, Y., Haffner, P., Bottou, L., Bengio, Y. Object recognition with gradientbased learning, Shape, Contour and Grouping in Computer Vision, 1681, 319-345. (1999). 5. Object Localization and detection, https://leonardoaraujosantos.gitbooks.io/artificial- inteligence/content/object_localization_and_detection.html, last accessed 2019/03/07. 6. Intersection over Union (IoU) for object detection, https://www.pyimagesearch.com/2016/11/07/intersection-over-union-iou-for-object- detection, last accessed 2019/03/07. 7. Cib software, https://cib.by, last accessed 2019/03/07. 8. Ren, S., He, K., Girshick, R., Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, https://arxiv.org/pdf/1506.01497.pdf, last accessed 2019/03/07. 9. Lin, T. Maire, M. Belongie, S., Bourdev, L., Girshick, R., Hays, J., Perona, P., Ramanan, D., Zitnick, C. L., Dollár, P. Microsoft COCO: Common Objects in Context, https://arxiv.org/pdf/1405.0312.pdf, last accessed 2019/03/07. 10. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., Berg, A. C. SSD: Single Shot MultiBox Detector, https://arxiv.org/pdf/1512.02325.pdf, last accessed 2019/03/07. 11. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.,You Only Look Once: Unified, Real- Time Object Detection,https://arxiv.org/pdf/1506.02640.pdf, last accessed 2019/03/07. 12. Kroshchenko, A., Golovko,V., Bezobrazov, S., Mikhno, E., Khatskevich, M., Mikhnyaev, A., Brich, A. Deep trainingfor detecting of objects at images of documents, Vesnyk Brest State Technical University, 5(107), 2–9. (2017) (In Russian). 13. Golovko, V., Mikhno, E., Brich, A., Sachenko, A. A Shallow Convolutional Neural Network for Accurate Handwritten Digits Classification, Communications in Computer and Information Science, 673, 77–85. (2017). 14. Komar, M., Sachenko, A., Bezobrazov, S., Golovko, V. Intelligent Cyber Defense System Using Artificial Neural Network and Immune System Techniques, Communications in Computer and Information Science, 783, 36-55. (2017). 15. Oksuz, K., Cam, B. C., Akbas, E., Kalkan, S. Localization Recall Precision (LRP): A New Performance Metric for Object Detection, https://arxiv.org/pdf/1807.01696.pdf, last accessed 2019/03/07.