=Paper=
{{Paper
|id=Vol-2677/paper8
|storemode=property
|title=An approach how to automate labeling data for the training ANN models for page layout analysis
|pdfUrl=https://ceur-ws.org/Vol-2677/paper8.pdf
|volume=Vol-2677
|authors=Andrey A. Mikhailov
|dblpUrl=https://dblp.org/rec/conf/itams/Mikhailov20
}}
==An approach how to automate labeling data for the training ANN models for page layout analysis==
An Approach How to Automate Labeling Data for the Training ANN Models for Page Layout Analysis Andrey Mikhailov1 Matrosov Institute for System Dynamics and Control Theory of SB RAS, 134 Lermontov st., Irkutsk, Russia mikhailov@icc.ru, WWW home page: http://idstu.irk.ru Abstract. Object detection and recognition is an important task in many document analysis applications. It is a difficult problem due to different page layouts and representation formats. Recently the deep learning in computer vision has significantly boosted the data-driven image-based approaches for page layout analysis. In this paper, we con- sider open formats of electronic documents to generate training datasets. Formats of these documents should contain markup allowing obtaining information about page layout regions. It will allow us to generate a training dataset automatically for training ANN models of page layout analysis. Keywords: document layout analysis · PDF accessibility · ANN models · artificial intelligence 1 Introduction Arbitrary documents are a common way of presenting information on the web. The big volume and structure of such documents make them a valuable source in data science and business intelligence applications. However, as a rule, they haven’t included semantics for machine interpretation of their content as consid- ered by their author. The information accumulated in them is often unstructured and not standardized. The analysis of these data requires transformation to a structured representation with a given formal model. In document analysis and recognition, this task commonly named as document layout analysis. In recent years, approaches for page layout analysis based on deep neural networks for object detection and classification have been actively developing. This is ev- idenced by the results of one of the main scientific conferences on document analysis - ICDAR 1 . Since 2001, this conference has hosted the RDCL document Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 1 https://icdar2019.org layout analysis competitions (2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019). For the competition are published datasets. For example, in 2019 a dataset was published that includes only 4, 500 examples. This amount of data is not enough for high-quality training of deep neural networks, since their modern architectures have many free parameters and are very sensitive to the volume and quality of data. Modern layout analysis systems based on ANN models are focused mainly on a small count of the same document types. This is due to the fact that either open-source or hand-tagged datasets were used to develop page layout analysis ANN models. In this paper, we propose an idea to automate the process of labeling datasets. For this, it is proposed to develop methods for automatic data labeling for training deep neural for page layout analysis. Which should reduce the process of developing layout analysis systems for new types of documents, and improve the quality of the analysis. 2 Related Works Document images are often generated from physical documents by digitization, using scanners or various generation programs (printers). Many documents, such as newspapers, magazines and brochures, have very complex layouts due to the placement of pictures, headings and captions, complex backgrounds, artistic text formatting, etc. A person uses a lot of additional clues such as context, conventions, language information. Automatic analysis of an arbitrary document with a complex lay- out is an extremely difficult task and goes beyond the capabilities of modern document layout analysis systems. In the scientific literature, a large number of methods for analyzing the layout of documents have been proposed. According to article [10], they can be divided into three groups: methods of classification based on areas [17, 13]; classification methods based on pixel analysis [12, 11]; analysis of connected components [6, 15, 3]. With the increasing efficiency and popularity of convolutional neural networks, their field of application is constantly expand- ing. Since 2014, the first attempts to use artificial neural networks to solve the problem of analyzing the layout of documents have been known [9, 8, 2, 16]. These works have demonstrated their effectiveness in comparison with classical approaches, which is confirmed by the results of the 2017 competition at the IC- DAR conference [4]. On the other hand, the 2019 competition showed that on a variety of data, with a large number of classes (10), the combination of classical methods [5] is most effective compared to deep neural networks. This is due to the lack of a sufficient amount of diverse tagged data with a large number of classes. While for special cases, neural networks work much more efficiently [7]. It should be noted that to solve the problem of analyzing the arrangement of documents in these works, either neural networks of the R-CNN architecture or author’s developments are used. For training neural networks, open datasets of labeled data are usually used; in rare cases, the authors of the articles indicate that they have labeled their own training set. These samples rarely reach 20,000 copies and are often not publicly available. The author is not aware at the mo- ment of open datasets large enough to train neural networks for document layout analysis. It should be noted that it was the creation of such datasets as ImageNet that made it possible to obtain outstanding results using convolutional neural networks for natural image recognition. 3 An Idea The Internet contains a large number of original LaTex documents. One of the most well-known resources is arXiv 2 . arXiv is a free distribution service and an open-access archive for more than 1,7 million scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. PDF documents can be generated from these documents using special compiler programs. PDF is a page orientated graphic format. It simply puts images and glyphs at various coordinates on a page. Since 2006, PDF includes special tags for support reading order and logical order. With reading order, the characters on the page are understood to have a linear sequence of appearance. Logical order allows introducing concepts such as tables, lists, and headings, as well as provide alternate text for images, descriptive text for links and form fields, and so on. Traditionally, there are three ways to obtain PDF document from LaTex. – LaTeX source file converted to a DVI file, which could then be converted to PostScript with dvips. This, in turn, can be converted to a PDF file by ps2pdf 3 tool. latex dvips ps2pdf text.tex -------> text.dvi -------> text.ps --------> text.pdf – The step with conversion to PostScript can be skipped. latex dvipdfm text.tex -------> text.dvi -------> text.pdf – Directly from the LaTeX source to PDF file by pdflatex program. pdflatex text.tex --------> text.pdf The first two ways are not allows for tagging PDF. Because the DVI format does not allow saving additional tags. For the direct compilation of LaTex into PDF there is a special LaTex package - accessibility 4 . 2 https://arxiv.org/ 3 https://www.ps2pdf.com/ 4 https://github.com/AndyClifton/accessibility Accessibility [14] was written as a proof-of-concept showing how to improve the structure and tagging of PDF files generated from LaTeX. These features make PDF documents machine-readable and thus enable document readers to automatically process and present the document. Andy Clifton took on main- tenance of the package in May 2019 with permission and support from Babett Schalitz. This package is predominantly targeted at documents produced using the KOMA-Script document classes [1]. + Accessibility pdflatex PDFBox + BBoxes Fig. 1. Labeling PDFs process The idea of automating data labeling is shown in the figure [img˙concept]. We propose to use the Accessibility package for generating tagged PDF docu- ments. This package is easy to use. In order to get a tagged document, the only short preamble is needed to add to the document and compiled using pdflatex tool. The next step is to extract the tagged information from the tagged doc- ument. PDFBox allows to extract content from documents and render PDF to image. We propose to use this tool to generate training dataset from tagged PDFs. 4 Conclusion In this paper, we presented an idea how to automate dataset labeling from LaTex documents. The main idea is use special LaTex package Accessibility. This package allows adding tags to produced PDF documents. To extract information about layout from tagged PDFs we suggest to use PDFBox library. We expect that the explained principles can be used for designing software for page layout analysis. Acknowledgment The research was supported by the Program of the Fundamental Research of the Siberian Branch of the Russian Academy of Sciences, project num. IV.38.1.2 (reg. num. AAAA-A17-117032210079-1). Results are achieved using the Centre of collective usage Integrated information network of Irkutsk scientific educational complex. References [1] AndyClifton. The Accessibility LaTeX package. 2019 (accessed August 21, 2020). url: https://github.com/AndyClifton/accessibility. [2] Dario Augusto Borges Oliveira and Matheus Palhares Viana. “Fast CNN- based document layout analysis”. In: Proceedings of the IEEE Interna- tional Conference on Computer Vision Workshops. 2017, pp. 1173–1180. [3] Syed Saqib Bukhari, Mayce Ibrahim Ali Al Azawi, Faisal Shafait, and Thomas M Breuel. “Document image segmentation using discriminative learning over connected components”. In: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. 2010, pp. 183– 190. [4] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. “Icdar2017 competition on recognition of documents with complex layouts- rdcl2017”. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). Vol. 1. IEEE. 2017, pp. 1404–1410. [5] Christian Clausner, Apostolos Antonacopoulos, and Stefan Pletschacher. “ICDAR2019 competition on recognition of documents with complex layouts- rdcl2019”. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE. 2019, pp. 1521–1526. [6] Lloyd A. Fletcher and Rangachar Kasturi. “A robust algorithm for text string separation from mixed text/graphics images”. In: IEEE transactions on pattern analysis and machine intelligence 10.6 (1988), pp. 910–918. [7] Liangcai Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and Eva Lang. “Icdar 2019 competition on table detection and recognition (ctdar)”. In: 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE. 2019, pp. 1510–1515. [8] Adam W Harley, Alex Ufkes, and Konstantinos G Derpanis. “Evaluation of deep convolutional nets for document image classification and retrieval”. In: 2015 13th International Conference on Document Analysis and Recog- nition (ICDAR). IEEE. 2015, pp. 991–995. [9] Le Kang, Jayant Kumar, Peng Ye, Yi Li, and David Doermann. “Convo- lutional neural networks for document image classification”. In: 2014 22nd International Conference on Pattern Recognition. IEEE. 2014, pp. 3168– 3172. [10] Viet Phuong Le, Nibal Nayef, Muriel Visani, Jean-Marc Ogier, and Cao De Tran. “Text and non-text segmentation based on connected component features”. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR). IEEE. 2015, pp. 1096–1100. [11] Michael A Moll and Henry S Baird. “Segmentation-based retrieval of doc- ument images from diverse collections”. In: Document Recognition and Retrieval XV . Vol. 6815. International Society for Optics and Photonics. 2008, p. 68150L. [12] Michael A Moll, Henry S Baird, and Chang An. “Truthing for pixel- accurate segmentation”. In: 2008 The Eighth IAPR International Work- shop on Document Analysis Systems. IEEE. 2008, pp. 379–385. [13] Oleg Okun, David Dœrmann, and Matti Pietikainen. Page segmentation and zone classification: the state of the art. Tech. rep. OULU UNIV (FIN- LAND) DEPT OF ELECTRICAL ENGINEERING, 1999. [14] Babett Schalitz. Accessibility-erhöhung von latex-dokumenten. Diplomar- beit, Fakultät Informatik . Tech. rep. Technische Universität Dresden, July 2007. [15] Karl Tombre, Salvatore Tabbone, Loıc Pélissier, Bart Lamiroy, and Philippe Dosch. “Text/graphics separation revisited”. In: International Workshop on Document Analysis Systems. Springer. 2002, pp. 200–211. [16] Nicole Vincent and Jean-Marc Ogier. “Shall deep learning be the manda- tory future of document analysis problems?” In: Pattern Recognition 86 (2019), pp. 281–289. [17] Kwan Y. Wong, Richard G. Casey, and Friedrich M. Wahl. “Document analysis system”. In: IBM journal of research and development 26.6 (1982), pp. 647–656.