=Paper=
{{Paper
|id=Vol-2984/paper15
|storemode=property
|title=PyTabby: a Docreader's module for extracting text and tables from PDF with a text layer (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2984/paper15.pdf
|volume=Vol-2984
|authors=Andrey A. Mikhailov,Alexey O. Shigarov,Ilya S. Kozlov
|dblpUrl=https://dblp.org/rec/conf/itams/MikhailovSK21
}}
==PyTabby: a Docreader's module for extracting text and tables from PDF with a text layer (short paper)==
PyTabby: a Docreader’s module for extracting text and tables from PDF with a text layer Andrey A. Mikhailov1 , Alexey Shigarov1 and Ilya S. Kozlov2 1 Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences, Irkutsk, 664033, Russian Federation 2 Ivannikov Institute for System Programming of Russian Academy of Sciences, 25 Alexander Solzhenitsyn St., Moscow, 109004, Russian Federation Abstract This paper presents a complete solution for extraction of textual information and tables from PDF with a text layer. The presented solution consist of two parts: PyTabby is a tool for extracting text and tables from PDF with a complex background and layout, and Python wrapper module for Docreader tool. The PyTabby tool extracts text and tables from the low level representation of the PDF format. It enables employment of the additional information excluded in scanned documents and provides improvement of quality and performance compared with Optical Character Recognition (OCR) methods. The presented solution is incorporated into Docreader tool to parse PDF files with a text layer and is used as a part of the TALISMAN technology for social analytics. Keywords document structure analysis, PDF documents, document analysis, 1. Introduction In the digital world, web content is the main value. Being useful information that provides business opportunities, web content is a valuable asset on the Internet. A large number of such documents and their properties makes them a valuable resource in data science and business intelligence applications. Usually, electronic documents are not accompanied by the semantics necessary for machine interpretation of their content as intended by their author. The information accumulated in them is often unstructured and not standardized. Analysis of these documents requires their preliminary transformation to a structured representation with a formal model. The Ivannikov Institute for System Programming of Russian Academy of Sciences is devel- oping a social media analysis technology called TALISMAN1 . Unlike most existing solutions for social analytics, the TALISMAN technology was originally aimed at working with large amounts of data. The most promising open solutions from the stack of Big Data technologies are employed, such as: Apache Spark, GraphX, MLLib, etc. Information Technologies: Algorithms, Models, Systems (ITAMS), September 14, 2021, Irkutsk Envelope-Open mikhailov@icc.ru (A. A. Mikhailov); shigarov@icc.ru (A. Shigarov); kozlov-ilya@ispras.ru (I. S. Kozlov) GLOBE http://td.icc.ru (A. Shigarov) Orcid 0000-0003-4057-4511 (A. A. Mikhailov); 0000-0001-5572-5349 (A. Shigarov); 0000-0002-0145-1159 (I. S. Kozlov) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.ispras.ru/en/technologies/talisman Logical structure extraction Process manager Docreader HTML Table detection API PDF with text Converter layer Attached documents PDF without analisys text layer Dedoc DOCX Output format EXCEL TXT Logical document structure CSV Figure 1: The Docreader scheme of work. The part of TALISMAN is Docreader2 , a universal and open system for bringing documents to a single format. It automatically extracts the logical structure, tables, and meta-information. The content of the documents is presented in the form of a tree encoding headers and lists of various levels of nesting. Docreader can be integrated as a separate component into systems for analyzing the structure and content of documents. Docreader can extract logical structure from PDF with and without a text layer. However, for table detection and recognition, it uses the OCR [1] for two cases. 2. Related research In the past two decades, several methods and tools for PDF table extraction have been proposed. Some of them are discussed in the recent surveys [2, 3, 4, 5]. Ramel et al. [6] consider two techniques for detecting and recognizing tables from documents in an exchange format like PDF. The first technique is based on the analysis of ruling lines. The second analyzes the arrangement of text components. Hassan et al. [7] expand these ideas for the PDF table extraction. In the project TableSeer, Liu et al. [8] propose methods for detecting tables in PDF documents and 2 https://education.at.ispras.ru/dedoc extracting metadata (headers). They use text arrangement, fonts, whitespace, and keywords (e. g. “Table”, “Figure”). Oro et al. [9] present PDF-TREX, a heuristic method where the PDF table extraction is realized as building from content elements to tables in a bottom-up way. Yildiz et al. [10] propose a heuristic method for the PDF table extraction using pdftohtml3 for generating its input. They also use the pdftohtml tool to prepare their input. However, this tool occasionally makes mistakes in combining text chunks, which are located too close to each other, thus the input can be corrupted. Nurminen [11] in his thesis describes comprehensive PDF table detection and structure recognition algorithms that have demonstrated high recall and precision on “ICDAR 2013 Table Competition” [12]. Some of them are implemented in Tabula4 , a tool for extracting tabular data from PDF. Rastan et al. [13] consider a framework for the end-to-end table processing including the task of table structure recognition. Moreover, Rastan et al. [14] suggest using an ad-hoc document analysis leading to a better table extraction. Their wrapper is able to detect features such as page columns, bullets, and numbering. Perez-Arriaga et al. [15] combines layout heuristics with a supervised machine learning method based on k-nearest neighbors to extract tables from untagged PDF documents. Their system, TAO, promises to be an efficient, comprehensive and robust solution for both stages: table detection and cell structure recognition, and it does not depend on fixed patterns or layouts of tables or documents. 3. The proposed method The process of PDF table and textual information extraction involves the following phases [16]: 1. data preparation, to recover text blocks presented as words and ruling lines from instruc- tions of a source PDF document; 2. text line and paragraph extraction, to recover text blocks presented as lines and para- graphs; 3. table detection, to recover a bounding box of each table located on a page; 4. table structure recognition, to recover a cell structure of a detected table. We propose to use an heuristic-based page layout analysis to recover text blocks such as paragraphs, titles, footnotes, table cells etc. These additional data allow us to correct some errors of the presented table detection. To build text blocks, we use data that are available in untagged PDF documents, including character positions, fonts, rulings, and cursor traces. Since such documents do not contain word structures, we propose a simple algorithm for combining adjacent character positions into words. Moreover, we adapt and extend the existing algorithms of T-Recs systems [17, 18] for combining neighbor words into text blocks. Originally, T-Recs algorithms were designed for document images. In contrast to them, our adoption uses additional heuristics based on the PDF-specific data. 3 http://pdftohtml.sourceforge.net 4 http://tabula.technology 3.1. Table detection At this step, we detect only full and partial bordered tables. The main idea is finding table boxes on the page using ruling lines (vertical and horizontal). This can be done in two ways: based on the image and on the PDF instruction analysis. Both approaches have their disadvantages. In the first case, the image of the document contains redundant information, such as text, pictures, forms, etc. This information makes it difficult to highlight vertical and horizontal lines in the document image. The PDF format allows separate selection of all the instructions with which the outline of the table is formed - drawing lines, rectangles, etc. However, PDF printers often use non-standard approaches to output graphics. For example, the color of the line can be the same as the background, so that the visual line is invisible. Such a reading can be automatically processed, but if the color of the line and the background differs only slightly, it will be impossible to distinguish them, and, hence, their programmatic separation will also be difficult. In this work, we use a combined approach. At the first stage, all instructions for outputting text and graphic information were removed from the PDF document, as shown in Fig 3.1. As a result, the document consisted only of horizontal and vertical lines. After such processing, an image was generated from the PDF document page by page. Using the algorithm of connected components, such an image can be used to detect the contours of the tables. Figure 2: Original PDF page Figure 3: Cleared PDF page 3.2. Table structure recognition At this step, we construct rows and columns that constitute an arrangement of cells. The system provides an algorithm for slicing a table space into rows and columns based on the analysis of connected text blocks. To generate columns, first, we exclude each multi-column text block located in more than one column. We assume that a text block is multi-column when its horizontal projection intersects with the projections of two or more text blocks located in the same line. Each column is considered as an intersection of horizontal projections of one-column text blocks. Similarly, rows are constructed from vertical projections of one-row text blocks. At this step, we also recover empty cells. Some of them can be erroneous, i.e. they are absent in the source table. The system provides the ad-hoc heuristics to dispose of erroneous empty cells. 4. Conclusion In this paper, we develop PyTabby, a tool for the PDF table and textual extraction from PDF documents with a text layer. This extends our previous work for the table structure recogni- tion [19, 20]. The system exploits a set of customizable ad-hoc heuristics for table detection and cell structure reconstruction based on features of text and ruling lines presented in PDF documents. Most of them, such as horizontal and vertical distances, fonts, and rulings, are well-known and used in the existing methods. Additionally, we propose to exploit the feature of appearance of text printing instruction in PDF files and positions of a drawing cursor. Acknowledgments The results were obtained within the framework of the State Assignment of the Ministry of Education and Science of the Russian Federation for the project ”Methods and technologies of cloud-based service-oriented platform for collecting, storing and processing large volumes of multi-format interdisciplinary data and knowledge based upon the use of artificial intelligence, model-guided approach and machine learning” (State Registration No. 121030500071-2). References [1] C. C. Tappert, C. Y. Suen, T. Wakahara, The state of the art in online handwriting recognition, IEEE Transactions on pattern analysis and machine intelligence 12 (1990) 787–808. [2] B. Coüasnon, A. Lemaitre, Handbook of Document Image Processing and Recognition, 2014, pp. 647–677. URL: http://dx.doi.org/10.1007/978-0-85729-859-1_20. doi:1 0 . 1 0 0 7 / 978- 0- 85729- 859- 1\_20. [3] J. Hu, Y. Liu, Analysis of Documents Born Digital, 2014, pp. 775–804. URL: http://dx.doi. org/10.1007/978-0-85729-859-1_26. doi:1 0 . 1 0 0 7 / 9 7 8 - 0 - 8 5 7 2 9 - 8 5 9 - 1 \ _ 2 6 . [4] S. Khusro, A. Latif, I. Ullah, On methods and tools of table detection, extraction and annotation in PDF documents, J. Inf. Sci. 41 (2015) 41–57. URL: http://dx.doi.org/10.1177/ 0165551514551903. doi:1 0 . 1 1 7 7 / 0 1 6 5 5 5 1 5 1 4 5 5 1 9 0 3 . [5] A. S. Corrêa, P.-O. Zander, Unleashing tabular content to open data: A survey on pdf table extraction methods and tools, in: Proc. 18th Int. Conf. on Digital Government Research, 2017, pp. 54–63. URL: http://doi.acm.org/10.1145/3085228.3085278. doi:1 0 . 1 1 4 5 / 3 0 8 5 2 2 8 . 3085278. [6] J. Y. Ramel, M. Crucianu, N. Vincent, C. Faure, Detection, extraction and representation of tables, in: Proc. 7th Int. Conf. on Document Analysis and Recognition, 2003, pp. 374–378 vol.1. doi:1 0 . 1 1 0 9 / I C D A R . 2 0 0 3 . 1 2 2 7 6 9 2 . [7] T. Hassan, R. Baumgartner, Table recognition and understanding from PDF files, in: Proc. 9th Int. Conf. on Document Analysis and Recognition - Vol. 02, 2007, pp. 1143–1147. URL: http://dl.acm.org/citation.cfm?id=1304596.1304833. [8] Y. Liu, K. Bai, P. Mitra, C. L. Giles, TableSeer: Automatic table metadata extraction and searching in digital libraries, in: Proc. 7th ACM/IEEE Joint Conf. on Digital Libraries, 2007, pp. 91–100. URL: http://doi.acm.org/10.1145/1255175.1255193. doi:1 0 . 1 1 4 5 / 1 2 5 5 1 7 5 . 1255193. [9] E. Oro, M. Ruffolo, PDF-TREX: An approach for recognizing and extracting tables from PDF documents, in: Proc. 10th Int. Conf. on Document Analysis and Recognition, 2009, pp. 906–910. doi:1 0 . 1 1 0 9 / I C D A R . 2 0 0 9 . 1 2 . [10] B. Yildiz, K. Kaiser, S. Miksch, pdf2table: A method to extract table information from PDF files, in: Proc. 2nd Indian Int. Conf. on Artificial Intelligence, Pune, India, 2005, pp. 1773–1785. [11] A. Nurminen, Algorithmic extraction of data in tables in PDF documents, Master’s thesis, Tampere University of Technology, Tampere, Finland, 2013. [12] M. Göbel, T. Hassan, E. Oro, G. Orsi, ICDAR 2013 table competition, in: Proc. 12th Int. Conf. on Document Analysis and Recognition, 2013, pp. 1449–1453. doi:1 0 . 1 1 0 9 / I C D A R . 2 0 1 3 . 2 9 2 . [13] R. Rastan, H.-Y. Paik, J. Shepherd, Texus: A task-based approach for table extraction and understanding, in: Proc. 2015 ACM Symposium on Document Engineering, 2015, pp. 25–34. URL: http://doi.acm.org/10.1145/2682571.2797069. doi:1 0 . 1 1 4 5 / 2 6 8 2 5 7 1 . 2 7 9 7 0 6 9 . [14] R. Rastan, H.-Y. Paik, J. Shepherd, A pdf wrapper for table processing, in: Proc. 2016 ACM Symposium on Document Engineering, 2016, pp. 115–118. URL: http://doi.acm.org/ 10.1145/2960811.2967162. doi:1 0 . 1 1 4 5 / 2 9 6 0 8 1 1 . 2 9 6 7 1 6 2 . [15] M. O. Perez-Arriaga, T. Estrada, S. Abad-Mota, TAO: system for table detection and extraction from PDF documents, in: Proc. 29th Int. Florida Artificial Intelligence Research Society Conference, 2016, pp. 591–596. [16] A. Shigarov, A. Altaev, A. Mikhailov, V. Paramonov, E. Cherkashin, Tabbypdf: Web-based system for pdf table extraction, in: R. Damaševičius, G. Vasiljevienė (Eds.), Information and Software Technologies, Springer International Publishing, Cham, 2018, pp. 257–269. [17] T. Kieninger, A. Dengel, The t-recs table recognition and analysis system, in: Document Analysis Systems: Theory and Practice, 1999, pp. 255–270. doi:1 0 . 1 0 0 7 / 3 - 5 4 0 - 4 8 1 7 2 - 9 \ _ 2 1 . [18] T. Kieninger, A. Dengel, Applying the t-recs table recognition system to the business letter domain, in: Proc. 6th Int. Conf. on Document Analysis and Recognition, 2001, pp. 518–522. doi:1 0 . 1 1 0 9 / I C D A R . 2 0 0 1 . 9 5 3 8 4 3 . [19] A. Shigarov, A. Mikhailov, V. Khristyuk, V. Paramonov, Software development for rule- based spreadsheet data extraction and transformation, 2019, pp. 1132–1137. doi:1 0 . 2 3 9 1 9 / M I P R O . 2 0 1 9 . 8 7 5 6 8 2 9 , cited By 1. [20] A. Shigarov, I. Cherepanov, E. Cherkashin, N. Dorodnykh, V. Khristyuk, A. Mikhailov, V. Paramonov, E. Rozhkow, A. Yurin, Towards end-to-end transformation of arbitrary tables from untagged portable documents (pdf) to linked data, volume 2463, 2019, pp. 1–12. Cited By 0.