<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PyTabby: a Docreader's module for extracting text and tables from PDF with a text layer</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrey A. Mikhailov</string-name>
          <email>mikhailov@icc.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexey Shigarov</string-name>
          <email>shigarov@icc.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilya S. Kozlov</string-name>
          <email>kozlov-ilya@ispras.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information Technologies: Algorithms</institution>
          ,
          <addr-line>Models, Systems, ITAMS</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ivannikov Institute for System Programming of Russian Academy of Sciences</institution>
          ,
          <addr-line>25 Alexander Solzhenitsyn St., Moscow</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a complete solution for extraction of textual information and tables from PDF with a text layer. The presented solution consist of two parts: PyTabby is a tool for extracting text and tables from PDF with a complex background and layout, and Python wrapper module for Docreader tool. The PyTabby tool extracts text and tables from the low level representation of the PDF format. It enables employment of the additional information excluded in scanned documents and provides improvement of quality and performance compared with Optical Character Recognition (OCR) methods. The presented solution is incorporated into Docreader tool to parse PDF files with a text layer and is used as a part of the TALISMAN technology for social analytics.</p>
      </abstract>
      <kwd-group>
        <kwd>document structure analysis</kwd>
        <kwd>PDF documents</kwd>
        <kwd>document analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In the digital world, web content is the main value. Being useful information that provides
business opportunities, web content is a valuable asset on the Internet. A large number of
such documents and their properties makes them a valuable resource in data science and
business intelligence applications. Usually, electronic documents are not accompanied by the
semantics necessary for machine interpretation of their content as intended by their author.
The information accumulated in them is often unstructured and not standardized. Analysis of
these documents requires their preliminary transformation to a structured representation with
a formal model.
CEUR
Workshop
Proceedings</p>
      <p>Process manager</p>
      <p>Docreader</p>
      <p>Logical structure
extraction</p>
      <p>The part of TALISMAN is Docreader2, a universal and open system for bringing documents
to a single format. It automatically extracts the logical structure, tables, and meta-information.
The content of the documents is presented in the form of a tree encoding headers and lists of
various levels of nesting. Docreader can be integrated as a separate component into systems for
analyzing the structure and content of documents.</p>
      <p>
        Docreader can extract logical structure from PDF with and without a text layer. However, for
table detection and recognition, it uses the OCR [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for two cases.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related research</title>
      <p>
        In the past two decades, several methods and tools for PDF table extraction have been proposed.
Some of them are discussed in the recent surveys [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4, 5</xref>
        ]. Ramel et al. [6] consider two
techniques for detecting and recognizing tables from documents in an exchange format like PDF.
The first technique is based on the analysis of ruling lines. The second analyzes the arrangement
of text components. Hassan et al. [7] expand these ideas for the PDF table extraction. In the
project TableSeer, Liu et al. [8] propose methods for detecting tables in PDF documents and
      </p>
      <sec id="sec-2-1">
        <title>2https://education.at.ispras.ru/dedoc</title>
        <p>extracting metadata (headers). They use text arrangement, fonts, whitespace, and keywords
(e. g. “Table”, “Figure”). Oro et al. [9] present PDF-TREX, a heuristic method where the PDF
table extraction is realized as building from content elements to tables in a bottom-up way.</p>
        <p>Yildiz et al. [10] propose a heuristic method for the PDF table extraction using pdftohtml 3 for
generating its input. They also use the pdftohtml tool to prepare their input. However, this tool
occasionally makes mistakes in combining text chunks, which are located too close to each other,
thus the input can be corrupted. Nurminen [11] in his thesis describes comprehensive PDF table
detection and structure recognition algorithms that have demonstrated high recall and precision
on “ICDAR 2013 Table Competition” [12]. Some of them are implemented in Tabula4, a tool for
extracting tabular data from PDF. Rastan et al. [13] consider a framework for the end-to-end
table processing including the task of table structure recognition. Moreover, Rastan et al. [14]
suggest using an ad-hoc document analysis leading to a better table extraction. Their wrapper
is able to detect features such as page columns, bullets, and numbering. Perez-Arriaga et al. [15]
combines layout heuristics with a supervised machine learning method based on k-nearest
neighbors to extract tables from untagged PDF documents. Their system, TAO, promises to
be an eficient, comprehensive and robust solution for both stages: table detection and cell
structure recognition, and it does not depend on fixed patterns or layouts of tables or documents.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The proposed method</title>
      <p>The process of PDF table and textual information extraction involves the following phases [16]:
1. data preparation, to recover text blocks presented as words and ruling lines from
instructions of a source PDF document;
2. text line and paragraph extraction, to recover text blocks presented as lines and
paragraphs;
3. table detection, to recover a bounding box of each table located on a page;
4. table structure recognition, to recover a cell structure of a detected table.</p>
      <p>We propose to use an heuristic-based page layout analysis to recover text blocks such as
paragraphs, titles, footnotes, table cells etc. These additional data allow us to correct some
errors of the presented table detection.</p>
      <p>To build text blocks, we use data that are available in untagged PDF documents, including
character positions, fonts, rulings, and cursor traces. Since such documents do not contain
word structures, we propose a simple algorithm for combining adjacent character positions into
words. Moreover, we adapt and extend the existing algorithms of T-Recs systems [17, 18] for
combining neighbor words into text blocks. Originally, T-Recs algorithms were designed for
document images. In contrast to them, our adoption uses additional heuristics based on the
PDF-specific data.</p>
      <sec id="sec-3-1">
        <title>3http://pdftohtml.sourceforge.net 4http://tabula.technology</title>
        <sec id="sec-3-1-1">
          <title>3.1. Table detection</title>
          <p>At this step, we detect only full and partial bordered tables. The main idea is finding table boxes
on the page using ruling lines (vertical and horizontal). This can be done in two ways: based
on the image and on the PDF instruction analysis. Both approaches have their disadvantages.
In the first case, the image of the document contains redundant information, such as text,
pictures, forms, etc. This information makes it dificult to highlight vertical and horizontal
lines in the document image. The PDF format allows separate selection of all the instructions
with which the outline of the table is formed - drawing lines, rectangles, etc. However, PDF
printers often use non-standard approaches to output graphics. For example, the color of the
line can be the same as the background, so that the visual line is invisible. Such a reading can
be automatically processed, but if the color of the line and the background difers only slightly,
it will be impossible to distinguish them, and, hence, their programmatic separation will also be
dificult.</p>
          <p>In this work, we use a combined approach. At the first stage, all instructions for outputting
text and graphic information were removed from the PDF document, as shown in Fig 3.1. As a
result, the document consisted only of horizontal and vertical lines. After such processing, an
image was generated from the PDF document page by page. Using the algorithm of connected
components, such an image can be used to detect the contours of the tables.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.2. Table structure recognition</title>
          <p>At this step, we construct rows and columns that constitute an arrangement of cells. The
system provides an algorithm for slicing a table space into rows and columns based on the
analysis of connected text blocks. To generate columns, first, we exclude each multi-column
text block located in more than one column. We assume that a text block is multi-column when
its horizontal projection intersects with the projections of two or more text blocks located in the
same line. Each column is considered as an intersection of horizontal projections of one-column
text blocks. Similarly, rows are constructed from vertical projections of one-row text blocks. At
this step, we also recover empty cells. Some of them can be erroneous, i.e. they are absent in
the source table. The system provides the ad-hoc heuristics to dispose of erroneous empty cells.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we develop PyTabby, a tool for the PDF table and textual extraction from PDF
documents with a text layer. This extends our previous work for the table structure
recognition [19, 20]. The system exploits a set of customizable ad-hoc heuristics for table detection
and cell structure reconstruction based on features of text and ruling lines presented in PDF
documents. Most of them, such as horizontal and vertical distances, fonts, and rulings, are
well-known and used in the existing methods. Additionally, we propose to exploit the feature
of appearance of text printing instruction in PDF files and positions of a drawing cursor.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The results were obtained within the framework of the State Assignment of the Ministry of
Education and Science of the Russian Federation for the project ”Methods and technologies of
cloud-based service-oriented platform for collecting, storing and processing large volumes of
multi-format interdisciplinary data and knowledge based upon the use of artificial intelligence,
model-guided approach and machine learning” (State Registration No. 121030500071-2).
[5] A. S. Corrêa, P.-O. Zander, Unleashing tabular content to open data: A survey on pdf table
extraction methods and tools, in: Proc. 18th Int. Conf. on Digital Government Research,
2017, pp. 54–63. URL: http://doi.acm.org/10.1145/3085228.3085278. doi:1 0 . 1 1 4 5 / 3 0 8 5 2 2 8 .
3 0 8 5 2 7 8 .
[6] J. Y. Ramel, M. Crucianu, N. Vincent, C. Faure, Detection, extraction and representation of
tables, in: Proc. 7th Int. Conf. on Document Analysis and Recognition, 2003, pp. 374–378
vol.1. doi:1 0 . 1 1 0 9 / I C D A R . 2 0 0 3 . 1 2 2 7 6 9 2 .
[7] T. Hassan, R. Baumgartner, Table recognition and understanding from PDF files, in: Proc.
9th Int. Conf. on Document Analysis and Recognition - Vol. 02, 2007, pp. 1143–1147. URL:
http://dl.acm.org/citation.cfm?id=1304596.1304833.
[8] Y. Liu, K. Bai, P. Mitra, C. L. Giles, TableSeer: Automatic table metadata extraction and
searching in digital libraries, in: Proc. 7th ACM/IEEE Joint Conf. on Digital Libraries,
2007, pp. 91–100. URL: http://doi.acm.org/10.1145/1255175.1255193. doi:1 0 . 1 1 4 5 / 1 2 5 5 1 7 5 .
1 2 5 5 1 9 3 .
[9] E. Oro, M. Rufolo, PDF-TREX: An approach for recognizing and extracting tables from
PDF documents, in: Proc. 10th Int. Conf. on Document Analysis and Recognition, 2009,
pp. 906–910. doi:1 0 . 1 1 0 9 / I C D A R . 2 0 0 9 . 1 2 .
[10] B. Yildiz, K. Kaiser, S. Miksch, pdf2table: A method to extract table information from
PDF files, in: Proc. 2nd Indian Int. Conf. on Artificial Intelligence, Pune, India, 2005, pp.
1773–1785.
[11] A. Nurminen, Algorithmic extraction of data in tables in PDF documents, Master’s thesis,</p>
      <p>Tampere University of Technology, Tampere, Finland, 2013.
[12] M. Göbel, T. Hassan, E. Oro, G. Orsi, ICDAR 2013 table competition, in: Proc. 12th Int. Conf.</p>
      <p>on Document Analysis and Recognition, 2013, pp. 1449–1453. doi:1 0 . 1 1 0 9 / I C D A R . 2 0 1 3 . 2 9 2 .
[13] R. Rastan, H.-Y. Paik, J. Shepherd, Texus: A task-based approach for table extraction and
understanding, in: Proc. 2015 ACM Symposium on Document Engineering, 2015, pp.
25–34. URL: http://doi.acm.org/10.1145/2682571.2797069. doi:1 0 . 1 1 4 5 / 2 6 8 2 5 7 1 . 2 7 9 7 0 6 9 .
[14] R. Rastan, H.-Y. Paik, J. Shepherd, A pdf wrapper for table processing, in: Proc. 2016
ACM Symposium on Document Engineering, 2016, pp. 115–118. URL: http://doi.acm.org/
10.1145/2960811.2967162. doi:1 0 . 1 1 4 5 / 2 9 6 0 8 1 1 . 2 9 6 7 1 6 2 .
[15] M. O. Perez-Arriaga, T. Estrada, S. Abad-Mota, TAO: system for table detection and
extraction from PDF documents, in: Proc. 29th Int. Florida Artificial Intelligence Research
Society Conference, 2016, pp. 591–596.
[16] A. Shigarov, A. Altaev, A. Mikhailov, V. Paramonov, E. Cherkashin, Tabbypdf: Web-based
system for pdf table extraction, in: R. Damaševičius, G. Vasiljevienė (Eds.), Information
and Software Technologies, Springer International Publishing, Cham, 2018, pp. 257–269.
[17] T. Kieninger, A. Dengel, The t-recs table recognition and analysis system, in: Document</p>
      <p>Analysis Systems: Theory and Practice, 1999, pp. 255–270. doi:1 0 . 1 0 0 7 / 3 - 5 4 0 - 4 8 1 7 2 - 9 \ _ 2 1 .
[18] T. Kieninger, A. Dengel, Applying the t-recs table recognition system to the business letter
domain, in: Proc. 6th Int. Conf. on Document Analysis and Recognition, 2001, pp. 518–522.
doi:1 0 . 1 1 0 9 / I C D A R . 2 0 0 1 . 9 5 3 8 4 3 .
[19] A. Shigarov, A. Mikhailov, V. Khristyuk, V. Paramonov, Software development for
rulebased spreadsheet data extraction and transformation, 2019, pp. 1132–1137. doi:1 0 . 2 3 9 1 9 /
M I P R O . 2 0 1 9 . 8 7 5 6 8 2 9 , cited By 1.
[20] A. Shigarov, I. Cherepanov, E. Cherkashin, N. Dorodnykh, V. Khristyuk, A. Mikhailov,
V. Paramonov, E. Rozhkow, A. Yurin, Towards end-to-end transformation of arbitrary
tables from untagged portable documents (pdf) to linked data, volume 2463, 2019, pp. 1–12.
Cited By 0.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Tappert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. Y.</given-names>
            <surname>Suen</surname>
          </string-name>
          , T. Wakahara,
          <article-title>The state of the art in online handwriting recognition</article-title>
          ,
          <source>IEEE Transactions on pattern analysis and machine intelligence</source>
          <volume>12</volume>
          (
          <year>1990</year>
          )
          <fpage>787</fpage>
          -
          <lpage>808</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Coüasnon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lemaitre</surname>
          </string-name>
          ,
          <source>Handbook of Document Image Processing and Recognition</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>647</fpage>
          -
          <lpage>677</lpage>
          . URL: http://dx.doi.org/10.1007/978-0-
          <fpage>85729</fpage>
          -859-1_
          <fpage>20</fpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 0 - 8 5 7 2 9 - 8 5 9 - 1</volume>
          \ _ 2
          <fpage>0</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <source>Analysis of Documents Born Digital</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>775</fpage>
          -
          <lpage>804</lpage>
          . URL: http://dx.doi. org/10.1007/978-0-
          <fpage>85729</fpage>
          -859-1_
          <fpage>26</fpage>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>0 7 / 9 7 8 - 0 - 8 5 7 2 9 - 8 5 9 - 1</volume>
          \ _ 2
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khusro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Latif</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Ullah</surname>
          </string-name>
          ,
          <article-title>On methods and tools of table detection, extraction and annotation in PDF documents</article-title>
          ,
          <source>J. Inf. Sci</source>
          .
          <volume>41</volume>
          (
          <year>2015</year>
          )
          <fpage>41</fpage>
          -
          <lpage>57</lpage>
          . URL: http://dx.doi.
          <source>org/10.1177/ 0165551514551903. doi:1 0 . 1 1</source>
          <volume>7 7 / 0 1 6 5 5 5 1 5 1 4 5 5 1 9 0 3 .</volume>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>