<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Table extraction, analysis, and interpretation: the current state of the TabbyDOC project</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexey Shigarov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikita Dorodnykh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey Mikhailov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viacheslav Paramonov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Yurin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Matrosov Institute for System Dynamics and Control Theory, Siberian Branch of the Russian Academy of Sciences</institution>
          ,
          <addr-line>134 Lermontov St, Irkutsk, 664033, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The freely available tabular data represented in various digital formats, such as print-oriented documents, spreadsheets, and web pages, are a valuable source to populate knowledge graphs. However, dificulties that inevitably arise with the extraction and integration of the tabular data often hinder their intensive use in practice. TabbyDOC project aims at elaborating a theoretical basis and developing open software for data extraction from arbitrary tables. Previously, it was devoted to the following issues: (i) table extraction tables from print-oriented documents, (ii) data transformation from spreadsheet tables to relational and linked data. This paper summarizes the project's results that are intended for the following tasks: (i) automation of fine-tuning artificial neural networks for table detection in document images, (ii) a synthesis of programs for spreadsheet data transformation driven by user-defined rules of table analysis and interpretation, and (iii) generating RDF-triples from entities extracted from relational tables.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;table understanding</kwd>
        <kwd>table extraction</kwd>
        <kwd>table analysis</kwd>
        <kwd>table interpretation</kwd>
        <kwd>spreadsheet data extraction</kwd>
        <kwd>data integration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A large volume of tabular data represented in various digital formats, such as print-oriented
documents (e.g. PDF), spreadsheets (e.g. Excel), flat file databases (e.g. CSV), and web-pages
(e.g. HTML), is freely available in the Web. Such data can be a valuable source for populating
knowledge graphs [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. However, dificulties that inevitably arise with extraction and
integration of the tabular data often hinder their intensive use in practice. Typically, they are not
accompanied by explicit semantics necessary for the machine interpretation of their content,
as conceived by their author. Since such data are unstructured or semi-structured, first they
should be transformed to a structured representation with a formal model.
      </p>
      <p>
        The general-purpose tools for document converting, text mining, or web-scraping typically
do not take into account the relational nature of tabular data, whereas table-specific tools enable
a much more efective implementation of tabular data processing [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. They shorten the
Unstructured data
      </p>
      <p>IMG</p>
      <p>Semi-structured data</p>
      <p>Structured data</p>
      <p>Linked data
software development time hiding inessential details and focusing on table specifics. This is
especially important in cases where it is necessary to develop software for the mass processing
of tabular data in a short time and with a lack of resources.</p>
      <p>TabbyDOC project aims at elaborating a theoretical basis and developing open software for
data extraction from tables (Fig. 1). It covers the following tasks of the table understanding:
(i) table extraction (i.e. detection of table bounding boxes and recognition of their cells in
print-oriented documents), (ii) table analysis (i.e. extraction of interrelated functional data items
from recognized tables), and (iii) table interpretation (i.e. mapping extracted data items to an
external vocabulary).</p>
      <p>The rest of the paper summarizes TabbyDOC’s results as follows: (Section 2) an automation
of fine-tuning artificial neural networks for table detection in document images, (Section 3)
cleansing table structure (erroneously split cells of header), (Section 4) a synthesis of programs for
spreadsheet data transformation driven by user-defined rules of table analysis and interpretation,
and (Section 5) generating RDF-triples from entities extracted from relational tables.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Table extraction</title>
      <p>
        Schreiber et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] first discovered that deep learning (DL) based “ object detection” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in natural
scene images can be successfully applied to the table detection in document images via the
transfer learning. Their promising approach is based on fine-tuning pretrained artificial neural
networks (ANN) and the well-known “Faster R-CNN” architecture [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, this process
includes many routine manipulations to prepare training data. We addressed the issue of
shortening expert eforts by unifying existing collections of ground-truth data as well as by
transforming and augmenting samples.
      </p>
      <p>
        To simplify the development of ANN-models for table detection, we designed a workflow [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
(Fig. 2) that covers the following steps: (i) unifying samples (document images) from diverse
annotated collections to “Pascal VOC”1 format; (ii) blurring and augmenting image samples via
1http://host.robots.ox.ac.uk/pascal/voc
afine transformation (Fig. 2); (iii) converting samples from “Pascal VOC” to TFRecord 2 format;
(iv) training ANN-model by using TensorFlow3, the open source platform for deep learning; (v)
evaluating the target ANN-model on a competitive dataset.
      </p>
      <p>
        The workflow was automated by DL4TD 4 [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a set of Python scripts. This allowed us to
prepare about 19K annotated samples without the augmentation that were collected from
ifve freely available datasets, namely, UNLV 5, Marmot6, “ICDAR2017 POD” [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], SciTSR7 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
and “ICDAR2019 cTDaR”8 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We also converted them in the “Pascal VOC” format. The
proposed solution enabled reduction of routine manipulations, i.e. expert eforts for trying
various training options.
      </p>
      <p>
        The automation was used to choose training options. As a result, more than 50 ANN models
with admissible quality were created. We selected the ANN model with the best accuracy
among all that we trained. It was incorporated into TabbyPDF9 [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ], our tool for table
extraction from untagged PDF documents. This solution was evaluated with the “ICDAR 2013
Table Competition”10 methodology [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and dataset [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The precision reached only 0.8651
while recall was 0.9795.
      </p>
      <p>It should be noted that false positives among predictions made by the selected ANN-model
suficiently degraded the precision. We proposed to verify predictions to reduce the false
positives by probing the text arrangement inside the bounding box of a candidate table. The
idea consists in the following: (i) to segment text of a table candidate into blocks each of which
represents a paragraph or cell, (ii) to compose a graph from the text blocks, and (iii) to analyze
the graph and to probe features indicating that the candidate is probably a table or not.
2https://www.tensorflow.org/tutorials/load_data/tfrecord
3https://www.tensorflow.org
4https://github.com/tabbydoc/dl4td
5http://tc11.cvc.uab.es/datasets/DFKI-TGT-2010_1
6https://www.icst.pku.edu.cn/cpdp/sjzy
7https://github.com/Academic-Hammer/SciTSR
8https://github.com/cndplab-founder/ICDAR2019_cTDaR
9https://github.com/tabbydoc/tabbypdf2
10https://www.tamirhassan.com/html/competition.html
1st origin dataset
Nth origin dataset
...</p>
      <p>Data unification</p>
      <p>Unified dataset
(Pascal VOC)
Augmented
dataset</p>
      <p>Transformed dataset</p>
      <p>Image
transformation</p>
      <p>Data augmentation
Training data
generation</p>
      <p>Training dataset
(TFRecord)</p>
      <p>Training
ANN-model</p>
      <p>ANN-model</p>
      <p>Performance
evaluation</p>
      <p>Score</p>
      <p>To implement the approach mentioned above, we adapted the T-Recs [16, 17] algorithms
for “clustering of word blocks” in a plain-text document, by adding PDF-specific constraints.
The main changes were the following: (i) composing text blocks by using the rendering order
and formatting of text and graphics, (ii) calculating interline spacing. We also supplemented
T-Recs algorithms with a new one for eliminating erroneously glued blocks due to subscript
and superscript fonts. All the algorithms were implemented as part of TabbyPDF.</p>
      <p>As a result, the adapted algorithms enabled representing table candidates as graphs of
connected text blocks. We selected a set of 32 features for classifying predictions and trained a
binary classifier based on the Random Forest classifier. The hyperparameters were tuned by
using the Randomized Search. To train a model with a satisfactory accuracy, first we conducted
a number of experiments with changing training datasets. As a result, we prepared the reference
dataset as follows. Our ANN-model predicted some table candidates in PDF documents. Two
experts manually assigned one of the two labels, either “Table” or “Not table”, to each of these
table candidates. The graphs were generated for both positive and negative samples. The target
classifier was trained on this dataset.</p>
      <p>The verification allowed us to increase the precision up to 0.9703 on the competition part of
“ICDAR 2013 Table Competition”, which is 10% higher than the initial measurement (0.8651).
The approach showed that the graph-based table verification phase can significantly improve
results obtained from the deep learning-based table prediction phase. More details can be found
in our previous paper [18].</p>
    </sec>
    <sec id="sec-3">
      <title>3. Table cleansing</title>
      <p>The table analysis stage assumes that each physical (machine-readable) cell corresponds to one
logical (human-readable) cell (Fig. 3). However, spreadsheets actually do not guarantee the
fulfillment of this assumption, so it is often violated in real-world tables. We addressed this
issue with HeadRecog, a rule-based algorithm for correcting the physical structure of cells in
column headers by using visual borders [19, 20]. The algorithm matches physical cells with
logical ones that are highlighted by visual borders (Fig. 3). The HeadRecog was implemented
1
2 Items
3
Total
1995
as a part of TabbyXL11, our software platform for rule-based transformation spreadsheet data
from arbitrary to relational tables [21]. It was experimentally demonstrated that the correctness
of the cell structure significantly afects the efectiveness of table analysis and interpretation by
reducing a number of errors.</p>
      <p>It should be highlighted that the existing methods for the cell structure recognition are mainly
image-based. They focus on a lower-level representation of documents such as bitmap images.
Unlike them, our solution deals with the high-level representation of spreadsheets. This help
avoid data loss that inevitably accompanies converting to a lower-level format required for the
use of the image-based methods.</p>
      <p>We developed and tested the software tool (HeadRecod) for correcting a physical structure
of headers in spreadsheet tables according to their visual clues. This tool realized the algorithms
which were previously made in the project. We used the assumption of K. Broman [22] about
iflling header cells of a table that states that empty cells are used for decoration only. If a cell is
empty, but the corresponding column-decorator cannot be found, this cell should be merged
with one of the non-empty neighboring cells. The decision if the merging is possible is based
on the analysis results about cells mutual disposition, their styles and visual borders.</p>
      <p>The eficiency of the implemented solution ( HeadRecog) was demonstrated on real-life tables
from the open access corpora SAUS (“The 2010 Statistical Abstract of the United States”). We
prepared SAUS200 [23], the subset of 200 tables randomly selected by G. Nagy and published
by Z. Chen and M. Cafarella12. It should be noted that we have made minor improvements
to these tables. We have deleted hidden columns that contained service (markup) data. To
automatically test the implemented software, a set of ground-truth tables corresponding to the
subset of SAUS200 was also prepared. Each of the ground-truth tables has a correct structure of
header cells.</p>
      <p>To evaluate the performance of the automatic cells correction, a utility was implemented. It
allows one to compare the header part of two tables and calculate the diference in the cells
structure and their content. The total number of cells in the original SAUS200 tables header is
8028 compared to 3768 in the ground-truth tables headers. The use of HeadRecog brings the
number of cells to 3795 items. Comparing the cells adjusted automatically and manually, we
obtained a complete match for 93.2% of the cases.</p>
      <p>11https://github.com/tabbydoc/tabbyxl
12http://dbgroup.eecs.umich.edu/project/sheets/datasets.html</p>
    </sec>
    <sec id="sec-4">
      <title>4. Table analysis</title>
      <p>Other advancements in TabbyXL [21, 24, 25] concerned the development of CRL, our
domainspecific language (DSL) for specifying production rules for analysis and interpretation of
tables [26, 27, 28]. CRL enables determination of queries (conditions) and operations (actions)
that are necessary to develop programs for the spreadsheet data transformation from an arbitrary
to a canonical form (Fig. [? ]). CRL-rules map a physical structure of cells (properties of layout,
formatting, and content) to a logical structure (interrelated functional data items such as entries,
labels, and categories). In comparison with general-purpose rule languages (e.g. Drools13,
Jess14, or RuleML15), the advanced version of the language enables expression of rulesets
without any instructions for management of the working memory (such as updates of modified
facts, or blocks on the rule re-activation). This provides syntactically simplified declaration of
the right-hand side of CRL-rules. End-users can focus more on the logic of the table analysis
and interpretation than on the logic of the rule management and execution.</p>
      <p>We developed an interpreter of CRL-rules. It provides translation of CRL-rulesets (declarative
programs) to Java source code (imperative programs). The generated source code is ready
for compilation and building of executable programs for the domain-specific spreadsheet data
extraction and transformation. Many of the existing solutions with similar goals use predefined
table models embedded into their internal algorithms. Such systems usually support only a few
widespread layout types of tables. Unlike them, our software platform defines a general-purpose
table model that does not restrict layout types. It allows expressing user-defined layout, style,
and text features of arbitrary tables in external CRL-rules. In comparison to our competitors,
we support not only widespread layout types of arbitrary tables, but also some specific types.
The empirical results show that our software platform can be successfully used in development
of programs for the spreadsheet data extraction and transformation.</p>
      <p>Additionally, we developed a generator of Maven-projects to build executable applications
for transformation of spreadsheet tables to a canonical form. The generated applications
have a basic functionality of spreadsheet data canonicalization and can be applied without
additional programmer eforts. This can be useful in diverting software development towards
rule-driven data extraction and transformation from spreadsheet tables. The implemented
tools were integrated into the final release of TabbyXL16. Its source code was published in
open access, including the accompanying wiki-documentation17, software demos in formats of
Docker-container18, and CodeOcean-capsule19.</p>
      <p>An illustrative example of using the updated TabbyXL when extracting data from real-life
statistical tables was prepared. The example includes: the original SAUS200 tables, CRL-rules
for table analysis and interpretation that enables generation of an executable Java application
for converting these tables to a canonical form, ground-truth data for automatic performance
13https://www.drools.org
14https://jess.sandia.gov
15http://ruleml.org
16https://github.com/tabbydoc/tabbyxl/releases/tag/v1.1.1
17https://github.com/tabbydoc/tabbyxl/wiki
18https://hub.docker.com/r/tabbydoc/tabbyxl
19https://codeocean.com/capsule/5326436/tree/v1
twhhecnsceneeeenlltwlltccalo:agrbcn"reesl&lt;rtcu:=bc"lco=tor=nce1r.,crr,t r=t=&gt;1c,obrnlaenr.krb, !bla(nRk,u!tlaegg1ed) twhhecscneneeeenlltwlltccae:ognrcnt"lreby&gt;ro:cdccyol"r=nteo=r.cc1r,, rrtt =&gt;=co1r,nbelra.rnbk,(!Rtauggleed3)
twhhesccneneeeenltlwlltccal:oagrbrnb"eehl&lt;rec:a=cdlc"=otro=nce1r.,rrbt,=cl=&gt;1c,obrlnaenrk.cr, !bla(nRk,u!tlaegg2e)d twhheccgeneernolllluccp21::c1ttaa.lggab==e==lw""issttthuubbc2"",.lcalb=el= c1.cl,(cRr=u=lec14.c)r
twhheccgeneernolllluccp12::c1ttaa.lggab==e==lw""ihhtheeaacdd2"".,larbte=l= c1.rt, (rbR=u=lec15.r)b twhheccaeneednlldllccla12b::ettlaacgg1==.la==b""esbltotuodbyc""2,.retn&gt;tr=y c1.rt, (rbR&lt;u=lec16.r)b twhheccaeneednlldllccla12b::ettlaacgg1==.la==b""ebhloetoaddyc""2,.celn&gt;tr=y c1.cl,(cRr&lt;u=lec17.c)r
Figure 5: An example of CRL-rules for transforming tables (on the left) to a canonical form (on the
right).
evaluation (of the generated Java application), as well as a step-by-step description of the
workflow. The accompanying material was published as an open-access archive 20.</p>
      <p>The performance evaluation was conducted for three cases of preprocessing of the SAUS200
dataset, namely: (i) original tables; (ii) tables corrected automatically using the HeadRecog
algorithms proposed by us; (iii) tables manually corrected by three experts. For the original data,
the following indicators were achieved: the 1-score of data extraction (occurrences and labels)
reached 84%, the 1-score of relationship extraction (such as “entry-label” and “label-label”)
reached 72.4%. An automatic structure correction improved the 1-score of data extraction to
85.5% and the 1-score of relationship extraction to 82.2%. With the expert correction, they
reached 96.3% and 93.7%, respectively. All materials and steps to reproduce this experiment are
available at the dataset [23].</p>
    </sec>
    <sec id="sec-5">
      <title>5. Table interpretation</title>
      <p>We developed TabbyLD21 [29], a tool for the semantic interpretation of tabular data by using a
cross-domain knowledge graph, namely, DBpedia22. It implements the following functions: (i)
Data cleansing transforms origin tabular data to a canonical form which is suitable to be used in</p>
      <p>Million dollar</p>
      <p>Norway Denmark
2199 2004 Norway Million dollar
2037 2004 Denmark Million dollar
2786 2005 Norway Million dollar
2109 2005 Denmark Million dollar</p>
      <p>b
dbr:Norway dbr:United_States_dollar
dbr:Denmark dbr:United_States_dollar
dbr:Norway dbr:United_States_dollar</p>
      <p>DBpedia Resource
dbr:Denmark dbr:United_States_dollar
c</p>
      <p>DBpedia
Ontology +
XML Schema</p>
      <p>Definition
SPARQL23 queries to look up the candidate KG-entities (Fig. 5 a, b). (ii) Column type classification
assigns one of two types, either literal (containing numbers, dates, etc.) or categorical (containing
entity mentions), to each column. (iii) Entity linking matches cell values with KG-entities of
DBpedia. (iv) Table annotation enriches canonicalized tables with links to DBpedia’s resources
(Fig. 5 c). (v) Linked data generation represents annotated data as RDF24-triplets.</p>
      <p>One of the important issues studied in our work was the named entity disambiguation, when
two or more candidate KG-entities that were looked up from a knowledge graph (KG) match
the same mention (surface form) from a table. To resolve such conflicts, we use semantic
similarity by an edit distance, headings similarity, as well as consistency and context of the
candidate KG-entities. In comparison with others, we employ an additional metric estimated by
relationships between the candidate KG-entities and a recognized KG-class. The final decision
is made by an aggregation of the metrics based on linear convolution.</p>
      <p>The proposed solution was evaluated on the “T2Dv2 Gold Standard” dataset25 that includes
779 reference tables from the “Web Data Commons” corpus. We used a subset of randomly
selected 237 tables and 150 negative samples from T2Dv2. The precision for entity linking was
about 73% and the recall reached about 50%. We also evaluated it with Troy20026, a dataset of
200 statistical tables. In this case, the accuracy for entity linking was 64%. Note that a statistical
table is cross-tabulation containing at least 2 variables (2 subjects in terms of [30]). To the
best of our knowledge, the existing proposals are limited to one-subject tables (i.e. relations in
23https://www.w3.org/TR/rdf-sparql-query
24https://www.w3.org/TR/rdf-concepts
25http://webdatacommons.org/webtables/goldstandardV2.html
26http://tc11.cvc.uab.es/datasets/Troy_200_1
3NF) [31]. This work [32] pioneered an attempt to semantically interpret such tables.</p>
      <p>The use of the proposed solution in domain-specific ontology engineering was demonstrated
by an example from the industrial safety inspection (ISI) [33, 34, 35]. We developed an extension
for PKBD27 [36], the knowledge base management system. It allows for construction of an
ontology in the OWL28 format from RDF-triples, fragments of interrelated entities extracted
from tables. A dataset29 of 161 tables extracted from ISI reports was used as a source. As a result,
a target ISI ontology including 25 KG-entities, 196 KG-properties, and 21 KG-relationships in
the terminological level was automatically generated.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>TabbyDOC project contributed in the theoretical basis and software tools for table extraction,
cleansing, analysis, and interpretation. The results rely on contemporary techniques of the
deep-learning, rule-based and generative programming, linked open data, and table
understanding. The obtained results discovered new opportunities for intellectualization of the software
engineering in data extraction from unstructured and semi-structured sources. Particularly,
we proposed to develop methods and tools for the synthesis of tabular data transformation
software based on table analysis and interpretation rules. We expect that it can expand the
theoretical knowledge in the integration of heterogeneous tabular data. The developed software
can be used in data science and business intelligence.</p>
      <p>Further work implies automatic understanding of web-tables tagged with HTML markup. To
extract data from spreadsheet tables we used the end-user programming as the main approach
for development of the user-defined rules. This allowed us to support specific tricks of table
layout, formatting, and content. However, to scale such solutions is too challenging when
there are ambiguous tricks applied within source tables, whereas it is important that a solution
intended for the Web should be easily scaled. This is possible for predefined types of
webtables. The latter is needed to classify them and select type-specific algorithms of analysis and
interpretation. Thus, the approach we used heretofore is suitable for spreadsheet sources, but
not for the Web. Further work will aim to fill this gap by developing a scalable solution for
web-tables.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was supported by the Russian Science Foundation (Grant No. 18-71-10001).
27http://knowledge-core.ru
28https://www.w3.org/OWL
29https://data.mendeley.com/datasets/8zdymg4y96/1
[16] T. G. Kieninger, Table structure recognition based on robust block segmentation, in:</p>
      <p>Document Recognition V, 1998, pp. 22–32. doi:10.1117/12.304642.
[17] T. Kieninger, A. Dengel, The t-recs table recognition and analysis system, in: Document
Analysis Systems: Theory and Practice, volume 1655 LNCS, 1999, pp. 255–270. doi:10.
1007/3-540-48172-9_21.
[18] A. Mikhailov, A. Shigarov, E. Rozhkov, I. Cherepanov, On graph-based verification for pdf
table detection, in: 2020 Ivannikov ISPRAS Open Conference (ISPRAS), 2020, pp. 91–95.
doi:10.1109/ISPRAS51486.2020.00020.
[19] V. Paramonov, A. Shigarov, V. Vetrova, A. Mikhailov, Heuristic algorithm for
recovering a physical structure of spreadsheet header, in: Information Systems Architecture
and Technology: Proc. 40th Anniversary Int. Conf. on Information Systems
Architecture and Technology – ISAT 2019, volume 1050 AISC, 2020, pp. 140–149. doi:10.1007/
978-3-030-30440-9_14.
[20] V. Paramonov, A. Shigarov, V. Vetrova, Table header correction algorithm based on
heuristics for improving spreadsheet data extraction, in: Information and Software Technologies,
volume 1283 CCIS, 2020, pp. 147–158. doi:10.1007/978-3-030-59506-7_13.
[21] A. Shigarov, V. Khristyuk, A. Mikhailov, TabbyXL: software platform for rule-based
spreadsheet data extraction and transformation, SoftwareX 10 (2019) 100270. doi:10.
1016/j.softx.2019.100270.
[22] K. W. Broman, K. H. Woo, Data organization in spreadsheets, The American Statistician
72 (2018) 2–10. doi:10.1080/00031305.2017.1375989.
[23] A. Shigarov, V. Paramonov, V. Khristyuk, Spreadsheet data extraction from real-world
tables of saus (the 2010 statistical abstract of the united states): case study, 2021. doi:10.
6084/m9.figshare.14371055.v2.
[24] A. Shigarov, V. Khristyuk, A. Mikhailov, V. Paramonov, Tabbyxl: Rule-based spreadsheet
data extraction and transformation, in: Information and Software Technologies, volume
1078 CCIS, 2019, pp. 59–75. doi:10.1007/978-3-030-30275-7_6.
[25] A. Shigarov, V. Khristyuk, A. Mikhailov, V. Paramonov, Software development for
rulebased spreadsheet data extraction and transformation, in: Proc. 42nd Int. Conv. on
Information and Communication Technology, Electronics and Microelectronics, 2019, pp.
1132–1137. doi:10.23919/mipro.2019.8756829.
[26] A. Shigarov, Rule-based table analysis and interpretation, in: Information and Software
Technologies, volume 538 CCIS, 2015, pp. 175–186. doi:10.1007/978-3-319-24770-0_
16.
[27] A. Shigarov, V. Paramonov, P. Belykh, A. Bondarev, Rule-based canonicalization of arbitrary
tables in spreadsheets, in: Information and Software Technologies, volume 639 CCIS, 2016,
pp. 78–91. doi:10.1007/978-3-319-46254-7_7.
[28] A. Shigarov, A. Mikhailov, Rule-based spreadsheet data transformation from arbitrary to
relational tables, Inform. Syst. 71 (2017) 123–136. doi:10.1016/j.is.2017.08.004.
[29] N. Dorodnykh, A. Yurin, Tabbyld: a tool for semantic interpretation of spreadsheets
data, in: Modelling and Development of Intelligent Systems, volume 1341 CCIS, 2021, pp.
315–333. doi:10.1007/978-3-030-68527-0_20.
[30] K. Braunschweig, M. Thiele, W. Lehner, From web tables to concepts: a semantic
normalization approach, in: Conceptual Modeling, volume 9381 LNCS, 2015, pp. 247–260.
doi:10.1007/978-3-319-25264-3_18.
[31] S. Zhang, K. Balog, Web table extraction, retrieval, and augmentation: a survey, ACM</p>
      <p>Trans. Intell. Syst. Technol. 11 (2020). doi:10.1145/3372117.
[32] N. Dorodnykh, A. Yurin, Towards a universal approach for semantic interpretation of
spreadsheets data, in: Proc. 24th S. on International Database Engineering and Applications,
2020. doi:10.1145/3410566.3410609.
[33] N. Dorodnykh, A. Yurin, Towards ontology engineering based on transformation of
conceptual models and spreadsheet data: a case study, in: Intelligent Systems
Applications in Software Engineering, volume 1046 AISC, 2019, pp. 233–247. doi:10.1007/
978-3-030-30329-7_22.
[34] A. Y. Yurin, N. O. Dorodnykh, Experimental evaluation of a spreadsheets transformation
in the context of domain model engineering, in: 2020 Ural Symposium on Biomedical
Engineering, Radioelectronics and Information Technology (USBEREIT), 2020, pp. 0388–
0391. doi:10.1109/USBEREIT48449.2020.9117674.
[35] N. O. Dorodnykh, A. Yurin, A. Shigarov, Conceptual model engineering for industrial safety
inspection based on spreadsheet data analysis, in: Modelling and Development of
Intelligent Systems, volume 1126 CCIS, 2020, pp. 51–65. doi:10.1007/978-3-030-39237-6_
4.
[36] A. Yurin, N. Dorodnykh, Personal knowledge base designer: software for expert systems
prototyping, SoftwareX 11 (2020) 100411. doi:10.1016/j.softx.2020.100411.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Martinez-Rodriguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Lopez-Arevalo, Information extraction meets the semantic web: a survey</article-title>
          ,
          <source>Semantic Web</source>
          <volume>11</volume>
          (
          <year>2020</year>
          )
          <fpage>255</fpage>
          -
          <lpage>335</lpage>
          . doi:
          <volume>10</volume>
          .3233/SW-180333.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato, G. de Melo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Rashid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schmelzeisen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sequeda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Staab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          , Knowledge graphs,
          <year>2021</year>
          . arXiv:
          <year>2003</year>
          .02320.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Milosevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gregson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Nenadic, A framework for information extraction from tables in biomedical literature</article-title>
          ,
          <source>Int. J. on Document Analysis and Recognition</source>
          <volume>22</volume>
          (
          <year>2019</year>
          )
          <fpage>55</fpage>
          -
          <lpage>78</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10032-019-00317-0.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Roldán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jiménez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szekely</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Corchuelo</surname>
          </string-name>
          ,
          <article-title>Tomate: A heuristic-based approach to extract data from html tables</article-title>
          ,
          <source>Information Sciences</source>
          (
          <year>2021</year>
          ). doi:
          <volume>10</volume>
          .1016/j.ins.
          <year>2021</year>
          .
          <volume>04</volume>
          .087.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Schreiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agne</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dengel</surname>
          </string-name>
          , S. Ahmed,
          <article-title>DeepDeSRT: deep learning for detection and structure recognition of tables in document images</article-title>
          ,
          <source>in: Proc. 14th IAPR Int. Conf. on Document Analysis and Recognition</source>
          , volume
          <volume>1</volume>
          ,
          <year>2017</year>
          , pp.
          <fpage>1162</fpage>
          -
          <lpage>1167</lpage>
          . doi:
          <volume>10</volume>
          .1109/ ICDAR.
          <year>2017</year>
          .
          <volume>192</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.-Q.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , S.-T. Xu,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Object detection with deep learning: a review</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>30</volume>
          (
          <year>2019</year>
          )
          <fpage>3212</fpage>
          -
          <lpage>3232</lpage>
          . doi:
          <volume>10</volume>
          . 1109/TNNLS.
          <year>2018</year>
          .
          <volume>2876865</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
          </string-name>
          r-cnn:
          <article-title>Towards real-time object detection with region proposal networks</article-title>
          ,
          <source>in: Proc. 28th Int. Conf. Neural Information Processing Systems - Volume 1</source>
          , MIT Press, Cambridge, MA, USA,
          <year>2015</year>
          , pp.
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>I.</given-names>
            <surname>Cherepanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shigarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Paramonov</surname>
          </string-name>
          ,
          <article-title>On automated workflow for ifne-tuning deep neural network models for table detection in document images</article-title>
          ,
          <source>in: 2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1130</fpage>
          -
          <lpage>1133</lpage>
          . doi:
          <volume>10</volume>
          .23919/MIPRO48935.
          <year>2020</year>
          .
          <volume>9245241</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>Icdar2017 competition on page object detection</article-title>
          ,
          <source>in: 2017 14th IAPR Int. Conf. Document Analysis and Recognition (ICDAR)</source>
          , volume
          <volume>01</volume>
          ,
          <year>2017</year>
          , pp.
          <fpage>1417</fpage>
          -
          <lpage>1422</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDAR.
          <year>2017</year>
          .
          <volume>231</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Huang</surname>
          </string-name>
          , H.
          <string-name>
            <surname>-D. Xu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>X.-L.</given-names>
          </string-name>
          <string-name>
            <surname>Mao</surname>
          </string-name>
          ,
          <source>Complicated table structure recognition</source>
          ,
          <year>2019</year>
          . arXiv:
          <year>1908</year>
          .04729.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Déjean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Meunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kleber</surname>
          </string-name>
          , E. Lang,
          <article-title>ICDAR 2019 competition on table detection and recognition (cTDaR)</article-title>
          ,
          <source>in: Proc. 15th Int. Conf. Document Analysis and Recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1510</fpage>
          -
          <lpage>1515</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDAR.
          <year>2019</year>
          .
          <volume>00243</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shigarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Altaev</surname>
          </string-name>
          ,
          <article-title>Configurable table structure recognition in untagged PDF documents</article-title>
          ,
          <source>in: Proc. ACM S. on Document Engineering</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>122</lpage>
          . doi:
          <volume>10</volume>
          . 1145/2960811.2967152.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Shigarov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Altaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mikhailov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Paramonov</surname>
          </string-name>
          , E. Cherkashin,
          <article-title>TabbyPDF: web-based system for PDF table extraction</article-title>
          ,
          <source>in: Information and Software Technologies</source>
          , volume
          <volume>920</volume>
          CCIS,
          <year>2018</year>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>269</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -99972-2_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Göbel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hassan</surname>
          </string-name>
          , E. Oro,
          <string-name>
            <given-names>G.</given-names>
            <surname>Orsi</surname>
          </string-name>
          ,
          <article-title>A methodology for evaluating algorithms for table understanding in PDF documents</article-title>
          ,
          <source>in: Proc. ACM S. on Document Engineering</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>48</lpage>
          . doi:
          <volume>10</volume>
          .1145/2361354.2361365.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gobel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hassan</surname>
          </string-name>
          , E. Oro, G. Orsi,
          <article-title>ICDAR 2013 table competition</article-title>
          ,
          <source>in: Proc. 12th Int. Conf. on Document Analysis and Recognition</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1449</fpage>
          -
          <lpage>1453</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICDAR.
          <year>2013</year>
          .
          <volume>292</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>