<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards End-to-End Transformation of Arbitrary Tables from Untagged Portable Documents (PDF) to Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexey Shigarov</string-name>
          <email>shigarov@icc.ru</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Igor Cherepanov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evgeniy Cherkashin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikita Dorodnykh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasiliy Khristyuk</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrey Mikhailov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viacheslav Paramonov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Egor Rozhkow</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandr Yurin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Mathematics</institution>
          ,
          <addr-line>Economics and Informatics</addr-line>
          ,
          <institution>Irkutsk State University</institution>
          ,
          <addr-line>Gagarin Blvd. 20, Irkutsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Matrosov Institute for System Dynamics and Control Theory, Siberian Branch of Russian Academy of Sciences</institution>
          ,
          <addr-line>Lermontov St. 134, Irkutsk</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2004</year>
      </pub-date>
      <abstract>
        <p>The paper is devoted to the problem of an end-to-end table transformation from untagged portable documents (PDF) to linked data. It covers the issues of the table extraction from documents, the reconstruction of logical table structure, the conceptualization of their natural-language content, and the linking of extracted data with external vocabularies. We consider some perspective approaches for the deeplearning-based table detection, heuristic-based table structure recognition, rule-based table analysis, and knowledge-based table interpretation. They can be used as a basis to develop a consistent solution for this problem. Our application experience con rms that such solutions are demanded for populating databases and generating ontologies with tabular data being extracted from weakly and semi-structured documents.</p>
      </abstract>
      <kwd-group>
        <kwd>Data transformation Document analysis Table extraction Table analysis Table interpretation Linked data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>PDF3 (Portable Document Format) is a very popular way to represent
noneditable documents. Many of PDF documents are machine-readable but remain
untagged (i.e. there are no tags for identifying layout items such as paragraphs,
columns, or tables). Such documents often contain tables with arbitrary layout
(e.g. cross-tabulations, invoices, and data sets). These tables can be a valuable
source in various applications of the text and data analysis. However, di culties
that inevitably arise with the extraction and integration of the tabular data
presented in untagged PDF documents often hinder the intensive use of them in
practice.</p>
      <p>Tabular untagged
PDF documents</p>
      <p>Arbitrary
spreadsheet tables Relational tables</p>
      <p>The untagged PDF documents do not provide any metadata for describing
table location and cell positions, as well as functional roles, internal and external
relationships of data items. In other words, there is no explicit semantics that
is needed for the interpretation of tabular data presented in such documents
by third-party software applications. The generic approach to make the tabular
data interpretable is to extract them from the documents and represent them
as linked data. The transformation of the tabular data requires recovering the
metadata missed in original documents and linking the extracted data with
external vocabularies. This process can be considered as a chain of tasks as shown
in Fig. 1 where each step enriches the tabular data with explicit semantics.</p>
      <p>We present some perspective approaches to the development of a consistent
solution to the end-to-end table transformation from weakly and semi-structured
documents to linked data. They cover deep-learning-based table detection
(identifying positions and content of tables), heuristic-based table structure
recognition (identifying positions and content of cells), rule-based table analysis
(recovering functional data items and their relationships), and knowledge-based table
interpretation (linking recovered data items with external vocabularies). Our
application experience con rms that such solutions are demanded for
populating databases and generating ontologies with tabular data being extracted from
weakly and semi-structured documents.
2
The rst task (table extraction) aims at extracting tables from a document and
representing them in a spreadsheet-like format. This task consists of two
consecutive sub-tasks: table detection and table structure recognition. The table detection
extracts positions of a table in a document, i.e. either geometric positions of a
bounding box of the table or a set of all text chunks placed inside this
bounding box. The table structure recognition recovers positions and content of cells
in a table. A detected table should be separated into cells addressed by rows
and columns. The recovered cells of the extracted table can be represented as
arbitrary tables in a spreadsheet format (e.g. Excel).</p>
      <p>Many table extraction methods are traditionally heuristics-based [62, 40, 43,
42]. Some of them combine heuristics and machine-learning approaches [4, 22,
41]. The current trend involves deep learning techniques: binary classi cation
based on convolutional neural networks [29], ne-tuned object detection models
[47, 26, 2, 58], and semantic segmentation [30, 33]. Some of the existing methods
show a high accuracy on the competition datasets (UNLV[48], Marmot4, ICDAR
2013 [27], and ICDAR 2017 [25]). However, the complexity of a document layout
often prevents the applicability of these methods in practice. For example, a
long table placed on several consecutive pages cannot be extracted by these
techniques as a whole object. We believe that the research of issues of the table
extraction is far from being completed.</p>
      <p>We propose to combine deep learning and heuristic-based techniques to
develop a comprehensive solution for this task. As contemporary works [47, 26,
2, 58] show the deep neural network models provide a good performance for
the table detection. The generic approach to develop such models is based on
ne-tuning pre-trained models for object detection on images. We adopted this
approach to create and use own model for the table detection.</p>
      <p>To examine this approach we utilized the well-known architecture \Faster
R-CNN" [44] with the pre-trained convolutional neural network model
(ResNet101) for the feature extraction. It also used the distance transformation described
in [26] for pre-processing images. The regional-proposal network model was
trained on three existing datasets of documents (UNLV, Marmot, and ICDAR
2017) containing 2800 samples. Additionally, the training data were augmented
by a ne transformations that allowed us to improve the accuracy of table
detection by 5%. The performance evaluation of the developed model showed a high
recall (0.979) and precision (0.865) on the ICDAR 2013 competition dataset [27].
It should be noted that the veri cation of the obtained predictions can decrease
false positives and improves precision.</p>
      <p>At the stage of the cell structure recognition, we propose to use our
heuristicbased bottom-up method [55, 54]. Its main idea consists in building text blocks
from words placed inside the bounding box of a detected table. Each text block
is a whole textual content of one cell in this table. Our method exploits a set of
customizable ad-hoc heuristics for table detection and cell structure
reconstruction based on features of text and ruling lines presented in PDF documents,
including the following: horizontal and vertical distances, fonts, the order of
appearance of text printing instructions in PDF les and positions of the drawing
cursor. Additionally, ad-hoc heuristics can be handcrafted and tuned for various
domain-speci c PDF documents. The table is subdivided into rows and columns,
using the analysis of connected components (text blocks). The proposed method
reaches a high recall (0.923) and precision (0.950) on the ICDAR 2013
competition dataset [27].
4 http://www.icst.pku.edu.cn/cpdp/sjzy/index.htm
The second task (table analysis ) is to transform a spreadsheet table from an
arbitrary to a relational form. This task relies on recovering metadata on the logical
structure and content of an arbitrary table, i.e. data items presented in cells,
including their functional roles and internal relationships. The recovered
semantics enables representing extracted data as relational tables in a spreadsheet-like
format (e.g. CSV). In this form, each column of a relational table matches to
a category. However, the columns typically remain anonymous, i.e. there is no
description of the presented categories. The categories (concepts describing
extracted data items) often need to be found out for the purposes of data analysis.</p>
      <p>There are several recent studies dealing with the issues of the spreadsheet
data extraction and transformation, including the following: layout features [35,
11, 16], \code smells" and formulas [31, 15, 5, 34], programming by examples [6,
59, 32], data models [1, 12, 13], linked open data [45, 71], domain-speci c [64,
9, 60] and rule-based architectures [57, 67, 68]. Typically, the related solutions
(e.g. [17, 19, 10]) rely on a prede ned table structure. They support only a few
widespread layout types of tables with typical functional cell regions. This limits
their applicability for speci c cases.</p>
      <p>Our approach consists in using rules for table analysis [53]. The rules map
explicit features (layout, style, and text of cells) of an arbitrary table to its
implicit semantic relationships (linked functional data items such as entries,
labels, and categories). The approach expects that one ruleset provides functional
and structural analysis for tables with the same features. Such rulesets can be
executed by a rule engine [53, 57] or be translated to executable programs in a
general-purpose language [50, 51].</p>
      <p>For implementation of our approach, we develop own domain-speci c
language of table analysis and interpretation rules, CRL [52, 56, 57]. This language
determines queries (conditions) and operations (actions) that are necessary to
develop programs for spreadsheet data transformation from an arbitrary to
relational form. CRL rules expressed as productions map the physical structure of
cells to the logical structure of data items. In comparison with general-purpose
rule languages (such as Drools5, Jess6, RuleML7), our language enables
expressing rulesets without any instructions for management of the working memory
(such as updates of modi ed facts, or blocks on the rule re-activation). This
provides syntactically simplifying declaration of the right-side hand of CRL rules.
CRL allows end-users to focus more on the logic of table analysis and
interpretation than on the logic of the rule management and execution.</p>
      <p>The interpreter of CRL rules provides translating CRL rulesets (declarative
programs) to Java source code (imperative programs) [50, 51]. The generated
source code is ready for compilation and building of executable programs for
domain-speci c spreadsheet data extraction and transformation. The novelty of
5 https://www.drools.org
6 https://www.jessrules.com
7 http://ruleml.org
the software platform consist in providing two rule-based ways to implement
work ows of spreadsheet data extraction and transformation. In the rst case,
a ruleset for table analysis and interpretation is expressed in a general-purpose
rule language and executed by a JSR-948 compatible rule engine. In the second
case, our interpreter translates a ruleset expressed in CRL to Java source code
that is compiled and executed by using Java platform.</p>
      <p>The existing solutions with similar goals (e.g. [17, 19, 10]) typically use
prede ned table models embedded into their internal algorithms. Unlike them, we
de ne a general-purpose table model that does not restrict layout types. Instead
of a generic approach when functions (roles) are associated with cells, in our
model functions are determined for data items (entries and labels) originated
from cells. The model supports a table layout where one cell contains two or
more data items. This allows expressing user-de ned layout, style, and text
features of arbitrary tables in external rules to support both widespread and speci c
table types.
4
The third task (table interpretation) serves to link the extracted tabular data
to external vocabularies. Each column and each data item of a relational table
should be associated with a concept (class, object, or property) of a
generalpurpose or domain-speci c ontology (Fig. 4). The linked open data (LOD9)
cloud, including global taxonomies (e.g. DBpedia10, Wikidata11, or YAGO12)
can be utilized as external vocabularies for these purposes. The tabular data
enriched by links to external vocabularies are represented in RDF/OWL (Resource
Description Framework13 / Web Ontology Language14) formats. The integration
of the extracted data with the LOD cloud simpli es the implementation of their
further analysis.</p>
      <p>The methods intended for the issues of the table interpretation are mainly
knowledge-based. They try to bind text in tables with some external concepts,
using the following techniques: extraction ontologies [20], data frames [61],
knowledge (classes and relations) automatically collected from the Web [63],
natural language processing, including the named entity linking [7, 66, 71] and the
word embedding [18], as well as various general-purpose ontologies of Linked
Open Data [36, 39, 49, 14, 37, 46, 45] or proprietary global taxonomy, ProBase
[65]. There are also several studies that propose to use contextual information
that surrounds tables [8, 28, 70, 69]. The issues of converting data presented in
8 https://www.jcp.org/ja/jsr
9 https://lod-cloud.net
10 https://wiki.dbpedia.org
11 https://www.wikidata.org
12
https://www.mpi-inf.mpg.de/departments/databases-and-informationsystems/research/yago-naga/yago
13 https://www.w3.org/RDF
14 https://www.w3.org/OWL</p>
      <p>Number
(Q11563)</p>
      <p>DATA
million dollar
percentage of gni
percentage of gni
million dollar
million dollar
2005
2004
2005
2004
2005</p>
      <p>Year (Q577)
instance of (P31)
2004 (Q2014)
spreadsheets or web tables to RDF/OWL formats are considered in the papers [3,
38, 23, 24, 21].</p>
      <p>We suggest an approach to the semantic table interpretation based on
combining natural language processing techniques and external vocabularies. First
of all, the extracted data items should be separated into two types: numeric and
non-numeric via named entity recognition. Then non-numeric data items are
linked with concepts (classes, objects, and properties) of an external vocabulary
by using the semantic similarity.</p>
      <p>To examine this approach we used DBpedia but it can be extended by other
vocabularies in future. Our implementation of the approach supports the
following functionality: (i) tabular data cleansing and formatting in accordance with
DBpedia naming conventions; (ii) creating queries to DBpedia in SPARQL
language; (iii) and linking extracted data items of a table with DBpedia classes,
objects, and properties.</p>
      <p>We developed a prototype of the software tool for generating linked data in
RDF/OWL formats from extracted tabular data. It is designed to be a part of
an end-to-end process of the table understanding implemented by our software
platform. In this environment, it can be applied for the semantic interpretation
of tables with an arbitrary layout.
5</p>
    </sec>
    <sec id="sec-2">
      <title>Conclusions</title>
      <p>The presented approaches can draw up the basis to design the methodology and
software for creating systems of data extraction from arbitrary tables contained
in weakly structured (e.g. PDF) and semi-structured documents (e.g.
spreadsheets). The approaches correspond to the state-of-the-art level studies in the
area of information extraction. They rely on the modern techniques of the
deeplearning, rule-based and generative programming, linked open data, and table
understanding.</p>
      <p>Our approach to the table extraction can be quickly adapted to various
domain-speci c PDF documents (such as nancial statements, business credit
assessments, material safety data sheets, etc.) with a rich tabular content.
Additional ad-hoc heuristics can be handcrafted and tuned for the target domain.
This does not require to prepare training datasets that can be a costly process.</p>
      <p>The proposed approach to the table analysis involves the rule and generative
programming. It includes a principally novel formal language for table analysis
and interpretation that should provide expressing table transformation rules.
The main advantage of our approach to the semantic table interpretation is
that it can be applied for tables with an arbitrary layout. In comparison to our
competitors, we support not only widespread layout types of arbitrary tables,
but also speci c ones.</p>
      <p>We expect that the explained principles can be used for designing the software
for end-to-end tabular transformation in scienti c and industrial data-intensive
applications. Particularly, we used them to develop the software for lling up a
data warehouse with socio-economic data on Mongolian provinces, for populating
the database of the web-based statistical atlas from tabular data of government
statistical reports, and for generating ontologies from data of arbitrary tables
used in industrial safety inspection.
6</p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <p>This work was nancially supported by the Russian Scienti c Foundation (Grant
No. 18-71-10001).
7. Bhagavatula, C.S., Noraset, T., Downey, D.: Tabel: Entity linking in web tables.</p>
      <p>In: Arenas, M., Corcho, O., Simperl, E., Strohmaier, M., d'Aquin, M., Srinivas, K.,
Groth, P., Dumontier, M., He in, J., Thirunarayan, K., Thirunarayan, K., Staab,
S. (eds.) The Semantic Web - ISWC 2015. pp. 425{441 (2015)
8. Braunschweig, K.: Recovering the Semantics of Tabular Web Data. Ph.D. thesis,</p>
      <p>
        Technischen Universitat Dresden, Dresden, Germany (2015)
9. Cao, T.D., Manolescu, I., Tannier, X.: Extracting linked data from statistic
spreadsheets. In: Proc. Int. Workshop Semantic Big Data. pp. 5:1{5:5 (2017).
https://doi.org/10.1145/3066911.3066914
10. Chen, Z.: Information Extraction on Para-Relational Data. Ph.D. thesis, University
of Michigan, US (2016)
11. Chen, Z., Dadiomov, S., Wesley, R., Xiao, G., Cory, D., Cafarella, M., Mackinlay,
J.: Spreadsheet property detection with rule-assisted active learning. In: Proc.
ACM on Conf. on Information and Knowledge Management. pp. 999{1008 (2017).
https://doi.org/10.1145/3132847.3132882
12. Cunha, J., Fernandes, J.P., Mendes, J., Saraiva, J.: Embedding, evolution, and
validation of model-driven spreadsheets. IEEE Transactions on Software Engineering
41(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), 241{263 (2015). https://doi.org/10.1109/TSE.2014.2361141
13. Cunha, J., Erwig, M., Mendes, J., Saraiva, J.: Model inference for spreadsheets.
Autom Softw Eng 23(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), 361{392 (2016). https://doi.org/10.1007/s10515-014-0167-x
14. Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination
for web tables using large knowledge bases. Proc. VLDB Endow. 6(13), 1606{1617
(2013). https://doi.org/10.14778/2536258.2536271
15. Dou, W., Xu, C., Cheung, S.C., Wei, J.: CACheck: Detecting and repairing cell
arrays in spreadsheets. IEEE Transactions on Software Engineering 43(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), 226{251
(2017). https://doi.org/10.1109/TSE.2016.2584059
16. Dou, W., Han, S., Xu, L., Zhang, D., Wei, J.: Expandable group identi cation
in spreadsheets. In: Proc. 33rd ACM/IEEE Int. Conf. on Automated Software
Engineering. pp. 498{508 (2018). https://doi.org/10.1145/3238147.3238222
17. Eberius, J., Werner, C., Thiele, M., Braunschweig, K., Dannecker, L., Lehner, W.:
DeExcelerator: a framework for extracting relational data from partially
structured documents. In: Proc. 22nd ACM Int. Conf. on Information &amp; Knowledge
Management. pp. 2477{2480 (2013). https://doi.org/10.1145/2505515.2508210
18. Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.:
Matching web tables with knowledge base entities: From entity lookups to entity
embeddings. In: d'Amato, C., Fernandez, M., Tamma, V., Lecue, F., Cudre-Mauroux,
P., Sequeda, J., Lange, C., He in, J. (eds.) The Semantic Web { ISWC 2017. pp.
260{277 (2017)
19. Embley, D.W., Krishnamoorthy, M.S., Nagy, G., Seth, S.: Converting
heterogeneous statistical tables on the web to searchable databases. Int. J. Document
Analysis and Recognition 19(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), 119{138 (2016).
https://doi.org/10.1007/s10032016-0259-1
20. Embley, D., Tao, C., Liddle, S.: Automating the extraction of data from
HTML tables with unknown structure. Data Knowl. Eng. 54(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 3{28 (2005).
https://doi.org/10.1016/j.datak.2004.10.004
21. Ermilov, I., Ngomo, A.C.N.: TAIPAN: Automatic Property Mapping for Tabular
      </p>
      <p>Data, pp. 163{179 (2016). https://doi.org/10.1007/978-3-319-49004-5 11
22. Fan, M., Kim, D.S.: Table region detection on large-scale PDF les without labeled
data. CoRR abs/1506.08891 (2015), http://arxiv.org/abs/1506.08891
23. Fiorelli, M., Lorenzetti, T., Pazienza, M.T., Stellato, A., Turbati, A.: Sheet2RDF: a
exible and dynamic spreadsheet import&amp;lifting framework for RDF. In: Proc. 28th
Int. Conf. Industrial, Engineering and Other Applications of Applied Intelligent
Systems. pp. 131{140 (2015). https://doi.org/10.1007/978-3-319-19066-2 13
24. Galkin, M., Mouromtsev, D., Auer, S.: Identifying web tables: Supporting a
neglected type of content on the web. In: Proc. 6th Int. Conf. Knowledge Engineering
and Semantic Web. pp. 48{62 (2015). https://doi.org/10.1007/978-3-319-24543-0 4
25. Gao, L., Yi, X., Jiang, Z., Hao, L., Tang, Z.: Icdar2017 competition on page object
detection. In: 14th IAPR Int. Conf. on Document Analysis and Recognition. vol. 01,
pp. 1417{1422 (2017). https://doi.org/10.1109/ICDAR.2017.231
26. Gilani, A., Qasim, S.R., Malik, I., Shafait, F.: Table detection using deep learning.</p>
      <p>
        In: 14th IAPR Int. Conf. on Document Analysis and Recognition. vol. 01, pp.
771{776 (2017). https://doi.org/10.1109/ICDAR.2017.131
27. Gobel, M., Hassan, T., Oro, E., Orsi, G.: Icdar 2013 table competition. In:
12th Int. Conf. on Document Analysis and Recognition. pp. 1449{1453 (2013).
https://doi.org/10.1109/ICDAR.2013.292
28. Govindaraju, V., Zhang, C., Re, C.: Understanding tables in context using standard
NLP toolkits. In: Proc. 51st Annual Meeting of the Association for Computational
Linguistics. vol. 2: Short Papers, pp. 658{664 (2013)
29. Hao, L., Gao, L., Yi, X., Tang, Z.: A table detection method for pdf documents
based on convolutional neural networks. In: 12th IAPR Workshop on Document
Analysis Systems. pp. 287{292 (2016). https://doi.org/10.1109/DAS.2016.23
30. He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task
fcn for semantic page segmentation and table detection. In: 14th IAPR Int.
Conf. on Document Analysis and Recognition. vol. 01, pp. 254{261 (2017).
https://doi.org/10.1109/ICDAR.2017.50
31. Hermans, F., Pinzger, M., Deursen, A.: Detecting and refactoring code
smells in spreadsheet formulas. Empirical Softw. Engg. 20(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), 549{575 (2015).
https://doi.org/10.1007/s10664-013-9296-2
32. Jin, Z., Anderson, M.R., Cafarella, M., Jagadish, H.V.: Foofah: Transforming data
by example. In: Proc. ACM Int. Conf. Management of Data. pp. 683{698 (2017).
https://doi.org/10.1145/3035918.3064034
33. Kavasidis, I., Palazzo, S., Spampinato, C., Pino, C., Giordano, D.,
Giuffrida, D., Messina, P.: A saliency-based convolutional neural network for
table and chart detection in digitized documents. CoRR abs/1804.06236 (2018),
http://arxiv.org/abs/1804.06236
34. Koch, P., Hofer, B., Wotawa, F.: On the re nement of spreadsheet smells by means
of structure information. Journal of Systems and Software 147, 64{85 (2019).
https://doi.org/10.1016/j.jss.2018.09.092
35. Koci, E., Thiele, M., Romero, O., Lehner, W.: Table identi cation and
reconstruction in spreadsheets. In: Proc. 29th Int. Conf. Advanced Information Systems
Engineering. pp. 527{541 (2017). https://doi.org/10.1007/978-3-319-59536-8 33
36. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables
using entities, types and relationships. Proc. VLDB Endow. 3(
        <xref ref-type="bibr" rid="ref1 ref2">1-2</xref>
        ), 1338{1347
(2010). https://doi.org/10.14778/1920841.1921005
37. Mun~oz, E., Hogan, A., Mileo, A.: Using linked data to mine RDF from wikipedia's
tables. In: Proc. 7th ACM Int. Conf. Web Search and Data Mining. pp. 533{542
(2014). https://doi.org/10.1145/2556195.2556266
38. Mulwad, V., Finin, T., Joshi, A.: A Domain Independent Framework for Extracting
Linked Semantic Data from Tables, pp. 16{33 (2012).
https://doi.org/10.1007/9783-642-34213-4 2
39. Mulwad, V., Finin, T., Syed, Z., Joshi, A.: Using linked data to interpret tables.
      </p>
      <p>
        In: Proc. 1st Int. Conf. Consuming Linked Data. vol. 665, pp. 109{120 (2010),
http://dl.acm.org/citation.cfm?id=2878947.2878957
40. Perez-Arriaga, M., Estrada, T., Abad-Mota, S.: Tao: System
for table detection and extraction from pdf documents (2016),
https://www.aaai.org/ocs/index.php/FLAIRS/FLAIRS16/paper/view/12916
41. Rashid, S.F., Akmal, A., Adnan, M., Aslam, A.A., Dengel, A.: Table
recognition in heterogeneous documents using machine learning. In: 14th IAPR Int.
Conf. on Document Analysis and Recognition. vol. 01, pp. 777{782 (2017).
https://doi.org/10.1109/ICDAR.2017.132
42. Rastan, R., Paik, H.Y., Shepherd, J.: Texus: A uni ed framework for extracting and
understanding tables in pdf documents. Information Processing &amp; Management
56(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), 895{918 (2019). https://doi.org/https://doi.org/10.1016/j.ipm.2019.01.008
43. Rastan, R., Paik, H.Y., Shepherd, J., Ryu, S.H., Beheshti, A.: Texus: Table
extraction system for pdf documents. In: Wang, J., Cong, G., Chen, J., Qi, J. (eds.)
Databases Theory and Applications. pp. 345{349 (2018)
44. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time
object detection with region proposal networks. CoRR abs/1506.01497 (2015),
http://arxiv.org/abs/1506.01497
45. Ritze, D., Bizer, C.: Matching web tables to dbpedia - A feature utility study.
      </p>
      <p>
        In: Proc. 20th Int. Conf. on Extending Database Technology. pp. 210{221 (2017).
https://doi.org/10.5441/002/edbt.2017.20
46. Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to DBpedia. In:
5th Int. Conf. on Web Intelligence, Mining and Semantics. pp. 10:1{10:6 (2015).
https://doi.org/10.1145/2797115.2797118
47. Schreiber, S., Agne, S., Wolf, I., Dengel, A., Ahmed, S.: Deepdesrt: Deep learning
for detection and structure recognition of tables in document images. In: 14th
IAPR Int. Conf. on Document Analysis and Recognition. vol. 01, pp. 1162{1167
(2017). https://doi.org/10.1109/ICDAR.2017.192
48. Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach
towards the benchmarking of table structure recognition systems. In: 9th
IAPR Int. Workshop on Document Analysis Systems. pp. 113{120 (2010).
https://doi.org/10.1145/1815330.1815345
49. Shen, W., Wang, J., Luo, P., Wang, M.: LIEGE: Link entities in web lists with
knowledge base. In: 18th ACM SIGKDD Int. Conf. on Knowledge Discovery and
Data Mining. pp. 1424{1432 (2012). https://doi.org/10.1145/2339530.2339753
50. Shigarov, A., Khristyuk, V., Mikhailov, A.: Tabbyxl: Software platform for
rulebased spreadsheet data extraction and transformation. SoftwareX 10, 100270
(2019). https://doi.org/https://doi.org/10.1016/j.softx.2019.100270
51. Shigarov, A., Khristyuk, V., Mikhailov, A., Paramonov, V.: Software
development for rule-based spreadsheet data extraction and
transformation. In: 42nd International Convention on Information and
Communication Technology, Electronics and Microelectronics. pp. 1132{1137 (2019).
https://doi.org/10.23919/MIPRO.2019.8756829
52. Shigarov, A.: Rule-based table analysis and interpretation. In: Proc. 21st
Int. Conf. Information and Software Technologies. pp. 175{186 (2015).
https://doi.org/10.1007/978-3-319-24770-0 16
53. Shigarov, A.: Table understanding using a rule engine. Expert Systems with
Applications 42(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), 929{937 (2015). https://doi.org/10.1016/j.eswa.2014.08.045
54. Shigarov, A., Altaev, A., Mikhailov, A., Paramonov, V., Cherkashin, E.: Tabbypdf:
Web-based system for pdf table extraction. In: 24th Int. Conf. on Information and
Software Technologies. pp. 257{269 (2018)
55. Shigarov, A., Mikhailov, A., Altaev, A.: Con gurable table structure recognition in
untagged PDF documents. In: Proc. ACM Symposium on Document Engineering.
pp. 119{122 (2016). https://doi.org/10.1145/2960811.2967152
56. Shigarov, A., Paramonov, V., Belykh, P., Bondarev, A.: Rule-based
canonicalization of arbitrary tables in spreadsheets. In: Proc. 22nd Int. Conf. Information
and Software Technologies. pp. 78{91 (2016).
https://doi.org/10.1007/978-3-31946254-7 7
57. Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation
from arbitrary to relational tables. Information Systems 71, 123{136 (2017).
https://doi.org/10.1016/j.is.2017.08.004
58. Siddiqui, S.A., Malik, M.I., Agne, S., Dengel, A., Ahmed, S.: Decnt: Deep
deformable cnn for table detection. IEEE Access 6, 74151{74161 (2018).
https://doi.org/10.1109/ACCESS.2018.2880211
59. Singh, R., Gulwani, S.: Transforming spreadsheet data types using examples.
SIG
      </p>
      <p>
        PLAN Not. 51(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ), 343{356 (2016). https://doi.org/10.1145/2914770.2837668
60. Swidan, A., Hermans, F.: Semi-automatic extraction of cross-table data from a
set of spreadsheets. In: Barbosa, S., Markopoulos, P., Paterno, F., Stumpf, S.,
Valtolina, S. (eds.) End-User Development. pp. 84{99 (2017)
61. Tijerino, Y., Embley, D., Lonsdale, D., Ding, Y., Nagy, G.: Towards ontology
generation from tables. World Wide Web: Internet and Web Information Systems 8(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ),
261{285 (2005). https://doi.org/10.1007/s11280-005-0360-8
62. Tran, D.N., Tran, T.A., Oh, A., Kim, S.H., Na, I.S.: Table
detection from document image using vertical arrangement of text
blocks. International Journal of Contents 11(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), 77{85 (2015).
https://doi.org/http://dx.doi.org/10.5392/IJoC.2015.11.4.077
63. Venetis, P., Halevy, A., Madhavan, J., Pasca, M., Shen, W., Wu, F., Miao, G., Wu,
C.: Recovering semantics of tables on the web. Proc. VLDB Endow. 4(9), 528{538
(2011). https://doi.org/10.14778/2002938.2002939
64. de Vos, M., Wielemaker, J., Rijgersberg, H., Schreiber, G., Wielinga, B., Top,
J.: Combining information on structure and content to automatically annotate
natural science spreadsheets. Int. J. Human-Computer Studies 103, 63{76 (2017).
https://doi.org/10.1016/j.ijhcs.2017.02.006
65. Wang, J., Wang, H., Wang, Z., Zhu, K.Q.: Understanding tables on the
web. In: Proc. 31st Int. Conf. Conceptual Modeling. pp. 141{155 (2012).
https://doi.org/10.1007/978-3-642-34002-4 11
66. Wu, T., Yan, S., Piao, Z., Xu, L., Wang, R., Qi, G.: Entity linking in web tables
with multiple linked knowledge bases. In: Li, Y.F., Hu, W., Dong, J.S., Antoniou,
G., Wang, Z., Sun, J., Liu, Y. (eds.) Semantic Technology. pp. 239{253 (2016)
67. Yang, S., Guo, J., Wei, R.: Semantic interoperability with heterogeneous
information systems on the internet through automatic tabular document exchange.
      </p>
      <p>Information Systems 69, 195{217 (2017). https://doi.org/10.1016/j.is.2016.10.010
68. Yang, S., Wei, R., Shigarov, A.: Semantic interoperability for electronic
business through a novel cross-context semantic document exchange
approach. In: Proc. ACM Symposium on Doc. Eng. pp. 28:1{28:10 (2018).
https://doi.org/10.1145/3209280.3209523
69. Yoshida, M., Matsumoto, K., Kita, K.: Table topic models for hidden unit
estimation. In: Proc. 12th Asia Information Retrieval Societies Conference. pp. 302{307
(2016). https://doi.org/10.1007/978-3-319-48051-0 23
70. Zhang, Z.: Towards E cient and E ective Semantic Table Interpretation, pp. 487{
502 (2014). https://doi.org/10.1007/978-3-319-11964-9 31
71. Zhang, Z.: E ective and e cient semantic table interpretation using TableMiner+.</p>
      <p>
        Semantic Web 8(
        <xref ref-type="bibr" rid="ref6">6</xref>
        ), 921{957 (2017). https://doi.org/10.3233/SW-160242
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Amal tano, D.,
          <string-name>
            <surname>Fasolino</surname>
            ,
            <given-names>A.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tramontana</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Simone</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Di</given-names>
            <surname>Mare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Scala</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.:</surname>
          </string-name>
          <article-title>A reverse engineering process for inferring data models from spreadsheet-based information systems: An automotive industrial experience</article-title>
          .
          <source>In: Data Management Technologies and Applications</source>
          . pp.
          <volume>136</volume>
          {
          <issue>153</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Arif</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shafait</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Table detection in document images using foreground and background features</article-title>
          .
          <source>In: Digital Image Computing: Techniques and Applications</source>
          . pp.
          <volume>1</volume>
          {
          <issue>8</issue>
          (
          <year>2018</year>
          ). https://doi.org/10.1109/DICTA.
          <year>2018</year>
          .8615795
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. van Assem,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Rijgersberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Wigham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Top</surname>
          </string-name>
          , J.:
          <article-title>Converting and annotating quantitative data tables</article-title>
          . In:
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>P.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>J.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horrocks</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glimm</surname>
            ,
            <given-names>B</given-names>
          </string-name>
          . (eds.)
          <source>The Semantic Web { ISWC 2010</source>
          . pp.
          <volume>16</volume>
          {
          <issue>31</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harit</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roy</surname>
          </string-name>
          , S.D.:
          <article-title>Table extraction from document images using xed point model</article-title>
          .
          <source>In: Indian Conf. on Computer Vision Graphics and Image Processing</source>
          . pp.
          <volume>67</volume>
          :
          <issue>1</issue>
          {
          <issue>67</issue>
          :
          <article-title>8</article-title>
          . ICVGIP '
          <volume>14</volume>
          (
          <year>2014</year>
          ). https://doi.org/10.1145/2683483.2683550
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Barowy</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>E.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zorn</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>ExceLint: Automatically nding spreadsheet formula errors</article-title>
          .
          <source>Proc. ACM Program. Lang. 2(OOPSLA)</source>
          ,
          <volume>148</volume>
          :1{
          <fpage>148</fpage>
          :
          <fpage>26</fpage>
          (
          <year>2018</year>
          ). https://doi.org/10.1145/3276518
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Barowy</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulwani</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zorn</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>FlashRelate: Extracting relational data from semi-structured spreadsheets using examples</article-title>
          .
          <source>SIGPLAN Not</source>
          .
          <volume>50</volume>
          (
          <issue>6</issue>
          ),
          <volume>218</volume>
          {
          <fpage>228</fpage>
          (
          <year>2015</year>
          ). https://doi.org/10.1145/2813885.2737952
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>