Towards Lightweight Representation of the
        Table Semantics for the Cross-Context
                Information Exchange

 Alexey Shigarov1 , Vasiliy Khristyuk1 , Evgeniy Cherkashin1 , and Shuo Yang2
      1
        Matrosov Institute for System Dynamics and Control Theory of SB RAS,
                      134 Lermontov st., Irkutsk, Russia, 664033
                                  shigarov@icc.ru,
                        WWW home page: http://td.icc.ru
     2
       School of Computer Science and Cyber Engineering, Guangzhou University,
                  230 Wai Huan Xi Road, Guangzhou, China, 510006
                                yangshuo@gzhu.edu.cn


          Abstract. This addresses representation of the table semantics for the
          cross-context information exchange. The tables we consider have an ar-
          bitrary cells structure represented in a machine-readable format. For
          example, tables can be contained in electronic documents, such as a
          spreadsheet or a web-page, and, typically, they are not accompanied by
          semantics enabling their automatic interpretation. In spite of the exist-
          ing variety of formalisms for representation of the table semantics, most
          of them they are fairly inefficient in terms of the user efforts required for
          the semantic annotation. We outline a new approach to the lightweight
          representation of the table semantics. We stay on the interpretation level
          that provides the inference of the semantics of atomic data items of a
          table from a description of data groups expressed by the syntax of the
          table. We expect that, in the future, implementation of our approach can
          reduce the complexity and volume of the table semantics required for the
          cross-context information exchange, as well as the user efforts aimed at
          annotating tabular data.

                                          ·                        ·
               ·                              ·                        ·
          Keywords: Table Semantics Table understanding Table interpreta-
          tion Semantic interoperability Information exchange Spreadsheet.


1     Introduction


   Nowadays, the volume of electronic documents participating in the infor-
mation exchange and transmission continues to significantly increase. The in-
terpretation of a document depends on its context (historic, national, domain,
organizational, etc.). For example, if a table in a financial report is titled with

    Copyright      ©
                2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0).
“FY2020”, then, in Russia, this might mean some data of a fiscal year starting
from January 1, 2020; however, in the United States, we should infer a differ-
ent year period starting on October 1, 2019. This complicates the document
processing and understanding within the information exchange.
    The cross-context exchange of electronic documents implies that a document
is represented in one context and transmitted for processing in another con-
text [13, 16]. The ability of computer programs to exchange information with
unambiguous and shared meaning is referred to as the semantic interoperabil-
ity [2]. Particularly, the semantic interoperability should be provided when a
source context differs from a target one. The use of semantic markup makes
documents readable and interpretable not only by humans but also by computer
programs. Such markup plays a key role in enabling semantic interoperability
for the information exchange.
    In recent years, many studies were devoted to some issues of semantic interop-
erability (e. g. [6, 12, 14, 15]. However, the problem of the cross-context exchange
of tabular documents remains open in general. The arbitrary tables with an ex-
plicit syntactic structure of cells are one of the main ways to present data in
electronic documents. For example, they can be contained in a spreadsheet, a
rich text document, or a hypertext of a web-page. Typically, such tables are not
accompanied by explicit semantics needed for their automatic interpretation.
The interpretation of such documents is complicated due to the variety of forms
for representing the table syntax and semantics. The actual challenges are the
extraction of the semantic components from documents, the context neutraliza-
tion, and the unambiguous interpretation of transmitted data, as well as the
representation of the semantic structure of tables.
    We outline a new 3-Level Table Object Model (3L-TOM) for the lightweight
description of tables on the interpretation level. It is assumed that the model
will enable the inference of atomic data items from categories, i. e. groups of data
items. We expect that our approach can reduce the complexity and volume of
the semantics required for the annotation and interpretation of arbitrary tables
in the cross-context exchange of electronic documents.


2   Problem Statement

The representation of the table semantics for the cross-context exchange of elec-
tronic documents is a complicated problem. Such representation should primarily
enable both the semantic annotation of arbitrary tables and the interpretation
of tabular data in a target context. Additionally, it has to reduce the complexity
and volume of the table semantics transmitted in the information exchange.
    We begin with a discussion about the concept of an arbitrary table. Hin-
terberger [3] defines a table both as a data structure to organize the tuples of
relation and as an arrangement of data in rows and columns. On the one hand, a
table can represent relational data in a grid of cells (e. g. contingency tables used
in statistics). On the other hand, a table can be used as a way of visual com-
munication to arrange data items, even when there are no relationships between
them (e. g. the grid layout used in web pages).
    We refer to an arbitrary table presented in a document as a way of visual
communication for arranging interrelated data items in a grid of cells. The data
items represented in an arbitrary table are divided into two functional types:
(i) entries and (ii) labels. The entries are values of data, while the labels are
considered as keys or attributes for addressing the values of data conceptually.
In comparison with a relational table, where its tuples contain only values but
its schema distinguishes attributes, an arbitrary table represents both values
and attributes in one syntax structure, the grid of cells. A relational table typi-
cally describes instances of one conceptual entity, while an arbitrary table often
includes labels of several conceptual entities.
    The arbitrary tables are mainly intended to be understood by humans. Typ-
ically, they lack explicit semantics needed for computer programs to interpret
them as intended by their authors or as required by an application. For example,
a table presented in a spreadsheet can be read and understood by humans as
relational information, but such a table is just a grid of cells hidden in another
grid of cells on a sheet for computers. The syntax of the arbitrary tables allows
expression of semantics implicitly via various functional arrangements of data
items in cells, as well as the formatting variety of both cells and text. In general,
the semantic annotation of an arbitrary table requires execution of the main
steps of the table understanding.
    Now, consider two approaches to representation of table semantics that we
refer to as heavy and lightweight representation, respectively. The first approach
describes semantics heavily on the level of atomic data items. Alternatively, the
second one is a lightweight description of data groups, such as categories. The
volume of the heavy representation grows linearly with respect to the amount of
atomic data items, while the volume of the lightweight representation depends
only on a number of groups, such as data items. Even the complex multidi-
mensional tables rarely contain more than a dozen categories, which can be
interpreted as separate groups. For example, suppose that we need to describe
the semantics of a table containing 10,000 data items, of which 1,000 are labels
of one category (L1 ), 1,000 are labels of another category (L2 ), and 8,000 are
entries of the third category (E). Moreover, each entry of E is addressed by two
labels: one of L1 , the other of L2 . In this case, a heavy representation should
describe 10,000 objects and 16,000 “entry-label” relationships between them.
On the other hand, such a table contains only 3 groups (L1 , L2 , and E). There-
fore, lightweight semantics is limited to describing only 3 objects and 2 relations
between them.
    The existing formalisms for representing tabular data such as Wang’s model
[10, 11], Hurst’s model [4], 2-Level Table Object Model (2L-TOM) [8, 9], and
Relational Data Model (RDM) [1, 5] are fairly inefficient for the semantic anno-
tation of arbitrary tables in terms of the user efforts. The first three models [8,9]
might be used to describe the semantics of arbitrary tables only in a heavy
manner. This case requires mapping of an arbitrary table to its atomic data
items. The RDM [1, 5] allows specification of a table via a conceptual schema.
A volume of such representation depends on the number of table columns, so it
can be considered as a lightweight description. However, this model strongly re-
stricts a syntactic structure by the definition of relational tables [5], when each
column corresponds to a labeled attribute and each row is a tuple of values.
Note that many tables presented in electronic documents have a more complex
syntactic structure. Therefore, RDM does not apply to the semantic annotation
of arbitrary tables in the general case.
    We believe that it is possible to develop a model for the lightweight represen-
tation of the table semantics on the interpretation level, which can be utilized
for a substantial range of arbitrary tables. In comparison with the existing for-
malisms, the model we aim to build should enable the inference of atomic data
items from their categories.


3   Solution Outline

We outline a new 3-Level Table Object Model (3L-TOM) for representing the
structure of an arbitrary table on the following three levels: (i) syntactic, (ii)
semantic, and (iii) interpretation. The 3L-TOM model extends the 2L-TOM
by adding the third level, a lightweight representation of categories for the ta-
ble interpretation. The 2L-TOM was introduced in [8, 9] and implemented as a
software [7] to provide the syntax and semantics of arbitrary tables.
    The first level of the 2L-TOM describes syntactic objects of a table, such
as a layout, formatting, and text of cells. This level should comply with the
capabilities and limitations of contemporary table formats such as Excel and
HTML. The second level defines data items of two functional types: (i) entries
and (ii) labels. All semantic objects are separated into two or more groups. At
least two groups are corresponding to the different functional types. Each entry
can be associated with one label of each group. Labels of the same group can be
associated with each other by parent-child relationships. Both entries and labels
are typically read and converted from the text of some cells. To simplify the
presentation of the model, we use the following assumptions: the properties of
syntactic objects (layout, formatting, and text) are attributed to the semantic
objects produced from the corresponding cells. For example, when a label was
created as a result of text reading from a cell at the address A1:B2 of a source
spreadsheet, then we say that this label is located at the address A1:B2. Note
that the semantic object does not have syntactic properties directly, but they
can be inferred from the associated syntactic objects. The semantic level can be
made context-independent by neutralizing the context of values read from the
syntactic level.
    The third level of the 3L-TOM interprets a table by specifying semantic
groups of data items. A semantic group is a set of data items which belong
to the same functional type and the same category of an external vocabulary
(e. g. DBpedia3 , YAGO4 , and Wikidata5 ). Each semantic group determines a
set of operations for generating its data items from syntactic objects of a table
and context. Each pair of groups, where one is a set of entries and another is a
set of labels, determines a set of operations for coupling their data items by the
entry-label relationships. A group of labels can also determine a set of operations
for coupling its labels by the label-label relationships. The interpretation level
provides the automatic inference of the semantics from the syntax of a table. It
can also serve to validate both the semantics and the syntax of a table.
    While the syntactic and semantic levels can be implemented by 2L-TOM,
the interpretation level is supposed to be based on describing semantic groups
of data items not provided by this model. To represent the interpretation level
of 3L-TOM, the three-level table model, we propose a design of a novel language
for a lightweight description of the table semantics, hereinafter TSDL (Table
Semantic Description Language). This language aims at reducing the volume and
the complexity of semantic annotation of tables through a lightweight description
of entire groups, instead of a heavy description of atomic data items.
    The design of TSDL is based on predefined operations for the context neu-
tralization of tabular data. They serve to cleanse data items read from table
cells, as well as to free them from the context-dependency. A data item is read
from the textual content of one or more cells. Its value can be modified through
various transformations such as string processing, type conversion, aggregation,
etc. Some transformation pipelines can be composed of several operations. It is
assumed that one or more of such pipelines provides generation of all data items
of a group context-independently.
    Another kind of the predefined operations that the design of TSDL is in-
tended to include is linking data items with each other by inner relationships of
the table structure. These operations correspond to some general features of the
table layout recommended by typographical standards and observed in many
documents. Our approach divides these operations into two types per two kinds
of inner relationships: entry-label and label-label. The first type of operations
uses methods for the inference of entry-label relationships from a pair of groups
where one contains entries while another consists of labels. The second type
determines methods for the inference of parent-child relationships between the
labels of the same group.
    The operations for the entry-label linking are based on the following general
features of the table layout:

 – BY ROW / BY COLUMN / BY CELL, an entry is associated with a label
   when they are originated from cells placed in the same row, column, or cell,
   respectively.
 – BY INDEX, an entry is associated with a label when they are originated
   from cells read in the same order.
3
  https://dbpedia.org
4
  https://yago-knowledge.org
5
  https://www.wikidata.org
 – BY ADDRESS, an entry is associated with a label originated from a cell at
   a specified address.
 – BY SINGLE, all entries are associated with a single label of a semantic
   group.
Note that these operations can be parametrized to specify some control infor-
mation, for example, a direction for seeking a cell of the label relative to a cell
of the entry, a shift in the reading order of cells, a cell address.
    The operations for the label-label linking engage the following general fea-
tures of the table layout:
 – BY NESTING, child-parent labels are originated from cells located in adja-
   cent rows (columns), the child cell is nested in the parent cell by columns
   (rows).
 – BY INDENTATION, child-parent labels are originated from cells located in
   the same column, the text of the child label is indented relative to the text
   of the parent cell.
 – BY EMPHASIZING, child-parent labels are originated from cells located in
   the same column, the text of the labels is highlighted by the different font
   formatting.
 – BY ALIGNMENT, child-parent labels are originated from cells located in
   the same column, the text of the labels is highlighted by the different align-
   ment.
    We expect that some additional features of the table layout can be identified
over time. Therefore, the lightweight description should be extensible by new
operations for linking groups. The language might be designed as functional. In
this case, the operations of the context neutralization and the linking of semantic
groups can be expressed as function calls.


4   Conclusion
In the future, we expect that implementation of the approach proposed might
reduce the complexity and volume of the table semantics required for the cross-
context information exchange, as well as the user efforts aimed at annotating
tabular data.
    The 3L-TOM model could enable implementation of the visual annotation
of a table on the level of data groups described in both a syntactic and semantic
side. Typically, all data items of one group are originated from cells located in
one functional region (adjacent rows and/or columns). For example, pivot tables
often place all labels of one category into either one row of a head or one column
of a stub, while all entries are placed into body cells. This layout feature can
be used to visually annotate semantic groups by selecting the corresponding
functional regions of cells. We expect that such annotation allows creation of
the semantic markup with minimal efforts of end-users.
    Another consequence arising from the implementation of 3L-TOM is the
possibility to recover semantics, the 2nd level of an instance of 3L-TOM, from
syntax, the 1st level of the instance of 3L-TOM. Interpretation of the TSDL
descriptions enables either the inference of atomic data items of a table when
they are absent or their validation when they are present in the instance of 3L-
TOM. This may significantly reduce the volume of tabular data transmitted from
one context to another. One lightweight description of semantics can be applied
to a set of tables with the same layout but with different content. Potentially, the
validation prevents unacceptable modifications of the table layout and content
in data collection tasks, as well as protection of tabular data against potential
damage in document exchange tasks.
    Moreover, atomic data items restored by using the TSDL descriptions can be
represented as linked data. Such a format complies with the Linked Data6 princi-
ples that allow creation of semantic objects published in the form of hydpertext,
at the same time linking them to elements of other documents and objects. It
is possible to construct some rules for mapping table semantics to the RDF7
graph with the concretization of syntax via the standards like RDFa/XML8 and
Turtle9 . The popular common-sense knowledge graphs (e. g. DBpedia, YAGO,
and Wikidata), which can be used to describe table semantics, support RDF
as a standard knowledge representation format. This allows integration of the
linked data generated from tables with open external vocabularies in a common
format. The proposed translation of table semantics into RDF will simplify the
utilization of such data in some target applications, since RDF is supported by
the majority of the ontological modeling tools and, de facto, it is the most used
tool for representing linked data and ontologies.
    Summarizing the above, our approach can be implemented in the future
by the development of the following tools: (i) a model representing tables on
three levels (syntactic, semantic, and interpretation), (ii) a formal language for
a lightweight description of the table semantics, (iii) a visual annotator for
document tables to get a lightweight description, (iv) a validator of the ta-
ble syntax and semantics, (v) a generator of linked data from tables by their
lightweight description. We believe that these results will be useful in appli-
cations of cross-context exchange of tabular documents in various fields (e-
government, e-healthcare, e-commerce, etc.).


5   Acknowledgment

This work was supported by the Basic Research Program of the Siberian Branch
of the Russian Academy of Sciences, Project IV.38.1.2, Registration No. AAAA-
A17-117032210079-1.

6
  https://www.w3.org/wiki/LinkedData
7
  https://www.w3.org/RDF
8
  https://www.w3.org/TR/rdfa-core
9
  https://www.w3.org/TR/turtle
References
 1. Embley, D.W.: Relational model. In: Encyclopedia of Database Systems, pp. 3149–
    3154 (2018). https://doi.org/10.1007/978-1-4614-8265-9 306
 2. Heiler, S.: Semantic interoperability. ACM Comput. Surv. 27(2), 271—-273 (1995).
    https://doi.org/10.1145/210376.210392
 3. Hinterberger, H.: Table. In: Encyclopedia of Database Systems, pp. 3873–3874
    (2018). https://doi.org/10.1007/978-1-4614-8265-9 1373
 4. Hurst, M.: Towards a theory of tables. Int. J. Doc. Anal. Recog. 8(2-3), 123–131
    (2006). https://doi.org/10.1007/s10032-006-0016-y
 5. Johnston, T.: Chapter 3 - the relational paradigm: mathematics. In: Bitemporal
    Data, pp. 35–41 (2014). https://doi.org/10.1016/B978-0-12-408067-6.00003-6
 6. Qin, P., Guo, J.: A novel machine natural language mediation for semantic docu-
    ment exchange in smart city. Future Generation Computer Systems 102, 810–826
    (2020). https://doi.org/10.1016/j.future.2019.07.028
 7. Shigarov, A., Khristyuk, V., Mikhailov, A.: Tabbyxl: Software platform for rule-
    based spreadsheet data extraction and transformation. SoftwareX 10, 100270
    (2019). https://doi.org/10.1016/j.softx.2019.100270
 8. Shigarov, A., Khristyuk, V., Mikhailov, A., Paramonov, V.: Tabbyxl: Rule-based
    spreadsheet data extraction and transformation. In: Information and Software
    Technologies. vol. 1078 CCIS, pp. 59–75 (2019). https://doi.org/10.1007/978-3-
    030-30275-7 6
 9. Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation
    from arbitrary to relational tables. Information Systems 71, 123–136 (2017).
    https://doi.org/10.1016/j.is.2017.08.004
10. Wang, X.: Tabular abstraction, editing, and formatting. Ph.D. thesis, University
    of Waterloo, Waterloo, Ontario, Canada (1996)
11. Wang, X., Wood, D.: A conceptual model for tables. In: Principles of Digital Doc-
    ument Processing. vol. 1481 LNCS, pp. 10–23 (1998). https://doi.org/10.1007/3-
    540-49654-8 2
12. Yang, S., Wei, R.: Tabdoc approach: An information fusion method to implement
    semantic interoperability between IoT devices and users. IEEE Internet of Things
    Journal 6(2), 1972–1986 (2019). https://doi.org/10.1109/JIOT.2018.2871274
13. Yang, S., Wei, R.: Semantic interoperability through a novel cross-context tabular
    document representation approach for smart cities. IEEE Access 8, 70676–70692
    (2020). https://doi.org/10.1109/ACCESS.2020.2986485
14. Yang, S., Wei, R., Guo, J., Xu, L.: Semantic inference on clinical docu-
    ments: combining machine learning algorithms with an inference engine for
    effective clinical diagnosis and treatment. IEEE Access 5, 3529–3546 (2017).
    https://doi.org/10.1109/ACCESS.2017.2672975
15. Yang, S., Guo, J., Wei, R.: Semantic interoperability with heterogeneous infor-
    mation systems on the internet through automatic tabular document exchange.
    Information Systems 69, 195–217 (2017). https://doi.org/10.1016/j.is.2016.10.010
16. Yang, S., Wei, R., Shigarov, A.: Semantic interoperability for elec-
    tronic business through a novel cross-context semantic document ex-
    change approach. In: Proc. ACM S. on Doc. Eng. pp. 28:1–28:10 (2018).
    https://doi.org/10.1145/3209280.3209523