<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Capturing Contextual Semantic Information About Statements in Web Tables</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felipe Quecole</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rom~ao Martines</string-name>
          <email>roh.martinesg@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose M. Gimenez-Garc a</string-name>
          <email>jose.gimenez.garcia@univ-st-etienne.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harsh Thakkar</string-name>
          <email>thakkar@cs.uni-bonn.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Federal University of Sao Carlos - UFSCar</institution>
          ,
          <addr-line>S~ao Carlos</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ Lyon, UJM-Saint-Etienne, CNRS, Laboratoire Hubert Curien UMR 5516</institution>
          ,
          <addr-line>F-42023 Saint-Etienne</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Bonn</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Data published on the Web is growing every year. However, most of this data does not have semantic representation. Web tables are an example of structured data on the Web that has no clear semantics. While there is an emerging research effort in lifting tabular data into semantic web formats, most of the work is focused around entity recognition in tables with simple structure. In this work we explore how capture the semantics of complex tables and transform them to knowledge graph. These complex tables include contextual information about statements, such as time or provenance. Hence, we need to use contextualized knowledge graphs to represent the information of the tables. We explore how this contextual information is represented in tables, and relate it to previous classifications of web tables, and how to encode it in RDF using different approaches. Finally, we present a prototype tool that converts web tables from Wikipedia into RDF, trying to cover all existing approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>Tables</kwd>
        <kwd>Knowledge Graphs</kwd>
        <kwd>RDF</kwd>
        <kwd>Property Graphs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Data is being published in the web at an ever-increasing speed. However, most of this
data lacks semantics. This makes difficult to use it to generate value. Knowledge-graphs
are a well-known representation to encode data semantics. The Semantic Web provides
standards to represent inter-operable knowledge graphs were each resource can be
unequivocally referenced. Tools to generate semantic data from structured web data
(specially tables) in gaining traction in the recent years. Most approaches focus on
entity recognition and disambiguation, in order to automatically extract the information
and transform it to RDF. However, to the best of our knowledge, existing approaches
tackle only simple tables with no additional information about the statements that can
be extracted. More complex tables exists that provide statements in different contexts
(e.g., according to different sources, or valid at different time periods). In order to
encode this contextual information (or statement metadata), we need to identify those
Copyright c 2018
for this paper by its authors. Copying permitted for private and academic purposes.
contexts and represent the information accordingly using contextualized knowledge
graphs. In this work we focus on transforming tables into RDF, where contexts are
represented by means of reifying the statements using the main existing approaches.</p>
      <p>The rest of the paper is organized as it follows: in section 2 is discussed some
background information; section 3 presents an overview of how data is usually represented in
web tables, challenges to represent this data in RDF, and how recent research is dealing
with them; section 4 discusses the proposed approach to transform data from web
tables to RDF; finally, section 5 draw some conclusions and possible lines of future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>2.1</p>
      <p>RDF
In this section we introduce the necessary background information about RDF, existing
reification approaches, and tools to convert automatically structured data to RDF.
RDF is the data model used in the Semantic Web. It represents statements as triples
&lt;Subject, Predicate, Object&gt;. The subject identifies the resource being described,
the predicate is the property applied to it, and the object is the concrete value for this
property. Triples can share subject and/or object, hence creating a interconnected
graph of (possibly heterogeneous) statements. Formal definitions of RDF triple and
RDF graph can be seen in Definitions 1 and 2.</p>
      <p>Definition 1 (RDF triple). Assume infinite, mutually disjoint sets I (IRI
references), B (Blank nodes), and L (Literals) . An RDF triple is a tuple (s;p;o) 2
(I [B) I (I [B[L), where \s" is the subject, \p" is the predicate and \o" is the
object.</p>
      <p>Definition 2 (RDF graph). An RDF graph G is a set of RDF triples f(s;p;o)g.
p
It can be represented as a directed labeled graph s ! o.
2.2</p>
      <sec id="sec-2-1">
        <title>Annotating RDF with contextual information</title>
        <p>As seen in previous section, RDF statements represent binary relations between to
resources (the subject and the object). This model is not well suited to represent
additional contextual information about the statement themselves (such as data of
validity, provenance, or confidence). Current approaches to represent this kind of
information reify the statement into a new resource, that can be then used as subject
or object of new statements that represent the context. Down below we describe the
five main existing approaches. In the Figure 1, we illustrate each of them.</p>
        <p>In RDF Reification [6, Sec. 4], a resource can be used as a statement, and
additional information can be added as follows: a quad of the form (s;p;o;i), i is a quad
identifier, can be described by the triples (i;r:subject;s), (i;r:predicate;p) and
(i;r:object;o):
(a) Standart</p>
        <p>Reification
(b) Named</p>
        <p>Graphs
(c) n-ary</p>
        <p>Relations
(d) Singleton</p>
        <p>Properties</p>
        <p>(e) NdFluents</p>
        <p>
          Named Graphs [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] considers a sets of pairs in the form (G;n) where G is a RDF
graph and n is an URI (Uniform Resource Identifier). Then, we have N-Quads directly
describing an (s;p;o;i) quad.
        </p>
        <p>
          In N-ary Relations [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], a resource is used to describe a relationship, considering
that a subject is involved in a relationship, which in turn has your own identifiers
and qualifiers. Here, a quad of the form (s;p;o;i) can be decomposed in (s;ps;i) and
(i;pv;o), (pv;:value;p), (ps;:statement;s).
        </p>
        <p>
          Singleton Properties [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] creates a property that is only used for a unique
statement. To represent a quad (s;p;o;i) we need of the triples (s;i;o) and (i;:singlePropertyOf;p).
        </p>
        <p>
          NdFluents [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]: creates contextual versions of subject and object and links them
to the original and the context using the triples (s0;contextualP artOf;s), (o0;contextualP artOf;o),
(s0;contextualExtent;c), (o0;contextualExtent;c).
2.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>RDF generation tools</title>
        <p>
          In order to transform a data source into RDF, a common approach is to use a
mapping language to represent how the data from one source has to be transformed
into triples . Several tools exist to transform heterogeneous data formats into RDF,
most of them tackling a single data model or format. In this section we focus on the
two most prominent mapping languages: RML [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and SPARQL-Generate [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Our
approach will make use of both in different steps of the process.
        </p>
        <p>
          RML [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] stands for \RDF Mapping Language", it is an extension of R2RML
(Relational to RDF Mapping Language) 4. While R2RML can be used to express
customized mappings from relational databases to RDF datasets, RML also supports
other structured formats, such as CSV, TSV, XML and JSON. R2RML's mapping
references relational tables' column by name, and uses predicates such as SubjectMap,
PredicateObjectMap, PredicateMap and ObjectMap. Each of the above mentioned
predicates have as object a column or an URI and, the triples are created according
to the predicates and their respective referenced column(s). RML extends R2RML
vocabulary to include more general clauses (in which the R2RML's clauses are
included - as a subset or sub-property), i.e., rr:logicalTable and rr:tableName become
a sub-property of rml:logicalSource and rml:sourceName In our work, we further
4 https://www.w3.org/TR/r2rml
extend RML to gather enough information from the mapping document and extract
from HTML tables, information such as in which column one can find the subject,
or which type of table, and thus reification method, is correct for that table, which
CSS class should be used to select the specific table from the page, etc.
        </p>
        <p>
          SPARQL-Generate [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] extends SPARQL 1.1 to be able to extract information
from heterogeneous data sources. SPARQL-Generate includes three new clauses:
{ source clause: used to bind variables to documents
{ iterator clause: used to extract bits of information from the documents
{ generate clause: extends the existing construct clause of SPARQL 1.1, allowing
modularization of queries and factorization of the RDF generation.
        </p>
        <p>
          The first two clauses (source - and its binding functions - and iterator) allow
SPARQLGenerate to support various data formats and navigate through them.
3
According to Crestan et al. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] web tables can be categorized as layout tables (used,
for presentation purposes, not really containing any knowledge), and relational tables.
Relational tables encode implicit semantics of the data, and can be further divided
according to their structure in vertical listing: tables that list in each row one or
more attributes for a series of similar entities located in on column (the subject
column); horizontal listing: similar to vertical listing, horizontal listings present their
subjects in one row; attribute/value: these tables are a specific case of vertical listings
and horizontal listings, but they do not contain the subjects in the table; matrix :
tables that have the same value type for each cell at the junction of a row and a
column; calendar: a specific case of the matrix type, differing only in its semantics; and
enumeration: tables that list a series of objects that have the same ontological relation.
        </p>
        <p>
          Mun~oz et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] identify three types of tables in Wikipedia: toc, infobox, and
wikitable. The first corresponds to layout tables, in these tables (and here \toc" stands
for: table of content) the topics of the article are presented. The second and the third
correspond to relational tables. Infoboxes have a clear horizontal listing structure where
the subject, predicate and object of the table can be identified in each row, and form
the basis of extracted data to create DBPedia [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Wikitables are used to embed tables
with semantic content in a Wikipedia article, but their structure is highly variable.
        </p>
        <p>
          While solutions for transforming data in tables to RDF have been proposed, most
of them focus on challenges such as identifying the subject column, interpret the
implicit structure of table, entity recognition and disambiguation, and mapping values
in the table with classes and properties in a knowledge base [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In addition, they only
tackle vertical and horizontal listings with simple structure. In this work, we tackle
more complex tables, where contextual information needs to be expressed about the
extracted triples (such as date or provenance). This contextual information is usually
encoded in the tables in one of the following two ways: (1) In horizontal and vertical
listings, by grouping columns by the context5. (2) In matrix tables, by using row and
column headers as identifiers of the context6.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <p>The transformation from tables to Knowledge Graphs needs to consider the different
typologies of tables presented in the previous section. For tables without contextual
metadata about the statements the process is relatively simple: each cell in the subject
column is mapped to a subject in a triple and each cell of the same row to an object,
using a property that depends on the column of the object. However, for tables that
contain contextual information it is necessary to capture the context of the triples. RDF,
as mentioned in Section 2.1, only supports binary relations. In order to capture the
context of the triples it will be necessary to resort to a reification approach (see Section 2.2).
Take as an example table 17. We want to extract information not only about the
population estimates, but also about the corresponding year and the agency responsible for
that estimation. This table is an example of a matrix table, where contexts are
indicated by the headers of rows and columns. Listing 1.1 exemplifies an expected output
for the value for the cell of row 1 and column 2, including all the contextual metadata.</p>
      <p>In addition, the approach needs to read the webpage and extract the information.
However, the HTML structure of the table can be arbitrary, and this is one of the
challenges to face in this approach. Hence, it is necessary to include a preliminary step
to pre-process the table. For this prototype, we decide to get some of the necessary
information from the user. The preprocessing step produces as output a modified
version of the table with additional information: indexes for column and row, the
datatype for the value in each cell, category of the table and groups of columns. This
information is then used by a conversion module RDF. Note that this approach could
be extended to include other kinds of knowledge graphs, such as property graphs,
by adding a new conversion module. A schema of this process is shown in Figure 2.
5 See https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_
territories_by_mortality_rate, where the same data is given twice but with different
sources
6 See Table 1
7 Taken from https://en.wikipedia.org/wiki/World_population_estimates
1 wp : year1950 a time : DateTimeDescription , time : I n t e r v a l ;
2 time : year " 1950 " ^^ xsd : gYear .
3
4 wp : Maddison a ex : Provenance ;
5 prov : wasGeneratedBy [
6 a event : Event , prov : A c t i v i t y ;
7 event : time [
8 a time : I n t e r v a l ;
9 time : hasDateTimeDescription [
10 a time : DateTimeDescription ;
11 time : year " 2008 " ^^ xsd : gYear ] ] ] .
12
13 &lt;http : / / p u r l . org / az / worldpop#e a r t h : year1950 : Maddison&gt;
14 r d f : o b j e c t 2544000000 ;
15 r d f : p r e d i c a t e dbo : p o p u l a t i o n T o t a l ;
16 r d f : s u b j e c t dbr : Earth ;
17 time : i n t e r v a l D u r i n g wp : year1950 ;
18 prov : agent wp : Maddison .</p>
      <p>Listing 1.1: Expected output example</p>
      <p>
        The input taken by the preprocessing module is written in RDF using RML [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
We extend the vocabulary with the following terms:
{ CSSselector : indicates the CSS selector for the target table in the web page;
{ TablePosition: index for the target table, given the CSS selector;
{ Reification: indicates to which category the table belongs;
{ SubjectIndex : indicates the column that helds the subject for the triple;
{ HeaderRow: (when columns are grouped by context) indicates in which row the
headers (that will be used as predicates) are;
{ ColumnPredicate: index of the column that is part of the predicate.
      </p>
      <p>
        The RDF conversion module makes use of SPARQL-Generate [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], using its XPath
function to iterate over the elements of the table, and the above mentioned input from
the user, except for the first three that are used in the preprocessing step, are used
to compose the SPARQL-generate query. The values inserted by the user dictate the
role for each column from the HTML table, that is, which column is the subject, part
of the predicate or just the object of the triples (with the header being the predicate).
      </p>
      <p>The prototype tool is publicly available8 under Apache-2.0 license.
8 https://github.com/felipequecole/table2rdf
Transforming web tables into knowledge graphs while capturing their semantics and
contextual information is a challenging task for various reasons: On the side of the
knowledge graph representation, it can be necessary to use reification techniques in
order to encode the context. On the side of the table, the HTML structure can be
arbitrary, and the contents of the table can be difficult to identify. We propose a
two-step process. The first step takes additional information and pre-processes the
table, generating a enriched version of the table with the information needed by
the second step, such as the category of the table or how to extract the contextual
metadata about the statements. The second step reads the output of the preprocessor
and transforms the data in a knowledge graph. We have implemented a tool that
gets part of the necessary information from the user (falling back to default values in
case some information is not given) in the first step, and a RDF conversion module
as second step. Note that other approaches focusing on different challenges, such as
entity disambiguation or subject column identification, could be incorporated in the
preprocessing step. Conversely, new modules can be added to substitute the RDF
transformation to another kind of knowledge graph, such as property graphs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ives</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>DBpedia: A Nucleus for a Web of Open Data. ISWC+ASWC (</article-title>
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Carroll</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>P.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stickler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Named graphs</article-title>
          .
          <source>J. Web Sem</source>
          . (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Crestan</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pantel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Web-scale table census and classification</article-title>
          .
          <source>Proceedings of the fourth ACM international conference on Web search and data mining</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sande</surname>
            ,
            <given-names>M.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Colpaert</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , deWalle, R.V.:
          <article-title>RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data</article-title>
          .
          <source>LDOW</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Gimenez-Garc</surname>
            <given-names>a</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.M.</given-names>
            ,
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Maret</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.:</surname>
          </string-name>
          <article-title>NdFluents: An Ontology for Annotated Statements with Inference Preservation</article-title>
          .
          <source>ESWC</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Lassila</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swick</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          :
          <article-title>Resource Description Framework (RDF) Model and Syntax Specification</article-title>
          .
          <source>W3C Recommendation</source>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Lefrancois</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zimmermann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bakerally</surname>
          </string-name>
          , N.:
          <article-title>A SPARQL Extension for Generating RDF from Heterogeneous Formats</article-title>
          .
          <source>ESWC</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Martinez-Rodriguez</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez-Arevalo</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Information Extraction meets the Semantic Web: A Survey. Semantic Web journal (</article-title>
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] Mun~oz, E.,
          <string-name>
            <surname>Hogan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mileo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Using linked data to mine RDF from wikipedia's tables</article-title>
          .
          <source>Proceedings of the 7th ACM international conference on Web search and data mining</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bodenreider</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Don't like RDF Reification?: Making Statements about Statements Using Singleton Property</article-title>
          .
          <source>WWW</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rector</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hayes</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Welty</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Defining N-Ary</surname>
          </string-name>
          <article-title>Relations on the Semantic Web</article-title>
          . W3C Working Group (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>