<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Inferencing in the Large Characterizing Semantic Integration of Open Tabular Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asha Subramanian</string-name>
          <email>asha.subramanian@iiitb.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Institute of Information Techmology</institution>
          ,
          <addr-line>26/C, Electronics City, Hosur Road„ Electronic City, Bengaluru, Karnataka 560100</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Tables are a natural and ubiquitous way of representing related information. Actionable insight is usually gleaned from tabular datasets through data mining techniques assisted by domain experts. These techniques however, do not harness the semantics or the contextual reference underlying the datasets. Tabular datasets, especially the ones created as part of open data initiatives often contain information about entities fragmented across several datasets implicitly connected through some semantics thus giving them a contextual reference. Our work deals with harnessing this context (Thematic Framework) in which they can be reasoned further. This thesis aims at creating algorithmic support for a human to semantically integrate a collection of tabular data using ontologies from publicly available knowledge bases in Linked Open Data. The overall objectives of our work called “Inferencing in the Large” aims to go further than this, to enrich the mapped ontology with inferencing rules and generate enriched RDF (Schematic Framework), to enable the use of semantic reasoners.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Web</kwd>
        <kwd>Linked Open Data</kwd>
        <kwd>Context Abduction</kwd>
        <kwd>Ontology</kwd>
        <kwd>Graph Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Recent initiatives like Open Data by various governments, have resulted in a number
of freely available tabular datasets containing actionable knowledge that could be
relevant to several stakeholders. However, these published datasets are often created
independently, with no overarching purpose or schematic structure. Indeed, there may be
no overarching thematic structure – the datasets need not be about any one particular
topic or theme. As a result, valuable knowledge remains fragmented across the datasets.
While typical analytics efforts use data mining or machine learning algorithms to
exploit data patterns and generate inferences, they fail to harness the implicit meaning of
the data to make meaningful inferences in a semantic context.</p>
      <p>There is a pressing need for semantic integration of such arbitrarily structured data.</p>
      <p>Given a collection of tables, it is a daunting task even to determine what the set
of tables are collectively about (that is, a topic or theme for the collection), let alone
establish an overarching schematic framework.</p>
      <p>We call this problem, “Inferencing in the Large” (in the wilderness), where in order
to extract meaning from a collection of data, we first need to establish a framework
within which meaning can be interpreted.</p>
      <p>
        In a previous project called Sandesh (expanding to Semantic Data Mesh), we had
proposed a knowledge representation framework based on Kripke Semantics called
Many Worlds on a Frame (MWF) to integrate disparate datasets [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this model,
aggregated knowledge was represented in the form of several semantic “worlds” – each
of which represented a schematic framework within which data was organized. Given
a set of tabular data, and a hand-crafted set of seed worlds and their (type and location)
relationships, the Sandesh toolkit reorganized the tabular data into data elements within
the schematic frameworks of one or more worlds. MWF was meant to address the
absence of a schematic and thematic framework in open datasets. However, the Sandesh
framework still requires significant human effort in organizing the seed set of worlds
and their schematic structures.
      </p>
      <p>In this work, we aim to automate this process further, by using ontologies from
the Linked Open Data cloud to explain a collection of tables. Firstly determine the
Thematic Framework or the dominant concept(s) that the tables are collectively about
and secondly, determine the Schematic Framework or entities/properties that each of
the row values and column headers relate to in the context determined by the Theme.
Such matched ontologies can be enriched in two ways: (a). they can be augmented
with inference rules and new assertions to enable semantic reasoning within them, and
(b). they can be interrelated to one another to form a global frame, that can support
reasoning across them.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Determining a meaningful context, extracting relevant ontologies and generating
enriched RDF tuples from structured, unstructured and semi-structured data using
appropriate ontologies from LOD cloud are all active research areas [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>We divide this broad literature into the following groups :
– Identifying and Relating Concepts and Entities from Content</p>
      <p>
        Tools such as Open Calais1, FRED [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Apache Stanbol2, Fox3 work on
unstructured content, extract concepts and entities such as places, events, people,
organisations etc and relate them to universally known entities from knowledge bases
such as DBPedia, Freebase, Geonames etc. While Open Calais and FRED amongst
these are the most advanced tools with capabilities to extract context and related
entities, the ontology/metadata they use internally are proprietary, in the sense that
the disambiguated entities refer to an internal Calais or a FRED URI/id. Our
objective is to extract concepts for the identified context that can be related to an
openly available knowledge base from the Linked Open Data Cloud without using
any proprietary vocabulary. In the context of datasets in LOD cloud, Lalithsena et
      </p>
      <p>
        Open Calais - http://viewer.opencalais.com/
Apache Stanbol - https://stanbol.apache.org/overview.html
FOX: Federated knOwledge eXtraction Framework - http://aksw.org/Projects/FOX.html
al. in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] use an interesting technique to identify domains for such datasets with
an aim to annotate/categorize the datasets appropriately. They rely on the Freebase
knowledge base to identify topic domains for LOD.
– Extraction of RDF tuples from CSVs
      </p>
      <p>
        Several research efforts have addressed extraction of RDF tuples from CSV files.
Some prominent tools in this area include RDF Converter4, Virtuoso Sponger5,
Open Refine6 and RDF123 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However, the generated RDF tuples are mostly still
raw data without any contextual reference. These RDF tuples need to be
semantically linked to knowledge sources such as the ones constituting the Linked Open
Data Cloud7 or other formal ontologies to extract meaningful inferences from the
data.
– State of Art : Understanding Semantics of Tables and generating enriched
RDF
Some of the most recent and relevant work that compares to our research includes
work by Mulwad [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Mulwad’s work is quite comprehensive in determining the
meaning of a table and uses parameterized graphical model to represent a table.
Their core module performs joint inferencing over row values, column headers and
relations between columns to infer the meaning of the table by using a
semantic message passing scheme that incorporates semantics into the messages. The
graph is parameterised on three variables 1) one to determine the classes the
column should map to 2) second to determine the appropriate relations between the
data values in a row 3) third to determine relation between the column headers. The
joint inferencing module is an iterative algorithm that converges when the model
variables agree on the alloted LOD classes/entities for the column headers, relations
between columns and row values. They also generate enriched RDF encapsulating
the meaning of the table.
      </p>
      <p>While these efforts are attractive and generate quality linked data keeping the
intended meaning of the data in mind, they still work on a single table and are largely data
values driven. In our challenge, we are looking for the ontology/collection of classes
from related ontologies that fit best a set of tables, each table contributing a set of its
columns to the identified ontology(ies). We expect a set of tables to have different
utilitarian views depending upon the desired context. Our research aims to provide this
semantic framework wherein a set of tables are mapped to domain from LOD. The
columns from various tables are linked to relevant properties of the domain classes that
the tables are collectively about, the data values and relations between columns in the
tables are instantiations of the domain classes and their properties. Our overall
objectives from this research is also, given a set of tables, generate enriched RDF data that
can be further exploited by semantic reasoners.</p>
      <p>RDFConverter: http://www.w3.org/wiki/ConverterToRdf
Virtuoso Sponger: http://virtuoso.openlinksw.com/dataspace/doc/dav/wiki/Main/VirtSponger
Open Refine: https://github.com/OpenRefine</p>
      <p>Linked Data Design Issues: http://www.w3.org/DesignIssues/LinkedData.html</p>
    </sec>
    <sec id="sec-3">
      <title>Research Questions and Hypothesis</title>
      <p>My thesis attempts to answer the overall research objective to semantically integrate a
collection of tabular data sets by inferring a Thematic and Schematic Framework. The
following research questions detail the research objective.
1. Given a Collection of arbitrary tabular datasets, is it possible to determine what the
collections of tables is about? Can we relate this inferred theme in terms of known
concept(s) from Linked Open Data (LOD) or a custom knowledge base? We call
this the Thematic Framework. We envisage the Thematic Framework to contain
Concept/Domain Classes from LOD that best describe the collection of datasets.
2. For the identified Thematic Framework, can we relate the column headers in the
various tables as properties of the identified concept classes from LOD and data
values as instances or entities of the concept classes? We call this the Schematic
Framework consisting of the T-Box (class and property definitions) and A-Box
(class and property assertions). We envisage the Schematic Framework to contain
a) properties of the dominant classes that are most relevant for the columns in the
datasets in line with the identified Thematic Framework b) A-Box instantiations for
all the data values in the datasets using the dominant classes and their properties c)
T-Box definitions for the hierarchy of dominant classes and their properties derived
from the respective vocabularies in LOD
3. Finally, can the enriched RDF generated using the Thematic and Schematic
Framework discussed above, be processed by a reasoner such as Apache Jena to perform
semantic inferences?
We hypothesise that for tabular datasets with data values that can be linked to LOD or
some available custom knowledge base, it is possible to infer dominant concepts that
relate to concept classes from known ontologies and further map the column headers and
data values of the tables to properties and entities of those concept classes respectively.
The definitions of properties and classes described explicitly in an ontology (DBPedia,
Yago and others from Linked Open Data Cloud) and those implicitly derived from the
instance assertions together with additional evidence from column headers can be
combined with graphical modelling techniques to achieve the research objective. Table 1
shows a sample Thematic and Schematic Framework output for a set of two input data
files (Table a , Table b) that have information on some Indian states and their capitals
and rivers and their state of origin.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Proposal</title>
      <p>
        Our first goal is to find dominant concept classes that relate maximally to a given set
of tables. This paper showcases preliminary results towards this first goal. We propose
two approaches for the concept class(es) identification and combine the two to obtain
an overall scoring. In the bottom-up approach, entities are searched from LOD to obtain
classes that maximally subsume the data values in a column. We also use the bottom-up
technique to mine properties for the columns that best relate to the relation between the
data values in a pair of columns. In the top-down approach, we rely completely on the
column header literals or other information in the table description to arrive at candidate
properties and their respective domain classes. For columns containing arbitrary literals,
only top-down technique is applicable. We assume that the literals used to label the
column headers are relevant to the data contained in the respective columns as otherwise,
it will be practically impossible to ascertain what the data is about even by humans. We
combine results from the top-down and bottom-up techniques and create a consolidated
graph linking columns from tables to their respective candidate classes (derived from
bottom-up technique) using cc edge label (cc used to denote candiate class link), and
candidate properties for the columns (derived from top-down technique and bottom-up
technique) to their respective domain classes using d edge label (d used to denote link
to a domain class). We use DBPedia to generate the preliminary list of domain classes
for the columns and call it the Hypothesis Set and expand the search for candidate
properties to all the equivalent classes from LOD (determined by the owl:equivalentClass
property for each domain class of the candidate property). This way we can identify
dominant concept classes for a given set of tables across LOD. We use two Abduction
Reasoning Heuristics namely a) Consistency and b) Minimality to arrive at the
dominant class(es) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Our Scoring Model to determine the dominant classes is as follows:
1. Candidate Class Support (CCS), defines how well a class
class for columns across all the CSV files:
2
fits as a candidate
2. Domain Class Support (DCS) defines how well a class
didate properties for columns across all the CSV files:
2</p>
      <p>corresponds to
canccs( ) =</p>
      <p>P fk
cscols( )
dcs( ) = jdscols( )j
jcolsj
(1)
(2)
Here, represents the Hypothesis Set. Each member of this class will have
incoming edges representing one of the following: a) candidate class for some column
ck with its corresponding support fk represented by an incoming cc link, and/or b)
domain class for some property p represented by an incoming d link. P fk is the
sum of the support from each column connected to class with a cc link. cscols( )
is the set of nodes of type column that have a path leading to with a cc link.
Similarly dscols( ) is the set of nodes of type column having a path to with a d link.
cols is the set of all nodes of type column.</p>
      <p>ccs and dcs calibrate the prolific nature of the class across all the CSV files. In
addition to the above scores, a “universality score” is associated with a class that describes
how prolific is this class across different tables. This score called Tabular Support (TS)
is defined as:
Here, tabs( ) is the set of nodes labeled “table” in the graph that have a path to
any of the labeled edges and tabs is the set of nodes labeled “table” in the graph.</p>
      <p>The class score vector for class is a vector representing ccs and dcs scores:
ts( ) = jtabs( )j</p>
      <p>jtabsj
csv( ) = [ccs( ); dcs( )]</p>
      <p>The overall score representing the suitability of a class as a domain class is defined
as:</p>
      <p>Score( ) = kcsv( )k2 H[csv( )] ts( )</p>
      <p>Here kcsv( )k2 represents the L2 norm of the class score vector and H[csv( )]
represents the entropy of the class score vector, given by:</p>
      <p>H[csv( )] =</p>
      <p>X
i2ccs( );dcs( )
pi( ) log pi( )</p>
      <p>From the Overall Scores for each entry in the Hypothesis set, we use a user defined
threshold to select the dominant concept(s) for the collection of tables. Our approach
uses data-driven techniques and additional evidence from column headers and looks for
convergence to domain classes in the context determined by the data. This is one of the
differences from the State of the Art techniques discussed in section 2
5</p>
    </sec>
    <sec id="sec-5">
      <title>Preliminary Results</title>
      <p>As of this writing, we have considered a variety of tabular datasets ranging from hand
crafted to publicly available csv datasets including those from data.gov.in.</p>
      <p>Table 2 shows the preliminary results from our scoring model to identify dominant
concept classes from a collection of tables using a cutoff threshold at 0.75.
(3)
via
(4)
(5)
The first goal of our research objective namely Thematic Framework extraction, as of
now has been verifed with satisfactory results on tables, where the dominant concepts
are about persons, places, organisations or some identifiable concept defined in LOD.
The main challenge is the ability to identify LOD entities/resources from the data
accurately especially when the Information Content/Entropy in the data from columns is
low. Additionally the column header may capture the essence of the properties for a
domain class using words that have a similar word-sense. We would like to test and
refine the algorithm on variety of tables where the data values are a combination of
known LOD entities and arbitrary values. Additionally our proposal faces challenges
to converge to any dominant theme when the dataset is about a complex concept such
as Rice Prices on a particular date in various districts of India or a dataset about Real
Estate Sales Transations in a particular locality. Such instances occur when we do not
have appropriate classes/ontologies in LOD that relate to the dataset in hand or the data
values in the tables do not map to any entity in the LOD. In such cases, we would like
to explore the use of SKOS(Simple Knowledge Organization System) categories and
Yago concept classes together with Wordnet (to address the problem of similar words
that capture the essence of the column header literals) as they seem to closely relate
to the purpose/context of the data. Additionally, we would like to incorporate human
input/custom ontology to validate the suggestions on concepts/properties returned by
the algorithms. The next step from here is to expand the Thematic Framework to the
corresponding Schematic Framework (T-Box and A-Box assertions) and device an
appropriate scoring algorithm for abduced properties in line with the Thematic Framework
and finally generate enriched RDF.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Evaluation Plan</title>
      <p>We intend to use evaluation methods measuring a) Coverage b) Accuracy c)
Applicability. Coverage will measure the percentage of tables/columns mapped to LOD classes.
The goal is to cover as many columns in all the datasets to relevant properties from LOD
in line with the Thematic Framework abduced for the datasets. Accuracy will compare
the scores of the Dominant Concept(s) in the Thematic Framework and scores of the
properties suggested by the Schematic Framework with the actual concepts/properties
suggested by human evaluators. Applicability will measure the relevance of the newly
abduced A-Box instantiations and newly inferred LOD properties for relations between
column headers and the Dominant Concepts for a collection of datasets.
Acknowledgments. I am deeply grateful to my thesis advisor Prof. Srinath Srinivasa
for his support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Srinivasa</surname>
          </string-name>
          , Srinath and Agrawal, Sweety V. and
          <string-name>
            <surname>Jog</surname>
          </string-name>
          , Chinmay and Deshmukh,
          <source>Jayati: Characterizing Utilitarian Aggregation of Open Knowledge In: Proceedings of the 1st IKDD Conference on Data Sciences</source>
          , pp.
          <volume>6</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          :
          <fpage>11</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York 2014
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Presutti</surname>
            , Valentina,
            <given-names>Francesco</given-names>
          </string-name>
          <string-name>
            <surname>Draicchio</surname>
            , and
            <given-names>Aldo</given-names>
          </string-name>
          <string-name>
            <surname>Gangemi</surname>
          </string-name>
          .
          <article-title>Knowledge extraction based on discourse representation theory and linguistic frames</article-title>
          .
          <source>In Knowledge Engineering and Knowledge Management</source>
          , pp.
          <fpage>114</fpage>
          -
          <lpage>129</lpage>
          . Springer Berlin Heidelberg,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Han, Lushan, Tim Finin, Cynthia Parr, Joel Sachs, and Anupam Joshi.
          <article-title>RDF123:from Spreadsheets to RDF</article-title>
          . Springer Berlin Heidelberg,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Mulwad</surname>
          </string-name>
          , Varish Vyankatesh.
          <article-title>TABEL - A Domain Independent and Extensible Framework for Inferring the Semantics of Tables</article-title>
          .
          <source>PhD diss</source>
          ., University of Maryland,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Buitelaar</surname>
            , Paul, Philipp Cimiano, and
            <given-names>Bernardo</given-names>
          </string-name>
          <string-name>
            <surname>Magnini</surname>
          </string-name>
          .
          <article-title>Ontology learning from text: An overview</article-title>
          . Vol.
          <volume>123</volume>
          .
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lau</surname>
          </string-name>
          ,
          <string-name>
            <surname>Raymond</surname>
            <given-names>YK</given-names>
          </string-name>
          , Jin Xing Hao,
          <string-name>
            <given-names>Maolin</given-names>
            <surname>Tang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Xujuan</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Towards context-sensitive domain ontology extraction</article-title>
          .
          <source>In System Sciences</source>
          ,
          <year>2007</year>
          .
          <source>HICSS</source>
          <year>2007</year>
          . 40th Annual Hawaii International Conference on, pp.
          <fpage>60</fpage>
          -
          <lpage>60</lpage>
          . IEEE,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Gerber, Daniel, Sebastian Hellmann, Lorenz Buhmann, Tommaso Soru, Ricardo Usbeck, and
          <string-name>
            <surname>Axel-Cyrille Ngonga Ngomo</surname>
          </string-name>
          .
          <article-title>Real-time RDF extraction from unstructured data streams</article-title>
          .
          <source>In The Semantic Web - ISWC</source>
          <year>2013</year>
          , pp.
          <fpage>135</fpage>
          -
          <lpage>150</lpage>
          . Springer Berlin Heidelberg,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Augenstein</surname>
            , Isabelle,
            <given-names>Sebastian</given-names>
          </string-name>
          <string-name>
            <surname>Pado</surname>
            , and
            <given-names>Sebastian</given-names>
          </string-name>
          <string-name>
            <surname>Rudolph</surname>
          </string-name>
          . Lodifier:
          <article-title>Generating linked data from unstructured text</article-title>
          .
          <source>In The Semantic Web: Research and Applications</source>
          , pp.
          <fpage>210</fpage>
          -
          <lpage>224</lpage>
          . Springer Berlin Heidelberg,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lalithsena</surname>
            , Sarasi, Pascal Hitzler, Amit Sheth, and
            <given-names>Paril</given-names>
          </string-name>
          <string-name>
            <surname>Jain</surname>
          </string-name>
          .
          <article-title>Automatic domain identification for linked open data</article-title>
          .
          <source>In Web Intelligence (WI) and Intelligent Agent Technologies (IAT)</source>
          ,
          <year>2013</year>
          IEEE/WIC/ACM International Joint Conferences on, vol.
          <volume>1</volume>
          , pp.
          <fpage>205</fpage>
          -
          <lpage>212</lpage>
          . IEEE,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Asha</surname>
            <given-names>Subramanian</given-names>
          </string-name>
          , Srinath Srinivasa, Pavan Kumar, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vignesh</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Semantic Integration of Structured Data Powered by Linked Open Data</article-title>
          .
          <source>In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics (WIMS '15) ACM</source>
          , New York, NY, USA.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>