<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Enriching Linked Open Data via Open Information Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonis Koukourikos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vangelis Karkaletsis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Vouros</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Software and Knowledge Engineering Laboratory, Institute of Informatics and Telecommunications, National Center for Scientific Research “Demokritos” Agia Paraskevi Attikis</institution>
          ,
          <addr-line>P.O.Box 60228, 15310 Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Piraeus, Department of Digital Systems.</institution>
          <addr-line>80, Karaoli and Dimitriou Str, Piraeus, 18534</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The descriptions of various entities on Linked Data repositories are subject to constant renewals and modifications, with respect to both the descriptions of concepts and relations and entities realizing their instantiations. Thus, the underlying ontologies have to be updated accordingly in order to reflect these changes. This paper presents a system for examining the possibilities of discovering new relations and updating/verifying existing ones for entities described in Linked Data repositories by using Open Information Extraction techniques. These are applied to web content. The process aims to the enrichment of the examined datasets and the expansion of the ontologies with newly-discovered concepts and relations. Towards this target, the paper discusses the intricacies, pitfalls and challenges present.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Open Data</kwd>
        <kwd>Open Information Extraction</kwd>
        <kwd>Ontology Population</kwd>
        <kwd>Ontology Enrichment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The Linked Data initiative aims to provide a set of guidelines and best practices for publishing
structured data and associating it with other resources. The Linking Open Data community project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aims
to publish open data sets as RDF triples and establish RDF links between objects from different data
sets.
      </p>
      <p>The steadily increasing amount of the datasets involved in the project1 and the structured nature of
the included information provides a foundation for establishing fast, easy and customizable access to
substantial knowledge resources. However, it is important to ensure that the information provided by
such repositories is constantly updated and expanded. It is necessary to examine the adequacy of the
concepts and relations used for describing any entity as well as assessing the validity of the RDF
triples.</p>
      <p>A possibly promising approach is to exploit the rich and constantly updated content available in the
World Wide Web in order to build methods for tackling the aforementioned issues.</p>
      <p>The present paper presents an experimental system towards:
1. discovering new concepts for describing an entity in a LOD repository, thus enriching the
underlying ontology of the dataset, by retrieving and analyzing relevant web content;
2. assessing the validity of a property currently present in the LOD repository and discovering new
values for that property;
3. adding new instantiations of known properties for a LOD object; i.e. associations of a known type
between a known entity and an entity not currently present in the repository.</p>
      <p>The paper is structured as follows. First, we provide a brief description of the technologies related to
our approach. A presentation of the overall system and the role of each of its components are given in
section 3. Preliminary results from experiments with the proposed system are presented in section 4.</p>
      <sec id="sec-1-1">
        <title>1 http:// stats.lod2.eu/</title>
        <p>We conclude with some interesting remarks from the experiments and describe future enhancements
and expansions on the system.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Related Technologies</title>
      <p>
        An overview of Linked Open Data can be found at [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Any open dataset that follows the principles
originally set for Linked Data can be said to belong in the LOD universe. The notion of interlinking is
very important for the initiative. However, it is still the common case that interlinks between datasets
are mainly at the instance level.
      </p>
      <p>
        Ontology alignment is a key-technology for the purpose of interlinking different datasets, by
discovering equivalent and/or semantically similar classes and properties between the ontologies of distinct
repositories. Correspondences can then drive further associations between data in distinct repositories.
Some examples of recent alignment systems are SAMBO [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], ASMOV [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and RIMOM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The
ontology alignment process may also be instance-based: In this regard, in relation to LOD, approaches
aim to connect datasets at the conceptual level using LOD information as it is. The BLOOMS ontology
alignment system [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is such a system. It uses information existing in the Wikipedia hierarchy in order
to bootstrap the alignment of two input ontologies
      </p>
      <p>
        In order to address the expansion of ontologies (either at the conceptual or at the assertional level),
we use Open Information Extraction (OIE) methods for retrieving information from the broad spectrum
of sources available in the Web. The Open Information Extraction paradigm [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] focuses on obtaining
relations between argumentative pairs from unstructured web text, with the best possible accuracy,
without introducing distinct training sets or taking into account domain-specific information. Hybrid
approaches, which map OIE results to existing ontologies and subsequently use the mapping to
improve on the results of the extraction process, are also of particular interest [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Overall System Architecture</title>
      <p>The following figure summarizes the architecture of the experimental system and indicates the
individual components.</p>
      <p>Web</p>
      <p>Web Crawler</p>
      <p>OIE</p>
      <p>Module
Linked Open Data</p>
      <p>Repository</p>
      <p>Linguistic
Processing
Module
LOD
Access
Module</p>
      <p>Relation Checker</p>
      <p>LOD-linked
Relation set
The whole process realized by the system is naturally divided into four stages. The first step is to obtain
the raw textual information from the web. The second step is to apply light-weight linguistic
manipulation in order to enrich the text with additional information, such as identification of named entities and
co-references, useful for the extraction process. The actual information extraction is employed at the
next stage. The extracted information is in the form of a set of triples representing binary relations
between entities, is then compared by the relation checker module to the relations obtained by the LOD
repositories. A subset of the extracted relations -considered relevant to entities from the repository- is
finally produced by the system..</p>
      <p>The following section presents the specific technologies that are used for realizing the different
components of the system, implementing each of the distinct phases. It also presents specific issues
related to setting up the experiments performed.</p>
      <p>Linked Open Data Resources: The collection used was the Jamendo dataset, available from
DBTune2, part of the Linking Open Data in the Semantic Web project.</p>
      <p>The Jamendo dataset contains representations for musicians whose work is available through the
Jamendo site, a host of Creative Commons licensed music. The dataset includes 3505 band names and
5786 records. The dataset is linked to the GeoNames3 dataset for providing the place of origin for an
artist.</p>
      <p>Collecting the corpus: To activate the crawler, we extracted band names from the Jamendo dataset
and performed web searches based on each of these names. From the list of artist names, we selected
randomly 200 using a simple random number generator for picking an index on the complete list. The
selected names were given as search queries to a module implementing the Bing API. The topmost 50
results for each query were taken into account. The corresponding pages were retrieved and the textual
information of the pages was stored. In order to eliminate irrelevant web page content like menus,
advertisements and content tables, we implemented a heuristic boilerplate removal module based on the
boilerpipe library4.</p>
      <p>Linguistic processing: The first run of the system reveals an important amount of non-resolvable
relations due to the use of pronouns or generic phrasing (e.g. “they”, “he”, “the band” etc.). It was
deemed necessary to repeat the tests after incorporating a co-reference resolution mechanism. For our
first tries, we used the relevant OpenNLP5 functionalities. Furthermore, the named entity recognition
module of the OpenNLP library was re-trained and applied to the corpus.</p>
      <p>
        Information Extraction: The information extraction module that was used was the REVERB system
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a second generation OIE system which expands the ideas implemented on previous OIE systems
by exploiting generic syntactical and lexical constraints in order to reduce the introduction of non-valid
relations in the extracted set. As already said, the relations are triples of the form (Argument, Relation,
Argument), where an argument is an entity related to another one via the indicated relation. For this
test, we used the REVERB system with the default settings provided by the creators.
      </p>
      <p>From the set of 10000 pages in the corpus (50 pages for each of 200 artists), REVERB returned a set
of 717140 relations. Some of the relations were obviously invalid, since they associated entities with
incoherent lexicalizations. We discarded relations for which any of the constituents contained HTML
tagging or dynamic elements (JavaScript snippets). Further examination of the relation set will
probably reveal additional heuristics for rejecting relations. After this heuristic-based rejection process a set
of 506420 relations was considered for further processing.</p>
      <p>Association with the LOD dataset: The next step of the experiment was the association of the results
from the information extraction with the data available from the Jamendo dataset. The results were
classified to the following generic classes:
 Relations on which both arguments were found in the LOD dataset
 Relations on which a single argument was found in the LOD dataset
 Relations that did not associate objects from the LOD dataset
Relations belonging to the first class can either provide an alternate phrasing for a known relation, or
introduce a different relation between the known entities (i.e. entities already in the dataset). We used
WordNet6 to examine if the lexicalization of the relation in the LOD repository and the phrase found
from the extraction system shared one or more senses. This is done by simple string matching, after
trivial changes like removing underscores, punctuation and capitalization. If that is the case, the
relation is considered already known. Otherwise, the relation is marked as a possibly new one.</p>
      <sec id="sec-3-1">
        <title>2 http:// dbtune.org/</title>
        <p>3 http:// www.geonames.org/
4 http://code.google.com/p/boilerpipe
5 http:// opennlp.sourceforge.net/projects.html
6 http:// wordnet.princeton.edu/</p>
        <p>Each instance-triple of the second class is associating a known entity with an unknown entity, a fact
that could also lead to the introduction of an additional concept for describing the unknown entity and
of new relations for describing how entities are related.</p>
        <p>Instances of the third class were not considered relevant to the purposes of the experiment.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The majority of the relations obtained by the information extraction process were not directly relevant
to the entities in the Jamendo dataset. 290,714 extracted relations do not concern any entity in the LOD
repository (artist name, record name or track name), while in 17,219 of the extracted relations both of
their arguments are known entities. The rest of the relations (198,487) indicate a relation with only one
known entity, which maybe in the first or in the second argument in the extracted relation triple.
290714
58%
17219
3%
198487
39%
single argument
irrelevant
both arguments</p>
      <p>An examination of the test corpus indicated that most irrelevant relations were related to the author
of the article or provided general remarks, peripheral to the domain considered; For example generic
opinions of the writer, comments on the music industry, social associations of the artist etc.</p>
      <p>Some of the entities’ properties in the Jamendo RDF dataset have been also retrieved by the OIE.
Associations between artist names and their works were the most frequent relations. Such relations are
lexicalized in a straightforward manner that is directly resolved by WordNet (e.g. “maker”, “creator”)
and at a context-specific phrasing (“is responsible for”, “delivered”). By a brief manual examination of
the web corpus we observed that certain information included in the repository was present in the page
but it was not provided in a free-formed linguistic style, i.e. in the core text of an article. Rather, the
creations of an artist were provided in structured lists or tables, as were the cities on which the
musician will perform concerts. It is thus important to consider extraction techniques not based on sentential
analysis but also to specific presentation styles, using structured elements (tables or lists), also in
combination with information in the context.</p>
      <p>The extraction process identified entities which are not included in the Jamendo dataset. These
entities may also lead to the inclusion of new concepts/relations in the corresponding ontologies. For
instance, the most prevalent was the relation of membership to a band, where a named entity (the
musician) was declared as a member, either directly, or by giving his specific role. Another important
relation refers to the future releases of records by the band, an element that could also be included in an
updated dataset.</p>
      <p>With respect to the interlinking with the GeoNames database, it should be mentioned that a
significant amount of entities were associated with different geographical names. Such cases were related
mostly to tours/concerts and location changes from the artists.</p>
      <p>From the NER-enhanced OIE process, we can construct a relation graph for a named entity, with the
edges denoting the relations in which the entity participates. This results to a graph of related entities.
The dataset used for the experiments limits the generalizing possibilities for the graph, as the
information included in the examined web sources do not emphasize on aspects indirectly associated with
the music domain. For example, we are not able to find the population or other known inhabitants of a
city by only using the relations extracted by the system.</p>
      <p>Regarding the available information, it is important to compare the produced relation graphs with
corresponding RDF graphs derived from the LOD cloud in order to:
 Deduce the validity of the extracted information
 Disambiguate named entities based on the LOD information and produce distinct graphs for
different entities of the same name</p>
      <sec id="sec-4-1">
        <title>London</title>
        <p>visited</p>
      </sec>
      <sec id="sec-4-2">
        <title>Ardecan</title>
        <p>released
recording</p>
      </sec>
      <sec id="sec-4-3">
        <title>Ouvertu re</title>
      </sec>
      <sec id="sec-4-4">
        <title>Ulyss</title>
        <p>While the extraction system discovered an association between the artist and the track, i.e. the
relation (Ardecan, recording, Ulyss), the corresponding node for the artist is not correctly associated with
the album record (Ouverture) to which the extracted track belongs. From the simple linguistic analysis
that we have applied so far, it is not possible to distinguish the role of “Ouverture” and “Ulyss” with
respect to the ontology behind the Jamendo dataset.</p>
        <p>In addition, the association between the artist and a geographical location that is stated by the
“visited” relation is irrelevant to the LOD schema. We expect that this issue will be easily handled with
simple some simple linguistic analysis. For example, there is no sense in WordNet that includes both
“based” and “visited”, so a cross-check with the WordNet database will reduce the possibility of such
relations being similar or equivalent.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>The purpose of our experiments is to examine the possible correlation of linked open data, open
information extraction and ontology evolvement.</p>
      <p>At this stage, we focused on the possibilities for expanding domain-specific ontologies by using
unstructured open-world textual data and associating it with, possibly incomplete, specifications. The
initial experiments indicated the presence of multi-faceted information for the entities described in the
LOD repositories. The concepts commonly associated with an entity are broader than the ones
currently used for its description in the repository, thus the underlying ontology could be expanded in order to
include the additional concepts. Furthermore, using linguistic techniques and given an adequate set of
external ontologies, we could associate a newly discovered relation with a property in an existing
ontology. However this can be arbitrarily intricate. Ontology alignment methods can play a major role
towards these goals.</p>
      <p>The information extraction process itself should be augmented with techniques that take into
account non-sentential web content, as it seems to provide a wealth of information that leads to valid
relations.</p>
      <p>Our immediate next step is to further analyze the presented results and apply statistical measures in
order to deduce whether an extracted relation is actually relevant to the domain and a specific entity,
based on the number of occurrences, the association with multiple entities etc. We also aim to refine
the linguistic methods applied to the corpus in order to refine the results of the information extraction.
These steps shall provide the basis for expressing a methodology for determining graph similarity
measures between the relation graph and the LOD-derived RDF graph (or a subset of the latter).</p>
      <p>In the long-term, it is important to focus on the relation of the major technologies involved in the
system: Open Information Extraction, Linked Open Data and Ontology Enrichment. A gradually
improving ontology could be used to assist the information extraction module in order to eliminate
irrelevant relations with respect to the domain of the LOD repository. The data in such a repository could
also be updated in accordance to the results of the information extraction. Our goal is thus to combine
these ideas in a constantly updated system, where the results of the components in a specific run are
exploited by subsequent runs in order to increase the efficiency and accuracy of the other components
in future executions of the process.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>The research leading to these results has received funding from the European Union Seventh
Framework Programme under grant agreement n° 288513 (NOMAD - Policy Formulation through non
moderated crowdsourcing).
7</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData.
          <article-title>The W3C SWEO community projects home page</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berners-Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Linked data - the story so far</article-title>
          .
          <source>International Journal On Semantic Web and Information Systems</source>
          <volume>5</volume>
          (
          <issue>3</issue>
          ),
          <fpage>1</fpage>
          -
          <lpage>22</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Lambrix</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
          </string-name>
          , H.:
          <article-title>SAMBO - a system for aligning and merging biomedical ontologies</article-title>
          .
          <source>Journal of Web Semantics</source>
          , vol.
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <fpage>196</fpage>
          -
          <lpage>206</lpage>
          , (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jean-Mary</surname>
            ,
            <given-names>Y. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shironoshita</surname>
            ,
            <given-names>E. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kabuka</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          :
          <article-title>Ontology matching with semantic verification</article-title>
          .
          <source>Journal of Web Semantics</source>
          , vol.
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <fpage>235</fpage>
          -
          <lpage>251</lpage>
          , (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Rimom: A dynamic multistrategy ontology alignment framework</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          , vol.
          <volume>21</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1218</fpage>
          -
          <lpage>1232</lpage>
          , (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jain</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hitzler</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sheth</surname>
            <given-names>A.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verma</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yeh</surname>
            <given-names>P.Z.</given-names>
          </string-name>
          :
          <article-title>Ontology Alignment for Linked Open Data</article-title>
          .
          <source>In proc. of the 9th International Semantic Web Conference (ISWC2010)</source>
          , (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Banko</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Broadhead</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Open information extraction from the web</article-title>
          .
          <source>International Joint Conference on Artificial Intelligence</source>
          , (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            <given-names>D.S.</given-names>
          </string-name>
          :
          <article-title>Open information extraction using Wikipedia. In proc. of the 48th Annual Meeting of the Association for Computational Linguistics</article-title>
          , ACL '
          <volume>10</volume>
          ,
          <fpage>118</fpage>
          -
          <lpage>127</lpage>
          , Morristown, NJ, USA, (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Soderland</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roof</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qin</surname>
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mausam</surname>
          </string-name>
          , Etzioni O.:
          <article-title>Adapting open information extraction to domainspecific relations</article-title>
          .
          <source>AI Magazine</source>
          ,
          <volume>31</volume>
          (
          <issue>3</issue>
          ),
          <fpage>93</fpage>
          -
          <lpage>102</lpage>
          , (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Etzioni</surname>
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fader</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Christensen</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            <given-names>S</given-names>
          </string-name>
          , Mausam: Open Information Extraction: the Second Generation.
          <source>International Joint Conference on Artificial Intelligence</source>
          , (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>