<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DBpediaSameAs: an Approach to Tackle Heterogeneity in DBpedia Identifiers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andre Valdestilhas</string-name>
          <email>valdestilhas@informatik.uni-</email>
          <email>valdestilhas@informatik.unileipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natanael Arndt</string-name>
          <email>arndt@informatik.uni-leipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Kontokostas</string-name>
          <email>kontokostas@informatik.uni-</email>
          <email>kontokostas@informatik.unileipzig.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AKSW, Department of</institution>
          ,
          <addr-line>Computer Science, Augustusplatz 10, D-04109 Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The DBpedia dataset has multiple URIs within the dataset and from other datasets connected with (transitive) owl :sameAs relations and thus referring to the same concepts. With this heterogeneity of identi ers it is complicated for users and agents to nd the unique identi er which should be preferably used. We are introducing the concept of DBpedia Unique Identi er (DUI) and a dataset of linksets relating URIs to DUIs. In order to improve the quality of our dataset we developed a mechanism that allows the user to rate and suggest links. As proof of concept an implementation with a graphical web user interface is provided for accessing the linkset and rating the links. The DBpedia sameAs service is available at http://dbpsa.aksw.org/SameAsService.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Further DBpedia has more than one URI representing the
same resource within the dataset, e.g. the dbpedia:Brassil4
has at least the following equivalents within the DBpedia
dbpedia:Republica_Federativa_do_Brasil, dbpedia:ISO_
3166-1:BR and dbpedia:Brazil which are all redirecting to
dbpedia:Brazil. Thus a problem to consider is to directly
resolve any of the equivalents directly to the nal URI e.g.
http://dbpedia.org/resource/Brazil without any
redundancies.</p>
      <p>Also, according to Halpin et. al. [1] and Wood et. al. [4],
sameas.org has collected millions of triples with owl:same
As relations. It would be important to promote reciprocal
owl:sameAs con rmation mechanisms and develop e ective
trust mechanisms to assure the quality of owl:sameAs
relations.</p>
      <p>To tackle the identi er heterogeneity problem we are making
the following contributions:</p>
      <p>We describe an approach for the mitigation of the
identi er heterogeneity problem and implement a
prototype where the user is able to evaluate existing links,
as well as suggest new links to be rated.</p>
      <p>The ability to generate statistics about good and bad
links which, brings the possibility to have a quality
control for the links to DBpedia.</p>
      <p>We de ne the DBpedia Unique Identi er (DUI), which
instead of several transient owl:sameAs DBpedia URIs
for the same nal address, now is possible to have
a unique URI from DBpedia. A DUI goes directly
to the nal address instead of having to process
several possible intermediate results. For example, with
a URI from Freebase, 17 redundant URIs from
DBpedia where avoided or if one used a service such as
sameAs.org, 1141 URIs would be avoided.</p>
      <p>The rest of the paper is organized as follows: section 2
represents a proposed approach for tackling the identi er
heterogeneity problem, we evaluate our work in section 3, in
section 4 we focus on related work, and nally section 5
concludes the paper and outline future work.
4Throughout the paper we are using the following
namespace de nitions: owl: http://www.w3.org/2002/07/owl#,
dbpedia: http://dbpedia.org/resource/.</p>
    </sec>
    <sec id="sec-2">
      <title>2. REPRESENTATION OF THE IDEA</title>
      <p>This section provides an explanation about our main idea,
such as implementation and descriptions.</p>
      <p>Before continuing the work, there are some de nitions that
were adopted.</p>
      <p>Normalization of the URI: Is understood by
normalizing URIs, the fact of eliminating redundancies.
DBpedia unique identi er: The DBpedia Unique
Identi er (DUI) is an unique URI that identi es a
resource in the DBpedia repository and also is the result
of our normalization.</p>
      <p>The idea started with a stand alone service on the web that
solves the problem where the user provides a URI as
parameter and instead of several transient URIs with owl:sameAs
property, the user receives a single DUI from our service.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 The work-flow</title>
      <p>The work- ow for requesting the DUI of a given resource is
represented in g. 1. Firstly, the user will provide a URI
from some address, i.e. FreeBase. Then, instead of possible
several results of URIs with the property owl:sameAs, our
system will return a DUI. Consequently, the user has a
possibility to rate, verify, validate, and suggest a di erent link.
Then the rate can give us a chance to have statistics about
the quality of the links.</p>
      <p>A service, also was implemented, where the user can
provide a URI and the API will return the DBpedia identi er
like a URI that represents the owl:sameAs about the URI
provided.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Methodology</title>
      <p>
        This section describes in four steps the technique and how
the idea was developed, from phase of importing links to a
relational database until the development of the service on
the web and a GUI.
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) The les with triples that contains owl:sameAs links,
were downloaded. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) All triples were imported in a
relational database 5, because we will use some characteristics
5http://tinyurl.com/creatdb
of a relational database i.e. comparative with voting system
in future works. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) An implementation of a service on the
web was provided, where the user enters the URI and
receives a DUI. (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) In order to provide an interface to access
this service were created a web system that receive as input
a URI, return as output an DBpedia identi er and allow
rate and make suggestions about the resulting link.
      </p>
      <p>Where the DBpedia Link Repository uses the DBpediaSameAs
service in order to tackle the heterogeneity and giving the
appropriate DUI, that redirects the user to the DBpedia
Link Rate interface, thus, providing a feedback to the
DBpedia Link Repository, therefore, improving the quality of
the DBpedia endpoint.</p>
    </sec>
    <sec id="sec-5">
      <title>3. EVALUATION</title>
      <p>The aim of this qualitative6 evaluation was centered in
verifying the behavior of the service DBpediaSameAs, the
Graphical User Interface (GUI) that gives the possibility to verify
and rate the links.</p>
      <p>
        There are chosen 3 evaluation criteria:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Normalization on DBpedia URIs: With this criteria
was evaluated if the DBpediaSameAs can provide an
normalization on DBpedia URIs. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Rate the Links: Where was
evaluated if the DBpediaSameAs can provide a way to rate
the links. (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) DBpediaSameAs as service: Was
evaluated if DBpediaSameAs can provide a stand alone service
on the web that brings the normalization on DBpedia URIs.
      </p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Normalization on DBpedia URIs</title>
      <p>The criteria used in this evaluation are uniquely to tackle
heterogeneity, that was observed during the search of
coreferences between di erent data sets with a problem about
redundancies.</p>
      <p>When was used a URI from freebase in order to obtain a
DBpedia URI was observed that at least 3 URIs were returned,
that drives to the same nal address.</p>
      <p>As an example of a real case, executed in our public server,
with a URI from Freebase:
$ curl http :// dbpsa . aksw . org / :</p>
      <p>S a m e A s S e r v i c e / S a m e A s S e r v l e t ? uris = http %3 :
A %2 F %2 Frdf . freebase . com %2 Fns %2 Fm .015 fr
returns : http :// dbpedia . org / resource / :</p>
      <p>Brazil
Where, in this case, instead of 17 URIs from DBpedia, that
goes to the same nal address, our approach drives the user
directly to the nal address.</p>
      <p>As can be observed on the gure 4 that approach the
transitive and redirect URIs, where show that with this approach
instead of have several URIs the user can have only one
from the DBpediaSameAs. Thus, in this way, providing a
normalization on DBpedia URIs.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Rate the links</title>
      <p>In order to have a link rating, were implemented a GUI
that allows the users to give some feedback, suggestions, in
this way, improving the quality of the links. The rate is a
quite simple process, the GUI just ask the user to rate the
link with +1 if the link attends your expectations or -1 if the
link is wrong or some type of spam. The GUI was developed
using concepts from pre x.cc 7 and work from Zaveri[5] such
our system of rate (+1 and -1) and the standard of the web
documents. Some improvements and personalization, also
was provided, such as the suggestions and the possibility to
check the link. The gure 3 shows the moment when the
user clicked on the -1 and indicated that the user didn't like
the link and was asked to make a suggestion of a new URI.
The eld about a suggestion for a new link will only appear
when the user are not satis ed with the current link, then,
when clicking on the -1, then the system will ask for an
optional suggestion.</p>
    </sec>
    <sec id="sec-8">
      <title>3.3 Results</title>
      <p>The results of this work could also be expressed in
numbers that was obtained during importing triples to the
relational database and with some results from the sameAs.org
web site. A total of 62,531,487 triples imported into our
database, the time was 2,220 seconds for the whole
operation, thus, was noticed that 28,167 triples were imported
per second. The source code used to obtain the results is
available in our github repository8.
7http://prefix.cc
8https://github.com/firmao/dbpedia-links/blob
/master/CreateDB.sh
3.3.1 Transitive and Redirect Links
Transitive and Redirect Links are redundancies at
DBpedia that supposed has a link to the same place, in other
words, they use owl:sameAs property, this links will redirect
another links, will provide a transition between the links,
that's why the name transitive. In this case, instead of using
this transitive links that points to the same nal destination
URI, this nal destination URI will be used directly. The
gure 4 try to make more clear this explanation.</p>
      <p>Was discovered and treated 6,473,988 triples with transitive
and redirect links from 62,531,487 imported links among 142
domains inside DBpedia. Then, 10.35% of the links can be
avoided in some cases.</p>
    </sec>
    <sec id="sec-9">
      <title>3.4 Discussion</title>
      <p>The DBpediaSameAs was evaluated with its normalization
of URIs, link rate, and DBpediaSameAs as a stand alone
service on the web. As results of the normalization a DUI
was obtained in order to tackle the heterogeneity. In other
words, instead of several URIs e.g. from sameAs.org one
DUI was obtained. The link rate functionality further allows
to improve the quality of the dataset.</p>
      <p>Despite, the GUI of DBpediaSameAs, also a stand alone
service on the web was developed that brings the
functionality to get a DUI without a GUI for agents or people which
don't need to use the DBpediaSameAs in a Graphical mode,
allowing use as an o -the-shelf component.</p>
    </sec>
    <sec id="sec-10">
      <title>4. RELATED WORK</title>
      <p>The work [5] elaborates a data quality assessment
methodology in DBpedia, which comprises of a manual and
semiautomatic process. This work drive us to a reinforcement
about the concept of data quality used in our work, when in
our case will be more a manual process and also we are able
to improve the DBpedia data quality.</p>
      <p>The work [2], presents a two staged experiment for the
creation of gold standards that act as benchmarks for several
interlinking algorithms. The similar aspects of this works
are: The validation of links and a dubbed manual
validation, where the user i.e. validator or evaluator speci es
whether a link generated by an interlinked tool is correct
or incorrect. The results of the link validation process are
used to learn presumably better link speci cations and thus
achieving high-quality. Also, this work proposes an
experiment to investigate the e ect of user intervention in dataset
interlinking on small knowledge bases.</p>
    </sec>
    <sec id="sec-11">
      <title>4.1 A related problem with sameAs.org</title>
      <p>The sameAs.org is a service that leading source of co-reference
data on the Semantic Web. For example, when the web
site sameAs.org is accessed with a URI from Freebase that
should bring information about a country called Brazil.
Was used the URI (http://rdf.freebase.com/ns/m.015fr)
as parameter to the service, and is received as return more
than 1140 URIs as shown in g. 5, but the user can have a
doubt about which one is the correct.</p>
      <p>Our work is not an alternative to the sameAs web site, but
brings possibilities, like, was noticed that the sameAs.org
does not provide a way to rate the link, but with this rating,
is possible to improve the quality of the data, and bring some
facility to the user.</p>
    </sec>
    <sec id="sec-12">
      <title>5. CONCLUSION AND FUTURE WORKS</title>
      <p>An approach was provided to tackle the heterogeneity
working with owl:sameAs redundancies that were observed
during researching co-references between di erent data sets and
providing a unique DBpedia identi er and give the chance
to rate the resulting links and make suggestions.
A proof of concept was implemented as a computer web
system in order to present and validate our idea and every
concept of this work. The source code is available 9.
Was noticed in our results that there are bene ts when a
considerable number of owl:sameAs redundancies can be
avoided. Rating the links allow users to make link
suggestions brings more quality to the repository, and the stand
alone service on the web allow you to use the DBpediaSameAs
also in a command line textual environment and can be used
as an o -the-shelf component.</p>
      <p>
        For the future we plan to: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) make a study about the
results of link rating. This needs a period of usage of the
DBpediaSameAs service in order to gather su cient results
for proper analysis. (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) An implementation case with more
members of the DBpedia community. A study about how
will be the behavior when implement with the DBpedia
community.
      </p>
    </sec>
    <sec id="sec-13">
      <title>6. ACKNOWLEDGMENT</title>
      <p>We would like to acknowledge, National Council for
Scienti c and Technological Development (CNPq) 10 and
Universitat Leipzig for their support. Special thanks to Markus</p>
      <p>Ackermann for a essential help with the deployment of
application and very good suggestions. Additionally,research
activities of this paper were funded by grants from the EU's
7th &amp; H2020 Programmes for projects ALIGNED (GA 644055),
GeoKnow (GA 318159) and LIDER (GA 610782).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>H.</given-names>
            <surname>Halpin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>McCusker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and H. S.</given-names>
            <surname>Thompson</surname>
          </string-name>
          .
          <article-title>When owl: sameas isn't the same: An analysis of identity in linked data</article-title>
          . In P. F.
          <string-name>
            <surname>Patel-Schneider</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Hitzler</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mika</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>J. Z.</given-names>
          </string-name>
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>I. Horrocks</given-names>
          </string-name>
          , and B. Glimm, editors,
          <source>International Semantic Web Conference (1)</source>
          , volume
          <volume>6496</volume>
          of Lecture Notes in Computer Science, pages
          <volume>305</volume>
          {
          <fpage>320</fpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          . Interlinking:
          <article-title>Performance assessment of user evaluation vs. supervised learning approaches</article-title>
          .
          <source>In 24th International World Wide Web Conference (WWW</source>
          <year>2015</year>
          )
          <article-title>: workshop: Linked Data on the Web (LDOW2015), Florence</article-title>
          , Italy, May
          <volume>18</volume>
          to 22,
          <year>2015</year>
          , Proceedings,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          , P. van Kleef,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer. DBpedia -</surname>
          </string-name>
          <article-title>a large-scale, multilingual knowledge base extracted from wikipedia</article-title>
          .
          <source>Semantic Web Journal</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zaidman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ruth</surname>
          </string-name>
          , and M. Hausenblas, editors.
          <source>Linked Data: Structured data on the Web. Manning</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Sherif</surname>
          </string-name>
          , L. Buhmann, M. Morsey,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Lehmann.</surname>
          </string-name>
          <article-title>User-driven quality evaluation of dbpedia</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Semantic Systems, I-SEMANTICS '13</source>
          , pages
          <fpage>97</fpage>
          {
          <fpage>104</fpage>
          , New York, NY, USA,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>