<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Correcting Range Violation Errors in DBpedia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Piyawat Lertvittayakumjorn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natthawut Kertkeidkachorn</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ryutaro Ichise</string-name>
          <email>ichiseg@nii.ac.jp</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing, Imperial College London</institution>
          ,
          <addr-line>London SW7 2AZ</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Institute of Informatics</institution>
          ,
          <addr-line>Tokyo 101-8430</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SOKENDAI (The Graduate University for Advanced Studies)</institution>
          ,
          <addr-line>Tokyo 101-8430</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A range violation error is a problem when an object of a knowledge graph triple does not have a type required by the range of the triple's predicate. This paper aims to correct these erroneous triples in DBpedia by nding correct objects with the required type to replace the incorrect objects. Our approach is based on graph analysis and keyword matching. It also exploits information from the incorrect objects because, despite their incorrectness, they contain useful clues to nd the correct objects. Experimental results show that our proposed approach outperforms various baseline methods, including entity search (e.g., DBpedia Lookup) and knowledge graph completion (TransE and AMIE+).</p>
      </abstract>
      <kwd-group>
        <kwd>DBpedia</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Data Quality</kwd>
        <kwd>Error Correction</kwd>
        <kwd>Range Violation Error</kwd>
        <kwd>Knowledge Graph Re nement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>DBpedia is a large knowledge graph extracted from structured data in Wikipedia.
Thanks to its large-scale size and the multi-disciplinary characteristic, DBpedia
becomes a nucleus of the Linked Open Data (LOD) project to which
numerous linked data resources connect. However, DBpedia is not free of errors for
many reasons such as human errors and inconsistencies within Wikipedia, which
is maintained by thousands of contributors. One major type of the errors is a
problem when an object of a triple does not have a type required by the range of
the triple's predicate [2]. We call this error type as a range violation error (RVE).
Currently, 18.7% of DBpedia triples whose predicate is a mapping-based object
property with a de ned range are su ering from this kind of error. For example,
the triple &lt;dbr4:Sedo, dbo5:locationCountry, dbr:Cologne&gt; in DBpedia is
erroneous because the predicate dbo:locationCountry requires an object with
the type dbo:Country, which dbr:Cologne is devoid of since Cologne is a city,</p>
    </sec>
    <sec id="sec-2">
      <title>4 dbr: http://dbpedia.org/resource/ 5 dbo: http://dbpedia.org/ontology/</title>
      <p>not a country. This inconsistency could undermine the e ectiveness of any
applications using DBpedia. To correct this error, the object dbr:Cologne should
be replaced by dbr:Germany, the country where Cologne is located.</p>
      <p>
        Actually, there are several strategies to x the RVE, depending on its root
cause, such as adding the missing type to the object and re ning the range
indicated in the ontology. Nevertheless, in this work, we aim to solve the RVE
automatically by nding a correct object to replace the incorrect object for
each of the erroneous triples. This makes a signi cant impact on the research
eld because (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) it complements existing works on knowledge graph re nement
which follow other strategies [4{6] and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) based on our investigation, xing by
replacing objects is applicable to more than 62.8% of all RVEs in DBpedia.
      </p>
      <p>
        Formally, our problem formulation is \Given an erroneous triple t = hs; p; oi
in DBpedia where (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) p is a DBpedia object property with the range rp and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
o is an incorrect object which has at least one DBpedia ontology class (dbo) as
its type but does not have rp as its type. Find a semantically correct object o0
that has the type rp and best replaces the incorrect object o in hs; p; oi."
2
      </p>
      <sec id="sec-2-1">
        <title>Our Approach</title>
        <p>It is clear that only objects with the type rp could be the correct answer, so
the complete search space of this problem (Sp) is a set of all entities with the
type rp. However, Sp is enormous for some properties p. So, our approach (i)
constructs a reduced search space (St) that contains only the entities related to
the erroneous triple t and then (ii) calculates scores of the entities in St.
2.1</p>
        <sec id="sec-2-1-1">
          <title>Constructing a Reduced Search Space</title>
          <p>St is constructed from the union of three portions { St;1, St;2, and St;3. First,
we de ne that p0 is a related property of p if and only if there exists at least
one (x; y), y 2 Sp, such that both hx; p; yi and hx; p0; yi are in DBpedia. After
that, we create St;1 storing all entities in Sp that are linked to s by at least one
related property of p. In some cases, s may have p0-links to objects (entities or
literals) that lack the type rp, but they may give us some hints to the correct
object. So, if the conditional probability P (hx; p; yijhx; p0; yi; y 2 Sp) is larger
than a threshold (set at 0.9 in the experiment), we transform the objects into
clue texts stored in Ct in addition to the label of the incorrect object o, which
is always a clue text in any case. For each clue text c 2 Ct, we tokenize c into a
set of keywords Kc. Then we create St;2 storing all entities in Sp whose abstract
contains at least one keyword from any c 2 Ct. Last but not least, we create St;3
which collects all entities in Sp that connect immediately to the incorrect object
o in any direction. Finally, we merge the three portions (St;1; St;2; St;3) to be St.
2.2</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Calculating Scores</title>
          <p>We invent two scoring methods to evaluate the likelihood that a candidate object
e 2 St is the correct object of the triple t.</p>
          <p>
            Method 1: Graph Method Intuitively, the correct object o0 should be strongly
related to o compared to other entities in St, and the more related two entities
are, the more objects connect both entities. Therefore, our scoring function based
on graph analysis is g(e) = jA(o; e)j + b(e) where e 2 St, A(o; e) is a set of
entities that have direct links to both o and e regardless of the links' direction,
and b(e) = 1 if e links immediately to the incorrect object o; otherwise, b(e) = 0.
Method 2: Keyword Method We develop the keyword method to nd e 2 St
which the clue texts in Ct support. The scoring function of this method is
m(e) =
c2Ct
X jfw 2 Kc : w is in abs(e) ^ cap(w) ^ w is in prof (o)gj + 1
+ r(e):
jKcj + 1
The score of an entity e is calculated by aggregating scores of e with respect
to each clue text c 2 Ct. The score with respect to c re ects a proportion of
keywords in c which are found in the abstract of e. To ensure that the keywords
refer to a named-entity which is the replacement of o, we count only keywords w
that begin with a capital letter (cap(w)) and are related to o (w is in the prof ile
of o). Additionally, the term r(e) is a bonus point which will be 1 only if (
            <xref ref-type="bibr" rid="ref1">1</xref>
            )
9p0[hs; p0; ei ^ P (hx; p; yijhx; p0; yi; y 2 Sp) &gt; ] and (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) e is in the prof ile of o.
Satisfying these conditions is equivalent to the match of one clue text where the
clue is an entity in the search space St.
          </p>
          <p>
            Ranking Candidate Objects After calculating the scores (by using g or m),
we can rank the candidate objects and select one with the highest score to
replace o in t. In case the scores are equal, we have two more criteria to prioritize
candidate objects { (
            <xref ref-type="bibr" rid="ref1">1</xref>
            ) the number of portions of St that cover the candidate,
i.e., jfi : e 2 St;igj and (
            <xref ref-type="bibr" rid="ref2">2</xref>
            ) the number of p in-links to the candidate.
3
          </p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Experiment</title>
        <p>We tested our approach using four manually created datasets from four
object properties { dbo:locationCountry, dbo:formerTeam, dbo:employer, and
dbo:birthPlace. Each of the datasets contains one hundred erroneous triples
hs; p; oi of a particular property together with their correct objects o0. We
compare our approach to four baseline methods. Two are normally used for
entity search (DBpedia lookup6 and dbo:wikiPageDisambiguates) nding
entities that have the required type rp and could be the correct object o0 from a
given query o. The other two baseline methods are originally devised for
knowledge graph completion (TransE [1] and AMIE+ [3]) nding the correct object
given the subject s and the property p.</p>
        <p>The results are presented in Table 1. M, @1, and @10 stand for three
evaluation metrics { Mean reciprocal ranking (MRR), HITS@1, and HITS@10,
respectively. All of them were calculated using the ranks of the correct objects provided</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6 http://wiki.dbpedia.org/projects/dbpedia-lookup</title>
      <p>by each method. It is noticeable that both of our methods outperform baseline
methods for all datasets. However, the graph method is more e ective when an
incorrect object corresponds to only one entity in Sp (as in locationCountry and
employer datasets), because there are relatively many objects connecting this
entity pair. Conversely, if an incorrect object is related to more than one entity
in Sp, only information from the graph structure is not su cient to nd the
correct object. The keyword method, in contrast, is more e ective in such cases
(e.g., formerTeam dataset) since it also exploits the information from s and p.
4</p>
      <sec id="sec-3-1">
        <title>Conclusion</title>
        <p>This paper aims to x range violation errors in DBpedia automatically by nding
correct objects to replace the incorrect objects. We developed an algorithm to
construct a small search space of candidate objects and two scoring methods to
evaluate the candidates. As exploiting information from all components of the
erroneous triples, our proposed approach is e ective to nd the correct objects
as demonstrated in the experiment. For future work, we plan to apply this idea
to x similar errors in other knowledge graphs such as Wikidata and NELL.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bordes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Usunier</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Duran</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yakhnenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Translating embeddings for modeling multi-relational data</article-title>
          .
          <source>In: 26th NIPS</source>
          . pp.
          <volume>2787</volume>
          {
          <issue>2795</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freudenberg</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verborgh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannens</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Van de Walle, R.:
          <article-title>Assessing and re ning mappings to rdf to improve dataset quality</article-title>
          .
          <source>In: 14th ISWC</source>
          . pp.
          <volume>133</volume>
          {
          <issue>149</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Galarraga</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Te ioudi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hose</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suchanek</surname>
            ,
            <given-names>F.M.</given-names>
          </string-name>
          :
          <article-title>Fast rule mining in ontological knowledge bases with amie+</article-title>
          .
          <source>The VLDB Journal</source>
          <volume>24</volume>
          (
          <issue>6</issue>
          ),
          <volume>707</volume>
          {
          <fpage>730</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Data-driven joint debugging of the dbpedia mappings and ontology</article-title>
          .
          <source>In: 14th ESWC</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Paulheim</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Improving the quality of linked data using statistical distributions</article-title>
          .
          <source>IJSWIS</source>
          <volume>10</volume>
          (
          <issue>2</issue>
          ),
          <volume>63</volume>
          {
          <fpage>86</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tonon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Catasta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demartini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cudre-Mauroux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Fixing the domain and range of properties in linked data by context disambiguation</article-title>
          .
          <source>In: LDOW</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>