-

Correcting Range Violation Errors in DBpedia

Piyawat Lertvittayakumjorn

0 1

Natthawut Kertkeidkachorn

1 2

Ryutaro Ichise

ichiseg@nii.ac.jp 1 2 0 Department of Computing, Imperial College London , London SW7 2AZ , UK 1 National Institute of Informatics , Tokyo 101-8430 , Japan 2 SOKENDAI (The Graduate University for Advanced Studies) , Tokyo 101-8430 , Japan

A range violation error is a problem when an object of a knowledge graph triple does not have a type required by the range of the triple's predicate. This paper aims to correct these erroneous triples in DBpedia by nding correct objects with the required type to replace the incorrect objects. Our approach is based on graph analysis and keyword matching. It also exploits information from the incorrect objects because, despite their incorrectness, they contain useful clues to nd the correct objects. Experimental results show that our proposed approach outperforms various baseline methods, including entity search (e.g., DBpedia Lookup) and knowledge graph completion (TransE and AMIE+).

DBpedia Linked Data Data Quality Error Correction Range Violation Error Knowledge Graph Re nement

DBpedia is a large knowledge graph extracted from structured data in Wikipedia. Thanks to its large-scale size and the multi-disciplinary characteristic, DBpedia becomes a nucleus of the Linked Open Data (LOD) project to which numerous linked data resources connect. However, DBpedia is not free of errors for many reasons such as human errors and inconsistencies within Wikipedia, which is maintained by thousands of contributors. One major type of the errors is a problem when an object of a triple does not have a type required by the range of the triple's predicate [2]. We call this error type as a range violation error (RVE). Currently, 18.7% of DBpedia triples whose predicate is a mapping-based object property with a de ned range are su ering from this kind of error. For example, the triple <dbr4:Sedo, dbo5:locationCountry, dbr:Cologne> in DBpedia is erroneous because the predicate dbo:locationCountry requires an object with the type dbo:Country, which dbr:Cologne is devoid of since Cologne is a city,

4 dbr: http://dbpedia.org/resource/ 5 dbo: http://dbpedia.org/ontology/

not a country. This inconsistency could undermine the e ectiveness of any applications using DBpedia. To correct this error, the object dbr:Cologne should be replaced by dbr:Germany, the country where Cologne is located.

Actually, there are several strategies to x the RVE, depending on its root cause, such as adding the missing type to the object and re ning the range indicated in the ontology. Nevertheless, in this work, we aim to solve the RVE automatically by nding a correct object to replace the incorrect object for each of the erroneous triples. This makes a signi cant impact on the research eld because ( 1 ) it complements existing works on knowledge graph re nement which follow other strategies [4{6] and ( 2 ) based on our investigation, xing by replacing objects is applicable to more than 62.8% of all RVEs in DBpedia.

Formally, our problem formulation is \Given an erroneous triple t = hs; p; oi in DBpedia where ( 1 ) p is a DBpedia object property with the range rp and ( 2 ) o is an incorrect object which has at least one DBpedia ontology class (dbo) as its type but does not have rp as its type. Find a semantically correct object o0 that has the type rp and best replaces the incorrect object o in hs; p; oi." 2

Our Approach

It is clear that only objects with the type rp could be the correct answer, so the complete search space of this problem (Sp) is a set of all entities with the type rp. However, Sp is enormous for some properties p. So, our approach (i) constructs a reduced search space (St) that contains only the entities related to the erroneous triple t and then (ii) calculates scores of the entities in St. 2.1

Constructing a Reduced Search Space

St is constructed from the union of three portions { St;1, St;2, and St;3. First, we de ne that p0 is a related property of p if and only if there exists at least one (x; y), y 2 Sp, such that both hx; p; yi and hx; p0; yi are in DBpedia. After that, we create St;1 storing all entities in Sp that are linked to s by at least one related property of p. In some cases, s may have p0-links to objects (entities or literals) that lack the type rp, but they may give us some hints to the correct object. So, if the conditional probability P (hx; p; yijhx; p0; yi; y 2 Sp) is larger than a threshold (set at 0.9 in the experiment), we transform the objects into clue texts stored in Ct in addition to the label of the incorrect object o, which is always a clue text in any case. For each clue text c 2 Ct, we tokenize c into a set of keywords Kc. Then we create St;2 storing all entities in Sp whose abstract contains at least one keyword from any c 2 Ct. Last but not least, we create St;3 which collects all entities in Sp that connect immediately to the incorrect object o in any direction. Finally, we merge the three portions (St;1; St;2; St;3) to be St. 2.2

Calculating Scores

We invent two scoring methods to evaluate the likelihood that a candidate object e 2 St is the correct object of the triple t.

Method 1: Graph Method Intuitively, the correct object o0 should be strongly related to o compared to other entities in St, and the more related two entities are, the more objects connect both entities. Therefore, our scoring function based on graph analysis is g(e) = jA(o; e)j + b(e) where e 2 St, A(o; e) is a set of entities that have direct links to both o and e regardless of the links' direction, and b(e) = 1 if e links immediately to the incorrect object o; otherwise, b(e) = 0. Method 2: Keyword Method We develop the keyword method to nd e 2 St which the clue texts in Ct support. The scoring function of this method is m(e) = c2Ct X jfw 2 Kc : w is in abs(e) ^ cap(w) ^ w is in prof (o)gj + 1 + r(e): jKcj + 1 The score of an entity e is calculated by aggregating scores of e with respect to each clue text c 2 Ct. The score with respect to c re ects a proportion of keywords in c which are found in the abstract of e. To ensure that the keywords refer to a named-entity which is the replacement of o, we count only keywords w that begin with a capital letter (cap(w)) and are related to o (w is in the prof ile of o). Additionally, the term r(e) is a bonus point which will be 1 only if ( 1 ) 9p0[hs; p0; ei ^ P (hx; p; yijhx; p0; yi; y 2 Sp) > ] and ( 2 ) e is in the prof ile of o. Satisfying these conditions is equivalent to the match of one clue text where the clue is an entity in the search space St.

Ranking Candidate Objects After calculating the scores (by using g or m), we can rank the candidate objects and select one with the highest score to replace o in t. In case the scores are equal, we have two more criteria to prioritize candidate objects { ( 1 ) the number of portions of St that cover the candidate, i.e., jfi : e 2 St;igj and ( 2 ) the number of p in-links to the candidate. 3

Experiment

We tested our approach using four manually created datasets from four object properties { dbo:locationCountry, dbo:formerTeam, dbo:employer, and dbo:birthPlace. Each of the datasets contains one hundred erroneous triples hs; p; oi of a particular property together with their correct objects o0. We compare our approach to four baseline methods. Two are normally used for entity search (DBpedia lookup6 and dbo:wikiPageDisambiguates) nding entities that have the required type rp and could be the correct object o0 from a given query o. The other two baseline methods are originally devised for knowledge graph completion (TransE [1] and AMIE+ [3]) nding the correct object given the subject s and the property p.

The results are presented in Table 1. M, @1, and @10 stand for three evaluation metrics { Mean reciprocal ranking (MRR), HITS@1, and HITS@10, respectively. All of them were calculated using the ranks of the correct objects provided

6 http://wiki.dbpedia.org/projects/dbpedia-lookup

by each method. It is noticeable that both of our methods outperform baseline methods for all datasets. However, the graph method is more e ective when an incorrect object corresponds to only one entity in Sp (as in locationCountry and employer datasets), because there are relatively many objects connecting this entity pair. Conversely, if an incorrect object is related to more than one entity in Sp, only information from the graph structure is not su cient to nd the correct object. The keyword method, in contrast, is more e ective in such cases (e.g., formerTeam dataset) since it also exploits the information from s and p. 4

Conclusion

This paper aims to x range violation errors in DBpedia automatically by nding correct objects to replace the incorrect objects. We developed an algorithm to construct a small search space of candidate objects and two scoring methods to evaluate the candidates. As exploiting information from all components of the erroneous triples, our proposed approach is e ective to nd the correct objects as demonstrated in the experiment. For future work, we plan to apply this idea to x similar errors in other knowledge graphs such as Wikidata and NELL.

1. Bordes , A. , Usunier , N. , Garcia-Duran , A. , Weston , J. , Yakhnenko , O. : Translating embeddings for modeling multi-relational data . In: 26th NIPS . pp. 2787 { 2795 ( 2013 )

2. Dimou , A. , Kontokostas , D. , Freudenberg , M. , Verborgh , R. , Lehmann , J. , Mannens , E. , Hellmann , S. , Van de Walle, R.: Assessing and re ning mappings to rdf to improve dataset quality . In: 14th ISWC . pp. 133 { 149 ( 2015 )

3. Galarraga , L. , Te ioudi , C. , Hose , K. , Suchanek , F.M. : Fast rule mining in ontological knowledge bases with amie+ . The VLDB Journal 24 ( 6 ), 707 { 730 ( 2015 )

4. Paulheim , H. : Data-driven joint debugging of the dbpedia mappings and ontology . In: 14th ESWC ( 2017 )

5. Paulheim , H. , Bizer , C. : Improving the quality of linked data using statistical distributions . IJSWIS 10 ( 2 ), 63 { 86 ( 2014 )

6. Tonon , A. , Catasta , M. , Demartini , G. , Cudre-Mauroux , P. : Fixing the domain and range of properties in linked data by context disambiguation . In: LDOW ( 2015 )