An Approach to Correction of Erroneous Links in Knowledge Graphs André Melo Heiko Paulheim University of Mannheim University of Mannheim andre@informatik.uni-mannheim.de heiko@informatik.uni-mannheim.de ABSTRACT of the links in the infobox are wrong, the extraction process gen- Enriching and cleaning datasets are tasks relevant to most large- erates a wrong fact. NELL, on the other hand, has text as main scale knowledge graphs, which are known to be noisy and incom- source of information and in many cases the source text has correct plete. In this paper, we propose one approach to fix wrong facts information, which cannot be extracted correctly. which originate from confusion between entities. This is a common In both cases it is common that an instance is confused with an- source of errors in Wikipedia, from which DBpedia and YAGO are other one of a similar name (i.e., label or IRI). For example, the fact extracted, and in open information extraction systems, e.g. NELL. formerTeam( Alan_Ricard, Buffalo_Bill) is an erroneous fact Our approach generates candidate subjects and objects to fix triples from DBpedia which originates from a typo in Wikipedia: when by exploiting disambiguation links and approximate string match- referring to the NFL team Buffalo_Bills, the s was missing, there- ing. We run our approach on DBpedia and NELL and perform a fore, the NFL team was confused with the character Buffalo_Bill. manual evaluation of the generated corrections. In NELL the entity insect_raccoon exists because of problems when extracting the fact that raccoons prey on insects, and is con- CCS CONCEPTS fused with mammal_raccoon. Some relation assertion error detection approaches, such as PaTy- • Information systems → Data cleaning; • Computing method- BRED [4] and SDValidate [13], rely on type information, and since ologies → Semantic networks; Statistical relational learn- erroneous type assertions are also a common problem, that might ing; Machine learning; result in correct relation assertions with an instance of incorrect or incomplete types being wrongly identified as erroneous. Therefore, KEYWORDS combining such methods with type prediction [6, 12] is beneficial Knowledge graphs completion, error detection, data quality to rule out cases where the error is detected rather due to a missing or incorrect type of the subject or object than due to an erroneous 1 INTRODUCTION relation assertion. Knowledge graphs are known to be both often incomplete and Therefore, it is relevant to make a careful analysis of detected incorrect. Several link prediction and error detection methods have errors, identify the source of each error, and if possible correct been proposed, however, few of them explicitly focus on error them. In this paper, we propose CoCKG (Correction of Confusions correction or address the problem of choosing which absent facts in Knowledge Graphs), an automatic correction approach which should be added to the knowledge graph. resolves relation assertion errors caused by instance confusion. The problem is that the number of possible relation assertions The approach relies on error detection methods as well as type grows quadratically with the number of instances nc = ni2nr − n f , predictors to asses the confidence of the corrected facts. It uses where ni is the number of instances, nr the number of relations approximate string matching and exploits both searching for enti- and n f the number of existing facts in the graph. For large datasets ties with similar IRIs as well as Wikipedia disambiguation pages (if such as DBpedia, Wikidata and YAGO, computing the confidence available) to find candidate instances for correcting the facts. score of all these facts is challenging. While pruning possible facts which violate ontology constraints, especially domain and range 2 RELATED WORK restrictions of relations, can significantly reduce the search space, In the knowledge graph context, there are mainly error detection the problem is still very challenging. To illustrate the size of the and link prediction approaches. Both are closely related to our search space, in DBpedia (2016-10) nc ≈ 4.4 × 1017 facts; when problem: while error detection deletes wrong triples, link prediction filtering those triples which violate the domain and range restriction aims at adding new triples. In both cases, the approaches learn a KG the number is reduced to nc ≈ 2.8 × 1017 , which is still too large to model which is capable of assigning confidence values to triples. compute confidence for all those candidates. KG embedding models [1, 10, 14] are currently the best perform- A promising approach for enriching a KG with some of its miss- ing approaches in the link prediction task. Other models rely on ing facts is the correction of erroneous facts. Some of the wrong directly observed features, e.g., the path ranking algorithm (PRA) facts which exist in a KG can be corrected. The error originates [3]. A more detailed description of link prediction methods can be from some problem in the knowledge acquisition process, or in the found in [9]. It is important to note that none of the link predic- source data. A good example of the latter are errors in Wikipedia, tion approaches mentioned address the problem of covering the which serves as the main source of DBpedia and YAGO. If any candidate triples space (of size nc as discussed in the introduction). K-CAP2017 Workshops and Tutorials Proceedings, 2017 Our approach, on the other hand, exploits the assumption that er- ©Copyright held by the owner/author(s). roneous facts often have a corresponding correct fact in order to K-CAP2017 Workshops and Tutorials Proceedings, 2017 A. Melo and H. Paulheim reduce that space. Error detection approaches, such as SDValidate Algorithm 1 Knowledge base correction process and PaTyBRED, focus on the detecting of already existing erroneous 1: function correct_triples( K , Te r r , ed, tp, mc, mcд ) triples. It has been shown that state-of-the-art embeddings perform 2: Tcor r ← ∅ 3: for t, scor e t ∈ Te r r do worse than PaTyBRED in the error detection task [5]. A survey 4: s, p, o ← t covering various KG refinement methods can be found in [11]. 5: s t p ← predict_types(tp, s ) Rule-based systems, such as AMIE [2], cannot assign scores to 6: o t p ← predict_types(tp, o ) if ¬(conf_nt(ed, t, s, s t p ) ∨ conf_nt(ed, t, o, o t p )) then arbitrary triples. However, they could be used to restrict the nc 7: 8: sc and ← get_candidates(s ) search space by identifying high confidence soft rules and using 9: oc and ← get_candidates(o ) the missing facts from instances where the rule does not hold as 10: Tc and ← {(s i , p, o) |s i ∈ sc and } ∪ {(s, p, o i ) |o i ∈ oc and } 11: Tc and ← Tc and − K candidates. Combining them with previously mentioned KG models 12: c be s t , max conf ← nil, conf would be an interesting line of reasearch, however, it is out of the 13: for c ∈ Tc and do scope of this paper. 14: if s ∈ domain(p ) ∧ o ∈ r anдe (p ) then 15: scor ec ← conf(ed, c ) Wang et al. [15] studied the problem of erroneous links in Wikipedia, 16: if scor ec ≥ mc ∧ scor ec /scor e t ≥ mcд then which is also the source of many errors of DBpedia. They model 17: c be s t , max conf ← c, scor ec the Wikipedia links as a weighted directed mono-relational graph, 18: end if 19: end if and propose the LinkRank algorithm which similar to PageRank, 20: end for but instead of ranking the nodes (entities), it ranks the links. They 21: if c be s t , nil then 22: Tcor r ← Tcor r ∪ {(c be s t , t ) } use LinkRank to generate candidates for the link correction and 23: end if use textual features from the description of articles to learn a SVM 24: end if classifier that can detect errors and choose the best candidate for 25: end for 26: return Tcor r correction. While this is a closely related problem, which can help 27: end function mitigate the problem studied in this paper, their method cannot be directly applied on arbitrary knowledge graphs. Our approach takes advantage of the multi-relational nature of KGs, entity types, satisfied, we proceed to the next part where we try to substitute ontological information and the graph structure. the subject and object with their respective lists of candidates. Combining the type prediction process with the error detection 3 PROPOSED APPROACH also has the advantage that the newly predicted types can be vali- Our approach consists of first running an error detection algorithm dated on triples containing the instance whose types were predicted. (PaTyBRED in the case of this paper), selecting the top-k facts This can help support, or contradict the type predictor, possibly most likely to be wrong. In the next step, the error is heuristically detecting types which are wrongly predicted by identifying triples verified to be an actual relation assertion error and not caused where the score is lowered with the new types. by missing type assertions in the object or subject with a type predictor tp. In the final step, candidate entities are retrieved, and 3.2 Retrieving Candidates if any of the candidates significantly improves the likelihood of One simple way to find candidate entities to resolve entity confu- the triple being right, we replace it by that candidate. The function sions is to use the disambiguation links. Since disambiguation pages correct_triple in Algorithm 1 gives an overview of how CoCKG are only available for Wikipedia-based knowledge graphs, and fur- works. The parameter K is the set of all triples in the knowledge thermore are not available for each entity (e.g. Ronaldo has no graph, Ter r is the set of triple and confidence pairs generated by disambiguation page), and in some cases the disambiguation pages the error detection model (ed), tp is the type predictor, mc is the miss important entities (e.g. the page Bluebird_(disambiguation) minimum confidence threshold, and mcд the minimum confidence misses the entity Bluebird_(horse), hence, we cannot correct the gain threshold, i.e. the ratio of the new and old triple scores. In the fact grandisre(Miss_Potential,Bluebird)), we require an addi- next subsections we discuss the other parts in more details. tional source of candidates. Since in our experiments we consider DBpedia and NELL, which 3.1 Type Prediction have informative IRIs (in the case of DBpedia extracted from the After selecting the k triples most likely to be wrong, we first check correspondent Wikipedia’s page), we search for candidate entities if their confidence is low because of missing or wrong instance which have similar IRIs. Alternatively, it could also be done with types (subject or object). In order to do that, we run a type predictor entity labels. This would be useful in KGs which have uninformative tp on the subject and object instances. In this paper, we use as tp IRIs (e.g. Wikidata and Freebase). For simplicity, in this paper, we a multilabel random forest classifier based on qualified links (i.e. refer to the informative part of an IRI as the “name” of the entity. ingoing links paired with subject type and outgoing links paired Retrieving all the instances of similar names can be a complicated with object type), as described in [6]. If the set of predicted types of task. This kind of problem is known as approximate string matching, the subject are different from the actual types, we change the type and it has been widely researched [8, 16]. For our method we use features used by ed and compute a new confidence for the triple (c.f. an approximate string matching approach based on [7]. First, we conf_nt). If the new score satisfies mc and mcд, then we conclude remove the IRI’s prefix and work with the suffix as the entity’s name. that the error was in the subject type assertions. The same is done We then tokenize the names and construct a deletions dictionary for the object, and if in neither case the confidence thresholds are with all tokens being added with all possible deletions up to a An Approach to Correction of Erroneous Links in Knowledge Graphs K-CAP2017 Workshops and Tutorials Proceedings, 2017 maximum edit distance dmax threshold. This dictionary contains strings as keys and lists with all tokens which can turn into the key WC string with up to dmax deletions as values. Only pairs of tokens 21 WW 14 which share a common deletion string can have an edit distance CW 64 less or equal than dmax . We also have a tokens dictionary which 58 13 CC 8 has tokens as keys and lists of entities which contain a given token as values. With that, given a token and a dmax we can easily obtain 8 12 all the entities which contain that a string approximately similar to that token up to the maximum edit distance. When searching for entities similar to a given entity, we perform Figure 1: Manual evaluation on DBpedia and NELL respec- queries for every token of the entity’s name and we require that tively all tokens are matched. That is, for a certain entity to be consid- ered similar, it has to contain tokens similar to all the tokens of the queried entity. A retrieved entity may have more tokens than the simultaneous confusion of both instances to be highly unlikely.1 the queried entity, but not less. The idea is that in general, when This is also done in order to make the number of candidate triples entering an entities name manually (e.g., in Wikipedia), it is com- linear instead of quadratic. mon to underspecify the entity, but highly unlikely to overspecify We then remove the candidate triples which are already existent it. E.g., it is more likely that Ronaldo is wrongly used instead of in the KG. We compute the confidence of all candidate triples and Cristiano_Ronaldo than the other way around. Furthermore, it select that with highest confidence, given that mc and mcд are reduces the number of matched entities. satisfied. Our method then outputs a list of triple pairs containing We also perform especial treatment on DBpedia and NELL entity the wrong triple detected and the corrected triple predicted. names because of peculiarities in their IRI structures. In DBpedia it is common to have between parentheses information to help 4 EXPERIMENTS disambiguate entities, which we consider unnecessary since the In our experiments we run CoCKG on DBpedia (2016-10) and NELL entity types are used in the error detection method. In NELL the (08m-690), then we manually evaluate the triples corrected by our first token is always the type of the entity, therefore, for similar approach. We run PaTyBRED on both datasets and select top-1% reasons, we ignore it. facts most likely to be errors to be processed by our correction method. We classify each corrected fact in four different categories: 3.3 Correcting Wrong facts (1) WC: wrong fact turned into correct At this point, for each assertion identified as erroneous, we have (2) WW: wrong fact turned into another wrong fact our list of candidate instances for subject and object from the disam- (3) CW: correct fact turned into wrong fact biguation links and approximate string matching. We then compute (4) CC: correct fact turned into another correct fact a custom similarity measure s (e 1 , e 2 ) between an entity e 1 and a Our approach was run with mc = 0.75, mcд = 2 and entity candidate e 2 . Each entity ei consists of a set of its tokens. The mea- similarity measure with c = 1.52 . That resulted in 24,973 corrections sure we propose consists of two components. The first is the sum on DBpedia and 616 correction on NELL. It also detected that 873 of Levenshtein (d L ) distance of all matched tokens, and the second (569) errors were caused by wrong types in DBpedia (NELL). Since considers the number of unmatched tokens to capture a difference manually evaluating all these corrections would be impossible, we in specificity. The set of approximately matched token pairs is rep- randomly select 100 correction on each to perform the evaluation. resented by µ (e 1 , e 2 ) and the constant c is the weight of the second The results of our manual evaluation are shown in Figure 1. component. This measure is used to sort the retrieved candidates, The proportion of facts successfully corrected (case 1) was rather to prune them in case there are too many, and to break ties when low. While our approach can potentially improve the results by deciding which of the top-scoring candidates should be chosen. tweaking the parameters, and possibly using ensembles of different type predictors and error detectors, it currently cannot be used as a fully automatic approach. However, we believe that combining X |e 1 | − |µ (e 1 , e 2 )| our approach with active learning is a promising direction which, s (e 1 , e 2 ) = d L (t 1 , t 2 ) + c (1) with the help of specialists, could significantly improve results. |e1| (t 1,t 2 ) ∈µ (e 1,e 2 ) When evaluating some relations individually, we notice that some of them achieve good results. E.g., the relations sire, damsire, In case the relation has domain or range restrictions, we remove grandsire and subsequentWork reaching more than 90% of suc- the candidates which violate these restrictions. Later, for each of cessful corrections (case 1). The results are good for these relations the candidates, we generate triples by substituting the subject and because horses are often named after other entities and artists often object by each of the instances in its candidates lists (first substitute have albums named after themselves, which makes confusions easy subject only, then object only). That is, the total number of candidate to happen. triples is the sum of the size of the subject and object candidates list. 1 For that to happen in the case of DBpedia, a Wikipedia user would have to go to the We do not create candidate triples by substituting both the subject wrong article page and insert a wrong link in the infobox. and object at the same time because, although possible, we assume 2 The parameter values were selected based on heuristics and may not be optimal K-CAP2017 Workshops and Tutorials Proceedings, 2017 A. Melo and H. Paulheim One of the problems of our approach is that since it relies on [2] Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek. PaTyBRED, which cannot find many relevant path features on 2013. AMIE: association rule mining under incomplete evidence in ontological knowledge bases. In 22nd International World Wide Web Conference, WWW ’13, DBpedia and NELL [5], it is difficult to distinguish between can- Rio de Janeiro, Brazil, May 13-17, 2013, Daniel Schwabe, Virgílio A. F. Almeida, didate entities of same type. For example, in NELL, the entity Hartmut Glaser, Ricardo A. Baeza-Yates, and Sue B. Moon (Eds.). International World Wide Web Conferences Steering Committee / ACM, 413–422. http://dl. person_paul as object of book_writer relation is always corrected acm.org/citation.cfm?id=2488425 with writer_paul_feval. [3] Ni Lao and William W. Cohen. 2010. Relational Retrieval Using a Combination The decision to generate candidate triples by corrupting ei- of Path-constrained Random Walks. Mach. Learn. 81, 1 (Oct. 2010), 53–67. https: //doi.org/10.1007/s10994-010-5205-8 ther the subject or object seemed to have worked well for DB- [4] Andre Melo and Heiko Paulheim. 2017. Detection of Relation Assertion Er- pedia, where we could not find a triple where both subject and rors in Knowledge Graphs. In Proceedings of the 9th International Conference on object were wrong. On the other hand, in NELL such case was ob- Knowledge Capture, K-CAP 2017, Austin, TX, USA, December 4-6, 2017. ACM. [5] André Melo and Heiko Paulheim. 2017. Local and global feature selection for served a few times, e.g. ismultipleof(musicinstrument_herd , multilabel classification with binary relevance. Artificial Intelligence Review musicinstrument_buffalo) whose object was corrected to mammal- (2017), 1–28. https://doi.org/10.1007/s10462-017-9556-4 [6] André Melo, Heiko Paulheim, and Johanna Völker. 2016. Type Prediction in RDF _buffalo but the subject remained wrong. Knowledge Bases Using Hierarchical Multilabel Classification. In Proceedings of Also, our assumption that confusions tend to use a more general the 6th International Conference on Web Intelligence, Mining and Semantics (WIMS IRI instead of a more specific, requiring all tokens of the queried to ’16). ACM, New York, NY, USA, Article 14, 10 pages. https://doi.org/10.1145/ 2912845.2912861 be matched, does not always hold. One example in DBpedia which [7] Stoyan Mihov and Klaus U. Schulz. 2004. Fast Approximate Search in Large contradicts this assumption is language(Paadatha_Thenikkal , Dictionaries. Comput. Linguist. 30, 4 (Dec. 2004), 451–477. https://doi.org/10. Tamil_cinema), whose corrected object would be Tamil_language 1162/0891201042544938 [8] Gonzalo Navarro. 2001. A Guided Tour to Approximate String Matching. ACM and could not be retrieved by our approach. While this can be a Comput. Surv. 33, 1 (March 2001), 31–88. https://doi.org/10.1145/375360.375365 problem, dropping this assumption also means that more candi- [9] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016. A Review of Relational Machine Learning for Knowledge Graphs. Proc. IEEE 104, dates entities will be retrieved, increasing the number of unrelated 1 (2016), 11–33. https://doi.org/10.1109/JPROC.2015.2483592 candidates, resulting in more candidate triples which need to be [10] Maximilian Nickel, Volker Tresp, and Hans peter Kriegel. 2011. A Three-Way tested and possibly more occurrences of cases 2 and 3. Further Model for Collective Learning on Multi-Relational Data. In Proceedings of the 28th International Conference on Machine Learning (ICML-11). ACM. http://www. experiments would have to be conducted in order to evaluate the icml-2011.org/papers/438_icmlpaper.pdf effects of such change. [11] Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8, 3 (2017), 489–508. https://doi.org/10. 3233/SW-160218 5 CONCLUSION [12] Heiko Paulheim and Christian Bizer. 2013. Type Inference on Noisy RDF Data. Springer Berlin Heidelberg, Berlin, Heidelberg, 510–525. https://doi.org/10.1007/ In this paper we proposed CoCKG, an approach for correcting er- 978-3-642-41335-3_32 roneous facts originated from entity confusions. The experiments [13] Heiko Paulheim and Christian Bizer. 2014. Improving the Quality of Linked Data Using Statistical Distributions. Int. J. Semant. Web Inf. Syst. 10, 2 (April 2014), show that CoCKG is capable of correcting wrong triples with con- 63–86. https://doi.org/10.4018/ijswis.2014040104 fused instances, with estimated precision of 21% of the produced [14] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume corrections in DBpedia and 14% in NELL. The low precision val- Bouchard. 2016. Complex Embeddings for Simple Link Prediction. CoRR abs/1606.06357 (2016). http://arxiv.org/abs/1606.06357 ues obtained do not allow this process, as of now, to be used for [15] Chengyu Wang, Rong Zhang, Xiaofeng He, and Aoying Zhou. 2016. Error fully automatic KG enrichment. Nevertheless, it works as a proof Link Detection and Correction in Wikipedia. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM of concept and can be useful, e.g., as suggestions from which a user ’16). ACM, New York, NY, USA, 307–316. https://doi.org/10.1145/2983323.2983705 would ultimately decide whether to execute. [16] Zhenglu Yang, Jianjun Yu, and Masaru Kitsuregawa. 2010. Fast Algorithms In the future it would be interesting to adapt this method to for Top-k Approximate String Matching. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI’10). AAAI Press, 1467–1473. support active learning. Since guaranteeing the quality of the newly http://dl.acm.org/citation.cfm?id=2898607.2898841 generated facts is crucial, having input from the user to clarify borderline cases and improve the overall results would be highly valuable. Furthermore, using an ensemble of different KG models with different characteristics, e.g. KG embeddings, instead of a single model may potentially increase the robustness of the system. Finally, it would be worth adding textual features from entities descriptions to help determine if a pair of entities is related or not. ACKNOWLEDGMENTS The work presented in this paper has been partly supported by the Ministry of Science, Research and the Arts Baden-Württemberg in the project SyKo2 W2 (Synthesis of Completion and Correction of Knowledge Graphs on the Web). REFERENCES [1] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok- sana Yakhnenko. [n. d.]. Translating Embeddings for Modeling Multi-relational Data. In Advances in Neural Information Processing Systems 26.