=Paper= {{Paper |id=Vol-2065/paper12 |storemode=property |title=An Approach to Correction of Erroneous Links in Knowledge Graphs |pdfUrl=https://ceur-ws.org/Vol-2065/paper12.pdf |volume=Vol-2065 |authors=André Melo,Heiko Paulheim |dblpUrl=https://dblp.org/rec/conf/kcap/MeloP17a }} ==An Approach to Correction of Erroneous Links in Knowledge Graphs== https://ceur-ws.org/Vol-2065/paper12.pdf
    An Approach to Correction of Erroneous Links in Knowledge
                             Graphs
                                 André Melo                                                   Heiko Paulheim
                       University of Mannheim                                           University of Mannheim
                  andre@informatik.uni-mannheim.de                                 heiko@informatik.uni-mannheim.de
ABSTRACT                                                                 of the links in the infobox are wrong, the extraction process gen-
Enriching and cleaning datasets are tasks relevant to most large-        erates a wrong fact. NELL, on the other hand, has text as main
scale knowledge graphs, which are known to be noisy and incom-           source of information and in many cases the source text has correct
plete. In this paper, we propose one approach to fix wrong facts         information, which cannot be extracted correctly.
which originate from confusion between entities. This is a common           In both cases it is common that an instance is confused with an-
source of errors in Wikipedia, from which DBpedia and YAGO are           other one of a similar name (i.e., label or IRI). For example, the fact
extracted, and in open information extraction systems, e.g. NELL.        formerTeam( Alan_Ricard, Buffalo_Bill) is an erroneous fact
Our approach generates candidate subjects and objects to fix triples     from DBpedia which originates from a typo in Wikipedia: when
by exploiting disambiguation links and approximate string match-         referring to the NFL team Buffalo_Bills, the s was missing, there-
ing. We run our approach on DBpedia and NELL and perform a               fore, the NFL team was confused with the character Buffalo_Bill.
manual evaluation of the generated corrections.                          In NELL the entity insect_raccoon exists because of problems
                                                                         when extracting the fact that raccoons prey on insects, and is con-
CCS CONCEPTS                                                             fused with mammal_raccoon.
                                                                            Some relation assertion error detection approaches, such as PaTy-
• Information systems → Data cleaning; • Computing method-
                                                                         BRED [4] and SDValidate [13], rely on type information, and since
ologies → Semantic networks; Statistical relational learn-
                                                                         erroneous type assertions are also a common problem, that might
ing; Machine learning;
                                                                         result in correct relation assertions with an instance of incorrect or
                                                                         incomplete types being wrongly identified as erroneous. Therefore,
KEYWORDS
                                                                         combining such methods with type prediction [6, 12] is beneficial
Knowledge graphs completion, error detection, data quality               to rule out cases where the error is detected rather due to a missing
                                                                         or incorrect type of the subject or object than due to an erroneous
1    INTRODUCTION                                                        relation assertion.
Knowledge graphs are known to be both often incomplete and                  Therefore, it is relevant to make a careful analysis of detected
incorrect. Several link prediction and error detection methods have      errors, identify the source of each error, and if possible correct
been proposed, however, few of them explicitly focus on error            them. In this paper, we propose CoCKG (Correction of Confusions
correction or address the problem of choosing which absent facts         in Knowledge Graphs), an automatic correction approach which
should be added to the knowledge graph.                                  resolves relation assertion errors caused by instance confusion.
    The problem is that the number of possible relation assertions       The approach relies on error detection methods as well as type
grows quadratically with the number of instances nc = ni2nr − n f ,      predictors to asses the confidence of the corrected facts. It uses
where ni is the number of instances, nr the number of relations          approximate string matching and exploits both searching for enti-
and n f the number of existing facts in the graph. For large datasets    ties with similar IRIs as well as Wikipedia disambiguation pages (if
such as DBpedia, Wikidata and YAGO, computing the confidence             available) to find candidate instances for correcting the facts.
score of all these facts is challenging. While pruning possible facts
which violate ontology constraints, especially domain and range          2   RELATED WORK
restrictions of relations, can significantly reduce the search space,    In the knowledge graph context, there are mainly error detection
the problem is still very challenging. To illustrate the size of the     and link prediction approaches. Both are closely related to our
search space, in DBpedia (2016-10) nc ≈ 4.4 × 1017 facts; when           problem: while error detection deletes wrong triples, link prediction
filtering those triples which violate the domain and range restriction   aims at adding new triples. In both cases, the approaches learn a KG
the number is reduced to nc ≈ 2.8 × 1017 , which is still too large to   model which is capable of assigning confidence values to triples.
compute confidence for all those candidates.                                KG embedding models [1, 10, 14] are currently the best perform-
    A promising approach for enriching a KG with some of its miss-       ing approaches in the link prediction task. Other models rely on
ing facts is the correction of erroneous facts. Some of the wrong        directly observed features, e.g., the path ranking algorithm (PRA)
facts which exist in a KG can be corrected. The error originates         [3]. A more detailed description of link prediction methods can be
from some problem in the knowledge acquisition process, or in the        found in [9]. It is important to note that none of the link predic-
source data. A good example of the latter are errors in Wikipedia,       tion approaches mentioned address the problem of covering the
which serves as the main source of DBpedia and YAGO. If any              candidate triples space (of size nc as discussed in the introduction).
K-CAP2017 Workshops and Tutorials Proceedings, 2017
                                                                         Our approach, on the other hand, exploits the assumption that er-
©Copyright held by the owner/author(s).                                  roneous facts often have a corresponding correct fact in order to
K-CAP2017 Workshops and Tutorials Proceedings, 2017                                                                                 A. Melo and H. Paulheim


reduce that space. Error detection approaches, such as SDValidate          Algorithm 1 Knowledge base correction process
and PaTyBRED, focus on the detecting of already existing erroneous          1: function correct_triples( K , Te r r , ed, tp, mc, mcд )
triples. It has been shown that state-of-the-art embeddings perform         2:    Tcor r ← ∅
                                                                            3:    for t, scor e t ∈ Te r r do
worse than PaTyBRED in the error detection task [5]. A survey               4:       s, p, o ← t
covering various KG refinement methods can be found in [11].                5:       s t p ← predict_types(tp, s )
   Rule-based systems, such as AMIE [2], cannot assign scores to            6:       o t p ← predict_types(tp, o )
                                                                                      if ¬(conf_nt(ed, t, s, s t p ) ∨ conf_nt(ed, t, o, o t p )) then
arbitrary triples. However, they could be used to restrict the nc
                                                                            7:
                                                                            8:            sc and ← get_candidates(s )
search space by identifying high confidence soft rules and using            9:            oc and ← get_candidates(o )
the missing facts from instances where the rule does not hold as           10:            Tc and ← {(s i , p, o) |s i ∈ sc and } ∪ {(s, p, o i ) |o i ∈ oc and }
                                                                           11:            Tc and ← Tc and − K
candidates. Combining them with previously mentioned KG models             12:            c be s t , max conf ← nil, conf
would be an interesting line of reasearch, however, it is out of the       13:            for c ∈ Tc and do
scope of this paper.                                                       14:                 if s ∈ domain(p ) ∧ o ∈ r anдe (p ) then
                                                                           15:                       scor ec ← conf(ed, c )
   Wang et al. [15] studied the problem of erroneous links in Wikipedia,   16:                       if scor ec ≥ mc ∧ scor ec /scor e t ≥ mcд then
which is also the source of many errors of DBpedia. They model             17:                           c be s t , max conf ← c, scor ec
the Wikipedia links as a weighted directed mono-relational graph,          18:                       end if
                                                                           19:                 end if
and propose the LinkRank algorithm which similar to PageRank,              20:            end for
but instead of ranking the nodes (entities), it ranks the links. They      21:            if c be s t , nil then
                                                                           22:                 Tcor r ← Tcor r ∪ {(c be s t , t ) }
use LinkRank to generate candidates for the link correction and            23:            end if
use textual features from the description of articles to learn a SVM       24:        end if
classifier that can detect errors and choose the best candidate for        25:    end for
                                                                           26:    return Tcor r
correction. While this is a closely related problem, which can help        27: end function
mitigate the problem studied in this paper, their method cannot
be directly applied on arbitrary knowledge graphs. Our approach
takes advantage of the multi-relational nature of KGs, entity types,
                                                                           satisfied, we proceed to the next part where we try to substitute
ontological information and the graph structure.
                                                                           the subject and object with their respective lists of candidates.
                                                                              Combining the type prediction process with the error detection
3     PROPOSED APPROACH                                                    also has the advantage that the newly predicted types can be vali-
Our approach consists of first running an error detection algorithm        dated on triples containing the instance whose types were predicted.
(PaTyBRED in the case of this paper), selecting the top-k facts            This can help support, or contradict the type predictor, possibly
most likely to be wrong. In the next step, the error is heuristically      detecting types which are wrongly predicted by identifying triples
verified to be an actual relation assertion error and not caused           where the score is lowered with the new types.
by missing type assertions in the object or subject with a type
predictor tp. In the final step, candidate entities are retrieved, and     3.2      Retrieving Candidates
if any of the candidates significantly improves the likelihood of
                                                                           One simple way to find candidate entities to resolve entity confu-
the triple being right, we replace it by that candidate. The function
                                                                           sions is to use the disambiguation links. Since disambiguation pages
correct_triple in Algorithm 1 gives an overview of how CoCKG
                                                                           are only available for Wikipedia-based knowledge graphs, and fur-
works. The parameter K is the set of all triples in the knowledge
                                                                           thermore are not available for each entity (e.g. Ronaldo has no
graph, Ter r is the set of triple and confidence pairs generated by
                                                                           disambiguation page), and in some cases the disambiguation pages
the error detection model (ed), tp is the type predictor, mc is the
                                                                           miss important entities (e.g. the page Bluebird_(disambiguation)
minimum confidence threshold, and mcд the minimum confidence
                                                                           misses the entity Bluebird_(horse), hence, we cannot correct the
gain threshold, i.e. the ratio of the new and old triple scores. In the
                                                                           fact grandisre(Miss_Potential,Bluebird)), we require an addi-
next subsections we discuss the other parts in more details.
                                                                           tional source of candidates.
                                                                              Since in our experiments we consider DBpedia and NELL, which
3.1    Type Prediction                                                     have informative IRIs (in the case of DBpedia extracted from the
After selecting the k triples most likely to be wrong, we first check      correspondent Wikipedia’s page), we search for candidate entities
if their confidence is low because of missing or wrong instance            which have similar IRIs. Alternatively, it could also be done with
types (subject or object). In order to do that, we run a type predictor    entity labels. This would be useful in KGs which have uninformative
tp on the subject and object instances. In this paper, we use as tp        IRIs (e.g. Wikidata and Freebase). For simplicity, in this paper, we
a multilabel random forest classifier based on qualified links (i.e.       refer to the informative part of an IRI as the “name” of the entity.
ingoing links paired with subject type and outgoing links paired              Retrieving all the instances of similar names can be a complicated
with object type), as described in [6]. If the set of predicted types of   task. This kind of problem is known as approximate string matching,
the subject are different from the actual types, we change the type        and it has been widely researched [8, 16]. For our method we use
features used by ed and compute a new confidence for the triple (c.f.      an approximate string matching approach based on [7]. First, we
conf_nt). If the new score satisfies mc and mcд, then we conclude          remove the IRI’s prefix and work with the suffix as the entity’s name.
that the error was in the subject type assertions. The same is done        We then tokenize the names and construct a deletions dictionary
for the object, and if in neither case the confidence thresholds are       with all tokens being added with all possible deletions up to a
An Approach to Correction of Erroneous Links in Knowledge Graphs                                                         K-CAP2017 Workshops and Tutorials Proceedings, 2017


maximum edit distance dmax threshold. This dictionary contains
strings as keys and lists with all tokens which can turn into the key                                                                                     WC
string with up to dmax deletions as values. Only pairs of tokens                                                                      21                  WW                         14
which share a common deletion string can have an edit distance
                                                                                                                                                          CW            64
less or equal than dmax . We also have a tokens dictionary which                                                          58
                                                                                                                                         13               CC                            8
has tokens as keys and lists of entities which contain a given token
as values. With that, given a token and a dmax we can easily obtain                                                                  8                                             12
all the entities which contain that a string approximately similar to
that token up to the maximum edit distance.
    When searching for entities similar to a given entity, we perform                                        Figure 1: Manual evaluation on DBpedia and NELL respec-
queries for every token of the entity’s name and we require that                                             tively
all tokens are matched. That is, for a certain entity to be consid-
ered similar, it has to contain tokens similar to all the tokens of
the queried entity. A retrieved entity may have more tokens than                                             the simultaneous confusion of both instances to be highly unlikely.1
the queried entity, but not less. The idea is that in general, when                                          This is also done in order to make the number of candidate triples
entering an entities name manually (e.g., in Wikipedia), it is com-                                          linear instead of quadratic.
mon to underspecify the entity, but highly unlikely to overspecify                                              We then remove the candidate triples which are already existent
it. E.g., it is more likely that Ronaldo is wrongly used instead of                                          in the KG. We compute the confidence of all candidate triples and
Cristiano_Ronaldo than the other way around. Furthermore, it                                                 select that with highest confidence, given that mc and mcд are
reduces the number of matched entities.                                                                      satisfied. Our method then outputs a list of triple pairs containing
    We also perform especial treatment on DBpedia and NELL entity                                            the wrong triple detected and the corrected triple predicted.
names because of peculiarities in their IRI structures. In DBpedia
it is common to have between parentheses information to help                                                 4    EXPERIMENTS
disambiguate entities, which we consider unnecessary since the                                               In our experiments we run CoCKG on DBpedia (2016-10) and NELL
entity types are used in the error detection method. In NELL the                                             (08m-690), then we manually evaluate the triples corrected by our
first token is always the type of the entity, therefore, for similar                                         approach. We run PaTyBRED on both datasets and select top-1%
reasons, we ignore it.                                                                                       facts most likely to be errors to be processed by our correction
                                                                                                             method. We classify each corrected fact in four different categories:
3.3    Correcting Wrong facts                                                                                   (1) WC: wrong fact turned into correct
At this point, for each assertion identified as erroneous, we have                                              (2) WW: wrong fact turned into another wrong fact
our list of candidate instances for subject and object from the disam-                                          (3) CW: correct fact turned into wrong fact
biguation links and approximate string matching. We then compute                                                (4) CC: correct fact turned into another correct fact
a custom similarity measure s (e 1 , e 2 ) between an entity e 1 and a                                          Our approach was run with mc = 0.75, mcд = 2 and entity
candidate e 2 . Each entity ei consists of a set of its tokens. The mea-                                     similarity measure with c = 1.52 . That resulted in 24,973 corrections
sure we propose consists of two components. The first is the sum                                             on DBpedia and 616 correction on NELL. It also detected that 873
of Levenshtein (d L ) distance of all matched tokens, and the second                                         (569) errors were caused by wrong types in DBpedia (NELL). Since
considers the number of unmatched tokens to capture a difference                                             manually evaluating all these corrections would be impossible, we
in specificity. The set of approximately matched token pairs is rep-                                         randomly select 100 correction on each to perform the evaluation.
resented by µ (e 1 , e 2 ) and the constant c is the weight of the second                                       The results of our manual evaluation are shown in Figure 1.
component. This measure is used to sort the retrieved candidates,                                            The proportion of facts successfully corrected (case 1) was rather
to prune them in case there are too many, and to break ties when                                             low. While our approach can potentially improve the results by
deciding which of the top-scoring candidates should be chosen.                                               tweaking the parameters, and possibly using ensembles of different
                                                                                                             type predictors and error detectors, it currently cannot be used as
                                                                                                             a fully automatic approach. However, we believe that combining
                                  X                                        |e 1 | − |µ (e 1 , e 2 )|         our approach with active learning is a promising direction which,
      s (e 1 , e 2 ) =                              d L (t 1 , t 2 ) + c                               (1)   with the help of specialists, could significantly improve results.
                                                                                    |e1|
                         (t 1,t 2 ) ∈µ (e 1,e 2 )                                                               When evaluating some relations individually, we notice that
                                                                                                             some of them achieve good results. E.g., the relations sire, damsire,
   In case the relation has domain or range restrictions, we remove                                          grandsire and subsequentWork reaching more than 90% of suc-
the candidates which violate these restrictions. Later, for each of                                          cessful corrections (case 1). The results are good for these relations
the candidates, we generate triples by substituting the subject and                                          because horses are often named after other entities and artists often
object by each of the instances in its candidates lists (first substitute                                    have albums named after themselves, which makes confusions easy
subject only, then object only). That is, the total number of candidate                                      to happen.
triples is the sum of the size of the subject and object candidates list.
                                                                                                             1 For that to happen in the case of DBpedia, a Wikipedia user would have to go to the
We do not create candidate triples by substituting both the subject
                                                                                                             wrong article page and insert a wrong link in the infobox.
and object at the same time because, although possible, we assume                                            2 The parameter values were selected based on heuristics and may not be optimal
K-CAP2017 Workshops and Tutorials Proceedings, 2017                                                                                       A. Melo and H. Paulheim


   One of the problems of our approach is that since it relies on                     [2] Luis Antonio Galárraga, Christina Teflioudi, Katja Hose, and Fabian M. Suchanek.
PaTyBRED, which cannot find many relevant path features on                                2013. AMIE: association rule mining under incomplete evidence in ontological
                                                                                          knowledge bases. In 22nd International World Wide Web Conference, WWW ’13,
DBpedia and NELL [5], it is difficult to distinguish between can-                         Rio de Janeiro, Brazil, May 13-17, 2013, Daniel Schwabe, Virgílio A. F. Almeida,
didate entities of same type. For example, in NELL, the entity                            Hartmut Glaser, Ricardo A. Baeza-Yates, and Sue B. Moon (Eds.). International
                                                                                          World Wide Web Conferences Steering Committee / ACM, 413–422. http://dl.
person_paul as object of book_writer relation is always corrected                         acm.org/citation.cfm?id=2488425
with writer_paul_feval.                                                               [3] Ni Lao and William W. Cohen. 2010. Relational Retrieval Using a Combination
   The decision to generate candidate triples by corrupting ei-                           of Path-constrained Random Walks. Mach. Learn. 81, 1 (Oct. 2010), 53–67. https:
                                                                                          //doi.org/10.1007/s10994-010-5205-8
ther the subject or object seemed to have worked well for DB-                         [4] Andre Melo and Heiko Paulheim. 2017. Detection of Relation Assertion Er-
pedia, where we could not find a triple where both subject and                            rors in Knowledge Graphs. In Proceedings of the 9th International Conference on
object were wrong. On the other hand, in NELL such case was ob-                           Knowledge Capture, K-CAP 2017, Austin, TX, USA, December 4-6, 2017. ACM.
                                                                                      [5] André Melo and Heiko Paulheim. 2017. Local and global feature selection for
served a few times, e.g. ismultipleof(musicinstrument_herd ,                              multilabel classification with binary relevance. Artificial Intelligence Review
musicinstrument_buffalo) whose object was corrected to mammal-                            (2017), 1–28. https://doi.org/10.1007/s10462-017-9556-4
                                                                                      [6] André Melo, Heiko Paulheim, and Johanna Völker. 2016. Type Prediction in RDF
_buffalo but the subject remained wrong.                                                  Knowledge Bases Using Hierarchical Multilabel Classification. In Proceedings of
   Also, our assumption that confusions tend to use a more general                        the 6th International Conference on Web Intelligence, Mining and Semantics (WIMS
IRI instead of a more specific, requiring all tokens of the queried to                    ’16). ACM, New York, NY, USA, Article 14, 10 pages. https://doi.org/10.1145/
                                                                                          2912845.2912861
be matched, does not always hold. One example in DBpedia which                        [7] Stoyan Mihov and Klaus U. Schulz. 2004. Fast Approximate Search in Large
contradicts this assumption is language(Paadatha_Thenikkal ,                              Dictionaries. Comput. Linguist. 30, 4 (Dec. 2004), 451–477. https://doi.org/10.
Tamil_cinema), whose corrected object would be Tamil_language                             1162/0891201042544938
                                                                                      [8] Gonzalo Navarro. 2001. A Guided Tour to Approximate String Matching. ACM
and could not be retrieved by our approach. While this can be a                           Comput. Surv. 33, 1 (March 2001), 31–88. https://doi.org/10.1145/375360.375365
problem, dropping this assumption also means that more candi-                         [9] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. 2016.
                                                                                          A Review of Relational Machine Learning for Knowledge Graphs. Proc. IEEE 104,
dates entities will be retrieved, increasing the number of unrelated                      1 (2016), 11–33. https://doi.org/10.1109/JPROC.2015.2483592
candidates, resulting in more candidate triples which need to be                     [10] Maximilian Nickel, Volker Tresp, and Hans peter Kriegel. 2011. A Three-Way
tested and possibly more occurrences of cases 2 and 3. Further                            Model for Collective Learning on Multi-Relational Data. In Proceedings of the
                                                                                          28th International Conference on Machine Learning (ICML-11). ACM. http://www.
experiments would have to be conducted in order to evaluate the                           icml-2011.org/papers/438_icmlpaper.pdf
effects of such change.                                                              [11] Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches
                                                                                          and evaluation methods. Semantic Web 8, 3 (2017), 489–508. https://doi.org/10.
                                                                                          3233/SW-160218
5    CONCLUSION                                                                      [12] Heiko Paulheim and Christian Bizer. 2013. Type Inference on Noisy RDF Data.
                                                                                          Springer Berlin Heidelberg, Berlin, Heidelberg, 510–525. https://doi.org/10.1007/
In this paper we proposed CoCKG, an approach for correcting er-                           978-3-642-41335-3_32
roneous facts originated from entity confusions. The experiments                     [13] Heiko Paulheim and Christian Bizer. 2014. Improving the Quality of Linked Data
                                                                                          Using Statistical Distributions. Int. J. Semant. Web Inf. Syst. 10, 2 (April 2014),
show that CoCKG is capable of correcting wrong triples with con-                          63–86. https://doi.org/10.4018/ijswis.2014040104
fused instances, with estimated precision of 21% of the produced                     [14] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume
corrections in DBpedia and 14% in NELL. The low precision val-                            Bouchard. 2016. Complex Embeddings for Simple Link Prediction. CoRR
                                                                                          abs/1606.06357 (2016). http://arxiv.org/abs/1606.06357
ues obtained do not allow this process, as of now, to be used for                    [15] Chengyu Wang, Rong Zhang, Xiaofeng He, and Aoying Zhou. 2016. Error
fully automatic KG enrichment. Nevertheless, it works as a proof                          Link Detection and Correction in Wikipedia. In Proceedings of the 25th ACM
                                                                                          International on Conference on Information and Knowledge Management (CIKM
of concept and can be useful, e.g., as suggestions from which a user                      ’16). ACM, New York, NY, USA, 307–316. https://doi.org/10.1145/2983323.2983705
would ultimately decide whether to execute.                                          [16] Zhenglu Yang, Jianjun Yu, and Masaru Kitsuregawa. 2010. Fast Algorithms
   In the future it would be interesting to adapt this method to                          for Top-k Approximate String Matching. In Proceedings of the Twenty-Fourth
                                                                                          AAAI Conference on Artificial Intelligence (AAAI’10). AAAI Press, 1467–1473.
support active learning. Since guaranteeing the quality of the newly                      http://dl.acm.org/citation.cfm?id=2898607.2898841
generated facts is crucial, having input from the user to clarify
borderline cases and improve the overall results would be highly
valuable. Furthermore, using an ensemble of different KG models
with different characteristics, e.g. KG embeddings, instead of a
single model may potentially increase the robustness of the system.
Finally, it would be worth adding textual features from entities
descriptions to help determine if a pair of entities is related or not.

ACKNOWLEDGMENTS
The work presented in this paper has been partly supported by the
Ministry of Science, Research and the Arts Baden-Württemberg in
the project SyKo2 W2 (Synthesis of Completion and Correction of
Knowledge Graphs on the Web).

REFERENCES
 [1] Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Ok-
     sana Yakhnenko. [n. d.]. Translating Embeddings for Modeling Multi-relational
     Data. In Advances in Neural Information Processing Systems 26.