=Paper=
{{Paper
|id=Vol-2378/shortAT1
|storemode=property
|title=Linkex: A Tool for Link Key Discovery Based on Pattern Structures
|pdfUrl=https://ceur-ws.org/Vol-2378/shortAT1.pdf
|volume=Vol-2378
|authors=Nacira Abbas,Jérôme David,Amedeo Napoli
|dblpUrl=https://dblp.org/rec/conf/icfca/AbbasDN19
}}
==Linkex: A Tool for Link Key Discovery Based on Pattern Structures==
Linkex: A Tool for Link Key Discovery Based on Pattern Structures Nacira Abbas1 , Jérôme David2 , and Amedeo Napoli1 1Université de Lorraine, CNRS, Inria, Loria, F-54000 Nancy, France Nacira.Abbas@inria.fr, Amedeo.Napoli@loria.fr 2 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, F-38000 Grenoble, France Jerome.David@inria.fr Abstract. Links constitute the core of Linked Data philosophy. With the high growth of data published in the web, many frameworks have been proposed to deal with the link discovery problem, and particularly the identity links. Finding such kinds of links between different RDF data sets is a critical task. In this po- sition paper, we focus on link key which consists of sets of pairs of properties identifying the same entities across heterogeneous datasets. We also propose to formalize the problem of link key discovery using Pattern Structures (PS), the generalization of Formal Concept Analysis dealing with non binary datasets. Af- ter providing the proper definitions of link keys and setting the problem in terms of PS, we show that the intents of the pattern concepts correspond to link keys and their extents to sets of identity links generated by their intents. Finally, we discuss an implementation of this framework and we show the applicability and the scalability of the proposed method. Keywords: Link key· Pattern Structures· Linked Data· RDF. 1 Introduction Data interlinking is the task of finding the same entities described in different RDF datasets. To perfom this task, it has been proposed to use link keys [1] that extend the notion of a key used in databases. Link keys are rules allowing to infer identity links between RDF datasets. In practice, a link key is a set of pairs of properties from two datasets that generate same-as links. An example of a link key is: {hauteur, creatori}{htitre, titlei} linkkey hLivre, Booki stating that whenever an instance of the class Livre has the same values for property auteur as an instance of class Book has for property creator and they share at least one value for their property titre and title, then they denote the same entity. An algorithm to extract link key candidates is proposed in [1]. However, these candidates are not yet link keys. In order to select the best link key candidate to apply as a link key, quality measures have been proposed but this is out of the scope of this paper. Copyright c 2019 for this paper by its authors. Copying permitted for private and academic purposes. 2 N. Abbas et al. The question of using Formal Concept Analysis (FCA) to extract such kind of keys has naturally arisen, given that the set of link key candidates is an ordered structure. A first formalization is given in [2] then developed in [3]. Yet, to apply FCA techniques, scaling (binarizing) data is applied, which may not be efficient, particularly when we have to deal with a large number of entries like RDF datasets. In this position paper, instead of scaling, we propose to work directly on link key expressions which are the syntactic formulations of link keys. To do that, we use Pattern Structures (PS) [4], a generalization of FCA to deal with non binary data. We show that the intents of resulting pattern concepts correspond to link key candidates and their extents to sets of identity links generated by their intents. 2 Notations and definitions An RDF dataset is a set of triples: hsub ject, property, ob jecti ∈ (U ∪ B) × U × (U ∪ B ∪ L), where U is a set of IRIs (In- ternationalized Resource Identifier), B a set of blank nodes and L a set of literals, i.e., values depending on datatypes. In this work, we consider only objects which are lit- erals. To avoid confusion with PS objects, we refer to ”object” in an RDF triple as 0 ”rdf object”. Let D, D be two RDF datasets, we aim to discover link keys between 0 0 0 two specific classes C, C from D, D respectively. C = {s|hs,rdf:type,Ci ∈ D} and C = 0 0 0 0 0 {s |hs ,rdf:type,C i ∈ D } are the sets of instances of the classes C and C respectively. 0 0 0 0 0 0 0 0 P = {p|∃s, o, hs, p, oi ∈ D} and P = {p |∃s , o , hs , p , o i ∈ D } are the sets of proper- 0 ties from D and D respectively. Unlike databases, the properties in RDF are not func- tional i.e. for one property, an instance may have more than one value i.e. rdf object. Here, p(s) = {o|∃s, p, hs, p, oi ∈ D} is the set of the values of an instance s for the prop- erty p. Figure 1 represents an example of two datasets represented by instances of the classes Person and Inhabitant respectively. The goal here is to find a link key which identifies the same entities described in both classes. One may expect to identify that z1 is the same entity as i1 i.e. the link hz1 , i1 i as well as the links hz2 , i2 i and hz3 , i3 i. o1 : Person o2 : Inhabitant Dupont e o2:name o1:lastNam o1 : z1 o2 : i1 o1:firstNam o2:given e Thomas e o2:given o1:firstNam o1 : z2 o2 : i2 o1:lastNam e o2:name Dubois e o2:name o1:lastNam o1 : z3 o2 : i3 o2:name o1:fi Durand rstN iven ame o1:g Lisa Fig. 1. Example of two datasets representing respectively instances of classes Person and Inhab- itant. A link key is a statement of the form: 0 0 0 0 0 0 {hp1 , p1 i, .., hpk , pk i}{hp1 , p1 i, .., hpk , pk i, .., hpn , pn i} linkkey hC,C i 0 0 0 meaning that for all pairs of instances hs, s i ∈ C ×C , if s and s have the same values, Linkex: A Tool for Link Key Discovery Based on Pattern Structures 3 0 i.e. rdf objects, for each pairs of properties p1,..,k and p1,..,k respectively, and share at 0 least one value for each pair of properties pk+1,..,n and pk+1,..,n respectively, then they 0 represent the same entity i.e. (hs, owl:sameAs, s i). 0 Definition 1 (Link key expression). Let Eq and In be sets of pairs of properties hp, p i ∈ 0 0 P × P where Eq ⊆ In. K = hEq, In, hC,C ii is called a link key expression over two 0 classes C,C . Notice that Eq and In may be empty. Since Eq ⊆ In, we write K = Eq, In \ Eq for short. Which means that when a pair of properties belongs to the Eq part, we don’t represent it in the In part. As example of link key expression over the two classes Person and Inhabitant, we have: / {hlastName, namei, h f irstName, giveni} where Eq is empty. 0, 0 0 Definition 2 (Satisfaction of a link key expression). We say that a link hs, s i ∈ C ×C 0 satisfies the link key expression K = Eq, In and we write K |= hs, s i iff: 0 0 0 0 0 0 ∀hp, p i ∈ Eq=⇒p(s) = p (s ) 6= 0/ and ∀hp, p i ∈ In=⇒p(s) ∩ p (s ) 6= 0/ Notice that a link may satisfies more than one link key expression. 0 0 0 The set L(K) = {hs, s i ∈ C × C |K |= hs, s i} is the set of all links satisfying K and called the link set generated by K. Definition 3 (meet, join, subsumption of link key expressions). Let K1 = Eq1 , In1 and K2 = Eq2 , In2 two link key expressions. The meet u and join t of K1 and K2 are defined respectively as follow: K1 u K2 = Eq1 ∩ Eq2 , In1 ∩ In2 and K1 t K2 = Eq1 ∪ Eq2 , In1 ∪ In2 We say K2 subsumes K1 and we write K1 v K2 iff K1 u K2 = K1 where v is a partial order between link key expressions. 3 Link key extraction with Pattern Structure We propose here to formalize the problem of link key candidate extraction using PS [4] which is a generalization of FCA [5] to deal with non binary data. Definition 4 (Pattern Structure for link key extraction). A Pattern Structure for link 0 0 keys extraction is a triple (C × C , (D, u, t), δ ) where, C × C is the set a pair of in- 0 0 0 stances hs, s i ∈ C ×C . D is a set of link key expressions over the two classes C and C . 0 0 δ is a mapping associating each pair of instances hs, s i ∈ C ×C to its description K, 0 defined as the maximal link key expression satisfied by hs, s i i.e. the join of the link key 0 0 0 expressions satisfied by hs, s i. Formally, δ (hs, s i) = t{K|K |= hs, s i}. (D, u, t) is a lattice. The meet, join and subsumption of two link key expressions are defined as in Definition 3. From the previous example of datasets represented in Fig- ure 1, we obtained the Pattern Structure in Table 1. Here for example, to find δ (hz1 , i1 i), we have computed the join of all link key exepressions satisfied by the link hz1 , i1 i and we obtained the description δ (hz1 , i1 i) = {hlastName, namei, h f irstName, giveni}, {}. Notice that we omit pairs of instances for which descriptions correspond to the empty link key expression K = 0, / 0. / 4 N. Abbas et al. 0 Derivation operators . are defined as follows for A ⊆ C ×C and K ∈ D: 0 0 0 A = uhs,s0 i∈A δ (hs, s i) and K = {hs, s i |K v δ (hs, s i)} (A, K) is a pattern concept if A = K and K = A. We say that a pattern concept (A, K) is subsumed by another pattern concept (A1 , K1 ) if A ⊆ A1 (or equivalently if K1 v K). We aim here to extract link key candidates using the PS defined above. Definition 5 (Link key candidate). A link key expression K is a link key candidate 0 over two classes C,C if L(K) 6= 0/ and there is no link key expression H such as H v K that generates the same link set as K i.e. L(K) 6= L(H) One can see that, if (A, K) is a pattern concept, then its intent K represents a link key candidate and its extent is the link set generated by this link key candidate. Figure 2 represents the pattern concept lattice generated from the PS in Table 1. For example, the pattern concept:{hz1 , i1 i, hz2 , i2 i, hz3 , i3 i}, {h f irstName, giveni}, {hlastName, namei}, identify the link key candidate in the intent, which generates the link set in the extent. This link key candidate can be applied as a link key. In fact, it generates the expected links. For example the instances z3 and i3 represent the same person since both of them had the same first name and share at least one family name. The class Inhabitant gives, for example, the birth name and the married name while the class Person gives just one of these two. Table 1. The pattern structure computed from RDF datasets represented in Figure 1 Eq In {hlastName, namei, hz1 , i1 i {} h f irstName, giveni} hz1 , i2 i {h f irstName, giveni} {} hz2 , i1 i {h f irstName, giveni} {} ‘ {hlastName, namei, hz2 , i2 i {} h f irstName, giveni} hz2 , i3 i 0/ {hlastName, namei} hz3 , i2 i {hlastName, namei} {} Fig. 2. The resulting lattice from the pattern hz3 , i3 i {h f irstName, giveni} {hlastName, namei} structure in Table 1. 4 Tool Description We have implemented a prototype tool called Linkex 1 , which aims to extract link key candidates using PS. It starts from two RDF data sets and computes a PS, then generates a pattern concept lattice using a modified version of AddIntent algorithm [6]. The intent of each resulting pattern concept corresponds to a link key candidate and its extent to the set of identity links generated by this link key candidate. When the resulting pattern concept lattice is small, Linkex gives the possibility to visualize link key candidates in a nice way i.e. organized in a lattice represented by a Hasse diagram. For example the Figure 2, which represents a pattern concept lattice, is a direct output of this tool. Linkex offers also the possibility to assess the quality of the extracted link keys candidates using the measures proposed in [1]. 1 https://gitlab.inria.fr/moex/linkex Linkex: A Tool for Link Key Discovery Based on Pattern Structures 5 To demonstrate the applicability and the scalability of the proposed tool, we have experimented with two bibliographical datasets. The first one2 is produced by BnF, the ”Bibliothèque nationale de France”. The second one3 is produced by ABES, ”Agence Bibliographique de l’Enseignement Supérieur”. BnF and ABES, have provided us with two samples of data for the experiments. The first sample is an extraction of authors in- stances (and their books instances) that have a combination firstname, name which is in the top 1000 most frequent homonyms. The second sample is an extraction of all authors instances (and their book instances) that have a name starting with letter ’A’. We aim here to find link key candidates between instances of the classes representing authors contained in both datasets. As showed in Table 2, we got short running times comparing to the size of processed data. Notice that, the processing time is correlated to the size of datasets. Yet, the most interesting result here is the number of the obtained pattern con- cepts i.e. the number of link key candidates represented by the column #PsConcepts in Table 2. For example, in sample 1, from 1,564,495 pattern structure descriptions, we got only 155 link key candidates. As a direct consequence of this, is that an expert could easily navigate these link key candidates and evaluate them. Thus, we can say that the proposed method implemented as Linkex is applicable and scalable. l Table 2. Experimentation results Samples #BnF #ABES #PsEntries #PsConcepts Processing time . Sample 1 8,162 15,421 1,564,495 155 2m10 Sample 2 18,637 142,571 12,348,012 186 6m50 5 Conclusions In this work, we formalize the problem of extracting link key candidates using PS and we propose a tool allowing to automatically build a pattern concept lattice where each pattern concept intent is a candidate link key. We aim to extend this work to consider in- terdependent classes i.e when rdf objects are instances of other classes and we plan also to relax equality constraints used to compare literals by considering similarity measures instead. Acknowledgments This work has been supported by the ANR project Elker (ANR-17-CE23-0007-01) and the BnF in the context of the agreement between Inria and Ministère de la culture. References 1. Atencia, M., David, J., Euzenat, J.: Data interlinking through robust linkkey extraction. In: ECAI. pp. 15–20 (2014) 2 https://data.bnf.fr 3 https://www.idref.fr 6 N. Abbas et al. 2. Atencia, M., David, J., Euzenat, J.: What can fca do for database linkkey extraction? In: 3rd ECAI workshop on What can FCA do for Artificial Intelligence?(FCA4AI). pp. 85–92. No commercial editor. (2014) 3. Atencia, M., David, J., Euzenat, J., Napoli, A., Vizzini, J.: Link key candidate extraction with relational concept analysis. Discrete Applied Mathematics (2019) 4. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: ICCS. pp. 129–142. Springer (2001) 5. Ganter, B., Wille, R.: Formal concept analysis: mathematical foundations. Springer Science & Business Media (2012) 6. Van Der Merwe, D., Obiedkov, S., Kourie, D.: Addintent: A new incremental algorithm for constructing concept lattices. In: ICFCA. pp. 372–385. Springer (2004)