=Paper= {{Paper |id=Vol-2378/shortAT1 |storemode=property |title=Linkex: A Tool for Link Key Discovery Based on Pattern Structures |pdfUrl=https://ceur-ws.org/Vol-2378/shortAT1.pdf |volume=Vol-2378 |authors=Nacira Abbas,Jérôme David,Amedeo Napoli |dblpUrl=https://dblp.org/rec/conf/icfca/AbbasDN19 }} ==Linkex: A Tool for Link Key Discovery Based on Pattern Structures== https://ceur-ws.org/Vol-2378/shortAT1.pdf
Linkex: A Tool for Link Key Discovery Based on Pattern
                      Structures

                   Nacira Abbas1 , Jérôme David2 , and Amedeo Napoli1
              1Université de Lorraine, CNRS, Inria, Loria, F-54000 Nancy, France
                   Nacira.Abbas@inria.fr, Amedeo.Napoli@loria.fr
      2 Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, F-38000 Grenoble, France

                                  Jerome.David@inria.fr



        Abstract. Links constitute the core of Linked Data philosophy. With the high
        growth of data published in the web, many frameworks have been proposed to
        deal with the link discovery problem, and particularly the identity links. Finding
        such kinds of links between different RDF data sets is a critical task. In this po-
        sition paper, we focus on link key which consists of sets of pairs of properties
        identifying the same entities across heterogeneous datasets. We also propose to
        formalize the problem of link key discovery using Pattern Structures (PS), the
        generalization of Formal Concept Analysis dealing with non binary datasets. Af-
        ter providing the proper definitions of link keys and setting the problem in terms
        of PS, we show that the intents of the pattern concepts correspond to link keys
        and their extents to sets of identity links generated by their intents. Finally, we
        discuss an implementation of this framework and we show the applicability and
        the scalability of the proposed method.

        Keywords: Link key· Pattern Structures· Linked Data· RDF.


1     Introduction


     Data interlinking is the task of finding the same entities described in different RDF
datasets. To perfom this task, it has been proposed to use link keys [1] that extend the
notion of a key used in databases. Link keys are rules allowing to infer identity links
between RDF datasets. In practice, a link key is a set of pairs of properties from two
datasets that generate same-as links. An example of a link key is:
{hauteur, creatori}{htitre, titlei} linkkey hLivre, Booki stating that whenever an instance
of the class Livre has the same values for property auteur as an instance of class Book
has for property creator and they share at least one value for their property titre and
title, then they denote the same entity. An algorithm to extract link key candidates is
proposed in [1]. However, these candidates are not yet link keys. In order to select the
best link key candidate to apply as a link key, quality measures have been proposed but
this is out of the scope of this paper.
    Copyright c 2019 for this paper by its authors. Copying permitted for private and academic
    purposes.
2        N. Abbas et al.

     The question of using Formal Concept Analysis (FCA) to extract such kind of keys
has naturally arisen, given that the set of link key candidates is an ordered structure. A
first formalization is given in [2] then developed in [3]. Yet, to apply FCA techniques,
scaling (binarizing) data is applied, which may not be efficient, particularly when we
have to deal with a large number of entries like RDF datasets. In this position paper,
instead of scaling, we propose to work directly on link key expressions which are the
syntactic formulations of link keys. To do that, we use Pattern Structures (PS) [4], a
generalization of FCA to deal with non binary data. We show that the intents of resulting
pattern concepts correspond to link key candidates and their extents to sets of identity
links generated by their intents.


2    Notations and definitions

An RDF dataset is a set of triples:
hsub ject, property, ob jecti ∈ (U ∪ B) × U × (U ∪ B ∪ L), where U is a set of IRIs (In-
ternationalized Resource Identifier), B a set of blank nodes and L a set of literals, i.e.,
values depending on datatypes. In this work, we consider only objects which are lit-
erals. To avoid confusion with PS objects, we refer to ”object” in an RDF triple as
                        0
”rdf object”. Let D, D be two RDF datasets, we aim to discover link keys between
                            0           0                                                         0
two specific classes C, C from D, D respectively. C = {s|hs,rdf:type,Ci ∈ D} and C =
   0   0          0       0                                                         0
{s |hs ,rdf:type,C i ∈ D } are the sets of instances of the classes C and C respectively.
                                     0        0    0  0   0   0  0      0
P = {p|∃s, o, hs, p, oi ∈ D} and P = {p |∃s , o , hs , p , o i ∈ D } are the sets of proper-
                    0
ties from D and D respectively. Unlike databases, the properties in RDF are not func-
tional i.e. for one property, an instance may have more than one value i.e. rdf object.
Here, p(s) = {o|∃s, p, hs, p, oi ∈ D} is the set of the values of an instance s for the prop-
erty p. Figure 1 represents an example of two datasets represented by instances of the
classes Person and Inhabitant respectively. The goal here is to find a link key which
identifies the same entities described in both classes. One may expect to identify that z1
is the same entity as i1 i.e. the link hz1 , i1 i as well as the links hz2 , i2 i and hz3 , i3 i.
                   o1 : Person                                         o2 : Inhabitant
                                                 Dupont
                                             e             o2:name
                                 o1:lastNam
                     o1 : z1                                               o2 : i1
                                 o1:firstNam               o2:given
                                             e
                                                 Thomas
                                             e             o2:given
                                 o1:firstNam
                     o1 : z2                                               o2 : i2
                                 o1:lastNam
                                            e              o2:name
                                                 Dubois
                                             e             o2:name
                                 o1:lastNam
                     o1 : z3                                               o2 : i3
                                                           o2:name
                                 o1:fi           Durand
                                      rstN                      iven
                                          ame              o1:g

                                                 Lisa


Fig. 1. Example of two datasets representing respectively instances of classes Person and Inhab-
itant.
   A link key is a statement of the form:
                0               0           0               0               0                0
        {hp1 , p1 i, .., hpk , pk i}{hp1 , p1 i, .., hpk , pk i, .., hpn , pn i} linkkey hC,C i
                                                    0              0             0
meaning that for all pairs of instances hs, s i ∈ C ×C , if s and s have the same values,
                  Linkex: A Tool for Link Key Discovery Based on Pattern Structures            3

                                                             0
i.e. rdf objects, for each pairs of properties p1,..,k and p1,..,k respectively, and share at
                                                                 0
least one value for each pair of properties pk+1,..,n and pk+1,..,n respectively, then they
                                                    0
represent the same entity i.e. (hs, owl:sameAs, s i).
                                                                                           0
Definition 1 (Link key expression). Let Eq and In be sets of pairs of properties hp, p i ∈
      0                                  0
P × P where Eq ⊆ In. K = hEq, In, hC,C ii is called a link key expression over two
            0
classes C,C . Notice that Eq and In may be empty.
Since Eq ⊆ In, we write K = Eq, In \ Eq for short. Which means that when a pair of
properties belongs to the Eq part, we don’t represent it in the In part. As example of
link key expression over the two classes Person and Inhabitant, we have:
/ {hlastName, namei, h f irstName, giveni} where Eq is empty.
0,
                                                                                  0            0
Definition 2 (Satisfaction of a link key expression). We say that a link hs, s i ∈ C ×C
                                                                    0
satisfies the link key expression K = Eq, In and we write K |= hs, s i iff:
               0                   0 0               0                      0 0
        ∀hp, p i ∈ Eq=⇒p(s) = p (s ) 6= 0/ and ∀hp, p i ∈ In=⇒p(s) ∩ p (s ) 6= 0/
Notice that a link may satisfies more than one link key expression.
                       0          0         0
The set L(K) = {hs, s i ∈ C × C |K |= hs, s i} is the set of all links satisfying K and
called the link set generated by K.
Definition 3 (meet, join, subsumption of link key expressions). Let K1 = Eq1 , In1
and K2 = Eq2 , In2 two link key expressions. The meet u and join t of K1 and K2 are
defined respectively as follow:
          K1 u K2 = Eq1 ∩ Eq2 , In1 ∩ In2 and K1 t K2 = Eq1 ∪ Eq2 , In1 ∪ In2
    We say K2 subsumes K1 and we write K1 v K2 iff K1 u K2 = K1 where v is a partial
order between link key expressions.


3   Link key extraction with Pattern Structure
We propose here to formalize the problem of link key candidate extraction using PS [4]
which is a generalization of FCA [5] to deal with non binary data.
Definition 4 (Pattern Structure for link key extraction). A Pattern Structure for link
                                     0                            0
keys extraction is a triple (C × C , (D, u, t), δ ) where, C × C is the set a pair of in-
             0         0                                                                    0
stances hs, s i ∈ C ×C . D is a set of link key expressions over the two classes C and C .
                                                           0          0
δ is a mapping associating each pair of instances hs, s i ∈ C ×C to its description K,
                                                              0
defined as the maximal link key expression satisfied by hs, s i i.e. the join of the link key
                              0                     0                   0
expressions satisfied by hs, s i. Formally, δ (hs, s i) = t{K|K |= hs, s i}.
(D, u, t) is a lattice. The meet, join and subsumption of two link key expressions are
defined as in Definition 3. From the previous example of datasets represented in Fig-
ure 1, we obtained the Pattern Structure in Table 1. Here for example, to find δ (hz1 , i1 i),
we have computed the join of all link key exepressions satisfied by the link hz1 , i1 i and
we obtained the description δ (hz1 , i1 i) = {hlastName, namei, h f irstName, giveni}, {}.
    Notice that we omit pairs of instances for which descriptions correspond to the
empty link key expression K = 0, / 0.
                                    /
4              N. Abbas et al.

                                                                                     0
      Derivation operators . are defined as follows for A ⊆ C ×C and K ∈ D:
                                   0                           0               0
           A = uhs,s0 i∈A δ (hs, s i)   and       K  = {hs, s i |K v δ (hs, s i)}
    (A, K) is a pattern concept if A = K and K  = A. We say that a pattern concept
(A, K) is subsumed by another pattern concept (A1 , K1 ) if A ⊆ A1 (or equivalently if
K1 v K). We aim here to extract link key candidates using the PS defined above.
Definition 5 (Link key candidate). A link key expression K is a link key candidate
                     0
over two classes C,C if L(K) 6= 0/ and there is no link key expression H such as H v K
that generates the same link set as K i.e. L(K) 6= L(H)
One can see that, if (A, K) is a pattern concept, then its intent K represents a link key
candidate and its extent is the link set generated by this link key candidate. Figure 2
represents the pattern concept lattice generated from the PS in Table 1. For example, the
pattern concept:{hz1 , i1 i, hz2 , i2 i, hz3 , i3 i}, {h f irstName, giveni}, {hlastName, namei},
identify the link key candidate in the intent, which generates the link set in the extent.
This link key candidate can be applied as a link key. In fact, it generates the expected
links. For example the instances z3 and i3 represent the same person since both of them
had the same first name and share at least one family name. The class Inhabitant gives,
for example, the birth name and the married name while the class Person gives just one
of these two.
Table 1. The pattern structure computed
from RDF datasets represented in Figure 1

                     Eq                 In
           {hlastName, namei,
hz1 , i1 i                              {}
           h f irstName, giveni}
hz1 , i2 i {h f irstName, giveni}       {}
hz2 , i1 i {h f irstName, giveni}       {}
                                                       ‘
             {hlastName, namei,
hz2 , i2 i                              {}
             h f irstName, giveni}
hz2 , i3 i            0/          {hlastName, namei}
hz3 , i2 i {hlastName, namei}             {}
                                                           Fig. 2. The resulting lattice from the pattern
hz3 , i3 i {h f irstName, giveni} {hlastName, namei}       structure in Table 1.


4      Tool Description
We have implemented a prototype tool called Linkex 1 , which aims to extract link
key candidates using PS. It starts from two RDF data sets and computes a PS, then
generates a pattern concept lattice using a modified version of AddIntent algorithm
[6]. The intent of each resulting pattern concept corresponds to a link key candidate
and its extent to the set of identity links generated by this link key candidate. When the
resulting pattern concept lattice is small, Linkex gives the possibility to visualize link
key candidates in a nice way i.e. organized in a lattice represented by a Hasse diagram.
For example the Figure 2, which represents a pattern concept lattice, is a direct output
of this tool. Linkex offers also the possibility to assess the quality of the extracted link
keys candidates using the measures proposed in [1].
 1 https://gitlab.inria.fr/moex/linkex
                    Linkex: A Tool for Link Key Discovery Based on Pattern Structures         5

    To demonstrate the applicability and the scalability of the proposed tool, we have
experimented with two bibliographical datasets. The first one2 is produced by BnF, the
”Bibliothèque nationale de France”. The second one3 is produced by ABES, ”Agence
Bibliographique de l’Enseignement Supérieur”. BnF and ABES, have provided us with
two samples of data for the experiments. The first sample is an extraction of authors in-
stances (and their books instances) that have a combination firstname, name which is in
the top 1000 most frequent homonyms. The second sample is an extraction of all authors
instances (and their book instances) that have a name starting with letter ’A’. We aim
here to find link key candidates between instances of the classes representing authors
contained in both datasets. As showed in Table 2, we got short running times comparing
to the size of processed data. Notice that, the processing time is correlated to the size of
datasets. Yet, the most interesting result here is the number of the obtained pattern con-
cepts i.e. the number of link key candidates represented by the column #PsConcepts
in Table 2. For example, in sample 1, from 1,564,495 pattern structure descriptions, we
got only 155 link key candidates. As a direct consequence of this, is that an expert could
easily navigate these link key candidates and evaluate them. Thus, we can say that the
proposed method implemented as Linkex is applicable and scalable. l
                                 Table 2. Experimentation results

                      Samples #BnF #ABES #PsEntries #PsConcepts Processing time
                    . Sample 1 8,162 15,421 1,564,495    155         2m10
                      Sample 2 18,637 142,571 12,348,012 186         6m50



5    Conclusions

In this work, we formalize the problem of extracting link key candidates using PS and
we propose a tool allowing to automatically build a pattern concept lattice where each
pattern concept intent is a candidate link key. We aim to extend this work to consider in-
terdependent classes i.e when rdf objects are instances of other classes and we plan also
to relax equality constraints used to compare literals by considering similarity measures
instead.


Acknowledgments

This work has been supported by the ANR project Elker (ANR-17-CE23-0007-01) and
the BnF in the context of the agreement between Inria and Ministère de la culture.


References

1. Atencia, M., David, J., Euzenat, J.: Data interlinking through robust linkkey extraction. In:
   ECAI. pp. 15–20 (2014)
 2 https://data.bnf.fr
 3 https://www.idref.fr
6        N. Abbas et al.

2. Atencia, M., David, J., Euzenat, J.: What can fca do for database linkkey extraction? In: 3rd
   ECAI workshop on What can FCA do for Artificial Intelligence?(FCA4AI). pp. 85–92. No
   commercial editor. (2014)
3. Atencia, M., David, J., Euzenat, J., Napoli, A., Vizzini, J.: Link key candidate extraction with
   relational concept analysis. Discrete Applied Mathematics (2019)
4. Ganter, B., Kuznetsov, S.O.: Pattern structures and their projections. In: ICCS. pp. 129–142.
   Springer (2001)
5. Ganter, B., Wille, R.: Formal concept analysis: mathematical foundations. Springer Science
   & Business Media (2012)
6. Van Der Merwe, D., Obiedkov, S., Kourie, D.: Addintent: A new incremental algorithm for
   constructing concept lattices. In: ICFCA. pp. 372–385. Springer (2004)