Non-Redundant Link Keys in RDF Data:
                Preliminary Steps

    Nacira Abbas1 , Alexandre Bazin1 , Jérôme David2 , and Amedeo Napoli1
        1
         Université de Lorraine, CNRS, Inria, Loria, F-54000 Nancy, France
  Nacira.Abbas@inria.fr, Alexandre.Bazin@loria.fr, Amedeo.Napoli@loria.fr
 2
   Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, F-38000 Grenoble,
                         France Jerome.David@inria.fr


      Abstract. A link key between two RDF datasets D1 and D2 is a set of
      pairs of properties allowing to identify pairs of individuals, say x1 in D1
      and x2 in D2 , which can be materialized as a x1 owl:sameAs x2 identity
      link. There exist several ways to mine such link keys but no one takes
      into account the fact that owl:sameAs is an equivalence relation, which
      leads to the discovery of non-redundant link keys. Accordingly, in this
      paper, we present the link key discovery based on Pattern Structures
      (PS). PS output a pattern concept lattice where every concept has an
      extent representing a set of pairs of individuals and an intent representing
      the related link key candidate. Then, we discuss the equivalence relation
      induced by a link key and we introduce the notion of non-redundant link
      key candidate.

      Keywords: Linked Data · RDF · Link Key · Formal Concept Analysis
      · Pattern Structures.


1    Introduction
In this paper, we are interested in data interlinking which goal is to discover
identity links across two RDF datasets over the web of data [5,8]. The same real
world entity can be represented in two RDF datasets by different subjects in RDF
triples (subject,property,value) (instead of “object” usually used in RDF data
we will use “value”). It is important to be able to detect such identities, for exam-
ple using rules expressing sufficient conditions for two subjects to be identical.
A link key takes the form of two sets of pairs of properties associated with a pair
of classes. The pairs of properties express sufficient conditions for two subjects,
from the associated pair of classes, to be the same. An example of a link key is
 ({(designation, title)}, {(designation, title), (creator, author)}, (Book, Novel))
which states that whenever an instance a of the class Book has the same (non
empty) values for the property designation as an instance b of the class Novel
for the property title (universal quantification), and that a and b share at least
one value for the properties creator and author (existential quantification), then
a and b denote the same entity, i.e., an owl:sameAs relation can be established
between a and b.

Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
    A link key can be understood as a “closed set” in the sense that it is maximal
w.r.t. the set of pairs of individuals to which it applies. This was firstly discussed
in [2] and then extended in [3]. Hence the question of relying on Formal Concept
Analysis (FCA [7]) to discover link keys is straightforward as FCA is based on
a closure operator. Then, given two RDF datasets, FCA is applied in [3] to a
binary table where rows correspond to pairs of individuals and columns to pairs
of properties. The intent of a concept is a link key candidate which should be
validated thanks to suitable quality measures. The extent of the concept is the
set of identity links between individuals. Furthermore, a generalization of the
former approach proposed in [1] is based on pattern structures [6] and takes into
account different pairs of classes at the same time in the discovery of link keys.
    Link key candidates over two RDF datasets have to generate different and
maximal link sets. However it appears that two different link key candidates may
generate the same link set. This means that there exists some redundancy be-
tween the two link key candidates, that they should be considered as equivalent
and merged. This can be achieved by looking at owl:sameAs which is an equiv-
alence relation stating that two individual should be identified. The owl:sameAs
relation generates partitions among pairs of individuals that can be used to
detect redundant link key candidates and thus reduce their number, i.e., two
candidates relying on the same partition are declared as redundant and thus
merged.
    In this paper, we present the discovery of link key candidates within the
framework of pattern structure. Then, we introduce the notion of non-redundant
link key candidate based on the equivalence relation induced by a link key can-
didate. Finally, we discuss how these candidates can be merged to reduce the
search space of link keys.


2     Basics and Notations
2.1   RDF data
In this work, we deal with RDF datasets which are defined as follows:
Definition 1 (RDF dataset).
   Let U be a set of IRIs (Internationalized Resource Identifier), B a set of
blank nodes and L a set of literals. An RDF dataset is a set of triples (s, p, v) ∈
(U ∪ B) × U × (U ∪ B ∪ L).
Given a dataset D, we denote by:
 – I(D) = {s | ∃p, v (s, p, v) ∈ D} the set of individual identifiers,
 – P (D) = {p | ∃s, v (s, p, v) ∈ D} the set of property identifiers,
 – C(D) = {c | ∃s (s, rdf:type, c) ∈ D} the set of class identifiers. A triple
   (s, rdf:type, c) means that the subject s is an instance of the class c.
 – I(c) = {s | (s, rdf:type, c) ∈ D} the set of instances of c ∈ C(D),
 – p(s) = {v | (s, p, v) ∈ D} is the set of values (or “RDF objects”) related to s
   through p.
     An identity link is an RDF triple (a, owl:sameAs, b) stating that the IRIs a
and b refer to the same real-world entity. Fig. 1 represents two RDF datasets
D1 and D2 , where P (D1 ) = {p1 , p2 , p3 , p4 } and P (D2 ) = {q1 , q2 , q3 , q4 }. Then
C(D1 ) = {c1 } and C(D2 ) = {c2 } with I(c1 ) = {a1 , a2 , a3 , a4 , a5 } and I(c2 ) =
{b1 , b2 , b3 , b4 , b5 }. For example, the set of values of b3 for the property q2 is
q2 (b3 ) = {v8 , v9 }.


                                           v1
                       c1                            q1         c2
                                  p1       v2
                                  p2                 q2
                       a1         p2       v3                  b1
                                                     q1
                                           v4        q2
                                  p1
                       a2         p2       v5        q2        b2
                                  p1                 q1
                                           v6

                       a3         p1       v7        q1        b3
                                  p2                 q2
                                           v8        q2

                       a4                  v9                  b4
                                  p3                 q3
                                           v10
                                  p4
                       a5                            q4        b5
                                                     q4
                                  p4       v11
                                  p3       v12       q3


                                           v13

                             D1                           D2


Fig. 1. Example of two RDF datasets. On the left-hand side, the dataset D1 is pop-
ulated with instances of the class c1 , and on the right-hand side the dataset D2 is
populated with instances of the class c2 .


2.2   Link Keys
Link keys are logical constructors allowing to deduce identity links (owl:sameAs).
In this paper we will call link key expression the syntactic formulation of a link
key, when we do not know whether the expression is a valid link key. For example,
the link key expression ({}, {(nbrOfPages, nbrOfBeds)}, (Book, Hospital)) will not
satisfy the link key semantics since an instance of class Book and an instance
of class Hospital cannot represent the same entity since Book and Hospital are
disjoint classes.
Definition 2 (Link key expression, link key candidate). Let D1 and D2
be two RDF datasets, k = (Eq, In, (c1 , c2 )) is a link key expression (over D1
and D2 ) iff In ⊆ P (D1 ) × P (D2 ), Eq ⊆ In, c1 ∈ C(D1 ) and c2 ∈ C(D2 ).
     The set of links L(k) (directly) generated by k is the set of pairs of instances
 (a, b) ∈ I(c1 ) × I(c2 ) satisfying:
 (i) for all (p, q) ∈ Eq, p(a) = q(b) and p(a) 6= ∅,
(ii) for all (p, q) ∈ In \ Eq, p(a) ∩ q(b) 6= ∅.
     A link key expression k1 = (Eq1 , In1 , (c1 , c2 )) is a link key candidate if:
(iii) L(k1 ) 6= ∅,
(iv) k1 is maximal i.e. there does not exist another link key expression k2 =
      (Eq2 , In2 , (c1 , c2 )) such that Eq1 ⊂ Eq2 , In1 ⊂ In2 , and L(k1 ) = L(k2 ).
    The number of link key expressions may be exponential w.r.t. the number of
 properties. Then link key discovery algorithms only consider link key candidates
 which are link key expressions generating at least one link and that are maximal
 w.r.t. the set of links they generate.


 3    Link Key Discovery
 Here after we assume that all link key expressions are defined on the same pair
 of datasets D1 and D2 w.r.t. one pair of classes, yielding link key expressions
 of the form k = (Eq, In, (c1 , c2 )). In the following, we show how link keys may
 be discovered within the formalism of pattern structures (see details in [1]) and
 then we discuss the notion of non-redundant link keys.
 Example 1. Let us consider the pattern structure (G, (E, u), δ) displayed in Ta-
 ble 1. Here we skip the details for building this table and the related PS lattice
 which can be found in [1].
     The rows termed “PS objects” correspond to the set of objects G of the
 pattern structure and include pairs of related instances. The set of descriptions
 (E, u) includes all possible pairs of properties preceded either by ∀ or ∃. The
 mapping δ relates a pair of instances (a, b) ∈ I(c1 ) × I(c2 ) to a description as
 follows: (i) δ(a, b) includes ∀(p, q) whenever p(a) = q(b) and p(a) 6= ∅, (ii) δ(a, b)
 includes ∃(p, q) whenever p(a) ∩ q(b) 6= ∅. Then the descriptions correspond to
 link key expressions (Eq, In) w.r.t. the pairs of classes (c1 , c2 ). It should be
 noticed that it is possible to simultaneously work with several pairs of classes as
 explained in [1].
     We have that δ(a1 , b1 ) = {∃(p1 , q1 ), ∃(p2 , q2 )} because p1 (a1 ) ∩ q1 (b1 ) 6= ∅
 and p2 (a1 ) ∩ q2 (b1 ) 6= ∅ while δ(a2 , b1 ) = {∃(p1 , q1 )} because p1 (a2 ) ∩ q1 (b1 ) 6= ∅.
 Then δ(a1 , b1 ) u δ(a2 , b1 ) = {∃(p1 , q1 )} and thus δ(a2 , b1 ) v δ(a1 , b1 ). This can
 be read in the pattern concept lattice where the pattern concept pc5 is subsumed
 by the pattern concept pc4 , i.e., the extent of pc5 {(a1 , b1 ), (a2 , b2 ), (a3 , b3 )} is
 included in the extent of pc4 {(a1 , b1 ), (a2 , b1 ), (a2 , b2 ), (a3 , b3 )}, while the intent
 {∃(p1 , q1 )} of pc4 is included in the intent of pc5 , {∃(p1 , q1 ), ∃(p2 , q2 )}.
     The set of all pattern concepts is organized within the pattern concept lattice
 lkps-lattice displayed in Fig. 2. Moreover, all potential link key candidates are
 lying in the intents of the pattern concepts in the lattice. The corresponding set
 of link key candidates is denoted by lkc.
                                  PS objects (g)       descriptions (δ(g))
                                     (a1 , b1 )      {∃(p1 , q1 ), ∃(p2 , q2 )}
                                     (a1 , b2 )              {∃(p2 , q2 )}
                                     (a2 , b1 )              {∃(p1 , q1 )}
                                     (a2 , b2 )      {∃(p1 , q1 ), ∃(p2 , q2 )}
                                     (a3 , b3 )  {∀(p1 , q1 ), ∃(p1 , q1 ), ∃(p2 , q2 )}
                                     (a4 , b4 )      {∀(p3 , q3 ), ∃(p3 , q3 )}
                                     (a4 , b5 )      {∀(p4 , q4 ), ∃(p4 , q4 )}
                                     (a5 , b4 )      {∀(p4 , q4 ), ∃(p4 , q4 )}
                                     (a5 , b5 )      {∀(p3 , q3 ), ∃(p3 , q3 )}

Table 1. The pattern structure related to link key discovery over c1 and c2 introduced
in Fig. 1.


                                                                    pc0

                                                            I(c1 ) × I(c2 )

                                                                     ∅


                                                                                     pc3                                pc4
                pc1                               pc2
                                                                           {(a1 , b1 ), (a1 , b2 ),          {(a1 , b1 ), (a2 , b1 ),
       {(a4 , b5 ), (a5 , b4 )}          {(a4 , b4 ), (a5 , b5 )}
                                                                           (a2 , b2 ), (a3 , b3 )}           (a2 , b2 ), (a3 , b3 )}
   k1 = {∀(p4 , q4 ), ∃(p4 , q4 )}   k2 = {∀(p3 , q3 ), ∃(p3 , q3 )}
                                                                            k3 = {∃(p2 , q2 )}                   k4 = {∃(p1 , q1 )}


                                                                                         pc5

                                                                          {(a1 , b1 ), (a2 , b2 ), (a3 , b3 )}

                                                                          k5 = {∃(p1 , q1 ), ∃(p2 , q2 )}


                                                                                         pc6

                                                                                     {(a3 , b3 )}

                                                                          k6 = {∀(p1 , q1 ), ∃(p1 , q1 ),
                                                                                 ∃(p2 , q2 )}


                                                                    pc7

                                                                     ∅
                                                          P (D1 ) × P (D2 )


Fig. 2. The pattern concept intents in the pattern concept lattice lkps-lattice include
the complete set of link key candidates.


    Let us consider the so-called lkps-lattice and pc = (L(k), k) a pattern con-
cept, where the extent L(k) corresponds to the set of links generated by k,
and the intent k corresponds to a link key candidate. Let I denotes the set
of instances I = I(c1 ) ∪ I(c2 ) and the binary relation 'k ⊆ I × I such as
(a, b) ∈ L(k) → a 'k b. The interpretation of a 'k b is: "k states that there
exists a owl:sameAs relation between a and b". Actually 'k is an equivalence re-
lation based on the fact that owl:sameAs itself is an equivalence relation. We say
that k induces the equivalence relation 'k over I. Moreover 'k forms a partition
over I where each element of this partition is an equivalence class. In fact the 'k
equivalence relation will help us to build more concise set of link key candidates
since it allows to identify non-redundant link key candidates termed nr-lkc. A
link key candidate k1 is a nr-lkc in lkc if there is no other candidate k2 in lkc
such that 'k1 and 'k2 form the same partition. Otherwise, k1 is redundant.
     In Fig. 2, it can be observed that 'k3 and 'k4 form the same partition,
namely {(a1 , b1 , a2 , b2 ), (a3 , b3 )} (it should be noticed that singletons are omitted
for the sake of readability). Then the link key candidates k3 and k4 are redundant.
By contrast, k1 is a nr-lkc because there is no other candidate k in lkc such
that 'k1 and 'k form the same partition.
     Let us briefly explain how 'k3 and 'k4 are inducing the same partition,
namely {(a1 , b1 , a2 , b2 ), (a3 , b3 )}. The extent of k3 in lkps-lattice is given by
{(a1 , b1 ), (a1 , b2 ), (a2 , b2 ), (a3 , b3 )}. By transitivity and symmetry of owl:sameAs,
we have that (a1 , b2 ) and (b2 , a2 ) yields (a1 , a2 ), then (a2 , a1 ) and (a1 , b1 ) yields
(a2 , b1 ), and finally (b1 , a2 ) and (a2 , b2 ) yields (b1 , b2 ) and the complete graph
between (a1 , a2 , b1 , b2 ). The same thing applies when we consider k4 instead of
k3 . This intuitively shows how 'k3 and 'k4 are inducing the same partition.
     One main straightforward application of identifying nr-lkc is the ability to
reduce the search space of link keys since the set of nr-lkc is included in lkc.
Indeed, this can be seen as a refinement where redundant link key candidates
inducing the same partition are merged. For example, since 'k3 and 'k4 form
the same partition, then, k3 and k4 can be merged into a nr-lkc k34 = {k3 , k4 }.
Among the perspectives is to consolidate the theory and practice of link key
discovery based on partition pattern structures initially introduced for mining
functional dependencies in [4].


References
1. Abbas, N., David, J., Napoli, A.: Discovery of link keys in RDF data based on
   pattern structures: Preliminary steps. In: Proceedings of ICFCA. CEUR Workshop
   Proceedings, vol. 2668, pp. 235–246. CEUR-WS.org (2020)
2. Atencia, M., David, J., Euzenat, J.: Data interlinking through robust linkkey ex-
   traction. In: Proceedings of ECAI. pp. 15–20 (2014)
3. Atencia, M., David, J., Euzenat, J., Napoli, A., Vizzini, J.: Link key candidate
   extraction with relational concept analysis. Discrete applied mathematics 273, 2–
   20 (2020)
4. Baixeries, J., Kaytoue, M., Napoli, A.: Characterizing functional dependencies in
   formal concept analysis with pattern structures. Annals of Mathematics and Arti-
   ficial Intelligence 72, 129–149 (2014)
5. Ferrara, A., Nikolov, A., Scharffe, F.: Data Linking for the Semantic Web. Interna-
   tional Journal of Semantic Web and Information Systems 7(3), 46–76 (2011)
6. Ganter, B., Kuznetsov, S.O.: Pattern Structures and Their Projections. In: Proceed-
   ings of the International Conference on Conceptual Structures (ICCS). pp. 129–142.
   LNCS 2120, Springer (2001)
7. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations.
   Springer (1999)
8. Nentwig, M., Hartung, M., Ngonga Ngomo, A.C., Rahm, E.: A survey
   of current link discovery frameworks. Semantic Web 8(3), 419–436 (2017).
   https://doi.org/10.3233/SW-150210