-

Non-Redundant Link Keys in RDF Data: Preliminary Steps

Nacira Abbas

Nacira.Abbas@inria.fr 1

Alexandre Bazin

Alexandre.Bazin@loria.fr 1

Jérôme David

Jerome.David@inria.fr 0

Amedeo Napoli

Amedeo.Napoli@loria.fr 1 0 Université Grenoble Alpes , Inria, CNRS, Grenoble INP, LIG, F-38000 Grenoble , France 1 Université de Lorraine , CNRS, Inria, Loria, F-54000 Nancy , France

A link key between two RDF datasets D1 and D2 is a set of pairs of properties allowing to identify pairs of individuals, say x1 in D1 and x2 in D2, which can be materialized as a x1 owl:sameAs x2 identity link. There exist several ways to mine such link keys but no one takes into account the fact that owl:sameAs is an equivalence relation, which leads to the discovery of non-redundant link keys. Accordingly, in this paper, we present the link key discovery based on Pattern Structures (PS). PS output a pattern concept lattice where every concept has an extent representing a set of pairs of individuals and an intent representing the related link key candidate. Then, we discuss the equivalence relation induced by a link key and we introduce the notion of non-redundant link key candidate.

Linked Data RDF Link Key Formal Concept Analysis Pattern Structures

In this paper, we are interested in data interlinking which goal is to discover identity links across two RDF datasets over the web of data [ 5,8 ]. The same real world entity can be represented in two RDF datasets by different subjects in RDF triples (subject,property,value) (instead of “object” usually used in RDF data we will use “value”). It is important to be able to detect such identities, for example using rules expressing sufficient conditions for two subjects to be identical. A link key takes the form of two sets of pairs of properties associated with a pair of classes. The pairs of properties express sufficient conditions for two subjects, from the associated pair of classes, to be the same. An example of a link key is ({(designation, title)}, {(designation, title), (creator, author)}, (Book, Novel)) which states that whenever an instance a of the class Book has the same (non empty) values for the property designation as an instance b of the class Novel for the property title (universal quantification), and that a and b share at least one value for the properties creator and author (existential quantification), then a and b denote the same entity, i.e., an owl:sameAs relation can be established between a and b.

Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

A link key can be understood as a “closed set” in the sense that it is maximal w.r.t. the set of pairs of individuals to which it applies. This was firstly discussed in [ 2 ] and then extended in [ 3 ]. Hence the question of relying on Formal Concept Analysis (FCA [ 7 ]) to discover link keys is straightforward as FCA is based on a closure operator. Then, given two RDF datasets, FCA is applied in [ 3 ] to a binary table where rows correspond to pairs of individuals and columns to pairs of properties. The intent of a concept is a link key candidate which should be validated thanks to suitable quality measures. The extent of the concept is the set of identity links between individuals. Furthermore, a generalization of the former approach proposed in [ 1 ] is based on pattern structures [ 6 ] and takes into account different pairs of classes at the same time in the discovery of link keys.

Link key candidates over two RDF datasets have to generate different and maximal link sets. However it appears that two different link key candidates may generate the same link set. This means that there exists some redundancy between the two link key candidates, that they should be considered as equivalent and merged. This can be achieved by looking at owl:sameAs which is an equivalence relation stating that two individual should be identified. The owl:sameAs relation generates partitions among pairs of individuals that can be used to detect redundant link key candidates and thus reduce their number, i.e., two candidates relying on the same partition are declared as redundant and thus merged.

In this paper, we present the discovery of link key candidates within the framework of pattern structure. Then, we introduce the notion of non-redundant link key candidate based on the equivalence relation induced by a link key candidate. Finally, we discuss how these candidates can be merged to reduce the search space of link keys. 2 2.1

Basics and Notations RDF data

In this work, we deal with RDF datasets which are defined as follows:

Definition 1 (RDF dataset).

Let U be a set of IRIs (Internationalized Resource Identifier), B a set of blank nodes and L a set of literals. An RDF dataset is a set of triples (s, p, v) ∈ (U ∪ B) × U × (U ∪ B ∪ L).

Given a dataset D, we denote by: – I(D) = {s | ∃p, v (s, p, v) ∈ D} the set of individual identifiers, – P (D) = {p | ∃s, v (s, p, v) ∈ D} the set of property identifiers, – C(D) = {c | ∃s (s, rdf:type, c) ∈ D} the set of class identifiers. A triple (s, rdf:type, c) means that the subject s is an instance of the class c. – I(c) = {s | (s, rdf:type, c) ∈ D} the set of instances of c ∈ C(D), – p(s) = {v | (s, p, v) ∈ D} is the set of values (or “RDF objects”) related to s through p.

An identity link is an RDF triple (a, owl:sameAs, b) stating that the IRIs a and b refer to the same real-world entity. Fig. 1 represents two RDF datasets D1 and D2, where P (D1) = {p1, p2, p3, p4} and P (D2) = {q1, q2, q3, q4}. Then C(D1) = {c1} and C(D2) = {c2} with I(c1) = {a1, a2, a3, a4, a5} and I(c2) = {b1, b2, b3, b4, b5}. For example, the set of values of b3 for the property q2 is q2(b3) = {v8, v9}.

c1 a1 a2 a3 a4 a5

Definition 2 (Link key expression, link key candidate). Let D1 and D2 be two RDF datasets, k = (Eq, In, (c1, c2)) is a link key expression (over D1 and D2) iff In ⊆ P (D1) × P (D2), Eq ⊆ In, c1 ∈ C(D1) and c2 ∈ C(D2).

The set of links L(k) (directly) generated by k is the set of pairs of instances (a, b) ∈ I(c1) × I(c2) satisfying: (i) for all (p, q) ∈ Eq, p(a) = q(b) and p(a) 6= ∅, (ii) for all (p, q) ∈ In \ Eq, p(a) ∩ q(b) 6= ∅.

A link key expression k1 = (Eq1, In1, (c1, c2)) is a link key candidate if: (iii) L(k1) 6= ∅, (iv) k1 is maximal i.e. there does not exist another link key expression k2 = (Eq2, In2, (c1, c2)) such that Eq1 ⊂ Eq2, In1 ⊂ In2, and L(k1) = L(k2).

The number of link key expressions may be exponential w.r.t. the number of properties. Then link key discovery algorithms only consider link key candidates which are link key expressions generating at least one link and that are maximal w.r.t. the set of links they generate. 3

Link Key Discovery

Here after we assume that all link key expressions are defined on the same pair of datasets D1 and D2 w.r.t. one pair of classes, yielding link key expressions of the form k = (Eq, In, (c1, c2)). In the following, we show how link keys may be discovered within the formalism of pattern structures (see details in [ 1 ]) and then we discuss the notion of non-redundant link keys.

Example 1. Let us consider the pattern structure (G, (E, u), δ) displayed in Table 1. Here we skip the details for building this table and the related PS lattice which can be found in [ 1 ].

The rows termed “PS objects” correspond to the set of objects G of the pattern structure and include pairs of related instances. The set of descriptions (E, u) includes all possible pairs of properties preceded either by ∀ or ∃. The mapping δ relates a pair of instances (a, b) ∈ I(c1) × I(c2) to a description as follows: (i) δ(a, b) includes ∀(p, q) whenever p(a) = q(b) and p(a) 6= ∅, (ii) δ(a, b) includes ∃(p, q) whenever p(a) ∩ q(b) 6= ∅. Then the descriptions correspond to link key expressions (Eq, In) w.r.t. the pairs of classes (c1, c2). It should be noticed that it is possible to simultaneously work with several pairs of classes as explained in [ 1 ].

We have that δ(a1, b1) = {∃(p1, q1), ∃(p2, q2)} because p1(a1) ∩ q1(b1) 6= ∅ and p2(a1) ∩ q2(b1) 6= ∅ while δ(a2, b1) = {∃(p1, q1)} because p1(a2) ∩ q1(b1) 6= ∅. Then δ(a1, b1) u δ(a2, b1) = {∃(p1, q1)} and thus δ(a2, b1) v δ(a1, b1). This can be read in the pattern concept lattice where the pattern concept pc5 is subsumed by the pattern concept pc4, i.e., the extent of pc5 {(a1, b1), (a2, b2), (a3, b3)} is included in the extent of pc4 {(a1, b1), (a2, b1), (a2, b2), (a3, b3)}, while the intent {∃(p1, q1)} of pc4 is included in the intent of pc5, {∃(p1, q1), ∃(p2, q2)}.

The set of all pattern concepts is organized within the pattern concept lattice lkps-lattice displayed in Fig. 2. Moreover, all potential link key candidates are lying in the intents of the pattern concepts in the lattice. The corresponding set of link key candidates is denoted by lkc.

PS objects (g) (a1, b1) (a1, b2) (a2, b1) (a2, b2) (a3, b3) (a4, b4) (a4, b5) (a5, b4) (a5, b5) descriptions (δ(g)) {∃(p1, q1), ∃(p2, q2)} {∃(p2, q2)} {∃(p1, q1)} {∃(p1, q1), ∃(p2, q2)} {∀(p1, q1), ∃(p1, q1), ∃(p2, q2)} {∀(p3, q3), ∃(p3, q3)} {∀(p4, q4), ∃(p4, q4)} {∀(p4, q4), ∃(p4, q4)} {∀(p3, q3), ∃(p3, q3)}

Let us consider the so-called lkps-lattice and pc = (L(k), k) a pattern concept, where the extent L(k) corresponds to the set of links generated by k, and the intent k corresponds to a link key candidate. Let I denotes the set of instances I = I(c1) ∪ I(c2) and the binary relation 'k⊆ I × I such as (a, b) ∈ L(k) → a 'k b. The interpretation of a 'k b is: "k states that there exists a owl:sameAs relation between a and b". Actually 'k is an equivalence relation based on the fact that owl:sameAs itself is an equivalence relation. We say that k induces the equivalence relation 'k over I. Moreover 'k forms a partition over I where each element of this partition is an equivalence class. In fact the 'k equivalence relation will help us to build more concise set of link key candidates since it allows to identify non-redundant link key candidates termed nr-lkc. A link key candidate k1 is a nr-lkc in lkc if there is no other candidate k2 in lkc such that 'k1 and 'k2 form the same partition. Otherwise, k1 is redundant.

In Fig. 2, it can be observed that 'k3 and 'k4 form the same partition, namely {(a1, b1, a2, b2), (a3, b3)} (it should be noticed that singletons are omitted for the sake of readability). Then the link key candidates k3 and k4 are redundant. By contrast, k1 is a nr-lkc because there is no other candidate k in lkc such that 'k1 and 'k form the same partition.

Let us briefly explain how 'k3 and 'k4 are inducing the same partition, namely {(a1, b1, a2, b2), (a3, b3)}. The extent of k3 in lkps-lattice is given by {(a1, b1), (a1, b2), (a2, b2), (a3, b3)}. By transitivity and symmetry of owl:sameAs, we have that (a1, b2) and (b2, a2) yields (a1, a2), then (a2, a1) and (a1, b1) yields (a2, b1), and finally (b1, a2) and (a2, b2) yields (b1, b2) and the complete graph between (a1, a2, b1, b2). The same thing applies when we consider k4 instead of k3. This intuitively shows how 'k3 and 'k4 are inducing the same partition.

One main straightforward application of identifying nr-lkc is the ability to reduce the search space of link keys since the set of nr-lkc is included in lkc. Indeed, this can be seen as a refinement where redundant link key candidates inducing the same partition are merged. For example, since 'k3 and 'k4 form the same partition, then, k3 and k4 can be merged into a nr-lkc k34 = {k3, k4}. Among the perspectives is to consolidate the theory and practice of link key discovery based on partition pattern structures initially introduced for mining functional dependencies in [ 4 ].

1. Abbas , N. , David , J. , Napoli , A. : Discovery of link keys in RDF data based on pattern structures: Preliminary steps . In: Proceedings of ICFCA. CEUR Workshop Proceedings , vol. 2668 , pp. 235 - 246 . CEUR-WS.org ( 2020 )

2. Atencia , M. , David , J. , Euzenat , J. : Data interlinking through robust linkkey extraction . In: Proceedings of ECAI . pp. 15 - 20 ( 2014 )

3. Atencia , M. , David , J. , Euzenat , J. , Napoli , A. , Vizzini , J.: Link key candidate extraction with relational concept analysis . Discrete applied mathematics 273 , 2 - 20 ( 2020 )

4. Baixeries , J. , Kaytoue , M. , Napoli , A. : Characterizing functional dependencies in formal concept analysis with pattern structures . Annals of Mathematics and Artificial Intelligence 72 , 129 - 149 ( 2014 )

5. Ferrara , A. , Nikolov , A. , Scharffe , F. : Data Linking for the Semantic Web . International Journal of Semantic Web and Information Systems 7 ( 3 ), 46 - 76 ( 2011 )

6. Ganter , B. , Kuznetsov , S.O. : Pattern Structures and Their Projections . In: Proceedings of the International Conference on Conceptual Structures (ICCS) . pp. 129 - 142 . LNCS 2120, Springer ( 2001 )

7. Ganter , B. , Wille , R.: Formal Concept Analysis: Mathematical Foundations . Springer ( 1999 )

8. Nentwig , M. , Hartung , M. ,

Ngonga

Ngomo , A.C. , Rahm , E.: A survey of current link discovery frameworks . Semantic Web 8 ( 3 ), 419 - 436 ( 2017 ). https://doi.org/10.3233/SW-150210