A Study of the Discovery and Redundancy of Link Keys Between Two RDF Datasets Based on Partition Pattern Structures Nacira Abbas1 , Alexandre Bazin2 , Jérôme David3 and Amedeo Napoli1,∗ 1 Université de Lorraine, CNRS, Inria, Loria, F-54000 Nancy, France 2 Université de Montpellier, CNRS, LIRMM, F-34095 Montpellier, France 3 Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, F-38000 Grenoble, France Abstract A link key between two RDF datasets 𝐷1 and 𝐷2 is a set of pairs of properties allowing to identify pairs of individuals 𝑥1 and 𝑥2 through an identity link such as x 1 owl ∶ sameAs x 2 . In this paper, relying on and extending previous work, we introduce an original formalization of link key discovery based on the framework of Partition Pattern Structures (pps). Our objective is to study and evaluate the redundancy of link keys based on the fact that owl:sameAs is an equivalence relation. In the pps concept lattice, every concept has an extent representing a link key candidate and an intent representing a partition of instances into sets of equivalent instances. Experiments show three main results. Firstly redundancy of link keys is not so significant in real-world datasets. Nevertheless, the link key discovery approach based on pps returns a reduced number of non redundant link key candidates when compared to a standard approach. Moreover, the pps-based approach is efficient and returns link keys of high quality. 1. Introduction In this paper, we are interested in data interlinking whose objective is to discover identity links across two RDF datasets over the web of data [1, 2, 3]. The same real world entity can be represented in two RDF datasets by different subjects in RDF triples having the form (subject, property, object) . Then, for cleaning data and providing data of better quality, it is meaningful to detect such identities. There are both numerical and logical approaches for discovering these identities. For example, interlinking methods have been implemented in systems such as LIMES [4] and SILK [5]. These systems use link specifications, i.e. rules that declare whether two IRIs should be linked. Link specifications can also be specified by users or learned from data [6, 7, 8]. In particular, link keys which are under investigation in this paper are special kinds of rules allowing to infer identity links between two RDF datasets. More formally, a link key is composed of two sets of pairs of properties ({(𝑝𝑖 , 𝑞𝑖 )}𝑖 , {(𝑝𝑗′ , 𝑞𝑗′ )}𝑗 ) associated with a pair of Published in Pablo Cordero, Ondrej Kridlo (Eds.): The 16𝑡ℎ International Conference on Concept Lattices and Their Applications, CLA 2022, Tallinn, Estonia, June 20–22, 2022, Proceedings, pp. 175–189. ∗ Corresponding author. Envelope-Open Nacira.Abbas@inria.fr (N. Abbas); Alexandre.Bazin@umontpellier.fr (A. Bazin); Jerome.David@inria.fr (J. David); Amedeo.Napoli@loria.fr (A. Napoli) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) classes (𝑐1 , 𝑐2 ). Then, whenever an instance 𝑎1 of class 𝑐1 has the same (non empty) set of values as an instance 𝑏1 of class 𝑐2 , i.e. 𝑝𝑖 (𝑎1 ) = 𝑞𝑖 (𝑏1 ) for all pairs of properties in the first set (universal quantification), and shares at least one value for all pairs of properties in the second set (existential quantification), i.e. 𝑝𝑗′ (𝑎1 ) ∩ 𝑞𝑗′ (𝑏1 ) ≠ ∅, then 𝑎1 and 𝑏1 denote the same entity, i.e., an owl:sameAs relation can be established between 𝑎1 and 𝑏1 . More concretely, the expression k=({(designation,title)},{(designation,title),(creator,author)}, (Book,Novel)) states that whenever an instance 𝑎 of class Book has the same non empty values for the property designation as an instance 𝑏 of the class Novel for title , and that 𝑎 and 𝑏 share at least one value for the properties creator and author , then 𝑎 and 𝑏 denote the same entity a owl:sameAs link can be established between 𝑎 and 𝑏. Link keys are not provided with the datasets and algorithms are designed for automatically discovering such link keys [9, 10, 11]. Given two RDF datasets, these algorithms reduce the search space and focus on the discovery of “link key candidates” instead of checking every combination of pairs of properties and pairs of classes. The notion of a link key candidate –made precise below– involves maximality and closure. Following this line, natural links were established in [10, 11] between link key discovery and Formal Concept Analysis (FCA [12]). Indeed, FCA appeared as a suitable framework for the discovery of link key candidates, which are then evaluated thanks to appropriate quality measures [9]. Given two RDF datasets, FCA is applied in [10] to a binary table where rows correspond to pairs of individuals and columns to pairs of properties. The intent of a resulting concept corresponds to a link key candidate, which remains to be validated thanks to suitable quality measures. The extent of the concept includes the potential identity links between individuals. A generalization of the former approach is proposed in [11], which is based on pattern structures [13] and which takes into account different pairs of classes at the same time in the discovery of link keys. Actually, “good” link key candidates over two RDF datasets have to generate different and maximal link sets, i.e., a mapping between individuals which is “close to a bijection”. However it appears that two different link key candidates may generate the same link set when the link set is considered as a partition w.r.t. the owl:sameAs equivalence relation. This means that these two link key candidates present a certain redundancy, and then they can be considered as equivalent and merged in a way to be defined. This redundancy can be detected thanks to the properties of owl:sameAs as an equivalence relation, i.e. reflexivity, symmetry, and transitivity. Then, the owl:sameAs relation generates partitions among pairs of individuals that can be used for reducing the number of potential link key candidates. Indeed, two candidates relying on the same partition are considered as redundant and can be merged. However, we do not have a concrete idea of the importance of such redundancy and we should find a way to measure it. This is one objective of this paper to try to materialize this redundancy and to measure its importance. For doing so, taking inspiration from the work carried out in [14] on the discovery of functional dependencies, we provide a formalization of link key discovery based on “partition pattern structures” (pps), which allow us to take into account sets of equivalent individuals w.r.t. owl:sameAs as partitions. Then, a pattern concept represents a link key candidate and the related partition induced by the candidate. This approach is able to retrieve all link key candidates as a set of “non redundant” link keys. Moreover, this link key discovery process based on pps is operational and original1 . Actually, this is the first time that the characteristics of owl:sameAs as an equivalence relation are considered, and that the related partitions of pairs of individuals are directly used for defining link key candidates. Thanks to pps, we are able to define redundancy of link key candidates and we also introduce a new measure based on the size of partitions for evaluating the quality of the discovered candidates. Finally, the experiments proposed in the last part of this paper provide three main results. Firstly, the redundancy of link keys in real-world datasets appears to be not so significant. Nevertheless, the current link key discovery approach based on pps is efficient and returns a reduced number of non redundant link key candidates. Moreover, this pps-based approach returns link keys of very high quality when compared to competitors. The summary of the paper is as follows. Section 2 presents some basics about link keys. Then the discovery of non-redundant link keys based on pattern structures and partition pattern structures is made precise in Section 3. A running example illustrates all these constructions. New quality measures related to non redundant link keys are defined in Section 4. Finally, experiments in Section 5 show the capability and efficiency of the pps-based approach in link key discovery. 2. Preliminaries This section introduces the basic definitions relative to data interlinking with link keys. We recall what an RDF dataset is and then we introduce two forms of link keys, namely link key expressions and link key candidates. 2.1. RDF Data Definition 1 (RDF Dataset). Let 𝑈 denote a set of IRIs, i.e., “Internationalized Resource Identifiers”, 𝐵 a set of blank nodes, i.e., “anonymous resources”, and 𝐿 a set of literals, i.e., “string values”. An RDF dataset is a set of triples (𝑠, 𝑝, 𝑜) ∈ (𝑈 ∪ 𝐵) × 𝑈 × (𝑈 ∪ 𝐵 ∪ 𝐿). Let 𝐷 be an RDF dataset, 𝑆(𝐷) = {𝑠 | ∃ 𝑝, 𝑜 (𝑠, 𝑝, 𝑜) ∈ 𝐷} denotes the set of individual identifiers, 𝑃(𝐷) = {𝑝 | ∃ 𝑠, 𝑜 (𝑠, 𝑝, 𝑜) ∈ 𝐷} the set of property identifiers, and 𝐶(𝐷) = {𝑐 | ∃ 𝑠 (𝑠, rdf:type , 𝑐) ∈ 𝐷} the set of class identifiers. Moreover, 𝐼 (𝑐) = {𝑠 | ∃ 𝑠 (𝑠, rdf:type , 𝑐) ∈ 𝐷} denotes the set of instances of 𝑐 ∈ 𝐶(𝐷) while 𝑝(𝑠) = {𝑜 | (𝑠, 𝑝, 𝑜) ∈ 𝐷} denotes the set of objects –or values– associated with 𝑠 through property 𝑝. An identity link is an RDF triple of the form (𝑎, owl:sameAs , 𝑏) stating that the IRIs 𝑎 and 𝑏 are referring to the same entity, alternatively 𝑎 and 𝑏 are denoting the same individual. Example 1. In Figure 1, the RDF datasets 𝐷1 and 𝐷2 include 𝑃(𝐷1 ) = {𝑝1 , 𝑝2 , 𝑝3 , 𝑝4 } and 𝑃(𝐷2 ) = {𝑞1 , 𝑞2 , 𝑞3 , 𝑞4 } as sets of property identifiers, and 𝐶(𝐷1 ) = {𝑐1 } and 𝐶(𝐷2 ) = {𝑐2 } as class identifiers. In addition, 𝐼 (𝑐1 ) = {𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 , 𝑎5 } and 𝐼 (𝑐2 ) = {𝑏1 , 𝑏2 , 𝑏3 , 𝑏4 , 𝑏5 } denote respectively the sets of instances of class 𝑐1 and class 𝑐2 . Finally, considering subject 𝑏3 and property 𝑞2 , the related set of objects or the “value” of 𝑏3 for property 𝑞2 is 𝑞2 (𝑏3 ) = {𝑜8 , 𝑜9 }. 1 The present paper extends a short preliminary version published in [15]. 𝑜1 𝑐1 𝑞1 𝑐2 𝑝1 𝑜2 𝑝2 𝑞2 𝑎1 𝑝2 𝑜3 𝑏1 𝑞1 𝑜4 𝑞2 𝑝1 𝑎2 𝑝2 𝑜5 𝑞2 𝑏2 𝑝1 𝑞1 𝑜6 𝑎3 𝑝1 𝑜7 𝑞1 𝑏3 𝑝2 𝑞2 𝑜8 𝑞2 𝑎4 𝑜9 𝑏4 𝑝3 𝑞3 𝑜10 𝑝4 𝑎5 𝑞4 𝑏5 𝑞4 𝑝4 𝑜11 𝑝3 𝑜12 𝑞3 𝑜13 𝐷1 𝐷2 Figure 1: An example of two related RDF datasets. On the left-hand side, the dataset 𝐷1 is populated with instances of class 𝑐1 while on the right-hand side the dataset 𝐷2 is populated with instances of class 𝑐2 . 2.2. Link Keys Link keys are logical constructions allowing to infer identity links between instances of two classes lying in two RDF datasets. In the following we introduce two syntactic definitions of link keys, namely “link key expressions” and “link key candidates”. A definition of a link key and its semantics using Description Logics interpretation is proposed in [16]. However, in this paper we will stick to these two syntactic definitions for the sake of simplicity. Intuitively, a link key expression 𝑘 = (𝐸𝑞, 𝐼 𝑛, (𝑐1 , 𝑐2 )) is composed of two sets of pairs of properties, i.e., 𝐸𝑞 and 𝐼 𝑛, where 𝐸𝑞 is based on equality and 𝐼 𝑛 on non empty intersection, while links are generated between instances of classes 𝑐1 and 𝑐2 . More formally we have: Definition 2 (Link key expression). Let 𝐷1 and 𝐷2 be two RDF datasets, 𝑘 = (𝐸𝑞, 𝐼 𝑛, (𝑐1 , 𝑐2 )) is a link key expression over 𝐷1 and 𝐷2 iff 𝐼 𝑛 ⊆ 𝑃(𝐷1 ) × 𝑃(𝐷2 ), 𝐸𝑞 ⊆ 𝐼 𝑛, 𝑐1 ∈ 𝐶(𝐷1 ) and 𝑐2 ∈ 𝐶(𝐷2 ). The set of links generated by 𝑘 is denoted by 𝐿(𝑘) and includes a set of pairs of instances (𝑎, 𝑏) ∈ 𝐼 (𝑐1 ) × 𝐼 (𝑐2 ) satisfying: (i) for all (𝑝, 𝑞) ∈ 𝐸𝑞, 𝑝(𝑎) = 𝑞(𝑏) and 𝑝(𝑎) ≠ ∅, (ii) for all (𝑝, 𝑞) ∈ 𝐼 𝑛 ⧵ 𝐸𝑞, 𝑝(𝑎) ∩ 𝑞(𝑏) ≠ ∅. The number of link key expressions may be exponential w.r.t. the number of properties. To reduce the search space, algorithms for link key discovery only consider “link key candidates”, i.e., link key expressions which generate at least one link and which are “maximal” among the set of link key expressions. PS objects (g) descriptions (𝛿(𝑔)) (𝑎1 , 𝑏1 ) {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} (𝑎1 , 𝑏2 ) {∃(𝑝2 , 𝑞2 )} (𝑎2 , 𝑏1 ) {∃(𝑝1 , 𝑞1 )} (𝑎2 , 𝑏2 ) {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} (𝑎3 , 𝑏3 ) {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} (𝑎4 , 𝑏4 ) {∀(𝑝3 , 𝑞3 ), ∃(𝑝3 , 𝑞3 )} (𝑎4 , 𝑏5 ) {∀(𝑝4 , 𝑞4 ), ∃(𝑝4 , 𝑞4 )} (𝑎5 , 𝑏4 ) {∀(𝑝4 , 𝑞4 ), ∃(𝑝4 , 𝑞4 )} (𝑎5 , 𝑏5 ) {∀(𝑝3 , 𝑞3 ), ∃(𝑝3 , 𝑞3 )} Table 1 The pattern structure related to datasets 𝐷1 and 𝐷2 given in Fig. 1. Definition 3 (Link key candidate). A link key expression 𝑘1 = (𝐸𝑞1 , 𝐼 𝑛1 , (𝑐1 , 𝑐2 )) is a link key candidate if: (i) 𝐿(𝑘1 ) ≠ ∅, (ii) there does not exist another link key expression 𝑘2 = (𝐸𝑞2 , 𝐼 𝑛2 , (𝑐1 , 𝑐2 )) over 𝐷1 and 𝐷2 such that 𝐸𝑞1 ⊂ 𝐸𝑞2 , 𝐼 𝑛1 ⊂ 𝐼 𝑛2 , and 𝐿(𝑘1 ) = 𝐿(𝑘2 ). For example, consider the expressions 𝑘1 and 𝑘2 : (i) 𝑘1 = ({(𝑝1 , 𝑞1 )}, {(𝑝1 , 𝑞1 ), (𝑝2 , 𝑞2 )}, (𝑐1 , 𝑐2 )), (ii) 𝑘2 = ({(𝑝1 , 𝑞1 )}, {(𝑝1 , 𝑞1 )}, (𝑐1 , 𝑐2 )). The related link sets are 𝐿(𝑘1 ) = 𝐿(𝑘2 ) = {(𝑎3 , 𝑏3 )}. Then 𝑘1 and 𝑘2 are both link key expressions but only 𝑘1 is a link key candidate as it is maximal on the link set {(𝑎3 , 𝑏3 )} while 𝑘2 is not. The set of link key expressions is denoted by lke, the set of link key candidates by lkc, and we have that lkc ⊆ lke. The definition of link key candidates is based on the idea of maximality which involves a certain form of closure. This gave rise to a number of papers studying the potential relations existing between Formal Concept Analysis (FCA [12]) and the discovery of link keys [9, 10]. Following the same line, we make precise in the next section two ways of discovering link key candidates based on two extensions of FCA, pattern structures [13, 17] and partition pattern structures [14]. 3. From Link Keys to Non Redundant Link Keys 3.1. Discovering Link Keys with Pattern Structures In this section we shortly recall how pattern structures can be used in the discovery of link keys (details can be read in [11, 18]). For the sake of simplicity and readability, we rely on a motivating example. Then we show how “Partition Pattern Structures” [14, 19] allow to discover the so-called “non-redundant link key candidates” denoted as nrlkc. Example 2. Let us consider the pattern structure lkps = (𝐺, (𝐸, ⊓), 𝛿) displayed in Table 1. All details for building lkps and the associated pattern concept lattice are given in [11]. In the rows (“PS objects”), 𝐺 includes pairs of related instances, i.e., (𝑎𝑖 , 𝑏𝑗 ) with 𝑎𝑖 ∈ 𝐼 (𝑐1 ) and 𝑏𝑗 ∈ 𝐼 (𝑐2 ), which correspond to the objects of lkps. The set of potential descriptions (𝐸, ⊓) includes all possible pairs of properties preceded either by ∀ or ∃. For the sake of simplicity, the ⊓ operator corresponds here to set intersection. 𝑝𝑐0 𝐼 (𝑐1 ) × 𝐼 (𝑐2 ) ∅ 𝑝𝑐3 𝑝𝑐4 𝑝𝑐1 𝑝𝑐2 {(𝑎1 , 𝑏1 ), (𝑎1 , 𝑏2 ), {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏1 ), {(𝑎4 , 𝑏5 ), (𝑎5 , 𝑏4 )} {(𝑎4 , 𝑏4 ), (𝑎5 , 𝑏5 )} (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} 𝑘1 = {∀(𝑝4 , 𝑞4 ), ∃(𝑝4 , 𝑞4 )} 𝑘2 = {∀(𝑝3 , 𝑞3 ), ∃(𝑝3 , 𝑞3 )} 𝑘3 = {∃(𝑝2 , 𝑞2 )} 𝑘4 = {∃(𝑝1 , 𝑞1 )} 𝑝𝑐5 {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} 𝑘5 = {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} 𝑝𝑐6 {(𝑎3 , 𝑏3 )} 𝑘6 = {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} 𝑝𝑐7 ∅ 𝑃(𝐷1 ) × 𝑃(𝐷2 ) Figure 2: The pattern concept lattice related to Table 1. Every intent in a pattern concept corresponds to a link key candidate. The mapping 𝛿 relates a pair of instances (𝑎, 𝑏) ∈ 𝐼 (𝑐1 ) × 𝐼 (𝑐2 ) to a description as follows: (i) 𝛿(𝑎, 𝑏) includes ∀(𝑝, 𝑞) whenever 𝑝(𝑎) = 𝑞(𝑏) and 𝑝(𝑎) ≠ ∅, (ii) 𝛿(𝑎, 𝑏) includes ∃(𝑝, 𝑞) whenever 𝑝(𝑎) ∩ 𝑞(𝑏) ≠ ∅. Actually, such descriptions correspond to link key expressions w.r.t. the pairs of classes (𝑐1 , 𝑐2 ). In this work, we only consider the pair of classes 𝑐1 and 𝑐2 , while dealing with several pairs of classes is explained in [11]. Based on Fig. 1, the description of 𝛿(𝑎1 , 𝑏1 ) is given by {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} because 𝑝1 (𝑎1 ) ∩ 𝑞1 (𝑏1 ) ≠ ∅ and 𝑝2 (𝑎1 ) ∩ 𝑞2 (𝑏1 ) ≠ ∅, while 𝛿(𝑎2 , 𝑏1 ) = {∃(𝑝1 , 𝑞1 )} because 𝑝1 (𝑎2 ) ∩ 𝑞1 (𝑏1 ) ≠ ∅. Then, 𝛿(𝑎1 , 𝑏1 ) ⊓ 𝛿(𝑎2 , 𝑏1 ) = {∃(𝑝1 , 𝑞1 )} and thus 𝛿(𝑎2 , 𝑏1 ) ⊑ 𝛿(𝑎1 , 𝑏1 ). This can be read in the pattern concept lattice displayed in Fig.2 where the pattern concept 𝑝𝑐5 is subsumed by the pattern concept 𝑝𝑐4 , i.e., the intent {∃(𝑝1 , 𝑞1 )} of 𝑝𝑐4 is included in the intent {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} of 𝑝𝑐5 , while the extent {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} of 𝑝𝑐5 is included in the extent of 𝑝𝑐4 which is {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}. In this way, the set of all pattern concepts is organized within the pattern concept lattice in Fig. 2. Moreover, all link key candidates are lying in the intents of the pattern concepts. 3.2. Discovering Non Redundant Link Keys with Partition Pattern Structures 3.2.1. Motivation: sameAs is an Equivalence Relation. Having a careful look at Fig. 2, one can verify that more compact link keys could be designed. For example, in the extent of 𝑝𝑐3 , {(𝑎1 , 𝑏1 ), (𝑎1 , 𝑏2 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}, the three first pairs have a non empty intersection when considered in a particular order, i.e., 𝑎1 is shared by {(𝑎1 , 𝑏1 ), (𝑎1 , 𝑏2 )} and 𝑏2 is shared by {(𝑎1 , 𝑏2 ), (𝑎2 , 𝑏2 )}. Actually there exists a potential owl:sameAs –or simply sameAs – relation between the elements in such pairs. Being an equivalence relation, sameAs generates an equivalence class. Hence, we can “merge” some of these pairs in such an equivalence class. For example (𝑎1 , 𝑏1 ) and (𝑎1 , 𝑏2 ) can be merged, i.e., a 1 sameAs b 1 and a 1 sameAs b 2 yield b 1 sameAs b 2 (symmetry and transitivity of sameAs ). Moreover (𝑎1 , 𝑏2 ) and (𝑎2 , 𝑏2 ) can be merged because a 1 sameAs b 2 and a 2 sameAs b 2 yield a 1 sameAs a 2 . Finally, from a 1 sameAs a 2 and a 1 sameAs b 1 it comes a 2 sameAs b 1 . Then a 1 , a 2 , b 1 , b 2 , are in the same equivalence class w.r.t. sameAs . Two important facts should be noticed: (i) the link a 2 sameAs b 1 , absent in 𝑝𝑐3 but present in 𝑝𝑐4 , is inferred from the extent of 𝑝𝑐3 thanks to the properties of sameAs while in the same way a 1 sameAs b 2 could be inferred in 𝑝𝑐4 , (ii) the equivalence class {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 )} which can be built in 𝑝𝑐3 and as well in 𝑝𝑐4 , allows us to merge the link keys in 𝑝𝑐3 and 𝑝𝑐4 because they have now the “same extent”. Relying on this observation, we define an equivalence relation over the extents of pattern concepts as follows. Let us consider two RDF datasets 𝐷1 and 𝐷2 , a link key 𝑘, two classes 𝑐1 and 𝑐2 , and the set of instances 𝐼 = 𝐼 (𝑐1 ) ∪ 𝐼 (𝑐2 ). Given 𝑥0 and 𝑥𝑛 ∈ 𝐼, there exists a chain of sameAs re- lations between 𝑥0 and 𝑥𝑛 iff there exists a sequence of elements 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 such that x 0 sameAs x 1 , x 1 sameAs x 2 , … , x n −1 sameAs x n holds, where the symmetry of sameAs may be used. Then, we define a relation between 𝑥0 and 𝑥𝑛 in 𝐼 w.r.t. the link key 𝑘, denoted as 𝑥0 ≃𝑘 𝑥𝑛 , iff there exists a chain of sameAs relations between 𝑥0 and 𝑥𝑛 . It can be checked that ≃𝑘 is an equivalence relation as sameAs itself is an equivalence relation. For example, 𝐸𝑥𝑡(𝑝𝑐3 )/≃∃(𝑝2 ,𝑞2 ) = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} where 𝐸𝑥𝑡(𝑝𝑐3 ) denotes the extent of 𝑝𝑐3 . In the same way, 𝐸𝑥𝑡(𝑝𝑐4 )/≃∃(𝑝1 ,𝑞1 ) = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}. Then the two link keys ∃(𝑝1 , 𝑞1 ) and ∃(𝑝2 , 𝑞2 ) can be “identified” or “merged” because they have the same equivalence classes. As a consequence, one may obtain more comparable sets of linked elements and thus minimize the number of possible link key candidates. Moreover, it can be noticed that an equivalence class determines a partition within the set of instances under study (singletons if any are omitted here). 3.2.2. The design of a PPS for “Non Redundant Link Keys”. Hereafter, one main objective is to build a specific pattern structure where concepts are related to unique equivalence classes and yield “non redundant link key candidates” denoted as nrlkc. Here “non redundant” means any two elements in nrlkc are associated with different equivalence classes. In addition, an equivalence class corresponds to a partition of the set of instances. Accordingly, we define a “partition pattern structure” (pps) based on descriptions which are partitions and used to discover non redundant link key candidates. Indeed, in such a pps, objects in the rows correspond to pairs of properties and descriptions correspond to partitions. An example of pps is proposed in [14] where pps are introduced for mining functional dependencies. Accordingly, we define a partition pattern structure over classes 𝑐1 and 𝑐2 as a triple (lkc, (𝑃𝑎𝑟𝑡(𝐼 ), ⊓𝑝𝑎𝑟𝑡 ), 𝛿) such as: • lkc is the set of all link key candidates w.r.t. classes 𝑐1 and 𝑐2 which are given by a pattern concept lattice (as displayed in Fig. 2). pps objects (𝑘𝑖 ) Descriptions (partitions, 𝛿(𝑘𝑖 )) 𝑘1 = {∃(𝑝4 , 𝑞4 ), ∀(𝑝4 , 𝑞4 )} {(𝑎4 , 𝑏5 ), (𝑎5 , 𝑏4 )} 𝑘2 = {∃(𝑝3 , 𝑞3 ), ∀(𝑝3 , 𝑞3 )} {(𝑎4 , 𝑏4 ), (𝑎5 , 𝑏5 )} 𝑘3 = {∃(𝑝2 , 𝑞2 )} {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} 𝑘4 = {∃(𝑝1 , 𝑞1 )} {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} 𝑘5 = {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} 𝑘6 = {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} {(𝑎3 , 𝑏3 )} Table 2 The partition pattern structure related to the pattern structure in Table 1 over the classes 𝑐1 and 𝑐2 in datasets 𝐷1 and 𝐷2 . • (𝑃𝑎𝑟𝑡(𝐼 ), ⊓𝑝𝑎𝑟𝑡 ) is the set of potential descriptions or partitions and ⊓𝑝𝑎𝑟𝑡 is the “meet” of two partitions. • 𝛿 maps a link key candidate 𝑘 to the partition 𝐼 / ≃𝑘 , which is the quotient set of 𝐼 w.r.t. ≃𝑘 . As it can be seen in Table 2, pps objects are the link key candidates already computed and lying in the intents of the pattern concept lattice. This means that link key candidates will compose pps-concept extents, while partitions involving elements are declared in the descriptions and thus will compose pps-concept intents. Moreover, partitions reduced to a singleton are omitted in the descriptions of the pps objects. Let us examine a concrete example of pps based on the pattern structure given in Ta- ble 1. For 𝑝𝑐3 , we have 𝐸𝑥𝑡(𝑝𝑐3 ) = {(𝑎1 , 𝑏1 ), (𝑎1 , 𝑏2 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} and 𝐼 𝑛𝑡(𝑝𝑐3 ) = {∃(𝑝2 , 𝑞2 )}. The related partition, 𝐸𝑥𝑡(𝑝𝑐3 )/≃∃(𝑝2 ,𝑞2 ) = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}, where singletons are omit- ted, is associated with row 𝑘3 ({∃(𝑝2 , 𝑞2 )}) in Table 2 . In the same way, considering 𝑝𝑐4 and {∃(𝑝1 , 𝑞1 )}, 𝐸𝑥𝑡(𝑝𝑐4 )/≃∃(𝑝1 ,𝑞1 ) is associated with row 𝑘4 ({∃(𝑝1 , 𝑞1 )}) in Table 2. Moreover, 𝐸𝑥𝑡(𝑝𝑐4 )/≃∃(𝑝1 ,𝑞1 ) = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} and is equal to 𝐸𝑥𝑡(𝑝𝑐3 )/≃∃(𝑝2 ,𝑞2 ) . As already ob- served, 𝑘3 and 𝑘4 are inducing the same partition and can be “merged”. The other pps objects, namely 𝑘1 , 𝑘2 , 𝑘5 , and 𝑘6 , are obtained from the corresponding pattern concepts in Fig. 2 in the same way. Finally, the pps over classes 𝑐1 and 𝑐2 related to the datasets 𝐷1 and 𝐷2 is displayed in Table 2. The meet –or similarity– operation used to compare the rows in the pps corresponds to the meet of two partitions 𝑝𝑎𝑟𝑡1 and 𝑝𝑎𝑟𝑡2 over 𝐼 and is denoted by 𝑝𝑎𝑟𝑡1 ⊓𝑝𝑎𝑟𝑡 𝑝𝑎𝑟𝑡2 . This meet operation is classically defined as the intersection of the respective equivalence classes, also known as the “coarsest common refinement of the two partitions” (see [14]). For example the meet 𝛿(𝑘4 ) ⊓𝑝𝑎𝑟𝑡 𝛿(𝑘5 ) is given by {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} ⊓𝑝𝑎𝑟𝑡 {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} which yields the partition {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}. In particular, this means that 𝛿(𝑘4 ) ⊓ 𝛿(𝑘5 ) = 𝛿(𝑘5 ) and thus that 𝛿(𝑘5 ) ⊑ 𝛿(𝑘4 ). Based on this meet operation, the pps concept lattice is constructed and shown in Fig. 3. The ex- tent of a pps concept includes the 𝐸𝑞 and the 𝐼 𝑛 parts of a link key candidate (𝐸𝑞, 𝐼 𝑛, (𝑐1 , 𝑐2 )) in lkc. For example, the intent of 𝑝𝑝𝑠𝑐6 is {(𝑎3 , 𝑏3 )} and its extent is 𝑘6 , i.e., {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}. As all elements in the extents of the pps concept lattice, 𝑘6 is an element of nrlkc, i.e., a non redundant link key candidate. Moreover, 𝑘6 corresponds to a closed set and thus verifies maxi- mality, and it is non redundant w.r.t. the partition in the associated intent, i.e., no other link key candidate induces the same partition. 3.2.3. Discussion: From LKC to NRLKC. We now discuss what is gained in relying on pattern structures and then on partition pattern structures. Actually, the discovery of link key candidates (lkc) is based on the construction of the pattern concept lattice. Then the construction of the pps concept lattice yielding the non redundant link key candidates should be considered as a “refinement” where equivalent link key candidates, i.e., link key candidates inducing the same partition, are merged. For example, both concepts 𝑝𝑐3 and 𝑝𝑐4 in Fig. 2 are merged into a single concept 𝑝𝑝𝑠𝑐34 in Fig. 3. The resulting pps concept lattice is of smaller size and more concise as link key candidates in the concept extents are non redundant. The relations existing between lkc and nrlkc can be characterized as follows. By construction, nrlkc ⊆ lkc as the discovery of non redundant link keys is based on lkc, i.e., the link key candidates lying in the intents of concepts in the pattern concept lattice. Then, the candidates which have the same equivalence classes w.r.t. the sameAs relation are merged to produce the non redundant link key candidates in nrlkc. Thus, it comes that nrlkc ⊆ lkc and that |nrlkc | ≤ |lkc | (where |X| denotes the cardinality of set X). In other words, a non redundant link key candidate is always a link key candidate while the converse is not true. The definition of nrlkc allows us to introduce a new quality measure for evaluating and validating non redundant link key candidates. In the next section we make precise this new quality measure and we discuss the benefits related to the discovery of non redundant link key candidates. 4. Quality Measures for Non Redundant Link Key Candidates Not all link key candidates are eligible to be final link keys. Some candidates can be too general or related to noise in the data and thus they may generate owl:sameAs links which are not correct or have a bad quality. For example the link key candidate ({}, {(country,country )}, (Person,Human )) states that two individuals (subjects) sharing the same country represent the same person, and this is obviously not true. The candidate ({}, {(name,title )}, (Person,Book )) may be discovered by chance and will generate non correct links as well. An ideal link key candidate should be “correct” and “complete”, and specific “quality measures” are defined for assessing these properties. In a supervised setting, when a set of reference links is available, the correctness of a link key candidate is measured thanks to “precision” and the completeness thanks to “recall”. In an unsupervised setting as this is the case here, the measures of “coverage” and “discriminability” are defined over the set of links directly generated by a link key candidate [9]. The global quality of a link key candidate is then estimated thanks to the harmonic mean of these two measures. |𝜋 (𝐿) ∪ 𝜋2 (𝐿)| The coverage of a set 𝐿 of links over 𝑐1 and 𝑐2 is 𝑐𝑜𝑣(𝐿, 𝑐1 , 𝑐2 ) = 1 where 𝜋1 (𝐿) = |𝐼 (𝑐1 ) ∪ 𝐼 (𝑐2 )| {𝑎|(𝑎, 𝑏) ∈ 𝐿} and 𝜋2 (𝐿) = {𝑏|(𝑎, 𝑏) ∈ 𝐿}. 𝑚𝑖𝑛(|𝜋1 (𝐿)|, |𝜋2 (𝐿)|) The discriminability of 𝐿 is 𝑑𝑖𝑠(𝐿) = . |𝐿| 𝑝𝑝𝑠𝑐0 Link Key Candidates Singleton Partition 𝑝𝑝𝑠𝑐1 𝑝𝑝𝑠𝑐2 𝑝𝑝𝑠𝑐6 𝑘1 {∀(𝑝4 , 𝑞4 ), ∃(𝑝4 , 𝑞4 )} 𝑘2 {∀(𝑝3 , 𝑞3 ), ∃(𝑝3 , 𝑞3 )} 𝑘6 {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} {(𝑎4 , 𝑏5 ), (𝑎5 , 𝑏4 )} {(𝑎4 , 𝑏4 ), (𝑎5 , 𝑏5 )} {(𝑎3 , 𝑏3 )} 𝑝𝑝𝑠𝑐5 𝑘5 {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} 𝑝𝑝𝑠𝑐34 𝑘3 , 𝑘4 {∃(𝑝1 , 𝑞1 )}, {∃(𝑝2 , 𝑞2 )} {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} 𝑝𝑐7 ∅ 𝐼 Figure 3: The pps concept lattice over the partition pattern structure and the pattern concept lattice given in Fig. 2. Coverage and discriminability evaluate how close a set of links is to a total, respectively bijective, mapping. Coverage is maximum when all instances of the considered classes are linked to at least another instance, while discriminability is maximum when instances are linked to at most one instance. In the present setting, we are interested in the “redundancy” of link key candidates, and we would like to evaluate the quality of “non redundant” candidates. A link key candidate 𝑘 is said to be “redundant” in 𝐻 ⊆ lkc if there exists another link key candidate ℎ ∈ 𝐻 such that 𝐼 /≃𝑘 = 𝐼 /≃ℎ , i.e., 𝑘 generates the same partition as ℎ. This is the case in the running example for the link key candidates 𝑘3 and 𝑘4 . One main consequence of detecting redundancy is to reduce the number of candidates by merging candidates inducing the same partition. Then, since ≃𝑘3 and ≃𝑘4 induce the same partition, 𝑘3 and 𝑘4 are merged into a non redundant link key candidate 𝑘34 = {𝑘3 , 𝑘4 } interpreted as “𝑘3 or 𝑘4 ”. Accordingly, we introduce the new quality measure 𝑝𝑆𝑖𝑧𝑒(𝑘) that is compliant with the semantics of sameAs and defined w.r.t. the partition associated with the equivalence relation ≃𝑘 : 𝑝𝑆𝑖𝑧𝑒(𝑘) = | {𝑥 ∈ 𝐼 /≃𝑘 | |𝑥| > 1} | where 𝐼 = 𝐼 (𝑐1 ) ∪ 𝐼 (𝑐2 ) and 𝐼 /≃𝑘 is the quotient set of 𝐼 w.r.t. ≃𝑘 . For example, 𝑝𝑆𝑖𝑧𝑒(𝑘3 ) = 𝑝𝑆𝑖𝑧𝑒(𝑘4 ) = |{(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}| = 2. The lower bound of 𝑝𝑆𝑖𝑧𝑒(𝑘) is 1 since 𝑘 generates at least one link. The upper bound is 𝑚𝑖𝑛(|𝐼 (𝑐1 )|, |𝐼 (𝑐2 )|) and this is achieved when 𝑘 determines a 1-1 mapping. Moreover, pSize can be normalized into 𝑛𝑝𝑆𝑖𝑧𝑒(𝑘) = 𝑝𝑆𝑖𝑧𝑒(𝑘)/|𝐼 / ≃𝑘 |. Then 𝑛𝑝𝑆𝑖𝑧𝑒(𝑘3 ) = 2/6 as 𝐼 / ≃𝑘3 = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 ), 𝑎4 , 𝑏4 , 𝑎5 , 𝑏5 }, where this time singletons are counted. The normalized measure 𝑛𝑝𝑆𝑖𝑧𝑒(𝑘) evaluates how close are 𝑐1 and 𝑐2 w.r.t. ≃𝑘 , and is maximal when all instances are linked. However, this is not always satisfactory, especially when the cardinalities of the two classes significantly differ (i.e., classes are not balanced). Then, it is more accurate to maximize the measure as soon as all instances of one class are linked as fol- lows: 𝑠𝑠𝑝𝑐(𝑘) = 𝑝𝑆𝑖𝑧𝑒(𝑘)/𝑚𝑖𝑛(|𝐼 (𝑐1 )/≃𝑘 |, |𝐼 (𝑐2 )/≃𝑘 |) (also known as the “Szymkiewicz–Simpson partition coefficient”). For example, 𝑠𝑠𝑝𝑐(𝑘3 ) = 2/4 with: 𝑠𝑠𝑝𝑐(𝑘3 ) = |{(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}|/𝑚𝑖𝑛(|{(𝑎1 , 𝑎2 ), 𝑎3 , 𝑎4 , 𝑎5 }|, |{(𝑏1 , 𝑏2 ), 𝑏3 , 𝑏4 , 𝑏5 }|). 5. Experiments 5.1. Datasets and experimental settings We run experiments on DB-Yago datasets provided in [20]. These datasets are used by the approaches to which we compare our results in the last series of experiments. They have been rewritten into Terse RDF Triple Language (Turtle) which is a syntax for expressing RDF data [21]. They are derived from DBpedia and Yago and organized into nine tasks. One particular task consists in finding links between instances of a class in DBpedia and instances of a class in Yago, e.g., db:Actor and yago:Actor . A set of reference links (i.e., owl:sameAs links) denoted by 𝐿𝑟𝑒𝑓 is provided for each task. The statistics of DB-Yago datasets are given in Table 3. Experiments were run on a MacBook Pro 2018 with Intel Core i7-8850H@2,6 GHz, 16GB of RAM. The link key candidates are provided by a tool called Linkex2 in which we have implemented the lkps algorithm i.e. the link key discovery algorithm based on pattern structures proposed in [11]. A basic text normalization consisting in removing diacritics, tokenizing and sorting the resulting bag of tokens is performed. 5.2. Redundancy of Link Key Candidates is not so Significant In Table 3, column |lkc| represents the number of link key candidates discovered by lkps algorithm while column |nrlkc| represents the number of non redundant link key candidates. In most of the tasks, we can observe that |nrlkc| is equal to |lkc|, except for the tasks Actor and Film where |nrlkc| is lower than |lkc| by 1% and by 5% respectively. This means that, in general, link key candidates produce different partitions and only a few are redundant. By contrast, if redundancy does not significantly reduce the number of link key candidates, it gives a good idea of the compactness of partitions related to link key candidates. Nevertheless, even if redundancy is not so significant, we decided to compare the quality of the resulting sets of candidates measured with 𝑝𝑆𝑖𝑧𝑒 and the results of competitors, namely keys and conditional keys as they are studied in [20]. 5.3. Non Redundant Candidates, Classical Keys and Conditional Keys The 𝑝𝑆𝑖𝑧𝑒 measure is used to evaluate and select the best link key candidates, –in this experiment the link key candidate ranked first– in every task listed in Table 3. Then, we compare recall, precision, and F-measure of the links generated by the best link key candidates selected by 2 Linkex is available online at https://gitlab.inria.fr/moex/linkex. Interlinking task datasets |triples| |subjects| |properties| |lkc| |nrlkc| time db:Actor 94 606 5 807 16 Actor 2 198 2 177 56s yago:Actor 1 029 580 108 415 16 db:Album 594 144 85 002 5 Album 44 44 28s yago:Album 762 238 136 848 5 db:Book 247 372 29 846 7 Book 82 82 39s yago:Book 185 032 41 849 7 db:Film 1 369 600 82 099 9 Film 18 718 17 643 34m yago:Film 1 067 084 123 822 9 db:Mountain 135 442 16 397 5 Mountain 39 39 5m yago:Mountain 233 562 32 874 5 db:Museum 15 940 1 826 7 Museum 48 48 9s yago:Museum 163 342 21 050 7 db:Organization 4 487 205 183 665 17 Organization 1 425 1 425 57m yago:Organization 4 410 854 430 071 17 db:Scientist 128 360 18 409 10 Scientist 862 862 2m yago:Scientist 671 266 92 828 18 db:University 241 838 10 352 9 University 213 213 56s yago:University 263 624 23 334 9 Table 3 DB-Yago datasets statistics, where |triples| denotes the number of triples in each dataset, |subjects| the number of instances, and |properties| the number of properties. 𝑝𝑆𝑖𝑧𝑒 against interlinking approaches providing classical keys and conditional keys as reported in [20]. In Figure 4 we can observe that: (i) recall of link keys is significantly better than recall of classical and conditional keys, (ii) precision of classical and conditional keys is slightly better than precision of link keys in most of the tasks, (iii) F-measure is much higher for selected link key candidates than for classical and conditional keys. Link keys have a better recall because they are more flexible than classical keys, i.e., a link key is not necessarily a pair of keys and an instance of class 𝑐1 may be linked to many instances of class 𝑐2 . For summarizing, it can be concluded that considering the best link key candidates selected by 𝑝𝑆𝑖𝑧𝑒 will ensure a higher interlinking quality compared to classical and conditional keys. We also observe that the best link key according to discriminability and coverage is the same that the best non redundant link key selected thanks to 𝑝𝑆𝑖𝑧𝑒. Actually, the Actor dataset makes an exception: the link key candidate selected with 𝑝𝑆𝑖𝑧𝑒 obtains an F-measure of 0.95 contrasting the score of 0.34 obtained with coverage and discriminability. In addition, contrasting key-based approaches, link key discovery does not require any prior knowledge such as property or class alignments. (a)recall (b)precision (c)F-measure 0.57 0.93 0.71 classicalkeys University 0.22 0.99 University 0.26 9⋅10−2 0.84 0.69 0. 9 9 0.1 6 0.76 conditional keys Scientist 0.16 0.99 Scientist 0.28 link keys 5⋅10−2 0.56 0.82 0. 9 8 0. 11 0.67 Organization −20.14 0.98 Organization −2 0.24 1⋅10 0.79 0.94 1 2⋅10 0.86 Museum 0.25 1 Museum 0.4 0.12 0.79 0.94 1 0. 21 0.86 Mountain 0.28 0.99 Mountain 0.44 0 0.61 0.91 1 0 0.74 Film 0.38 0.96 Film 0.54 4⋅10−2 0.67 0. 0.95 9 9 8⋅10 −2 0.79 Book 0.−211 0.99 Book −2 0.23 3⋅10 0.59 0.83 1 6⋅10 0.69 Album 0.15 0.99 Album 0.26 0 0.91 10.98 0 0.95 Actor 0.57 0.99 Actor 0.73 0.27 0.99 0.43 Figure 4: Interlinking performances of classical keys, conditional keys as given in [20] and link keys. Actually a recall of 0 stands for a very small recall close to 0. 6. Synthesis and Conclusion This paper introduces a formalization of link key discovery based on partition pattern structures (pps). This approach allows to discover the set nrlkc of link key candidates which are not redundant w.r.t. the owl:sameAs equivalence relation, while still being correct and complete. In addition, new appropriate quality measures are proposed. Practically, we observe in experiments that the redundancy of link key candidates in public datasets is rather rare. Nevertheless, the experiments and the associated results show that link key based on pps obtain better F-measure values than two other key-based approaches to data-interlinking. Among the perspectives, a first one is to consolidate the theory and practice of link key discovery based on pps introduced and detailed in this paper. A second and very important direction of investigation is related to the discovery of “fuzzy link keys”. We make the hypothesis that the redundancy of link key candidates is rather rare because we are using the crisp equality operator when we are building the link key candidates. Instead, in considering the discovery of link key candidates as depending on a similarity relation, i.e., reflexive and symmetric –also called a tolerance relation– rather than on an equality, then we could propose a formalization of fuzzy link key candidates in the spirit of approximate-matching dependencies as they are studied in [19]. We could also expect much more differences between crisp and fuzzy link key candidates. In addition, a reduction of the size of the concept lattice related to nrlkc could also be investigated thanks to similarities between partitions. References [1] A. Ferrara, A. Nikolov, F. Scharffe, Data Linking for the Semantic Web, International Journal of Semantic Web and Information Systems 7 (2011) 46–76. [2] P. Christen, Data Matching – Concepts and Techniques for Record Linkage, Entity Resolu- tion, and Duplicate Detection, Springer, 2012. [3] M. Nentwig, M. Hartung, A.-C. Ngonga Ngomo, E. Rahm, A survey of current Link Discovery frameworks, Semantic Web Journal 8 (2017) 419–436. [4] A. N. Ngomo, S. Auer, LIMES – A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data, in: T. Walsh (Ed.), Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI), IJCAI/AAAI, 2011, pp. 2312–2317. [5] J. Volz, C. Bizer, M. Gaedke, G. Kobilarov, Silk – A Link Discovery Framework for the Web of Data, in: C. Bizer, T. Heath, T. Berners-Lee, K. Idehen (Eds.), Proceedings of the WWW2009 Workshop on Linked Data on the Web (LDOW), CEUR Workshop Proceedings 538, 2009. [6] R. Isele, C. Bizer, Active learning of expressive linkage rules using genetic programming, Journal of web semantics 23 (2013) 2–15. [7] A. N. Ngomo, K. Lyko, EAGLE: Efficient Active Learning of Link Specifications Using Genetic Programming, in: E. Simperl, P. Cimiano, A. Polleres, Ó. Corcho, V. Presutti (Eds.), Proceedings of the 9th Extended Semantic Web Conference (ESWC, Lecture Notes in Computer Science 7295, Springer, 2012, pp. 149–163. [8] A. N. Ngomo, K. Lyko, Unsupervised learning of link specifications: deterministic vs. non-deterministic, in: P. Shvaiko, J. Euzenat, K. Srinivas, M. Mao, E. Jiménez-Ruiz (Eds.), Proceedings of the 8th International Workshop on Ontology Matching (at ISWC), CEUR Workshop Proceedings 1111, 2013, pp. 25–36. [9] M. Atencia, J. David, J. Euzenat, Data interlinking through robust linkkey extraction, in: Proceedings of ECAI, 2014, pp. 15–20. [10] M. Atencia, J. David, J. Euzenat, A. Napoli, J. Vizzini, Link key candidate extraction with relational concept analysis, Discrete applied mathematics 273 (2020) 2–20. [11] N. Abbas, J. David, A. Napoli, Discovery of Link Keys in RDF Data Based on Pattern Structures: Preliminary Steps, in: F. J. Valverde-Albacete, M. Trnecka (Eds.), Proceedings of the International Conference on Concept Lattices and Their Applications (CLA), CEUR Workshop Proceedings 2668, 2020, pp. 235–246. [12] B. Ganter, R. Wille, Formal Concept Analysis: Mathematical Foundations, Springer, 1999. [13] B. Ganter, S. O. Kuznetsov, Pattern Structures and Their Projections, in: H. S. Delugach, G. Stumme (Eds.), Proceedings of the International Conference on Conceptual Structures (ICCS), Lecture Notes in Computer Science 2120, Springer, 2001, pp. 129–142. [14] J. Baixeries, M. Kaytoue, A. Napoli, Characterizing functional dependencies in formal concept analysis with pattern structures, Annals of Mathematics and Artificial Intelligence 72 (2014) 129–149. [15] N. Abbas, A. Bazin, J. David, A. Napoli, Non-Redundant Link Keys in RDF Data: Preliminary Steps, in: Proceedings of the FCA4AI Workshop at IJCAI, CEUR Workshop Proceedings 2972, 2021, pp. 125–130. [16] M. Atencia, J. David, J. Euzenat, On the relation between keys and link keys for data interlinking, Semantic Web Journal 12 (2021) 547–567. [17] M. Kaytoue, S. O. Kuznetsov, A. Napoli, S. Duplessis, Mining Gene Expression Data with Pattern Structures in Formal Concept Analysis, Information Sciences 181 (2011) 1989–2001. [18] N. Abbas, A. Bazin, J. David, A. Napoli, Sandwich: An Algorithm for Discovering Relevant Link Keys in an LKPS Concept Lattice, in: A. Braud, A. Buzmakov, T. Hanika, F. L. Ber (Eds.), Proceedings of the 16th International Conference on Formal Concept Analysis (ICFCA), Lecture Notes in Computer Science 12733, Springer, 2021, pp. 243–251. [19] J. Baixeries, V. Codocedo, M. Kaytoue, A. Napoli, Characterizing Approximate-Matching Dependencies in Formal Concept Analysis with Pattern Structures, Discrete Applied Mathematics 249 (2018) 18–27. [20] D. Symeonidou, L. Galárraga, N. Pernelle, F. Saïs, F. M. Suchanek, VICKEY: Mining Conditional Keys on Knowledge Bases, in: Proceedings of 16th International Semantic Web Conference (ISWC), LNCS 10587, Springer, 2017, pp. 661–677. [21] D. Beckett, T. Berners-Lee, E. Prud’hommeaux, G. Carothers, RDF 1.1 Turtle, W3C Recom- mendation, W3C, 2014. https://www.w3.org/TR/turtle/.