A Study of the Discovery and Redundancy of Link
Keys Between Two RDF Datasets Based on Partition
Pattern Structures
Nacira Abbas1 , Alexandre Bazin2 , Jérôme David3 and Amedeo Napoli1,∗
1
  Université de Lorraine, CNRS, Inria, Loria, F-54000 Nancy, France
2
  Université de Montpellier, CNRS, LIRMM, F-34095 Montpellier, France
3
  Université Grenoble Alpes, Inria, CNRS, Grenoble INP, LIG, F-38000 Grenoble, France


                                         Abstract
                                         A link key between two RDF datasets 𝐷1 and 𝐷2 is a set of pairs of properties allowing to identify pairs
                                         of individuals 𝑥1 and 𝑥2 through an identity link such as x 1 owl ∶ sameAs x 2 . In this paper, relying on
                                         and extending previous work, we introduce an original formalization of link key discovery based on the
                                         framework of Partition Pattern Structures (pps). Our objective is to study and evaluate the redundancy
                                         of link keys based on the fact that owl:sameAs is an equivalence relation. In the pps concept lattice,
                                         every concept has an extent representing a link key candidate and an intent representing a partition of
                                         instances into sets of equivalent instances. Experiments show three main results. Firstly redundancy of
                                         link keys is not so significant in real-world datasets. Nevertheless, the link key discovery approach based
                                         on pps returns a reduced number of non redundant link key candidates when compared to a standard
                                         approach. Moreover, the pps-based approach is efficient and returns link keys of high quality.


1. Introduction
In this paper, we are interested in data interlinking whose objective is to discover identity
links across two RDF datasets over the web of data [1, 2, 3]. The same real world entity
can be represented in two RDF datasets by different subjects in RDF triples having the form
(subject, property, object) . Then, for cleaning data and providing data of better quality,
it is meaningful to detect such identities. There are both numerical and logical approaches
for discovering these identities. For example, interlinking methods have been implemented in
systems such as LIMES [4] and SILK [5]. These systems use link specifications, i.e. rules that
declare whether two IRIs should be linked. Link specifications can also be specified by users or
learned from data [6, 7, 8].
   In particular, link keys which are under investigation in this paper are special kinds of
rules allowing to infer identity links between two RDF datasets. More formally, a link key
is composed of two sets of pairs of properties ({(𝑝𝑖 , 𝑞𝑖 )}𝑖 , {(𝑝𝑗′ , 𝑞𝑗′ )}𝑗 ) associated with a pair of

Published in Pablo Cordero, Ondrej Kridlo (Eds.): The 16𝑡ℎ International Conference on Concept Lattices and Their
Applications, CLA 2022, Tallinn, Estonia, June 20–22, 2022, Proceedings, pp. 175–189.
∗
    Corresponding author.
Envelope-Open Nacira.Abbas@inria.fr (N. Abbas); Alexandre.Bazin@umontpellier.fr (A. Bazin); Jerome.David@inria.fr
(J. David); Amedeo.Napoli@loria.fr (A. Napoli)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
classes (𝑐1 , 𝑐2 ). Then, whenever an instance 𝑎1 of class 𝑐1 has the same (non empty) set of
values as an instance 𝑏1 of class 𝑐2 , i.e. 𝑝𝑖 (𝑎1 ) = 𝑞𝑖 (𝑏1 ) for all pairs of properties in the first set
(universal quantification), and shares at least one value for all pairs of properties in the second
set (existential quantification), i.e. 𝑝𝑗′ (𝑎1 ) ∩ 𝑞𝑗′ (𝑏1 ) ≠ ∅, then 𝑎1 and 𝑏1 denote the same entity, i.e.,
an owl:sameAs relation can be established between 𝑎1 and 𝑏1 . More concretely, the expression
k=({(designation,title)},{(designation,title),(creator,author)}, (Book,Novel))
states that whenever an instance 𝑎 of class Book has the same non empty values for the property
designation as an instance 𝑏 of the class Novel for title , and that 𝑎 and 𝑏 share at least one
value for the properties creator and author , then 𝑎 and 𝑏 denote the same entity a owl:sameAs
link can be established between 𝑎 and 𝑏.
    Link keys are not provided with the datasets and algorithms are designed for automatically
discovering such link keys [9, 10, 11]. Given two RDF datasets, these algorithms reduce the
search space and focus on the discovery of “link key candidates” instead of checking every
combination of pairs of properties and pairs of classes. The notion of a link key candidate
–made precise below– involves maximality and closure. Following this line, natural links were
established in [10, 11] between link key discovery and Formal Concept Analysis (FCA [12]).
Indeed, FCA appeared as a suitable framework for the discovery of link key candidates, which
are then evaluated thanks to appropriate quality measures [9].
    Given two RDF datasets, FCA is applied in [10] to a binary table where rows correspond
to pairs of individuals and columns to pairs of properties. The intent of a resulting concept
corresponds to a link key candidate, which remains to be validated thanks to suitable quality
measures. The extent of the concept includes the potential identity links between individuals. A
generalization of the former approach is proposed in [11], which is based on pattern structures
[13] and which takes into account different pairs of classes at the same time in the discovery of
link keys.
    Actually, “good” link key candidates over two RDF datasets have to generate different and
maximal link sets, i.e., a mapping between individuals which is “close to a bijection”. However
it appears that two different link key candidates may generate the same link set when the link
set is considered as a partition w.r.t. the owl:sameAs equivalence relation. This means that
these two link key candidates present a certain redundancy, and then they can be considered as
equivalent and merged in a way to be defined. This redundancy can be detected thanks to the
properties of owl:sameAs as an equivalence relation, i.e. reflexivity, symmetry, and transitivity.
Then, the owl:sameAs relation generates partitions among pairs of individuals that can be used
for reducing the number of potential link key candidates. Indeed, two candidates relying on
the same partition are considered as redundant and can be merged. However, we do not have
a concrete idea of the importance of such redundancy and we should find a way to measure
it. This is one objective of this paper to try to materialize this redundancy and to measure its
importance.
    For doing so, taking inspiration from the work carried out in [14] on the discovery of
functional dependencies, we provide a formalization of link key discovery based on “partition
pattern structures” (pps), which allow us to take into account sets of equivalent individuals
w.r.t. owl:sameAs as partitions. Then, a pattern concept represents a link key candidate and
the related partition induced by the candidate. This approach is able to retrieve all link key
candidates as a set of “non redundant” link keys. Moreover, this link key discovery process
based on pps is operational and original1 . Actually, this is the first time that the characteristics
of owl:sameAs as an equivalence relation are considered, and that the related partitions of pairs
of individuals are directly used for defining link key candidates. Thanks to pps, we are able to
define redundancy of link key candidates and we also introduce a new measure based on the size
of partitions for evaluating the quality of the discovered candidates. Finally, the experiments
proposed in the last part of this paper provide three main results. Firstly, the redundancy of link
keys in real-world datasets appears to be not so significant. Nevertheless, the current link key
discovery approach based on pps is efficient and returns a reduced number of non redundant
link key candidates. Moreover, this pps-based approach returns link keys of very high quality
when compared to competitors.
   The summary of the paper is as follows. Section 2 presents some basics about link keys. Then
the discovery of non-redundant link keys based on pattern structures and partition pattern
structures is made precise in Section 3. A running example illustrates all these constructions.
New quality measures related to non redundant link keys are defined in Section 4. Finally,
experiments in Section 5 show the capability and efficiency of the pps-based approach in link
key discovery.


2. Preliminaries
This section introduces the basic definitions relative to data interlinking with link keys. We
recall what an RDF dataset is and then we introduce two forms of link keys, namely link key
expressions and link key candidates.

2.1. RDF Data
Definition 1 (RDF Dataset). Let 𝑈 denote a set of IRIs, i.e., “Internationalized Resource Identifiers”,
𝐵 a set of blank nodes, i.e., “anonymous resources”, and 𝐿 a set of literals, i.e., “string values”.
  An RDF dataset is a set of triples (𝑠, 𝑝, 𝑜) ∈ (𝑈 ∪ 𝐵) × 𝑈 × (𝑈 ∪ 𝐵 ∪ 𝐿).

   Let 𝐷 be an RDF dataset, 𝑆(𝐷) = {𝑠 | ∃ 𝑝, 𝑜 (𝑠, 𝑝, 𝑜) ∈ 𝐷} denotes the set of individual identifiers,
𝑃(𝐷) = {𝑝 | ∃ 𝑠, 𝑜 (𝑠, 𝑝, 𝑜) ∈ 𝐷} the set of property identifiers, and 𝐶(𝐷) = {𝑐 | ∃ 𝑠 (𝑠, rdf:type , 𝑐) ∈
𝐷} the set of class identifiers.
   Moreover, 𝐼 (𝑐) = {𝑠 | ∃ 𝑠 (𝑠, rdf:type , 𝑐) ∈ 𝐷} denotes the set of instances of 𝑐 ∈ 𝐶(𝐷) while
𝑝(𝑠) = {𝑜 | (𝑠, 𝑝, 𝑜) ∈ 𝐷} denotes the set of objects –or values– associated with 𝑠 through property
𝑝.
   An identity link is an RDF triple of the form (𝑎, owl:sameAs , 𝑏) stating that the IRIs 𝑎 and 𝑏
are referring to the same entity, alternatively 𝑎 and 𝑏 are denoting the same individual.
Example 1. In Figure 1, the RDF datasets 𝐷1 and 𝐷2 include 𝑃(𝐷1 ) = {𝑝1 , 𝑝2 , 𝑝3 , 𝑝4 } and 𝑃(𝐷2 ) =
{𝑞1 , 𝑞2 , 𝑞3 , 𝑞4 } as sets of property identifiers, and 𝐶(𝐷1 ) = {𝑐1 } and 𝐶(𝐷2 ) = {𝑐2 } as class identifiers.
In addition, 𝐼 (𝑐1 ) = {𝑎1 , 𝑎2 , 𝑎3 , 𝑎4 , 𝑎5 } and 𝐼 (𝑐2 ) = {𝑏1 , 𝑏2 , 𝑏3 , 𝑏4 , 𝑏5 } denote respectively the sets of
instances of class 𝑐1 and class 𝑐2 . Finally, considering subject 𝑏3 and property 𝑞2 , the related set
of objects or the “value” of 𝑏3 for property 𝑞2 is 𝑞2 (𝑏3 ) = {𝑜8 , 𝑜9 }.

1
    The present paper extends a short preliminary version published in [15].
                                                 𝑜1
                               𝑐1                         𝑞1         𝑐2
                                         𝑝1      𝑜2
                                         𝑝2               𝑞2
                              𝑎1         𝑝2      𝑜3                  𝑏1
                                                          𝑞1
                                                 𝑜4       𝑞2
                                         𝑝1
                              𝑎2         𝑝2      𝑜5       𝑞2         𝑏2
                                         𝑝1               𝑞1
                                                 𝑜6
                              𝑎3         𝑝1      𝑜7       𝑞1         𝑏3
                                         𝑝2               𝑞2
                                                 𝑜8       𝑞2
                              𝑎4                 𝑜9                  𝑏4
                                         𝑝3                𝑞3
                                                 𝑜10
                                         𝑝4
                              𝑎5                           𝑞4        𝑏5
                                                           𝑞4
                                         𝑝4      𝑜11
                                         𝑝3      𝑜12       𝑞3

                                                 𝑜13

                                    𝐷1                          𝐷2

Figure 1: An example of two related RDF datasets. On the left-hand side, the dataset 𝐷1 is populated
with instances of class 𝑐1 while on the right-hand side the dataset 𝐷2 is populated with instances of
class 𝑐2 .


2.2. Link Keys
Link keys are logical constructions allowing to infer identity links between instances of two
classes lying in two RDF datasets. In the following we introduce two syntactic definitions of
link keys, namely “link key expressions” and “link key candidates”. A definition of a link key
and its semantics using Description Logics interpretation is proposed in [16]. However, in this
paper we will stick to these two syntactic definitions for the sake of simplicity.
   Intuitively, a link key expression 𝑘 = (𝐸𝑞, 𝐼 𝑛, (𝑐1 , 𝑐2 )) is composed of two sets of pairs of
properties, i.e., 𝐸𝑞 and 𝐼 𝑛, where 𝐸𝑞 is based on equality and 𝐼 𝑛 on non empty intersection, while
links are generated between instances of classes 𝑐1 and 𝑐2 . More formally we have:

Definition 2 (Link key expression). Let 𝐷1 and 𝐷2 be two RDF datasets, 𝑘 = (𝐸𝑞, 𝐼 𝑛, (𝑐1 , 𝑐2 )) is a
link key expression over 𝐷1 and 𝐷2 iff 𝐼 𝑛 ⊆ 𝑃(𝐷1 ) × 𝑃(𝐷2 ), 𝐸𝑞 ⊆ 𝐼 𝑛, 𝑐1 ∈ 𝐶(𝐷1 ) and 𝑐2 ∈ 𝐶(𝐷2 ).
   The set of links generated by 𝑘 is denoted by 𝐿(𝑘) and includes a set of pairs of instances
(𝑎, 𝑏) ∈ 𝐼 (𝑐1 ) × 𝐼 (𝑐2 ) satisfying:
   (i) for all (𝑝, 𝑞) ∈ 𝐸𝑞, 𝑝(𝑎) = 𝑞(𝑏) and 𝑝(𝑎) ≠ ∅,
   (ii) for all (𝑝, 𝑞) ∈ 𝐼 𝑛 ⧵ 𝐸𝑞, 𝑝(𝑎) ∩ 𝑞(𝑏) ≠ ∅.

   The number of link key expressions may be exponential w.r.t. the number of properties. To
reduce the search space, algorithms for link key discovery only consider “link key candidates”,
i.e., link key expressions which generate at least one link and which are “maximal” among the
set of link key expressions.
                                    PS objects (g)       descriptions (𝛿(𝑔))
                                       (𝑎1 , 𝑏1 )        {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}
                                       (𝑎1 , 𝑏2 )                {∃(𝑝2 , 𝑞2 )}
                                       (𝑎2 , 𝑏1 )                {∃(𝑝1 , 𝑞1 )}
                                       (𝑎2 , 𝑏2 )        {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}
                                       (𝑎3 , 𝑏3 )    {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}
                                       (𝑎4 , 𝑏4 )        {∀(𝑝3 , 𝑞3 ), ∃(𝑝3 , 𝑞3 )}
                                       (𝑎4 , 𝑏5 )        {∀(𝑝4 , 𝑞4 ), ∃(𝑝4 , 𝑞4 )}
                                       (𝑎5 , 𝑏4 )        {∀(𝑝4 , 𝑞4 ), ∃(𝑝4 , 𝑞4 )}
                                       (𝑎5 , 𝑏5 )        {∀(𝑝3 , 𝑞3 ), ∃(𝑝3 , 𝑞3 )}

Table 1
The pattern structure related to datasets 𝐷1 and 𝐷2 given in Fig. 1.


Definition 3 (Link key candidate). A link key expression 𝑘1 = (𝐸𝑞1 , 𝐼 𝑛1 , (𝑐1 , 𝑐2 )) is a link key
candidate if:
  (i) 𝐿(𝑘1 ) ≠ ∅,
  (ii) there does not exist another link key expression 𝑘2 = (𝐸𝑞2 , 𝐼 𝑛2 , (𝑐1 , 𝑐2 )) over 𝐷1 and 𝐷2 such
that 𝐸𝑞1 ⊂ 𝐸𝑞2 , 𝐼 𝑛1 ⊂ 𝐼 𝑛2 , and 𝐿(𝑘1 ) = 𝐿(𝑘2 ).
For example, consider the expressions 𝑘1 and 𝑘2 :
   (i) 𝑘1 = ({(𝑝1 , 𝑞1 )}, {(𝑝1 , 𝑞1 ), (𝑝2 , 𝑞2 )}, (𝑐1 , 𝑐2 )),
   (ii) 𝑘2 = ({(𝑝1 , 𝑞1 )}, {(𝑝1 , 𝑞1 )}, (𝑐1 , 𝑐2 )).
The related link sets are 𝐿(𝑘1 ) = 𝐿(𝑘2 ) = {(𝑎3 , 𝑏3 )}. Then 𝑘1 and 𝑘2 are both link key expressions
but only 𝑘1 is a link key candidate as it is maximal on the link set {(𝑎3 , 𝑏3 )} while 𝑘2 is not.
   The set of link key expressions is denoted by lke, the set of link key candidates by lkc, and
we have that lkc ⊆ lke. The definition of link key candidates is based on the idea of maximality
which involves a certain form of closure. This gave rise to a number of papers studying the
potential relations existing between Formal Concept Analysis (FCA [12]) and the discovery of
link keys [9, 10]. Following the same line, we make precise in the next section two ways of
discovering link key candidates based on two extensions of FCA, pattern structures [13, 17] and
partition pattern structures [14].


3. From Link Keys to Non Redundant Link Keys
3.1. Discovering Link Keys with Pattern Structures
In this section we shortly recall how pattern structures can be used in the discovery of link
keys (details can be read in [11, 18]). For the sake of simplicity and readability, we rely on a
motivating example. Then we show how “Partition Pattern Structures” [14, 19] allow to discover
the so-called “non-redundant link key candidates” denoted as nrlkc.
Example 2. Let us consider the pattern structure lkps = (𝐺, (𝐸, ⊓), 𝛿) displayed in Table 1. All
details for building lkps and the associated pattern concept lattice are given in [11]. In the
rows (“PS objects”), 𝐺 includes pairs of related instances, i.e., (𝑎𝑖 , 𝑏𝑗 ) with 𝑎𝑖 ∈ 𝐼 (𝑐1 ) and 𝑏𝑗 ∈ 𝐼 (𝑐2 ),
which correspond to the objects of lkps. The set of potential descriptions (𝐸, ⊓) includes all
possible pairs of properties preceded either by ∀ or ∃. For the sake of simplicity, the ⊓ operator
corresponds here to set intersection.
                                                                                        𝑝𝑐0
                                                                                  𝐼 (𝑐1 ) × 𝐼 (𝑐2 )
                                                                                         ∅

                                                                                                               𝑝𝑐3                                    𝑝𝑐4
                             𝑝𝑐1                               𝑝𝑐2
                                                                                                      {(𝑎1 , 𝑏1 ), (𝑎1 , 𝑏2 ),               {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏1 ),
                    {(𝑎4 , 𝑏5 ), (𝑎5 , 𝑏4 )}          {(𝑎4 , 𝑏4 ), (𝑎5 , 𝑏5 )}
                                                                                                      (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}                (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}
               𝑘1 = {∀(𝑝4 , 𝑞4 ), ∃(𝑝4 , 𝑞4 )}   𝑘2 = {∀(𝑝3 , 𝑞3 ), ∃(𝑝3 , 𝑞3 )}
                                                                                                      𝑘3 = {∃(𝑝2 , 𝑞2 )}                     𝑘4 = {∃(𝑝1 , 𝑞1 )}

                                                                                                                     𝑝𝑐5
                                                                                                      {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}
                                                                                                      𝑘5 = {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}

                                                                                                                     𝑝𝑐6
                                                                                                                 {(𝑎3 , 𝑏3 )}
                                                                                                      𝑘6 = {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ),
                                                                                                             ∃(𝑝2 , 𝑞2 )}


                                                                                        𝑝𝑐7
                                                                                         ∅
                                                                                 𝑃(𝐷1 ) × 𝑃(𝐷2 )

Figure 2: The pattern concept lattice related to Table 1. Every intent in a pattern concept corresponds
to a link key candidate.


   The mapping 𝛿 relates a pair of instances (𝑎, 𝑏) ∈ 𝐼 (𝑐1 ) × 𝐼 (𝑐2 ) to a description as follows: (i)
𝛿(𝑎, 𝑏) includes ∀(𝑝, 𝑞) whenever 𝑝(𝑎) = 𝑞(𝑏) and 𝑝(𝑎) ≠ ∅, (ii) 𝛿(𝑎, 𝑏) includes ∃(𝑝, 𝑞) whenever
𝑝(𝑎) ∩ 𝑞(𝑏) ≠ ∅. Actually, such descriptions correspond to link key expressions w.r.t. the pairs
of classes (𝑐1 , 𝑐2 ). In this work, we only consider the pair of classes 𝑐1 and 𝑐2 , while dealing with
several pairs of classes is explained in [11].
   Based on Fig. 1, the description of 𝛿(𝑎1 , 𝑏1 ) is given by {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )} because 𝑝1 (𝑎1 ) ∩
𝑞1 (𝑏1 ) ≠ ∅ and 𝑝2 (𝑎1 ) ∩ 𝑞2 (𝑏1 ) ≠ ∅, while 𝛿(𝑎2 , 𝑏1 ) = {∃(𝑝1 , 𝑞1 )} because 𝑝1 (𝑎2 ) ∩ 𝑞1 (𝑏1 ) ≠ ∅.
Then, 𝛿(𝑎1 , 𝑏1 ) ⊓ 𝛿(𝑎2 , 𝑏1 ) = {∃(𝑝1 , 𝑞1 )} and thus 𝛿(𝑎2 , 𝑏1 ) ⊑ 𝛿(𝑎1 , 𝑏1 ). This can be read in the
pattern concept lattice displayed in Fig.2 where the pattern concept 𝑝𝑐5 is subsumed by the
pattern concept 𝑝𝑐4 , i.e., the intent {∃(𝑝1 , 𝑞1 )} of 𝑝𝑐4 is included in the intent {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}
of 𝑝𝑐5 , while the extent {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} of 𝑝𝑐5 is included in the extent of 𝑝𝑐4 which is
{(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}.
   In this way, the set of all pattern concepts is organized within the pattern concept lattice in
Fig. 2. Moreover, all link key candidates are lying in the intents of the pattern concepts.

3.2. Discovering Non Redundant Link Keys with Partition Pattern Structures
3.2.1. Motivation: sameAs is an Equivalence Relation.


   Having a careful look at Fig. 2, one can verify that more compact link keys could be designed.
For example, in the extent of 𝑝𝑐3 , {(𝑎1 , 𝑏1 ), (𝑎1 , 𝑏2 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}, the three first pairs have a non
empty intersection when considered in a particular order, i.e., 𝑎1 is shared by {(𝑎1 , 𝑏1 ), (𝑎1 , 𝑏2 )} and
𝑏2 is shared by {(𝑎1 , 𝑏2 ), (𝑎2 , 𝑏2 )}. Actually there exists a potential owl:sameAs –or simply sameAs –
relation between the elements in such pairs. Being an equivalence relation, sameAs generates an
equivalence class. Hence, we can “merge” some of these pairs in such an equivalence class. For
example (𝑎1 , 𝑏1 ) and (𝑎1 , 𝑏2 ) can be merged, i.e., a 1 sameAs b 1 and a 1 sameAs b 2 yield b 1 sameAs b 2
(symmetry and transitivity of sameAs ). Moreover (𝑎1 , 𝑏2 ) and (𝑎2 , 𝑏2 ) can be merged because
a 1 sameAs b 2 and a 2 sameAs b 2 yield a 1 sameAs a 2 . Finally, from a 1 sameAs a 2 and a 1 sameAs b 1 it
comes a 2 sameAs b 1 . Then a 1 , a 2 , b 1 , b 2 , are in the same equivalence class w.r.t. sameAs .
    Two important facts should be noticed: (i) the link a 2 sameAs b 1 , absent in 𝑝𝑐3 but present
in 𝑝𝑐4 , is inferred from the extent of 𝑝𝑐3 thanks to the properties of sameAs while in the same
way a 1 sameAs b 2 could be inferred in 𝑝𝑐4 , (ii) the equivalence class {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 )} which can
be built in 𝑝𝑐3 and as well in 𝑝𝑐4 , allows us to merge the link keys in 𝑝𝑐3 and 𝑝𝑐4 because they
have now the “same extent”. Relying on this observation, we define an equivalence relation
over the extents of pattern concepts as follows.
    Let us consider two RDF datasets 𝐷1 and 𝐷2 , a link key 𝑘, two classes 𝑐1 and 𝑐2 , and the
set of instances 𝐼 = 𝐼 (𝑐1 ) ∪ 𝐼 (𝑐2 ). Given 𝑥0 and 𝑥𝑛 ∈ 𝐼, there exists a chain of sameAs re-
lations between 𝑥0 and 𝑥𝑛 iff there exists a sequence of elements 𝑥1 , 𝑥2 , … , 𝑥𝑛−1 such that
x 0 sameAs x 1 , x 1 sameAs x 2 , … , x n −1 sameAs x n holds, where the symmetry of sameAs may be
used. Then, we define a relation between 𝑥0 and 𝑥𝑛 in 𝐼 w.r.t. the link key 𝑘, denoted as
𝑥0 ≃𝑘 𝑥𝑛 , iff there exists a chain of sameAs relations between 𝑥0 and 𝑥𝑛 . It can be checked
that ≃𝑘 is an equivalence relation as sameAs itself is an equivalence relation. For example,
𝐸𝑥𝑡(𝑝𝑐3 )/≃∃(𝑝2 ,𝑞2 ) = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} where 𝐸𝑥𝑡(𝑝𝑐3 ) denotes the extent of 𝑝𝑐3 . In the
same way, 𝐸𝑥𝑡(𝑝𝑐4 )/≃∃(𝑝1 ,𝑞1 ) = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}. Then the two link keys ∃(𝑝1 , 𝑞1 ) and
∃(𝑝2 , 𝑞2 ) can be “identified” or “merged” because they have the same equivalence classes.
    As a consequence, one may obtain more comparable sets of linked elements and thus minimize
the number of possible link key candidates. Moreover, it can be noticed that an equivalence class
determines a partition within the set of instances under study (singletons if any are omitted
here).

3.2.2. The design of a PPS for “Non Redundant Link Keys”.


   Hereafter, one main objective is to build a specific pattern structure where concepts are
related to unique equivalence classes and yield “non redundant link key candidates” denoted as
nrlkc. Here “non redundant” means any two elements in nrlkc are associated with different
equivalence classes. In addition, an equivalence class corresponds to a partition of the set of
instances. Accordingly, we define a “partition pattern structure” (pps) based on descriptions
which are partitions and used to discover non redundant link key candidates. Indeed, in such
a pps, objects in the rows correspond to pairs of properties and descriptions correspond to
partitions. An example of pps is proposed in [14] where pps are introduced for mining functional
dependencies.
   Accordingly, we define a partition pattern structure over classes 𝑐1 and 𝑐2 as a triple
(lkc, (𝑃𝑎𝑟𝑡(𝐼 ), ⊓𝑝𝑎𝑟𝑡 ), 𝛿) such as:

    • lkc is the set of all link key candidates w.r.t. classes 𝑐1 and 𝑐2 which are given by a pattern
      concept lattice (as displayed in Fig. 2).
                          pps objects (𝑘𝑖 )                              Descriptions (partitions, 𝛿(𝑘𝑖 ))
                          𝑘1 = {∃(𝑝4 , 𝑞4 ), ∀(𝑝4 , 𝑞4 )}                         {(𝑎4 , 𝑏5 ), (𝑎5 , 𝑏4 )}
                          𝑘2 = {∃(𝑝3 , 𝑞3 ), ∀(𝑝3 , 𝑞3 )}                         {(𝑎4 , 𝑏4 ), (𝑎5 , 𝑏5 )}
                          𝑘3 = {∃(𝑝2 , 𝑞2 )}                                  {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}
                          𝑘4 = {∃(𝑝1 , 𝑞1 )}                                  {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}
                          𝑘5 = {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}                    {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}
                          𝑘6 = {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}                  {(𝑎3 , 𝑏3 )}

Table 2
The partition pattern structure related to the pattern structure in Table 1 over the classes 𝑐1 and 𝑐2 in
datasets 𝐷1 and 𝐷2 .


     • (𝑃𝑎𝑟𝑡(𝐼 ), ⊓𝑝𝑎𝑟𝑡 ) is the set of potential descriptions or partitions and ⊓𝑝𝑎𝑟𝑡 is the “meet” of
       two partitions.
     • 𝛿 maps a link key candidate 𝑘 to the partition 𝐼 / ≃𝑘 , which is the quotient set of 𝐼 w.r.t. ≃𝑘 .

   As it can be seen in Table 2, pps objects are the link key candidates already computed and lying
in the intents of the pattern concept lattice. This means that link key candidates will compose
pps-concept extents, while partitions involving elements are declared in the descriptions and
thus will compose pps-concept intents. Moreover, partitions reduced to a singleton are omitted
in the descriptions of the pps objects.
   Let us examine a concrete example of pps based on the pattern structure given in Ta-
ble 1. For 𝑝𝑐3 , we have 𝐸𝑥𝑡(𝑝𝑐3 ) = {(𝑎1 , 𝑏1 ), (𝑎1 , 𝑏2 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} and 𝐼 𝑛𝑡(𝑝𝑐3 ) = {∃(𝑝2 , 𝑞2 )}.
The related partition, 𝐸𝑥𝑡(𝑝𝑐3 )/≃∃(𝑝2 ,𝑞2 ) = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}, where singletons are omit-
ted, is associated with row 𝑘3 ({∃(𝑝2 , 𝑞2 )}) in Table 2 . In the same way, considering 𝑝𝑐4
and {∃(𝑝1 , 𝑞1 )}, 𝐸𝑥𝑡(𝑝𝑐4 )/≃∃(𝑝1 ,𝑞1 ) is associated with row 𝑘4 ({∃(𝑝1 , 𝑞1 )}) in Table 2. Moreover,
𝐸𝑥𝑡(𝑝𝑐4 )/≃∃(𝑝1 ,𝑞1 ) = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} and is equal to 𝐸𝑥𝑡(𝑝𝑐3 )/≃∃(𝑝2 ,𝑞2 ) . As already ob-
served, 𝑘3 and 𝑘4 are inducing the same partition and can be “merged”. The other pps objects,
namely 𝑘1 , 𝑘2 , 𝑘5 , and 𝑘6 , are obtained from the corresponding pattern concepts in Fig. 2 in the
same way. Finally, the pps over classes 𝑐1 and 𝑐2 related to the datasets 𝐷1 and 𝐷2 is displayed
in Table 2.
   The meet –or similarity– operation used to compare the rows in the pps corresponds to the
meet of two partitions 𝑝𝑎𝑟𝑡1 and 𝑝𝑎𝑟𝑡2 over 𝐼 and is denoted by 𝑝𝑎𝑟𝑡1 ⊓𝑝𝑎𝑟𝑡 𝑝𝑎𝑟𝑡2 . This meet
operation is classically defined as the intersection of the respective equivalence classes, also
known as the “coarsest common refinement of the two partitions” (see [14]). For example the
meet 𝛿(𝑘4 ) ⊓𝑝𝑎𝑟𝑡 𝛿(𝑘5 ) is given by {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} ⊓𝑝𝑎𝑟𝑡 {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )} which
yields the partition {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}. In particular, this means that 𝛿(𝑘4 ) ⊓ 𝛿(𝑘5 ) = 𝛿(𝑘5 )
and thus that 𝛿(𝑘5 ) ⊑ 𝛿(𝑘4 ).
   Based on this meet operation, the pps concept lattice is constructed and shown in Fig. 3. The ex-
tent of a pps concept includes the 𝐸𝑞 and the 𝐼 𝑛 parts of a link key candidate (𝐸𝑞, 𝐼 𝑛, (𝑐1 , 𝑐2 )) in lkc.
For example, the intent of 𝑝𝑝𝑠𝑐6 is {(𝑎3 , 𝑏3 )} and its extent is 𝑘6 , i.e., {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}.
As all elements in the extents of the pps concept lattice, 𝑘6 is an element of nrlkc, i.e., a non
redundant link key candidate. Moreover, 𝑘6 corresponds to a closed set and thus verifies maxi-
mality, and it is non redundant w.r.t. the partition in the associated intent, i.e., no other link key
candidate induces the same partition.
3.2.3. Discussion: From LKC to NRLKC.


   We now discuss what is gained in relying on pattern structures and then on partition pattern
structures. Actually, the discovery of link key candidates (lkc) is based on the construction of
the pattern concept lattice. Then the construction of the pps concept lattice yielding the non
redundant link key candidates should be considered as a “refinement” where equivalent link key
candidates, i.e., link key candidates inducing the same partition, are merged. For example, both
concepts 𝑝𝑐3 and 𝑝𝑐4 in Fig. 2 are merged into a single concept 𝑝𝑝𝑠𝑐34 in Fig. 3. The resulting
pps concept lattice is of smaller size and more concise as link key candidates in the concept
extents are non redundant.
   The relations existing between lkc and nrlkc can be characterized as follows. By construction,
nrlkc ⊆ lkc as the discovery of non redundant link keys is based on lkc, i.e., the link key
candidates lying in the intents of concepts in the pattern concept lattice. Then, the candidates
which have the same equivalence classes w.r.t. the sameAs relation are merged to produce the
non redundant link key candidates in nrlkc. Thus, it comes that nrlkc ⊆ lkc and that |nrlkc |
≤ |lkc | (where |X| denotes the cardinality of set X). In other words, a non redundant link key
candidate is always a link key candidate while the converse is not true.
   The definition of nrlkc allows us to introduce a new quality measure for evaluating and
validating non redundant link key candidates. In the next section we make precise this new
quality measure and we discuss the benefits related to the discovery of non redundant link key
candidates.


4. Quality Measures for Non Redundant Link Key Candidates
Not all link key candidates are eligible to be final link keys. Some candidates can be too general or
related to noise in the data and thus they may generate owl:sameAs links which are not correct or
have a bad quality. For example the link key candidate ({}, {(country,country )}, (Person,Human ))
states that two individuals (subjects) sharing the same country represent the same person, and
this is obviously not true. The candidate ({}, {(name,title )}, (Person,Book )) may be discovered
by chance and will generate non correct links as well.
   An ideal link key candidate should be “correct” and “complete”, and specific “quality measures”
are defined for assessing these properties. In a supervised setting, when a set of reference links
is available, the correctness of a link key candidate is measured thanks to “precision” and the
completeness thanks to “recall”. In an unsupervised setting as this is the case here, the measures
of “coverage” and “discriminability” are defined over the set of links directly generated by a link
key candidate [9]. The global quality of a link key candidate is then estimated thanks to the
harmonic mean of these two measures.
                                                                        |𝜋 (𝐿) ∪ 𝜋2 (𝐿)|
   The coverage of a set 𝐿 of links over 𝑐1 and 𝑐2 is 𝑐𝑜𝑣(𝐿, 𝑐1 , 𝑐2 ) = 1                   where 𝜋1 (𝐿) =
                                                                         |𝐼 (𝑐1 ) ∪ 𝐼 (𝑐2 )|
{𝑎|(𝑎, 𝑏) ∈ 𝐿} and 𝜋2 (𝐿) = {𝑏|(𝑎, 𝑏) ∈ 𝐿}.
                                            𝑚𝑖𝑛(|𝜋1 (𝐿)|, |𝜋2 (𝐿)|)
   The discriminability of 𝐿 is 𝑑𝑖𝑠(𝐿) =                            .
                                                     |𝐿|
                                                            𝑝𝑝𝑠𝑐0
                                                     Link Key Candidates
                                                     Singleton Partition


                               𝑝𝑝𝑠𝑐1                                𝑝𝑝𝑠𝑐2                                                                𝑝𝑝𝑠𝑐6
                     𝑘1 {∀(𝑝4 , 𝑞4 ), ∃(𝑝4 , 𝑞4 )}        𝑘2 {∀(𝑝3 , 𝑞3 ), ∃(𝑝3 , 𝑞3 )}                                 𝑘6 {∀(𝑝1 , 𝑞1 ), ∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}
                       {(𝑎4 , 𝑏5 ), (𝑎5 , 𝑏4 )}             {(𝑎4 , 𝑏4 ), (𝑎5 , 𝑏5 )}                                                  {(𝑎3 , 𝑏3 )}


                                                                                                        𝑝𝑝𝑠𝑐5
                                                                                            𝑘5 {∃(𝑝1 , 𝑞1 ), ∃(𝑝2 , 𝑞2 )}
                                                                                          {(𝑎1 , 𝑏1 ), (𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}


                                                                                              𝑝𝑝𝑠𝑐34
                                                                               𝑘3 , 𝑘4 {∃(𝑝1 , 𝑞1 )}, {∃(𝑝2 , 𝑞2 )}
                                                                                 {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}


                                                              𝑝𝑐7
                                                               ∅
                                                               𝐼

Figure 3: The pps concept lattice over the partition pattern structure and the pattern concept lattice
given in Fig. 2.


   Coverage and discriminability evaluate how close a set of links is to a total, respectively
bijective, mapping. Coverage is maximum when all instances of the considered classes are
linked to at least another instance, while discriminability is maximum when instances are linked
to at most one instance.
   In the present setting, we are interested in the “redundancy” of link key candidates, and we
would like to evaluate the quality of “non redundant” candidates. A link key candidate 𝑘 is
said to be “redundant” in 𝐻 ⊆ lkc if there exists another link key candidate ℎ ∈ 𝐻 such that
𝐼 /≃𝑘 = 𝐼 /≃ℎ , i.e., 𝑘 generates the same partition as ℎ. This is the case in the running example
for the link key candidates 𝑘3 and 𝑘4 . One main consequence of detecting redundancy is to
reduce the number of candidates by merging candidates inducing the same partition. Then,
since ≃𝑘3 and ≃𝑘4 induce the same partition, 𝑘3 and 𝑘4 are merged into a non redundant link
key candidate 𝑘34 = {𝑘3 , 𝑘4 } interpreted as “𝑘3 or 𝑘4 ”.
   Accordingly, we introduce the new quality measure 𝑝𝑆𝑖𝑧𝑒(𝑘) that is compliant with the
semantics of sameAs and defined w.r.t. the partition associated with the equivalence relation ≃𝑘 :
   𝑝𝑆𝑖𝑧𝑒(𝑘) = | {𝑥 ∈ 𝐼 /≃𝑘 | |𝑥| > 1} | where 𝐼 = 𝐼 (𝑐1 ) ∪ 𝐼 (𝑐2 ) and 𝐼 /≃𝑘 is the quotient set of 𝐼 w.r.t.
≃𝑘 .
For example, 𝑝𝑆𝑖𝑧𝑒(𝑘3 ) = 𝑝𝑆𝑖𝑧𝑒(𝑘4 ) = |{(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}| = 2. The lower bound of 𝑝𝑆𝑖𝑧𝑒(𝑘)
is 1 since 𝑘 generates at least one link. The upper bound is 𝑚𝑖𝑛(|𝐼 (𝑐1 )|, |𝐼 (𝑐2 )|) and this is
achieved when 𝑘 determines a 1-1 mapping. Moreover, pSize can be normalized into 𝑛𝑝𝑆𝑖𝑧𝑒(𝑘) =
𝑝𝑆𝑖𝑧𝑒(𝑘)/|𝐼 / ≃𝑘 |. Then 𝑛𝑝𝑆𝑖𝑧𝑒(𝑘3 ) = 2/6 as 𝐼 / ≃𝑘3 = {(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 ), 𝑎4 , 𝑏4 , 𝑎5 , 𝑏5 }, where
this time singletons are counted.
  The normalized measure 𝑛𝑝𝑆𝑖𝑧𝑒(𝑘) evaluates how close are 𝑐1 and 𝑐2 w.r.t. ≃𝑘 , and is maximal
when all instances are linked. However, this is not always satisfactory, especially when the
cardinalities of the two classes significantly differ (i.e., classes are not balanced). Then, it is
more accurate to maximize the measure as soon as all instances of one class are linked as fol-
lows: 𝑠𝑠𝑝𝑐(𝑘) = 𝑝𝑆𝑖𝑧𝑒(𝑘)/𝑚𝑖𝑛(|𝐼 (𝑐1 )/≃𝑘 |, |𝐼 (𝑐2 )/≃𝑘 |) (also known as the “Szymkiewicz–Simpson
partition coefficient”). For example, 𝑠𝑠𝑝𝑐(𝑘3 ) = 2/4 with:
𝑠𝑠𝑝𝑐(𝑘3 ) = |{(𝑎1 , 𝑏1 , 𝑎2 , 𝑏2 ), (𝑎3 , 𝑏3 )}|/𝑚𝑖𝑛(|{(𝑎1 , 𝑎2 ), 𝑎3 , 𝑎4 , 𝑎5 }|, |{(𝑏1 , 𝑏2 ), 𝑏3 , 𝑏4 , 𝑏5 }|).


5. Experiments
5.1. Datasets and experimental settings
We run experiments on DB-Yago datasets provided in [20]. These datasets are used by the
approaches to which we compare our results in the last series of experiments. They have been
rewritten into Terse RDF Triple Language (Turtle) which is a syntax for expressing RDF data
[21]. They are derived from DBpedia and Yago and organized into nine tasks. One particular
task consists in finding links between instances of a class in DBpedia and instances of a class in
Yago, e.g., db:Actor and yago:Actor . A set of reference links (i.e., owl:sameAs links) denoted
by 𝐿𝑟𝑒𝑓 is provided for each task. The statistics of DB-Yago datasets are given in Table 3.
   Experiments were run on a MacBook Pro 2018 with Intel Core i7-8850H@2,6 GHz, 16GB
of RAM. The link key candidates are provided by a tool called Linkex2 in which we have
implemented the lkps algorithm i.e. the link key discovery algorithm based on pattern structures
proposed in [11]. A basic text normalization consisting in removing diacritics, tokenizing and
sorting the resulting bag of tokens is performed.

5.2. Redundancy of Link Key Candidates is not so Significant
In Table 3, column |lkc| represents the number of link key candidates discovered by lkps
algorithm while column |nrlkc| represents the number of non redundant link key candidates.
In most of the tasks, we can observe that |nrlkc| is equal to |lkc|, except for the tasks Actor
and Film where |nrlkc| is lower than |lkc| by 1% and by 5% respectively. This means that,
in general, link key candidates produce different partitions and only a few are redundant. By
contrast, if redundancy does not significantly reduce the number of link key candidates, it gives
a good idea of the compactness of partitions related to link key candidates.
   Nevertheless, even if redundancy is not so significant, we decided to compare the quality
of the resulting sets of candidates measured with 𝑝𝑆𝑖𝑧𝑒 and the results of competitors, namely
keys and conditional keys as they are studied in [20].

5.3. Non Redundant Candidates, Classical Keys and Conditional Keys
The 𝑝𝑆𝑖𝑧𝑒 measure is used to evaluate and select the best link key candidates, –in this experiment
the link key candidate ranked first– in every task listed in Table 3. Then, we compare recall,
precision, and F-measure of the links generated by the best link key candidates selected by
2
    Linkex is available online at https://gitlab.inria.fr/moex/linkex.
 Interlinking task    datasets               |triples|   |subjects|   |properties|   |lkc|    |nrlkc|   time
                      db:Actor                94 606        5 807         16
       Actor                                                                         2 198     2 177     56s
                      yago:Actor            1 029 580     108 415         16
                      db:Album               594 144       85 002          5
      Album                                                                           44        44       28s
                      yago:Album             762 238      136 848          5
                      db:Book                247 372       29 846          7
       Book                                                                           82        82       39s
                      yago:Book              185 032       41 849          7
                      db:Film               1 369 600      82 099          9
        Film                                                                         18 718   17 643    34m
                      yago:Film             1 067 084     123 822          9
                      db:Mountain            135 442       16 397          5
     Mountain                                                                         39        39       5m
                      yago:Mountain          233 562       32 874          5
                      db:Museum               15 940        1 826          7
     Museum                                                                           48        48       9s
                      yago:Museum            163 342       21 050          7
                      db:Organization       4 487 205     183 665         17
   Organization                                                                      1 425     1 425    57m
                      yago:Organization     4 410 854     430 071         17
                      db:Scientist           128 360       18 409         10
     Scientist                                                                        862      862       2m
                      yago:Scientist         671 266       92 828         18
                      db:University          241 838       10 352          9
    University                                                                        213      213       56s
                      yago:University        263 624       23 334          9


Table 3
DB-Yago datasets statistics, where |triples| denotes the number of triples in each dataset, |subjects| the
number of instances, and |properties| the number of properties.


𝑝𝑆𝑖𝑧𝑒 against interlinking approaches providing classical keys and conditional keys as reported
in [20].
   In Figure 4 we can observe that: (i) recall of link keys is significantly better than recall of
classical and conditional keys, (ii) precision of classical and conditional keys is slightly better
than precision of link keys in most of the tasks, (iii) F-measure is much higher for selected link
key candidates than for classical and conditional keys.
   Link keys have a better recall because they are more flexible than classical keys, i.e., a link
key is not necessarily a pair of keys and an instance of class 𝑐1 may be linked to many instances
of class 𝑐2 .
   For summarizing, it can be concluded that considering the best link key candidates selected
by 𝑝𝑆𝑖𝑧𝑒 will ensure a higher interlinking quality compared to classical and conditional keys. We
also observe that the best link key according to discriminability and coverage is the same that
the best non redundant link key selected thanks to 𝑝𝑆𝑖𝑧𝑒. Actually, the Actor dataset makes an
exception: the link key candidate selected with 𝑝𝑆𝑖𝑧𝑒 obtains an F-measure of 0.95 contrasting
the score of 0.34 obtained with coverage and discriminability. In addition, contrasting key-based
approaches, link key discovery does not require any prior knowledge such as property or class
alignments.
                                               (a)recall                     (b)precision                                                        (c)F-measure

                                                     0.57                                              0.93                                                     0.71                classicalkeys
                  University            0.22                                                               0.99 University                  0.26
                                  9⋅10−2                          0.84                      0.69           0. 9 9                    0.1 6                         0.76             conditional keys
                    Scientist        0.16                                                                  0.99    Scientist                 0.28                                   link keys
                                 5⋅10−2              0.56                                        0.82     0. 9 8                 0. 11                        0.67
                Organization −20.14                                                                       0.98 Organization        −2 0.24
                              1⋅10                             0.79                                    0.94 1                 2⋅10                                      0.86
                   Museum                 0.25                                                              1     Museum                          0.4
                                   0.12                        0.79                                    0.94 1                           0. 21                           0.86
                  Mountain                 0.28                                                            0.99 Mountain                            0.44
                              0                        0.61                                           0.91  1                0                                    0.74
                        Film                    0.38                                                     0.96          Film                              0.54
                                4⋅10−2                    0.67                                             0.
                                                                                                        0.95  9 9               8⋅10  −2
                                                                                                                                                                    0.79
                       Book 0.−211                                                                         0.99       Book           −2 0.23
                                3⋅10                  0.59                                        0.83      1                  6⋅10                            0.69
                      Album          0.15                                                                  0.99      Album                  0.26
                              0                                       0.91                                10.98              0                                               0.95
                       Actor                         0.57                                                  0.99       Actor                                      0.73
                                           0.27                                                            0.99                                    0.43
Figure 4: Interlinking performances of classical keys, conditional keys as given in [20] and link keys.
Actually a recall of 0 stands for a very small recall close to 0.


6. Synthesis and Conclusion
This paper introduces a formalization of link key discovery based on partition pattern structures
(pps). This approach allows to discover the set nrlkc of link key candidates which are not
redundant w.r.t. the owl:sameAs equivalence relation, while still being correct and complete. In
addition, new appropriate quality measures are proposed.
   Practically, we observe in experiments that the redundancy of link key candidates in public
datasets is rather rare. Nevertheless, the experiments and the associated results show that
link key based on pps obtain better F-measure values than two other key-based approaches to
data-interlinking.
   Among the perspectives, a first one is to consolidate the theory and practice of link key
discovery based on pps introduced and detailed in this paper. A second and very important
direction of investigation is related to the discovery of “fuzzy link keys”. We make the hypothesis
that the redundancy of link key candidates is rather rare because we are using the crisp equality
operator when we are building the link key candidates. Instead, in considering the discovery of
link key candidates as depending on a similarity relation, i.e., reflexive and symmetric –also
called a tolerance relation– rather than on an equality, then we could propose a formalization
of fuzzy link key candidates in the spirit of approximate-matching dependencies as they are
studied in [19]. We could also expect much more differences between crisp and fuzzy link key
candidates. In addition, a reduction of the size of the concept lattice related to nrlkc could also
be investigated thanks to similarities between partitions.


References
 [1] A. Ferrara, A. Nikolov, F. Scharffe, Data Linking for the Semantic Web, International
     Journal of Semantic Web and Information Systems 7 (2011) 46–76.
 [2] P. Christen, Data Matching – Concepts and Techniques for Record Linkage, Entity Resolu-
     tion, and Duplicate Detection, Springer, 2012.
 [3] M. Nentwig, M. Hartung, A.-C. Ngonga Ngomo, E. Rahm, A survey of current Link
     Discovery frameworks, Semantic Web Journal 8 (2017) 419–436.
 [4] A. N. Ngomo, S. Auer, LIMES – A Time-Efficient Approach for Large-Scale Link Discovery
     on the Web of Data, in: T. Walsh (Ed.), Proceedings of the 22nd International Joint
     Conference on Artificial Intelligence (IJCAI), IJCAI/AAAI, 2011, pp. 2312–2317.
 [5] J. Volz, C. Bizer, M. Gaedke, G. Kobilarov, Silk – A Link Discovery Framework for the
     Web of Data, in: C. Bizer, T. Heath, T. Berners-Lee, K. Idehen (Eds.), Proceedings of the
     WWW2009 Workshop on Linked Data on the Web (LDOW), CEUR Workshop Proceedings
     538, 2009.
 [6] R. Isele, C. Bizer, Active learning of expressive linkage rules using genetic programming,
     Journal of web semantics 23 (2013) 2–15.
 [7] A. N. Ngomo, K. Lyko, EAGLE: Efficient Active Learning of Link Specifications Using
     Genetic Programming, in: E. Simperl, P. Cimiano, A. Polleres, Ó. Corcho, V. Presutti
     (Eds.), Proceedings of the 9th Extended Semantic Web Conference (ESWC, Lecture Notes
     in Computer Science 7295, Springer, 2012, pp. 149–163.
 [8] A. N. Ngomo, K. Lyko, Unsupervised learning of link specifications: deterministic vs.
     non-deterministic, in: P. Shvaiko, J. Euzenat, K. Srinivas, M. Mao, E. Jiménez-Ruiz (Eds.),
     Proceedings of the 8th International Workshop on Ontology Matching (at ISWC), CEUR
     Workshop Proceedings 1111, 2013, pp. 25–36.
 [9] M. Atencia, J. David, J. Euzenat, Data interlinking through robust linkkey extraction, in:
     Proceedings of ECAI, 2014, pp. 15–20.
[10] M. Atencia, J. David, J. Euzenat, A. Napoli, J. Vizzini, Link key candidate extraction with
     relational concept analysis, Discrete applied mathematics 273 (2020) 2–20.
[11] N. Abbas, J. David, A. Napoli, Discovery of Link Keys in RDF Data Based on Pattern
     Structures: Preliminary Steps, in: F. J. Valverde-Albacete, M. Trnecka (Eds.), Proceedings
     of the International Conference on Concept Lattices and Their Applications (CLA), CEUR
     Workshop Proceedings 2668, 2020, pp. 235–246.
[12] B. Ganter, R. Wille, Formal Concept Analysis: Mathematical Foundations, Springer, 1999.
[13] B. Ganter, S. O. Kuznetsov, Pattern Structures and Their Projections, in: H. S. Delugach,
     G. Stumme (Eds.), Proceedings of the International Conference on Conceptual Structures
     (ICCS), Lecture Notes in Computer Science 2120, Springer, 2001, pp. 129–142.
[14] J. Baixeries, M. Kaytoue, A. Napoli, Characterizing functional dependencies in formal
     concept analysis with pattern structures, Annals of Mathematics and Artificial Intelligence
     72 (2014) 129–149.
[15] N. Abbas, A. Bazin, J. David, A. Napoli, Non-Redundant Link Keys in RDF Data: Preliminary
     Steps, in: Proceedings of the FCA4AI Workshop at IJCAI, CEUR Workshop Proceedings
     2972, 2021, pp. 125–130.
[16] M. Atencia, J. David, J. Euzenat, On the relation between keys and link keys for data
     interlinking, Semantic Web Journal 12 (2021) 547–567.
[17] M. Kaytoue, S. O. Kuznetsov, A. Napoli, S. Duplessis, Mining Gene Expression Data with
     Pattern Structures in Formal Concept Analysis, Information Sciences 181 (2011) 1989–2001.
[18] N. Abbas, A. Bazin, J. David, A. Napoli, Sandwich: An Algorithm for Discovering Relevant
     Link Keys in an LKPS Concept Lattice, in: A. Braud, A. Buzmakov, T. Hanika, F. L. Ber
     (Eds.), Proceedings of the 16th International Conference on Formal Concept Analysis
     (ICFCA), Lecture Notes in Computer Science 12733, Springer, 2021, pp. 243–251.
[19] J. Baixeries, V. Codocedo, M. Kaytoue, A. Napoli, Characterizing Approximate-Matching
     Dependencies in Formal Concept Analysis with Pattern Structures, Discrete Applied
     Mathematics 249 (2018) 18–27.
[20] D. Symeonidou, L. Galárraga, N. Pernelle, F. Saïs, F. M. Suchanek, VICKEY: Mining
     Conditional Keys on Knowledge Bases, in: Proceedings of 16th International Semantic
     Web Conference (ISWC), LNCS 10587, Springer, 2017, pp. 661–677.
[21] D. Beckett, T. Berners-Lee, E. Prud’hommeaux, G. Carothers, RDF 1.1 Turtle, W3C Recom-
     mendation, W3C, 2014. https://www.w3.org/TR/turtle/.