<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Non-Redundant Link Keys in RDF Data: Preliminary Steps</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nacira Abbas</string-name>
          <email>Nacira.Abbas@inria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandre Bazin</string-name>
          <email>Alexandre.Bazin@loria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jérôme David</string-name>
          <email>Jerome.David@inria.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amedeo Napoli</string-name>
          <email>Amedeo.Napoli@loria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Université Grenoble Alpes</institution>
          ,
          <addr-line>Inria, CNRS, Grenoble INP, LIG, F-38000 Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université de Lorraine</institution>
          ,
          <addr-line>CNRS, Inria, Loria, F-54000 Nancy</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A link key between two RDF datasets D1 and D2 is a set of pairs of properties allowing to identify pairs of individuals, say x1 in D1 and x2 in D2, which can be materialized as a x1 owl:sameAs x2 identity link. There exist several ways to mine such link keys but no one takes into account the fact that owl:sameAs is an equivalence relation, which leads to the discovery of non-redundant link keys. Accordingly, in this paper, we present the link key discovery based on Pattern Structures (PS). PS output a pattern concept lattice where every concept has an extent representing a set of pairs of individuals and an intent representing the related link key candidate. Then, we discuss the equivalence relation induced by a link key and we introduce the notion of non-redundant link key candidate.</p>
      </abstract>
      <kwd-group>
        <kwd>Linked Data</kwd>
        <kwd>RDF</kwd>
        <kwd>Link Key</kwd>
        <kwd>Formal Concept Analysis</kwd>
        <kwd>Pattern Structures</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In this paper, we are interested in data interlinking which goal is to discover
identity links across two RDF datasets over the web of data [
        <xref ref-type="bibr" rid="ref5 ref8">5,8</xref>
        ]. The same real
world entity can be represented in two RDF datasets by different subjects in RDF
triples (subject,property,value) (instead of “object” usually used in RDF data
we will use “value”). It is important to be able to detect such identities, for
example using rules expressing sufficient conditions for two subjects to be identical.
A link key takes the form of two sets of pairs of properties associated with a pair
of classes. The pairs of properties express sufficient conditions for two subjects,
from the associated pair of classes, to be the same. An example of a link key is
({(designation, title)}, {(designation, title), (creator, author)}, (Book, Novel))
which states that whenever an instance a of the class Book has the same (non
empty) values for the property designation as an instance b of the class Novel
for the property title (universal quantification), and that a and b share at least
one value for the properties creator and author (existential quantification), then
a and b denote the same entity, i.e., an owl:sameAs relation can be established
between a and b.
      </p>
      <p>Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        A link key can be understood as a “closed set” in the sense that it is maximal
w.r.t. the set of pairs of individuals to which it applies. This was firstly discussed
in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and then extended in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Hence the question of relying on Formal Concept
Analysis (FCA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) to discover link keys is straightforward as FCA is based on
a closure operator. Then, given two RDF datasets, FCA is applied in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to a
binary table where rows correspond to pairs of individuals and columns to pairs
of properties. The intent of a concept is a link key candidate which should be
validated thanks to suitable quality measures. The extent of the concept is the
set of identity links between individuals. Furthermore, a generalization of the
former approach proposed in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is based on pattern structures [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and takes into
account different pairs of classes at the same time in the discovery of link keys.
      </p>
      <p>Link key candidates over two RDF datasets have to generate different and
maximal link sets. However it appears that two different link key candidates may
generate the same link set. This means that there exists some redundancy
between the two link key candidates, that they should be considered as equivalent
and merged. This can be achieved by looking at owl:sameAs which is an
equivalence relation stating that two individual should be identified. The owl:sameAs
relation generates partitions among pairs of individuals that can be used to
detect redundant link key candidates and thus reduce their number, i.e., two
candidates relying on the same partition are declared as redundant and thus
merged.</p>
      <p>In this paper, we present the discovery of link key candidates within the
framework of pattern structure. Then, we introduce the notion of non-redundant
link key candidate based on the equivalence relation induced by a link key
candidate. Finally, we discuss how these candidates can be merged to reduce the
search space of link keys.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Basics and Notations</title>
      <sec id="sec-2-1">
        <title>RDF data</title>
        <p>In this work, we deal with RDF datasets which are defined as follows:</p>
      </sec>
      <sec id="sec-2-2">
        <title>Definition 1 (RDF dataset).</title>
        <p>Let U be a set of IRIs (Internationalized Resource Identifier), B a set of
blank nodes and L a set of literals. An RDF dataset is a set of triples (s, p, v) ∈
(U ∪ B) × U × (U ∪ B ∪ L).</p>
        <p>Given a dataset D, we denote by:
– I(D) = {s | ∃p, v (s, p, v) ∈ D} the set of individual identifiers,
– P (D) = {p | ∃s, v (s, p, v) ∈ D} the set of property identifiers,
– C(D) = {c | ∃s (s, rdf:type, c) ∈ D} the set of class identifiers. A triple
(s, rdf:type, c) means that the subject s is an instance of the class c.
– I(c) = {s | (s, rdf:type, c) ∈ D} the set of instances of c ∈ C(D),
– p(s) = {v | (s, p, v) ∈ D} is the set of values (or “RDF objects”) related to s
through p.</p>
        <p>An identity link is an RDF triple (a, owl:sameAs, b) stating that the IRIs a
and b refer to the same real-world entity. Fig. 1 represents two RDF datasets
D1 and D2, where P (D1) = {p1, p2, p3, p4} and P (D2) = {q1, q2, q3, q4}. Then
C(D1) = {c1} and C(D2) = {c2} with I(c1) = {a1, a2, a3, a4, a5} and I(c2) =
{b1, b2, b3, b4, b5}. For example, the set of values of b3 for the property q2 is
q2(b3) = {v8, v9}.</p>
        <p>c1
a1
a2
a3
a4
a5</p>
        <p>Definition 2 (Link key expression, link key candidate). Let D1 and D2
be two RDF datasets, k = (Eq, In, (c1, c2)) is a link key expression (over D1
and D2) iff In ⊆ P (D1) × P (D2), Eq ⊆ In, c1 ∈ C(D1) and c2 ∈ C(D2).</p>
        <p>The set of links L(k) (directly) generated by k is the set of pairs of instances
(a, b) ∈ I(c1) × I(c2) satisfying:
(i) for all (p, q) ∈ Eq, p(a) = q(b) and p(a) 6= ∅,
(ii) for all (p, q) ∈ In \ Eq, p(a) ∩ q(b) 6= ∅.</p>
        <p>A link key expression k1 = (Eq1, In1, (c1, c2)) is a link key candidate if:
(iii) L(k1) 6= ∅,
(iv) k1 is maximal i.e. there does not exist another link key expression k2 =
(Eq2, In2, (c1, c2)) such that Eq1 ⊂ Eq2, In1 ⊂ In2, and L(k1) = L(k2).</p>
        <p>The number of link key expressions may be exponential w.r.t. the number of
properties. Then link key discovery algorithms only consider link key candidates
which are link key expressions generating at least one link and that are maximal
w.r.t. the set of links they generate.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Link Key Discovery</title>
      <p>
        Here after we assume that all link key expressions are defined on the same pair
of datasets D1 and D2 w.r.t. one pair of classes, yielding link key expressions
of the form k = (Eq, In, (c1, c2)). In the following, we show how link keys may
be discovered within the formalism of pattern structures (see details in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]) and
then we discuss the notion of non-redundant link keys.
      </p>
      <p>
        Example 1. Let us consider the pattern structure (G, (E, u), δ) displayed in
Table 1. Here we skip the details for building this table and the related PS lattice
which can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The rows termed “PS objects” correspond to the set of objects G of the
pattern structure and include pairs of related instances. The set of descriptions
(E, u) includes all possible pairs of properties preceded either by ∀ or ∃. The
mapping δ relates a pair of instances (a, b) ∈ I(c1) × I(c2) to a description as
follows: (i) δ(a, b) includes ∀(p, q) whenever p(a) = q(b) and p(a) 6= ∅, (ii) δ(a, b)
includes ∃(p, q) whenever p(a) ∩ q(b) 6= ∅. Then the descriptions correspond to
link key expressions (Eq, In) w.r.t. the pairs of classes (c1, c2). It should be
noticed that it is possible to simultaneously work with several pairs of classes as
explained in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>We have that δ(a1, b1) = {∃(p1, q1), ∃(p2, q2)} because p1(a1) ∩ q1(b1) 6= ∅
and p2(a1) ∩ q2(b1) 6= ∅ while δ(a2, b1) = {∃(p1, q1)} because p1(a2) ∩ q1(b1) 6= ∅.
Then δ(a1, b1) u δ(a2, b1) = {∃(p1, q1)} and thus δ(a2, b1) v δ(a1, b1). This can
be read in the pattern concept lattice where the pattern concept pc5 is subsumed
by the pattern concept pc4, i.e., the extent of pc5 {(a1, b1), (a2, b2), (a3, b3)} is
included in the extent of pc4 {(a1, b1), (a2, b1), (a2, b2), (a3, b3)}, while the intent
{∃(p1, q1)} of pc4 is included in the intent of pc5, {∃(p1, q1), ∃(p2, q2)}.</p>
      <p>The set of all pattern concepts is organized within the pattern concept lattice
lkps-lattice displayed in Fig. 2. Moreover, all potential link key candidates are
lying in the intents of the pattern concepts in the lattice. The corresponding set
of link key candidates is denoted by lkc.</p>
      <p>PS objects (g)
(a1, b1)
(a1, b2)
(a2, b1)
(a2, b2)
(a3, b3)
(a4, b4)
(a4, b5)
(a5, b4)
(a5, b5)
descriptions (δ(g))
{∃(p1, q1), ∃(p2, q2)}
{∃(p2, q2)}
{∃(p1, q1)}
{∃(p1, q1), ∃(p2, q2)}
{∀(p1, q1), ∃(p1, q1), ∃(p2, q2)}
{∀(p3, q3), ∃(p3, q3)}
{∀(p4, q4), ∃(p4, q4)}
{∀(p4, q4), ∃(p4, q4)}
{∀(p3, q3), ∃(p3, q3)}</p>
      <p>Let us consider the so-called lkps-lattice and pc = (L(k), k) a pattern
concept, where the extent L(k) corresponds to the set of links generated by k,
and the intent k corresponds to a link key candidate. Let I denotes the set
of instances I = I(c1) ∪ I(c2) and the binary relation 'k⊆ I × I such as
(a, b) ∈ L(k) → a 'k b. The interpretation of a 'k b is: "k states that there
exists a owl:sameAs relation between a and b". Actually 'k is an equivalence
relation based on the fact that owl:sameAs itself is an equivalence relation. We say
that k induces the equivalence relation 'k over I. Moreover 'k forms a partition
over I where each element of this partition is an equivalence class. In fact the 'k
equivalence relation will help us to build more concise set of link key candidates
since it allows to identify non-redundant link key candidates termed nr-lkc. A
link key candidate k1 is a nr-lkc in lkc if there is no other candidate k2 in lkc
such that 'k1 and 'k2 form the same partition. Otherwise, k1 is redundant.</p>
      <p>In Fig. 2, it can be observed that 'k3 and 'k4 form the same partition,
namely {(a1, b1, a2, b2), (a3, b3)} (it should be noticed that singletons are omitted
for the sake of readability). Then the link key candidates k3 and k4 are redundant.
By contrast, k1 is a nr-lkc because there is no other candidate k in lkc such
that 'k1 and 'k form the same partition.</p>
      <p>Let us briefly explain how 'k3 and 'k4 are inducing the same partition,
namely {(a1, b1, a2, b2), (a3, b3)}. The extent of k3 in lkps-lattice is given by
{(a1, b1), (a1, b2), (a2, b2), (a3, b3)}. By transitivity and symmetry of owl:sameAs,
we have that (a1, b2) and (b2, a2) yields (a1, a2), then (a2, a1) and (a1, b1) yields
(a2, b1), and finally (b1, a2) and (a2, b2) yields (b1, b2) and the complete graph
between (a1, a2, b1, b2). The same thing applies when we consider k4 instead of
k3. This intuitively shows how 'k3 and 'k4 are inducing the same partition.</p>
      <p>
        One main straightforward application of identifying nr-lkc is the ability to
reduce the search space of link keys since the set of nr-lkc is included in lkc.
Indeed, this can be seen as a refinement where redundant link key candidates
inducing the same partition are merged. For example, since 'k3 and 'k4 form
the same partition, then, k3 and k4 can be merged into a nr-lkc k34 = {k3, k4}.
Among the perspectives is to consolidate the theory and practice of link key
discovery based on partition pattern structures initially introduced for mining
functional dependencies in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abbas</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>David</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Discovery of link keys in RDF data based on pattern structures: Preliminary steps</article-title>
          .
          <source>In: Proceedings of ICFCA. CEUR Workshop Proceedings</source>
          , vol.
          <volume>2668</volume>
          , pp.
          <fpage>235</fpage>
          -
          <lpage>246</lpage>
          . CEUR-WS.org (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Atencia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>David</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Data interlinking through robust linkkey extraction</article-title>
          .
          <source>In: Proceedings of ECAI</source>
          . pp.
          <fpage>15</fpage>
          -
          <lpage>20</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Atencia</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>David</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Euzenat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vizzini</surname>
          </string-name>
          , J.:
          <article-title>Link key candidate extraction with relational concept analysis</article-title>
          .
          <source>Discrete applied mathematics 273</source>
          ,
          <fpage>2</fpage>
          -
          <lpage>20</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Baixeries</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Characterizing functional dependencies in formal concept analysis with pattern structures</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>72</volume>
          ,
          <fpage>129</fpage>
          -
          <lpage>149</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ferrara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scharffe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Data Linking for the Semantic Web</article-title>
          .
          <source>International Journal of Semantic Web and Information Systems</source>
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <fpage>46</fpage>
          -
          <lpage>76</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>Pattern Structures and Their Projections</article-title>
          .
          <source>In: Proceedings of the International Conference on Conceptual Structures (ICCS)</source>
          . pp.
          <fpage>129</fpage>
          -
          <lpage>142</lpage>
          . LNCS 2120, Springer (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis: Mathematical Foundations</source>
          . Springer (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Nentwig</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hartung</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ngonga</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.C.</given-names>
            ,
            <surname>Rahm</surname>
          </string-name>
          , E.:
          <article-title>A survey of current link discovery frameworks</article-title>
          .
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <issue>3</issue>
          ),
          <fpage>419</fpage>
          -
          <lpage>436</lpage>
          (
          <year>2017</year>
          ). https://doi.org/10.3233/SW-150210
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>