<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>What can FCA do for database linkkey extraction? (problem paper)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuel Atencia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jérôme David</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jérôme Euzenat</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>INRIA</institution>
          ,
          <addr-line>Grenoble</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ. Grenoble Alpes &amp;</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Links between heterogeneous data sets may be found by using a generalisation of keys in databases, called linkkeys, which apply across data sets. This paper considers the question of characterising such keys in terms of formal concept analysis. This question is natural because the space of candidate keys is an ordered structure obtained by reduction of the space of keys and that of data set partitions. Classical techniques for generating functional dependencies in formal concept analysis indeed apply for finding candidate keys. They can be adapted in order to find database candidate linkkeys. The question of their extensibility to the RDF context would be worth investigating.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        For that purpose, we have defined linkkeys [
        <xref ref-type="bibr" rid="ref2 ref5">5, 2</xref>
        ] and we would like to formulate
the linkkey extraction problem in the framework of formal concept analysis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>We first present this problem in the context of database candidate key extraction
where one looks for sets of attributes and the sets of equality statements that they
generate. We formulate this problem as the computation of a concept lattice. Then we turn
to an adaptation of linkkeys to databases and show that the previous technique cannot
be used for extracting the expected linkkeys. Instead we propose an adaptation.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Candidate keys in databases</title>
      <p>A relation D = hA; T i is a set of tuples T characterised by a set of attributes A. A key
is a subset of the attributes whose values identify a unique tuple.</p>
      <p>Definition (key) Given a database relation D = hA; T i, a key is a subset of the
attributes K A, such that 8t; t0 2 T ; (8p 2 K; p(t) = p(t0)) ) t t0.</p>
      <p>Classically, keys are defined from functional dependencies. A set of attributes A is
functionally dependent from another K, if equality of the attributes of K determines
equality for the attributes of A. If the equality between tuples is the same thing as the
equality for all attribute values, then a key is simply those sets of attributes of which A
is functionally dependent.</p>
      <p>However, we have not used the equality between tuple (=) but a particular
relation. The reason is that we do not want to find keys for the database with =, but with an
unknown relation which is to be discovered.</p>
      <p>The statements t t0 are those equality statements that are generated by the key.
The relation must contain = (t = t0 ) t t0) and be an equivalence relation (this is
by definition if it is the smallest relation satisfying the key).</p>
      <p>From a key K of a relation hA; T i, it is easy to obtain these statements through
the function : 2A ! 2T T such that (K) = ft t0j8p 2 K; p(t) = p(t0)g. is
anti-monotonic (8K; K0 A; K K0 ) (K) (K0)).</p>
      <p>We define candidate key extraction as the task of finding the minimal sets of
attributes which generate a partition of the set of tuples.</p>
      <p>Definition (candidate key) Given a database relation D, a candidate key is a key such
that none of its proper subsets generate the same partition. (D) is the set of candidate
keys.</p>
      <p>Those candidate keys which generate the singletons(T ) partition are called normal
candidate keys and their set noted ^(D) = fK 2 (D)j8(t t0) 2 (K); t = t0g.</p>
      <p>The problem of candidate key extraction is formulated in the following way:
Problem: Given a database relation D, find (D).</p>
      <p>This problem is usually not considered in databases. Either keys are given and used
for finding equivalent tuples and reducing the table, or the table is assumed without
redundancy and keys are extracted. In this latter case, the problem is the extraction of
normal candidate keys.</p>
      <p>
        Using lattices is common place for extracting functional dependencies [
        <xref ref-type="bibr" rid="ref4 ref9">9, 4</xref>
        ] and the
link to extract functional dependencies with formal concept analysis has already been
considered [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and further refined [
        <xref ref-type="bibr" rid="ref10 ref3">10, 3</xref>
        ].
      </p>
      <p>In fact, this link can be fully exploited for extracting candidate keys instead of
finding functional dependencies.</p>
      <p>It consists of defining1 a formal context enc(hA; T i) = hP2(T ); A; Ii such that:
8p 2 A; 8ht; t0i 2 P2(T ),</p>
      <p>ht; t0iIp iff p(t) = p(t0)
fag
d a1
c</p>
      <p>d
fy; a; tg
A0 = fw; y; a; o; tg
d</p>
      <p>e
b; c
fa; tg
d</p>
      <p>As presented in Table 2, this may generate several candidate keys for the same
concept (fo; tg and fwg for the maximal partition in the library dataset; in the bookstore
dataset, the concept of extent 4 5 6 has two candidate keys flng and flg; f ng and
the maximal partition has three candidate keys fidg, ftt; lng and ftt; f n; lgg).</p>
      <p>This answers positively to our first question: it is possible to extract keys, i.e.,
generating (D) from data with some help from formal concept analysis.
1 For an arbitraty total strict order &lt; on T , P2(T ) = fht; t0i 2 T 2 j t &lt; t0g.
2 RE = fX 2 Ej8X0 2 E; :(X0RX))g.</p>
      <p>intent
fid; f n; tt; ln; lgg
ff n; ln; lgg
ftt; lgg
ff n; ttg
flgg
ff ng
fttg
?
fw; y; a; t; og
fy; a; og
fy; a; tg
fy; ag
fa; tg
fag
?
potential keys</p>
      <p>id. . . ,
fnlnlg, fnlg, fnln, lgln, ln
ttlg
fntt
lg
fn
tt
?
w, . . . ot, yot, aot, yaot, wyaot
o, ao, at, yo, yao
yt, yat
y, ya
t
a
?</p>
      <p>candidate keys
fidg, ftt; lng, ftt; f n; lgg
flng, ff n; lgg
ftt; lgg
ff n; ttg
flgg
ff ng
fttg
?
fo; tg; fwg
fog
fy; tg
fyg
ftg
fag
?
f3
fa1
extent
?
5
f4
8g
eg
2</p>
    </sec>
    <sec id="sec-3">
      <title>Database linkkey extraction</title>
      <p>Consider that, instead of one relation, we are faced with two relations from two different
databases which may contain tuples corresponding to the same individual.</p>
      <p>We assume that candidate attribute pairs are already available through an alignment
A which expresses equivalences between attributes of both relations. In this example,
A = fhlastname; authori; htitle; origi; hid; widig. Our goal is to find those which will
identify the same individuals (tuples) in both databases.
2.1</p>
      <sec id="sec-3-1">
        <title>Linkkeys for databases</title>
        <p>
          Linkkeys [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] have been introduced for generating equality, a.k.a. sameAs, links between
RDF datasets. We present a simplified notion of linkkey which is defined over relations.
Definition (Linkkey) Given two relations D = hA; T i and D0 = hA0; T 0i and an
alignment A A A0. LK A is a linkkey between D and D0 iff 8t; t0 2 T ; T 0; (8hp; p0i 2
LK; p(t) = p(t0)) ) t t0. The set of linkkeys between D and D0 with respect to A
is denoted A(D; D0).
        </p>
        <p>This definition may be rendered independent from A by assuming A = A A0, so
any attribute of one relation may be matched to any other.
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Strong linkkey extraction</title>
        <p>One way to deal with this problem is to start with keys: either candidate keys or normal
candidate keys. For that purpose, we define (D)=A as the operation which replaces,
in all candidate keys of D, each occurrence of an attribute in a correspondence of A by
this correspondence3.</p>
        <p>A first kind of linkkeys that may be extracted are those which are normal
candidate keys in their respective relations. They are called strong linkkeys and may be
obtained by selecting normal candidate keys that contain only attributes mentioned in
the alignment (replacing the attribute by the correspondence) and to intersect them, i.e.,
^(D)=A\ ^(D0i)=A. Strong linkkeys have the advantage of identifying tuples matching
across relations without generating any links within the initial relations.</p>
        <p>In the example of Table 1, there is one such strong linkkey: fhid; widig. Indeed, the
normal candidate keys for the bookstore relation are fidg, ftitle; lastnameg, or ftitle; firstname; langg
and, for the library relation they are fwidg and forig; translatorg. Since, translator has
no equivalent in the bookstore relation (through A), only fhid; widig can be used.
Unfortunately, it does not identify any equality statement as this happens very often with
databases surrogates (this may have been worse if both relations used integers as
identifiers: identifying false positives).</p>
        <p>This scheme may be relaxed by trying to extract linkkeys from all candidate keys.
In this way one would simply use (D)=A \ (D0)=A. In our example, this does not
bring further linkkeys.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Candidate linkkey extraction</title>
        <p>The technique proposed above, does indeed generate linkkeys, but does not generate all
of them: linkkeys may rely on sets of attributes which are not candidate keys. Indeed,
one interesting linkkey for the relations above is fhlastname; authori; htitle; origig.</p>
        <p>Surprisingly, it does not use a normal candidate key of the library relation and not
even a candidate key of the bookstore relation as fauthor; origg generates the same links
as forigg in this relation. However, when applied to the elements of T T 0 this linkkey
generates non ambiguous links, i.e., links which do not entail new links within a relation
(this would have been different if a tuple hyear = 1822; author = Quincey; orig =
Confessions; translator = Baudelairei were present in the library relation).</p>
        <p>Such linkkeys may be found by the same type of technique as before. It consists
of defining a formal context enc(hA; T i; hA0; T 0i; A) = hT T 0; A; I i such that:
8hp; p0i 2 A; 8ht; t0i 2 T T 0,</p>
        <p>ht; t0iI hp; p0i iff p(t) = p0(t0)
is redefined to deal with subsets of alignments and generate assertions on T [
T 0. But, in order for to remain an equivalence relation it will be necessary to close
on T [ T 0 and not only on T T 0. Indeed if two tuples of T are found equal to a tuple
of T 0, then by transitivity, they should be equal as well.</p>
        <p>Again, candidate linkkeys are the minimal elements of the intent which generate
exactly the corresponding set of links. A(D; D0) = Sc2F CA(enc(D;D0;A)) fK
intent(c)j (K) = (intent(c))g.
3 This assumes that the alignment is one-to-one. This assumption is necessary for this subsection
of the paper only.</p>
        <p>3
4
5
6
fhtt; oig
3
5
b; 4
d; 6
c;
e
fhf n; ai; htt; oig</p>
        <p>A = fhf n; ai; htt; oi; hid; wig</p>
        <p>This technique, applied to the example of Table 1, generates the lattice of Figure 2. It
can be argued that the candidate linkkeys fhf n; ai; htt; oig and fhid; wig are better than
the others because they do not generate other statements within the relations. Indeed,
fhtt; oig generates 6 7 8, and fhf n; aig generates a1 a2 b, c d e and
4 5 6.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conlusion and further work</title>
      <p>We introduced, in the context of the relational model, the notions of candidate keys and
linkkeys and we discussed potential links with formal concept analysis. These are only a
few elements of a wider program. Problems were expressed in the relational framework
because they are simpler. Our ambition is to provide an integrated way to generate
links across RDF data sets using keys and it may be worth investigating if the proposed
formal concept analysis framework can be extended to full RDF data interlinking.</p>
      <p>
        Plunging this in the context of RDF requires further developments:
– considering that values do not have to be syntactically equal but may be found
equal with respect to some theory: this may be a simple set of equality statement
(“étudiant”=“student”) or may depend on RDF Schemas or OWL ontologies;
– considering several tables depending on each others together (this is related to
Relational Concept Analysis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and could use the notion of foreign keys);
– considering that RDF attributes are not functional and hence yield a more general
type of keys [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Once this is integrated within a common theoretical framework, a full solution
requires work before and after running formal concept analysis:
– Before, it is necessary to use ontology/database matching [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and to proceed to
value normalisation.
– After, it is necessary to select among these potential or candidate keys those which
are the more accurate [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been partially supported by the ANR projects Qualinca (12-CORD-0012
for Manuel Atencia), and Lindicle (12-IS02-0002 for all three authors), and by grant
TIN2011-28084 (for Manuel Atencia and Jérôme David) of the Ministry of Science and
Innovation of Spain, co-funded by the European Regional Development Fund (ERDF).</p>
      <p>Thanks to the anonymous reviewers for helping clarifying the paper.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Manuel</given-names>
            <surname>Atencia</surname>
          </string-name>
          , Michel Chein, Madalina Croitoru, Michel Chein, Jérôme David, Michel Leclère, Nathalie Pernelle, Fatiha Saïs, François Scharffe, and
          <string-name>
            <given-names>Danai</given-names>
            <surname>Symeonidou</surname>
          </string-name>
          .
          <article-title>Defining key semantics for the RDF datasets: experiments and evaluations</article-title>
          .
          <source>In Proc. 21st International Conference on Conceptual Structures (ICCS)</source>
          ,
          <source>Iasi (RO)</source>
          , pages
          <fpage>65</fpage>
          -
          <lpage>78</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Manuel</given-names>
            <surname>Atencia</surname>
          </string-name>
          , Jérôme David, and
          <string-name>
            <given-names>Jérôme</given-names>
            <surname>Euzenat</surname>
          </string-name>
          .
          <article-title>Data interlinking through robust linkkey extraction</article-title>
          .
          <source>In Proc. 21st european conference on artificial intelligence (ECAI)</source>
          ,
          <source>Praha (CZ)</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Jaume</given-names>
            <surname>Baixeries</surname>
          </string-name>
          , Mehdi Kaytoue, and
          <string-name>
            <given-names>Amedeo</given-names>
            <surname>Napoli</surname>
          </string-name>
          .
          <article-title>Characterizing functional dependencies in formal concept analysis with pattern structures</article-title>
          .
          <source>Annals of mathematics and artificial intelligence</source>
          ,
          <year>2014</year>
          . To appear.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>János</given-names>
            <surname>Demetrovics</surname>
          </string-name>
          , Leonid Libkin, and
          <string-name>
            <given-names>Ilya</given-names>
            <surname>Muchnik</surname>
          </string-name>
          .
          <article-title>Functional dependencies in relational databases: A lattice point of view</article-title>
          .
          <source>Discrete Applied Mathematics</source>
          ,
          <volume>40</volume>
          (
          <issue>2</issue>
          ):
          <fpage>155</fpage>
          -
          <lpage>185</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Jérôme</given-names>
            <surname>Euzenat</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Shvaiko</surname>
          </string-name>
          . Ontology matching. Springer, Heidelberg (DE),
          <source>2nd edition</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Bernhard</given-names>
            <surname>Ganter</surname>
          </string-name>
          and
          <string-name>
            <given-names>Rudolf</given-names>
            <surname>Wille</surname>
          </string-name>
          .
          <source>Formal concept analysis: mathematical foundations</source>
          . Springer, Berlin, New York, Paris,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Mohamed</given-names>
            <surname>Rouane</surname>
          </string-name>
          <string-name>
            <surname>Hacene</surname>
          </string-name>
          , Marianne Huchard, Amedeo Napoli, and
          <string-name>
            <given-names>Petko</given-names>
            <surname>Valtchev</surname>
          </string-name>
          .
          <article-title>Relational concept analysis: mining concept lattices from multi-relational data</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          ,
          <volume>67</volume>
          (
          <issue>1</issue>
          ):
          <fpage>81</fpage>
          -
          <lpage>108</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Tom</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked Data: Evolving the Web into a Global Data Space</article-title>
          . Morgan &amp; Claypool,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Mark</given-names>
            <surname>Levene</surname>
          </string-name>
          .
          <article-title>A lattice view of functional dependencies in incomplete relations</article-title>
          .
          <source>Acta cybernetica</source>
          ,
          <volume>12</volume>
          (
          <issue>2</issue>
          ):
          <fpage>181</fpage>
          -
          <lpage>207</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stéphane</surname>
            <given-names>Lopes</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jean-Marc Petit</surname>
            , and
            <given-names>Lotfi</given-names>
          </string-name>
          <string-name>
            <surname>Lakhal</surname>
          </string-name>
          .
          <article-title>Functional and approximate dependency mining: database and FCA points of view</article-title>
          .
          <source>Journal of Experimental &amp; Theoretical Artificial Intelligence</source>
          ,
          <volume>14</volume>
          (
          <issue>2-3</issue>
          ):
          <fpage>93</fpage>
          -
          <lpage>114</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>