=Paper= {{Paper |id=Vol-1257/paper10 |storemode=property |title=What can FCA do for Database Linkkey Extraction? |pdfUrl=https://ceur-ws.org/Vol-1257/paper10.pdf |volume=Vol-1257 |dblpUrl=https://dblp.org/rec/conf/ecai/AtenciaDE14a }} ==What can FCA do for Database Linkkey Extraction?== https://ceur-ws.org/Vol-1257/paper10.pdf
    What can FCA do for database linkkey extraction?
                  (problem paper)

              Manuel Atencia1,2 , Jérôme David1,2 , and Jérôme Euzenat2,1
                              1
                                  Univ. Grenoble Alpes & 2 INRIA,
                                         Grenoble, France
                                        http://exmo.inria.fr



       Abstract. Links between heterogeneous data sets may be found by using a gen-
       eralisation of keys in databases, called linkkeys, which apply across data sets.
       This paper considers the question of characterising such keys in terms of formal
       concept analysis. This question is natural because the space of candidate keys is
       an ordered structure obtained by reduction of the space of keys and that of data set
       partitions. Classical techniques for generating functional dependencies in formal
       concept analysis indeed apply for finding candidate keys. They can be adapted in
       order to find database candidate linkkeys. The question of their extensibility to
       the RDF context would be worth investigating.


    We aim at finding correspondences between properties of two RDF datasets which
allows for identifying items denoting the same individuals. This is particularly useful
when dealing with linked data [8] for finding equality links between data sets.
    Because the RDF setting raises many additional problems, we restrict ourselves here
to databases. The problem is illustrated by the two (small) book relations of Table 1
(from [5], p.116). We would like to characterise a way to identify items on the same
line while not (wrongly) identifying any other pair of items.


                 bookstore relation                       library relation
    id firstname      title     lastname lang year author      orig      translator wid
    id      fn          tt          ln     lg   y    a           o            t      w
                                              1845 Poe        Raven Baudelaire a1
                                              1845 Poe        Raven       Mallarmé a2
     3      E.     Gold bug        Poe    en 1843 Poe       Gold Bug Baudelaire b
     4      T.     On murder Quincey en 1827 Quincey On murder Schwob c
     5      T.       Kant        Quincey en 1827 Quincey       Kant        Schwob d
     6      T.    Confessions Quincey en 1822 Quincey Confessions Musset             e
     7     J.-J.  Confessions Rousseau fr
     8      T.    Confessions Aquinus fr

Table 1. Two relations with, on the same lines, those tuples that represents the same individual
(the line after attribute names are abbreviations).


    For that purpose, we have defined linkkeys [5, 2] and we would like to formulate
the linkkey extraction problem in the framework of formal concept analysis [6].
    We first present this problem in the context of database candidate key extraction
where one looks for sets of attributes and the sets of equality statements that they gen-
erate. We formulate this problem as the computation of a concept lattice. Then we turn
to an adaptation of linkkeys to databases and show that the previous technique cannot
be used for extracting the expected linkkeys. Instead we propose an adaptation.


1   Candidate keys in databases

A relation D = hA, T i is a set of tuples T characterised by a set of attributes A. A key
is a subset of the attributes whose values identify a unique tuple.
Definition (key) Given a database relation D = hA, T i, a key is a subset of the at-
tributes K ⊆ A, such that ∀t, t0 ∈ T , (∀p ∈ K, p(t) = p(t0 )) ⇒ t ≈ t0 .
     Classically, keys are defined from functional dependencies. A set of attributes A is
functionally dependent from another K, if equality of the attributes of K determines
equality for the attributes of A. If the equality between tuples is the same thing as the
equality for all attribute values, then a key is simply those sets of attributes of which A
is functionally dependent.
     However, we have not used the equality between tuple (=) but a particular ≈ rela-
tion. The reason is that we do not want to find keys for the database with =, but with an
unknown relation ≈ which is to be discovered.
     The statements t ≈ t0 are those equality statements that are generated by the key.
The ≈ relation must contain = (t = t0 ⇒ t ≈ t0 ) and be an equivalence relation (this is
by definition if it is the smallest relation satisfying the key).
     From a key K of a relation hA, T i, it is easy to obtain these statements through
the function γ : 2A → 2T ×T such that γ(K) = {t ≈ t0 |∀p ∈ K, p(t) = p(t0 )}. γ is
anti-monotonic (∀K, K 0 ⊆ A, K ⊆ K 0 ⇒ γ(K) ⊇ γ(K 0 )).
     We define candidate key extraction as the task of finding the minimal sets of at-
tributes which generate a partition of the set of tuples.
Definition (candidate key) Given a database relation D, a candidate key is a key such
that none of its proper subsets generate the same partition. κ(D) is the set of candidate
keys.
     Those candidate keys which generate the singletons(T ) partition are called normal
candidate keys and their set noted κ̂(D) = {K ∈ κ(D)|∀(t ≈ t0 ) ∈ γ(K), t = t0 }.
     The problem of candidate key extraction is formulated in the following way:
Problem: Given a database relation D, find κ(D).
     This problem is usually not considered in databases. Either keys are given and used
for finding equivalent tuples and reducing the table, or the table is assumed without
redundancy and keys are extracted. In this latter case, the problem is the extraction of
normal candidate keys.
     Using lattices is common place for extracting functional dependencies [9, 4] and the
link to extract functional dependencies with formal concept analysis has already been
considered [6] and further refined [10, 3].
     In fact, this link can be fully exploited for extracting candidate keys instead of find-
ing functional dependencies.
   It consists of defining1 a formal context enc(hA, T i) = hP2 (T ), A, Ii such that:
∀p ∈ A, ∀ht, t0 i ∈ P2 (T ),

                                       ht, t0 iIp iff p(t) = p(t0 )

    The (formal) concepts of this encoding, that we denote by the set F CA(enc(D)),
associate a set of attributes to a set of pairs of tuples. These pairs of tuples are tuples that
cannot be distinguished by the values of the attributes, i.e., our ≈ assertions. The candi-
date keys are the minimal S elements of the intent which generate exactly the correspond-
ing partition2 . κ(D) = c∈F CA(enc(D)) µ⊆ {K ⊆ intent(c)|γ(K) = γ(intent(c))}.
    For any key K ∈ κ(D), γ(c) is the reflexive, transitive and symmetric closure of
the extent of its concept.
    If this method is applied to the data sources of Table 1, the result is displayed in
Figure 1.


                                  T ×T                                               T0×T0
                                    ∅                                                  ∅


 3 ≈ 4 ≈ 5 ≈ 6, 7 ≈ 8 4 ≈ 5 ≈ 6 ≈ 8              6≈7≈8                      a1 ≈ a2 ≈ b, c ≈ d ≈ e
        {lg}              {f n}                   {tt}                                 {a}


         4≈5≈6                     7≈8             6≈8                a1 ≈ a2 , c ≈ d      a1 ≈ b, c ≈ d
        {lg, ln, f n}             {lg, tt}        {tt, f n}                 {a, y}               {a, t}


                                                                a1 ≈ a2               c≈d
                         A = {id, f n, tt, ln, lg}              {y, a, o}            {y, a, t}



                                                                   A0 = {w, y, a, o, t}



 Fig. 1. Example of the key concept lattice for the bookstore (left) and library (right) relations.


    As presented in Table 2, this may generate several candidate keys for the same
concept ({o, t} and {w} for the maximal partition in the library dataset; in the bookstore
dataset, the concept of extent 4 ≈ 5 ≈ 6 has two candidate keys {ln} and {lg, f n} and
the maximal partition has three candidate keys {id}, {tt, ln} and {tt, f n, lg}).
    This answers positively to our first question: it is possible to extract keys, i.e., gen-
erating κ(D) from data with some help from formal concept analysis.
 1
     For an arbitraty total strict order < on T , P2 (T ) = {ht, t0 i ∈ T 2 | t < t0 }.
 2
     µR E = {X ∈ E|∀X 0 ∈ E, ¬(X 0 RX))}.
       intent                potential keys                candidate keys                    extent
{id, f n, tt, ln, lg}             id. . . ,          {id}, {tt, ln}, {tt, f n, lg}             ∅
   {f n, ln, lg}      fnlnlg, fnlg, fnln, lgln, ln         {ln}, {f n, lg}               {4 ≈ 5 ≈ 6}
      {tt, lg}                      ttlg                       {tt, lg}                    {7 ≈ 8}
     {f n, tt}                      fntt                      {f n, tt}                    {6 ≈ 8}
        {lg}                         lg                          {lg}               {3 ≈ 4 ≈ 5 ≈ 6, 7 ≈ 8}
       {f n}                         fn                          {f n}                 {4 ≈ 5 ≈ 6 ≈ 8}
        {tt}                         tt                          {tt}                    {6 ≈ 7 ≈ 8}
         ∅                           ∅                             ∅                        T ×T
  {w, y, a, t, o} w, . . . ot, yot, aot, yaot, wyaot         {o, t}, {w}                       ∅
     {y, a, o}             o, ao, at, yo, yao                     {o}                     {a1 ≈ a2 }
     {y, a, t}                   yt, yat                        {y, t}                      {c ≈ d}
       {y, a}                      y, ya                          {y}                  {a1 ≈ a2 , c ≈ d}
       {a, t}                         t                           {t}                   {a1 ≈ b, c ≈ d}
        {a}                           a                           {a}              {a1 ≈ a2 ≈ b, c ≈ d ≈ e}
         ∅                           ∅                             ∅                       T0×T0

Table 2. The list of concepts extracted from the bookstore (top) and library (bottom) relations.
All intent should be completed by their subsets.




2     Database linkkey extraction

Consider that, instead of one relation, we are faced with two relations from two different
databases which may contain tuples corresponding to the same individual.
    We assume that candidate attribute pairs are already available through an alignment
A which expresses equivalences between attributes of both relations. In this example,
A = {hlastname, authori, htitle, origi, hid, widi}. Our goal is to find those which will
identify the same individuals (tuples) in both databases.


2.1   Linkkeys for databases

Linkkeys [5] have been introduced for generating equality, a.k.a. sameAs, links between
RDF datasets. We present a simplified notion of linkkey which is defined over relations.
Definition (Linkkey) Given two relations D = hA, T i and D0 = hA0 , T 0 i and an align-
ment A ⊆ A×A0 . LK ⊆ A is a linkkey between D and D0 iff ∀t, t0 ∈ T , T 0 ; (∀hp, p0 i ∈
LK, p(t) = p(t0 )) ⇒ t ≈ t0 . The set of linkkeys between D and D0 with respect to A
is denoted κA (D, D0 ).
    This definition may be rendered independent from A by assuming A = A × A0 , so
any attribute of one relation may be matched to any other.


2.2   Strong linkkey extraction

One way to deal with this problem is to start with keys: either candidate keys or normal
candidate keys. For that purpose, we define κ(D)/A as the operation which replaces,
in all candidate keys of D, each occurrence of an attribute in a correspondence of A by
this correspondence3 .
     A first kind of linkkeys that may be extracted are those which are normal can-
didate keys in their respective relations. They are called strong linkkeys and may be
obtained by selecting normal candidate keys that contain only attributes mentioned in
the alignment (replacing the attribute by the correspondence) and to intersect them, i.e.,
κ̂(D)/A∩κ̂(D0 i)/A. Strong linkkeys have the advantage of identifying tuples matching
across relations without generating any links within the initial relations.
     In the example of Table 1, there is one such strong linkkey: {hid, widi}. Indeed, the
normal candidate keys for the bookstore relation are {id}, {title, lastname}, or {title, firstname, lang}
and, for the library relation they are {wid} and {orig, translator}. Since, translator has
no equivalent in the bookstore relation (through A), only {hid, widi} can be used. Un-
fortunately, it does not identify any equality statement as this happens very often with
databases surrogates (this may have been worse if both relations used integers as iden-
tifiers: identifying false positives).
     This scheme may be relaxed by trying to extract linkkeys from all candidate keys.
In this way one would simply use κ(D)/A ∩ κ(D0 )/A. In our example, this does not
bring further linkkeys.


2.3     Candidate linkkey extraction

The technique proposed above, does indeed generate linkkeys, but does not generate all
of them: linkkeys may rely on sets of attributes which are not candidate keys. Indeed,
one interesting linkkey for the relations above is {hlastname, authori, htitle, origi}.
    Surprisingly, it does not use a normal candidate key of the library relation and not
even a candidate key of the bookstore relation as {author, orig} generates the same links
as {orig} in this relation. However, when applied to the elements of T × T 0 this linkkey
generates non ambiguous links, i.e., links which do not entail new links within a relation
(this would have been different if a tuple hyear = 1822, author = Quincey, orig =
Confessions, translator = Baudelairei were present in the library relation).
    Such linkkeys may be found by the same type of technique as before. It consists
of defining a formal context enc(hA, T i, hA0 , T 0 i, A) = hT × T 0 , A, Ii such that:
∀hp, p0 i ∈ A, ∀ht, t0 i ∈ T × T 0 ,

                                  ht, t0 iIhp, p0 i iff p(t) = p0 (t0 )

      γ is redefined to deal with subsets of alignments and generate ≈ assertions on T ∪
T 0 . But, in order for ≈ to remain an equivalence relation it will be necessary to close ≈
on T ∪ T 0 and not only on T × T 0 . Indeed if two tuples of T are found equal to a tuple
of T 0 , then by transitivity, they should be equal as well.
      Again, candidate linkkeys are the minimal elements   S of the intent which generate
exactly the corresponding set of links. κA (D, D0 ) = c∈F CA(enc(D,D0 ,A)) µ⊆ {K ⊆
intent(c)|γ(K) = γ(intent(c))}.
 3
     This assumes that the alignment is one-to-one. This assumption is necessary for this subsection
     of the paper only.
                                                   T ×T0
                                                     ∅


                3 ≈ a1,    3 ≈ a2,      3 ≈ b,
                4 ≈ c,     4 ≈ d,       4 ≈ e,            3 ≈ b,     4 ≈ c,     5 ≈ d,
                5 ≈ c,     5 ≈ d,       5 ≈ e,            6 ≈ e,     7 ≈ e,     8≈e
                6 ≈ c,     6 ≈ d,       6≈e                         {htt, oi}
                          {hf n, ai}



                                                 3 ≈ b, 4 ≈ c,
                                                 5 ≈ d, 6 ≈ e
                                              {hf n, ai, htt, oi}




                                       A = {hf n, ai, htt, oi, hid, wi}



                   Fig. 2. Linkkey concept lattice for the relations of Table 1.


    This technique, applied to the example of Table 1, generates the lattice of Figure 2. It
can be argued that the candidate linkkeys {hf n, ai, htt, oi} and {hid, wi} are better than
the others because they do not generate other statements within the relations. Indeed,
{htt, oi} generates 6 ≈ 7 ≈ 8, and {hf n, ai} generates a1 ≈ a2 ≈ b, c ≈ d ≈ e and
4 ≈ 5 ≈ 6.


3     Conlusion and further work
We introduced, in the context of the relational model, the notions of candidate keys and
linkkeys and we discussed potential links with formal concept analysis. These are only a
few elements of a wider program. Problems were expressed in the relational framework
because they are simpler. Our ambition is to provide an integrated way to generate
links across RDF data sets using keys and it may be worth investigating if the proposed
formal concept analysis framework can be extended to full RDF data interlinking.
    Plunging this in the context of RDF requires further developments:
    – considering that values do not have to be syntactically equal but may be found
      equal with respect to some theory: this may be a simple set of equality statement
      (“étudiant”=“student”) or may depend on RDF Schemas or OWL ontologies;
    – considering several tables depending on each others together (this is related to Re-
      lational Concept Analysis [7] and could use the notion of foreign keys);
 – considering that RDF attributes are not functional and hence yield a more general
   type of keys [1].
    Once this is integrated within a common theoretical framework, a full solution re-
quires work before and after running formal concept analysis:
 – Before, it is necessary to use ontology/database matching [5] and to proceed to
   value normalisation.
 – After, it is necessary to select among these potential or candidate keys those which
   are the more accurate [2].


Acknowledgements
This work has been partially supported by the ANR projects Qualinca (12-CORD-0012
for Manuel Atencia), and Lindicle (12-IS02-0002 for all three authors), and by grant
TIN2011-28084 (for Manuel Atencia and Jérôme David) of the Ministry of Science and
Innovation of Spain, co-funded by the European Regional Development Fund (ERDF).
    Thanks to the anonymous reviewers for helping clarifying the paper.


References
 1. Manuel Atencia, Michel Chein, Madalina Croitoru, Michel Chein, Jérôme David, Michel
    Leclère, Nathalie Pernelle, Fatiha Saïs, François Scharffe, and Danai Symeonidou. Defining
    key semantics for the RDF datasets: experiments and evaluations. In Proc. 21st International
    Conference on Conceptual Structures (ICCS), Iasi (RO), pages 65–78, 2014.
 2. Manuel Atencia, Jérôme David, and Jérôme Euzenat. Data interlinking through robust
    linkkey extraction. In Proc. 21st european conference on artificial intelligence (ECAI),
    Praha (CZ), 2014.
 3. Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli. Characterizing functional dependen-
    cies in formal concept analysis with pattern structures. Annals of mathematics and artificial
    intelligence, 2014. To appear.
 4. János Demetrovics, Leonid Libkin, and Ilya Muchnik. Functional dependencies in relational
    databases: A lattice point of view. Discrete Applied Mathematics, 40(2):155–185, 1992.
 5. Jérôme Euzenat and Pavel Shvaiko. Ontology matching. Springer, Heidelberg (DE), 2nd
    edition, 2013.
 6. Bernhard Ganter and Rudolf Wille. Formal concept analysis: mathematical foundations.
    Springer, Berlin, New York, Paris, 1999.
 7. Mohamed Rouane Hacene, Marianne Huchard, Amedeo Napoli, and Petko Valtchev. Rela-
    tional concept analysis: mining concept lattices from multi-relational data. Annals of Math-
    ematics and Artificial Intelligence, 67(1):81–108, 2013.
 8. Tom Heath and Christian Bizer. Linked Data: Evolving the Web into a Global Data Space.
    Morgan & Claypool, 2011.
 9. Mark Levene. A lattice view of functional dependencies in incomplete relations. Acta cy-
    bernetica, 12(2):181–207, 1995.
10. Stéphane Lopes, Jean-Marc Petit, and Lotfi Lakhal. Functional and approximate dependency
    mining: database and FCA points of view. Journal of Experimental & Theoretical Artificial
    Intelligence, 14(2-3):93–114, 2002.