-

What can FCA do for database linkkey extraction? (problem paper)

Manuel Atencia

0 1

Jérôme David

0 1

Jérôme Euzenat

0 1 0 INRIA , Grenoble , France 1 Univ. Grenoble Alpes &

Links between heterogeneous data sets may be found by using a generalisation of keys in databases, called linkkeys, which apply across data sets. This paper considers the question of characterising such keys in terms of formal concept analysis. This question is natural because the space of candidate keys is an ordered structure obtained by reduction of the space of keys and that of data set partitions. Classical techniques for generating functional dependencies in formal concept analysis indeed apply for finding candidate keys. They can be adapted in order to find database candidate linkkeys. The question of their extensibility to the RDF context would be worth investigating.

For that purpose, we have defined linkkeys [ 5, 2 ] and we would like to formulate the linkkey extraction problem in the framework of formal concept analysis [ 6 ].

We first present this problem in the context of database candidate key extraction where one looks for sets of attributes and the sets of equality statements that they generate. We formulate this problem as the computation of a concept lattice. Then we turn to an adaptation of linkkeys to databases and show that the previous technique cannot be used for extracting the expected linkkeys. Instead we propose an adaptation. 1

Candidate keys in databases

A relation D = hA; T i is a set of tuples T characterised by a set of attributes A. A key is a subset of the attributes whose values identify a unique tuple.

Definition (key) Given a database relation D = hA; T i, a key is a subset of the attributes K A, such that 8t; t0 2 T ; (8p 2 K; p(t) = p(t0)) ) t t0.

Classically, keys are defined from functional dependencies. A set of attributes A is functionally dependent from another K, if equality of the attributes of K determines equality for the attributes of A. If the equality between tuples is the same thing as the equality for all attribute values, then a key is simply those sets of attributes of which A is functionally dependent.

However, we have not used the equality between tuple (=) but a particular relation. The reason is that we do not want to find keys for the database with =, but with an unknown relation which is to be discovered.

The statements t t0 are those equality statements that are generated by the key. The relation must contain = (t = t0 ) t t0) and be an equivalence relation (this is by definition if it is the smallest relation satisfying the key).

From a key K of a relation hA; T i, it is easy to obtain these statements through the function : 2A ! 2T T such that (K) = ft t0j8p 2 K; p(t) = p(t0)g. is anti-monotonic (8K; K0 A; K K0 ) (K) (K0)).

We define candidate key extraction as the task of finding the minimal sets of attributes which generate a partition of the set of tuples.

Definition (candidate key) Given a database relation D, a candidate key is a key such that none of its proper subsets generate the same partition. (D) is the set of candidate keys.

Those candidate keys which generate the singletons(T ) partition are called normal candidate keys and their set noted ^(D) = fK 2 (D)j8(t t0) 2 (K); t = t0g.

The problem of candidate key extraction is formulated in the following way: Problem: Given a database relation D, find (D).

This problem is usually not considered in databases. Either keys are given and used for finding equivalent tuples and reducing the table, or the table is assumed without redundancy and keys are extracted. In this latter case, the problem is the extraction of normal candidate keys.

Using lattices is common place for extracting functional dependencies [ 9, 4 ] and the link to extract functional dependencies with formal concept analysis has already been considered [ 6 ] and further refined [ 10, 3 ].

In fact, this link can be fully exploited for extracting candidate keys instead of finding functional dependencies.

It consists of defining1 a formal context enc(hA; T i) = hP2(T ); A; Ii such that: 8p 2 A; 8ht; t0i 2 P2(T ),

ht; t0iIp iff p(t) = p(t0) fag d a1 c

d fy; a; tg A0 = fw; y; a; o; tg d

e b; c fa; tg d

As presented in Table 2, this may generate several candidate keys for the same concept (fo; tg and fwg for the maximal partition in the library dataset; in the bookstore dataset, the concept of extent 4 5 6 has two candidate keys flng and flg; f ng and the maximal partition has three candidate keys fidg, ftt; lng and ftt; f n; lgg).

This answers positively to our first question: it is possible to extract keys, i.e., generating (D) from data with some help from formal concept analysis. 1 For an arbitraty total strict order < on T , P2(T ) = fht; t0i 2 T 2 j t < t0g. 2 RE = fX 2 Ej8X0 2 E; :(X0RX))g.

intent fid; f n; tt; ln; lgg ff n; ln; lgg ftt; lgg ff n; ttg flgg ff ng fttg ? fw; y; a; t; og fy; a; og fy; a; tg fy; ag fa; tg fag ? potential keys

id. . . , fnlnlg, fnlg, fnln, lgln, ln ttlg fntt lg fn tt ? w, . . . ot, yot, aot, yaot, wyaot o, ao, at, yo, yao yt, yat y, ya t a ?

candidate keys fidg, ftt; lng, ftt; f n; lgg flng, ff n; lgg ftt; lgg ff n; ttg flgg ff ng fttg ? fo; tg; fwg fog fy; tg fyg ftg fag ? f3 fa1 extent ? 5 f4 8g eg 2

Database linkkey extraction

Consider that, instead of one relation, we are faced with two relations from two different databases which may contain tuples corresponding to the same individual.

We assume that candidate attribute pairs are already available through an alignment A which expresses equivalences between attributes of both relations. In this example, A = fhlastname; authori; htitle; origi; hid; widig. Our goal is to find those which will identify the same individuals (tuples) in both databases. 2.1

Linkkeys for databases

Linkkeys [ 5 ] have been introduced for generating equality, a.k.a. sameAs, links between RDF datasets. We present a simplified notion of linkkey which is defined over relations. Definition (Linkkey) Given two relations D = hA; T i and D0 = hA0; T 0i and an alignment A A A0. LK A is a linkkey between D and D0 iff 8t; t0 2 T ; T 0; (8hp; p0i 2 LK; p(t) = p(t0)) ) t t0. The set of linkkeys between D and D0 with respect to A is denoted A(D; D0).

This definition may be rendered independent from A by assuming A = A A0, so any attribute of one relation may be matched to any other. 2.2

Strong linkkey extraction

One way to deal with this problem is to start with keys: either candidate keys or normal candidate keys. For that purpose, we define (D)=A as the operation which replaces, in all candidate keys of D, each occurrence of an attribute in a correspondence of A by this correspondence3.

A first kind of linkkeys that may be extracted are those which are normal candidate keys in their respective relations. They are called strong linkkeys and may be obtained by selecting normal candidate keys that contain only attributes mentioned in the alignment (replacing the attribute by the correspondence) and to intersect them, i.e., ^(D)=A\ ^(D0i)=A. Strong linkkeys have the advantage of identifying tuples matching across relations without generating any links within the initial relations.

In the example of Table 1, there is one such strong linkkey: fhid; widig. Indeed, the normal candidate keys for the bookstore relation are fidg, ftitle; lastnameg, or ftitle; firstname; langg and, for the library relation they are fwidg and forig; translatorg. Since, translator has no equivalent in the bookstore relation (through A), only fhid; widig can be used. Unfortunately, it does not identify any equality statement as this happens very often with databases surrogates (this may have been worse if both relations used integers as identifiers: identifying false positives).

This scheme may be relaxed by trying to extract linkkeys from all candidate keys. In this way one would simply use (D)=A \ (D0)=A. In our example, this does not bring further linkkeys. 2.3

Candidate linkkey extraction

The technique proposed above, does indeed generate linkkeys, but does not generate all of them: linkkeys may rely on sets of attributes which are not candidate keys. Indeed, one interesting linkkey for the relations above is fhlastname; authori; htitle; origig.

Surprisingly, it does not use a normal candidate key of the library relation and not even a candidate key of the bookstore relation as fauthor; origg generates the same links as forigg in this relation. However, when applied to the elements of T T 0 this linkkey generates non ambiguous links, i.e., links which do not entail new links within a relation (this would have been different if a tuple hyear = 1822; author = Quincey; orig = Confessions; translator = Baudelairei were present in the library relation).

Such linkkeys may be found by the same type of technique as before. It consists of defining a formal context enc(hA; T i; hA0; T 0i; A) = hT T 0; A; I i such that: 8hp; p0i 2 A; 8ht; t0i 2 T T 0,

ht; t0iI hp; p0i iff p(t) = p0(t0) is redefined to deal with subsets of alignments and generate assertions on T [ T 0. But, in order for to remain an equivalence relation it will be necessary to close on T [ T 0 and not only on T T 0. Indeed if two tuples of T are found equal to a tuple of T 0, then by transitivity, they should be equal as well.

Again, candidate linkkeys are the minimal elements of the intent which generate exactly the corresponding set of links. A(D; D0) = Sc2F CA(enc(D;D0;A)) fK intent(c)j (K) = (intent(c))g. 3 This assumes that the alignment is one-to-one. This assumption is necessary for this subsection of the paper only.

3 4 5 6 fhtt; oig 3 5 b; 4 d; 6 c; e fhf n; ai; htt; oig

A = fhf n; ai; htt; oi; hid; wig

This technique, applied to the example of Table 1, generates the lattice of Figure 2. It can be argued that the candidate linkkeys fhf n; ai; htt; oig and fhid; wig are better than the others because they do not generate other statements within the relations. Indeed, fhtt; oig generates 6 7 8, and fhf n; aig generates a1 a2 b, c d e and 4 5 6. 3

Conlusion and further work

We introduced, in the context of the relational model, the notions of candidate keys and linkkeys and we discussed potential links with formal concept analysis. These are only a few elements of a wider program. Problems were expressed in the relational framework because they are simpler. Our ambition is to provide an integrated way to generate links across RDF data sets using keys and it may be worth investigating if the proposed formal concept analysis framework can be extended to full RDF data interlinking.

Plunging this in the context of RDF requires further developments: – considering that values do not have to be syntactically equal but may be found equal with respect to some theory: this may be a simple set of equality statement (“étudiant”=“student”) or may depend on RDF Schemas or OWL ontologies; – considering several tables depending on each others together (this is related to Relational Concept Analysis [ 7 ] and could use the notion of foreign keys); – considering that RDF attributes are not functional and hence yield a more general type of keys [ 1 ].

Once this is integrated within a common theoretical framework, a full solution requires work before and after running formal concept analysis: – Before, it is necessary to use ontology/database matching [ 5 ] and to proceed to value normalisation. – After, it is necessary to select among these potential or candidate keys those which are the more accurate [ 2 ].

Acknowledgements

This work has been partially supported by the ANR projects Qualinca (12-CORD-0012 for Manuel Atencia), and Lindicle (12-IS02-0002 for all three authors), and by grant TIN2011-28084 (for Manuel Atencia and Jérôme David) of the Ministry of Science and Innovation of Spain, co-funded by the European Regional Development Fund (ERDF).

Thanks to the anonymous reviewers for helping clarifying the paper.

Manuel

Atencia , Michel Chein, Madalina Croitoru, Michel Chein, Jérôme David, Michel Leclère, Nathalie Pernelle, Fatiha Saïs, François Scharffe, and

Danai

Symeonidou . Defining key semantics for the RDF datasets: experiments and evaluations . In Proc. 21st International Conference on Conceptual Structures (ICCS) , Iasi (RO) , pages 65 - 78 , 2014 .

Manuel

Atencia , Jérôme David, and

Jérôme

Euzenat . Data interlinking through robust linkkey extraction . In Proc. 21st european conference on artificial intelligence (ECAI) , Praha (CZ) , 2014 .

Jaume

Baixeries , Mehdi Kaytoue, and

Amedeo

Napoli . Characterizing functional dependencies in formal concept analysis with pattern structures . Annals of mathematics and artificial intelligence , 2014 . To appear.

János

Demetrovics , Leonid Libkin, and

Ilya

Muchnik . Functional dependencies in relational databases: A lattice point of view . Discrete Applied Mathematics , 40 ( 2 ): 155 - 185 , 1992 .

Jérôme

Euzenat and

Pavel

Shvaiko . Ontology matching. Springer, Heidelberg (DE), 2nd edition , 2013 .

Bernhard

Ganter and

Rudolf

Wille . Formal concept analysis: mathematical foundations . Springer, Berlin, New York, Paris, 1999 .

Mohamed

Rouane Hacene , Marianne Huchard, Amedeo Napoli, and

Petko

Valtchev . Relational concept analysis: mining concept lattices from multi-relational data . Annals of Mathematics and Artificial Intelligence , 67 ( 1 ): 81 - 108 , 2013 .

Tom

Heath and

Christian

Bizer . Linked Data: Evolving the Web into a Global Data Space . Morgan & Claypool, 2011 .

Mark

Levene . A lattice view of functional dependencies in incomplete relations . Acta cybernetica , 12 ( 2 ): 181 - 207 , 1995 .

10. Stéphane

Lopes

, Jean-Marc Petit , and Lotfi Lakhal . Functional and approximate dependency mining: database and FCA points of view . Journal of Experimental & Theoretical Artificial Intelligence , 14 ( 2-3 ): 93 - 114 , 2002 .