Characterizing Covers of Functional Dependencies using FCA Victor Codocedo1 , Jaume Baixeries2 , Mehdi Kaytoue3 , and Amedeo Napoli4 1 Universidad Técnica Federico Santa Marı́a. Campus San Joaquı́n, Santiago de Chile. 2 Universitat Politècnica de Catalunya. 08032, Barcelona. Catalonia. 3 Infologic, 99 Avenue de Lyon, F-26500, Bourg-lès-Valence, France 4 LORIA (CNRS - Inria Nancy Grand Est - Université de Lorraine), B.P. 239, F-54506, Vandœuvre-lès-Nancy. Corresponding author : victor.codocedo@inf.utfsm.cl Abstract. Functional dependencies (FDs) can be used for various im- portant operations on data, for instance, checking the consistency and the quality of a database (including databases that contain complex data). Consequently, a generic framework that allows mining a sound, complete, non-redundant and yet compact set of FDs is an important tool for many different applications. There are different definitions of such sets of FDs (usually called covers). In this paper, we present the characterization of two different kinds of covers for FDs in terms of pattern structures. The convenience of such a characterization is that it allows for an easy implementation of efficient mining algorithms which can later be easily adapted to other kinds of similar dependencies. Finally, we present empirical evidence that the proposed approach can perform better than a state-of-the-art FD miner in large databases. 1 Introduction Functional Dependencies (FDs) are a keystone of the relational database model, since they allow checking the consistency, maintaining the quality of a database [8, 10, 9], and guiding its design [20]. In addition, they have been used to study information integration in the Web of data with varying degrees of quality [24, 25], or to check the data completeness in DBpedia [1]. Therefore, the computa- tion of a succinct representation of a set of FDs (usually refered to as a cover), is of interest to various fields of knowledge discovery and representation, specially, if this computation can easily be extended to other kinds of dependencies (e.g. relaxed versions of FDs [5]). The computation of FD covers is a popular topic in the database literature. As a reference, in [21], seven algorithms to mine FDs and compute their covers are reviewed and grouped into three families: lattice transversal algorithms, dif- ference/agree sets algorithms, and dependency induction algorithms. The char- acterization of FDs with FCA and pattern structures is also presented in [2] and c paper author(s), 2018. Proceedings volume published and copyrighted by its editors. Paper published in Dmitry I. Ignatov, Lhouari Nourine (Eds.): CLA 2018, pp. 279–290, Department of Computer Science, Palacký University Olomouc, 2018. Copying permitted only for private and academic purposes. 280 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli a generalization to other types of FDs is given in [6]. For a detailed review on the characterization of FDs and FCA, see [3]. In this paper, we characterize the representations of FD covers using pattern structures, an extension of FCA dealing with complex object representations [18]. On the one hand, we adapt the definition of a canonical direct basis of im- plications with proper premises [4, 22] to the formalism of pattern structures, in order to prove that this basis is equivalent to a reduced non-redundant set of FDs, better known as the canonical cover (Section 3.1). We show that the canonical cover can be characterized using the arrow relation (Ö) between an attribute and a pattern defined over a partition pattern structures (PPS). On the other hand, we discuss on the relation between the Stem Base of implications [12] with the Minimal Cover of dependencies, a sound, complete, non-redundant set of FDs that has minimum cardinality w.r.t. any other equivalent cover (Sec- tion 3.2). We show that the latter can be characterized by pseudo-extents of a PPS. Finally, in Section 4 we empirically compare these two ways of computing FD covers with the algorithm TANE [16]. This algorithm is one of the most efficient FD mining algorithms and according to [21], it is the base for the family of “lattice transversal algorithms” serving as the baseline to validate our approach with a state-of-the-art FD miner. 2 Theoretical Background Let U be an ordered set of attributes, and let Dom be a set of values (a domain). For the sake of simplicity, we assume that Dom is a numerical set. A tuple t is a function t : U Ñ Dom, and a table T is a set of tuples. Usually a table is represented as a matrix, as in Table 1, where the set of tuples (or objects) is T “ t t1 , t2 , . . . , t7 u with attributes U “ t a, b, c, d, e u. We use table, dataset, set of tuples as equivalent terms. We overload the functional notation of a tuple in such a way that, given a tuple t P T , we say that tpX Ď U) is a tuple with the values of t in the ordered attributes xi P X defined as tpXq “ xtpx1 q, tpx2 q, . . . , tpxn qy. For example, we have that t2 pt a, c uq “ xt2 paq, t2 pcqy “ x2, 1y. In this article, the set notation is dropped: instead of t a, b u we use ab. 2.1 Functional Dependencies and their Covers Definition 1 ([23]). Let T be a set of tuples, and X, Y Ď U. A functional dependency (FD) X Ñ Y holds in T if: @t, t1 P T : tpXq “ t1 pXq ùñ tpY q “ t1 pY q For instance, the functional dependency d Ñ e holds in T (Table 1), whereas the functional dependency e Ñ d does not hold since t4 peq “ t5 peq but t4 pdq ‰ t5 pdq. For a given set of functional dependencies F , we can use the three Arm- strong’s axioms (reflexivity, augmentation and transitivity) to derive a larger Characterizing Covers of Functional Dependencies using FCA 281 set of FDs [20]. We will call F ˚ the set of FDs derived from F by reflexiv- ity and augmentation, and F ` the set of FDs derived by reflexivity, augmen- tation and transitivity. Two sets of FDs F and H are said to be equivalent F ” H ðñ F ` “ H ` . Let F be a set of FDs from a database T , F is said to be sound if all FDs in F hold in T . In addition, F is said to be complete if all FDs that hold in T can be derived from F . Let X Ñ Y be any FD in F , then F is said to be non-redundant if F ztX Ñ Y u ı F , and non-redundant w.r.t. augmentation iff pF ztX Ñ Y uq˚ ‰ F ˚ A set F is said to be left-reduced if for all X Ñ Y P F and Z Ĺ X we have that pF ztX Ñ Y uq Y tZ Ñ Y u ı F . Dually, it is said to be right-reduced if for all X Ñ Y P F and Z Ĺ Y we have that pF ztX Ñ Y uq Y tX Ñ Zu ı F . F is said to be reduced if it is simultaneously left and right-reduced. Let F be a reduced set of FDs, then G “ tX Ñ y | X Ñ Y P F, y P Y u is the splitting of F and G ” F . Let F be a reduced set, its splitting is called a canonical cover. A canonical cover is a left-reduced, non-redundant w.r.t. aug- mentation set of FDs with a single element in their right hand side (a split set) [19]. A different definition presents the canonical cover in a saturated ver- sion requiring uniqueness in their left hand side [17] losing the single element condition in the right hand side. For the sake of compatibility with FCA im- plication covers we will favor the first definition. Notice that canonical covers can be redundant w.r.t. transitivity. For example the canonical cover of Table 1 contains tc Ñ b, c Ñ e, d Ñ e, bd Ñ c, be Ñ cu where bd Ñ c would be redundant w.r.t. transitivity as bd Ñ bde Ñ c. Finally, a set F is said to be a minimum cover if it has as few FDs as any other equivalent set of FDs. For example, the minimum cover of Table 1 contains FDs tc Ñ be, d Ñ e, be Ñ cu. Notice that the minimum cover is not restricted to be reduced, so it is not presented with split sets. Secondly, the minimum cover contains exactly one fewer FD than the canonical cover, namely bd Ñ c. 2.2 Formal Concept Analysis, Implication Systems and FDs For the sake of brevity we do not provide a description of the Formal Concept Analysis (FCA) framework. The notation used in this article follows [13] where K “ pG, M, Iq is a formal context of objects G, attributes M and incidence relation I, with formal concepts pA, Bq where A1 “ B and B 1 “ A. Implications are relations established between attribute sets from a formal context K. Implications are analogous to FDs and they can be used to character- ize them [2]. Because of this, we will notate an implication similarly to an FD. Implication systems (sets of implications) can also characterize FD covers [4]. An implication X Ñ Y holds in K for X, Y Ď M if Y Ď X 2 . Let T be a set of tuples and U, a set of attributes ` ˘ in a table (such as the one in Table 1). We define the set P airpT q “ T2 set of all subsets T with exactly two elements, and the incidence set I such that ppti , tj q, xq P I ðñ ti rxs “ tj rxs, @x P U, @ti , tj P T . It can be shown that an FD X Ñ Y holds in the database if and only if X Ñ Y is an implication of the formal context K “ pP airpT q, U, Iq [13]. 282 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli K is called the binary codification of table T . For example, Table 3 contains the binary codification of Table 1. In Table 3 we can observe the implication d Ñ e which can be verified as an FD in Table 1. The previous statement entails that FDs and implications in K are in 1-1 correspondence. Moreover, the corresponding definition of a canonical cover of FDs is equivalent to that of a canonical-direct unitary basis of implications as shown in [4] where the equivalent left-minimal basis is described as containing minimal functional dependencies, i.e. those in a canonical cover. Table 1: Example of a table T Table 3: Binary codification of Ta- ab c d e ble 1 t1 1 1 1 1 1 a b c d e t2 2 1 1 1 1 pt1 , t2 q ˆˆˆˆ t3 3 1 2 2 2 pt1 , t3 q ˆ t4 3 2 3 2 2 pt1 , t4 q t5 3 1 2 3 2 pt1 , t5 q ˆ t6 1 1 2 2 2 pt1 , t6 q ˆ ˆ t7 1 1 2 4 2 pt1 , t7 q ˆ ˆ pt2 , t3 q ˆ Table 2: Representation context pt2 , t4 q a b c d e pt2 , t5 q ˆ pt2 , t6 q ˆ δpaq ˆ pt2 , t7 q ˆ δpbq ˆ pt3 , t4 q ˆ ˆˆ δpeq ˆ pt3 , t5 q ˆ ˆ ˆ ˆ δpaq [ δpbq ˆ ˆ Ö Ö δpaq [ δpeq ˆ ˆ pt3 , t6 q ˆˆˆˆ δpdq [ δpeq ˆ ˆ pt3 , t7 q ˆˆ ˆ δpaq [ δpdq [ δpeq ˆ ÖÖ ˆ ˆ pt4 , t5 q ˆ ˆ δpbq [ δpcq [ δpeq ˆ ˆ ˆ pt4 , t6 q ˆˆ δpaq [ δpbq [ δpcq [ δpeq ˆ ˆ ˆ Ö ˆ pt4 , t7 q ˆ δpbq [ δpcq [ δpdq [ δpeq Ö ˆ ˆ ˆ ˆ pt5 , t6 q ˆˆ ˆ δpaq [ δpbq [ δpcq [ δpdq [ δpeq ˆ ˆ ˆ ˆ ˆ pt5 , t7 q ˆˆ ˆ pt6 , t7 q ˆ ˆ ˆ ˆ 2.3 Partition Pattern Structures A Partition Pattern Structure (PPS) is a type of pattern structure [14] that deals with, as the name suggests, object representations in the form of set partitions. PPS have shown to be useful for mining biclusters [7] and, more importantly, relations between partition pattern concepts have been used to characterize FDs of different kinds [3]. The formalization of a database T with attributes U as a PPS is as follows. A partition d of T is a set d Ď ℘pT q (powerset of T ) of disjoint non-empty subsets of Ť that for any two different elements Ki , Kj P d we have that Ki X Kj “ H T such and KPd K “ T . Let D be the set of all possible partitions of T , then any two partitions d1 , d2 P D can be ordered by a coarser/finer relation denoted d1 Ď d2 (d1 is finer than d2 ) iff p@ Ki P d1 qpD Kj P d2 q such that Ki Ď Kj . The similarity operator is defined as d1 [ d2 “ tKi X Kj | Ki P d1 , Kj P d2 u. From this, it follows that pD, Ďq is a complete lattice with supremum and infimum defined respectively as J “ ttT uu and K “ tttu | t P T u. The set of attributes U is mapped onto D through a function δ which for a given attribute x P U yields a partition δpxq P D representing the equivalence Characterizing Covers of Functional Dependencies using FCA 283 relations over T for values of attribute x. Withdthis, we can configure the PPS pU, pD, Ďq, δq with derivation operators X ˝ “ δpxq denoting the equivalence xPX relations implied by attributes in X, and d˝ “ tx P U | δpxq Ď du denoting all attributes with associated equivalence relations finer than d. pX, dq is a partition pattern concept when X ˝ “ d and d˝ “ X. Theorem 1 (Proposition 2 in [3]). Let pP airpT q, U, Iq be the binary codifi- cation of a table within a database, and pU, pD, Ďq, δq its PPS representation: pW, Xq P pP airpT q, U, Iq ðñ pX, dq P pU, pD, Ďq, δq The proof of Theorem 1 can be found in [3]. Theorem 1 presents an important property of the PPS that states that X Ď U is an extent in pU, pD, Ďq, δq if and only if it is also an intent in pP airpT q, U, Iq (the relation between the set of tuple pairs W Ď P airpT q and the partition d P D can be formalized as well but it is of no interest to our development). Theorem 1 is very important since it entails that the lattices derived from pP airpT q, U, Iq and pU, pD, Ďq, δq are isomorphic. Consequently, implication X Ñ Y in pP airpT q, U, Iq can be found as a relation between extents in pU, pD, Ďq, δq (extent implication) such that Y Ď X ˝˝ . 3 Covers and Pattern Structures In this section we present two different kinds of covers for FDs: canonical covers (Section 3.1) and minimal covers (Section 3.2), as well as their characterization in terms of Pattern Structures. This section uses existing well-known results in FCA, which are reviewed here for the sake of readability. Section 3.1 presents how a canonical cover for FDs can be computed using PPS, according to the results in [4, 22]. Section 3.2 introduces a novel character- ization of the minimum cover of FDs by means of pseudo-extents of PPS [13]. The interest of these results is not limited to computing FD covers, but also for generalizations of FDs [3, 6]. 3.1 Characterizing a Canonical Cover of FDs The characterization of a canonical cover of FDs using FCA is straightforward. In a nutshell, a canonical cover of FDs is analogous to a canonical direct unitary basis of implications [4] as presented in Section 2.2. In this section we recall some of these ideas and show how they can be simply adapted to the framework of PPS. Firstly, let pU, pD, Ďq, δq be a PPS, we define D “ td P D | d˝˝ “ du as the set of all closed partition patterns in D. The formal context pU, D, Jq with pd, xq P J ðñ d Ď δpxq is called a representation context of pU, pD, Ďq, δq and their corresponding concept lattices are isomorphic [18]. For the sake of readability of the following definitions, we define the representation context as pD, U, Jq instead of pU, D, Jq (Table 2). By transitivity of equivalence, it is clear 284 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli that pD, U, Jq is isomorphic to pP airpT q, U, Iq as defined in Section 2.3 and as such, implications in pD, U, Jq correspond to FDs. For example, Table 2 (not considering elements Ö) contains the representation context pD, U, Jq of the PPS derived from Table 1. Notice that objects are closed intersections of object representations, e.g. δpdq does not appear since δpdq˝˝ “ δpdq [ δpeq. With this, the canonical direct basis of implications in pD, U, Jq (and thus, canonical cover of FDs) is determined by the set of proper premises of elements in U. Theorem 2 (Corollary 1 in [22]). X Ď Uztyu is a premise of y P U iff X is a hypergraph transversal of HyÖ defined as : HyÖ “ tppUztyuqqzd1 | d P D, d Ö yu The set of all proper premises of y is the minimum hypergraph transversal T rpHyÖ q. A detailed description on the development of Theorem 2 can be found in [22]. Providing a formal definition of hypergraph transversals is out of the scope of this article, however we briefly mention that this formalization can also be made considering set collections (instead of hypergraphs) and minimum hitting sets (instead of minimum hypergraph transversals) [11]. This problem is also analogous to the vertex cover problem [12]. Theorem 2 provides a formal description for the proper premises of a given attribute y P U that in turn yields the canonical cover of functional dependencies. However this approach requires the creation of the representation context which is a middle step in the overall calculation process. Actually, by analyzing the arrow relation between d and y we can observe that the representation context is not necessary. Consider that in pD, U, Jq, d1 “ tx P U | pd, xq P Jp ðñ d Ď δpxqqu and thus d1 is equivalent to d˝ for any d P D. Moreover, in pU, pD, Ďq, δq, d˝ Ĺ h˝ ðñ h Ĺ d since h “ h˝˝ and d “ d˝˝ . With this, we can rewrite the arrow definition as follows. d Ö y ðñ pd, yq R J and if d1 Ĺ h1 then ph, yq P J ðñ d Ę δpyq and if d˝ Ĺ h˝ then h Ď δpyq ðñ d Ę δpyq and if h Ĺ d then h Ď δpyq The last result shows that d Ö y in pD, U, Jq can be defined directly over the PPS. Intuitively, this definition corresponds to y Õ d in pU, D, Jq and thus, in pU, pD, Ďq, δq. With these elements we can finally propose a characterization for the canonical cover of functional dependencies in pU, pD, Ďq, δq as follows. Corollary 1. Let pU, pD, Ďq, δq be a partition pattern structure and T rpHq de- note the hypergraph transversal of H, then with Lcc “ tX Ñ y | y P U, X P T rpHyÕ qu HyÕ “ tppUztyuqqzd1 | d P D, y Õ du y Õ d ðñ d Ę δpyq and if h Ĺ d then h Ď δpyq Lcc is a canonical cover of functional dependencies. Characterizing Covers of Functional Dependencies using FCA 285 For the running example, let us calculate the proper premises of attribute c using the arrow relations in Table 3. We have c Õ pδpaq [ δpbqq and c Õ pδpaq [ δpdq [ δpeqq. With this, we have the hypergraph HcÕ “ ttd, eu, tbuu for which the minimum transversal hypergraph is T rpHcÕ q “ ttb, du, tb, euu. Correspondingly, we have the FDs bd Ñ c and be Ñ c which are included in the canonical cover. 3.2 Characterizing a Minimal Cover of FDs We introduce a novel characterization of a minimal cover of FDs by means of pseudo-intents, and its generalization using pseudo-extents of a PPS. The stem base of implications, or Duquenne-Guigues basis [15], is a sound, complete and non-redundant basis which also has minimum cardinality among the sets of implications for a given formal context. We show how this can be used to characterize a minimal cover of FDs in a rather simple manner. Prior to introducing the stem base, let us define pseudo-closed sets [13]. Definition 2. (Pseudo-closed sets) Let P ÞÑ P 2 be a closure operator over a set M, then P is a pseudo-closed set if and only if: P ‰ P2 (1) 2 Q Ĺ P and Q is a pseudo-closed set ùñ Q Ď P (2) Given a formal context pG, M, Iq, pseudo-closed sets A Ď G are called pseudo-extents, while pseudo-closed sets B Ď M are called pseudo-intents. A stem base of implications, or Duquenne-Guigues basis, can be defined as follows: Theorem 3 ([13]). (Duquenne-Guigues Basis) The set of implications: L “ tP Ñ P 2 | P is a pseudo-intentu (3) is sound, complete and non-redundant. Theorem 4 ([13]). Every complete set of implications contains an implication X Ñ Y with X 2 “ P 2 for every pseudo-intent P of pG, M, Iq Theorem 4 entails that the stem base of implications has minimum cardinality with respect to any equivalent set of implications of pG, M, Iq. With this and the previous observation that FDs are in 1-1 correspondence with implications in pP airpT q, U, Iq, we can derive the following corollary. Corollary 2. Let pP airpT q, U, Iq be the binary codification of a table with tuples T and attributes U, the set of FDs L “ tP Ñ P 2 | P is a pseudo-intentu is a minimal cover. Corollary 2 provides a novel characterization of the minimal cover of FDs through pseudo-intents of a formal context. Given the existing relation between pP airpT q, U, Iq and pU, pD, Ďq, δq, we can generalize the characterization of the minimal cover over the latter. Observe that in the PPS pU, pD, Ďq, δq, we main- tain the notion of pseudo-extents for a pseudo-closed set X Ď U with X ÞÑ X ˝˝ . 286 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli Fig. 1: Datasets and the Fig. 2: Performance Fig. 3: Performance number of rows and Comparison when in- Comparison when in- columns they contain. creasing tuples: FCA vs creasing attributes: TANE FCA vs TANE Proposition 1. Let pU, pD, Ďq, δq be the PPS representation of a database, then the set of functional dependencies Lmc defined below is a minimal cover. Lmc “ tX Ñ X ˝˝ | X is a pseudo-extentu (4) The proof of Proposition 1 follows from Theorem 1 and the fact that for a set X P U the closure operator X ÞÑ X 2 is exactly equivalent to X ÞÑ X ˝˝ and consequently, the set of pseudo-intents in pP airpT q, U, Iq is the same as the set of pseudo-extents in pU, pD, Ďq, δq. Thus, because of Corollary 2, Proposition 1 holds. Table 4: Dataset details, Execution Times in Seconds, and Number of Mined Rules for CCM (Canonical Cover Miner), MCM (Minimal Cover Miner) and TANE. CC: Canonical Cover, MC: Minimal Cover. Datasets in boldface repre- sent those in which FCA performed better than TANE. CCM MCM TANE Dataset # Tuples # Attributes # CC Deps Time [S] # MC Deps Time [S] # CC Deps Time [S] Mushroom 8124 22 3605 23887 1509 12684 - - Adult 48842 14 78 90.41 42 71.55 78 123.13 Credit 690 15 1099 3.07 253 1.82 1099 2.54 PGLW 17995 6 5 0.67 2 0.35 5 0.48 PGLW (2xA) 17995 12 38 1.81 15 1.18 38 7.45 Forest Fires 516 13 442 0.46 138 0.31 442 0.49 Forest Fires (2xT) 1032 13 442 1.27 138 0.78 442 2.34 ncvoter 1000 19 775 2.47 193 1.63 775 2.07 Diagnostics 120 8 37 0.08 17 0.06 37 0.06 Abalone 4177 8 137 0.41 40 0.29 137 0.32 CMC 1473 9 1 0.56 1 0.49 1 0.52 Hughes 401 12 3 0.12 3 0.17 3 0.06 Servo 167 4 1 0.05 1 0.03 1 0.02 Caulkins 1685 12 227 0.66 67 0.53 227 0.95 Characterizing Covers of Functional Dependencies using FCA 287 4 Experimental Evaluation In this section we present a brief experimental comparison of both introduced approaches versus TANE [16], a state-of-the-art FD miner. TANE is a highly optimized apriori -based algorithm that generates a canonical cover of FDs. The goal of this evaluation is to study the comparative benefits of using FCA versus a traditional approach such as TANE. TANE was re-implemented for our ex- periments5 . For the sake of repeatibility, the code used to run the experiments was made available at GitHub. Both the minimal cover miner (MCM) 6 , and the canonical cover miner (CCM)7 were implemented using Python’s pipy FCA library fca8 . Experiments were performed over 12 datasets extracted from the UCI Ma- chine Learning repository9 , the JASA Data Archive10 and Metanome’s repeata- bility Web page11 . Details on the number of rows and columns for each dataset are provided in the first two columns of Table 4. In addition to these datasets, we created synthetic versions by multiplying the rows or the columns of a given dataset. Experiments were performed on a virtual machine with 4 cores running at 2.7 Ghz and equipped with 32 GB of RAM memory. 4.1 Results & Discussion Table 4 presents the main results of applying our approach and TANE on each dataset to mine the Minimal Cover and the Canonical Cover, respectively. The table contains the execution times for each approach and the number of depen- dencies mined. Datasets in boldface represent those for which CCM or MCM performed substantially better than TANE. For the Mushroom dataset, TANE was not able to obtain results before running out of memory, thereby no infor- mation is provided in the table. Table 4 also reports in two synthetic datasets, namely PGLW (2xA) which contains two horizontal copies of the PGLW dataset resulting in twice as many attributes. Forest Fires (2xT) contains two vertical copies of Forest Fires resulting in twice as many tuples. All Canonical Covers mined by TANE have been reduced to a Minimal Cover to verify the consistency of our approach. Out of the 12 datasets, CCM and MCM performed better in the largest (both in rows and columns). This is better illustrated in Figure 1 where datasets are represented as points in a number of rows-number of columns space. Circles represent datasets for which CCM or MCM performed better while diamonds, where TANE did. Notice that the X axis is provided in logarithmic scale. The figure shows that most of the datasets where TANE performs better are in 5 https://github.com/codocedo/tane/tree/cla18 6 https://github.com/codocedo/fd_miner/tree/cla18 7 https://github.com/codocedo/uis_miner/tree/cla18 8 https://pypi.org/project/fca/ 9 https://archive.ics.uci.edu/ml/index.php 10 http://lib.stat.cmu.edu/jasadata/ 11 https://hpi.de/naumann/projects/repeatability/data-profiling/fds.html 288 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli the lower-left region of the figure, representing small datasets. FCA-based ap- proaches perform better in datasets in all other regions, including the upper-right which contains datasets with many rows and columns. Synthetic datasets in Table 4 show evidence that FCA scales better when duplicating the dataset. When duplicating attributes the difference is partic- ularly dramatic since TANE is over 13 times slower while our approach, only 3. To study this further, we created two sets of synthetic datasets. The first set (vertical set) was created by incrementally multiplying vertically the Diagnostics datasets (with 8 attributes and 120 tuples) generating 19 versions of 240, 360, 480 tuples, up to a dataset containing 2400 tuples. The second set (horizontal set) was created using the same idea but in a horizontal manner generating 19 versions of 16, 24, 32, up to 160 attributes. Since most of the datasets of the second set were too big for TANE, they were trunked to 40 tuples. Figure 2 depicts the increasing time for CCM, MCM and TANE on the vertical set, i.e. when increasing the number of tuples. We can observe that all three approaches scale linearly w.r.t. the number of tuples, even when CCM and MCM seem to have a more stable behavior. Vertical multiplication of datasets yield the same number of FDs than the original set, since the relation between attributes remains unchanged. Thus, we can assume that other algorithms based on TANE could achieve a similar performance to CCM or MCM provided some optimizations. On the other hand, this do not seem to be the case for the horizontal set. Figure 3 shows that CCM and MCM remain very stable when varying the num- ber of attributes, while TANE’s execution time grows exponentially. Indeed, this great difference in performance is due to the way in which we use FCA to find FDs which differs from TANE’s strategy. Using FCA we calculate closures which quickly group attribute copies avoiding unnecessary intersections. Instead, TANE computes each attribute combination rendering the exponential growth in the computation time. We stress that this is not simply an extreme case from which our approach takes advantage, but actually a very good illustration of the benefits of using a closure operator to navigate the space of FDs. Closures en- able CCM and MCM to avoid unnecessary computations not only when we have redundant attributes, but also whenever possible in the lattice of the powerset of attributes. Finally, we do not discuss on the differences between CCM and MCM strategies as these are detailed in [22]. 5 Conclusions We have presented a new characterization of a minimal cover of functional de- pendencies (FDs) by means of the stem base (or Duquenne-Guigues basis) of a partition pattern structure. We have presented the mechanisms through which this characterization can be exploited to efficiently mine the minimal cover. Fur- thermore, we have described the algorithms that implement these mechanisms and show empirical evidence that our characterization performs better than a Characterizing Covers of Functional Dependencies using FCA 289 state-of-the-art FD miner, namely TANE, in larger databases containing many rows and columns. Acknowledgments. This research work has been supported by the SGR2014-890 (MACDA) project of the Generalitat de Catalunya, and MINECO project APCOM (TIN2014-57226-P). References 1. Alam, M., Buzmakov, A., Codocedo, V., Napoli, A.: Mining definitions from RDF annotations using formal concept analysis. In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015. pp. 823–829. AAAI Press (2015) 2. Baixeries, J., Kaytoue, M., Napoli, A.: Computing Functional Dependencies with Pattern Structures. In: Concept Lattices and their Applications. pp. 175–186 (2012) 3. Baixeries, J., Kaytoue, M., Napoli, A.: Characterizing Functional Dependencies in Formal Concept Analysis with Pattern Structures. Annals of Mathematics and Artificial Intelligence 72(1–2), 129–149 (Oct 2014) 4. Bertet, K., Monjardet, B.: The multiple facets of the canonical direct unit impli- cational basis. Theoretical Computer Science 411(22-24), 2155–2166 (2010) 5. Caruccio, L., Deufemia, V., Polese, G.: Relaxed Functional Dependencies – A Sur- vey of Approaches. IEEE Transactions on Knowledge and Data Engineering 28(1), 147–165 (2016) 6. Codocedo, V., Baixeries, J., Kaytoue, M., Napoli, A.: Characterization of Order- like Dependencies with Formal Concept Analysis. In: Concept Lattices and Their Applications. vol. 1624, pp. 123–134 (2016) 7. Codocedo, V., Napoli, A.: Lattice-based biclustering using Partition Pattern Struc- tures. In: Proceedings of ECAI 2014. Frontiers in Artificial Intelligence and Appli- cations, vol. 263, pp. 213–218. IOS Press (2014) 8. Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada. pp. 159–170 (2008) 9. Fan, W.: Data quality: From theory to practice. SIGMOD Record 44(3), 7–18 (2015) 10. Fan, W., Geerts, F.: Foundations of Data Quality Management. Synthesis Lectures on Data Management, Morgan & Claypool Publishers (2012) 11. Gainer-Dewar, A., Vera-Licona, P.: The minimal hitting set generation problem: Algorithms and computation. SIAM Journal on Discrete Mathematics 31(1), 63– 100 (2017) 12. Ganter, B., Obiedkov, S.: Conceptual Exploration. Springer, Berlin (2016) 13. Ganter, B., Wille, R.: Formal Concept Analysis. Springer, Berlin (1999) 14. Ganter, B., Kuznetsov, S.O.: Pattern Structures and their projections. In: Delu- gach, H.S., Stumme, G. (eds.) Conceptual Structures: Broadening the Base. LNCS, vol. 2120, pp. 129–142. Springer (2001) 15. Guigues, J.L., Duquenne, V.: Familles minimales d’implications informatives résultant d’un tableau de données binaires. Mathématiques et Sciences Humaines 95, 5–18 (1986) 290 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli 16. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: An Efficient Algo- rithm for Discovering Functional and Approximate Dependencies. Computer Jour- nal 42(2), 100–111 (1999) 17. Korth, H.F., Silberschatz, A.: Database System Concepts. McGraw-Hill, Inc., New York, NY, USA (1986) 18. Kuznetsov, S.O.: Galois Connections in Data Analysis: Contributions from the So- viet Era and Modern Russian Research. In: Formal Concept Analysis, Foundations and Applications. pp. 196–225. Lecture Notes in Computer Science 3626, Springer (2005) 19. Maier, D.: Theory of Relational Databases. Computer Science Press (1983) 20. Mannila, H., Räihä, K.J.: The Design of Relational Databases. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (1992) 21. Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.P., Schönberg, M., Zwiener, J., Naumann, F.: Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proceedings of VLDB Endowment 8(10), 1082– 1093 (2015) 22. Ryssel, U., Distel, F., Borchmann, D.: Fast algorithms for implication bases and attribute exploration using proper premises. Annals of Mathematics and Artificial Intelligence 70(1), 25–53 (2014) 23. Ullman, J.: Principles of Database Systems and Knowledge-Based Systems, vol- umes 1–2. Computer Science Press, Rockville (MD), USA (1989) 24. Yu, Y., Heflin, J.: Extending functional dependency to detect abnormal data in RDF graphs. In: The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I. pp. 794–809 (2011) 25. Yu, Y., Li, Y., Heflin, J.: Detecting abnormal semantic web data using semantic dependency. In: Proceedings of the 5th IEEE International Conference on Semantic Computing (ICSC 2011), Palo Alto, CA, USA, September 18-21, 2011. pp. 154– 157. IEEE Computer Society (2011)