=Paper=
{{Paper
|id=Vol-2123/paper23
|storemode=property
|title=Characterizing Covers of Functional Dependencies using FCA
|pdfUrl=https://ceur-ws.org/Vol-2123/paper23.pdf
|volume=Vol-2123
|authors=Víctor Codocedo,Jaume Baixeries,Mehdi Kaytoue,Amedeo Napoli
|dblpUrl=https://dblp.org/rec/conf/cla/CodocedoBKN18
}}
==Characterizing Covers of Functional Dependencies using FCA ==
Characterizing Covers of Functional
Dependencies using FCA
Victor Codocedo1 , Jaume Baixeries2 , Mehdi Kaytoue3 , and Amedeo Napoli4
1
Universidad Técnica Federico Santa Marı́a. Campus San Joaquı́n, Santiago de
Chile.
2
Universitat Politècnica de Catalunya. 08032, Barcelona. Catalonia.
3
Infologic, 99 Avenue de Lyon, F-26500, Bourg-lès-Valence, France
4
LORIA (CNRS - Inria Nancy Grand Est - Université de Lorraine), B.P. 239,
F-54506, Vandœuvre-lès-Nancy.
Corresponding author : victor.codocedo@inf.utfsm.cl
Abstract. Functional dependencies (FDs) can be used for various im-
portant operations on data, for instance, checking the consistency and the
quality of a database (including databases that contain complex data).
Consequently, a generic framework that allows mining a sound, complete,
non-redundant and yet compact set of FDs is an important tool for many
different applications. There are different definitions of such sets of FDs
(usually called covers).
In this paper, we present the characterization of two different kinds of
covers for FDs in terms of pattern structures. The convenience of such a
characterization is that it allows for an easy implementation of efficient
mining algorithms which can later be easily adapted to other kinds of
similar dependencies. Finally, we present empirical evidence that the
proposed approach can perform better than a state-of-the-art FD miner
in large databases.
1 Introduction
Functional Dependencies (FDs) are a keystone of the relational database model,
since they allow checking the consistency, maintaining the quality of a database
[8, 10, 9], and guiding its design [20]. In addition, they have been used to study
information integration in the Web of data with varying degrees of quality [24,
25], or to check the data completeness in DBpedia [1]. Therefore, the computa-
tion of a succinct representation of a set of FDs (usually refered to as a cover), is
of interest to various fields of knowledge discovery and representation, specially,
if this computation can easily be extended to other kinds of dependencies (e.g.
relaxed versions of FDs [5]).
The computation of FD covers is a popular topic in the database literature.
As a reference, in [21], seven algorithms to mine FDs and compute their covers
are reviewed and grouped into three families: lattice transversal algorithms, dif-
ference/agree sets algorithms, and dependency induction algorithms. The char-
acterization of FDs with FCA and pattern structures is also presented in [2] and
c paper author(s), 2018. Proceedings volume published and copyrighted by its editors.
Paper published in Dmitry I. Ignatov, Lhouari Nourine (Eds.): CLA 2018, pp.
279–290, Department of Computer Science, Palacký University Olomouc, 2018.
Copying permitted only for private and academic purposes.
280 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli
a generalization to other types of FDs is given in [6]. For a detailed review on
the characterization of FDs and FCA, see [3].
In this paper, we characterize the representations of FD covers using pattern
structures, an extension of FCA dealing with complex object representations [18].
On the one hand, we adapt the definition of a canonical direct basis of im-
plications with proper premises [4, 22] to the formalism of pattern structures,
in order to prove that this basis is equivalent to a reduced non-redundant set
of FDs, better known as the canonical cover (Section 3.1). We show that the
canonical cover can be characterized using the arrow relation (Ö) between an
attribute and a pattern defined over a partition pattern structures (PPS). On the
other hand, we discuss on the relation between the Stem Base of implications [12]
with the Minimal Cover of dependencies, a sound, complete, non-redundant set
of FDs that has minimum cardinality w.r.t. any other equivalent cover (Sec-
tion 3.2). We show that the latter can be characterized by pseudo-extents of a
PPS.
Finally, in Section 4 we empirically compare these two ways of computing FD
covers with the algorithm TANE [16]. This algorithm is one of the most efficient
FD mining algorithms and according to [21], it is the base for the family of
“lattice transversal algorithms” serving as the baseline to validate our approach
with a state-of-the-art FD miner.
2 Theoretical Background
Let U be an ordered set of attributes, and let Dom be a set of values (a domain).
For the sake of simplicity, we assume that Dom is a numerical set. A tuple t
is a function t : U Ñ Dom, and a table T is a set of tuples. Usually a table is
represented as a matrix, as in Table 1, where the set of tuples (or objects) is T “
t t1 , t2 , . . . , t7 u with attributes U “ t a, b, c, d, e u. We use table, dataset, set of
tuples as equivalent terms. We overload the functional notation of a tuple in such
a way that, given a tuple t P T , we say that tpX Ď U) is a tuple with the values
of t in the ordered attributes xi P X defined as tpXq “ xtpx1 q, tpx2 q, . . . , tpxn qy.
For example, we have that t2 pt a, c uq “ xt2 paq, t2 pcqy “ x2, 1y. In this article,
the set notation is dropped: instead of t a, b u we use ab.
2.1 Functional Dependencies and their Covers
Definition 1 ([23]). Let T be a set of tuples, and X, Y Ď U. A functional
dependency (FD) X Ñ Y holds in T if:
@t, t1 P T : tpXq “ t1 pXq ùñ tpY q “ t1 pY q
For instance, the functional dependency d Ñ e holds in T (Table 1), whereas
the functional dependency e Ñ d does not hold since t4 peq “ t5 peq but t4 pdq ‰
t5 pdq.
For a given set of functional dependencies F , we can use the three Arm-
strong’s axioms (reflexivity, augmentation and transitivity) to derive a larger
Characterizing Covers of Functional Dependencies using FCA 281
set of FDs [20]. We will call F ˚ the set of FDs derived from F by reflexiv-
ity and augmentation, and F ` the set of FDs derived by reflexivity, augmen-
tation and transitivity. Two sets of FDs F and H are said to be equivalent
F ” H ðñ F ` “ H ` .
Let F be a set of FDs from a database T , F is said to be sound if all FDs
in F hold in T . In addition, F is said to be complete if all FDs that hold in
T can be derived from F . Let X Ñ Y be any FD in F , then F is said to be
non-redundant if F ztX Ñ Y u ı F , and non-redundant w.r.t. augmentation iff
pF ztX Ñ Y uq˚ ‰ F ˚
A set F is said to be left-reduced if for all X Ñ Y P F and Z Ĺ X we have
that pF ztX Ñ Y uq Y tZ Ñ Y u ı F . Dually, it is said to be right-reduced if for
all X Ñ Y P F and Z Ĺ Y we have that pF ztX Ñ Y uq Y tX Ñ Zu ı F . F is
said to be reduced if it is simultaneously left and right-reduced.
Let F be a reduced set of FDs, then G “ tX Ñ y | X Ñ Y P F, y P Y u
is the splitting of F and G ” F . Let F be a reduced set, its splitting is called
a canonical cover. A canonical cover is a left-reduced, non-redundant w.r.t. aug-
mentation set of FDs with a single element in their right hand side (a split
set) [19]. A different definition presents the canonical cover in a saturated ver-
sion requiring uniqueness in their left hand side [17] losing the single element
condition in the right hand side. For the sake of compatibility with FCA im-
plication covers we will favor the first definition. Notice that canonical covers
can be redundant w.r.t. transitivity. For example the canonical cover of Table 1
contains tc Ñ b, c Ñ e, d Ñ e, bd Ñ c, be Ñ cu where bd Ñ c would be
redundant w.r.t. transitivity as bd Ñ bde Ñ c.
Finally, a set F is said to be a minimum cover if it has as few FDs as any
other equivalent set of FDs. For example, the minimum cover of Table 1 contains
FDs tc Ñ be, d Ñ e, be Ñ cu. Notice that the minimum cover is not restricted
to be reduced, so it is not presented with split sets. Secondly, the minimum cover
contains exactly one fewer FD than the canonical cover, namely bd Ñ c.
2.2 Formal Concept Analysis, Implication Systems and FDs
For the sake of brevity we do not provide a description of the Formal Concept
Analysis (FCA) framework. The notation used in this article follows [13] where
K “ pG, M, Iq is a formal context of objects G, attributes M and incidence
relation I, with formal concepts pA, Bq where A1 “ B and B 1 “ A.
Implications are relations established between attribute sets from a formal
context K. Implications are analogous to FDs and they can be used to character-
ize them [2]. Because of this, we will notate an implication similarly to an FD.
Implication systems (sets of implications) can also characterize FD covers [4].
An implication X Ñ Y holds in K for X, Y Ď M if Y Ď X 2 . Let T be a
set of tuples and U, a set of attributes
` ˘ in a table (such as the one in Table 1).
We define the set P airpT q “ T2 set of all subsets T with exactly two elements,
and the incidence set I such that ppti , tj q, xq P I ðñ ti rxs “ tj rxs, @x P U,
@ti , tj P T . It can be shown that an FD X Ñ Y holds in the database if and
only if X Ñ Y is an implication of the formal context K “ pP airpT q, U, Iq [13].
282 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli
K is called the binary codification of table T . For example, Table 3 contains the
binary codification of Table 1. In Table 3 we can observe the implication d Ñ e
which can be verified as an FD in Table 1.
The previous statement entails that FDs and implications in K are in 1-1
correspondence. Moreover, the corresponding definition of a canonical cover of
FDs is equivalent to that of a canonical-direct unitary basis of implications as
shown in [4] where the equivalent left-minimal basis is described as containing
minimal functional dependencies, i.e. those in a canonical cover.
Table 1: Example of a table T Table 3: Binary codification of Ta-
ab c d e ble 1
t1 1 1 1 1 1 a b c d e
t2 2 1 1 1 1
pt1 , t2 q ˆˆˆˆ
t3 3 1 2 2 2
pt1 , t3 q ˆ
t4 3 2 3 2 2
pt1 , t4 q
t5 3 1 2 3 2
pt1 , t5 q ˆ
t6 1 1 2 2 2
pt1 , t6 q ˆ ˆ
t7 1 1 2 4 2
pt1 , t7 q ˆ ˆ
pt2 , t3 q ˆ
Table 2: Representation context pt2 , t4 q
a b c d e pt2 , t5 q ˆ
pt2 , t6 q ˆ
δpaq ˆ pt2 , t7 q ˆ
δpbq ˆ
pt3 , t4 q ˆ ˆˆ
δpeq ˆ
pt3 , t5 q ˆ ˆ ˆ ˆ
δpaq [ δpbq ˆ ˆ Ö Ö
δpaq [ δpeq ˆ ˆ pt3 , t6 q ˆˆˆˆ
δpdq [ δpeq ˆ ˆ pt3 , t7 q ˆˆ ˆ
δpaq [ δpdq [ δpeq ˆ ÖÖ ˆ ˆ pt4 , t5 q ˆ ˆ
δpbq [ δpcq [ δpeq ˆ ˆ ˆ pt4 , t6 q ˆˆ
δpaq [ δpbq [ δpcq [ δpeq ˆ ˆ ˆ Ö ˆ pt4 , t7 q ˆ
δpbq [ δpcq [ δpdq [ δpeq Ö ˆ ˆ ˆ ˆ pt5 , t6 q ˆˆ ˆ
δpaq [ δpbq [ δpcq [ δpdq [ δpeq ˆ ˆ ˆ ˆ ˆ pt5 , t7 q ˆˆ ˆ
pt6 , t7 q ˆ ˆ ˆ ˆ
2.3 Partition Pattern Structures
A Partition Pattern Structure (PPS) is a type of pattern structure [14] that deals
with, as the name suggests, object representations in the form of set partitions.
PPS have shown to be useful for mining biclusters [7] and, more importantly,
relations between partition pattern concepts have been used to characterize FDs
of different kinds [3].
The formalization of a database T with attributes U as a PPS is as follows. A
partition d of T is a set d Ď ℘pT q (powerset of T ) of disjoint non-empty subsets of
Ť that for any two different elements Ki , Kj P d we have that Ki X Kj “ H
T such
and KPd K “ T . Let D be the set of all possible partitions of T , then any two
partitions d1 , d2 P D can be ordered by a coarser/finer relation denoted d1 Ď d2
(d1 is finer than d2 ) iff p@ Ki P d1 qpD Kj P d2 q such that Ki Ď Kj . The
similarity operator is defined as d1 [ d2 “ tKi X Kj | Ki P d1 , Kj P d2 u. From
this, it follows that pD, Ďq is a complete lattice with supremum and infimum
defined respectively as J “ ttT uu and K “ tttu | t P T u.
The set of attributes U is mapped onto D through a function δ which for a
given attribute x P U yields a partition δpxq P D representing the equivalence
Characterizing Covers of Functional Dependencies using FCA 283
relations over T for values of attribute x. Withdthis, we can configure the PPS
pU, pD, Ďq, δq with derivation operators X ˝ “ δpxq denoting the equivalence
xPX
relations implied by attributes in X, and d˝ “ tx P U | δpxq Ď du denoting all
attributes with associated equivalence relations finer than d. pX, dq is a partition
pattern concept when X ˝ “ d and d˝ “ X.
Theorem 1 (Proposition 2 in [3]). Let pP airpT q, U, Iq be the binary codifi-
cation of a table within a database, and pU, pD, Ďq, δq its PPS representation:
pW, Xq P pP airpT q, U, Iq ðñ pX, dq P pU, pD, Ďq, δq
The proof of Theorem 1 can be found in [3]. Theorem 1 presents an important
property of the PPS that states that X Ď U is an extent in pU, pD, Ďq, δq if and
only if it is also an intent in pP airpT q, U, Iq (the relation between the set of tuple
pairs W Ď P airpT q and the partition d P D can be formalized as well but it is
of no interest to our development). Theorem 1 is very important since it entails
that the lattices derived from pP airpT q, U, Iq and pU, pD, Ďq, δq are isomorphic.
Consequently, implication X Ñ Y in pP airpT q, U, Iq can be found as a relation
between extents in pU, pD, Ďq, δq (extent implication) such that Y Ď X ˝˝ .
3 Covers and Pattern Structures
In this section we present two different kinds of covers for FDs: canonical covers
(Section 3.1) and minimal covers (Section 3.2), as well as their characterization
in terms of Pattern Structures. This section uses existing well-known results in
FCA, which are reviewed here for the sake of readability.
Section 3.1 presents how a canonical cover for FDs can be computed using
PPS, according to the results in [4, 22]. Section 3.2 introduces a novel character-
ization of the minimum cover of FDs by means of pseudo-extents of PPS [13].
The interest of these results is not limited to computing FD covers, but also for
generalizations of FDs [3, 6].
3.1 Characterizing a Canonical Cover of FDs
The characterization of a canonical cover of FDs using FCA is straightforward.
In a nutshell, a canonical cover of FDs is analogous to a canonical direct unitary
basis of implications [4] as presented in Section 2.2. In this section we recall some
of these ideas and show how they can be simply adapted to the framework of
PPS.
Firstly, let pU, pD, Ďq, δq be a PPS, we define D “ td P D | d˝˝ “ du as
the set of all closed partition patterns in D. The formal context pU, D, Jq with
pd, xq P J ðñ d Ď δpxq is called a representation context of pU, pD, Ďq, δq
and their corresponding concept lattices are isomorphic [18]. For the sake of
readability of the following definitions, we define the representation context as
pD, U, Jq instead of pU, D, Jq (Table 2). By transitivity of equivalence, it is clear
284 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli
that pD, U, Jq is isomorphic to pP airpT q, U, Iq as defined in Section 2.3 and as
such, implications in pD, U, Jq correspond to FDs. For example, Table 2 (not
considering elements Ö) contains the representation context pD, U, Jq of the
PPS derived from Table 1. Notice that objects are closed intersections of object
representations, e.g. δpdq does not appear since δpdq˝˝ “ δpdq [ δpeq. With this,
the canonical direct basis of implications in pD, U, Jq (and thus, canonical cover
of FDs) is determined by the set of proper premises of elements in U.
Theorem 2 (Corollary 1 in [22]). X Ď Uztyu is a premise of y P U iff X is
a hypergraph transversal of HyÖ defined as :
HyÖ “ tppUztyuqqzd1 | d P D, d Ö yu
The set of all proper premises of y is the minimum hypergraph transversal
T rpHyÖ q.
A detailed description on the development of Theorem 2 can be found in [22].
Providing a formal definition of hypergraph transversals is out of the scope
of this article, however we briefly mention that this formalization can also be
made considering set collections (instead of hypergraphs) and minimum hitting
sets (instead of minimum hypergraph transversals) [11]. This problem is also
analogous to the vertex cover problem [12].
Theorem 2 provides a formal description for the proper premises of a given
attribute y P U that in turn yields the canonical cover of functional dependencies.
However this approach requires the creation of the representation context which
is a middle step in the overall calculation process. Actually, by analyzing the
arrow relation between d and y we can observe that the representation context
is not necessary. Consider that in pD, U, Jq, d1 “ tx P U | pd, xq P Jp ðñ d Ď
δpxqqu and thus d1 is equivalent to d˝ for any d P D. Moreover, in pU, pD, Ďq, δq,
d˝ Ĺ h˝ ðñ h Ĺ d since h “ h˝˝ and d “ d˝˝ . With this, we can rewrite the
arrow definition as follows.
d Ö y ðñ pd, yq R J and if d1 Ĺ h1 then ph, yq P J
ðñ d Ę δpyq and if d˝ Ĺ h˝ then h Ď δpyq
ðñ d Ę δpyq and if h Ĺ d then h Ď δpyq
The last result shows that d Ö y in pD, U, Jq can be defined directly over the
PPS. Intuitively, this definition corresponds to y Õ d in pU, D, Jq and thus, in
pU, pD, Ďq, δq. With these elements we can finally propose a characterization for
the canonical cover of functional dependencies in pU, pD, Ďq, δq as follows.
Corollary 1. Let pU, pD, Ďq, δq be a partition pattern structure and T rpHq de-
note the hypergraph transversal of H, then with
Lcc “ tX Ñ y | y P U, X P T rpHyÕ qu
HyÕ “ tppUztyuqqzd1 | d P D, y Õ du
y Õ d ðñ d Ę δpyq and if h Ĺ d then h Ď δpyq
Lcc is a canonical cover of functional dependencies.
Characterizing Covers of Functional Dependencies using FCA 285
For the running example, let us calculate the proper premises of attribute
c using the arrow relations in Table 3. We have c Õ pδpaq [ δpbqq and c Õ
pδpaq [ δpdq [ δpeqq. With this, we have the hypergraph HcÕ “ ttd, eu, tbuu
for which the minimum transversal hypergraph is T rpHcÕ q “ ttb, du, tb, euu.
Correspondingly, we have the FDs bd Ñ c and be Ñ c which are included in the
canonical cover.
3.2 Characterizing a Minimal Cover of FDs
We introduce a novel characterization of a minimal cover of FDs by means of
pseudo-intents, and its generalization using pseudo-extents of a PPS.
The stem base of implications, or Duquenne-Guigues basis [15], is a sound,
complete and non-redundant basis which also has minimum cardinality among
the sets of implications for a given formal context. We show how this can be
used to characterize a minimal cover of FDs in a rather simple manner. Prior to
introducing the stem base, let us define pseudo-closed sets [13].
Definition 2. (Pseudo-closed sets) Let P ÞÑ P 2 be a closure operator over a
set M, then P is a pseudo-closed set if and only if:
P ‰ P2 (1)
2
Q Ĺ P and Q is a pseudo-closed set ùñ Q Ď P (2)
Given a formal context pG, M, Iq, pseudo-closed sets A Ď G are called
pseudo-extents, while pseudo-closed sets B Ď M are called pseudo-intents. A
stem base of implications, or Duquenne-Guigues basis, can be defined as follows:
Theorem 3 ([13]). (Duquenne-Guigues Basis) The set of implications:
L “ tP Ñ P 2 | P is a pseudo-intentu (3)
is sound, complete and non-redundant.
Theorem 4 ([13]). Every complete set of implications contains an implication
X Ñ Y with X 2 “ P 2 for every pseudo-intent P of pG, M, Iq
Theorem 4 entails that the stem base of implications has minimum cardinality
with respect to any equivalent set of implications of pG, M, Iq. With this and
the previous observation that FDs are in 1-1 correspondence with implications
in pP airpT q, U, Iq, we can derive the following corollary.
Corollary 2. Let pP airpT q, U, Iq be the binary codification of a table with tuples
T and attributes U, the set of FDs L “ tP Ñ P 2 | P is a pseudo-intentu is a
minimal cover.
Corollary 2 provides a novel characterization of the minimal cover of FDs
through pseudo-intents of a formal context. Given the existing relation between
pP airpT q, U, Iq and pU, pD, Ďq, δq, we can generalize the characterization of the
minimal cover over the latter. Observe that in the PPS pU, pD, Ďq, δq, we main-
tain the notion of pseudo-extents for a pseudo-closed set X Ď U with X ÞÑ X ˝˝ .
286 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli
Fig. 1: Datasets and the Fig. 2: Performance Fig. 3: Performance
number of rows and Comparison when in- Comparison when in-
columns they contain. creasing tuples: FCA vs creasing attributes:
TANE FCA vs TANE
Proposition 1. Let pU, pD, Ďq, δq be the PPS representation of a database, then
the set of functional dependencies Lmc defined below is a minimal cover.
Lmc “ tX Ñ X ˝˝ | X is a pseudo-extentu (4)
The proof of Proposition 1 follows from Theorem 1 and the fact that for a
set X P U the closure operator X ÞÑ X 2 is exactly equivalent to X ÞÑ X ˝˝ and
consequently, the set of pseudo-intents in pP airpT q, U, Iq is the same as the set
of pseudo-extents in pU, pD, Ďq, δq. Thus, because of Corollary 2, Proposition 1
holds.
Table 4: Dataset details, Execution Times in Seconds, and Number of Mined
Rules for CCM (Canonical Cover Miner), MCM (Minimal Cover Miner) and
TANE. CC: Canonical Cover, MC: Minimal Cover. Datasets in boldface repre-
sent those in which FCA performed better than TANE.
CCM MCM TANE
Dataset # Tuples # Attributes # CC Deps Time [S] # MC Deps Time [S] # CC Deps Time [S]
Mushroom 8124 22 3605 23887 1509 12684 - -
Adult 48842 14 78 90.41 42 71.55 78 123.13
Credit 690 15 1099 3.07 253 1.82 1099 2.54
PGLW 17995 6 5 0.67 2 0.35 5 0.48
PGLW (2xA) 17995 12 38 1.81 15 1.18 38 7.45
Forest Fires 516 13 442 0.46 138 0.31 442 0.49
Forest Fires (2xT) 1032 13 442 1.27 138 0.78 442 2.34
ncvoter 1000 19 775 2.47 193 1.63 775 2.07
Diagnostics 120 8 37 0.08 17 0.06 37 0.06
Abalone 4177 8 137 0.41 40 0.29 137 0.32
CMC 1473 9 1 0.56 1 0.49 1 0.52
Hughes 401 12 3 0.12 3 0.17 3 0.06
Servo 167 4 1 0.05 1 0.03 1 0.02
Caulkins 1685 12 227 0.66 67 0.53 227 0.95
Characterizing Covers of Functional Dependencies using FCA 287
4 Experimental Evaluation
In this section we present a brief experimental comparison of both introduced
approaches versus TANE [16], a state-of-the-art FD miner. TANE is a highly
optimized apriori -based algorithm that generates a canonical cover of FDs. The
goal of this evaluation is to study the comparative benefits of using FCA versus
a traditional approach such as TANE. TANE was re-implemented for our ex-
periments5 . For the sake of repeatibility, the code used to run the experiments
was made available at GitHub. Both the minimal cover miner (MCM) 6 , and
the canonical cover miner (CCM)7 were implemented using Python’s pipy FCA
library fca8 .
Experiments were performed over 12 datasets extracted from the UCI Ma-
chine Learning repository9 , the JASA Data Archive10 and Metanome’s repeata-
bility Web page11 . Details on the number of rows and columns for each dataset
are provided in the first two columns of Table 4. In addition to these datasets,
we created synthetic versions by multiplying the rows or the columns of a given
dataset. Experiments were performed on a virtual machine with 4 cores running
at 2.7 Ghz and equipped with 32 GB of RAM memory.
4.1 Results & Discussion
Table 4 presents the main results of applying our approach and TANE on each
dataset to mine the Minimal Cover and the Canonical Cover, respectively. The
table contains the execution times for each approach and the number of depen-
dencies mined. Datasets in boldface represent those for which CCM or MCM
performed substantially better than TANE. For the Mushroom dataset, TANE
was not able to obtain results before running out of memory, thereby no infor-
mation is provided in the table. Table 4 also reports in two synthetic datasets,
namely PGLW (2xA) which contains two horizontal copies of the PGLW dataset
resulting in twice as many attributes. Forest Fires (2xT) contains two vertical
copies of Forest Fires resulting in twice as many tuples. All Canonical Covers
mined by TANE have been reduced to a Minimal Cover to verify the consistency
of our approach.
Out of the 12 datasets, CCM and MCM performed better in the largest
(both in rows and columns). This is better illustrated in Figure 1 where datasets
are represented as points in a number of rows-number of columns space. Circles
represent datasets for which CCM or MCM performed better while diamonds,
where TANE did. Notice that the X axis is provided in logarithmic scale. The
figure shows that most of the datasets where TANE performs better are in
5
https://github.com/codocedo/tane/tree/cla18
6
https://github.com/codocedo/fd_miner/tree/cla18
7
https://github.com/codocedo/uis_miner/tree/cla18
8
https://pypi.org/project/fca/
9
https://archive.ics.uci.edu/ml/index.php
10
http://lib.stat.cmu.edu/jasadata/
11
https://hpi.de/naumann/projects/repeatability/data-profiling/fds.html
288 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli
the lower-left region of the figure, representing small datasets. FCA-based ap-
proaches perform better in datasets in all other regions, including the upper-right
which contains datasets with many rows and columns.
Synthetic datasets in Table 4 show evidence that FCA scales better when
duplicating the dataset. When duplicating attributes the difference is partic-
ularly dramatic since TANE is over 13 times slower while our approach, only
3. To study this further, we created two sets of synthetic datasets. The first set
(vertical set) was created by incrementally multiplying vertically the Diagnostics
datasets (with 8 attributes and 120 tuples) generating 19 versions of 240, 360,
480 tuples, up to a dataset containing 2400 tuples. The second set (horizontal
set) was created using the same idea but in a horizontal manner generating 19
versions of 16, 24, 32, up to 160 attributes. Since most of the datasets of the
second set were too big for TANE, they were trunked to 40 tuples.
Figure 2 depicts the increasing time for CCM, MCM and TANE on the
vertical set, i.e. when increasing the number of tuples. We can observe that all
three approaches scale linearly w.r.t. the number of tuples, even when CCM and
MCM seem to have a more stable behavior. Vertical multiplication of datasets
yield the same number of FDs than the original set, since the relation between
attributes remains unchanged. Thus, we can assume that other algorithms based
on TANE could achieve a similar performance to CCM or MCM provided some
optimizations.
On the other hand, this do not seem to be the case for the horizontal set.
Figure 3 shows that CCM and MCM remain very stable when varying the num-
ber of attributes, while TANE’s execution time grows exponentially. Indeed,
this great difference in performance is due to the way in which we use FCA to
find FDs which differs from TANE’s strategy. Using FCA we calculate closures
which quickly group attribute copies avoiding unnecessary intersections. Instead,
TANE computes each attribute combination rendering the exponential growth
in the computation time. We stress that this is not simply an extreme case from
which our approach takes advantage, but actually a very good illustration of the
benefits of using a closure operator to navigate the space of FDs. Closures en-
able CCM and MCM to avoid unnecessary computations not only when we have
redundant attributes, but also whenever possible in the lattice of the powerset
of attributes. Finally, we do not discuss on the differences between CCM and
MCM strategies as these are detailed in [22].
5 Conclusions
We have presented a new characterization of a minimal cover of functional de-
pendencies (FDs) by means of the stem base (or Duquenne-Guigues basis) of a
partition pattern structure. We have presented the mechanisms through which
this characterization can be exploited to efficiently mine the minimal cover. Fur-
thermore, we have described the algorithms that implement these mechanisms
and show empirical evidence that our characterization performs better than a
Characterizing Covers of Functional Dependencies using FCA 289
state-of-the-art FD miner, namely TANE, in larger databases containing many
rows and columns.
Acknowledgments. This research work has been supported by the SGR2014-890
(MACDA) project of the Generalitat de Catalunya, and MINECO project APCOM
(TIN2014-57226-P).
References
1. Alam, M., Buzmakov, A., Codocedo, V., Napoli, A.: Mining definitions from RDF
annotations using formal concept analysis. In: Proceedings of the Twenty-Fourth
International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires,
Argentina, July 25-31, 2015. pp. 823–829. AAAI Press (2015)
2. Baixeries, J., Kaytoue, M., Napoli, A.: Computing Functional Dependencies with
Pattern Structures. In: Concept Lattices and their Applications. pp. 175–186 (2012)
3. Baixeries, J., Kaytoue, M., Napoli, A.: Characterizing Functional Dependencies
in Formal Concept Analysis with Pattern Structures. Annals of Mathematics and
Artificial Intelligence 72(1–2), 129–149 (Oct 2014)
4. Bertet, K., Monjardet, B.: The multiple facets of the canonical direct unit impli-
cational basis. Theoretical Computer Science 411(22-24), 2155–2166 (2010)
5. Caruccio, L., Deufemia, V., Polese, G.: Relaxed Functional Dependencies – A Sur-
vey of Approaches. IEEE Transactions on Knowledge and Data Engineering 28(1),
147–165 (2016)
6. Codocedo, V., Baixeries, J., Kaytoue, M., Napoli, A.: Characterization of Order-
like Dependencies with Formal Concept Analysis. In: Concept Lattices and Their
Applications. vol. 1624, pp. 123–134 (2016)
7. Codocedo, V., Napoli, A.: Lattice-based biclustering using Partition Pattern Struc-
tures. In: Proceedings of ECAI 2014. Frontiers in Artificial Intelligence and Appli-
cations, vol. 263, pp. 213–218. IOS Press (2014)
8. Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of
the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles
of Database Systems, PODS 2008, June 9-11, 2008, Vancouver, BC, Canada. pp.
159–170 (2008)
9. Fan, W.: Data quality: From theory to practice. SIGMOD Record 44(3), 7–18
(2015)
10. Fan, W., Geerts, F.: Foundations of Data Quality Management. Synthesis Lectures
on Data Management, Morgan & Claypool Publishers (2012)
11. Gainer-Dewar, A., Vera-Licona, P.: The minimal hitting set generation problem:
Algorithms and computation. SIAM Journal on Discrete Mathematics 31(1), 63–
100 (2017)
12. Ganter, B., Obiedkov, S.: Conceptual Exploration. Springer, Berlin (2016)
13. Ganter, B., Wille, R.: Formal Concept Analysis. Springer, Berlin (1999)
14. Ganter, B., Kuznetsov, S.O.: Pattern Structures and their projections. In: Delu-
gach, H.S., Stumme, G. (eds.) Conceptual Structures: Broadening the Base. LNCS,
vol. 2120, pp. 129–142. Springer (2001)
15. Guigues, J.L., Duquenne, V.: Familles minimales d’implications informatives
résultant d’un tableau de données binaires. Mathématiques et Sciences Humaines
95, 5–18 (1986)
290 Victor Codocedo, Jaume Baixeries, Mehdi Kaytoue, and Amedeo Napoli
16. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: An Efficient Algo-
rithm for Discovering Functional and Approximate Dependencies. Computer Jour-
nal 42(2), 100–111 (1999)
17. Korth, H.F., Silberschatz, A.: Database System Concepts. McGraw-Hill, Inc., New
York, NY, USA (1986)
18. Kuznetsov, S.O.: Galois Connections in Data Analysis: Contributions from the So-
viet Era and Modern Russian Research. In: Formal Concept Analysis, Foundations
and Applications. pp. 196–225. Lecture Notes in Computer Science 3626, Springer
(2005)
19. Maier, D.: Theory of Relational Databases. Computer Science Press (1983)
20. Mannila, H., Räihä, K.J.: The Design of Relational Databases. Addison-Wesley
Longman Publishing Co., Inc., Boston, MA, USA (1992)
21. Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.P., Schönberg,
M., Zwiener, J., Naumann, F.: Functional Dependency Discovery: An Experimental
Evaluation of Seven Algorithms. Proceedings of VLDB Endowment 8(10), 1082–
1093 (2015)
22. Ryssel, U., Distel, F., Borchmann, D.: Fast algorithms for implication bases and
attribute exploration using proper premises. Annals of Mathematics and Artificial
Intelligence 70(1), 25–53 (2014)
23. Ullman, J.: Principles of Database Systems and Knowledge-Based Systems, vol-
umes 1–2. Computer Science Press, Rockville (MD), USA (1989)
24. Yu, Y., Heflin, J.: Extending functional dependency to detect abnormal data in
RDF graphs. In: The Semantic Web - ISWC 2011 - 10th International Semantic
Web Conference, Bonn, Germany, October 23-27, 2011, Proceedings, Part I. pp.
794–809 (2011)
25. Yu, Y., Li, Y., Heflin, J.: Detecting abnormal semantic web data using semantic
dependency. In: Proceedings of the 5th IEEE International Conference on Semantic
Computing (ICSC 2011), Palo Alto, CA, USA, September 18-21, 2011. pp. 154–
157. IEEE Computer Society (2011)