<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Characterizing Covers of Functional Dependencies using FCA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victor Codocedo</string-name>
          <email>victor.codocedo@inf.utfsm.cl</email>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaume Baixeries</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehdi Kaytoue</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amedeo Napoli</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>290, Department of Computer Science, Palacky University Olomouc</institution>
          ,
          <addr-line>2018. Copying permitted only for private and academic purposes</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Infologic</institution>
          ,
          <addr-line>99 Avenue de Lyon, F-26500, Bourg-l`es-Valence</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>LORIA (CNRS - Inria Nancy Grand Est - Universit ́e de Lorraine)</institution>
          ,
          <addr-line>B.P. 239, F-54506, Vandoeuvre-l`es-Nancy</addr-line>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Universidad T ́ecnica Federico Santa Mar ́ıa. Campus San Joaqu ́ın</institution>
          ,
          <addr-line>Santiago de</addr-line>
          <country country="CL">Chile</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Universitat Polit`ecnica de Catalunya.</institution>
          <addr-line>08032, Barcelona. Catalonia</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>c paper author(s), 2018. Proceedings volume published and copyrighted by its editors. Paper published in Dmitry I. Ignatov</institution>
          ,
          <addr-line>Lhouari Nourine (Eds.): CLA 2018, pp. 279</addr-line>
        </aff>
      </contrib-group>
      <fpage>279</fpage>
      <lpage>290</lpage>
      <abstract>
        <p>Functional dependencies (FDs) can be used for various important operations on data, for instance, checking the consistency and the quality of a database (including databases that contain complex data). Consequently, a generic framework that allows mining a sound, complete, non-redundant and yet compact set of FDs is an important tool for many different applications. There are different definitions of such sets of FDs (usually called covers). In this paper, we present the characterization of two different kinds of covers for FDs in terms of pattern structures. The convenience of such a characterization is that it allows for an easy implementation of efficient mining algorithms which can later be easily adapted to other kinds of similar dependencies. Finally, we present empirical evidence that the proposed approach can perform better than a state-of-the-art FD miner in large databases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Functional Dependencies (FDs) are a keystone of the relational database model,
since they allow checking the consistency, maintaining the quality of a database
[
        <xref ref-type="bibr" rid="ref10 ref8 ref9">8, 10, 9</xref>
        ], and guiding its design [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In addition, they have been used to study
information integration in the Web of data with varying degrees of quality [
        <xref ref-type="bibr" rid="ref24 ref25">24,
25</xref>
        ], or to check the data completeness in DBpedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, the
computation of a succinct representation of a set of FDs (usually refered to as a cover), is
of interest to various fields of knowledge discovery and representation, specially,
if this computation can easily be extended to other kinds of dependencies (e.g.
relaxed versions of FDs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
      </p>
      <p>
        The computation of FD covers is a popular topic in the database literature.
As a reference, in [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], seven algorithms to mine FDs and compute their covers
are reviewed and grouped into three families: lattice transversal algorithms,
difference/agree sets algorithms, and dependency induction algorithms. The
characterization of FDs with FCA and pattern structures is also presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
a generalization to other types of FDs is given in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. For a detailed review on
the characterization of FDs and FCA, see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In this paper, we characterize the representations of FD covers using pattern
structures, an extension of FCA dealing with complex object representations [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        On the one hand, we adapt the definition of a canonical direct basis of
implications with proper premises [
        <xref ref-type="bibr" rid="ref22 ref4">4, 22</xref>
        ] to the formalism of pattern structures,
in order to prove that this basis is equivalent to a reduced non-redundant set
of FDs, better known as the canonical cover (Section 3.1). We show that the
canonical cover can be characterized using the arrow relation (Ö) between an
attribute and a pattern defined over a partition pattern structures (PPS). On the
other hand, we discuss on the relation between the Stem Base of implications [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
with the Minimal Cover of dependencies, a sound, complete, non-redundant set
of FDs that has minimum cardinality w.r.t. any other equivalent cover
(Section 3.2). We show that the latter can be characterized by pseudo-extents of a
PPS.
      </p>
      <p>
        Finally, in Section 4 we empirically compare these two ways of computing FD
covers with the algorithm TANE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. This algorithm is one of the most efficient
FD mining algorithms and according to [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], it is the base for the family of
“lattice transversal algorithms” serving as the baseline to validate our approach
with a state-of-the-art FD miner.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Theoretical Background</title>
      <p>Let U be an ordered set of attributes, and let Dom be a set of values (a domain).
For the sake of simplicity, we assume that Dom is a numerical set. A tuple t
is a function t : U Ñ Dom, and a table T is a set of tuples. Usually a table is
represented as a matrix, as in Table 1, where the set of tuples (or objects) is T “
t t1, t2, . . . , t7 u with attributes U “ t a, b, c, d, e u. We use table, dataset, set of
tuples as equivalent terms. We overload the functional notation of a tuple in such
a way that, given a tuple t P T , we say that tpX Ď U ) is a tuple with the values
of t in the ordered attributes xi P X defined as tpXq “ xtpx1q, tpx2q, . . . , tpxnqy.
For example, we have that t2pt a, c uq “ xt2paq, t2pcqy “ x2, 1y. In this article,
the set notation is dropped: instead of t a, b u we use ab.
2.1</p>
      <sec id="sec-2-1">
        <title>Functional Dependencies and their Covers</title>
        <sec id="sec-2-1-1">
          <title>Definition 1 ([23]). Let T be a set of tuples, and X, Y Ď U . A functional</title>
          <p>dependency (FD) X Ñ Y holds in T if:</p>
          <p>For instance, the functional dependency d Ñ e holds in T (Table 1), whereas
the functional dependency e Ñ d does not hold since t4peq “ t5peq but t4pdq ‰
t5pdq.</p>
          <p>
            For a given set of functional dependencies F , we can use the three
Armstrong’s axioms (reflexivity, augmentation and transitivity) to derive a larger
set of FDs [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ]. We will call F ˚ the set of FDs derived from F by
reflexivity and augmentation, and F ` the set of FDs derived by reflexivity,
augmentation and transitivity. Two sets of FDs F and H are said to be equivalent
F ” H ðñ F ` “ H`.
          </p>
          <p>Let F be a set of FDs from a database T , F is said to be sound if all FDs
in F hold in T . In addition, F is said to be complete if all FDs that hold in
T can be derived from F . Let X Ñ Y be any FD in F , then F is said to be
non-redundant if F ztX Ñ Y u ı F , and non-redundant w.r.t. augmentation iff
pF ztX Ñ Y uq˚ ‰ F ˚</p>
          <p>A set F is said to be left-reduced if for all X Ñ Y P F and Z Ĺ X we have
that pF ztX Ñ Y uq Y tZ Ñ Y u ı F . Dually, it is said to be right-reduced if for
all X Ñ Y P F and Z Ĺ Y we have that pF ztX Ñ Y uq Y tX Ñ Zu ı F . F is
said to be reduced if it is simultaneously left and right-reduced.</p>
          <p>
            Let F be a reduced set of FDs, then G “ tX Ñ y | X Ñ Y P F, y P Y u
is the splitting of F and G ” F . Let F be a reduced set, its splitting is called
a canonical cover. A canonical cover is a left-reduced, non-redundant w.r.t.
augmentation set of FDs with a single element in their right hand side (a split
set) [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]. A different definition presents the canonical cover in a saturated
version requiring uniqueness in their left hand side [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] losing the single element
condition in the right hand side. For the sake of compatibility with FCA
implication covers we will favor the first definition. Notice that canonical covers
can be redundant w.r.t. transitivity. For example the canonical cover of Table 1
contains tc Ñ b, c Ñ e, d Ñ e, bd Ñ c, be Ñ cu where bd Ñ c would be
redundant w.r.t. transitivity as bd Ñ bde Ñ c.
          </p>
          <p>Finally, a set F is said to be a minimum cover if it has as few FDs as any
other equivalent set of FDs. For example, the minimum cover of Table 1 contains
FDs tc Ñ be, d Ñ e, be Ñ cu. Notice that the minimum cover is not restricted
to be reduced, so it is not presented with split sets. Secondly, the minimum cover
contains exactly one fewer FD than the canonical cover, namely bd Ñ c.
2.2</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Formal Concept Analysis, Implication Systems and FDs</title>
        <p>
          For the sake of brevity we do not provide a description of the Formal Concept
Analysis (FCA) framework. The notation used in this article follows [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] where
K “ pG, M, Iq is a formal context of objects G, attributes M and incidence
relation I, with formal concepts pA, Bq where A1 “ B and B1 “ A.
        </p>
        <p>
          Implications are relations established between attribute sets from a formal
context K. Implications are analogous to FDs and they can be used to
characterize them [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Because of this, we will notate an implication similarly to an FD.
Implication systems (sets of implications) can also characterize FD covers [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          An implication X Ñ Y holds in K for X, Y Ď M if Y Ď X2. Let T be a
set of tuples and U , a set of attributes in a table (such as the one in Table 1).
We define the set P airpT q “ `T2 ˘ set of all subsets T with exactly two elements,
and the incidence set I such that ppti, tj q, xq P I ðñ tirxs “ tj rxs, @x P U ,
@ti, tj P T . It can be shown that an FD X Ñ Y holds in the database if and
only if X Ñ Y is an implication of the formal context K “ pP airpT q, U , Iq [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
K is called the binary codification of table T . For example, Table 3 contains the
binary codification of Table 1. In Table 3 we can observe the implication d Ñ e
which can be verified as an FD in Table 1.
        </p>
        <p>
          The previous statement entails that FDs and implications in K are in 1-1
correspondence. Moreover, the corresponding definition of a canonical cover of
FDs is equivalent to that of a canonical-direct unitary basis of implications as
shown in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] where the equivalent left-minimal basis is described as containing
minimal functional dependencies, i.e. those in a canonical cover.
δpaq
δpbq ˆ
δpeq ˆ
δpaq [ δpbq ˆ ˆ Ö Ö
δpaq [ δpeq ˆ ˆ
δpdq [ δpeq ˆ ˆ
δpaq [ δpdq [ δpeq ˆ Ö Ö ˆ ˆ
δpbq [ δpcq [ δpeq ˆ ˆ ˆ
δpaq [ δpbq [ δpcq [ δpeq ˆ ˆ ˆ Ö ˆ
δpbq [ δpcq [ δpdq [ δpeq Ö ˆ ˆ ˆ ˆ
δpaq [ δpbq [ δpcq [ δpdq [ δpeq ˆ ˆ ˆ ˆ ˆ
A Partition Pattern Structure (PPS) is a type of pattern structure [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] that deals
with, as the name suggests, object representations in the form of set partitions.
PPS have shown to be useful for mining biclusters [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and, more importantly,
relations between partition pattern concepts have been used to characterize FDs
of different kinds [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>The formalization of a database T with attributes U as a PPS is as follows. A
partition d of T is a set d Ď ℘pT q (powerset of T ) of disjoint non-empty subsets of
T such that for any two different elements Ki, Kj P d we have that Ki X Kj “ H
and ŤKPd K “ T . Let D be the set of all possible partitions of T , then any two
partitions d1, d2 P D can be ordered by a coarser/finer relation denoted d1 Ď d2
(d1 is finer than d2) iff p@ Ki P d1qpD Kj P d2q such that Ki Ď Kj . The
similarity operator is defined as d1 [ d2 “ tKi X Kj | Ki P d1, Kj P d2u. From
this, it follows that pD, Ďq is a complete lattice with supremum and infimum
defined respectively as J “ ttT uu and K “ tttu | t P T u.</p>
        <p>The set of attributes U is mapped onto D through a function δ which for a
given attribute x P U yields a partition δpxq P D representing the equivalence
relations over T for values of attribute x. With this, we can configure the PPS
pU , pD, Ďq, δq with derivation operators X˝ “ d δpxq denoting the equivalence
xPX
relations implied by attributes in X, and d˝ “ tx P U | δpxq Ď du denoting all
attributes with associated equivalence relations finer than d. pX, dq is a partition
pattern concept when X˝ “ d and d˝ “ X.</p>
        <sec id="sec-2-2-1">
          <title>Theorem 1 (Proposition 2 in [3]). Let pP airpT q, U , Iq be the binary codifi</title>
          <p>cation of a table within a database, and pU , pD, Ďq, δq its PPS representation:
pW, Xq P pP airpT q, U , Iq ðñ pX, dq P pU , pD, Ďq, δq</p>
          <p>
            The proof of Theorem 1 can be found in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. Theorem 1 presents an important
property of the PPS that states that X Ď U is an extent in pU , pD, Ďq, δq if and
only if it is also an intent in pP airpT q, U , Iq (the relation between the set of tuple
pairs W Ď P airpT q and the partition d P D can be formalized as well but it is
of no interest to our development). Theorem 1 is very important since it entails
that the lattices derived from pP airpT q, U , Iq and pU , pD, Ďq, δq are isomorphic.
Consequently, implication X Ñ Y in pP airpT q, U , Iq can be found as a relation
between extents in pU , pD, Ďq, δq (extent implication) such that Y Ď X˝˝.
3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Covers and Pattern Structures</title>
      <p>In this section we present two different kinds of covers for FDs: canonical covers
(Section 3.1) and minimal covers (Section 3.2), as well as their characterization
in terms of Pattern Structures. This section uses existing well-known results in
FCA, which are reviewed here for the sake of readability.</p>
      <p>
        Section 3.1 presents how a canonical cover for FDs can be computed using
PPS, according to the results in [
        <xref ref-type="bibr" rid="ref22 ref4">4, 22</xref>
        ]. Section 3.2 introduces a novel
characterization of the minimum cover of FDs by means of pseudo-extents of PPS [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
The interest of these results is not limited to computing FD covers, but also for
generalizations of FDs [
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ].
3.1
      </p>
      <sec id="sec-3-1">
        <title>Characterizing a Canonical Cover of FDs</title>
        <p>
          The characterization of a canonical cover of FDs using FCA is straightforward.
In a nutshell, a canonical cover of FDs is analogous to a canonical direct unitary
basis of implications [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] as presented in Section 2.2. In this section we recall some
of these ideas and show how they can be simply adapted to the framework of
PPS.
        </p>
        <p>
          Firstly, let pU , pD, Ďq, δq be a PPS, we define D “ td P D | d˝˝ “ du as
the set of all closed partition patterns in D. The formal context pU , D, J q with
pd, xq P J ðñ d Ď δpxq is called a representation context of pU , pD, Ďq, δq
and their corresponding concept lattices are isomorphic [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. For the sake of
readability of the following definitions, we define the representation context as
pD, U , J q instead of pU , D, J q (Table 2). By transitivity of equivalence, it is clear
that pD, U , J q is isomorphic to pP airpT q, U , Iq as defined in Section 2.3 and as
such, implications in pD, U , J q correspond to FDs. For example, Table 2 (not
considering elements Ö) contains the representation context pD, U , J q of the
PPS derived from Table 1. Notice that objects are closed intersections of object
representations, e.g. δpdq does not appear since δpdq˝˝ “ δpdq [ δpeq. With this,
the canonical direct basis of implications in pD, U , J q (and thus, canonical cover
of FDs) is determined by the set of proper premises of elements in U .
        </p>
        <sec id="sec-3-1-1">
          <title>Theorem 2 (Corollary 1 in [22]). X Ď U ztyu is a premise of y P U iff X is</title>
          <p>a hypergraph transversal of HyÖ defined as :</p>
          <p>HyÖ “ tppU ztyuqqzd1 | d P D, d Ö yu
The set of all proper premises of y is the minimum hypergraph transversal
T rpHyÖq.</p>
          <p>
            A detailed description on the development of Theorem 2 can be found in [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ].
Providing a formal definition of hypergraph transversals is out of the scope
of this article, however we briefly mention that this formalization can also be
made considering set collections (instead of hypergraphs) and minimum hitting
sets (instead of minimum hypergraph transversals) [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. This problem is also
analogous to the vertex cover problem [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
          </p>
          <p>Theorem 2 provides a formal description for the proper premises of a given
attribute y P U that in turn yields the canonical cover of functional dependencies.
However this approach requires the creation of the representation context which
is a middle step in the overall calculation process. Actually, by analyzing the
arrow relation between d and y we can observe that the representation context
is not necessary. Consider that in pD, U , J q, d1 “ tx P U | pd, xq P J p ðñ d Ď
δpxqqu and thus d1 is equivalent to d˝ for any d P D. Moreover, in pU , pD, Ďq, δq,
d˝ Ĺ h˝ ðñ h Ĺ d since h “ h˝˝ and d “ d˝˝. With this, we can rewrite the
arrow definition as follows.</p>
          <p>d Ö y ðñ pd, yq R J and if d1 Ĺ h1 then ph, yq P J
ðñ d Ę δpyq and if d˝ Ĺ h˝ then h Ď δpyq
ðñ d Ę δpyq and if h Ĺ d then h Ď δpyq
The last result shows that d Ö y in pD, U , J q can be defined directly over the
PPS. Intuitively, this definition corresponds to y Õ d in pU , D, J q and thus, in
pU , pD, Ďq, δq. With these elements we can finally propose a characterization for
the canonical cover of functional dependencies in pU , pD, Ďq, δq as follows.
Corollary 1. Let pU , pD, Ďq, δq be a partition pattern structure and T rpHq
denote the hypergraph transversal of H, then with</p>
          <p>Lcc “ tX Ñ y | y P U , X P T rpHyÕqu</p>
          <p>HyÕ “ tppU ztyuqqzd1 | d P D, y Õ du
y Õ d ðñ d Ę δpyq and if h Ĺ d then h Ď δpyq
Lcc is a canonical cover of functional dependencies.</p>
          <p>For the running example, let us calculate the proper premises of attribute
c using the arrow relations in Table 3. We have c Õ pδpaq [ δpbqq and c Õ
pδpaq [ δpdq [ δpeqq. With this, we have the hypergraph HcÕ “ ttd, eu, tbuu
for which the minimum transversal hypergraph is T rpHcÕq “ ttb, du, tb, euu.
Correspondingly, we have the FDs bd Ñ c and be Ñ c which are included in the
canonical cover.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Characterizing a Minimal Cover of FDs</title>
        <p>We introduce a novel characterization of a minimal cover of FDs by means of
pseudo-intents, and its generalization using pseudo-extents of a PPS.</p>
        <p>
          The stem base of implications, or Duquenne-Guigues basis [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], is a sound,
complete and non-redundant basis which also has minimum cardinality among
the sets of implications for a given formal context. We show how this can be
used to characterize a minimal cover of FDs in a rather simple manner. Prior to
introducing the stem base, let us define pseudo-closed sets [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>Definition 2. (Pseudo-closed sets) Let P ÞÑ P 2 be a closure operator over a
set M, then P is a pseudo-closed set if and only if:</p>
        <p>P ‰ P 2
Q Ĺ P and Q is a pseudo-closed set ùñ
Q2 Ď P</p>
        <p>
          Given a formal context pG, M, Iq, pseudo-closed sets A Ď G are called
pseudo-extents, while pseudo-closed sets B Ď M are called pseudo-intents. A
stem base of implications, or Duquenne-Guigues basis, can be defined as follows:
Theorem 3 ([
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]). (Duquenne-Guigues Basis) The set of implications:
        </p>
        <p>L “ tP Ñ P 2 | P is a pseudo-intentu
is sound, complete and non-redundant.</p>
        <p>
          Theorem 4 ([
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]). Every complete set of implications contains an implication
X Ñ Y with X2 “ P 2 for every pseudo-intent P of pG, M, Iq
        </p>
        <p>Theorem 4 entails that the stem base of implications has minimum cardinality
with respect to any equivalent set of implications of pG, M, Iq. With this and
the previous observation that FDs are in 1-1 correspondence with implications
in pP airpT q, U , Iq, we can derive the following corollary.</p>
        <p>Corollary 2. Let pP airpT q, U , Iq be the binary codification of a table with tuples
T and attributes U , the set of FDs L “ tP Ñ P 2 | P is a pseudo-intentu is a
minimal cover.</p>
        <p>Corollary 2 provides a novel characterization of the minimal cover of FDs
through pseudo-intents of a formal context. Given the existing relation between
pP airpT q, U , Iq and pU , pD, Ďq, δq, we can generalize the characterization of the
minimal cover over the latter. Observe that in the PPS pU , pD, Ďq, δq, we
maintain the notion of pseudo-extents for a pseudo-closed set X Ď U with X ÞÑ X˝˝.
(1)
(2)
(3)
Proposition 1. Let pU , pD, Ďq, δq be the PPS representation of a database, then
the set of functional dependencies Lmc defined below is a minimal cover.</p>
        <p>Lmc “ tX Ñ X˝˝ | X is a pseudo-extentu
(4)</p>
        <p>The proof of Proposition 1 follows from Theorem 1 and the fact that for a
set X P U the closure operator X ÞÑ X2 is exactly equivalent to X ÞÑ X˝˝ and
consequently, the set of pseudo-intents in pP airpT q, U , Iq is the same as the set
of pseudo-extents in pU , pD, Ďq, δq. Thus, because of Corollary 2, Proposition 1
holds.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluation</title>
      <p>
        In this section we present a brief experimental comparison of both introduced
approaches versus TANE [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], a state-of-the-art FD miner. TANE is a highly
optimized apriori -based algorithm that generates a canonical cover of FDs. The
goal of this evaluation is to study the comparative benefits of using FCA versus
a traditional approach such as TANE. TANE was re-implemented for our
experiments5. For the sake of repeatibility, the code used to run the experiments
was made available at GitHub. Both the minimal cover miner (MCM) 6, and
the canonical cover miner (CCM)7 were implemented using Python’s pipy FCA
library fca8.
      </p>
      <p>Experiments were performed over 12 datasets extracted from the UCI
Machine Learning repository9, the JASA Data Archive10 and Metanome’s
repeatability Web page11. Details on the number of rows and columns for each dataset
are provided in the first two columns of Table 4. In addition to these datasets,
we created synthetic versions by multiplying the rows or the columns of a given
dataset. Experiments were performed on a virtual machine with 4 cores running
at 2.7 Ghz and equipped with 32 GB of RAM memory.
4.1</p>
      <sec id="sec-4-1">
        <title>Results &amp; Discussion</title>
        <p>5 https://github.com/codocedo/tane/tree/cla18
6 https://github.com/codocedo/fd_miner/tree/cla18
7 https://github.com/codocedo/uis_miner/tree/cla18
8 https://pypi.org/project/fca/
9 https://archive.ics.uci.edu/ml/index.php
10 http://lib.stat.cmu.edu/jasadata/
11 https://hpi.de/naumann/projects/repeatability/data-profiling/fds.html
the lower-left region of the figure, representing small datasets. FCA-based
approaches perform better in datasets in all other regions, including the upper-right
which contains datasets with many rows and columns.</p>
        <p>Synthetic datasets in Table 4 show evidence that FCA scales better when
duplicating the dataset. When duplicating attributes the difference is
particularly dramatic since TANE is over 13 times slower while our approach, only
3. To study this further, we created two sets of synthetic datasets. The first set
(vertical set) was created by incrementally multiplying vertically the Diagnostics
datasets (with 8 attributes and 120 tuples) generating 19 versions of 240, 360,
480 tuples, up to a dataset containing 2400 tuples. The second set (horizontal
set) was created using the same idea but in a horizontal manner generating 19
versions of 16, 24, 32, up to 160 attributes. Since most of the datasets of the
second set were too big for TANE, they were trunked to 40 tuples.</p>
        <p>Figure 2 depicts the increasing time for CCM, MCM and TANE on the
vertical set, i.e. when increasing the number of tuples. We can observe that all
three approaches scale linearly w.r.t. the number of tuples, even when CCM and
MCM seem to have a more stable behavior. Vertical multiplication of datasets
yield the same number of FDs than the original set, since the relation between
attributes remains unchanged. Thus, we can assume that other algorithms based
on TANE could achieve a similar performance to CCM or MCM provided some
optimizations.</p>
        <p>
          On the other hand, this do not seem to be the case for the horizontal set.
Figure 3 shows that CCM and MCM remain very stable when varying the
number of attributes, while TANE’s execution time grows exponentially. Indeed,
this great difference in performance is due to the way in which we use FCA to
find FDs which differs from TANE’s strategy. Using FCA we calculate closures
which quickly group attribute copies avoiding unnecessary intersections. Instead,
TANE computes each attribute combination rendering the exponential growth
in the computation time. We stress that this is not simply an extreme case from
which our approach takes advantage, but actually a very good illustration of the
benefits of using a closure operator to navigate the space of FDs. Closures
enable CCM and MCM to avoid unnecessary computations not only when we have
redundant attributes, but also whenever possible in the lattice of the powerset
of attributes. Finally, we do not discuss on the differences between CCM and
MCM strategies as these are detailed in [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We have presented a new characterization of a minimal cover of functional
dependencies (FDs) by means of the stem base (or Duquenne-Guigues basis) of a
partition pattern structure. We have presented the mechanisms through which
this characterization can be exploited to efficiently mine the minimal cover.
Furthermore, we have described the algorithms that implement these mechanisms
and show empirical evidence that our characterization performs better than a
state-of-the-art FD miner, namely TANE, in larger databases containing many
rows and columns.</p>
      <p>Acknowledgments. This research work has been supported by the SGR2014-890
(MACDA) project of the Generalitat de Catalunya, and MINECO project APCOM
(TIN2014-57226-P).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alam</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buzmakov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Codocedo</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Mining definitions from RDF annotations using formal concept analysis</article-title>
          .
          <source>In: Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2015</year>
          ,
          <string-name>
            <given-names>Buenos</given-names>
            <surname>Aires</surname>
          </string-name>
          , Argentina,
          <source>July 25-31</source>
          ,
          <year>2015</year>
          . pp.
          <fpage>823</fpage>
          -
          <lpage>829</lpage>
          . AAAI Press (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Baixeries</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Computing Functional Dependencies with Pattern Structures</article-title>
          .
          <source>In: Concept Lattices and their Applications</source>
          . pp.
          <fpage>175</fpage>
          -
          <lpage>186</lpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baixeries</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Characterizing Functional Dependencies in Formal Concept Analysis with Pattern Structures</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>72</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>129</fpage>
          -
          <lpage>149</lpage>
          (
          <year>Oct 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bertet</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monjardet</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>The multiple facets of the canonical direct unit implicational basis</article-title>
          .
          <source>Theoretical Computer Science</source>
          <volume>411</volume>
          (
          <fpage>22</fpage>
          -
          <lpage>24</lpage>
          ),
          <fpage>2155</fpage>
          -
          <lpage>2166</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Caruccio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deufemia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polese</surname>
          </string-name>
          , G.:
          <article-title>Relaxed Functional Dependencies - A Survey of Approaches</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>28</volume>
          (
          <issue>1</issue>
          ),
          <fpage>147</fpage>
          -
          <lpage>165</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Codocedo</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baixeries</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaytoue</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Characterization of Orderlike Dependencies with Formal Concept Analysis</article-title>
          .
          <source>In: Concept Lattices and Their Applications</source>
          . vol.
          <volume>1624</volume>
          , pp.
          <fpage>123</fpage>
          -
          <lpage>134</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Codocedo</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Lattice-based biclustering using Partition Pattern Structures</article-title>
          .
          <source>In: Proceedings of ECAI 2014. Frontiers in Artificial Intelligence and Applications</source>
          , vol.
          <volume>263</volume>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>218</lpage>
          . IOS Press (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fan</surname>
          </string-name>
          , W.:
          <article-title>Dependencies revisited for improving data quality</article-title>
          .
          <source>In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS</source>
          <year>2008</year>
          , June 9-11,
          <year>2008</year>
          , Vancouver, BC, Canada. pp.
          <fpage>159</fpage>
          -
          <lpage>170</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Data quality: From theory to practice</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>44</volume>
          (
          <issue>3</issue>
          ),
          <fpage>7</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geerts</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Foundations of Data Quality Management</article-title>
          .
          <source>Synthesis Lectures on Data Management</source>
          , Morgan &amp; Claypool Publishers (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Gainer-Dewar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vera-Licona</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>The minimal hitting set generation problem: Algorithms and computation</article-title>
          .
          <source>SIAM Journal on Discrete Mathematics</source>
          <volume>31</volume>
          (
          <issue>1</issue>
          ),
          <fpage>63</fpage>
          -
          <lpage>100</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Obiedkov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Conceptual Exploration. Springer, Berlin (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wille</surname>
          </string-name>
          , R.:
          <source>Formal Concept Analysis</source>
          . Springer, Berlin (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ganter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>Pattern Structures and their projections</article-title>
          . In: Delugach,
          <string-name>
            <given-names>H.S.</given-names>
            ,
            <surname>Stumme</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (eds.)
          <article-title>Conceptual Structures: Broadening the Base</article-title>
          .
          <source>LNCS</source>
          , vol.
          <volume>2120</volume>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>142</lpage>
          . Springer (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Guigues</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duquenne</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Familles minimales d'implications informatives r´esultant d'un tableau de donn´ees binaires</article-title>
          .
          <source>Math´ematiques et Sciences Humaines</source>
          <volume>95</volume>
          ,
          <fpage>5</fpage>
          -
          <lpage>18</lpage>
          (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Huhtala</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Ka¨rkka¨inen, J.,
          <string-name>
            <surname>Porkka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
          </string-name>
          , H.:
          <article-title>TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies</article-title>
          .
          <source>Computer Journal</source>
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <fpage>100</fpage>
          -
          <lpage>111</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Korth</surname>
            ,
            <given-names>H.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silberschatz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Database System Concepts</article-title>
          .
          <source>McGraw-Hill</source>
          , Inc., New York, NY, USA (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Kuznetsov</surname>
            ,
            <given-names>S.O.</given-names>
          </string-name>
          :
          <article-title>Galois Connections in Data Analysis: Contributions from the Soviet Era and Modern Russian Research</article-title>
          . In: Formal Concept Analysis,
          <source>Foundations and Applications</source>
          . pp.
          <fpage>196</fpage>
          -
          <lpage>225</lpage>
          . Lecture Notes in Computer Science 3626, Springer (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Maier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Theory of Relational Databases</article-title>
          . Computer Science Press (
          <year>1983</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Mannila</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <article-title>Ra¨ih¨a, K</article-title>
          .J.:
          <article-title>The Design of Relational Databases</article-title>
          .
          <string-name>
            <surname>Addison-Wesley Longman</surname>
          </string-name>
          Publishing Co., Inc., Boston, MA, USA (
          <year>1992</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Papenbrock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ehrlich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marten</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neubert</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudolph</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          , Scho¨nberg,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zwiener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Functional Dependency Discovery: An Experimental Evaluation of Seven Algorithms</article-title>
          .
          <source>Proceedings of VLDB Endowment</source>
          <volume>8</volume>
          (
          <issue>10</issue>
          ),
          <fpage>1082</fpage>
          -
          <lpage>1093</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Ryssel</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Distel</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borchmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Fast algorithms for implication bases and attribute exploration using proper premises</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>70</volume>
          (
          <issue>1</issue>
          ),
          <fpage>25</fpage>
          -
          <lpage>53</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Ullman</surname>
          </string-name>
          , J.:
          <article-title>Principles of Database Systems and Knowledge-Based Systems, volumes 1-2</article-title>
          . Computer Science Press,
          <article-title>Rockville (MD), USA (</article-title>
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heflin</surname>
          </string-name>
          , J.:
          <article-title>Extending functional dependency to detect abnormal data in RDF graphs</article-title>
          .
          <source>In: The Semantic Web - ISWC 2011 - 10th International Semantic Web Conference</source>
          , Bonn, Germany,
          <source>October 23-27</source>
          ,
          <year>2011</year>
          , Proceedings, Part I. pp.
          <fpage>794</fpage>
          -
          <lpage>809</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heflin</surname>
          </string-name>
          , J.:
          <article-title>Detecting abnormal semantic web data using semantic dependency</article-title>
          .
          <source>In: Proceedings of the 5th IEEE International Conference on Semantic Computing (ICSC</source>
          <year>2011</year>
          ), Palo Alto, CA, USA, September
          <volume>18</volume>
          -
          <issue>21</issue>
          ,
          <year>2011</year>
          . pp.
          <fpage>154</fpage>
          -
          <lpage>157</lpage>
          . IEEE Computer Society (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>