<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A genetic algorithm to discover relaxed functional dependencies from data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Loredana Caruccio</string-name>
          <email>lcaruccio@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Deufemia</string-name>
          <email>deufemia@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Polese</string-name>
          <email>gpolese@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Salerno, Department of Computer Science</institution>
          ,
          <addr-line>via Giovanni Paolo II n.132, 84084 Fisciano (SA)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Approximate functional dependencies are used in many emerging application domains, such as the identi cation of data inconsistencies or patterns of semantically related data, query rewriting, and so forth. They can approximate the canonical de nition of functional dependency (fd) by relaxing on the data comparison (i.e., by considering data similarity rather than equality), on the extent (i.e., by admitting the possibility that the dependency holds on a subset of data), or both. Approximate fds are di cult to be identi ed at design time like it happens with fds. In this paper, we propose a genetic algorithm to discover approximate fds from data. An empirical evaluation demonstrates the e ectiveness of the algorithm.</p>
      </abstract>
      <kwd-group>
        <kwd>Functional Dependency</kwd>
        <kwd>Genetic Algorithm</kwd>
        <kwd>Discovery from data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In the last decades, functional dependencies have been extended to deal with
new problems, such as data cleansing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], record matching [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], query rewriting
upon schema evolutions [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], or to express constraints for complex data types,
such as multimedia [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To this end, the canonical de nition of fd has been
extended either by using approximate paradigms rather than equality to compare
attribute values, or by admitting subsets of data on which the fd does not hold.
In the literature such extensions are referred to as relaxations, and these new
dependencies are indicated with the term relaxed functional dependencies (rfds)
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The relaxation degree is expressed by means of thresholds, indicating either
the required similarity degree for the rfd to hold, or the minimum percentage
of data on which the rfd must hold.
      </p>
      <p>
        While fds were originally speci ed at database design time, as properties
of a schema that should hold on every instance of it, in order to reduce the
designer e ort and to dynamically adapt the rfds to the evolution of the
application domain, it is necessary to automatically discover them from data. This is
made possible also thanks to the availability of big data collections, and to the
contributions of several research areas, such as machine learning and knowledge
discovery [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
      </p>
      <p>
        Most of the discovery algorithms described in the literature are intended for
fds [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], whereas few rfd de nitions are equipped with algorithms for
discovering them from data [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. In this paper, we propose a genetic algorithm (GA)
to identify a broad class of rfds, including both relaxation criteria. In general,
GAs are e cient for global searches in large search spaces, for which
deterministic methods are not suitable. They perform operations inspired to natural species
evolutions, such as natural selection, crossover, and mutation. In particular, by
using these operations, the proposed algorithm iteratively generates new
candidate rfds, but only a subset of them, determined by means of a tness function,
survives to the evolution process. The tness function exploits the support and
con dence quality measures, mainly used for evaluating association rules. An
empirical evaluation demonstrates the e ectiveness of the algorithm.
      </p>
      <p>The paper is organized as follows. Section 2 reviews the rfd discovery
algorithms existing in the literature. Section 3 provides some background de nitions
about rfds and formulates the problem of rfd discovery. Section 4 presents
the proposed genetic algorithm named GA-rfd for discovering rfds from data,
whose performance are reported in Section 5. Finally, summary and concluding
remarks are included in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In the literature there are two main categories of methods to automatically
discover fds from data, namely top-down and bottom-up methods. The formers
exploit an attribute lattice to generate candidate fds, which are successively
tested to verify their validity. The whole process is made e cient by exploiting
valid fds to prune the search space for new candidate fds. Examples of
topdown approaches include the algorithms TANE [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], FD Mine [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], FUN [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ],
and DFD [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. On the other hand, bottom-up methods derive candidate fds
from two attribute subsets, namely agree-sets and di erence-sets, which are built
by comparing the values of attributes for all possible combinations of pairs of
tuples. Examples of bottom-up approaches include the algorithms DepMiner [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ],
FastFD [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], and FDep [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Recently, an hybrid algorithm has been proposed
in order to obtain better performance in all cases [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>
        Although a recent survey listed thirdy-four di erent types of rfds [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], very
few of them are equipped with algorithms for discovering them from data [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ].
These are reviewed in the following.
      </p>
      <p>
        Approximate functional dependencies (afds) are fds holding for `most' rather
than `all' the tuples of a relation r [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The amount of tuples satisfying the afds
can be calculated by means of several measures [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], among which the g3 error
measure is the most frequently used [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The latter represents the minimum
fraction of tuples that must be removed from a relation instance r in order to
make a candidate afd valid. Most approaches for afd discovery use a small
portion of tuples (sampling) s ⊂ r to decide whether an afd holds on r [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The
method proposed in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] exploits the error measure of super keys to determine
the approximate satisfaction of afds.
      </p>
      <p>
        Like afds, conditional functional dependencies (cfds) are fds holding for
a subset of tuples, but in this case the subset is identi ed by specifying
conditions [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Among the approaches for discoverying cfds from data, the algorithm
proposed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] exploits an attribute lattice derived from partitions of attribute
values, in order to generate candidate cfds. Alternatively, the greedy algorithm
proposed in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] tries to derive cfds by nding close-to-optimal tableau, where
the closeness is measured by means of support and con dence. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] the
authors adapt three algorithms used for fd discovery, namely FD Miner, TANE,
and FastFD, to the discovery of cfd.
      </p>
      <p>
        Matching dependencies (mds) are rfdc de ned in terms of similarity
predicates to accommodate errors, and di erent representations in unreliable data
sources [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. They have been mainly proposed for object identi cation. The most
recent algorithm for md discovery evaluates the utility of candidate mds for a
given database instance, determining the corresponding similarity threshold
pattern. The utility is measured by means of support and con dence mds, whereas
thresholds are determined by analyzing the statistical distribution of data.
      </p>
      <p>
        Di erential dependencies (dds) are rfdc based on di erential constraints on
attribute values [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Among the algorithms proposed for their discovery, the
one proposed in [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] exploits reduction algorithms, which detect dds by rst
xing the RHS di erential function for each attribute in r, and then nding the
set of di erential functions that left-reduce the LHS. The performances of the
algorithm are improved by means of pruning strategies based on the
subsumption order of di erential functions, implication of dds, and instance exclusion.
Another approach for dd discovery tries to reduce the problem search space
by assuming a user-speci ed distance threshold as upper limit for the distance
intervals of LHS [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. The algorithm is based on a distance-based subspace
clustering model, and exploits pruning strategies to e ciently discover dds when
high threshold values are speci ed.
      </p>
      <p>
        As opposed to these discovery algorithms, which focus each on a speci c rfd,
the discovery algorithm proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] aims to identify the entire class of rfds
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] relaxing on the tuple comparison method. In particular, the approach relies
on lattice-based algorithms conceived for fd discovery, which are feeded with
similar subsets of tuples derived from previously computed di erential matrices.
      </p>
      <p>The algorithm proposed in this paper further extends the class of discovered
rfds, by also including those relaxing on the extent.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Discovery Relaxed Functional Dependencies from Data</title>
      <sec id="sec-3-1">
        <title>Relaxed Functional Dependencies</title>
        <p>Before de ning rfds we need to introduce the concept of similarity constraint,
which expresses the similarity of two values on a speci c domain, and it is
represented through a function . In particular, given two attributes A and B on a
domain D, (A; B) evaluates the similarity of A and B. As an example, can
be de ned in terms of a similarity metric ≈, like for instance the edit distance,
such that a≈b is true if a and b are \close" enough w.r.t. a prede ned threshold.</p>
        <p>Given a relational schema R de ned over a xed set of attributes, and R1 =
(A1; : : : ; Ak) and R2 = (B1; : : : ; Bm) two relation schemas of R, an rfd ' on R
is denoted by
≥
Dc1 × Dc2 ∶ (X1; X2) 1 ÐÐ→ (Y1; Y2) 2
(1)
where
k m
{ Dc1 ×Dc2 = (t1; t2) ∈ dom(R1)×dom(R2) S  i⋀1 c1i (t1[Ai]) ⋀  j⋀1 c2j (t2[Bj])
= =
where c1 = (c11 ; : : : ; c1k ) and c2 = (c21 ; : : : ; c2m ), with c1i and c2j predicates
on dom(Ai) and dom(Bj), respectively, that lter the tuples on which '
applies;
{ X1; Y1 ⊆ attr(R1), and X2; Y2 ⊆ attr(R2), with X1 ∩ Y1 = ∅ and X2 ∩ Y2 = ∅;
{ 1 ( 2, resp.) is a set of constraints [X1; X2] ( [Y1; Y2], resp.) on attributes
X1 and X2 (Y1 and Y2, resp.). For any pair of tuples (t1, t2)∈ Dc1 × Dc2 , the
constraint [X1; X2] ( [Y1; Y2], resp.) indicates true, if the similarity of t1
and t2 on attributes X1 and X2 (Y1 and Y2, resp.) agrees with the constraint
speci ed by [X1; X2] ( [Y1; Y2], resp.).
{ is a coverage measure de ned on Dc1 × Dc2 , which quanti es the satis
ability degree of ' on r, and can be de ned as a function ∶ dom(X)×dom(Y ) →
R measuring the amount of tuples in r violating or satisfying '.
{ is a threshold indicating the lower bound (or upper bound in case the
comparison operator is ≤) for the result of the coverage measure.</p>
        <p>Given r1 ⊆ Dc1 and r2 ⊆ Dc2 two relation instances on (R1; R2), (r1; r2)
satis es the rfd ', denoted by (r1; r2) ⊧ ', if and only if: ∀ (t1; t2) ∈ (r1; r2), if
[X1; X2] indicates true for each constraint ∈ 1, then almost always [Y1; Y2]
indicates true for each constraint ∈ 2. Here, almost always means that
( X1 (r1) X2 (r2); Y1 (r1) Y2 (r2)) ≥ .</p>
        <p>For rfds de ned on single relation schemas (i.e., R1 = R2), if X1 = X2 and
Y1 = Y2, then we will simplify the rfd notation given in (1) by using Dc, X,
and Y to refer to the Cartesian product Dc1 × Dc2 , and to the pairs (X, X)
and (Y , Y ), respectively. Moreover, if the rfd has to be satis ed by all tuples
in r, then the symbol err(0) is shown on the arrow. Such coverage measure
corresponds to the expression (X; Y ) = 0, where (X; Y ) measures the number
of tuples violating the rfd. As an example, the canonical fd can also be written
err(0)
as: Dtrue ∶ Xeq ÐÐÐÐ→ Yeq, where true is a sequence of tautologies, hence
Dtrue = dom(R1), whereas eq is the equality constraint.</p>
        <p>As an example, in a database of body measures it is likely to have a
similar shoe size for people having a similar height. Thus, the following rfd might
hold Dtrue ∶ Height≈ ÐÐerÐrÐ(0→) Shoe Size≈, where ≈ is a di erential function. On
the other hand, there might be exceptions to this dependencies, since few
people might have close heights but more di erent shoe sizes. This can be modeled
by introducing a di erent coverage measure into the rfd, making it approximate
(Height;ShoeSize)≤0:02
Dtrue ∶ Height≈ ÐÐÐÐÐÐÐÐÐÐÐÐÐÐ→ Shoe Sizeeq, where (Height; ShoeSize)
≤ 0:02 indicates that at most for 2% of tuples the rfd does not hold.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>The RFD discovery problem</title>
        <p>
          Given a relation r, the discovery of rfds is the problem of nding a minimal
cover set of valid rfds that hold in r. This problem introduces further complexity
to the already complex task of fd discovery from data, due to the signi cantly
larger search space, which derives from the several similarity/distance functions
that may be de ned over each attribute [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Consequently, it is necessary to
devise tractable algorithms capable of extracting rfds with meaningful similarity
constraints.
        </p>
        <p>In this paper, we analyze the problem of rfd discovery from data given a set
of thresholds for both attribute comparison and coverage measure. Although this
reduces the overall complexity of the problem, it still remains more complex than
the fds discovery one. In fact, as said in Section 3.1, an rfd X → Y holds on a
relation r i whenever the distance between two tuples t1 and t2 in r is below
a threshold value A on each attribute A belonging to X, then their distance is
below a threshold value B on each attribute B belonging to Y , with a degree of
certainty greater than a threshold value . For this reason, we have to analyze the
similarity between portions of tuple pairs rather than their equality. This does
not permit to exploit the equivalence classes of attribute values, usually used
for fd discovery. Indeed, an attribute value might be similar to other attribute
values that are not similar to each other.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Identifying Relaxed Functional Dependencies with a</title>
    </sec>
    <sec id="sec-5">
      <title>Genetic Algorithm</title>
      <p>The proposed algorithm discovers rfds by comparing tuples pairwise, based on
the similarity of subsets of their attributes. In particular, pairs of tuples with a
similarity higher than a threshold are represented with the value 1, and 0
otherwise. In this way, we reduce the rfd discovery problem to a search one, which
is solved through a GA. The latter implements a simulation in which a
population of candidate solutions to an optimization problem evolves towards better
solutions. The process usually starts from a population of randomly generated
individuals (Initialization), which evolves by stochastically selecting multiple
individuals from the current population (based on a tness function), and modi ed
(recombined and possibly randomly mutated) to form a new population. The
algorithm terminates when either a maximum number of generations has been
produced, or a satisfactory tness level has been reached for the population. In
the rst case, a satisfactory solution may or may not have been reached.
4.1</p>
      <p>GA-RFD
In this section, we explain the GA to discover rfds, which is named GA-rfd. In
particular, in the following we rst explain how the discovery problem has been
encoded for GA processing, then we de ne the tness function to evaluate the
satisfactory degree of individuals, and nally, we present the GA-rfd algorithm.
Encoding. Given the thresholds for attribute comparison and coverage measure,
the problem of rfd discovery reduces to nd all possible dependencies X → Y
satisfying the following rule: tuples that are similar on the LHS must correspond
to those that are similar on the RHS. Such a rule must hold for almost all set
of tuples that are similar on the LHS attributes, according to the threshold
speci ed for the coverage measure.</p>
      <p>In order to accomplish this with a GA, the dataset has to be modeled in a way
that highlights similarity between tuple pairs. For this reason, in the proposed
methodology we transform an input dataset in terms of a di erence dataset. The
latter contains the input dataset attributes as columns, the input dataset tuple
pairs as rows, and the similarity between tuple pairs as values. In particular,
given an attribute Xi and a tuple pair (t1; t2), the di erence dataset will have
the value 1 if t1[Xi] is similar to t2[Xi] according to the threshold associated
to Xi, and the value 0 otherwise.</p>
      <p>Example 1. Let us consider the sample dataset shown in Table 1a, and the
similarity thresholds: 1 for the attributes Height and Shoe Size, and 10 for the
attribute Weight. By using the absolute di erence as a distance function, we can
derive the di erence dataset shown in Table 1b.</p>
      <p>In order to construct individuals in the population we use an array of integers
representing the attribute indices in the di erence dataset. In particular, each
item i of the array corresponds to a speci c attribute Xi in the dataset. In
particular, since all the rfds can be reduced w.l.o.g. to a format with a single
attribute on the RHS, each individual of size k will represent a candidate rfd,
where the k − 1 attributes on the LHS correspond to the rst k − 1 items in the
array, and the last one corresponds to the attribute on the RHS. It is worth to
notice that the algorithm can be iterated with several values for k in order to
consider rfds with di erent sizes on the LHS.</p>
      <p>Example 2. The individual [3; 1] for the dataset in Table 1a will represent
the candidate rfd: Shoe Size → Height.</p>
      <p>
        Fitness Function. In order to discover rfds we exploit the concepts of support
and con dence commonly used in the context of association rule mining. In
particular, several studies have demonstrated the relationship existing between
fds and association rules [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Formally, let X and Y be two sets of items, the
support of a set X, denoted with sup(X), represents the ratio of the transactions
in the dataset containing X, whereas the con dence represents the ratio sup(X ∪
Y )~sup(X).
      </p>
      <p>In our context, sup(X) represents the ratio of tuple pairs that are similar on
all attributes in X. In this way, we can use the con dence as the tness function
to determine if a candidate rfd X → Y is valid: conf (X → Y ) = suspupp(pX(X∪Y) ) :
This tness function returns 1 if and only if whenever tuple pairs are similar on
attributes in X, then they are similar also on Y . In order to validate a candidate
rfd the proposed algorithm associates the threshold speci ed for the coverage
measure to the tness function.</p>
      <p>Example 3. If we consider the individual [3; 1] of Example 2, representing
the candidate rfd: Shoe Size → Height, the associated support and con dence
are: sup(Shoe Size) = 1231 , sup({Shoe Size; Height}) = 261 , conf (Shoe Size →
Height) = 261 × 1231 = 163 = 0:46. This candidate rfd will not be considered as valid,
unless the coverage threshold speci ed in input is less than 0:46.</p>
      <p>Algorithm 1 The general GA-rfd algorithm
1: procedure GA-RFD(di erence dataset D, extent threshold , int maxgen, int</p>
      <p>pop size, int k, double p1, double p2)
2: Set pop, final pop, new pop
3: int gen ← 0
4: final pop ← initialize(pop size, k, numCols(D))
5: while gen ≤ maxgen do
6: if (allOf(final pop, )) then
7: break
8: end if
9: new pop ← select(final pop, )
10: pop ← ∅
11: pop ← crossover(new pop, k, p1)
12: pop ← mutate(pop, k, numCols(D), p2)
13: final pop ← new pop ∪ pop
14: g ← g + 1
15: end while
16: return final pop
17: end procedure
Algorithm. The pseudo-code of the main procedure of GA-rfd is shown in
Algorithm 1. It rst produces the initial population through the initialize
procedure, and then, within a cycle, it selects only the individuals that reach the
tness function objective (select procedure), makes crossover on selected
individuals according to a xed probability p1 (crossover procedure), mutates
some of them according to a xed probability p2 (mutate procedure). The new
population is evaluated in the next iteration of the cycle. GA-rfd terminates if
and only if either the population is composed of individuals, each one
representing a valid rfd according to the -threshold for the coverage measure, which
determines the tness objective; or the number of iterations exceeds the speci ed
maximum number.</p>
      <p>The initialize procedure shown in Algorithm 2 creates a set of individuals
representing the initial population, according to the size pop size speci ed as
input. The used random values are ranged in the number of columns of the
dataset, since each individual element will represent a column index.</p>
      <p>The select procedure shown in Algorithm 3 analyzes the individuals of
the population pop and constructs a new population (selected pop) by including
individuals with a probability related to the tness function objective .</p>
      <p>Algorithm 2 The initialize procedure
1: procedure initialize(int pop size, int k, int numCols)
2: Set pop
3: for i ≤ pop size do
4: List x
5: for j ≤ k do
6: add randInt(1,numCols) to x
7: end for
8: add x to pop
9: end for
10: return pop
11: end procedure
Algorithm 3 The select procedure
1: procedure select(Set pop, double )
2: Set selected pop → {}
3: double fitness, confidence
4: for each c Î pop do
5: confidence = supspu(pcp:({cX:X∪Y) })
6: if confidence ≥ then
7: fitness = 1
8: else
9: fitness = confidence~
10: end if
11: if rand() &lt; fitness then
12: selected pop = selected pop ∪ {c}
13: end if
14: end for
15: return selected pop
16: end procedure</p>
      <p>The crossover procedure shown in Algorithm 4 creates a new population of
individuals (new pop) by crossing individual pairs. We use a one-point crossover
to recombine individuals; the crosspoint is selected randomly. In this procedure
we use the function swap, which changes a sublist of two individuals by swapping
one to each other.</p>
      <p>The mutate procedure shown in Algorithm 5 creates a new population of
individuals (new pop) by mutating one value in each individual. The individual
element that has to be changed (gene) and the new value (new gene) are selected
randomly. In this procedure we use the function swapV alue, which changes a
value in a individual by swapping it with the new value. The mutate procedure
permits to detect candidate rfds that do not directly depend on the initial
population. It has associated a low probability, which limits the randomness of
the whole methodology.</p>
      <p>Algorithm 4 The crossover procedure
1: procedure crossover(Set pop, int k, double p)
2: Set new pop → {}
3: int crosspoint
4: for each c1,c2 Î pop with c1 ≠ c2 do
5: if rand() &lt; p then
6: crosspoint = randInt(1,k)
7: for i ≤ crosspoint do
8: swap(c1,c2, i)
9: end for
10: add c1 to new pop
11: add c2 to new pop
12: end if
13: end for
14: return new pop
15: end procedure
Algorithm 5 The mutate procedure
1: procedure crossover(Set pop, int k, int numCols, double p)
2: Set new pop
3: int gene, new gene
4: for each c Î pop do
5: if rand() &lt; p then
6: gene = randInt(1,k)
7: new gene = randInt(1,numCols)
8: swapValue(c, gene, new gene)
9: end if
10: add c to new pop
11: end for
12: return new pop
13: end procedure
5</p>
    </sec>
    <sec id="sec-6">
      <title>Empirical Evaluation</title>
      <p>In this section we describe the performed empirical evaluation to verify the
e ectiveness of the proposed GA-rfd algorithm. In particular, we rst evaluate
the performance of the algorithm when used to discover fds, then we evaluated
the rfd discovery performance.</p>
      <p>
        The evaluation has been performed on a version of GA-rfd implemented
in the Python language, and on a machine with an AMD Opteron (TM) CPU,
8 GB RAM, and Linux Ubuntu 16.10 OS. Moreover, we used two real-world
datasets, also employed for evaluating several fd discovery algorithms [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], and
the sample dataset introduced in Example 1. The probability parameters for the
crossover and mutate procedures have been set to 0.85 and 0.3, respectively.
Evaluation by discovering FDs. This experiment compares the performances of
GA-rfd with those of the fd discovery algorithms with the aim of evaluating
the e ectiveness of GA-rfd with respect to well-known results, as in the case of
iris and bridges datasets [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Here, we do not evaluate the time performances,
since fd and rfd discovery problems are not comparable. In order to reduce the
problem to discover canonical fds we assigned the value 0 to all tuple comparison
thresholds and the value 1 to the coverage threshold. Table 2 reports the fds
discovered on the analyzed datasets. For the iris and bridge datasets GA-rfd
achieves the same results obtained by other fd discovery algorithms [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ].
Evaluation by discovering RFDs. In order to verify the e ectiveness of GA-rfd,
we dirtied some values in the datasets used in the previous experiment with the
aim of verifying whether the number of canonical fds also changes. In other
words, we simulated the presence of dirtiness in data, and analyzed GA-rfd
overtakes errors in dependency discovery. For this reason, we evaluated how
the number of rfds changes by varying either the tuple comparison thresholds
or the coverage measure threshold. It is worth to notice that, we have used
this methodology in order to highlight the usefulness of the proposed algorithm
w.r.t. the results shown in Table 2. In fact, in real contexts we use GA-rfd with
di erent thresholds, those indicating similar tuples, and the one specifying the
satis ability degree.
      </p>
      <p>Table 3 reports for each dataset: (i) the number of errors introduced in the
dataset, (ii) the number fds of discovered on the dataset with dirty data, (iii)
the number of discovered rfds by associating a threshold ≤1 to one attribute in
order to manage similarities in the value comparison1, (iv) the number of rfds
discovered with a coverage threshold of 0:9.</p>
      <p>We can observe that the errors introduced in sample-dataset2 and iris2 have
reduced the number of discovered fds. On the contrary, for bridge2 such a
number increases, since a minimal fd 1 has been violated by errors and three new
fds, whose LHS has more attributes than LHS( 1), have been discovered. The
relaxation on attribute comparison has allowed to capture the errors introduced
in the datasets. However, for bridge2 a new dependency has been discovered
1In particular, we have associated the threshold 1 to the rst attribute in the
sampledataset2, and to the fth attribute in both iris2 and bridges2. The other attributes have
associated threshold 0.
beyond those reported in 2. Regarding the relaxation of tuple coverage, for
sample-dataset2 the threshold is too narrow to capture the introduced error;
for iris2 we obtained less but more general dependencies than those reported in
2; for bridge2 many more dependencies were discovered, due to the fact that the
introduced errors is not signi cant compared to the size of the dataset, and to
the number of dependencies holding in it, as reported in 2.
6</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>In this paper we have proposed GA-rfd, a GA for discovering rfds from data.
It analyzes tuple pairs similarity through a di erence dataset, and validates rfds
by calculating the con dence on each candidate rfd. A preliminary evaluation
of the algorithm has been carried out to assess the e ectiveness of the approach.</p>
      <p>
        In the future, we would like to further improve this approach in order to
automatically discover the rfds and the threshold ranges of validity, without
requesting their speci cation to the user. Furthermore, we would like to investigate
the discovery of GA-rfd in the context of user interaction logs, especially in web
applications, aiming to mine user intent [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To this end, mashup repositories are
a further interesting application domain, since they are precious sources of data
concerning mashup component usage, which can be useful to opportunely advise
the development of new mashups [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abedjan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schulze</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>DFD: E cient functional dependency discovery</article-title>
          .
          <source>In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management</source>
          . pp.
          <volume>949</volume>
          {
          <fpage>958</fpage>
          . CIKM '
          <volume>14</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bohannon</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geerts</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kementsietsidis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Conditional functional dependencies for data cleaning</article-title>
          .
          <source>In: Proceedings of the 25th International Conference on Data Engineering</source>
          . pp.
          <volume>746</volume>
          {
          <fpage>755</fpage>
          . ICDE '
          <volume>07</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Caruccio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deufemia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polese</surname>
          </string-name>
          , G.:
          <article-title>Relaxed functional dependencies { A survey of approaches</article-title>
          .
          <source>IEEE TKDE 28(1)</source>
          ,
          <volume>147</volume>
          {
          <fpage>165</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Caruccio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deufemia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polese</surname>
          </string-name>
          , G.:
          <article-title>Understanding user intent on the web through interaction mining</article-title>
          .
          <source>Journal of Visual Languages &amp; Computing</source>
          <volume>31</volume>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <volume>230</volume>
          {
          <fpage>236</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Caruccio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deufemia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polese</surname>
          </string-name>
          , G.:
          <article-title>On the discovery of relaxed functional dependencies</article-title>
          .
          <source>In: Proceedings of the 20th International Database Engineering &amp; Applications Symposium</source>
          . pp.
          <volume>53</volume>
          {
          <fpage>61</fpage>
          . IDEAS '
          <volume>16</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Caruccio</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polese</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tortora</surname>
          </string-name>
          , G.:
          <article-title>Synchronization of queries and views upon schema evolutions: A survey</article-title>
          .
          <source>ACM Transactions on Database Systems (TODS) 41(2)</source>
          ,
          <volume>9</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <issue>7</issue>
          .
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deufemia</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polese</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vacca</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A normalization framework for multimedia databases</article-title>
          .
          <source>IEEE TKDE</source>
          <volume>19</volume>
          (
          <issue>12</issue>
          ),
          <volume>1666</volume>
          {
          <fpage>1679</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          :
          <article-title>Discovering data quality rules</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>1166</volume>
          {
          <fpage>1177</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Ma, S.:
          <article-title>Dynamic constraints for record matching</article-title>
          .
          <source>The VLDB Journal 20</source>
          ,
          <volume>495</volume>
          {
          <fpage>520</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geerts</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lakshmanan</surname>
            ,
            <given-names>L.V.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Discovering conditional functional dependencies</article-title>
          .
          <source>In: Proceedings of the 25th International Conference on Data Engineering, ICDE'09</source>
          . pp.
          <volume>1231</volume>
          {
          <issue>1234</issue>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Flach</surname>
            ,
            <given-names>P.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Savnik</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Database dependency discovery: A machine learning approach</article-title>
          .
          <source>AI Commun</source>
          .
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <volume>139</volume>
          {
          <fpage>160</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Giannella</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robertson</surname>
          </string-name>
          , E.:
          <article-title>On approximation measures for functional dependencies</article-title>
          .
          <source>Inform. Syst</source>
          .
          <volume>29</volume>
          (
          <issue>6</issue>
          ),
          <volume>483</volume>
          {
          <fpage>507</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Golab</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karlo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korn</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>On generating near-optimal tableaux for conditional functional dependencies</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>376</volume>
          {
          <fpage>390</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Huhtala</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          , Karkkainen, J.,
          <string-name>
            <surname>Porkka</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toivonen</surname>
          </string-name>
          , H.:
          <article-title>TANE: An e cient algorithm for discovering functional and approximate dependencies</article-title>
          .
          <source>The Computer Journal</source>
          <volume>42</volume>
          (
          <issue>2</issue>
          ),
          <volume>100</volume>
          {
          <fpage>111</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>King</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oil</surname>
          </string-name>
          , J.:
          <article-title>Discovery of functional and approximate functional dependencies in relational databases</article-title>
          .
          <source>J. Applied Math. and Decision Sciences</source>
          <volume>7</volume>
          (
          <issue>1</issue>
          ),
          <volume>49</volume>
          {
          <fpage>59</fpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kivinen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mannila</surname>
          </string-name>
          , H.:
          <article-title>Approximate inference of functional dependencies from relations</article-title>
          .
          <source>Theor. Comput. Sci</source>
          .
          <volume>149</volume>
          (
          <issue>1</issue>
          ),
          <volume>129</volume>
          {
          <fpage>149</fpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Kwashie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Mining di erential dependencies: A subspace clustering approach</article-title>
          . In: Wang,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Sharaf</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.A</surname>
          </string-name>
          . (eds.)
          <source>Proceedings of Australasian Database Conference</source>
          . pp.
          <volume>50</volume>
          {
          <fpage>61</fpage>
          . ADC '
          <volume>14</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Discover dependencies from data - A review</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>24</volume>
          (
          <issue>2</issue>
          ),
          <volume>251</volume>
          {
          <fpage>264</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Lopes</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petit</surname>
            ,
            <given-names>J.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lakhal</surname>
          </string-name>
          , L.:
          <article-title>E cient discovery of functional dependencies and armstrong relations</article-title>
          .
          <source>In: Proceedings of the 7th International Conference on Extending Database Technology</source>
          . pp.
          <volume>350</volume>
          {
          <fpage>364</fpage>
          . EDBT '
          <volume>00</volume>
          (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Mario Andres</surname>
            Paredes-Valverde,
            <given-names>Giner</given-names>
          </string-name>
          <string-name>
            <surname>Alor-Hernandez</surname>
            ,
            <given-names>A.R.G.R.V.G.E.J.D.:</given-names>
          </string-name>
          <article-title>A systematic review of tools, languages, and methodologies for mashup development</article-title>
          .
          <source>Software Practice &amp; Experience</source>
          <volume>45</volume>
          (
          <issue>13</issue>
          ),
          <volume>365397</volume>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Novelli</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cicchetti</surname>
          </string-name>
          , R.:
          <article-title>Fun: An e cient algorithm for mining functional and embedded dependencies</article-title>
          .
          <source>In: Proceedings of 8th International Conference Database Theory</source>
          , pp.
          <volume>189</volume>
          {
          <fpage>203</fpage>
          . ICDT '
          <volume>01</volume>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Papenbrock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Finke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zwiener</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Data pro ling with Metanome</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>8</volume>
          (
          <issue>12</issue>
          ),
          <year>1860</year>
          {
          <year>1863</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Papenbrock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ehrlich</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marten</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neubert</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudolph</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          , Schonberg,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zwiener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Naumann</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          :
          <article-title>Functional dependency discovery: An experimental evaluation of seven algorithms</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>8</volume>
          (
          <issue>10</issue>
          ),
          <volume>1082</volume>
          {
          <fpage>1093</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Papenbrock</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A hybrid approach to functional dependency discovery</article-title>
          .
          <source>In: Proceedings of the 2016 International Conference on Management of Data</source>
          . pp.
          <volume>821</volume>
          {
          <fpage>833</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Di erential dependencies: Reasoning and discovery</article-title>
          .
          <source>ACM Transactions on Database Systems</source>
          <volume>36</volume>
          ,
          <issue>16</issue>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Wyss</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giannella</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Robertson</surname>
          </string-name>
          , E.:
          <article-title>FastFDs: A heuristic-driven, depth- rst algorithm for mining functional dependencies from relation instances</article-title>
          .
          <source>In: Procs of Intl Conf. on Data Warehousing and Knowl. Disc</source>
          . pp.
          <volume>101</volume>
          {
          <fpage>110</fpage>
          . DaWaK '
          <volume>01</volume>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamilton</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Butz</surname>
            ,
            <given-names>C.J.</given-names>
          </string-name>
          : FD Mine:
          <article-title>Discovering functional dependencies in a database using equivalences</article-title>
          .
          <source>In: Proceedings of IEEE International Conference on Data Mining</source>
          . pp.
          <volume>729</volume>
          {
          <fpage>732</fpage>
          . ICDM '
          <volume>02</volume>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>