-

A genetic algorithm to discover relaxed functional dependencies from data

Loredana Caruccio

lcaruccio@unisa.it 0

Vincenzo Deufemia

deufemia@unisa.it 0

Giuseppe Polese

gpolese@unisa.it 0 0 University of Salerno, Department of Computer Science , via Giovanni Paolo II n.132, 84084 Fisciano (SA) , Italy

Approximate functional dependencies are used in many emerging application domains, such as the identi cation of data inconsistencies or patterns of semantically related data, query rewriting, and so forth. They can approximate the canonical de nition of functional dependency (fd) by relaxing on the data comparison (i.e., by considering data similarity rather than equality), on the extent (i.e., by admitting the possibility that the dependency holds on a subset of data), or both. Approximate fds are di cult to be identi ed at design time like it happens with fds. In this paper, we propose a genetic algorithm to discover approximate fds from data. An empirical evaluation demonstrates the e ectiveness of the algorithm.

Functional Dependency Genetic Algorithm Discovery from data

In the last decades, functional dependencies have been extended to deal with new problems, such as data cleansing [ 2 ], record matching [ 9 ], query rewriting upon schema evolutions [ 6 ], or to express constraints for complex data types, such as multimedia [ 7 ]. To this end, the canonical de nition of fd has been extended either by using approximate paradigms rather than equality to compare attribute values, or by admitting subsets of data on which the fd does not hold. In the literature such extensions are referred to as relaxations, and these new dependencies are indicated with the term relaxed functional dependencies (rfds) [ 3 ]. The relaxation degree is expressed by means of thresholds, indicating either the required similarity degree for the rfd to hold, or the minimum percentage of data on which the rfd must hold.

While fds were originally speci ed at database design time, as properties of a schema that should hold on every instance of it, in order to reduce the designer e ort and to dynamically adapt the rfds to the evolution of the application domain, it is necessary to automatically discover them from data. This is made possible also thanks to the availability of big data collections, and to the contributions of several research areas, such as machine learning and knowledge discovery [ 18 ].

Most of the discovery algorithms described in the literature are intended for fds [ 23 ], whereas few rfd de nitions are equipped with algorithms for discovering them from data [ 18 ]. In this paper, we propose a genetic algorithm (GA) to identify a broad class of rfds, including both relaxation criteria. In general, GAs are e cient for global searches in large search spaces, for which deterministic methods are not suitable. They perform operations inspired to natural species evolutions, such as natural selection, crossover, and mutation. In particular, by using these operations, the proposed algorithm iteratively generates new candidate rfds, but only a subset of them, determined by means of a tness function, survives to the evolution process. The tness function exploits the support and con dence quality measures, mainly used for evaluating association rules. An empirical evaluation demonstrates the e ectiveness of the algorithm.

The paper is organized as follows. Section 2 reviews the rfd discovery algorithms existing in the literature. Section 3 provides some background de nitions about rfds and formulates the problem of rfd discovery. Section 4 presents the proposed genetic algorithm named GA-rfd for discovering rfds from data, whose performance are reported in Section 5. Finally, summary and concluding remarks are included in Section 6. 2

Related Work

In the literature there are two main categories of methods to automatically discover fds from data, namely top-down and bottom-up methods. The formers exploit an attribute lattice to generate candidate fds, which are successively tested to verify their validity. The whole process is made e cient by exploiting valid fds to prune the search space for new candidate fds. Examples of topdown approaches include the algorithms TANE [ 14 ], FD Mine [ 27 ], FUN [ 21 ], and DFD [ 1 ]. On the other hand, bottom-up methods derive candidate fds from two attribute subsets, namely agree-sets and di erence-sets, which are built by comparing the values of attributes for all possible combinations of pairs of tuples. Examples of bottom-up approaches include the algorithms DepMiner [ 19 ], FastFD [ 26 ], and FDep [ 11 ]. Recently, an hybrid algorithm has been proposed in order to obtain better performance in all cases [ 24 ].

Although a recent survey listed thirdy-four di erent types of rfds [ 3 ], very few of them are equipped with algorithms for discovering them from data [ 18 ]. These are reviewed in the following.

Approximate functional dependencies (afds) are fds holding for `most' rather than `all' the tuples of a relation r [ 16 ]. The amount of tuples satisfying the afds can be calculated by means of several measures [ 12 ], among which the g3 error measure is the most frequently used [ 16 ]. The latter represents the minimum fraction of tuples that must be removed from a relation instance r in order to make a candidate afd valid. Most approaches for afd discovery use a small portion of tuples (sampling) s ⊂ r to decide whether an afd holds on r [ 16 ]. The method proposed in [ 15 ] exploits the error measure of super keys to determine the approximate satisfaction of afds.

Like afds, conditional functional dependencies (cfds) are fds holding for a subset of tuples, but in this case the subset is identi ed by specifying conditions [ 2 ]. Among the approaches for discoverying cfds from data, the algorithm proposed in [ 8 ] exploits an attribute lattice derived from partitions of attribute values, in order to generate candidate cfds. Alternatively, the greedy algorithm proposed in [ 13 ] tries to derive cfds by nding close-to-optimal tableau, where the closeness is measured by means of support and con dence. In [ 10 ] the authors adapt three algorithms used for fd discovery, namely FD Miner, TANE, and FastFD, to the discovery of cfd.

Matching dependencies (mds) are rfdc de ned in terms of similarity predicates to accommodate errors, and di erent representations in unreliable data sources [ 9 ]. They have been mainly proposed for object identi cation. The most recent algorithm for md discovery evaluates the utility of candidate mds for a given database instance, determining the corresponding similarity threshold pattern. The utility is measured by means of support and con dence mds, whereas thresholds are determined by analyzing the statistical distribution of data.

Di erential dependencies (dds) are rfdc based on di erential constraints on attribute values [ 25 ]. Among the algorithms proposed for their discovery, the one proposed in [ 25 ] exploits reduction algorithms, which detect dds by rst xing the RHS di erential function for each attribute in r, and then nding the set of di erential functions that left-reduce the LHS. The performances of the algorithm are improved by means of pruning strategies based on the subsumption order of di erential functions, implication of dds, and instance exclusion. Another approach for dd discovery tries to reduce the problem search space by assuming a user-speci ed distance threshold as upper limit for the distance intervals of LHS [ 17 ]. The algorithm is based on a distance-based subspace clustering model, and exploits pruning strategies to e ciently discover dds when high threshold values are speci ed.

As opposed to these discovery algorithms, which focus each on a speci c rfd, the discovery algorithm proposed in [ 5 ] aims to identify the entire class of rfds [ 3 ] relaxing on the tuple comparison method. In particular, the approach relies on lattice-based algorithms conceived for fd discovery, which are feeded with similar subsets of tuples derived from previously computed di erential matrices.

The algorithm proposed in this paper further extends the class of discovered rfds, by also including those relaxing on the extent. 3 3.1

Discovery Relaxed Functional Dependencies from Data Relaxed Functional Dependencies

Before de ning rfds we need to introduce the concept of similarity constraint, which expresses the similarity of two values on a speci c domain, and it is represented through a function . In particular, given two attributes A and B on a domain D, (A; B) evaluates the similarity of A and B. As an example, can be de ned in terms of a similarity metric ≈, like for instance the edit distance, such that a≈b is true if a and b are \close" enough w.r.t. a prede ned threshold.

Given a relational schema R de ned over a xed set of attributes, and R1 = (A1; : : : ; Ak) and R2 = (B1; : : : ; Bm) two relation schemas of R, an rfd ' on R is denoted by ≥ Dc1 × Dc2 ∶ (X1; X2) 1 ÐÐ→ (Y1; Y2) 2 (1) where k m { Dc1 ×Dc2 = (t1; t2) ∈ dom(R1)×dom(R2) S i⋀1 c1i (t1[Ai]) ⋀ j⋀1 c2j (t2[Bj]) = = where c1 = (c11 ; : : : ; c1k ) and c2 = (c21 ; : : : ; c2m ), with c1i and c2j predicates on dom(Ai) and dom(Bj), respectively, that lter the tuples on which ' applies; { X1; Y1 ⊆ attr(R1), and X2; Y2 ⊆ attr(R2), with X1 ∩ Y1 = ∅ and X2 ∩ Y2 = ∅; { 1 ( 2, resp.) is a set of constraints [X1; X2] ( [Y1; Y2], resp.) on attributes X1 and X2 (Y1 and Y2, resp.). For any pair of tuples (t1, t2)∈ Dc1 × Dc2 , the constraint [X1; X2] ( [Y1; Y2], resp.) indicates true, if the similarity of t1 and t2 on attributes X1 and X2 (Y1 and Y2, resp.) agrees with the constraint speci ed by [X1; X2] ( [Y1; Y2], resp.). { is a coverage measure de ned on Dc1 × Dc2 , which quanti es the satis ability degree of ' on r, and can be de ned as a function ∶ dom(X)×dom(Y ) → R measuring the amount of tuples in r violating or satisfying '. { is a threshold indicating the lower bound (or upper bound in case the comparison operator is ≤) for the result of the coverage measure.

Given r1 ⊆ Dc1 and r2 ⊆ Dc2 two relation instances on (R1; R2), (r1; r2) satis es the rfd ', denoted by (r1; r2) ⊧ ', if and only if: ∀ (t1; t2) ∈ (r1; r2), if [X1; X2] indicates true for each constraint ∈ 1, then almost always [Y1; Y2] indicates true for each constraint ∈ 2. Here, almost always means that ( X1 (r1) X2 (r2); Y1 (r1) Y2 (r2)) ≥ .

For rfds de ned on single relation schemas (i.e., R1 = R2), if X1 = X2 and Y1 = Y2, then we will simplify the rfd notation given in (1) by using Dc, X, and Y to refer to the Cartesian product Dc1 × Dc2 , and to the pairs (X, X) and (Y , Y ), respectively. Moreover, if the rfd has to be satis ed by all tuples in r, then the symbol err(0) is shown on the arrow. Such coverage measure corresponds to the expression (X; Y ) = 0, where (X; Y ) measures the number of tuples violating the rfd. As an example, the canonical fd can also be written err(0) as: Dtrue ∶ Xeq ÐÐÐÐ→ Yeq, where true is a sequence of tautologies, hence Dtrue = dom(R1), whereas eq is the equality constraint.

As an example, in a database of body measures it is likely to have a similar shoe size for people having a similar height. Thus, the following rfd might hold Dtrue ∶ Height≈ ÐÐerÐrÐ(0→) Shoe Size≈, where ≈ is a di erential function. On the other hand, there might be exceptions to this dependencies, since few people might have close heights but more di erent shoe sizes. This can be modeled by introducing a di erent coverage measure into the rfd, making it approximate (Height;ShoeSize)≤0:02 Dtrue ∶ Height≈ ÐÐÐÐÐÐÐÐÐÐÐÐÐÐ→ Shoe Sizeeq, where (Height; ShoeSize) ≤ 0:02 indicates that at most for 2% of tuples the rfd does not hold. 3.2

The RFD discovery problem

Given a relation r, the discovery of rfds is the problem of nding a minimal cover set of valid rfds that hold in r. This problem introduces further complexity to the already complex task of fd discovery from data, due to the signi cantly larger search space, which derives from the several similarity/distance functions that may be de ned over each attribute [ 17 ]. Consequently, it is necessary to devise tractable algorithms capable of extracting rfds with meaningful similarity constraints.

In this paper, we analyze the problem of rfd discovery from data given a set of thresholds for both attribute comparison and coverage measure. Although this reduces the overall complexity of the problem, it still remains more complex than the fds discovery one. In fact, as said in Section 3.1, an rfd X → Y holds on a relation r i whenever the distance between two tuples t1 and t2 in r is below a threshold value A on each attribute A belonging to X, then their distance is below a threshold value B on each attribute B belonging to Y , with a degree of certainty greater than a threshold value . For this reason, we have to analyze the similarity between portions of tuple pairs rather than their equality. This does not permit to exploit the equivalence classes of attribute values, usually used for fd discovery. Indeed, an attribute value might be similar to other attribute values that are not similar to each other. 4

Identifying Relaxed Functional Dependencies with a Genetic Algorithm

The proposed algorithm discovers rfds by comparing tuples pairwise, based on the similarity of subsets of their attributes. In particular, pairs of tuples with a similarity higher than a threshold are represented with the value 1, and 0 otherwise. In this way, we reduce the rfd discovery problem to a search one, which is solved through a GA. The latter implements a simulation in which a population of candidate solutions to an optimization problem evolves towards better solutions. The process usually starts from a population of randomly generated individuals (Initialization), which evolves by stochastically selecting multiple individuals from the current population (based on a tness function), and modi ed (recombined and possibly randomly mutated) to form a new population. The algorithm terminates when either a maximum number of generations has been produced, or a satisfactory tness level has been reached for the population. In the rst case, a satisfactory solution may or may not have been reached. 4.1

GA-RFD In this section, we explain the GA to discover rfds, which is named GA-rfd. In particular, in the following we rst explain how the discovery problem has been encoded for GA processing, then we de ne the tness function to evaluate the satisfactory degree of individuals, and nally, we present the GA-rfd algorithm. Encoding. Given the thresholds for attribute comparison and coverage measure, the problem of rfd discovery reduces to nd all possible dependencies X → Y satisfying the following rule: tuples that are similar on the LHS must correspond to those that are similar on the RHS. Such a rule must hold for almost all set of tuples that are similar on the LHS attributes, according to the threshold speci ed for the coverage measure.

In order to accomplish this with a GA, the dataset has to be modeled in a way that highlights similarity between tuple pairs. For this reason, in the proposed methodology we transform an input dataset in terms of a di erence dataset. The latter contains the input dataset attributes as columns, the input dataset tuple pairs as rows, and the similarity between tuple pairs as values. In particular, given an attribute Xi and a tuple pair (t1; t2), the di erence dataset will have the value 1 if t1[Xi] is similar to t2[Xi] according to the threshold associated to Xi, and the value 0 otherwise.

Example 1. Let us consider the sample dataset shown in Table 1a, and the similarity thresholds: 1 for the attributes Height and Shoe Size, and 10 for the attribute Weight. By using the absolute di erence as a distance function, we can derive the di erence dataset shown in Table 1b.

In order to construct individuals in the population we use an array of integers representing the attribute indices in the di erence dataset. In particular, each item i of the array corresponds to a speci c attribute Xi in the dataset. In particular, since all the rfds can be reduced w.l.o.g. to a format with a single attribute on the RHS, each individual of size k will represent a candidate rfd, where the k − 1 attributes on the LHS correspond to the rst k − 1 items in the array, and the last one corresponds to the attribute on the RHS. It is worth to notice that the algorithm can be iterated with several values for k in order to consider rfds with di erent sizes on the LHS.

Example 2. The individual [3; 1] for the dataset in Table 1a will represent the candidate rfd: Shoe Size → Height.

Fitness Function. In order to discover rfds we exploit the concepts of support and con dence commonly used in the context of association rule mining. In particular, several studies have demonstrated the relationship existing between fds and association rules [ 14 ]. Formally, let X and Y be two sets of items, the support of a set X, denoted with sup(X), represents the ratio of the transactions in the dataset containing X, whereas the con dence represents the ratio sup(X ∪ Y )~sup(X).

In our context, sup(X) represents the ratio of tuple pairs that are similar on all attributes in X. In this way, we can use the con dence as the tness function to determine if a candidate rfd X → Y is valid: conf (X → Y ) = suspupp(pX(X∪Y) ) : This tness function returns 1 if and only if whenever tuple pairs are similar on attributes in X, then they are similar also on Y . In order to validate a candidate rfd the proposed algorithm associates the threshold speci ed for the coverage measure to the tness function.

Example 3. If we consider the individual [3; 1] of Example 2, representing the candidate rfd: Shoe Size → Height, the associated support and con dence are: sup(Shoe Size) = 1231 , sup({Shoe Size; Height}) = 261 , conf (Shoe Size → Height) = 261 × 1231 = 163 = 0:46. This candidate rfd will not be considered as valid, unless the coverage threshold speci ed in input is less than 0:46.

Algorithm 1 The general GA-rfd algorithm 1: procedure GA-RFD(di erence dataset D, extent threshold , int maxgen, int

pop size, int k, double p1, double p2) 2: Set pop, final pop, new pop 3: int gen ← 0 4: final pop ← initialize(pop size, k, numCols(D)) 5: while gen ≤ maxgen do 6: if (allOf(final pop, )) then 7: break 8: end if 9: new pop ← select(final pop, ) 10: pop ← ∅ 11: pop ← crossover(new pop, k, p1) 12: pop ← mutate(pop, k, numCols(D), p2) 13: final pop ← new pop ∪ pop 14: g ← g + 1 15: end while 16: return final pop 17: end procedure Algorithm. The pseudo-code of the main procedure of GA-rfd is shown in Algorithm 1. It rst produces the initial population through the initialize procedure, and then, within a cycle, it selects only the individuals that reach the tness function objective (select procedure), makes crossover on selected individuals according to a xed probability p1 (crossover procedure), mutates some of them according to a xed probability p2 (mutate procedure). The new population is evaluated in the next iteration of the cycle. GA-rfd terminates if and only if either the population is composed of individuals, each one representing a valid rfd according to the -threshold for the coverage measure, which determines the tness objective; or the number of iterations exceeds the speci ed maximum number.

The initialize procedure shown in Algorithm 2 creates a set of individuals representing the initial population, according to the size pop size speci ed as input. The used random values are ranged in the number of columns of the dataset, since each individual element will represent a column index.

The select procedure shown in Algorithm 3 analyzes the individuals of the population pop and constructs a new population (selected pop) by including individuals with a probability related to the tness function objective .

Algorithm 2 The initialize procedure 1: procedure initialize(int pop size, int k, int numCols) 2: Set pop 3: for i ≤ pop size do 4: List x 5: for j ≤ k do 6: add randInt(1,numCols) to x 7: end for 8: add x to pop 9: end for 10: return pop 11: end procedure Algorithm 3 The select procedure 1: procedure select(Set pop, double ) 2: Set selected pop → {} 3: double fitness, confidence 4: for each c Î pop do 5: confidence = supspu(pcp:({cX:X∪Y) }) 6: if confidence ≥ then 7: fitness = 1 8: else 9: fitness = confidence~ 10: end if 11: if rand() < fitness then 12: selected pop = selected pop ∪ {c} 13: end if 14: end for 15: return selected pop 16: end procedure

The crossover procedure shown in Algorithm 4 creates a new population of individuals (new pop) by crossing individual pairs. We use a one-point crossover to recombine individuals; the crosspoint is selected randomly. In this procedure we use the function swap, which changes a sublist of two individuals by swapping one to each other.

The mutate procedure shown in Algorithm 5 creates a new population of individuals (new pop) by mutating one value in each individual. The individual element that has to be changed (gene) and the new value (new gene) are selected randomly. In this procedure we use the function swapV alue, which changes a value in a individual by swapping it with the new value. The mutate procedure permits to detect candidate rfds that do not directly depend on the initial population. It has associated a low probability, which limits the randomness of the whole methodology.

Algorithm 4 The crossover procedure 1: procedure crossover(Set pop, int k, double p) 2: Set new pop → {} 3: int crosspoint 4: for each c1,c2 Î pop with c1 ≠ c2 do 5: if rand() < p then 6: crosspoint = randInt(1,k) 7: for i ≤ crosspoint do 8: swap(c1,c2, i) 9: end for 10: add c1 to new pop 11: add c2 to new pop 12: end if 13: end for 14: return new pop 15: end procedure Algorithm 5 The mutate procedure 1: procedure crossover(Set pop, int k, int numCols, double p) 2: Set new pop 3: int gene, new gene 4: for each c Î pop do 5: if rand() < p then 6: gene = randInt(1,k) 7: new gene = randInt(1,numCols) 8: swapValue(c, gene, new gene) 9: end if 10: add c to new pop 11: end for 12: return new pop 13: end procedure 5

Empirical Evaluation

In this section we describe the performed empirical evaluation to verify the e ectiveness of the proposed GA-rfd algorithm. In particular, we rst evaluate the performance of the algorithm when used to discover fds, then we evaluated the rfd discovery performance.

The evaluation has been performed on a version of GA-rfd implemented in the Python language, and on a machine with an AMD Opteron (TM) CPU, 8 GB RAM, and Linux Ubuntu 16.10 OS. Moreover, we used two real-world datasets, also employed for evaluating several fd discovery algorithms [ 22 ], and the sample dataset introduced in Example 1. The probability parameters for the crossover and mutate procedures have been set to 0.85 and 0.3, respectively. Evaluation by discovering FDs. This experiment compares the performances of GA-rfd with those of the fd discovery algorithms with the aim of evaluating the e ectiveness of GA-rfd with respect to well-known results, as in the case of iris and bridges datasets [ 23 ]. Here, we do not evaluate the time performances, since fd and rfd discovery problems are not comparable. In order to reduce the problem to discover canonical fds we assigned the value 0 to all tuple comparison thresholds and the value 1 to the coverage threshold. Table 2 reports the fds discovered on the analyzed datasets. For the iris and bridge datasets GA-rfd achieves the same results obtained by other fd discovery algorithms [ 23 ]. Evaluation by discovering RFDs. In order to verify the e ectiveness of GA-rfd, we dirtied some values in the datasets used in the previous experiment with the aim of verifying whether the number of canonical fds also changes. In other words, we simulated the presence of dirtiness in data, and analyzed GA-rfd overtakes errors in dependency discovery. For this reason, we evaluated how the number of rfds changes by varying either the tuple comparison thresholds or the coverage measure threshold. It is worth to notice that, we have used this methodology in order to highlight the usefulness of the proposed algorithm w.r.t. the results shown in Table 2. In fact, in real contexts we use GA-rfd with di erent thresholds, those indicating similar tuples, and the one specifying the satis ability degree.

Table 3 reports for each dataset: (i) the number of errors introduced in the dataset, (ii) the number fds of discovered on the dataset with dirty data, (iii) the number of discovered rfds by associating a threshold ≤1 to one attribute in order to manage similarities in the value comparison1, (iv) the number of rfds discovered with a coverage threshold of 0:9.

We can observe that the errors introduced in sample-dataset2 and iris2 have reduced the number of discovered fds. On the contrary, for bridge2 such a number increases, since a minimal fd 1 has been violated by errors and three new fds, whose LHS has more attributes than LHS( 1), have been discovered. The relaxation on attribute comparison has allowed to capture the errors introduced in the datasets. However, for bridge2 a new dependency has been discovered 1In particular, we have associated the threshold 1 to the rst attribute in the sampledataset2, and to the fth attribute in both iris2 and bridges2. The other attributes have associated threshold 0. beyond those reported in 2. Regarding the relaxation of tuple coverage, for sample-dataset2 the threshold is too narrow to capture the introduced error; for iris2 we obtained less but more general dependencies than those reported in 2; for bridge2 many more dependencies were discovered, due to the fact that the introduced errors is not signi cant compared to the size of the dataset, and to the number of dependencies holding in it, as reported in 2. 6

Conclusion and Future Work

In this paper we have proposed GA-rfd, a GA for discovering rfds from data. It analyzes tuple pairs similarity through a di erence dataset, and validates rfds by calculating the con dence on each candidate rfd. A preliminary evaluation of the algorithm has been carried out to assess the e ectiveness of the approach.

In the future, we would like to further improve this approach in order to automatically discover the rfds and the threshold ranges of validity, without requesting their speci cation to the user. Furthermore, we would like to investigate the discovery of GA-rfd in the context of user interaction logs, especially in web applications, aiming to mine user intent [ 4 ]. To this end, mashup repositories are a further interesting application domain, since they are precious sources of data concerning mashup component usage, which can be useful to opportunely advise the development of new mashups [ 20 ].

1. Abedjan , Z. , Schulze , P. , Naumann , F. : DFD: E cient functional dependency discovery . In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management . pp. 949 { 958 . CIKM ' 14 ( 2014 )

2. Bohannon , P. , Fan , W. , Geerts , F. , Jia , X. , Kementsietsidis , A. : Conditional functional dependencies for data cleaning . In: Proceedings of the 25th International Conference on Data Engineering . pp. 746 { 755 . ICDE ' 07 ( 2007 )

3. Caruccio , L. , Deufemia , V. , Polese , G.: Relaxed functional dependencies { A survey of approaches . IEEE TKDE 28(1) , 147 { 165 ( 2016 )

4. Caruccio , L. , Deufemia , V. , Polese , G.: Understanding user intent on the web through interaction mining . Journal of Visual Languages & Computing 31 , Part

, 230 { 236 ( 2015 )

5. Caruccio , L. , Deufemia , V. , Polese , G.: On the discovery of relaxed functional dependencies . In: Proceedings of the 20th International Database Engineering & Applications Symposium . pp. 53 { 61 . IDEAS ' 16 ( 2016 )

6. Caruccio , L. , Polese , G. , Tortora , G.: Synchronization of queries and views upon schema evolutions: A survey . ACM Transactions on Database Systems (TODS) 41(2) , 9 ( 2016 )

7 . Chang , S.K. , Deufemia , V. , Polese , G. , Vacca , M.: A normalization framework for multimedia databases . IEEE TKDE 19 ( 12 ), 1666 { 1679 ( 2007 )

8. Chiang , F. , Miller , R.J. : Discovering data quality rules . Proceedings of the VLDB Endowment 1 ( 1 ), 1166 { 1177 ( 2008 )

9. Fan , W. , Gao , H. , Jia , X. , Li , J. , Ma, S.: Dynamic constraints for record matching . The VLDB Journal 20 , 495 { 520 ( 2011 )

10. Fan , W. , Geerts , F. , Lakshmanan , L.V.S. , Xiong , M. : Discovering conditional functional dependencies . In: Proceedings of the 25th International Conference on Data Engineering, ICDE'09 . pp. 1231 { 1234 ( 2009 )

11. Flach , P.A. , Savnik , I. : Database dependency discovery: A machine learning approach . AI Commun . 12 ( 3 ), 139 { 160 ( 1999 )

12. Giannella , C. , Robertson , E.: On approximation measures for functional dependencies . Inform. Syst . 29 ( 6 ), 483 { 507 ( 2004 )

13. Golab , L. , Karlo , H. , Korn , F. , Srivastava , D. , Yu , B. : On generating near-optimal tableaux for conditional functional dependencies . Proceedings of the VLDB Endowment 1 ( 1 ), 376 { 390 ( 2008 )

14. Huhtala , Y. , Karkkainen, J., Porkka , P. , Toivonen , H.: TANE: An e cient algorithm for discovering functional and approximate dependencies . The Computer Journal 42 ( 2 ), 100 { 111 ( 1999 )

15. King , R. , Oil , J.: Discovery of functional and approximate functional dependencies in relational databases . J. Applied Math. and Decision Sciences 7 ( 1 ), 49 { 59 ( 2003 )

16. Kivinen , J. , Mannila , H.: Approximate inference of functional dependencies from relations . Theor. Comput. Sci . 149 ( 1 ), 129 { 149 ( 1995 )

17. Kwashie , S. , Liu , J. , Li , J. , Ye , F. : Mining di erential dependencies: A subspace clustering approach . In: Wang, H. , Sharaf , M.A . (eds.) Proceedings of Australasian Database Conference . pp. 50 { 61 . ADC ' 14 ( 2014 )

18. Liu , J. , Li , J. , Liu, C. , Chen , Y. : Discover dependencies from data - A review . IEEE Transactions on Knowledge and Data Engineering 24 ( 2 ), 251 { 264 ( 2012 )

19. Lopes , S. , Petit , J.M. , Lakhal , L.: E cient discovery of functional dependencies and armstrong relations . In: Proceedings of the 7th International Conference on Extending Database Technology . pp. 350 { 364 . EDBT ' 00 ( 2000 )

20. Mario Andres Paredes-Valverde, Giner

Alor-Hernandez , A.R.G.R.V.G.E.J.D.:

A systematic review of tools, languages, and methodologies for mashup development . Software Practice & Experience 45 ( 13 ), 365397 ( 2015 )

21. Novelli , N. , Cicchetti , R.: Fun: An e cient algorithm for mining functional and embedded dependencies . In: Proceedings of 8th International Conference Database Theory , pp. 189 { 203 . ICDT ' 01 ( 2001 )

22. Papenbrock , T. , Bergmann , T. , Finke , M. , Zwiener , J. , Naumann , F. : Data pro ling with Metanome . Proceedings of the VLDB Endowment 8 ( 12 ), 1860 { 1863 ( 2015 )

23. Papenbrock , T. , Ehrlich , J. , Marten , J. , Neubert , T. , Rudolph , J.P. , Schonberg, M. , Zwiener , J. , Naumann , F. : Functional dependency discovery: An experimental evaluation of seven algorithms . Proceedings of the VLDB Endowment 8 ( 10 ), 1082 { 1093 ( 2015 )

24. Papenbrock , T. , Naumann , F. : A hybrid approach to functional dependency discovery . In: Proceedings of the 2016 International Conference on Management of Data . pp. 821 { 833 . ACM ( 2016 )

25. Song , S. , Chen , L. : Di erential dependencies: Reasoning and discovery . ACM Transactions on Database Systems 36 , 16 ( 2011 )

26. Wyss , C. , Giannella , C. , Robertson , E.: FastFDs: A heuristic-driven, depth- rst algorithm for mining functional dependencies from relation instances . In: Procs of Intl Conf. on Data Warehousing and Knowl. Disc . pp. 101 { 110 . DaWaK ' 01 ( 2001 )

27. Yao , H. , Hamilton , H.J. , Butz , C.J. : FD Mine: Discovering functional dependencies in a database using equivalences . In: Proceedings of IEEE International Conference on Data Mining . pp. 729 { 732 . ICDM ' 02 ( 2002 )