=Paper=
{{Paper
|id=Vol-2037/paper22
|storemode=property
|title=A Genetic Algorithm to Discover Relaxed Functional Dependencies from Data
|pdfUrl=https://ceur-ws.org/Vol-2037/paper_22.pdf
|volume=Vol-2037
|authors=Loredana Caruccio,Vincenzo Deufemia,Giuseppe Polese
|dblpUrl=https://dblp.org/rec/conf/sebd/CaruccioDP17
}}
==A Genetic Algorithm to Discover Relaxed Functional Dependencies from Data==
A genetic algorithm to discover relaxed functional dependencies from data Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese University of Salerno, Department of Computer Science, via Giovanni Paolo II n.132, 84084 Fisciano (SA), Italy {lcaruccio,deufemia,gpolese}@unisa.it Abstract. Approximate functional dependencies are used in many emerg- ing application domains, such as the identification of data inconsistencies or patterns of semantically related data, query rewriting, and so forth. They can approximate the canonical definition of functional dependency (fd) by relaxing on the data comparison (i.e., by considering data similar- ity rather than equality), on the extent (i.e., by admitting the possibility that the dependency holds on a subset of data), or both. Approximate fds are difficult to be identified at design time like it happens with fds. In this paper, we propose a genetic algorithm to discover approximate fds from data. An empirical evaluation demonstrates the effectiveness of the algorithm. Keywords: Functional Dependency, Genetic Algorithm, Discovery from data 1 Introduction In the last decades, functional dependencies have been extended to deal with new problems, such as data cleansing [2], record matching [9], query rewriting upon schema evolutions [6], or to express constraints for complex data types, such as multimedia [7]. To this end, the canonical definition of fd has been extended either by using approximate paradigms rather than equality to compare attribute values, or by admitting subsets of data on which the fd does not hold. In the literature such extensions are referred to as relaxations, and these new dependencies are indicated with the term relaxed functional dependencies (rfds) [3]. The relaxation degree is expressed by means of thresholds, indicating either the required similarity degree for the rfd to hold, or the minimum percentage of data on which the rfd must hold. While fds were originally specified at database design time, as properties of a schema that should hold on every instance of it, in order to reduce the designer effort and to dynamically adapt the rfds to the evolution of the appli- cation domain, it is necessary to automatically discover them from data. This is made possible also thanks to the availability of big data collections, and to the contributions of several research areas, such as machine learning and knowledge discovery [18]. Most of the discovery algorithms described in the literature are intended for fds [23], whereas few rfd definitions are equipped with algorithms for discov- ering them from data [18]. In this paper, we propose a genetic algorithm (GA) to identify a broad class of rfds, including both relaxation criteria. In general, GAs are efficient for global searches in large search spaces, for which determinis- tic methods are not suitable. They perform operations inspired to natural species evolutions, such as natural selection, crossover, and mutation. In particular, by using these operations, the proposed algorithm iteratively generates new candi- date rfds, but only a subset of them, determined by means of a fitness function, survives to the evolution process. The fitness function exploits the support and confidence quality measures, mainly used for evaluating association rules. An empirical evaluation demonstrates the effectiveness of the algorithm. The paper is organized as follows. Section 2 reviews the rfd discovery algo- rithms existing in the literature. Section 3 provides some background definitions about rfds and formulates the problem of rfd discovery. Section 4 presents the proposed genetic algorithm named GA-rfd for discovering rfds from data, whose performance are reported in Section 5. Finally, summary and concluding remarks are included in Section 6. 2 Related Work In the literature there are two main categories of methods to automatically discover fds from data, namely top-down and bottom-up methods. The formers exploit an attribute lattice to generate candidate fds, which are successively tested to verify their validity. The whole process is made efficient by exploiting valid fds to prune the search space for new candidate fds. Examples of top- down approaches include the algorithms TANE [14], FD Mine [27], FUN [21], and DFD [1]. On the other hand, bottom-up methods derive candidate fds from two attribute subsets, namely agree-sets and difference-sets, which are built by comparing the values of attributes for all possible combinations of pairs of tuples. Examples of bottom-up approaches include the algorithms DepMiner [19], FastFD [26], and FDep [11]. Recently, an hybrid algorithm has been proposed in order to obtain better performance in all cases [24]. Although a recent survey listed thirdy-four different types of rfds [3], very few of them are equipped with algorithms for discovering them from data [18]. These are reviewed in the following. Approximate functional dependencies (afds) are fds holding for ‘most’ rather than ‘all’ the tuples of a relation r [16]. The amount of tuples satisfying the afds can be calculated by means of several measures [12], among which the g3 error measure is the most frequently used [16]. The latter represents the minimum fraction of tuples that must be removed from a relation instance r in order to make a candidate afd valid. Most approaches for afd discovery use a small portion of tuples (sampling) s ⊂ r to decide whether an afd holds on r [16]. The method proposed in [15] exploits the error measure of super keys to determine the approximate satisfaction of afds. Like afds, conditional functional dependencies (cfds) are fds holding for a subset of tuples, but in this case the subset is identified by specifying condi- tions [2]. Among the approaches for discoverying cfds from data, the algorithm proposed in [8] exploits an attribute lattice derived from partitions of attribute values, in order to generate candidate cfds. Alternatively, the greedy algorithm proposed in [13] tries to derive cfds by finding close-to-optimal tableau, where the closeness is measured by means of support and confidence. In [10] the au- thors adapt three algorithms used for fd discovery, namely FD Miner, TANE, and FastFD, to the discovery of cfd. Matching dependencies (mds) are rfdc defined in terms of similarity pred- icates to accommodate errors, and different representations in unreliable data sources [9]. They have been mainly proposed for object identification. The most recent algorithm for md discovery evaluates the utility of candidate mds for a given database instance, determining the corresponding similarity threshold pat- tern. The utility is measured by means of support and confidence mds, whereas thresholds are determined by analyzing the statistical distribution of data. Differential dependencies (dds) are rfdc based on differential constraints on attribute values [25]. Among the algorithms proposed for their discovery, the one proposed in [25] exploits reduction algorithms, which detect dds by first fixing the RHS differential function for each attribute in r, and then finding the set of differential functions that left-reduce the LHS. The performances of the algorithm are improved by means of pruning strategies based on the subsump- tion order of differential functions, implication of dds, and instance exclusion. Another approach for dd discovery tries to reduce the problem search space by assuming a user-specified distance threshold as upper limit for the distance intervals of LHS [17]. The algorithm is based on a distance-based subspace clus- tering model, and exploits pruning strategies to efficiently discover dds when high threshold values are specified. As opposed to these discovery algorithms, which focus each on a specific rfd, the discovery algorithm proposed in [5] aims to identify the entire class of rfds [3] relaxing on the tuple comparison method. In particular, the approach relies on lattice-based algorithms conceived for fd discovery, which are feeded with similar subsets of tuples derived from previously computed differential matrices. The algorithm proposed in this paper further extends the class of discovered rfds, by also including those relaxing on the extent. 3 Discovery Relaxed Functional Dependencies from Data 3.1 Relaxed Functional Dependencies Before defining rfds we need to introduce the concept of similarity constraint, which expresses the similarity of two values on a specific domain, and it is rep- resented through a function φ. In particular, given two attributes A and B on a domain D, φ(A, B) evaluates the similarity of A and B. As an example, φ can be defined in terms of a similarity metric ≈, like for instance the edit distance, such that a≈b is true if a and b are “close” enough w.r.t. a predefined threshold. Given a relational schema R defined over a fixed set of attributes, and R1 = (A1 , . . . , Ak ) and R2 = (B1 , . . . , Bm ) two relation schemas of R, an rfd ϕ on R is denoted by Ψ ≥ Dc1 × Dc2 ∶ (X1 , X2 )Φ1 ÐÐ→ (Y1 , Y2 )Φ2 (1) where k m – Dc1 ×Dc2 = {(t1 , t2 ) ∈ dom(R1 )×dom(R2 ) ∣ ( ⋀ c1i (t1 [Ai ])) ⋀ ( ⋀ c2j (t2 [Bj ]))} i=1 j=1 where c1 = (c11 , . . . , c1k ) and c2 = (c21 , . . . , c2m ), with c1i and c2j predicates on dom(Ai ) and dom(Bj ), respectively, that filter the tuples on which ϕ applies; – X1 , Y1 ⊆ attr(R1 ), and X2 , Y2 ⊆ attr(R2 ), with X1 ∩ Y1 = ∅ and X2 ∩ Y2 = ∅; – Φ1 (Φ2 , resp.) is a set of constraints φ[X1 , X2 ] (φ[Y1 , Y2 ], resp.) on attributes X1 and X2 (Y1 and Y2 , resp.). For any pair of tuples (t1 , t2 )∈ Dc1 × Dc2 , the constraint φ[X1 , X2 ] (φ[Y1 , Y2 ], resp.) indicates true, if the similarity of t1 and t2 on attributes X1 and X2 (Y1 and Y2 , resp.) agrees with the constraint specified by φ[X1 , X2 ] (φ[Y1 , Y2 ], resp.). – Ψ is a coverage measure defined on Dc1 ×Dc2 , which quantifies the satisfiabil- ity degree of ϕ on r, and can be defined as a function Ψ ∶ dom(X)×dom(Y ) → R measuring the amount of tuples in r violating or satisfying ϕ. – is a threshold indicating the lower bound (or upper bound in case the comparison operator is ≤) for the result of the coverage measure. Given r1 ⊆ Dc1 and r2 ⊆ Dc2 two relation instances on (R1 , R2 ), (r1 , r2 ) satisfies the rfd ϕ, denoted by (r1 , r2 ) ⊧ ϕ, if and only if: ∀ (t1 , t2 ) ∈ (r1 , r2 ), if φ[X1 , X2 ] indicates true for each constraint φ ∈ Φ1 , then almost always φ[Y1 , Y2 ] indicates true for each constraint φ ∈ Φ2 . Here, almost always means that Ψ (πX1 (r1 )πX2 (r2 ), πY1 (r1 )πY2 (r2 )) ≥ . For rfds defined on single relation schemas (i.e., R1 = R2 ), if X1 = X2 and Y1 = Y2 , then we will simplify the rfd notation given in (1) by using Dc , X, and Y to refer to the Cartesian product Dc1 × Dc2 , and to the pairs (X, X) and (Y , Y ), respectively. Moreover, if the rfd has to be satisfied by all tuples in r, then the symbol Ψerr(0) is shown on the arrow. Such coverage measure corresponds to the expression ψ(X, Y ) = 0, where ψ(X, Y ) measures the number of tuples violating the rfd. As an example, the canonical fd can also be written Ψerr(0) as: Dtrue ∶ Xeq ÐÐÐÐ→ Yeq , where true is a sequence of tautologies, hence Dtrue = dom(R1 ), whereas eq is the equality constraint. As an example, in a database of body measures it is likely to have a simi- lar shoe size for people having a similar height. Thus, the following rfd might Ψerr(0) hold Dtrue ∶ Height≈ ÐÐÐÐ→ Shoe Size≈ , where ≈ is a differential function. On the other hand, there might be exceptions to this dependencies, since few peo- ple might have close heights but more different shoe sizes. This can be modeled by introducing a different coverage measure into the rfd, making it approximate ψ(Height,ShoeSize)≤0.02 Dtrue ∶ Height≈ ÐÐÐÐÐÐÐÐÐÐÐÐÐÐ→ Shoe Sizeeq , where ψ(Height, ShoeSize) ≤ 0.02 indicates that at most for 2% of tuples the rfd does not hold. 3.2 The RFD discovery problem Given a relation r, the discovery of rfds is the problem of finding a minimal cover set of valid rfds that hold in r. This problem introduces further complexity to the already complex task of fd discovery from data, due to the significantly larger search space, which derives from the several similarity/distance functions that may be defined over each attribute [17]. Consequently, it is necessary to devise tractable algorithms capable of extracting rfds with meaningful similarity constraints. In this paper, we analyze the problem of rfd discovery from data given a set of thresholds for both attribute comparison and coverage measure. Although this reduces the overall complexity of the problem, it still remains more complex than the fds discovery one. In fact, as said in Section 3.1, an rfd X → Y holds on a relation r iff whenever the distance between two tuples t1 and t2 in r is below a threshold value αA on each attribute A belonging to X, then their distance is below a threshold value αB on each attribute B belonging to Y , with a degree of certainty greater than a threshold value . For this reason, we have to analyze the similarity between portions of tuple pairs rather than their equality. This does not permit to exploit the equivalence classes of attribute values, usually used for fd discovery. Indeed, an attribute value might be similar to other attribute values that are not similar to each other. 4 Identifying Relaxed Functional Dependencies with a Genetic Algorithm The proposed algorithm discovers rfds by comparing tuples pairwise, based on the similarity of subsets of their attributes. In particular, pairs of tuples with a similarity higher than a threshold are represented with the value 1, and 0 oth- erwise. In this way, we reduce the rfd discovery problem to a search one, which is solved through a GA. The latter implements a simulation in which a popu- lation of candidate solutions to an optimization problem evolves towards better solutions. The process usually starts from a population of randomly generated individuals (Initialization), which evolves by stochastically selecting multiple in- dividuals from the current population (based on a fitness function), and modified (recombined and possibly randomly mutated) to form a new population. The al- gorithm terminates when either a maximum number of generations has been produced, or a satisfactory fitness level has been reached for the population. In the first case, a satisfactory solution may or may not have been reached. 4.1 GA-RFD In this section, we explain the GA to discover rfds, which is named GA-rfd. In particular, in the following we first explain how the discovery problem has been encoded for GA processing, then we define the fitness function to evaluate the satisfactory degree of individuals, and finally, we present the GA-rfd algorithm. Encoding. Given the thresholds for attribute comparison and coverage measure, the problem of rfd discovery reduces to find all possible dependencies X → Y satisfying the following rule: tuples that are similar on the LHS must correspond to those that are similar on the RHS. Such a rule must hold for almost all set of tuples that are similar on the LHS attributes, according to the threshold specified for the coverage measure. In order to accomplish this with a GA, the dataset has to be modeled in a way that highlights similarity between tuple pairs. For this reason, in the proposed methodology we transform an input dataset in terms of a difference dataset. The latter contains the input dataset attributes as columns, the input dataset tuple pairs as rows, and the similarity between tuple pairs as values. In particular, given an attribute Xi and a tuple pair (t1 , t2 ), the difference dataset will have the value 1 if t1 [Xi ] is similar to t2 [Xi ] according to the threshold associated to Xi , and the value 0 otherwise. Example 1. Let us consider the sample dataset shown in Table 1a, and the similarity thresholds: 1 for the attributes Height and Shoe Size, and 10 for the attribute Weight. By using the absolute difference as a distance function, we can derive the difference dataset shown in Table 1b. Table 1 A dataset of body measures. Pair ID Height Weight Shoe Size (1,2) 1 1 1 (1,3) 1 1 1 (1,4) 1 1 1 (1,5) 0 0 1 (1,6) 0 1 0 (1,7) 0 1 1 Tuple ID Height Weight Shoe size (2,3) 1 1 1 1 175 70 40 (2,4) 1 1 1 2 175 75 39 (2,5) 0 1 0 3 175 69 40 (2,6) 0 1 0 4 176 71 40 (2,7) 0 0 1 5 178 81 41 (3,4) 1 1 1 6 169 73 37 (3,5) 0 0 1 7 170 62 39 (3,6) 0 1 0 (a) A sample dataset. (3,7) 0 1 1 (4,5) 0 1 1 (4,6) 0 1 0 (4,7) 0 1 1 (5,6) 0 1 0 (5,7) 0 0 0 (6,7) 1 0 0 (b) The difference dataset. In order to construct individuals in the population we use an array of integers representing the attribute indices in the difference dataset. In particular, each item i of the array corresponds to a specific attribute Xi in the dataset. In particular, since all the rfds can be reduced w.l.o.g. to a format with a single attribute on the RHS, each individual of size k will represent a candidate rfd, where the k − 1 attributes on the LHS correspond to the first k − 1 items in the array, and the last one corresponds to the attribute on the RHS. It is worth to notice that the algorithm can be iterated with several values for k in order to consider rfds with different sizes on the LHS. Example 2. The individual [3, 1] for the dataset in Table 1a will represent the candidate rfd: Shoe Size → Height. Fitness Function. In order to discover rfds we exploit the concepts of support and confidence commonly used in the context of association rule mining. In particular, several studies have demonstrated the relationship existing between fds and association rules [14]. Formally, let X and Y be two sets of items, the support of a set X, denoted with sup(X), represents the ratio of the transactions in the dataset containing X, whereas the confidence represents the ratio sup(X ∪ Y )/sup(X). In our context, sup(X) represents the ratio of tuple pairs that are similar on all attributes in X. In this way, we can use the confidence as the fitness function to determine if a candidate rfd X → Y is valid: conf (X → Y ) = supp(X∪Y supp(X) ) . This fitness function returns 1 if and only if whenever tuple pairs are similar on attributes in X, then they are similar also on Y . In order to validate a candidate rfd the proposed algorithm associates the threshold specified for the coverage measure to the fitness function. Example 3. If we consider the individual [3, 1] of Example 2, representing the candidate rfd: Shoe Size → Height, the associated support and confidence are: sup(Shoe Size) = 21 13 , sup({Shoe Size, Height}) = 216 , conf (Shoe Size → Height) = 21 × 13 = 13 = 0.46. This candidate rfd will not be considered as valid, 6 21 6 unless the coverage threshold specified in input is less than 0.46. Algorithm 1 The general GA-rfd algorithm 1: procedure GA-RFD(difference dataset D, extent threshold β, int maxgen, int pop size, int k, double p1 , double p2 ) 2: Set pop, f inal pop, new pop 3: int gen ← 0 4: f inal pop ← initialize(pop size, k, numCols(D)) 5: while gen ≤ maxgen do 6: if (allOf(f inal pop, β)) then 7: break 8: end if 9: new pop ← select(f inal pop, β) 10: pop ← ∅ 11: pop ← crossover(new pop, k, p1 ) 12: pop ← mutate(pop, k, numCols(D), p2 ) 13: f inal pop ← new pop ∪ pop 14: g ← g+1 15: end while 16: return f inal pop 17: end procedure Algorithm. The pseudo-code of the main procedure of GA-rfd is shown in Al- gorithm 1. It first produces the initial population through the initialize pro- cedure, and then, within a cycle, it selects only the individuals that reach the fitness function objective (select procedure), makes crossover on selected in- dividuals according to a fixed probability p1 (crossover procedure), mutates some of them according to a fixed probability p2 (mutate procedure). The new population is evaluated in the next iteration of the cycle. GA-rfd terminates if and only if either the population is composed of individuals, each one represent- ing a valid rfd according to the β-threshold for the coverage measure, which determines the fitness objective; or the number of iterations exceeds the specified maximum number. The initialize procedure shown in Algorithm 2 creates a set of individuals representing the initial population, according to the size pop size specified as input. The used random values are ranged in the number of columns of the dataset, since each individual element will represent a column index. The select procedure shown in Algorithm 3 analyzes the individuals of the population pop and constructs a new population (selected pop) by including individuals with a probability related to the fitness function objective β. Algorithm 2 The initialize procedure 1: procedure initialize(int pop size, int k, int numCols) 2: Set pop 3: for i ≤ pop size do 4: List x 5: for j ≤ k do 6: add randInt(1,numCols) to x 7: end for 8: add x to pop 9: end for 10: return pop 11: end procedure Algorithm 3 The select procedure 1: procedure select(Set pop, double β) 2: Set selected pop → {} 3: double f itness, conf idence 4: for each c ∈ pop do 5: conf idence = supp(c.{X∪Y supp(c.X) }) 6: if conf idence ≥ β then 7: f itness = 1 8: else 9: f itness = conf idence/β 10: end if 11: if rand() < f itness then 12: selected pop = selected pop ∪ {c} 13: end if 14: end for 15: return selected pop 16: end procedure The crossover procedure shown in Algorithm 4 creates a new population of individuals (new pop) by crossing individual pairs. We use a one-point crossover to recombine individuals; the crosspoint is selected randomly. In this procedure we use the function swap, which changes a sublist of two individuals by swapping one to each other. The mutate procedure shown in Algorithm 5 creates a new population of individuals (new pop) by mutating one value in each individual. The individual element that has to be changed (gene) and the new value (new gene) are selected randomly. In this procedure we use the function swapV alue, which changes a value in a individual by swapping it with the new value. The mutate procedure permits to detect candidate rfds that do not directly depend on the initial population. It has associated a low probability, which limits the randomness of the whole methodology. Algorithm 4 The crossover procedure 1: procedure crossover(Set pop, int k, double p) 2: Set new pop → {} 3: int crosspoint 4: for each c1 ,c2 ∈ pop with c1 ≠ c2 do 5: if rand() < p then 6: crosspoint = randInt(1,k) 7: for i ≤ crosspoint do 8: swap(c1 ,c2 , i) 9: end for 10: add c1 to new pop 11: add c2 to new pop 12: end if 13: end for 14: return new pop 15: end procedure Algorithm 5 The mutate procedure 1: procedure crossover(Set pop, int k, int numCols, double p) 2: Set new pop 3: int gene, new gene 4: for each c ∈ pop do 5: if rand() < p then 6: gene = randInt(1,k) 7: new gene = randInt(1,numCols) 8: swapValue(c, gene, new gene) 9: end if 10: add c to new pop 11: end for 12: return new pop 13: end procedure 5 Empirical Evaluation In this section we describe the performed empirical evaluation to verify the effectiveness of the proposed GA-rfd algorithm. In particular, we first evaluate the performance of the algorithm when used to discover fds, then we evaluated the rfd discovery performance. The evaluation has been performed on a version of GA-rfd implemented in the Python language, and on a machine with an AMD Opteron (TM) CPU, 8 GB RAM, and Linux Ubuntu 16.10 OS. Moreover, we used two real-world datasets, also employed for evaluating several fd discovery algorithms [22], and the sample dataset introduced in Example 1. The probability parameters for the crossover and mutate procedures have been set to 0.85 and 0.3, respectively. Evaluation by discovering FDs. This experiment compares the performances of GA-rfd with those of the fd discovery algorithms with the aim of evaluating the effectiveness of GA-rfd with respect to well-known results, as in the case of iris and bridges datasets [23]. Here, we do not evaluate the time performances, since fd and rfd discovery problems are not comparable. In order to reduce the problem to discover canonical fds we assigned the value 0 to all tuple comparison thresholds and the value 1 to the coverage threshold. Table 2 reports the fds discovered on the analyzed datasets. For the iris and bridge datasets GA-rfd achieves the same results obtained by other fd discovery algorithms [23]. Table 2 Results obtained by GA-rfd algorithm to discovery fds. Dataset #Columns #Rows Size(KB) #FDs sample-dataset 3 7 1 2 iris 5 150 5 4 bridges 13 108 6 142 Evaluation by discovering RFDs. In order to verify the effectiveness of GA-rfd, we dirtied some values in the datasets used in the previous experiment with the aim of verifying whether the number of canonical fds also changes. In other words, we simulated the presence of dirtiness in data, and analyzed GA-rfd overtakes errors in dependency discovery. For this reason, we evaluated how the number of rfds changes by varying either the tuple comparison thresholds or the coverage measure threshold. It is worth to notice that, we have used this methodology in order to highlight the usefulness of the proposed algorithm w.r.t. the results shown in Table 2. In fact, in real contexts we use GA-rfd with different thresholds, those indicating similar tuples, and the one specifying the satisfiability degree. Table 3 reports for each dataset: (i) the number of errors introduced in the dataset, (ii) the number fds of discovered on the dataset with dirty data, (iii) the number of discovered rfds by associating a threshold ≤1 to one attribute in order to manage similarities in the value comparison1 , (iv) the number of rfds discovered with a coverage threshold of 0.9. Table 3 Results obtained by using a threshold 1 on the comparison of attribute values, and the results obtained using a coverage threshold of 0.9. RFDs RFDs Datasets n. errors FDs by relaxing on by relaxing on tuple comparison extent sample-dataset2 1 1 2 1 iris2 2 3 4 3 bridges2 2 144 143 254 We can observe that the errors introduced in sample-dataset2 and iris2 have reduced the number of discovered fds. On the contrary, for bridge2 such a num- ber increases, since a minimal fd φ1 has been violated by errors and three new fds, whose LHS has more attributes than LHS(φ1 ), have been discovered. The relaxation on attribute comparison has allowed to capture the errors introduced in the datasets. However, for bridge2 a new dependency has been discovered 1 In particular, we have associated the threshold 1 to the first attribute in the sample- dataset2, and to the fifth attribute in both iris2 and bridges2. The other attributes have associated threshold 0. beyond those reported in 2. Regarding the relaxation of tuple coverage, for sample-dataset2 the threshold is too narrow to capture the introduced error; for iris2 we obtained less but more general dependencies than those reported in 2; for bridge2 many more dependencies were discovered, due to the fact that the introduced errors is not significant compared to the size of the dataset, and to the number of dependencies holding in it, as reported in 2. 6 Conclusion and Future Work In this paper we have proposed GA-rfd, a GA for discovering rfds from data. It analyzes tuple pairs similarity through a difference dataset, and validates rfds by calculating the confidence on each candidate rfd. A preliminary evaluation of the algorithm has been carried out to assess the effectiveness of the approach. In the future, we would like to further improve this approach in order to automatically discover the rfds and the threshold ranges of validity, without re- questing their specification to the user. Furthermore, we would like to investigate the discovery of GA-rfd in the context of user interaction logs, especially in web applications, aiming to mine user intent [4]. To this end, mashup repositories are a further interesting application domain, since they are precious sources of data concerning mashup component usage, which can be useful to opportunely advise the development of new mashups [20]. References 1. Abedjan, Z., Schulze, P., Naumann, F.: DFD: Efficient functional dependency dis- covery. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. pp. 949–958. CIKM ’14 (2014) 2. Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional func- tional dependencies for data cleaning. In: Proceedings of the 25th International Conference on Data Engineering. pp. 746–755. ICDE ’07 (2007) 3. Caruccio, L., Deufemia, V., Polese, G.: Relaxed functional dependencies – A survey of approaches. IEEE TKDE 28(1), 147–165 (2016) 4. Caruccio, L., Deufemia, V., Polese, G.: Understanding user intent on the web through interaction mining. Journal of Visual Languages & Computing 31, Part B, 230 – 236 (2015) 5. Caruccio, L., Deufemia, V., Polese, G.: On the discovery of relaxed functional dependencies. In: Proceedings of the 20th International Database Engineering & Applications Symposium. pp. 53–61. IDEAS ’16 (2016) 6. Caruccio, L., Polese, G., Tortora, G.: Synchronization of queries and views upon schema evolutions: A survey. ACM Transactions on Database Systems (TODS) 41(2), 9 (2016) 7. Chang, S.K., Deufemia, V., Polese, G., Vacca, M.: A normalization framework for multimedia databases. IEEE TKDE 19(12), 1666–1679 (2007) 8. Chiang, F., Miller, R.J.: Discovering data quality rules. Proceedings of the VLDB Endowment 1(1), 1166–1177 (2008) 9. Fan, W., Gao, H., Jia, X., Li, J., Ma, S.: Dynamic constraints for record matching. The VLDB Journal 20, 495–520 (2011) 10. Fan, W., Geerts, F., Lakshmanan, L.V.S., Xiong, M.: Discovering conditional func- tional dependencies. In: Proceedings of the 25th International Conference on Data Engineering, ICDE’09. pp. 1231–1234 (2009) 11. Flach, P.A., Savnik, I.: Database dependency discovery: A machine learning ap- proach. AI Commun. 12(3), 139–160 (1999) 12. Giannella, C., Robertson, E.: On approximation measures for functional depen- dencies. Inform. Syst. 29(6), 483–507 (2004) 13. Golab, L., Karloff, H., Korn, F., Srivastava, D., Yu, B.: On generating near-optimal tableaux for conditional functional dependencies. Proceedings of the VLDB En- dowment 1(1), 376–390 (2008) 14. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: An efficient algo- rithm for discovering functional and approximate dependencies. The Computer Journal 42(2), 100–111 (1999) 15. King, R., Oil, J.: Discovery of functional and approximate functional dependencies in relational databases. J. Applied Math. and Decision Sciences 7(1), 49–59 (2003) 16. Kivinen, J., Mannila, H.: Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149(1), 129–149 (1995) 17. Kwashie, S., Liu, J., Li, J., Ye, F.: Mining differential dependencies: A subspace clustering approach. In: Wang, H., Sharaf, M.A. (eds.) Proceedings of Australasian Database Conference. pp. 50–61. ADC ’14 (2014) 18. Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data - A review. IEEE Transactions on Knowledge and Data Engineering 24(2), 251–264 (2012) 19. Lopes, S., Petit, J.M., Lakhal, L.: Efficient discovery of functional dependencies and armstrong relations. In: Proceedings of the 7th International Conference on Extending Database Technology. pp. 350–364. EDBT ’00 (2000) 20. Mario Andrés Paredes-Valverde, Giner Alor-Hernández, A.R.G.R.V.G.E.J.D.: A systematic review of tools, languages, and methodologies for mashup development. Software Practice & Experience 45(13), 365397 (2015) 21. Novelli, N., Cicchetti, R.: Fun: An efficient algorithm for mining functional and embedded dependencies. In: Proceedings of 8th International Conference Database Theory, pp. 189–203. ICDT ’01 (2001) 22. Papenbrock, T., Bergmann, T., Finke, M., Zwiener, J., Naumann, F.: Data profiling with Metanome. Proceedings of the VLDB Endowment 8(12), 1860–1863 (2015) 23. Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.P., Schönberg, M., Zwiener, J., Naumann, F.: Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment 8(10), 1082– 1093 (2015) 24. Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency dis- covery. In: Proceedings of the 2016 International Conference on Management of Data. pp. 821–833. ACM (2016) 25. Song, S., Chen, L.: Differential dependencies: Reasoning and discovery. ACM Transactions on Database Systems 36, 16 (2011) 26. Wyss, C., Giannella, C., Robertson, E.: FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In: Procs of Intl Conf. on Data Warehousing and Knowl. Disc. pp. 101–110. DaWaK ’01 (2001) 27. Yao, H., Hamilton, H.J., Butz, C.J.: FD Mine: Discovering functional dependencies in a database using equivalences. In: Proceedings of IEEE International Conference on Data Mining. pp. 729–732. ICDM ’02 (2002)