Incremental discovery of functional dependencies with a bit-vector algorithm Loredana Caruccio, Stefano Cirillo, Vincenzo Deufemia, and Giuseppe Polese University of Salerno, Department of Computer Science via Giovanni Paolo II n.132, 84084 Fisciano (SA), Italy {lcaruccio,scirillo,deufemia,gpolese}@unisa.it Abstract. Functional dependencies (fds) were conceived in the early ’70s, and were mainly used to verify database design and assess data quality. Nowadays they are automatically discovered from data since they can be exploited for many different purposes, such as query relax- ation, data cleansing, and record matching. In the context of big data, the speed at which new data is being created demand for new efficient algorithms for fd discovery. In this paper, we propose an incremental fd discovery approach, which is able to update the set of holding fds upon insertions of new tuples to the data instance, without having to restart the discovery process from scratch. It exploits a bit-vector rep- resentation of fds, and an upward/downward search strategy aiming to reduce the overall search space. Experimental results show that such al- gorithm achieves extremely better time performances with respect to the re-execution of the algorithm from scratch. Keywords: Functional Dependency · Discovery Algorithm · Instance Updating · Incremental Discovery. 1 Introduction With the advent of Big Data, both industry and research communities have manifested a tremendous interest in technologies capable of extracting informa- tion and their correlations among data. One way to represent such relationships is to use functional dependencies (fds), which represent relationships among database columns that can be used for several advanced database operations, such as query optimisation, data cleansing, and data integration [12]. While fds were originally specified at database design time, as properties of a schema that should hold on every instance of it, there has been the need to automatically discover them from data for reducing the design effort and for supporting their evolution in the application domains. This is made possible also thanks to the availability of big data collections, and to the contributions of several research areas, such as machine learning and knowledge discovery [9]. Copyright c 2019 for the individual papers by the papers authors. Copying permit- ted for private and academic purposes. This volume is published and copyrighted by its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy. The problem of discovering fds is extremely complex, since the number of holding fds for a given database instance could be exponential with respect to the number of database columns. In fact, most of the discovery algorithms de- scribed in the literature aimed to provide solutions in which the search space complexity is reduced by exploiting the theoretical properties of fds. However, while the availability of big data collections stimulates the discovery of mean- ingful fds from data, the necessity to update them according to the evolution of database instances require that also fds are updated. In fact, one of the char- acteristics describing big data is the velocity, which includes the fact that data are continuously produced. In this paper, we propose an incremental discovery approach for fds, which updates the complete set of holding fds according to the new added tuples. The approach exploits a bit-vector representation of fds, and an upward/downward search strategy aiming to prune the search space. Experimental results achieved on twenty-eight datasets show that the proposed approach achieves better time performances with respect to the re-execution of the algorithm from scratch. The paper is organised as follows. Section 2 reviews the fd discovery algo- rithms existing in the literature. Section 3 provides some background definitions about fds and formulates the problem of fd discovery. Section 4 presents the proposed methodology for the incremental discovery of fds from data, whose evaluation is reported in Section 5. Finally, summary and concluding remarks are included in Section 6. 2 Related Work The fd discovery problem dates back to the ’80s, when the first discovery al- gorithms were defined [11], and many of the latest proposals are based on the- oretical foundations introduced within older solutions [7, 8]. It is a extremely complex problem, since the number of potential fds can be exponential, and their detection requires analysing a huge number of column combinations [1]. In the literature there are two main categories of methods to automatically discover fds from data, namely column-based and row-based methods. The for- mer exploit an attribute lattice to generate candidate fds, which are successively tested to verify their validity. The whole process is made efficient by exploiting valid fds to prune the search space for new candidates. Examples of top-down approaches include the algorithms TANE [8], FD Mine [18], and DFD [2]. Row- based methods derive candidate fds from two attribute subsets, namely agree- sets and difference-sets, which are built by comparing the values of attributes for all possible combinations of tuples pairs. Examples of bottom-up approaches include the algorithms DepMiner [10], FastFD [17], and FDep [6]. Experimental results show that column-based algorithms usually outper- form row-based ones on datasets with many rows and few columns, whereas on datasets with few rows and many columns the row-based algorithms usually perform better [13]. Recently, an hybrid algorithm has been proposed in order to obtain better performance in all cases [14]. It combines row- and column-efficient discovery techniques by managing two separated phases, one in which it calcu- lates fds on a randomly selected small subset of records (column-efficiency), and the other in which it validates the discovered fds on the entire dataset. One of the first theoretical proposal of incremental algorithm for fd discovery has been proposed in 2001 [16]. It exploits the concepts of tuple partitions and monotonicity of fds to avoid the re-scanning of the database. Another proposal exploits the concept of functional independency in order to maintain the set of fds updated over time [3]. Finally, in order to discover and maintain functional dependencies in dynamic datasets, the DynFD algorithm has been proposed [15]. It continuously adapts the validation structures of fds in order to evolve them with batch of inserts, updates, and deletes of data. Some of the method- ologies surveyed above (i.e. [16, 3]) are not implemented, whereas others need to store the data structure produced during the discovery process. To this end, the advantage of the proposed methodology is that it relies only on the list of discovered fds, so it can be applied to any column-based fd discovery algorithm. 3 The Discovery Problem Before discussing the problem of discovering fds, let us first recall the definition of the canonical fd. Given a relational database schema R, defined over a set of attributes attr(R), derived as the union of attributes from relation schemas R composing R, assuming w.l.o.g. they all have unique names. For each attribute A ∈ attr(R), its domain is denoted by dom(A). Moreover, given an instance r of R and a tuple t ∈ r, we use t[A] to denote the projection of t onto A; similarly, for a set X of attributes in attr(R), t[X] denotes the projection of t onto X. An fd over R is a statement X → Y (X implies Y ) with X, Y ⊆ attr(R), such that, given an instance r over R, X → Y is satisfied in r if and only if for every pair of tuples (t1 , t2 ) in r, whenever t1 [X] = t2 [X], then t1 [Y ] = t2 [Y ]. X,Y are also named Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of an fd. The goal of dependency discovery is to find meaningful dependencies hold- ing among columns of a dataset. They represent domain knowledge and can be used to assess database design and data quality [9]. Moreover, discovering dependencies from data permits to capture the evolution of real-world domains. Discovering fds over a relation instance r entails finding all the possible col- umn combinations, and for each of them find all the possible partitions forming the LHS and RHS of candidate fds. Without loss of generality, we can consider only candidates with a single attribute on the RHS. Given this, the fd discovery problem has an extremely large search space. In fact, given a relation r with M attributes and n tuples, we need to consider all the possible combinations of 2 to M attributes, counting each of them as many times as the number of attributes in it, in order to account for the number of different candidates with a single RHS attribute. This complexity is synthesised by the following formula: M   X M k (1) k k=2 Since this complexity represents only the number of candidate fds that could be potentially checked, the discovery algorithms need to tackle several other issues, such as the validation of each candidate whose complexity is linear in the number of tuples. The general fd discovery problem is described by the following definition. Definition 1. Given a relation instance r of a relation R, an fd discovery algorithm has to find the minimal cover set of fds holding in r, having the property that tuples equal on the LHS must be equal also on the RHS. According to this definition, an fd discovery algorithm has to identify the minimal cover of fds holding on a relation instance r. The minimal cover will contain all valid fds even if they have low support. Thus, an holding fd does not appear in the minimal cover P only when it can be inferred by other fds in P . For these reasons, given a set P of all minimal fds holding on r, and computed at time τ , it is necessary to update P whenever one or more tuples have been added from time τ to time τ 0 , since new tuples can violate one or more fds of P . Details on how we deal with such a problem are discussed into the next section. 4 Incremental Discovery of Functional Dependencies This section presents the structure of the proposed incremental approach for fds discovery by introducing the data representation and the search strategies. Data structures, pruning, and validation processes of fds will, therefore, be described. 4.1 Incremental Methodology fd discovery algorithms operating on static datasets require to be re-execute upon the insertion of new tuples. This is a computationally expensive operation, especially for database instances with a large number of rows and columns. The main problem with the database instance evolution is that fds found at time τ can be invalidated at time τ 0 . For this reason, it is necessary to implement incremental approaches that reduce the time for searching and validating fds. In our study, we propose an incremental discovery approach that takes in input the fds associated to a relation instance r, represents fds using bitsets, and performs an upward and downward search strategy. Figure 1 shows the binary representation of an fd with k attributes. In particular, each dependency X → A requires two bitsets: the first contains all the LHS attributes, while the second contains the RHS attribute of the fd. Each location of a bitset B represents an attribute of a relation instance r. In particular, given an fd X → A on r, if BX [i] = 1 then the i-th attribute in r belongs to X. The size of each bitset corresponds to the number of attributes of the considered instance r. In this way, it is possible to represent dependencies with hundreds of attributes in a compact and lightweight fashion. In our methodology, all binary fds holding on a instance at time τ are stored in a linked ordered hash map. This data structure was used to ensure that information are quickly saved and retrieved. The map uses as key the bit-vector representation of fds, and as value the next dependency, according to an ordering criteria based on the LHS cardinality. Moreover, two fast arrays link the map in order to reduce the insertion times of the fds. This data organisation allows to perform a level-wise discovery based on a lattice search strategy. Furthermore, it guarantees an immediate pruning of dependencies that are not minimal. As said above, inserting new tuples in a relation instance could affect the validity of the fds. In particular, the main effects that can be produced by new tuples are: – Invalidation of functional dependencies. Let X → A be a minimal fd with X, A ⊆ attr(R) at time τ , such that for each pair of tuples (tτ1 , tτ2 ) belonging to the instance r, whenever tτ1 [X] = tτ2 [X] then tτ1 [A] = tτ2 [A]. If at time τ + 1 there exist a new tuple tτ3 +1 such that: tτ1 [X] = tτ3 +1 [X] ∧ tτ1 [A] 6= tτ3 +1 [A] or two new tuples tτ4 +1 , tτ5 +1 , which verify the following property: tτ4 +1 [X] = tτ5 +1 [X] ∧ tτ4 +1 [A] 6= tτ5 +1 [A] then the fd X → A is no longer valid. – Confirmation of minimality. If a minimal functional dependency is valid at time τ and has not been invalidated by any tuple at time τ + 1, then this dependency is minimal also at time τ + 1. Based on these considerations, the methodology considers the first functional dependency extracted at time τ with the smallest LHS cardinality and the at- tribute partitions of the relation instance. In the pre-processing phase, the fds holding at time τ are mapped within the linked ordered hash map and the parti- tions are updated according to the set of new tuples. The use of tuple partitions avoids to access data during the discovery and allows to validate the fds through the refinement property [8]. Definition 2. Refinement. A functional dependency X → A holds on r iff |πX | = |π(X∪A) |, where πX and π(X∪A) are the sets of equivalence classes, i.e., they are the partitions of r on X and X ∪ A. If a minimal fd X → A is invalidated at time τ + 1, then new candidate dependencies need to be analysed. In particular, it is necessary to validate all Fig. 1. Bit-vector representation of a functional dependency. Algorithm 1 Incremental Discovery Algorithm INPUT: a set Στ of valid and minimal fds at time τ OUTPUT: a set Στ +1 of new valid and minimal fds at time τ + 1 1: for all X → A ∈ Στ do 2: if REFINEMENT(X → A) is not valid then 3: Ll+1 ← NEXTLEVEL(X → A) 4: for all Xl+1 ∈ Ll+1 do 5: if not INFERENCE(Xl+1 → A then) 6: Στ ← Στ ∪ {Xl+1 → A} 7: else 8: Ll−1 ← PREVLEVEL(X → A) 9: for all Xl−1 ∈ Ll−1 do 10: if not INFERENCE(Xl−1 → A) then 11: Στ ← Στ \ {X → A} 12: Στ +1 ← Στ non-trivial fds XB → A for each B ∈ attr(R), with B ∈ / X ∪A. This means that a higher level in the lattice search strategy is considered. In fact, only candidates of higher levels w.r.t. the holding fds at time τ can hold at time τ + 1. Moreover, it is not necessary to verify the dependencies X \ B → A for each B ∈ X, since such fds were not valid at time τ , and cannot be valid at time τ + 1. However, it is necessary to guarantee the minimality of XB → A. In fact, such dependency can be considered as minimal at time τ + 1 if and only if it cannot be inferred by other holding fds at time τ + 1. Definition 3. Inference. A functional dependency XB → A is inferred by another dependency XB \ Z → A with Z ⊆ XB iff XB \ Z → A holds at time τ + 1, and Z 6= ∅. The inference checking is performed in the proposed search process by exploit- ing the previously introduced linked ordered hash map. In fact, it is necessary to verify if does not exist an fd, among those already validated at time τ + 1, which is minimal with respect to XB → A. In particular, in order to speed up the inference checking process, we used a hash map that links the possible fd’s RHS with all minimal fds already validated at time τ + 1. This allows to prune the number of fds to be checked. The complete discovery process defined according to the proposed method- ology is described by Algorithm 1. In particular, for each fd holding at time τ (line 1), it verifies if X → A also holds at time τ + 1 through the REFINEMENT function (line 2). Next, if X → A is not valid, the algorithm generates new can- didate fds at a higher level, by excluding those that can be inferred (lines 3-6). Instead, if X → A is valid, the latter is added to the result set only iff it cannot be inferred by other holding fds at a lower level (lines 8-11). In general, the proposed approach can drastically reduce the execution time for two main reasons: 1. the bit-vector representation of fds permits to quickly process new configu- rations of fds by computing the new LHSs in terms of attribute supersets or subsets. In other words, it optimises the downward/upward search strategies; 2. the ordered linked hash map permits to extremely prune the search process, since it guarantees a fast minimality test. Example 1. Table 1 shows a database instance r in which three new tuples have been added at time τ + 1. Starting from the partitions of the instance r at time τ , that is, πA = {[0, 1], [3, 4], [2]}, πB = {[3, 4], [0], [1, 2]}, πC = {[0, 2], [4], [1, 3]}, πD = {[0, 3], [4], [2], [1]}, the incremental algorithm computes the updated par- 0 0 titions, that is, πA = {[0, 1], [3, 4, 7], [2, 5, 6]}, πB = {[3, 4], [0], [6, 7], [1, 2, 5]}, 0 0 πC = {[0, 2, 5, 7], [4], [1, 3, 6]}, πD = {[0, 3, 5], [7], [4], [2], [6], [1]}, and loads the minimal fds holding at time τ . The latter are represented in terms of bit-vectors and inserted into the ordered linked map. Figure 2 shows the linked ordered hash map obtained at time τ +1 for the new instance of Table 1. It visualises all the considered candidate fds. In particular, each bit-vector in Figure 2 represents: 1) a minimal fd holding at time τ + 1, 2) a candidate fd that has been invalidated at time τ + 1, or 3) a candidate fd that is not minimal at time τ + 1. Initially, the dependency with the lowest LHS cardinality is selected by using a fast array to directly link the first dependency with a given LHS cardinality. In Figure 2 this corresponds to (0110) → (0001). Using the refinement property, this fd is removed because |πX | 6= |π(X∪A) |. Starting from this, the algorithm calculates the next candidates, which consider LHS supersets. For each of them the inference checking is applied, and only those that cannot be inferred are added to the map. Therefore, in the example, the fds (1110) → (0001) and (0111) → (1000) are added to the ordered map, also updating the links that preserve the map ordering. The execution proceeds until all fds are explored. In particular, the fd (0111) → (1000) is removed because it is not minimal with respect to (0101) → (1000), which has been previously validated. Finally, the algorithm returns the new minimal set of fds holding on the instance at time τ + 1, so avoiding the re-execution of the discovery algorithm from scratch. Table 1. An example of a relation instance updated at time τ + 1. Row Id A B C D 0 Andorra 1 French French 1 Andorra 2 English Russian 2 San Marino 2 French Italian 3 Cipro 4 English French 4 Cipro 4 Greek Spanish + 5 San Marino 2 French French + 6 San Marino 3 English German + 7 Cipro 3 French Arabic Fig. 2. The Linked Ordered Hash Map related to Example 1. 5 Evaluation In the following, we present experimental results concerning the performance of the proposed approach in discovering fds. In particular, the performed tests show how much the new proposed approach can improve time performances with respect to the re-execution of the fd discovery algorithm. Implementation details. The algorithm has been developed in Java 11.0.2. In particular, in order to improve the performance of the algorithm and avoid the re-calculation of partitions, we also introduced a methodology for caching par- titions, since the latter are widely used for the validation of candidate fds. Hardware and datasets. The tests has been performed on a Mac with an Intel Xeon processor at 3.20 GHz 8-core and 64GB of RAM. Moreover, to ensure a proper execution on the considered datasets, the Java memory heap size alloca- tion has been set to 40GB to permit a proper execution on datasets with a high number of tuples/attributes. We evaluated the proposed approach on several real world datasets, previ- ously used for testing fd discovery algorithms [13]. Statistics on the characteris- tics of the considered datasets are shown in Table 2. Such datasets are composed of one relation, since fd discovery algorithms always consider de-normalised databases. Table 2. Characteristics of the used datasets and discovery results. Dataset Tuples Attributes %Inserts #fds TANE Incremental time(ms) time(ms) Abalone 4176 9 50 % 137 308 183 Adult 32561 15 50 % 78 43033 1085 Balance-scale 624 5 50 % 1 130 6 Breast-cancer-wisc. 699 11 50 % 46 316 284 Breast-cancer 285 10 50 % 1 230 132 Bridges 108 13 50 % 142 218 195 Bupa 344 7 50 % 25 134 19 CalIt 10081 3 50 % 1 176 30 Cars 407 9 50 % 67 162 48 Car data 1727 7 50 % 1 197 7 Chess 1999 7 50 % 6 184 7 Citations 1000 9 50 % 76 203 182 Citeseer 10K 7 50 % 10 456 116 Citeseer 20K 6 50 % 4 721 816 Cmc 1472 10 50 % 1 477 15 DBLP 20K 8 50 % 28 364 172 Echocardiogram 132 13 50 % 538 191 363 Ecoli 335 9 50 % 54 139 24 Haberman 305 4 50 % 0 110 1 Hayes-roth 132 6 50 % 5 119 2 Iris 149 5 50 % 4 116 6 Letter 20K 17 50 % 61 177087 6203 Mammography 960 6 50 % 0 150 1 Nursery 12960 9 50 % 1 1065 45 Servo 166 5 50 % 1 115 3 Tae 150 6 50 % 2 122 14 Tax 100K 15 50 % 364 29368 28598 Wine 178 14 50 % 1374 209 131 Evaluation process. We carried out different tests by using as starting point the minimal fds extracted through the TANE algorithm [8]. All the tests were performed on datasets split into two parts. The first part represents the relation instance at time τ , and it has been given in input to TANE. The complete dataset represents the relation instance at time τ +1 analysed by the proposed incremen- tal discovery algorithm. Moreover, we re-executed the TANE algorithm on the complete dataset. This allowed us to analyse the resulting fds, and to compare execution times of our approach w.r.t. a complete re-execution of TANE. More- over, we executed other two experiments: 1) by varying the number of tuples inserted at time τ + 1; 2) by varying the number of rows/columns of the dataset. Analysis of the results. The first test has been conducted by simulating the inser- tion of 50% of tuples of the complete dataset. The results are reported in Table 2, where a comparison between the times of TANE and the incremental approach are shown. From the results we can notice that the proposed approach is in gen- eral more efficient, despite the variability of the number of rows/columns. This is mainly due to the pruning strategies introduced by the incremental approach. However, there are few cases in which the execution times are equivalent to those of TANE. This happens when there is a huge number of holding fds at time τ . The second test aimed to evaluate the relationship between the execution times and the sizes of the datasets. In particular, we selected datasets by varying (a) Citations (b) Citeseer (c) Letter (d) Nursery Fig. 3. A comparison between execution times of TANE with respect to our proposal by varying the percentage of inserted tuples at time τ + 1. the number of tuples inserted at time τ + 1 in the range of 5%-40% w.r.t. the complete dataset. Results are shown in Figure 3. In particular, the results for nursey (Figure 3(d)) and citeseer (Figure 3(b)) show an extremely reduction of the execution times when considering the incremental approach. This is probably due to the fact that the number of minimal fds at time τ was small. Moreover, although this number is higher for citations and letter, the time performances of our approach is still lower than the execution of TANE (Figures 3(a) and 3(c)). The last test allowed us to evaluate the discovery time of the incremental algorithm on datasets with a fixed number of tuples but a variable number of columns (first session), and with a fixed number of attributes but a variable number of tuples within an interval range of [2000 − 20000] tuples with step 2000 (second session). For these experiments dblp and citeseer datasets have been selected. Table 3 contains the characteristics of the datasets and the results of the experiments with a variable number of attributes. Instead, the results of the second session are shown in the Figure 4. By analysing Table 3, we can notice that the execution times of the proposed approach are always lower than the execution times of TANE. Moreover, we noticed that for our incremental approach the dblp dataset is more critical than citeseer when the number of attributes increases. Also in this case, this is probably due to the higher number Table 3. A comparison between execution times of TANE with respect to our proposal by varying the number of attributes. Dataset Tuples Attributes %Inserts TANE Incremental time(ms) time(ms) DBLP 20K 2 3% 220 32 DBLP 20K 3 3% 277 78 DBLP 20K 4 3% 293 126 DBLP 20K 5 3% 366 162 DBLP 20K 6 3% 418 232 DBLP 20K 7 3% 490 245 Citeseer 20K 2 3% 175 2 Citeseer 20K 3 3% 203 2 Citeseer 20K 4 3% 360 3 Citeseer 20K 5 3% 376 45 Citeseer 20K 6 3% 435 73 Citeseer 20K 7 3% 456 89 (a) Citeseer (b) DBLP Fig. 4. A comparison between execution times of TANE with respect to our proposal by varying the number of tuples. of fds discovered at time τ + 1. Different results have been obtained when we consider the variation in the number of tuples (Figure 4). In this case, although non monotonic the time trend grows faster for TANE. 6 Final Remarks In this paper we have proposed an incremental approach for discovering fds, which permits to update the set of holding fds without the need of exploring the complete search space. A bit-vector representation of the fds allowed us to optimise this process. Experimental results show that the proposed approach considerably reduces the execution times with respect to a re-execution from scratch. In the future, we would like to further improve this approach in order to automatically updates fds even for other database instance modifications, such as the deletion and the updating of tuples. Another interesting issue concerns the possibility of updating also Relaxed Functional Dependencies (rfds) [4], for which the discovery process is even more complex [5]. References 1. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. The VLDB Journal 24(4), 557–581 (2015) 2. Abedjan, Z., Schulze, P., Naumann, F.: DFD: Efficient functional dependency dis- covery. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management. pp. 949–958. CIKM ’14 (2014) 3. Bell, S.: Discovery and maintenance of functional dependencies by independencies. In: KDD. pp. 27–32 (1995) 4. Caruccio, L., Deufemia, V., Polese, G.: Relaxed functional dependencies – A survey of approaches. IEEE TKDE 28(1), 147–165 (2016) 5. Caruccio, L., Deufemia, V., Polese, G.: On the discovery of relaxed functional dependencies. In: Proceedings of 20th International Database Engineering & Ap- plications Symposium. pp. 53–61. IDEAS ’16 (2016) 6. Flach, P.A., Savnik, I.: Database dependency discovery: A machine learning ap- proach. AI Commun. 12(3), 139–160 (1999) 7. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: Efficient discovery of func- tional and approximate dependencies using partitions. In: ICDE. pp. 392–401 (1998) 8. Huhtala, Y., Kärkkäinen, J., Porkka, P., Toivonen, H.: TANE: An efficient algo- rithm for discovering functional and approximate dependencies. The Computer Journal 42(2), 100–111 (1999) 9. Liu, J., Li, J., Liu, C., Chen, Y.: Discover dependencies from data - A review. IEEE Transactions on Knowledge and Data Engineering 24(2), 251–264 (2012) 10. Lopes, S., Petit, J.M., Lakhal, L.: Efficient discovery of functional dependencies and armstrong relations. In: Proceedings of the 7th International Conference on Extending Database Technology. pp. 350–364. EDBT ’00 (2000) 11. Mannila, H., Räihä, K.J.: Dependency inference. In: Proceedings of the 13th In- ternational Conference on Very Large Data Bases. pp. 155–158. VLDB ’87 (1987) 12. Naumann, F.: Data profiling revisited. ACM SIGMOD Record 42(4), 40–49 (2014) 13. Papenbrock, T., Ehrlich, J., Marten, J., Neubert, T., Rudolph, J.P., Schönberg, M., Zwiener, J., Naumann, F.: Functional dependency discovery: An experimental evaluation of seven algorithms. Proceedings of the VLDB Endowment 8(10), 1082– 1093 (2015) 14. Papenbrock, T., Naumann, F.: A hybrid approach to functional dependency dis- covery. In: Proceedings of the 2016 International Conference on Management of Data. pp. 821–833. ACM (2016) 15. Schirmer, P., Papenbrock, T., Kruse, S., Hempfing, D., Meyer, T., Neuschäfer- Rube, D., Naumann, F.: DynFD: Functional dependency discovery in dynamic datasets. In: To appear in EDBT (2019) 16. Wang, S.L., Shen, J.W., Hong, T.P.: Incremental discovery of functional depen- dencies using partitions. In: Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569). vol. 3, pp. 1322–1326. IEEE (2001) 17. Wyss, C., Giannella, C., Robertson, E.: FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In: Procs of Intl Conf. on Data Warehousing and Knowl. Disc. pp. 101–110. DaWaK ’01 (2001) 18. Yao, H., Hamilton, H.J., Butz, C.J.: FD Mine: Discovering functional dependencies in a database using equivalences. In: Proceedings of IEEE International Conference on Data Mining. pp. 729–732. ICDM ’02 (2002)