1. Introduction

Eficient Validation of Functional Dependencies during Incremental Discovery

Loredana Caruccio

lcaruccio@unisa.it 0

Stefano Cirillo

scirillo@unisa.it 0

Vincenzo Deufemia

deufemia@unisa.it 0

Giuseppe Polese

gpolese@unisa.it 0

Data Profiling, Functional Dependency, Incremental Discovery

0 Department of Computer Science, University of Salerno , via Giovanni Paolo II n.132, 84084 Fisciano (SA) , Italy

2021

5 9

The discovery of functional dependencies (FDs) from data is facing novel challenges also due to the necessity of monitoring datasets that evolves over time. In these scenarios, incremental FD discovery algorithms have to eficiently verify which of the previously discovered dataset, and also infer new valid FDs. This requires the definition of search strategies and validation methods able to analyze only the portion of the dataset afected by new changes. In this paper we propose a new validation method, which can be used in combination with diferent search strategies, that exploits regular expressions and compressed data structures to eficiently verify whether a candidate FD holds on an updated version of the input dataset. Experimental results demonstrate the efectiveness of the proposed method on real-world datasets adapted for incremental scenarios, also compared with a baseline incremental FD discovery algorithm.

1. Introduction

Traditional validation methods, such as the partition-based ones, exploit the possibility to organize data into partitions according to attribute candidate sets, which will consistently be redefined by following a level-by-level search strategy. In this way, the validation can be performed by testing properties of partitions involved in each candidate FD [5]. Instead, rowbased discovery algorithms directly derive candidate FDs by comparing attribute values for all possible combinations of tuples pairs, so implicitly inferring the validation of the FDs [4, 8].

In this paper, we propose an FD incremental discovery algorithm, named REXY (RegeX-based incremental discoverY), which includes a new validation method exploiting regular expressions (RegExs) to extract the subset of data afecting discovery results. It employs a compressed data representation, which limits the memory load and optimizes the data analysis. The incremental search strategy of REXY is based on an upward/downward candidate generation method based on the set of previously holding FDs without the need of exploring the complete search space. We experimentally evaluated REXY on 22 real-world datasets. In particular, we evaluated the efectiveness of the algorithm in its incremental identification of FDs using as performance benchmark the incremental algorithm described in [9]. The results demonstrate that REXY improves the time performances of a baseline algorithm of several orders of magnitude.

The remainder of the paper is organized as follows. Section 2 reviews the main discovery strategies and algorithms existing in the literature. Section 3 introduces the incremental discovery problem. Section 4 describes the proposed validation method and its integration in an incremental search strategy, whereas the whole discovery algorithm is reported in Section 5. Experimental results are discussed in Section 6, and conclusions are included in Section 7.

2. Related Work

In the literature several algorithms for the discovery of FDs from data have been defined. They are mainly devoted to the analysis of data from scratch, yielding specific methodologies to explore and prune the search space. In particular, column-based and row-based represent the two main strategies they follow.

Column-based strategies model the search space as an attribute lattice, which permits to consider candidate dependencies at each level in terms of lattice’s edges, enabling the representation of the Left-Hand-Side (LHS) and the Right-Hand-Side (RHS) of an FD. After the generation of FD candidates at each level, it is necessary to validate each of them, entailing the evaluation of combination values or making the FD representations eficient, such as data partitions [3, 5, 10]. By following a level-by-level search strategy, column-based strategies exploit the possibility to progressively organize data, and to test the validation of a candidate FD by considering how data are collected into partitions. On the contrary, row-based strategies directly derive candidate FDs by analyzing all possible combinations of tuples pairs, aiming to derive two attribute subsets in terms of agree-sets or diference-sets , and entailing the automatic derivation of all holding FDs [4, 8]. In this case, the validation of an FD is implicitly inferred by the overall analysis over the set of data. Finally, in order to improve the eficiency of discovery algorithms, hybrid strategies have also been proposed [11, 12]. The latter calculate FDs on a randomly selected small subset of records (row-based), and validates the discovered FDs on the entire dataset, by focusing on a small subset of the search space determined by FD candidates and validation results (column-based).

In literature, there exist several FD discovery algorithms for incremental scenarios. In particular, one of the first algorithms exploits the concepts of tuple partitions and monotonicity of FDs to avoid the re-scanning of the database [7]. Another proposal is based on the concept of functional independency, through which the algorithm maintains the set of FDs updated over time [13]. Finally, the DynFD algorithm is able to discover and maintain FDs in dynamic datasets [6]. In particular, it extends an aforementioned hybrid strategy, by continuously adapting the FD validation structures according to a batch of insert, update, and delete data operations. With respect to these incremental approaches, REXY uses a new validation method that focuses the verification of candidate FDs only on the subset of data afected by the data changes. This can improve the eficiency of FD discovery in incremental scenarios.

3. Incremental discovery of Functional Dependencies

Functional Dependencies (FDs) express relationships between attributes of a database relation. An FD → states that the values of an attribute set are uniquely determined by the values of . More Formally, given a relation schema , an FD over is a statement → ( determines ), with , ⊆ () , such that, given an instance over , → is satisfied in if and only if for every pair of tuples ( 1, 2) in , whenever 1[ ] = 2[ ] , then 1[ ] = 2[ ] . and are also named Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of an FD, respectively. The latter is said to be non-trivial if and only if ⊇ , and minimal if and only if there is no attribute ∈ such that \ → is also an FD holding on . For sake of simplicity and w.l.o.g, in the rest of the paper we assume that consists of a single attribute.

In general, the FD discovery problem aims at finding a set of all minimal FDs holding on a relation instance . It entails searching for a partition of tuples sharing the same values on RHS attributes whenever they share the same value on LHS attributes [14]. To do this, it is possible to first generate FD candidates, and then verifying their validity and minimality, yielding a column-based strategy.

Column-based strategies model the search space as a graph representation of a lattice, which contains a collection of attribute sets, where Level 0 maps the empty set, Level 1 singleton sets, one for each attribute, Level 2 the pair sets, one for each possible combination of two attributes, and so forth. Finally, the last level, namely Level M, will contain a single set of all the attributes from . It permits to consider candidate FDs at each level in terms of edges. For instance, the edge highlighted in blue in Figure 1 defines the candidate FD → .

The validation process of column-based strategies performs simple computations on the cardinalities of the partitions calculated over attribute combinations, which correspond to the lattice nodes. This process is particularly eficient when it is possible to follow a level-by-level search strategy from Level 1 to Level M, and to gradually construct partitions over ever-increasing attribute set sizes. This could not be guaranteed when it is necessary to explore the search space with diferent starting points, such as a set of FDs.

Incremental scenarios require to update a set of minimal FDs Σ discovered over data collected until time according to the set of tuple modifications +1 at time +1 . This yields the necessity to (re)-consider all FDs in Σ , and possibly generates new FD candidates according to validation

ABCD ABCE ABDE ACDE BCDE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

AB AC AD AE BC BD BE CD CE DE

E results. In fact, the addition of a new tuple can entail the invalidation of a previously holding FD, whereas the deletion of a tuple can entail to a previously holding FD be no longer minimal. More formally, let → be an FD holding at time , and a tuple yielding a dataset modification, then we need to consider the following efects: • Invalidation: when a tuple is added, it could produce the invalidation of → at time + 1 . Consequently, one or more FD candidates on the next lattice level having the same RHS and a superset of its LHS could be validated. Thus, it is necessary to consider all possible FD candidates → such that ∉ and ≠ . • Minimality: when a tuple is removed, → could be no longer minimal at time + 1 . Consequently, one or more FD candidates on the previous lattice level having the same RHS and a subset of its LHS could be validated, yielding a minimal FD. Thus, it is necessary to consider all possible FD candidates ⧵ → such that ∈ .

Notice that, a tuple update can always be managed as the deletion of the stored tuple and the subsequent addition of the tuple containing the modified values.

Invalidation and minimality checks determine how FD candidate should be generated and managed throughout the search space according to previously holding FDs and validation results. This entails moving the focus on diferent parts of the search space, by focusing the attention on a subset of data characterized by specific candidate FDs under analysis. Thus, we propose a new validation method based on RegExs, which checks how modified data entail some violations, and/or verifies the minimality of a candidate FD.

4. The RegEx-based validation method

In this section, we first describe the compressed data structures used to optimize the time and space usage of the discovery algorithm. Then, we introduce the new RegEx-based validation method, and how it can be adopted within an incremental FD discovery process. 1 E22 2 E26 3 E28 4 E31 5 E34 6 E35 1 2 3 4 5 6 1 2 3 4 5 6

A M M M O A 1 2 2 2 3 1 (C) 24 12 3 8 41 27 1 2 3 4 5 6 (D) 1883 1887 1888

4.1. Compressed data structures

6Before iLnt.rCodaurcuicncgiotheetnaelw. validation method, it is necessary to provide details about its data structures. In particular, in order to minimize the memory load, we represent a dataset by using lightweight references to its tuples without losing significance. To do this, we map each t7

E37

M 18 1891

1350 attribute data to a 4u1nique numeric value, which allowThsrotuogh t8 E34 O 1888 RR 4558 2 G

Through

Steel Steel

Medium

Long eficiently identify data changes. In

Simple-T Simple-T particular, whenever new tuples are added to the dataset, their values are mapped into their corresponding numerical values. This repre(sae)ntation leads to a fast tuple comparison during the FD discovery process, and permits to create RegExs in a simple way, avoiding character t7 encoding issues.

2 the Bridges dataset: 7 7 2 2 1 1 2 2 1 2 Exat8mple 51.

Le3t us co5nsider5 the sn2ippet 5of the2Bridge1s datas1et in T2able 13(a) con2tainin2g 13 attributes. As shown in Table 1(b), after the mapping phase, each value has been mapped to a unique ID for each attribute. Let us now consider the insertion of the following two tuples in (b) t7 t8

E37 E34

M O 18 41 1891 1888

RR RR 1350 4558

G G

Through Through

Steel Steel

Medium

Long

S F

Simple-T Simple-T then each of these values is parsed according to the previous values of the dataset, and mapped to the following two tuples: 18 1891

RR 1350

Through Steel Medium

Simple-T E37 7

M 2 = [0-9.]*[,][ 2 ][,][0-9.]*[,][7][,][0-9.]*[,][0-9.]*[,][ 2 ] parsed by using a unique ID for the values in each of the attributes (Table 1(b)). Let us now consider the insertion of two di erent tuples in the existing dataset: Then each value is parsed by considering the already existing ones:

Although this data representation allows us to quickly retrieve information, it is necessary to definAeas nweew csatrnucsteuere, taoftseurpptohret tphaervsainligdastitoenp,sttehpeoftucapnldeidta8teis FeDqus.aTlotothits5e.nTd,hwuse,eixtpliosit naohtanshemceasps,anraymteolystRoergeExthHeasthuMplaep,tw8 hinostehkeeydsataareseutn. iOqune tIDhecoomthbeinrahtiaonnds,, wthheerteuaspltehe ta7ssioscsiattoerdevdaliunesDcosrirnecspeointdrteoptrheesennutmsbaercoofmthbeiinraotcicounrreonfceexsiisnttihnegdaantadsent.eIwn tvhaisluweasy,f oitris tphoessdibaletatoseatv.oid duplicate entries and to quickly perform insertion and/or removal operations. ExaAmlpthleou2.ghLetthuiss croenpsriedseernEtxaatmiopnleo1f, wdahteareathlleowmsapupsedtotuqpuleisck7lyanrdet8rsiehvoeulidnbfeorinmsearttieodnin fTraobmle t1h(be).dTahtau,s,itfowra7stnheecIDess“7a,r2y, 7t,o7,d2e, 6,n2e, 1a, 1n,e2w,2s,1tr,2u”citsuirnesetrotesduipnptoorthtethReegvEaxlHidaasthiMonap as key, whose associafteddsv.alue itshsiest etond1,, swineceenxopolot hiteratuhpalesshimnaTpablien1(b) contatihnes tsheeasracmhe step of candidate To which rvealliueess oonf 7r.egInesxteapda,tftoerr n8s.thTehkeisy “t5y,p3e,5,o5f, 2s,t5r,u2c,1t,u1r,e2, a3,ll2o, w2”sisuaslrteoadyavionicdluddeudpilnictoattehe entries and to quickly perform insertion or removal opera8tiiosnesq.uaFlutroth5e,ryiedledtinaiglsto RegExHashMap, then the associated value is set to 2, since tuple two occurrences of the same key. about the search strategy will be discussed in the following section.

4.2. Validation approach

An eficient discovery methodology should permit to quickly validate candidate FDs on the set of data under analysis, and ensure high adaptability to possible data changes. The proposed validation method relies on RegExs and exploits the RegExHashMap for fast retrieval operations. In particular, let ∶ → a candidate FD over a dataset . The proposed approach aims to create a RegEx to validate over . For sake of clarity, we describe the creation of by considering a new tuple . However, this strategy could be easily adapted to consider a large number of new tuples by chaining diferent RegExs.

The validation algorithm considers the projection of attributes in and onto the tuple to select value combinations that must be considered into the validation process. More specifically, to validate , it is necessary to create a RegEx for the attribute set = 1, 2, … , by considering [ 1], [ 2], … , [ ], in the following way: = t [ 1] [ , ] ( [ 0 - 9 ] * [ , ] ) * t [ 2] [ , ] ( [ 0 - 9 ] * [ , ] ) * … ( [ 0 - 9 ] * [ , ] ) * t [ ] = ( ? ! . * t [ ] ) . + = ( [ 0 - 9 ] * [ , ] ) * ( [ , ] [ 0 - 9 ] * ) * ( [ , ] [ 0 - 9 ] * ) *

We can notice that contains comma character as separator, whereas each value could be followed from zero or more numeric characters, i.e., [ 0 − 9 ]∗, representing all the possible IDs for the attributes that are not in . In fact, one of the strengths of this approach is to avoid considering specific values for attributes that must not be considered during the validation step. Similarly to the LHS, it is necessary to define a RegEx for the RHS, i.e., . To this end, we can define as the negative look ahead of [] (i.e., all values not equal to [] ), so enabling to consider only the tuples in which [] does not appear. The combination of and allows us to define a new RegEx to validate the candidate after the insertion of . In this way, it is possible to verify if there exists at least one key in the RegExHashMap satisfying . [0-9.]* 7

G Through Steel Medium

S Simple-T [0-9.]* [0-9.]* [0-9.]* [0-9.]* 7 7 2 6 1 1 2 2 1 2 [ 2 ] [7] [ 2 ] = = (?!.*2).+

[,][0-9.]*[,][0-9.]*[,] = [0-9.]*[,][ 2 ][,][0-9.]*[,][7][,][0-9.]*[,][0-9.]*[,][ 2 ]

Negative look ahead

Example 3. Let us consider the snippet dataset in Table 1(b), a candidate FD ∶ {, , } → , and the new tuple 7 defined in Example 1. The validation algorithm should verify if this new tuple leads to the invalidation of , according to the strategy defined above. To this end, it is necessary to define the RegEx to validate shown in Figure 2. We can observe that the validation approach permits to consider the new value 7[] , 7[] , and 7[] for the attributes , , and , while it considers a generic numeric value (i.e., [ 0 − 9 ]∗) for the remaining attributes. As a consequence, the regex has been defined as shown in Figure 2. We can notice that allows the validation strategy to check if already exists at least one tuple in the dataset with the same value for the attributes , , and . However, it is necessary to define the regex for the attribute by exploiting the negative look ahead of 7[ ] , in order to check if exist at least one key in the RegExHashMap with the same value for the attributes , , and , but a diferent value for .

4.3. An incremental search strategy

The proposed validation method can be used in any discovering strategy that, by working incrementally, can properly define FD candidates to be validated. In particular, the algorithm REXY uses a version of the algorithm described in [15], adapted for incrementally managing any kind of data changes. In particular, given a dataset at time , REXY considers the set of FDs holding at time , denoted as Σ , as starting points, and collect them in a hash map ordered by LHS cardinality [15]. Then, it follows a bottom-up search strategy by considering Σ as the set of candidate FDs, and by validating each of them according to the validation method described above. Consequently, given a candidate FD , REXY performs a downward discovery step when a tuple is removed or is a novel candidate (i.e., ∉ Σ ). It consists in the generalization of the candidate performed by iteratively removing one attribute on its LHS, until REXY meets a valid FD in the lower levels. Conversely, when a new tuple is added and is not valid, an upward discovery step is accomplished in order to find the possible new holding FDs. It consists in the specialization of performed by adding a new attribute on its LHS.

The usage of an ordered hash map into a bottom-up search strategy permits to reuse the analysis performed at the previous levels, and forward the validation of newly generated candidate FDs to the next level, yielding an optimization in the pruning of the search space.

5. The REXY Algorithm

The main procedure of REXY is shown in Algorithm 1. Given a set of minimal FDs Σ valid at time , the set of tuples +1 updating at time + 1 , and an instance of the RegExHashMap Λ at time , the main procedure of REXY starts the discovery process by considering Σ as the set of candidate FDs. In particular, REXY performs a discovery step in ascending order by the LHS cardinalities of Σ (line 3) and extracts all the candidate FDs from Σ with a specific LHS cardinality (line 4). Then, for each of them, it checks if the candidate ∶ → is still valid after the updating of the dataset according to the validation approach defined in Section 4.2 (line 6). If the analyzed FD is not valid at time + 1 , the procedure first removes it from the set of the analyzed candidates (line 7), and then generates new candidate FDs at a higher lattice level (line 8), by discarding those that can be inferred from other FDs already validated at time + 1 (lines 9-10). On the other hand, if the analyzed FD is still valid at time + 1 , REXY tries to find if there exist other FDs at a lower lattice level (line 12) that have been validated after update operations (lines 13-18). At the end of each iteration, the procedure removes all the FDs that are not minimal at time + 1 w.r.t. the FDs already validated in the previous iterations (line 19). Finally, RegExHashMap is updated with the new tuples (line 20), and the procedure returns a new set of minimal and valid FDs at time + 1 (line 21).

The validation procedure of REXY is shown at the bottom of Algorithm 1. Given a candidate FD ∶ → at time + 1 , an instance of the RegExHashMap Σ at time , and the set of the new tuples +1 at time + 1 , the procedure creates a new RegEx for each new tuple in +1 , in order to check the validation of the candidates on the updated dataset. In particular, the procedure starts by analyzing each new tuple in +1 (line 24). Then, for each of them, REXY defines a new RegEx according to the approach defined in Section 4.2. More specifically, for each attribute in () , if does not belong to the attributes of , the procedure adds to a generic value (lines 28-30). On the other hand, if belongs to the attributes on the LHS, then the procedures add to the final RegEx the corresponding value for the analyzed tuple [] (line 32). Otherwise, if none of the previous cases is verified, the attribute belongs to the RHS, so the procedure adds to the negative look ahead of this value (line 33). After the creation of the RegEx , REXY checks if exists at least a tuple in the RegExHashMap that matches this RegEx. If true, it means that is not valid on the dataset at time + 1 , since there exists at least one pair of tuples that invalidate (line 35). Otherwise, if the previous case is never verified for any other tuple, it means that the candidate is valid at time + 1 (line 36).

It is worth notice that, while the implementation of REXY considers several code optimizations and takes advantage of the defined data structures, for sake of clarity, the pseudo-codes of Algorithm 1 do not show such optimizations. They mainly guarantee that each possible candidate FD is validated at most once. 1 Function M A I N _ P R O C E D U R E : Σ+1

← ∅ for in [ 1, … , | ()| ] for each → ∈ Σ Σ ← Σ . C A R D I N A L I T Y _ F I L T E R ( ) do do Algorithm 1: Main Procedures of REXY

Output: The set of new minimal FDs at time + 1 , i.e., Σ+1

+ 1 ; An instance of the RegExHashMap Γ at time Input: A set Σ of minimal FDs at time ; Updated tuples +1 for the dataset at time if V A L I D A T I O N _ R E G E X ( → , Γ , +1 ) is False then Σ Σ ← Σ \ → ← S P E C I A L I Z E _ C A N D I D A T E ( → ) for each { → } ∈ Σ

do if I S _ M I N I M A L ({ → } Σ ← Σ ∪ { → }

, Σ ) is True then else for each { → } ∈ Σ Σ ← G E N E R A L I Z E _ C A N D I D A T E ( → )

do if V A L I D A T I O N _ R E G E X ({ → } Σ

← Σ \{ → } Σ ← Σ ∪ G E N E R A L I Z E _ C A N D I D A T E ({ → }

) , Γ , +1 ) is False then Σ ← Σ ∪ Σ if |Σ | > 0 then Σ+1 ← Σ+1 ∪ M I N I M A L I T Y _ C H E C K (Σ+1 , Σ ) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34 35 36 22 Function V A L I D A T I O N _ R E G E X ( → , Γ , +1 ) : for in [ 1, … , | ()| − 1 ]

do return Σ+1 Γ+1

← Γ+1 ∪ +1

← ∅ for each ∈ +1 do

A T T R S ← () ←

A T T R S [] if ∉ ∧ ≠ else return True

then else ← ∪ ([, ][ 0 − 9 ]∗)∗ if == 0 then ← ∪ ([ 0 − 9 ]∗[, ])∗ if ∈ else if ∈ then ← ∪ []

then ← ∪ (?!.∗[]).+ if < | A T T R S | − 1 then ← ∪ [, ] if Γ . E X I S T _ M A T C H I N G ( ) is True then return False // Updating of the RegExHashMap

6. Experimental results

REXY has been developed in Java 15. The execution tests have been performed on an iMac with an Intel Xeon processor at 3.40 GHz, 36-core, and 128GB of RAM. Furthermore, in order # Dataset 1 Iris 2 Balance-scale 3 Bupa 4 Appendicitis 5 Chess 6 Abalone 7 Nursey 8 Tic-tac-toe 9 Glass 10 Cmc 11 Yeast 12 Breast-cancer 13 Fraud-detection 14 Poker-hand 15 Echocardiogram 16 Bridges 17 Australian 18 Adult 19 Tax 20 Lymphography 21 Hepatitis 22 Parkinson to ensure proper executions of REXY on the considered datasets, the Java memory heap size allocation has been set to 100GB. the number of rows and columns1. In our experimental session, we simulate a scenario of continuous insertions of tuples by considering one tuple at a time. This allowed us to deeply analyze the performances of the validation process on each candidate FD. More specifically, for each dataset. We also report the number of resulting FDs and the number of validations performed at the end of the last execution of REXY on each dataset.

The results highlight that the average execution time is almost always less than 10 seconds, except for the Lymphography, Hepatitis, and Parkinsons datasets. This is probably due to the large number of FD invalidated during the discovery process, possibly leading to a larger number of new candidates to be validated. Concerning the validation process, the average time of this process is almost always less than 25 milliseconds per FD, and the maximum time almost never exceeds 40 milliseconds, except for the Fraud-detection, Tax, Hepatitis, and Parkinsons datasets. Despite these time peaks, the average times demonstrate that such values can be considered as outliers. In general, although execution times increase when the number of attributes increases, on average they remain low for validating each candidate FD.

We performed a comparative evaluation of REXY with respect to the incremental FD discovery algorithm presented in [9] (baseline). In particular, both the algorithms follow the same search strategy, but difer on how they organize data, also yielding to completely diferent validation methods. Thus, the comparison allowed us to highlight the improvements in terms of execution times of the proposed validation strategy with respect to the traditional partition-based used in 1The datasets are available on: https://github.com/DastLab/TestDataset 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Dataset ID [9]. Figure 3 shows time performances on average of both algorithms on the 22 datasets. We can notice that, REXY outperforms the baseline algorithm with several orders of magnitude, ranging from 2 to 7, except for the Parkinson dataset for which the baseline takes advantage from a caching strategy that allows to reuse the computation performed in the previous iterations.

In general, partition-based validation strategies, such as the one included in [9], are particularly eficient when a complete discovery process has to be performed, since partitions can be incrementally computed following a level-by-level bottom-up search strategy. However, the incremental scenario requires the discovery process to browse the search space starting from the previously holding FDs (i.e., not all FD candidates have to be considered in each level). Thus, the proposed validation method is able to perform the verification without having the necessity to compute sparse partitions, yielding faster execution times as can be noted in the discussed experimental results.

7. Conclusion

In this paper we presented REXY, an incremental algorithm for FD discovery. It exploits a new Regex-based validation method to eficiently select tuples afected by dataset updates. REXY follows an incremental strategy to explore the search space based on the previously holding FDs. Experimental results over real-world datasets show that REXY can eficiently update the set of holding FDs, without having the necessity to execute algorithms from scratch. In the future, we would like to include REXY in a visual monitoring tool [16], and extend it for updating Relaxed Functional Dependencies [17] in incremental scenarios. tional dependencies, in: Transactions on Large-Scale Data-and Knowledge-Centered Systems XLIV, Springer, 2020, pp. 132–159. [3] Z. Abedjan, P. Schulze, F. Naumann, DFD: Eficient functional dependency discovery, in: Proc. of the 23rd ACM International Conference on Information and Knowledge Management, 2014, pp. 949–958. [4] P. A. Flach, I. Savnik, Database dependency discovery: A machine learning approach, AI communications 12 (1999) 139–160. [5] Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen, TANE: An eficient algorithm for discovering functional and approximate dependencies, The Computer Journal 42 (1999) 100–111. [6] P. Schirmer, T. Papenbrock, S. Kruse, D. Hempfing, T. Meyer, D. Neuschäfer-Rube, F. Naumann, DynFD: Functional dependency discovery in dynamic datasets., in: Proc. of 22nd International Conference on Extending Database Technology, 2019, pp. 253–264. [7] S.-L. Wang, J.-W. Shen, T.-P. Hong, Incremental discovery of functional dependencies using partitions, in: Proc. of Joint 9th IFSA World Congress and 20th NAFIPS International Conference, volume 3, 2001, pp. 1322–1326. [8] C. Wyss, C. Giannella, E. Robertson, FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances, in: Proc. of International Conference on Data Warehousing and Knowledge Discovery, 2001, pp. 101–110. [9] L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Incremental discovery of functional dependencies with a bit-vector algorithm, in: M. Mecella, G. Amato, C. Gennaro (Eds.), Proceedings of the 27th Italian Symposium on Advanced Database Systems, volume 2400 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. [10] H. Yao, H. J. Hamilton, C. J. Butz, FD_Mine: Discovering functional dependencies in a database using equivalences, in: Proc. of IEEE International Conference on Data Mining, 2002, pp. 729–732. [11] T. Papenbrock, F. Naumann, A hybrid approach to functional dependency discovery, in:

Proc. of the 2016 International Conference on Management of Data, 2016, pp. 821–833. [12] Z. Wei, S. Link, Discovery and ranking of functional dependencies, in: Proc. of IEEE 35th

International Conference on Data Engineering, IEEE, 2019, pp. 1526–1537. [13] S. Bell, Discovery and maintenance of functional dependencies by independencies, in: Proc. of the 1st International Conference on Knowledge Discovery and Data Mining, 1995, pp. 27–32. [14] L. Caruccio, V. Deufemia, G. Polese, Mining relaxed functional dependencies from data,

Data Mining and Knowledge Discovery 34 (2020) 443–477. [15] L. Caruccio, S. Cirillo, Incremental discovery of imprecise functional dependencies, J. Data and Information Quality 12 (2020). [16] B. Breve, L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Visualizing dependencies during incremental discovery processes, in: A. Poulovassilis et al. (Ed.), Proceedings of the Workshops of the EDBT/ICDT 2020 Joint Conference, volume 2578 of CEUR Workshop Proceedings, CEUR-WS.org, 2020. [17] L. Caruccio, V. Deufemia, F. Naumann, G. Polese, Discovering relaxed functional dependencies based on multi-attribute dominance, IEEE Transactions on Knowledge and Data Engineering (2021) To appear.

[1]

Chu ,

I. F.

Ilyas ,

Papotti , Holistic data cleaning: Putting violations into context , in: Proc. of IEEE International Conference on Data Engineering , 2013 , pp. 458 - 469 .

[2]

Le Guilly , J.-M. Petit , V. -M. Scuturici , Evaluating classification feasibility using func-