=Paper= {{Paper |id=Vol-2994/paper1 |storemode=property |title=Efficient Validation of Functional Dependencies during Incremental Discovery |pdfUrl=https://ceur-ws.org/Vol-2994/paper1.pdf |volume=Vol-2994 |authors=Loredana Caruccio,Stefano Cirillo,Vincenzo Deufemia,Giuseppe Polese |dblpUrl=https://dblp.org/rec/conf/sebd/CaruccioCDP21 }} ==Efficient Validation of Functional Dependencies during Incremental Discovery== https://ceur-ws.org/Vol-2994/paper1.pdf
Efficient Validation of Functional Dependencies
during Incremental Discovery
Loredana Caruccio, Stefano Cirillo, Vincenzo Deufemia and Giuseppe Polese
Department of Computer Science, University of Salerno, via Giovanni Paolo II n.132, 84084 Fisciano (SA), Italy


                                      Abstract
                                      The discovery of functional dependencies (FDs) from data is facing novel challenges also due to the
                                      necessity of monitoring datasets that evolves over time. In these scenarios, incremental FD discovery
                                      algorithms have to efficiently verify which of the previously discovered FDs still hold on the updated
                                      dataset, and also infer new valid FDs. This requires the definition of search strategies and validation
                                      methods able to analyze only the portion of the dataset affected by new changes. In this paper we
                                      propose a new validation method, which can be used in combination with different search strategies,
                                      that exploits regular expressions and compressed data structures to efficiently verify whether a candidate
                                      FD holds on an updated version of the input dataset. Experimental results demonstrate the effectiveness
                                      of the proposed method on real-world datasets adapted for incremental scenarios, also compared with a
                                      baseline incremental FD discovery algorithm.

                                      Keywords
                                      Data Profiling, Functional Dependency, Incremental Discovery




1. Introduction
The usefulness of functional dependencies (FDs) has been widely demonstrated due to their
application in several contexts, such as for cleaning data [1], evaluating the feasibility of
classification models [2], and so forth. Although in the last decades many algorithms have been
proposed for the automatic discovery of FDs from data, e.g., [3, 4, 5], they lack the capability of
managing dynamic data. In this context, the set of discovered FDs should be kept consistent with
the possible data changes over the time, e.g., by means of insert, update, and delete operations.
As a consequence, incremental FD discovery algorithms have to adopt discovery strategies that
limit the search of candidate FDs to specific parts of the search space according to previously
discovered FDs.
   Most of the incremental FD discovery algorithms existing in literature extend traditional
FD search strategies with mechanisms that exploit the knowledge generated by the previous
executions [6, 7]. However, none of these algorithms introduce optimizations in their validation
methods aiming to consider data changes only.


SEBD 2021: The 29th Italian Symposium on Advanced Database Systems, September 5-9, 2021, Pizzo Calabro (VV), Italy
Envelope-Open lcaruccio@unisa.it (L. Caruccio); scirillo@unisa.it (S. Cirillo); deufemia@unisa.it (V. Deufemia);
gpolese@unisa.it (G. Polese)
Orcid 0000-0002-6711-3590 (L. Caruccio); 0000-0003-0201-2753 (S. Cirillo); 0000-0002-6711-3590 (V. Deufemia);
0000-0002-8496-2658 (G. Polese)
                                    Β© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
   Traditional validation methods, such as the partition-based ones, exploit the possibility to
organize data into partitions according to attribute candidate sets, which will consistently
be redefined by following a level-by-level search strategy. In this way, the validation can be
performed by testing properties of partitions involved in each candidate FD [5]. Instead, row-
based discovery algorithms directly derive candidate FDs by comparing attribute values for all
possible combinations of tuples pairs, so implicitly inferring the validation of the FDs [4, 8].
   In this paper, we propose an FD incremental discovery algorithm, named REXY (RegeX-based
incremental discoverY), which includes a new validation method exploiting regular expressions
(RegExs) to extract the subset of data affecting discovery results. It employs a compressed data
representation, which limits the memory load and optimizes the data analysis. The incremental
search strategy of REXY is based on an upward/downward candidate generation method based
on the set of previously holding FDs without the need of exploring the complete search space.
We experimentally evaluated REXY on 22 real-world datasets. In particular, we evaluated the
effectiveness of the algorithm in its incremental identification of FDs using as performance
benchmark the incremental algorithm described in [9]. The results demonstrate that REXY
improves the time performances of a baseline algorithm of several orders of magnitude.
   The remainder of the paper is organized as follows. Section 2 reviews the main discovery
strategies and algorithms existing in the literature. Section 3 introduces the incremental
discovery problem. Section 4 describes the proposed validation method and its integration in
an incremental search strategy, whereas the whole discovery algorithm is reported in Section 5.
Experimental results are discussed in Section 6, and conclusions are included in Section 7.


2. Related Work
In the literature several algorithms for the discovery of FDs from data have been defined. They
are mainly devoted to the analysis of data from scratch, yielding specific methodologies to
explore and prune the search space. In particular, column-based and row-based represent the
two main strategies they follow.
   Column-based strategies model the search space as an attribute lattice, which permits to
consider candidate dependencies at each level in terms of lattice’s edges, enabling the rep-
resentation of the Left-Hand-Side (LHS) and the Right-Hand-Side (RHS) of an FD. After the
generation of FD candidates at each level, it is necessary to validate each of them, entailing
the evaluation of combination values or making the FD representations efficient, such as data
partitions [3, 5, 10]. By following a level-by-level search strategy, column-based strategies
exploit the possibility to progressively organize data, and to test the validation of a candidate
FD by considering how data are collected into partitions. On the contrary, row-based strategies
directly derive candidate FDs by analyzing all possible combinations of tuples pairs, aiming to
derive two attribute subsets in terms of agree-sets or difference-sets, and entailing the automatic
derivation of all holding FDs [4, 8]. In this case, the validation of an FD is implicitly inferred by
the overall analysis over the set of data. Finally, in order to improve the efficiency of discovery
algorithms, hybrid strategies have also been proposed [11, 12]. The latter calculate FDs on a
randomly selected small subset of records (row-based), and validates the discovered FDs on the
entire dataset, by focusing on a small subset of the search space determined by FD candidates
and validation results (column-based).
   In literature, there exist several FD discovery algorithms for incremental scenarios. In
particular, one of the first algorithms exploits the concepts of tuple partitions and monotonicity
of FDs to avoid the re-scanning of the database [7]. Another proposal is based on the concept
of functional independency, through which the algorithm maintains the set of FDs updated over
time [13]. Finally, the DynFD algorithm is able to discover and maintain FDs in dynamic datasets
[6]. In particular, it extends an aforementioned hybrid strategy, by continuously adapting the
FD validation structures according to a batch of insert, update, and delete data operations. With
respect to these incremental approaches, REXY uses a new validation method that focuses the
verification of candidate FDs only on the subset of data affected by the data changes. This can
improve the efficiency of FD discovery in incremental scenarios.


3. Incremental discovery of Functional Dependencies
Functional Dependencies (FDs) express relationships between attributes of a database relation.
An FD 𝑋 β†’ π‘Œ states that the values of an attribute set π‘Œ are uniquely determined by the values of
𝑋. More Formally, given a relation schema 𝑅, an FD over 𝑅 is a statement 𝑋 β†’ π‘Œ (𝑋 determines
π‘Œ), with 𝑋 , π‘Œ βŠ† π‘Žπ‘‘π‘‘π‘Ÿ(𝑅), such that, given an instance π‘Ÿ over 𝑅, 𝑋 β†’ π‘Œ is satisfied in π‘Ÿ if and only if
for every pair of tuples (𝑑1 , 𝑑2 ) in π‘Ÿ, whenever 𝑑1 [𝑋 ] = 𝑑2 [𝑋 ], then 𝑑1 [π‘Œ ] = 𝑑2 [π‘Œ ]. 𝑋 and π‘Œ are also
named Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of an FD, respectively. The latter is
said to be non-trivial if and only if 𝑋 βŠ‡ π‘Œ, and minimal if and only if there is no attribute 𝐡 ∈ 𝑋
such that 𝑋 \𝐡 β†’ π‘Œ is also an FD holding on π‘Ÿ. For sake of simplicity and w.l.o.g, in the rest of
the paper we assume that π‘Œ consists of a single attribute.
   In general, the FD discovery problem aims at finding a set of all minimal FDs holding on a
relation instance π‘Ÿ. It entails searching for a partition of tuples sharing the same values on RHS
attributes whenever they share the same value on LHS attributes [14]. To do this, it is possible
to first generate FD candidates, and then verifying their validity and minimality, yielding a
column-based strategy.
   Column-based strategies model the search space as a graph representation of a lattice, which
contains a collection of attribute sets, where Level 0 maps the empty set, Level 1 singleton sets,
one for each attribute, Level 2 the pair sets, one for each possible combination of two attributes,
and so forth. Finally, the last level, namely Level M, will contain a single set of all the attributes
from 𝑅. It permits to consider candidate FDs at each level in terms of edges. For instance, the
edge highlighted in blue in Figure 1 defines the candidate FD 𝐡 β†’ 𝐷.
   The validation process of column-based strategies performs simple computations on the
cardinalities of the partitions calculated over attribute combinations, which correspond to the
lattice nodes. This process is particularly efficient when it is possible to follow a level-by-level
search strategy from Level 1 to Level M, and to gradually construct partitions over ever-increasing
attribute set sizes. This could not be guaranteed when it is necessary to explore the search
space with different starting points, such as a set of FDs.
   Incremental scenarios require to update a set of minimal FDs Σ𝜏 discovered over data collected
until time 𝜏 according to the set of tuple modifications 𝐷𝜏 +1 at time 𝜏 +1. This yields the necessity
to (re)-consider all FDs in Σ𝜏 , and possibly generates new FD candidates according to validation
                                                 ABCDE

                                    ABCD ABCE    ABDE ACDE BCDE


                            ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE


                               AB AC AD AE       BC BD BE CD        CE   DE


                                      A      B     C      D     E



Figure 1: The lattice search space representation for the attribute set {A,B,C,D,E}.


results. In fact, the addition of a new tuple can entail the invalidation of a previously holding FD,
whereas the deletion of a tuple can entail to a previously holding FD be no longer minimal. More
formally, let 𝑋 β†’ 𝐴 be an FD holding at time 𝜏, and 𝑑 a tuple yielding a dataset modification,
then we need to consider the following effects:

    β€’ Invalidation: when a tuple 𝑑 is added, it could produce the invalidation of 𝑋 β†’ 𝐴 at
      time 𝜏 + 1. Consequently, one or more FD candidates on the next lattice level having the
      same RHS and a superset of its LHS could be validated. Thus, it is necessary to consider
      all possible FD candidates 𝑋 𝐡 β†’ 𝐴 such that 𝐡 βˆ‰ 𝑋 and 𝐡 β‰  𝐴.
    β€’ Minimality: when a tuple 𝑑 is removed, 𝑋 β†’ 𝐴 could be no longer minimal at time
      𝜏 + 1. Consequently, one or more FD candidates on the previous lattice level having the
      same RHS and a subset of its LHS could be validated, yielding a minimal FD. Thus, it is
      necessary to consider all possible FD candidates 𝑋 β§΅ 𝐡 β†’ 𝐴 such that 𝐡 ∈ 𝑋.

   Notice that, a tuple update can always be managed as the deletion of the stored tuple and the
subsequent addition of the tuple containing the modified values.
   Invalidation and minimality checks determine how FD candidate should be generated and
managed throughout the search space according to previously holding FDs and validation
results. This entails moving the focus on different parts of the search space, by focusing the
attention on a subset of data characterized by specific candidate FDs under analysis. Thus, we
propose a new validation method based on RegExs, which checks how modified data entail
some violations, and/or verifies the minimality of a candidate FD.


4. The RegEx-based validation method
In this section, we first describe the compressed data structures used to optimize the time and
space usage of the discovery algorithm. Then, we introduce the new RegEx-based validation
method, and how it can be adopted within an incremental FD discovery process.
               Identif River Location Erected Purpose Length Lanes Clear-G T-Or-D Material Span Rel-L Type
                    (A)       (B)              (C)           (D)          (E)          (F)         (G)             (H)            (I)              (J)             (K)          (L)       (M)

          𝑑1        E22           A            24            1876       Highway 1200               4               G          Through Wood                         Short        S        Wood
          𝑑2        E26       M                12            1883         RR           1150        2               G          Through             Steel           Medium        S       Simple-T
          𝑑3        E28       M                3             1884       Highway 1000               2               G          Through             Steel           Medium        S        Arch
          𝑑4        E31       M                8             1887         RR           1161        2               G          Through             Steel           Medium        S       Simple-T
          𝑑5        E34           O            41            1888         RR           4558        2               G          Through             Steel            Long         F       Simple-T
          𝑑6        E35           A            27            1890       Highway 1000               2               G          Through             Steel           Medium        F       Simple-T

                                                                         (a) Before the mapping step.

                    Identif River Location Erected Purpose Length Lanes Clear-G T-Or-D Material Span Rel-L Type
                        (A)           (B)           (C)         (D)             (E)          (F)         (G)             (H)            (I)              (J)          (K)        (L)      (M)

               𝑑1         1           1              1              1           1             1            1              1             1                 1            1         1           1
               𝑑2         2           2              2              2           2             2            2              1             1                 2            2         1           2
               𝑑3         3           2              3              3           1             3            2              1             1                 2            2         1           3
               𝑑4         4           2              4              4           2             4            2              1             1                 2            2         1           2
               𝑑5         5           3              5              5           2             5            2              1             1                 2            3         2           2
               𝑑6         6           1              6              6           1             3            2              1             1                 2            2         2           2

                                                                          (b) After the mapping step.


                    Table 1: Snippet of Bridges to illustrate validation and discovery strategies.


 4.1. Compressed data structures
6Before introducing
         L. Carucciothe etnew
                           al. validation method, it is necessary to provide details about its data
 structures. In particular, in order to minimize the memory load, we represent a dataset by
 using lightweight references to its tuples without losing significance. To do this, we map each
   t7   E37    M     18   1891   RR     1350  2    G    Through     Steel   Medium    S     Simple-T
 attribute
   t8   E34
           dataOto a 41
                     unique  numeric
                          1888   RR
                                      value,
                                        4558
                                             which
                                              2    G
                                                    allows  to efficiently
                                                        Through     Steel
                                                                           identify
                                                                              Long
                                                                                     data
                                                                                      F
                                                                                          changes.
                                                                                            Simple-T
                                                                                                     In
 particular, whenever new tuples are added to the dataset, their values are mapped into their
 corresponding numerical values. This representation
                                               (a)        leads to a fast tuple comparison during
 the FD discovery process, and permits to create RegExs in a simple way, avoiding character
 encoding
     t
           issues.
            7      2      7      7      2     6      2       1      1       2      2      1       2
     7

 Example
   t8    51.    Let3 us consider
                          5     5 the snippet
                                       2      5of the 2Bridges
                                                             1 dataset
                                                                   1   in Table
                                                                          2     1(a)
                                                                                3    containing
                                                                                       2     2 13

 attributes. As shown in Table 1(b), after the mapping phase, each value has been mapped to a
                                               (b)
 unique ID for each attribute. Let us now consider the insertion of the following two tuples in
 the Bridges dataset:

    t7         E37            M           18         1891           RR          1350          2        G            Through                 Steel              Medium               S        Simple-T
    t8         E34            O           41         1888           RR          4558          2        G            Through                 Steel                 Long              F        Simple-T



 then each of these values is parsed according to the previous values of the dataset, and mapped
 to the following two tuples:
         E37              M            18            1891           RR          1350          2                G         Through        Steel        Medium                S          Simple-T


    [0-9.]*                   [0-9.]*                               [0-9.]*                                    [0-9.]*                                              [0-9.]*



          7               2               7              7          2            6            12               1              1               2               2             1            2




                        [2]                          [7]                                     [2]                                         [2]


                          = [0-9.]*[,][2][,][0-9.]*[,][7][,][0-9.]*[,][0-9.]*[,][2]

                          = (?!.*2).+                                     Negative look ahead

                    =             [,][0-9.]*[,][0-9.]*[,]                                              ([,][0-9]*)*
parsed by using a unique ID for the values in each of the attributes (Table 1(b)).
Let us now consider the insertion of two different tuples in the existing dataset:
   Then each value is parsed by considering the already existing ones:


      t7          7          2          7          7          2          6          2          1          1          2     2   1   2
      t8          5          3          5          5          2          5          2          1          1          2     3   2   2




    Although this data representation allows us to quickly retrieve information, it is necessary to
     As we can see, after the parsing step, the tuple t8 is FDs.
 define a new structure to support the validation step of candidate
                                                                      equalTotothis
                                                                                 t5end,
                                                                                    . Thus,  it is
                                                                                         we exploit
not   necessary
 a hashmap,       to store
              namely        the tuple twhose
                       RegExHashMap,     8 in the  dataset.
                                                keys         On the
                                                      are unique       other hand, whereas
                                                                  ID combinations,     the tuplethe
tassociated
  7 is stored  in D  since it represents   a combination     of existing  and    new   values  for
            values correspond to the number of their occurrences in the dataset. In this way, it is
the   dataset.
 possible to avoid duplicate entries and to quickly perform insertion and/or removal operations.
 Example
     Although2. Let us consider
                  this              Example of
                        representation          1, where    the mapped
                                                   data allows          us totuples    𝑑7 and
                                                                                 quickly       𝑑8 shouldinformation
                                                                                             retrieve     be inserted in
Table  1(b).  Thus, for
from the data, it was    𝑑   the  ID β€œ7, 2, 7, 7, 2, 6, 2, 1, 1, 2, 2, 1, 2”  is inserted
                           7 necessary to define a new structure to support the validation   into  the RegExHashMap
step of candidate fds.value
 as key, whose   associated       To is  set end,
                                      this    to 1, since     no otheratuples
                                                      we exploit             hashmapin Table in1(b)  contains
                                                                                                 which    the the  same
                                                                                                               search
values   of 𝑑 . Instead,
relies on 7regex patterns.  for 𝑑  the key    β€œ5, 3, 5, 5, 2,  5, 2, 1, 1, 2, 3, 2, 2”
                                 8 This type of structure allows us to avoid duplicate  is already   included  into  the
 RegExHashMap, then the associated value is set to 2, since tuple 𝑑8 is equal to 𝑑5 , yielding to
entries and to quickly perform insertion or removal operations. Further details
 two occurrences of the same key.
about the search strategy will be discussed in the following section.

4.2. Validation approach
An efficient discovery methodology should permit to quickly validate candidate FDs on the set
of data under analysis, and ensure high adaptability to possible data changes. The proposed
validation method relies on RegExs and exploits the RegExHashMap for fast retrieval operations.
In particular, let πœ‘ ∢ 𝑋 β†’ 𝐴 a candidate FD over a dataset 𝐷. The proposed approach aims
to create a RegEx 𝜌 to validate πœ‘ over 𝐷. For sake of clarity, we describe the creation of 𝜌 by
considering a new tuple 𝑑. However, this strategy could be easily adapted to consider a large
number of new tuples by chaining different RegExs.
   The validation algorithm considers the projection of attributes in 𝑋 and 𝐴 onto the tuple
𝑑 to select value combinations that must be considered into the validation process. More
specifically, to validate πœ‘, it is necessary to create a RegEx for the attribute set 𝑋 = 𝑋1 , 𝑋2 , … , π‘‹π‘˜
by considering 𝑑[𝑋1 ], 𝑑[𝑋2 ], … , 𝑑[π‘‹π‘˜ ], in the following way:
𝜌𝐿𝐻 𝑆 = t [ 𝐡1 ] [ , ] ( [ 0 - 9 ] * [ , ] ) * t [ 𝐡2 ] [ , ] ( [ 0 - 9 ] * [ , ] ) * … ( [ 0 - 9 ] * [ , ] ) * t [ π΅π‘˜ ]
πœŒπ‘…π» 𝑆 = ( ? ! . * t [ 𝐴] ) . +
πœŒπœ‘ = ( [ 0 - 9 ] * [ , ] ) * 𝜌𝐿𝐻 𝑆 ( [ , ] [ 0 - 9 ] * ) * πœŒπ‘…π» 𝑆 ( [ , ] [ 0 - 9 ] * ) *

   We can notice that 𝜌𝐿𝐻 𝑆 contains comma character as separator, whereas each value could
be followed from zero or more numeric characters, i.e., [0 βˆ’ 9]βˆ— , representing all the possible
IDs for the attributes that are not in πœ‘. In fact, one of the strengths of this approach is to avoid
considering specific values for attributes that must not be considered during the validation step.
Similarly to the LHS, it is necessary to define a RegEx for the RHS, i.e., πœŒπ‘…π» 𝑆 . To this end, we
can define πœŒπ‘…π» 𝑆 as the negative look ahead of 𝑑[𝐴] (i.e., all values not equal to 𝑑[𝐴]), so enabling
to consider only the tuples in which 𝑑[𝐴] does not appear. The combination of 𝜌𝐿𝐻 𝑆 and πœŒπ‘…π» 𝑆
allows us to define a new RegEx πœŒπœ‘ to validate the candidate πœ‘ after the insertion of 𝑑. In this
way, it is possible to verify if there exists at least one key in the RegExHashMap satisfying πœŒπœ‘ .
           E37       M       18      1891   RR     1350     2      G   Through   Steel   Medium   S   Simple-T


        [0-9.]*            [0-9.]*          [0-9.]*                [0-9.]*                     [0-9.]*



            7         2       7       7     2       6       12     1     1        2        2      1      2




                     [2]             [7]                  [2]                    [2]


                     = [0-9.]*[,][2][,][0-9.]*[,][7][,][0-9.]*[,][0-9.]*[,][2]

                      = (?!.*2).+                Negative look ahead

                 =          [,][0-9.]*[,][0-9.]*[,]              ([,][0-9]*)*


Figure 2: An example of RegEx creation for Bridges dataset.


Example 3. Let us consider the snippet dataset in Table 1(b), a candidate FD πœ‘ ∢ {𝐡, 𝐷, 𝐺} β†’ 𝐽,
and the new tuple 𝑑7 defined in Example 1. The validation algorithm should verify if this new
tuple leads to the invalidation of πœ‘, according to the strategy defined above. To this end, it is
necessary to define the RegEx πœŒπœ‘ to validate πœ‘ shown in Figure 2. We can observe that the
validation approach permits to consider the new value 𝑑7 [𝐡], 𝑑7 [𝐷], and 𝑑7 [𝐺] for the attributes 𝐡,
𝐷, and 𝐺, while it considers a generic numeric value (i.e., [0 βˆ’ 9]βˆ— ) for the remaining attributes.
As a consequence, the regex 𝜌𝐿𝐻 𝑆 has been defined as shown in Figure 2. We can notice that
𝜌𝐿𝐻 𝑆 allows the validation strategy to check if already exists at least one tuple in the dataset
with the same value for the attributes 𝐡, 𝐷, and 𝐺. However, it is necessary to define the regex
πœŒπ‘…π» 𝑆 for the attribute 𝐽 by exploiting the negative look ahead of 𝑑7 [𝐽 ], in order to check if exist
at least one key in the RegExHashMap with the same value for the attributes 𝐡, 𝐷, and 𝐺, but a
different value for 𝐽.


4.3. An incremental search strategy
The proposed validation method can be used in any discovering strategy that, by working
incrementally, can properly define FD candidates to be validated. In particular, the algorithm
REXY uses a version of the algorithm described in [15], adapted for incrementally managing any
kind of data changes. In particular, given a dataset 𝐷 at time 𝜏, REXY considers the set of FDs
holding at time 𝜏, denoted as Σ𝜏 , as starting points, and collect them in a hash map ordered by
LHS cardinality [15]. Then, it follows a bottom-up search strategy by considering Σ𝜏 as the set
of candidate FDs, and by validating each of them according to the validation method described
above. Consequently, given a candidate FD πœ‘, REXY performs a downward discovery step when
a tuple is removed or πœ‘ is a novel candidate (i.e., πœ‘ βˆ‰ Σ𝜏 ). It consists in the generalization of
the candidate πœ‘ performed by iteratively removing one attribute on its LHS, until REXY meets
a valid FD in the lower levels. Conversely, when a new tuple is added and πœ‘ is not valid, an
upward discovery step is accomplished in order to find the possible new holding FDs. It consists
in the specialization of πœ‘ performed by adding a new attribute on its LHS.
   The usage of an ordered hash map into a bottom-up search strategy permits to reuse the
analysis performed at the previous levels, and forward the validation of newly generated
candidate FDs to the next level, yielding an optimization in the pruning of the search space.


5. The REXY Algorithm
The main procedure of REXY is shown in Algorithm 1. Given a set of minimal FDs Σ𝜏 valid at
time 𝜏, the set of tuples 𝐷𝜏 +1 updating 𝐷 at time 𝜏 + 1, and an instance of the RegExHashMap
Ξ›πœ at time 𝜏, the main procedure of REXY starts the discovery process by considering Σ𝜏 as the
set of candidate FDs. In particular, REXY performs a discovery step in ascending order by the
LHS cardinalities of Σ𝜏 (line 3) and extracts all the candidate FDs from Σ𝜏 with a specific LHS
cardinality (line 4). Then, for each of them, it checks if the candidate πœ‘ ∢ 𝑋 β†’ 𝐴 is still valid
after the updating of the dataset according to the validation approach defined in Section 4.2
(line 6). If the analyzed FD is not valid at time 𝜏 + 1, the procedure first removes it from the set
of the analyzed candidates (line 7), and then generates new candidate FDs at a higher lattice
level (line 8), by discarding those that can be inferred from other FDs already validated at time
𝜏 + 1 (lines 9-10). On the other hand, if the analyzed FD is still valid at time 𝜏 + 1, REXY tries
to find if there exist other FDs at a lower lattice level (line 12) that have been validated after
update operations (lines 13-18). At the end of each iteration, the procedure removes all the FDs
that are not minimal at time 𝜏 + 1 w.r.t. the FDs already validated in the previous iterations
(line 19). Finally, RegExHashMap is updated with the new tuples (line 20), and the procedure
returns a new set of minimal and valid FDs at time 𝜏 + 1 (line 21).
   The validation procedure of REXY is shown at the bottom of Algorithm 1. Given a candidate
FD πœ‘ ∢ 𝑋 β†’ 𝐴 at time 𝜏 + 1, an instance of the RegExHashMap Σ𝜏 at time 𝜏, and the set of the
new tuples 𝐷𝜏 +1 at time 𝜏 + 1, the procedure creates a new RegEx for each new tuple in 𝐷𝜏 +1 ,
in order to check the validation of the candidates on the updated dataset. In particular, the
procedure starts by analyzing each new tuple in 𝐷𝜏 +1 (line 24). Then, for each of them, REXY
defines a new RegEx according to the approach defined in Section 4.2. More specifically, for
each attribute 𝐡 in π‘Žπ‘‘π‘‘π‘Ÿ(𝑅), if 𝐡 does not belong to the attributes of πœ‘, the procedure adds to πœŒπœ‘ a
generic value (lines 28-30). On the other hand, if 𝐡 belongs to the attributes on the LHS, then
the procedures add to the final RegEx πœŒπœ‘ the corresponding value for the analyzed tuple 𝑑[𝐡]
(line 32). Otherwise, if none of the previous cases is verified, the attribute 𝐡 belongs to the RHS,
so the procedure adds to πœŒπœ‘ the negative look ahead of this value (line 33). After the creation of
the RegEx πœŒπœ‘ , REXY checks if exists at least a tuple in the RegExHashMap that matches this
RegEx. If true, it means that πœ‘ is not valid on the dataset at time 𝜏 + 1, since there exists at least
one pair of tuples that invalidate πœ‘ (line 35). Otherwise, if the previous case is never verified for
any other tuple, it means that the candidate is valid at time 𝜏 + 1 (line 36).
   It is worth notice that, while the implementation of REXY considers several code optimizations
and takes advantage of the defined data structures, for sake of clarity, the pseudo-codes of
Algorithm 1 do not show such optimizations. They mainly guarantee that each possible candidate
FD is validated at most once.
  Algorithm 1: Main Procedures of REXY
   Input: A set Σ𝜏 of minimal FDs at time 𝜏; Updated tuples 𝐷𝜏 +1 for the dataset 𝐷 at time
          𝜏 + 1; An instance of the RegExHashMap Ξ“πœ at time 𝜏
   Output: The set of new minimal FDs at time 𝜏 + 1, i.e., Σ𝜏 +1
 1 Function M A I N _ P R O C E D U R E :
 2    Σ𝜏 +1 ← βˆ…
 3    for 𝑐 in [ 1, … , |π‘Žπ‘‘π‘‘π‘Ÿ(𝑅)| ] do
 4        Σ𝑐 ← Σ𝜏 . C A R D I N A L I T Y _ F I L T E R (𝑐)
 5        for each 𝑋 β†’ 𝐴 ∈ Σ𝑐 do
 6            if V A L I D A T I O N _ R E G E X (𝑋 β†’ 𝐴, Ξ“πœ , 𝐷𝜏 +1 ) is False then
 7                  Σ𝑐 ← Σ𝑐 \𝑋 β†’ 𝐴
 8                  Σ𝑠 ← S P E C I A L I Z E _ C A N D I D A T E (𝑋 β†’ 𝐴)
 9                  for each {𝑍 β†’ 𝐴} ∈ Σ𝑠 do
10                         if I S _ M I N I M A L ({𝑍 β†’ 𝐴}, Σ𝜏 ) is True then
11                                Σ𝜏 ← Σ𝜏 βˆͺ {𝑍 β†’ 𝐴}

12               else
13                   Σ𝑔 ← G E N E R A L I Z E _ C A N D I D A T E (𝑋 β†’ 𝐴)
14                   for each {π‘Š β†’ 𝐴} ∈ Σ𝑔 do
15                       if VA L I D A T I O N _ R E G E X ({π‘Š β†’ 𝐴}, Ξ“πœ , 𝐷𝜏 +1 ) is False then
16                           Σ𝑔 ← Σ𝑔 \{π‘Š β†’ 𝐴}
17                       Σ𝑔 ← Σ𝑔 βˆͺ G E N E R A L I Z E _ C A N D I D A T E ({π‘Š β†’ 𝐴})
18                    Σ𝑐 ← Ξ£ 𝑐 βˆͺ Ξ£ 𝑔
19        if |Σ𝑐 | > 0 then Σ𝜏 +1 ← Σ𝜏 +1 βˆͺ M I N I M A L I T Y _ C H E C K (Σ𝜏 +1 , Σ𝑐 )
20    Ξ“πœ +1 ← Ξ“πœ +1 βˆͺ 𝐷𝜏 +1                                                      // Updating of the RegExHashMap
21    return Σ𝜏 +1
22 Function V A L I D A T I O N _ R E G E X ( 𝑋 β†’ 𝐴, Ξ“πœ , 𝐷𝜏 +1 ) :
23    πœŒπœ‘ ← βˆ…
24    for each 𝑑 ∈ 𝐷𝜏 +1 do
25        A T T R S ← π‘Žπ‘‘π‘‘π‘Ÿ(𝑅)
26        for 𝑐 in [ 1, … , |π‘Žπ‘‘π‘‘π‘Ÿ(𝑅)| βˆ’ 1 ] do
27               𝐡 ←A T T R S [𝑖]
28               if 𝐡 βˆ‰ 𝑋 ∧ 𝐡 β‰  𝐴 then
29                   if 𝑖 == 0 then πœŒπœ‘ ← πœŒπœ‘ βˆͺ ([0 βˆ’ 9]βˆ— [, ])βˆ—
30                   else πœŒπœ‘ ← πœŒπœ‘ βˆͺ ([, ][0 βˆ’ 9]βˆ— )βˆ—
31               else
32                   if 𝐡 ∈ 𝑋 then πœŒπœ‘ ← πœŒπœ‘ βˆͺ 𝑑[𝐡]
33                   else if 𝐡 ∈ 𝐴 then πœŒπœ‘ ← πœŒπœ‘ βˆͺ (?!.βˆ— 𝑑[𝐡]).+
34                   if 𝑖 < |A T T R S | βˆ’ 1 then πœŒπœ‘ ← πœŒπœ‘ βˆͺ [, ]
35               if Ξ“πœ . E X I S T _ M A T C H I N G (πœŒπœ‘ ) is True then return False
36      return True


6. Experimental results
REXY has been developed in Java 15. The execution tests have been performed on an iMac
with an Intel Xeon processor at 3.40 GHz, 36-core, and 128GB of RAM. Furthermore, in order
                                             Exec               Validations Validations Validations
                         Cols   Rows FDs           Validations                                       Memory
  # Dataset              [#]     [#] [#]     (Avg)     [#]         (Min)      (Max)        (Avg)      [MB]
                                             [ms]                   [ms]       [ms]        [ms]
   1   Iris                 5    150    4        1          766          11          12           12      95
   2   Balance-scale        5    625    1        1         1330           8          38           13     100
   3   Bupa                 6    345   16       18         6979          11          36           14     120
   4   Appendicitis         7    107   72       39         7890           9          37           23     116
   5   Chess                7   2000    6        2       17142           13          13           13      39
   6   Abalone              9   4148 137       335      705498           12          33           18     114
   7   Nursey               9 12960     1       82       59137            5          18           16     121
   8   Tic-tac-toe          9    958    9       28       14613            5           6            5     122
   9   Glass               10    214 123       140       28138           14          21           18     117
  10   Cmc                 10   1473    1       13       17832           20          31           24     138
  11   Yeast               10   1484   50       12       95531           11          11           11      46
  12   Breast-cancer       11    699   46      175       56269            8          15           11      66
  13   Fraud-detection     11 28000    48     1482     1723242            5         230          207     414
  14   Poker-hand          11 264000    1       19     2401240            9          12           10     156
  15   Echocardiogram      13    132 538       160       65932            8          13            8      59
  16   Bridges             13    108 142       153       34017           11          31           11      94
  17   Australian          14    690 535       374      501588           10          26           12     124
  18   Adult               15 32560    60      281     3747386            8          23           12     803
  19   Tax                 15   6848 310      7038     2378734           10         532          287     202
  20   Lymphography        19    147 2730    71844     5156312           14          34           20    4921
  21   Hepatitis           20    155 8250   194553     4528681           11         170          126   48878
  22   Parkinson           24    195 1724   524195      874789           16       16470       10492      127

          Table 2: Details of the considered real-world datasets and REXY performances.


to ensure proper executions of REXY on the considered datasets, the Java memory heap size
allocation has been set to 100GB.
   Table 2 summarizes the time performances of REXY on 22 real-world datasets varying in
the number of rows and columns1 . In our experimental session, we simulate a scenario of
continuous insertions of tuples by considering one tuple at a time. This allowed us to deeply
analyze the performances of the validation process on each candidate FD. More specifically,
Table 2 reports the minimum, the maximum, and the average times of the validation process
for each dataset. We also report the number of resulting FDs and the number of validations
performed at the end of the last execution of REXY on each dataset.
   The results highlight that the average execution time is almost always less than 10 seconds,
except for the Lymphography, Hepatitis, and Parkinsons datasets. This is probably due to the
large number of FD invalidated during the discovery process, possibly leading to a larger number
of new candidates to be validated. Concerning the validation process, the average time of this
process is almost always less than 25 milliseconds per FD, and the maximum time almost never
exceeds 40 milliseconds, except for the Fraud-detection, Tax, Hepatitis, and Parkinsons datasets.
Despite these time peaks, the average times demonstrate that such values can be considered as
outliers. In general, although execution times increase when the number of attributes increases,
on average they remain low for validating each candidate FD.
   We performed a comparative evaluation of REXY with respect to the incremental FD discovery
algorithm presented in [9] (baseline). In particular, both the algorithms follow the same search
strategy, but differ on how they organize data, also yielding to completely different validation
methods. Thus, the comparison allowed us to highlight the improvements in terms of execution
times of the proposed validation strategy with respect to the traditional partition-based used in

   1
       The datasets are available on: https://github.com/DastLab/TestDataset
                                                                                                                                 Baseline
                             102                                                                                                 REXY
Avg Validation times [ms]
                             101
                             100
                            10 1
                            10 2
                            10 3
                            10 4
                            10 5
                                   1   2   3   4   5   6   7   8   9   10    11   12     13   14   15   16   17   18   19   20   21   22
                                                                            Dataset ID

Figure 3: Validation times of REXY and the baseline algorithm proposed in [9].


[9]. Figure 3 shows time performances on average of both algorithms on the 22 datasets. We can
notice that, REXY outperforms the baseline algorithm with several orders of magnitude, ranging
from 2 to 7, except for the Parkinson dataset for which the baseline takes advantage from a
caching strategy that allows to reuse the computation performed in the previous iterations.
   In general, partition-based validation strategies, such as the one included in [9], are particu-
larly efficient when a complete discovery process has to be performed, since partitions can be
incrementally computed following a level-by-level bottom-up search strategy. However, the
incremental scenario requires the discovery process to browse the search space starting from
the previously holding FDs (i.e., not all FD candidates have to be considered in each level). Thus,
the proposed validation method is able to perform the verification without having the necessity
to compute sparse partitions, yielding faster execution times as can be noted in the discussed
experimental results.


7. Conclusion
In this paper we presented REXY, an incremental algorithm for FD discovery. It exploits a new
Regex-based validation method to efficiently select tuples affected by dataset updates. REXY
follows an incremental strategy to explore the search space based on the previously holding FDs.
Experimental results over real-world datasets show that REXY can efficiently update the set of
holding FDs, without having the necessity to execute algorithms from scratch. In the future, we
would like to include REXY in a visual monitoring tool [16], and extend it for updating Relaxed
Functional Dependencies [17] in incremental scenarios.


References
       [1] X. Chu, I. F. Ilyas, P. Papotti, Holistic data cleaning: Putting violations into context, in:
           Proc. of IEEE International Conference on Data Engineering, 2013, pp. 458–469.
       [2] M. Le Guilly, J.-M. Petit, V.-M. Scuturici, Evaluating classification feasibility using func-
     tional dependencies, in: Transactions on Large-Scale Data-and Knowledge-Centered
     Systems XLIV, Springer, 2020, pp. 132–159.
 [3] Z. Abedjan, P. Schulze, F. Naumann, DFD: Efficient functional dependency discovery,
     in: Proc. of the 23rd ACM International Conference on Information and Knowledge
     Management, 2014, pp. 949–958.
 [4] P. A. Flach, I. Savnik, Database dependency discovery: A machine learning approach, AI
     communications 12 (1999) 139–160.
 [5] Y. Huhtala, J. KΓ€rkkΓ€inen, P. Porkka, H. Toivonen, TANE: An efficient algorithm for
     discovering functional and approximate dependencies, The Computer Journal 42 (1999)
     100–111.
 [6] P. Schirmer, T. Papenbrock, S. Kruse, D. Hempfing, T. Meyer, D. NeuschΓ€fer-Rube, F. Nau-
     mann, DynFD: Functional dependency discovery in dynamic datasets., in: Proc. of 22nd
     International Conference on Extending Database Technology, 2019, pp. 253–264.
 [7] S.-L. Wang, J.-W. Shen, T.-P. Hong, Incremental discovery of functional dependencies
     using partitions, in: Proc. of Joint 9th IFSA World Congress and 20th NAFIPS International
     Conference, volume 3, 2001, pp. 1322–1326.
 [8] C. Wyss, C. Giannella, E. Robertson, FastFDs: A heuristic-driven, depth-first algorithm
     for mining functional dependencies from relation instances, in: Proc. of International
     Conference on Data Warehousing and Knowledge Discovery, 2001, pp. 101–110.
 [9] L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Incremental discovery of functional de-
     pendencies with a bit-vector algorithm, in: M. Mecella, G. Amato, C. Gennaro (Eds.),
     Proceedings of the 27th Italian Symposium on Advanced Database Systems, volume 2400
     of CEUR Workshop Proceedings, CEUR-WS.org, 2019.
[10] H. Yao, H. J. Hamilton, C. J. Butz, FD_Mine: Discovering functional dependencies in a
     database using equivalences, in: Proc. of IEEE International Conference on Data Mining,
     2002, pp. 729–732.
[11] T. Papenbrock, F. Naumann, A hybrid approach to functional dependency discovery, in:
     Proc. of the 2016 International Conference on Management of Data, 2016, pp. 821–833.
[12] Z. Wei, S. Link, Discovery and ranking of functional dependencies, in: Proc. of IEEE 35th
     International Conference on Data Engineering, IEEE, 2019, pp. 1526–1537.
[13] S. Bell, Discovery and maintenance of functional dependencies by independencies, in:
     Proc. of the 1st International Conference on Knowledge Discovery and Data Mining, 1995,
     pp. 27–32.
[14] L. Caruccio, V. Deufemia, G. Polese, Mining relaxed functional dependencies from data,
     Data Mining and Knowledge Discovery 34 (2020) 443–477.
[15] L. Caruccio, S. Cirillo, Incremental discovery of imprecise functional dependencies, J. Data
     and Information Quality 12 (2020).
[16] B. Breve, L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Visualizing dependencies during
     incremental discovery processes, in: A. Poulovassilis et al. (Ed.), Proceedings of the
     Workshops of the EDBT/ICDT 2020 Joint Conference, volume 2578 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2020.
[17] L. Caruccio, V. Deufemia, F. Naumann, G. Polese, Discovering relaxed functional depen-
     dencies based on multi-attribute dominance, IEEE Transactions on Knowledge and Data
     Engineering (2021) To appear.