<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Eficient Validation of Functional Dependencies during Incremental Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Loredana Caruccio</string-name>
          <email>lcaruccio@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Cirillo</string-name>
          <email>scirillo@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Deufemia</string-name>
          <email>deufemia@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Polese</string-name>
          <email>gpolese@unisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Data Profiling, Functional Dependency, Incremental Discovery</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Salerno</institution>
          ,
          <addr-line>via Giovanni Paolo II n.132, 84084 Fisciano (SA)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>5</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>The discovery of functional dependencies (FDs) from data is facing novel challenges also due to the necessity of monitoring datasets that evolves over time. In these scenarios, incremental FD discovery algorithms have to eficiently verify which of the previously discovered dataset, and also infer new valid FDs. This requires the definition of search strategies and validation methods able to analyze only the portion of the dataset afected by new changes. In this paper we propose a new validation method, which can be used in combination with diferent search strategies, that exploits regular expressions and compressed data structures to eficiently verify whether a candidate FD holds on an updated version of the input dataset. Experimental results demonstrate the efectiveness of the proposed method on real-world datasets adapted for incremental scenarios, also compared with a baseline incremental FD discovery algorithm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Traditional validation methods, such as the partition-based ones, exploit the possibility to
organize data into partitions according to attribute candidate sets, which will consistently
be redefined by following a level-by-level search strategy. In this way, the validation can be
performed by testing properties of partitions involved in each candidate FD [5]. Instead,
rowbased discovery algorithms directly derive candidate FDs by comparing attribute values for all
possible combinations of tuples pairs, so implicitly inferring the validation of the FDs [4, 8].</p>
      <p>In this paper, we propose an FD incremental discovery algorithm, named REXY (RegeX-based
incremental discoverY), which includes a new validation method exploiting regular expressions
(RegExs) to extract the subset of data afecting discovery results. It employs a compressed data
representation, which limits the memory load and optimizes the data analysis. The incremental
search strategy of REXY is based on an upward/downward candidate generation method based
on the set of previously holding FDs without the need of exploring the complete search space.
We experimentally evaluated REXY on 22 real-world datasets. In particular, we evaluated the
efectiveness of the algorithm in its incremental identification of FDs using as performance
benchmark the incremental algorithm described in [9]. The results demonstrate that REXY
improves the time performances of a baseline algorithm of several orders of magnitude.</p>
      <p>The remainder of the paper is organized as follows. Section 2 reviews the main discovery
strategies and algorithms existing in the literature. Section 3 introduces the incremental
discovery problem. Section 4 describes the proposed validation method and its integration in
an incremental search strategy, whereas the whole discovery algorithm is reported in Section 5.
Experimental results are discussed in Section 6, and conclusions are included in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In the literature several algorithms for the discovery of FDs from data have been defined. They
are mainly devoted to the analysis of data from scratch, yielding specific methodologies to
explore and prune the search space. In particular, column-based and row-based represent the
two main strategies they follow.</p>
      <p>Column-based strategies model the search space as an attribute lattice, which permits to
consider candidate dependencies at each level in terms of lattice’s edges, enabling the
representation of the Left-Hand-Side (LHS) and the Right-Hand-Side (RHS) of an FD. After the
generation of FD candidates at each level, it is necessary to validate each of them, entailing
the evaluation of combination values or making the FD representations eficient, such as data
partitions [3, 5, 10]. By following a level-by-level search strategy, column-based strategies
exploit the possibility to progressively organize data, and to test the validation of a candidate
FD by considering how data are collected into partitions. On the contrary, row-based strategies
directly derive candidate FDs by analyzing all possible combinations of tuples pairs, aiming to
derive two attribute subsets in terms of agree-sets or diference-sets , and entailing the automatic
derivation of all holding FDs [4, 8]. In this case, the validation of an FD is implicitly inferred by
the overall analysis over the set of data. Finally, in order to improve the eficiency of discovery
algorithms, hybrid strategies have also been proposed [11, 12]. The latter calculate FDs on a
randomly selected small subset of records (row-based), and validates the discovered FDs on the
entire dataset, by focusing on a small subset of the search space determined by FD candidates
and validation results (column-based).</p>
      <p>In literature, there exist several FD discovery algorithms for incremental scenarios. In
particular, one of the first algorithms exploits the concepts of tuple partitions and monotonicity
of FDs to avoid the re-scanning of the database [7]. Another proposal is based on the concept
of functional independency, through which the algorithm maintains the set of FDs updated over
time [13]. Finally, the DynFD algorithm is able to discover and maintain FDs in dynamic datasets
[6]. In particular, it extends an aforementioned hybrid strategy, by continuously adapting the
FD validation structures according to a batch of insert, update, and delete data operations. With
respect to these incremental approaches, REXY uses a new validation method that focuses the
verification of candidate FDs only on the subset of data afected by the data changes. This can
improve the eficiency of FD discovery in incremental scenarios.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Incremental discovery of Functional Dependencies</title>
      <p>Functional Dependencies (FDs) express relationships between attributes of a database relation.
An FD  →  states that the values of an attribute set  are uniquely determined by the values of
 . More Formally, given a relation schema  , an FD over  is a statement  →  ( determines
 ), with  ,  ⊆  () , such that, given an instance  over  ,  →  is satisfied in  if and only if
for every pair of tuples ( 1,  2) in  , whenever  1[ ] =  2[ ] , then  1[ ] =  2[ ] .  and  are also
named Left-Hand-Side (LHS) and Right-Hand-Side (RHS) of an FD, respectively. The latter is
said to be non-trivial if and only if  ⊇  , and minimal if and only if there is no attribute  ∈ 
such that  \ →  is also an FD holding on  . For sake of simplicity and w.l.o.g, in the rest of
the paper we assume that  consists of a single attribute.</p>
      <p>In general, the FD discovery problem aims at finding a set of all minimal FDs holding on a
relation instance  . It entails searching for a partition of tuples sharing the same values on RHS
attributes whenever they share the same value on LHS attributes [14]. To do this, it is possible
to first generate FD candidates, and then verifying their validity and minimality, yielding a
column-based strategy.</p>
      <p>Column-based strategies model the search space as a graph representation of a lattice, which
contains a collection of attribute sets, where Level 0 maps the empty set, Level 1 singleton sets,
one for each attribute, Level 2 the pair sets, one for each possible combination of two attributes,
and so forth. Finally, the last level, namely Level M, will contain a single set of all the attributes
from  . It permits to consider candidate FDs at each level in terms of edges. For instance, the
edge highlighted in blue in Figure 1 defines the candidate FD  →  .</p>
      <p>The validation process of column-based strategies performs simple computations on the
cardinalities of the partitions calculated over attribute combinations, which correspond to the
lattice nodes. This process is particularly eficient when it is possible to follow a level-by-level
search strategy from Level 1 to Level M, and to gradually construct partitions over ever-increasing
attribute set sizes. This could not be guaranteed when it is necessary to explore the search
space with diferent starting points, such as a set of FDs.</p>
      <p>Incremental scenarios require to update a set of minimal FDs Σ discovered over data collected
until time  according to the set of tuple modifications  +1 at time  +1 . This yields the necessity
to (re)-consider all FDs in Σ , and possibly generates new FD candidates according to validation</p>
      <p>ABCD ABCE ABDE ACDE BCDE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE</p>
      <p>AB AC AD AE BC BD BE CD CE DE</p>
      <p>A</p>
      <p>B</p>
      <p>C</p>
      <p>D</p>
      <p>E
results. In fact, the addition of a new tuple can entail the invalidation of a previously holding FD,
whereas the deletion of a tuple can entail to a previously holding FD be no longer minimal. More
formally, let  →  be an FD holding at time  , and  a tuple yielding a dataset modification,
then we need to consider the following efects:
• Invalidation: when a tuple  is added, it could produce the invalidation of  →  at
time  + 1 . Consequently, one or more FD candidates on the next lattice level having the
same RHS and a superset of its LHS could be validated. Thus, it is necessary to consider
all possible FD candidates   →  such that  ∉  and  ≠  .
• Minimality: when a tuple  is removed,  →  could be no longer minimal at time
 + 1 . Consequently, one or more FD candidates on the previous lattice level having the
same RHS and a subset of its LHS could be validated, yielding a minimal FD. Thus, it is
necessary to consider all possible FD candidates  ⧵  →  such that  ∈  .</p>
      <p>Notice that, a tuple update can always be managed as the deletion of the stored tuple and the
subsequent addition of the tuple containing the modified values.</p>
      <p>Invalidation and minimality checks determine how FD candidate should be generated and
managed throughout the search space according to previously holding FDs and validation
results. This entails moving the focus on diferent parts of the search space, by focusing the
attention on a subset of data characterized by specific candidate FDs under analysis. Thus, we
propose a new validation method based on RegExs, which checks how modified data entail
some violations, and/or verifies the minimality of a candidate FD.</p>
    </sec>
    <sec id="sec-4">
      <title>4. The RegEx-based validation method</title>
      <p>In this section, we first describe the compressed data structures used to optimize the time and
space usage of the discovery algorithm. Then, we introduce the new RegEx-based validation
method, and how it can be adopted within an incremental FD discovery process.
1 E22

2 E26

3 E28

4 E31

5 E34

6 E35







1
2
3
4
5
6
1
2
3
4
5
6</p>
      <p>A
M
M
M
O
A
1
2
2
2
3
1
(C)
24
12
3
8
41
27
1
2
3
4
5
6
(D)
1883
1887
1888</p>
      <sec id="sec-4-1">
        <title>4.1. Compressed data structures</title>
        <p>6Before iLnt.rCodaurcuicncgiotheetnaelw. validation method, it is necessary to provide details about its data
structures. In particular, in order to minimize the memory load, we represent a dataset by
using lightweight references to its tuples without losing significance. To do this, we map each
t7</p>
        <p>E37</p>
        <p>M
18
1891</p>
        <p>RR</p>
        <p>1350
attribute data to a 4u1nique numeric value, which allowThsrotuogh
t8 E34 O 1888 RR 4558 2 G</p>
        <p>G</p>
        <p>Through</p>
        <p>Steel
Steel</p>
        <p>Medium</p>
        <p>Long
eficiently identify data changes. In</p>
        <p>Simple-T
Simple-T
particular, whenever new tuples are added to the dataset, their values are mapped into their
corresponding numerical values. This repre(sae)ntation leads to a fast tuple comparison during
the FD discovery process, and permits to create RegExs in a simple way, avoiding character
t7
encoding issues.</p>
        <p>7</p>
        <p>2
the Bridges dataset:
7
7
2
2
1
1
2
2
1
2
Exat8mple 51.</p>
        <p>Le3t us co5nsider5 the sn2ippet 5of the2Bridge1s datas1et in T2able 13(a) con2tainin2g 13
attributes. As shown in Table 1(b), after the mapping phase, each value has been mapped to a
unique ID for each attribute. Let us now consider the insertion of the following two tuples in
(b)
t7
t8</p>
        <p>E37
E34</p>
        <p>M
O
18
41
1891
1888</p>
        <p>RR
RR
1350
4558</p>
        <p>G
G</p>
        <p>Through
Through</p>
        <p>Steel
Steel</p>
        <p>Medium</p>
        <p>Long</p>
        <p>S
F</p>
        <p>Simple-T
Simple-T
then each of these values is parsed according to the previous values of the dataset, and mapped
to the following two tuples:
18
1891</p>
        <p>RR
1350</p>
        <p>G</p>
        <p>Through Steel Medium</p>
        <p>Simple-T
E37
7</p>
        <p>
          M
2
= [0-9.]*[,][
          <xref ref-type="bibr" rid="ref2">2</xref>
          ][,][0-9.]*[,][7][,][0-9.]*[,][0-9.]*[,][
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
parsed by using a unique ID for the values in each of the attributes (Table 1(b)).
Let us now consider the insertion of two di erent tuples in the existing dataset:
Then each value is parsed by considering the already existing ones:
        </p>
        <p>Although this data representation allows us to quickly retrieve information, it is necessary to
definAeas nweew csatrnucsteuere, taoftseurpptohret tphaervsainligdastitoenp,sttehpeoftucapnldeidta8teis FeDqus.aTlotothits5e.nTd,hwuse,eixtpliosit
naohtanshemceasps,anraymteolystRoergeExthHeasthuMplaep,tw8 hinostehkeeydsataareseutn. iOqune tIDhecoomthbeinrahtiaonnds,, wthheerteuaspltehe
ta7ssioscsiattoerdevdaliunesDcosrirnecspeointdrteoptrheesennutmsbaercoofmthbeiinraotcicounrreonfceexsiisnttihnegdaantadsent.eIwn tvhaisluweasy,f oitris
tphoessdibaletatoseatv.oid duplicate entries and to quickly perform insertion and/or removal operations.
ExaAmlpthleou2.ghLetthuiss croenpsriedseernEtxaatmiopnleo1f, wdahteareathlleowmsapupsedtotuqpuleisck7lyanrdet8rsiehvoeulidnbfeorinmsearttieodnin
fTraobmle t1h(be).dTahtau,s,itfowra7stnheecIDess“7a,r2y, 7t,o7,d2e, 6,n2e, 1a, 1n,e2w,2s,1tr,2u”citsuirnesetrotesduipnptoorthtethReegvEaxlHidaasthiMonap
as key, whose associafteddsv.alue itshsiest etond1,, swineceenxopolot hiteratuhpalesshimnaTpablien1(b) contatihnes tsheeasracmhe
step of candidate To which
rvealliueess oonf  7r.egInesxteapda,tftoerr n8s.thTehkeisy “t5y,p3e,5,o5f, 2s,t5r,u2c,1t,u1r,e2, a3,ll2o, w2”sisuaslrteoadyavionicdluddeudpilnictoattehe
entries and to quickly perform insertion or removal opera8tiiosnesq.uaFlutroth5e,ryiedledtinaiglsto
RegExHashMap, then the associated value is set to 2, since tuple 
two occurrences of the same key.
about the search strategy will be discussed in the following section.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Validation approach</title>
        <p>An eficient discovery methodology should permit to quickly validate candidate FDs on the set
of data under analysis, and ensure high adaptability to possible data changes. The proposed
validation method relies on RegExs and exploits the RegExHashMap for fast retrieval operations.
In particular, let  ∶  →  a candidate FD over a dataset  . The proposed approach aims
to create a RegEx  to validate  over  . For sake of clarity, we describe the creation of  by
considering a new tuple  . However, this strategy could be easily adapted to consider a large
number of new tuples by chaining diferent RegExs.</p>
        <p>
          The validation algorithm considers the projection of attributes in  and  onto the tuple
 to select value combinations that must be considered into the validation process. More
specifically, to validate  , it is necessary to create a RegEx for the attribute set  =  1,  2, … ,  
by considering [ 1], [ 2], … , [  ], in the following way:
   = t [  1] [ , ] ( [
          <xref ref-type="bibr" rid="ref1 ref2">0 - 9</xref>
          ] * [ , ] ) * t [  2] [ , ] ( [
          <xref ref-type="bibr" rid="ref1 ref2">0 - 9</xref>
          ] * [ , ] ) * … ( [
          <xref ref-type="bibr" rid="ref1 ref2">0 - 9</xref>
          ] * [ , ] ) * t [   ]
   = ( ? ! . * t [  ] ) . +
  = ( [
          <xref ref-type="bibr" rid="ref1 ref2">0 - 9</xref>
          ] * [ , ] ) *    ( [ , ] [
          <xref ref-type="bibr" rid="ref1 ref2">0 - 9</xref>
          ] * ) *    ( [ , ] [
          <xref ref-type="bibr" rid="ref1 ref2">0 - 9</xref>
          ] * ) *
        </p>
        <p>
          We can notice that    contains comma character as separator, whereas each value could
be followed from zero or more numeric characters, i.e., [
          <xref ref-type="bibr" rid="ref1 ref2">0 − 9</xref>
          ]∗, representing all the possible
IDs for the attributes that are not in  . In fact, one of the strengths of this approach is to avoid
considering specific values for attributes that must not be considered during the validation step.
Similarly to the LHS, it is necessary to define a RegEx for the RHS, i.e.,    . To this end, we
can define    as the negative look ahead of [] (i.e., all values not equal to [] ), so enabling
to consider only the tuples in which [] does not appear. The combination of    and   
allows us to define a new RegEx   to validate the candidate  after the insertion of  . In this
way, it is possible to verify if there exists at least one key in the RegExHashMap satisfying   .
[0-9.]*
7
        </p>
        <p>G Through Steel Medium</p>
        <p>
          S Simple-T
[0-9.]*
[0-9.]*
[0-9.]*
[0-9.]*
7
7
2
6
1
1
2
2
1
2
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
[7]
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
=
= (?!.*2).+
        </p>
        <p>
          [,][0-9.]*[,][0-9.]*[,]
= [0-9.]*[,][
          <xref ref-type="bibr" rid="ref2">2</xref>
          ][,][0-9.]*[,][7][,][0-9.]*[,][0-9.]*[,][
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]
        </p>
        <p>Negative look ahead</p>
        <p>
          Example 3. Let us consider the snippet dataset in Table 1(b), a candidate FD  ∶ {, , } →  ,
and the new tuple  7 defined in Example 1. The validation algorithm should verify if this new
tuple leads to the invalidation of  , according to the strategy defined above. To this end, it is
necessary to define the RegEx   to validate  shown in Figure 2. We can observe that the
validation approach permits to consider the new value  7[] ,  7[] , and  7[] for the attributes  ,
 , and  , while it considers a generic numeric value (i.e., [
          <xref ref-type="bibr" rid="ref1 ref2">0 − 9</xref>
          ]∗) for the remaining attributes.
As a consequence, the regex    has been defined as shown in Figure 2. We can notice that
   allows the validation strategy to check if already exists at least one tuple in the dataset
with the same value for the attributes  ,  , and  . However, it is necessary to define the regex
   for the attribute  by exploiting the negative look ahead of  7[ ] , in order to check if exist
at least one key in the RegExHashMap with the same value for the attributes  ,  , and  , but a
diferent value for  .
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. An incremental search strategy</title>
        <p>The proposed validation method can be used in any discovering strategy that, by working
incrementally, can properly define FD candidates to be validated. In particular, the algorithm
REXY uses a version of the algorithm described in [15], adapted for incrementally managing any
kind of data changes. In particular, given a dataset  at time  , REXY considers the set of FDs
holding at time  , denoted as Σ , as starting points, and collect them in a hash map ordered by
LHS cardinality [15]. Then, it follows a bottom-up search strategy by considering Σ as the set
of candidate FDs, and by validating each of them according to the validation method described
above. Consequently, given a candidate FD  , REXY performs a downward discovery step when
a tuple is removed or  is a novel candidate (i.e.,  ∉ Σ  ). It consists in the generalization of
the candidate  performed by iteratively removing one attribute on its LHS, until REXY meets
a valid FD in the lower levels. Conversely, when a new tuple is added and  is not valid, an
upward discovery step is accomplished in order to find the possible new holding FDs. It consists
in the specialization of  performed by adding a new attribute on its LHS.</p>
        <p>The usage of an ordered hash map into a bottom-up search strategy permits to reuse the
analysis performed at the previous levels, and forward the validation of newly generated
candidate FDs to the next level, yielding an optimization in the pruning of the search space.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. The REXY Algorithm</title>
      <p>The main procedure of REXY is shown in Algorithm 1. Given a set of minimal FDs Σ valid at
time  , the set of tuples  +1 updating  at time  + 1 , and an instance of the RegExHashMap
Λ at time  , the main procedure of REXY starts the discovery process by considering Σ as the
set of candidate FDs. In particular, REXY performs a discovery step in ascending order by the
LHS cardinalities of Σ (line 3) and extracts all the candidate FDs from Σ with a specific LHS
cardinality (line 4). Then, for each of them, it checks if the candidate  ∶  →  is still valid
after the updating of the dataset according to the validation approach defined in Section 4.2
(line 6). If the analyzed FD is not valid at time  + 1 , the procedure first removes it from the set
of the analyzed candidates (line 7), and then generates new candidate FDs at a higher lattice
level (line 8), by discarding those that can be inferred from other FDs already validated at time
 + 1 (lines 9-10). On the other hand, if the analyzed FD is still valid at time  + 1 , REXY tries
to find if there exist other FDs at a lower lattice level (line 12) that have been validated after
update operations (lines 13-18). At the end of each iteration, the procedure removes all the FDs
that are not minimal at time  + 1 w.r.t. the FDs already validated in the previous iterations
(line 19). Finally, RegExHashMap is updated with the new tuples (line 20), and the procedure
returns a new set of minimal and valid FDs at time  + 1 (line 21).</p>
      <p>The validation procedure of REXY is shown at the bottom of Algorithm 1. Given a candidate
FD  ∶  →  at time  + 1 , an instance of the RegExHashMap Σ at time  , and the set of the
new tuples  +1 at time  + 1 , the procedure creates a new RegEx for each new tuple in  +1 ,
in order to check the validation of the candidates on the updated dataset. In particular, the
procedure starts by analyzing each new tuple in  +1 (line 24). Then, for each of them, REXY
defines a new RegEx according to the approach defined in Section 4.2. More specifically, for
each attribute  in  () , if  does not belong to the attributes of  , the procedure adds to   a
generic value (lines 28-30). On the other hand, if  belongs to the attributes on the LHS, then
the procedures add to the final RegEx   the corresponding value for the analyzed tuple []
(line 32). Otherwise, if none of the previous cases is verified, the attribute  belongs to the RHS,
so the procedure adds to   the negative look ahead of this value (line 33). After the creation of
the RegEx   , REXY checks if exists at least a tuple in the RegExHashMap that matches this
RegEx. If true, it means that  is not valid on the dataset at time  + 1 , since there exists at least
one pair of tuples that invalidate  (line 35). Otherwise, if the previous case is never verified for
any other tuple, it means that the candidate is valid at time  + 1 (line 36).</p>
      <p>It is worth notice that, while the implementation of REXY considers several code optimizations
and takes advantage of the defined data structures, for sake of clarity, the pseudo-codes of
Algorithm 1 do not show such optimizations. They mainly guarantee that each possible candidate
FD is validated at most once.
1 Function M A I N _ P R O C E D U R E :
Σ+1</p>
      <p>← ∅
for  in [ 1, … , | ()| ]
for each  →  ∈ Σ
Σ

← Σ . C A R D I N A L I T Y _ F I L T E R ( )
do
 do
Algorithm 1: Main Procedures of REXY</p>
      <p>Output: The set of new minimal FDs at time  + 1 , i.e., Σ+1</p>
      <p>+ 1 ; An instance of the RegExHashMap Γ at time 
Input: A set Σ of minimal FDs at time  ; Updated tuples  +1 for the dataset  at time
if V A L I D A T I O N _ R E G E X ( →  , Γ ,  +1 ) is False then
Σ
Σ
 ← Σ \ → 
 ← S P E C I A L I Z E _ C A N D I D A T E ( →  )
for each { → } ∈ Σ</p>
      <p>do
if I S _ M I N I M A L ({ → }
Σ
 ← Σ
 ∪ { → }</p>
      <p>, Σ ) is True then
else
for each { → } ∈ Σ
Σ
 ← G E N E R A L I Z E _ C A N D I D A T E ( →  )</p>
      <p>do
if V A L I D A T I O N _ R E G E X ({ → }
Σ</p>
      <p>← Σ \{ → }
Σ ← Σ ∪ G E N E R A L I Z E _ C A N D I D A T E ({ → }</p>
      <p>)
, Γ ,  +1 ) is False then
Σ ← Σ ∪ Σ
if |Σ | &gt; 0 then Σ+1 ← Σ+1 ∪ M I N I M A L I T Y _ C H E C K (Σ+1 , Σ )
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
23
24
25
26
27
28
29
30
31
32
33
34
35
36
22 Function V A L I D A T I O N _ R E G E X (  →  , Γ ,  +1 ) :
for  in [ 1, … , | ()| − 1 ]</p>
      <p>do
return Σ+1
Γ+1</p>
      <p>← Γ+1 ∪  +1</p>
      <p>← ∅
for each  ∈  +1 do</p>
      <p>A T T R S ←  ()
 ←</p>
      <p>A T T R S []
if  ∉  ∧  ≠ 
else
return True</p>
      <p>
        then
else   ←   ∪ ([, ][
        <xref ref-type="bibr" rid="ref1 ref2">0 − 9</xref>
        ]∗)∗
if  == 0 then   ←   ∪ ([
        <xref ref-type="bibr" rid="ref1 ref2">0 − 9</xref>
        ]∗[, ])∗
if  ∈ 
else if  ∈ 
then   ←   ∪ []
      </p>
      <p>then   ←   ∪ (?!.∗[]).+
if  &lt; | A T T R S | − 1 then   ←   ∪ [, ]
if Γ . E X I S T _ M A T C H I N G (  ) is True then return False
// Updating of the RegExHashMap</p>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental results</title>
      <p>REXY has been developed in Java 15. The execution tests have been performed on an iMac
with an Intel Xeon processor at 3.40 GHz, 36-core, and 128GB of RAM. Furthermore, in order
# Dataset
1 Iris
2 Balance-scale
3 Bupa
4 Appendicitis
5 Chess
6 Abalone
7 Nursey
8 Tic-tac-toe
9 Glass
10 Cmc
11 Yeast
12 Breast-cancer
13 Fraud-detection
14 Poker-hand
15 Echocardiogram
16 Bridges
17 Australian
18 Adult
19 Tax
20 Lymphography
21 Hepatitis
22 Parkinson
to ensure proper executions of REXY on the considered datasets, the Java memory heap size
allocation has been set to 100GB.
the number of rows and columns1. In our experimental session, we simulate a scenario of
continuous insertions of tuples by considering one tuple at a time. This allowed us to deeply
analyze the performances of the validation process on each candidate FD. More specifically,
for each dataset. We also report the number of resulting FDs and the number of validations
performed at the end of the last execution of REXY on each dataset.</p>
      <p>The results highlight that the average execution time is almost always less than 10 seconds,
except for the Lymphography, Hepatitis, and Parkinsons datasets. This is probably due to the
large number of FD invalidated during the discovery process, possibly leading to a larger number
of new candidates to be validated. Concerning the validation process, the average time of this
process is almost always less than 25 milliseconds per FD, and the maximum time almost never
exceeds 40 milliseconds, except for the Fraud-detection, Tax, Hepatitis, and Parkinsons datasets.
Despite these time peaks, the average times demonstrate that such values can be considered as
outliers. In general, although execution times increase when the number of attributes increases,
on average they remain low for validating each candidate FD.</p>
      <p>We performed a comparative evaluation of REXY with respect to the incremental FD discovery
algorithm presented in [9] (baseline). In particular, both the algorithms follow the same search
strategy, but difer on how they organize data, also yielding to completely diferent validation
methods. Thus, the comparison allowed us to highlight the improvements in terms of execution
times of the proposed validation strategy with respect to the traditional partition-based used in
1The datasets are available on: https://github.com/DastLab/TestDataset
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20 21 22</p>
      <p>Dataset ID
[9]. Figure 3 shows time performances on average of both algorithms on the 22 datasets. We can
notice that, REXY outperforms the baseline algorithm with several orders of magnitude, ranging
from 2 to 7, except for the Parkinson dataset for which the baseline takes advantage from a
caching strategy that allows to reuse the computation performed in the previous iterations.</p>
      <p>In general, partition-based validation strategies, such as the one included in [9], are
particularly eficient when a complete discovery process has to be performed, since partitions can be
incrementally computed following a level-by-level bottom-up search strategy. However, the
incremental scenario requires the discovery process to browse the search space starting from
the previously holding FDs (i.e., not all FD candidates have to be considered in each level). Thus,
the proposed validation method is able to perform the verification without having the necessity
to compute sparse partitions, yielding faster execution times as can be noted in the discussed
experimental results.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper we presented REXY, an incremental algorithm for FD discovery. It exploits a new
Regex-based validation method to eficiently select tuples afected by dataset updates. REXY
follows an incremental strategy to explore the search space based on the previously holding FDs.
Experimental results over real-world datasets show that REXY can eficiently update the set of
holding FDs, without having the necessity to execute algorithms from scratch. In the future, we
would like to include REXY in a visual monitoring tool [16], and extend it for updating Relaxed
Functional Dependencies [17] in incremental scenarios.
tional dependencies, in: Transactions on Large-Scale Data-and Knowledge-Centered
Systems XLIV, Springer, 2020, pp. 132–159.
[3] Z. Abedjan, P. Schulze, F. Naumann, DFD: Eficient functional dependency discovery,
in: Proc. of the 23rd ACM International Conference on Information and Knowledge
Management, 2014, pp. 949–958.
[4] P. A. Flach, I. Savnik, Database dependency discovery: A machine learning approach, AI
communications 12 (1999) 139–160.
[5] Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen, TANE: An eficient algorithm for
discovering functional and approximate dependencies, The Computer Journal 42 (1999)
100–111.
[6] P. Schirmer, T. Papenbrock, S. Kruse, D. Hempfing, T. Meyer, D. Neuschäfer-Rube, F.
Naumann, DynFD: Functional dependency discovery in dynamic datasets., in: Proc. of 22nd
International Conference on Extending Database Technology, 2019, pp. 253–264.
[7] S.-L. Wang, J.-W. Shen, T.-P. Hong, Incremental discovery of functional dependencies
using partitions, in: Proc. of Joint 9th IFSA World Congress and 20th NAFIPS International
Conference, volume 3, 2001, pp. 1322–1326.
[8] C. Wyss, C. Giannella, E. Robertson, FastFDs: A heuristic-driven, depth-first algorithm
for mining functional dependencies from relation instances, in: Proc. of International
Conference on Data Warehousing and Knowledge Discovery, 2001, pp. 101–110.
[9] L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Incremental discovery of functional
dependencies with a bit-vector algorithm, in: M. Mecella, G. Amato, C. Gennaro (Eds.),
Proceedings of the 27th Italian Symposium on Advanced Database Systems, volume 2400
of CEUR Workshop Proceedings, CEUR-WS.org, 2019.
[10] H. Yao, H. J. Hamilton, C. J. Butz, FD_Mine: Discovering functional dependencies in a
database using equivalences, in: Proc. of IEEE International Conference on Data Mining,
2002, pp. 729–732.
[11] T. Papenbrock, F. Naumann, A hybrid approach to functional dependency discovery, in:</p>
      <p>Proc. of the 2016 International Conference on Management of Data, 2016, pp. 821–833.
[12] Z. Wei, S. Link, Discovery and ranking of functional dependencies, in: Proc. of IEEE 35th</p>
      <p>International Conference on Data Engineering, IEEE, 2019, pp. 1526–1537.
[13] S. Bell, Discovery and maintenance of functional dependencies by independencies, in:
Proc. of the 1st International Conference on Knowledge Discovery and Data Mining, 1995,
pp. 27–32.
[14] L. Caruccio, V. Deufemia, G. Polese, Mining relaxed functional dependencies from data,</p>
      <p>Data Mining and Knowledge Discovery 34 (2020) 443–477.
[15] L. Caruccio, S. Cirillo, Incremental discovery of imprecise functional dependencies, J. Data
and Information Quality 12 (2020).
[16] B. Breve, L. Caruccio, S. Cirillo, V. Deufemia, G. Polese, Visualizing dependencies during
incremental discovery processes, in: A. Poulovassilis et al. (Ed.), Proceedings of the
Workshops of the EDBT/ICDT 2020 Joint Conference, volume 2578 of CEUR Workshop
Proceedings, CEUR-WS.org, 2020.
[17] L. Caruccio, V. Deufemia, F. Naumann, G. Polese, Discovering relaxed functional
dependencies based on multi-attribute dominance, IEEE Transactions on Knowledge and Data
Engineering (2021) To appear.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. F.</given-names>
            <surname>Ilyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <article-title>Holistic data cleaning: Putting violations into context</article-title>
          ,
          <source>in: Proc. of IEEE International Conference on Data Engineering</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>458</fpage>
          -
          <lpage>469</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Le Guilly</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Petit</surname>
          </string-name>
          , V.
          <string-name>
            <surname>-M. Scuturici</surname>
          </string-name>
          ,
          <article-title>Evaluating classification feasibility using func-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>