=Paper=
{{Paper
|id=Vol-2214/paper1
|storemode=property
|title=An Algorithm for Verifying Approximate Pure Evolving Functional Dependencies
|pdfUrl=https://ceur-ws.org/Vol-2214/paper1.pdf
|volume=Vol-2214
|authors=Pietro Sala
|dblpUrl=https://dblp.org/rec/conf/cilc/Sala18
}}
==An Algorithm for Verifying Approximate Pure Evolving Functional Dependencies==
An algorithm for verifying Approximate Pure Evolving Functional Dependencies‹ Pietro Sala Department of Computer Science, University of Verona, Italy pietro.sala@univr.it Abstract. Fuctional dependencies (FDs) form the foundations of data- base theory since the beginning of time, given they may represent con- straints such as “the employee role determines his monthly salary”. If we consider temporal databases where each tuple is timestamped with a valid time (V T ) attribute, finer constraints may be imposed on the evo- lution of the data such as “the employee role before a promotion and his role afterwards determine his new monthly salary”. Such constraints are called Pure Evolving Functional Dependencies (PEFDs) according to the classification introduced in [3]. By adding approximation to such rules they may be used for extracting knowledge in place of imposing con- straints. Then we may want to discover that “the employee role before a promotion and his role afterwards generally determine his new monthly salary”. This kind of approximate dependencies are called Approximate Pure Evolving Functional Dependencies (APEFDs). Verifying whether or not a given APEFD holds over a database instance is an NP-Complete problem [4]. Despite the discouraging complexity, in this work we pro- pose an algorithms that solve in an efficient way this problem by trying to divide-and-conquer and prune the search space as much as possible. Keywords: Temporal Databases · Approximate Functional Dependen- cies · Data Mining. 1 Temporally Mixed Functional Dependencies According to [3], in the following we study a family of temporal functional de- pendencies [2, 5, 6] i.e., the Pure Evolving Functional Dependencies, called PEFD. From now on our temporal functional dependencies will be given over a temporal schema R “ U Y tV T u, where U is a set of atemporal attributes and V T is a special attribute denoting the valid time of each tuple. Hereinafter we assume tuples timestamped with natural numbers (i.e, DompV T q “ N). Let J Ď U be a non-empty subset of U . We define the set W as W “ U zJ and set W , which is basically a renaming of attributes in W . Formally, for each attribute A P W , we have A P W (i.e., W “ tA : A P W u). Given an instance r of R, we define an instance τJr of schema Rev “ JW W tV T, V T u as follows: ‹ This work has not been previously published and it is an extension of the results published in [4]. $ ˇ ¨ ˛, ’ ’ ˇ ˇ rptq^rpt1 q^trJs“t1 rJs“urJs^urW s“trW s^ / & ˚ ‹/ ˇ ˚ urW s“t1 rW s^trV T s“urV T s^t1 rV T s“urV T s ‹ . τJr “ u ˇˇ Dt,t1 ˚ ˚ ‹ ‹ ’ ’ ˇ ˚ ˝ ^trV T săt1 rV T s^ ‹/ ‚/ % ˇ - @t2 pprpt2 q^trV T săt2 rV T sqÑt1 rV T sďt2 rV T sq Schema Rev is called the evolution schema of R. We will denote by τJR the view on R that is built by expression τJr for every instance r of R. View τJR joins two tuples t1 and t2 that agree on the values of the attributes in J (i.e. t1 rJs “ t2 rJs) and t1 rV T s ă t2 rV T s. Moreover, such tuples are joined if does not exist a tuple t P r with trJs “ t1 rJs and t1 rV T s ă trV T s ă t2 rV T s (i.e., there exists a tuple that holds at some point in between the valid times of such tuples). For application purposes, it is correct correct to consider in a evolution schema only those pair of consecutive tuples whose the difference between V T and V T respects some given bound. Given a parameter k P N Y t`8u, tuples of τJr are filtered by means of the selection ∆k pτJr q “ σtrV T s´trV T sďk pτJr q (notice that ∆`8 pτJr q “ τJr ). ∆k pτJr q forces to consider only those tuples belonging to τJr having a temporal distance within the given threshold k. In the following, given a tuple t P τJr , we denote its temporal distance trV T s ´ trV T s with ∆ptq. A Pure Temporally Evolving Functional Dependency [3] over the temporal schema R “ U Y tV T u, PEFD for short, is an expression of the form r∆k pτJR qsXY Ñ Z. We have that X Ď W and Y , Z Ď W with X ‰ H and |Z| “ 1 (Z contains a single attribute). We say that an instance r of R fulfills a PEFD r∆k pτJR qsXY Ñ Z, written r |ù r∆k pτJR qsXY Ñ Z, if and only if for each pair of tuples t, t1 P ∆k pτJr q we have trXs “ t1 rXs ^ trY s “ t1 rY s Ñ trZs “ t1 rZs. Now add approximation to PEFD in a very standard way [7, 2]. First, we provide measurement G to deal with PEFD as follows: Gpr∆k pτJR qsXY Ñ Z, rq “ |r| ´ maxt|s| : s Ď r, s |ù r∆k pτJR qsXY Ñ Zu. By means of G we can define the relative scaled measurement g for TM-FD as follows: Gpr∆k pτJR qsXY Ñ Z, rq gpr∆k pτJR qsXY Ñ Z, rq “ . |r| Now we are ready to define the Approximate Pure Temporally Evolving Func- tional Dependency (APEFD for short). Formally, given a real number 0 ď ď 1, we say that an instance r of R satisfies the APEFD r∆k pτJR qsX Y Ñ Z, written r |ù r∆k pτJR qsXY Ñ Z, if and only if gpr∆k pτJR qsXY Ñ Z, rq ď . In Section 2 we provide an algorithm for checking an APEFD against an instance r, we call this problem Check-APEFD: Problem 1. (Check-APEFD). Given a temporal schema R, a PEFD r∆k pτJR qs XY Ñ Z on R, an instance r of R, and a real number 0 ď ď 1 determine whether or not r |ù r∆k pτJR qsXY Ñ Z. # Name P hys CT Durpminutesq VT 1 M cM urphy Sayer self „15 1 Jan 2016 2 M cM urphy Sayer f amily „5 5 Jan 2016 3 M cM urphy M aguire f amily „15 10 Jan 2016 4 M cM urphy M aguire self „15 15 Jan 2016 5 M cM urphy M aguire self „5 20 Jan 2016 r“ 6 M cM urphy M aguire self „5 25 Jan 2016 7 Lowe Sayer f amily „15 7 Jan 2016 8 Lowe Sayer f amily „15 15 Jan 2016 9 Lowe M aguire self „15 22 Jan 2016 10 Lowe Sayer self „5 28 Jan 2016 11 Lowe M aguire self „15 3 Feb 2016 12 Lowe Sayer self „5 6 Feb 2016 Fig. 1. An instance r of schema Contact that stores the phone contacts about two psychiatric cases. Attribute # represents the tuple number and it is used only for referencing tuples in the text (i.e., # does not belong to the schema Contact). r τName # Name P hys CT Dur VT P hys CT Dur VT 1, 2 M cM urphy Sayer self „15 1 Jan 2016 Sayer f amily „5 5 Jan 2016 2, 3 M cM urphy Sayer f amily „5 5 Jan 2016 M aguire f amily „15 10 Jan 2016 3, 4 M cM urphy M aguire f amily „15 10 Jan 2016 M aguire self „15 15 Jan 2016 4, 5 M cM urphy M aguire self „15 15 Jan 2016 M aguire self „5 20 Jan 2016 5, 6 M cM urphy M aguire self „5 20 Jan 2016 M aguire self „5 25 Jan 2016 7, 8 Lowe Sayer f amily „15 7 Jan 2016 Sayer f amily „15 15 Jan 2016 8, 9 Lowe Sayer f amily „15 15 Jan 2016 M aguire self „15 22 Jan 2016 9, 10 Lowe M aguire self „15 22 Jan 2016 Sayer self „5 28 Jan 2016 10, 11 Lowe Sayer self „5 28 Jan 2016 M aguire self „15 3 Feb 2016 11, 12 Lowe M aguire self „15 3 Feb 2016 Sayer self „5 6 Feb 2016 r Fig. 2. The evolution expression τN ame . Now we consider a scenario, borrowed from the clinical domain, in order to provide examples of how PEFDs and APEFDs work. In particular, this scenario is taken from psychiatric case register. Let us consider the temporal schema Contact “ tN ame, P hys, CT, Duru Y tV T u. Such a schema stores values about a phone-call service provided to psychiatric patients. This service is intended for monitoring and helping psychiatric patients, who are not hospitalized. Whenever a patient feels the need to talk to a physician, he can call the service. Data about calls are collected according to schema Contact. For the sake of simplicity, temporal attribute V T identifies the day when the call has being received, not the exact time the call has begun which is usually its semantics in real world applications. In addition, the service may be used by people somehow related to patients, as, for instance, relatives afraid of current conditions of a patient. More precisely, attribute N ame identifies patients, P hys identifies physicians, CT (Contact Type) identifies the physical person who is doing the call (e.g., value ’self’ stand for the patient himself, ’family’ for a relative), and Dur stores information about total duration of calls (value „ n means approximately n r r minutes). An instance r of R is provided in Figure 1. Instance τN ame , and τJ in general, may be seen as the output of a two-phase procedure. First, table Contact is partitioned into subsets of tuples, one for each value of N ame. Then, each tuple is joined with its immediate successor in its partition, w.r.t. V T r values. The whole relation τN ame is provided in Figure 2. In the following we will use t for referencing tuples of r and u for referencing tuples of τJr . Moreover, in the following each tuple u in τJr will be identified by the pair of indexes of the tuples in r that generate u. For instance, the first tuple of τJr in Figure 2 will be denoted by u1,2 since it is generated by the join of tuples t1 and t2 in r. Going back to our example, it is worth noting that tuples t2 and t7 are r not joined in τN ame , even if t7 rV T s “ t2 rV T s ` 2 and there is no tuple t with trV T s “ t7 rV T s ` 1. This is due to the fact that t7 rNames ‰ t2 rNames r r forbids the join in τN ame . Moreover, t1 and t3 are not joined in τName . In- deed, the presence of tuple t2 with t1 rNames “ t2 rNames “ t3 rNames and r t1 rV T s ă t2 rV T s ă t3 rV T s forbids the join in τName . Figure 3 graphically depicts how pairs of tuples pt1 , t2 q, pt2 , t3 q, pt3 , t4 q, pt4 , t5 q, pt5 , t6 q and pt7 , t8 q, pt8 , t9 q, pt9 , r t10 q, pt10 , t11 q, pt11 , t12 q are joined in τName . In both scenarios depicted in Figure 3, nodes represent tuples and are labeled by the corresponding tuple number. Values for attribute Dur are reported above each node. Values of P hys and CT attributes are reported below every node, respectively. Every edge pti , tj q is labeled by value ∆pui,j q “ tj rV T s ´ ti rV T s (i.e., the temporal distance between r two tuples). Basically, each tuple u P τN ame corresponds to an edge in Figure 3 while we have a node for each tuple in r. Contact In our example, we have that r |ù r∆5 pτName qsP hys, P hys Ñ CT . How- Contact ever, r * r∆6 pτName qsP hys, P hys Ñ CT , because of pairs pt2 , t3 q and pt10 , t11 q. More precisely, we have that t2 rNames “ t3 rNames “ M cM urphy, t10 rNames “ t11 rNames “ Lowe, t2 rP hyss “ t10 rP hyss “ Sayer, t3 rP hyss “ t11 rP hyss “ M aguire, but t3 rCT s ‰ t11 rCT s (i.e., t3 rCT s “ f amily, and t11 rCT s “ self ). M cM urphy 10 „15 4 days „5 5 days „15 5 days „15 5 days „5 5 days „5 t1 t2 t3 t4 t5 t6 Sayer Sayer M aguire M aguire M aguire M aguire self f amily f amily self self self Lowe „15 „15 „15 „5 „15 „5 8 days 7 days 6 days 6 days 3 days t7 t8 t9 t10 t11 t12 Sayer Sayer M aguire Sayer M aguire Sayer f amily f amily self self self self r Fig. 3. A graph-based representation of τN ame . In other words, we have that the set of tuples tu2,3 , u10,11 u does not satisfy the FD P hysP hys Ñ CT . Let us observe that the two proposed PEFDs differ only for the maximum temporal distance allowed. In particular, tuple u10,11 is responsible for r |zù Contact Contact r∆6 pτName qsP hys, P hys Ñ CT and it does not belong to ∆5 pτName q because Contact of ∆pu10,11 q ą 5 then we have r |ù r∆5 pτName qsP hys, P hys Ñ CT . This allows us to point out a general property of PEFDs, such a property holds for APEFDs, too. Given a PEFD r∆k pτJR qsXY Ñ Z we have that for every instance r of R such that r |ù r∆k pτJR qsXY Ñ Z then for every h ď k we have r |ù r∆h pτJR qsXY Ñ Z. R In our example, if we consider APEFD r∆6 pτName qsP hys, P hys Ñ CT with 1 R “ 12 , we have that r |ù r∆6 pτName qs P hys, P hys Ñ CT . It is worth noting 1 that, by considering relation r “ rztt3 u, this dependency would hold without the need of approximation (i.e., if tuple t3 is deleted from relation r). More pre- r1 r cisely, we have τName “ τName ztu2,3 , u3,4 uYtu2,4 u. Notice that tuple u2,4 was not r originally in τName because of tuple t3 . Figure 3 depicts this new scenario, by re- placing edges pt2 , t3 q and pt3 , t4 q with the dashed edge pt2 , t4 q. Moreover, we have R R that r1 |ù r∆`8 pτName qsP hys, P hys Ñ CT . Thus, r |ù r∆`8 pτName qsP hys, 1 P hys Ñ CT with “ 12 . 2 An algorithm for checking APEFDs As we proved in [4], the problem Check-APEFD given an APEFD rτJR , swpkqsXY Ñ Z and an instance r of R is NP-Complete in |r|. This is not an uncommon situation in the realm of temporal functional dependencies (see [8, 1, 9, 7] for further examples). Then, in principle, there is no asimptoptycally better algorithm than explore 1 the whole set of of possible subsets r1 of r with |r|´|r |r| | ď . However, in the following, we provide an algorithm that make use of heuristics for pruning the search space in order to achieve the tractability for many cases. Such algorithm is general and it may be applied under no assumptions on the input instance r, it makes use of two optimization techniques. The first op- timization technique consists of trying, whenever is possible, to split the current subset of r in two subsets on which the problem may be solved independently (i.e., choices in one subset do not affect choices in the other one and viceversa). The latter optimization technique consists of checking if the current partial so- lution may not lead to an optimal solution (i.e., a solution r1 where |r1 | is the maximum possible number of tuples that may be kept) if this happen the subtree is pruned immediately (i.e., we are looking only for optimal solutions). Our algorithm relies on the concept of color that we will explain through an example in the following. Given APEFD rτJR , swpkqsXY Ñ Z and an instance r of R, let us suppose that we are solving problem Check-APEFD on such instance with a simple guess-and-check procedure that make use of two, initially empty, subset r` (the tuples to be kept in the solution) and r´ (the tuples to be deleted in the solution) of r. At each step the procedure guess a tuple t in rzpr` Yr´ q and decides non-deterministically (guessimg phase) either to update r` to r` Y ttu (i.e., t is kept in the current partial solution) to update r´ to r´ Y ttu (i.e, t is deleted in the current partial solution). When r “ r` Y r´ (check phase), the procedure returns Y ES if r` |ù rτJR , swpkqsXY Ñ Z and |r´ | ď ¨ |r|, otherwise it returns N O. From now on, for us a partial solution is a triple pr, r` , r´ q such that pr` Y r´ q Ď r and r` X r´ “ H, if r “ r` Y r´ we simply say that pr` Y r´ q is a solution. A solution is consistent pr, r` , r´ q if and only if r` |ù rτJR , swpkqsXY Ñ Z. Given two partial solution pr, r` ´ ` ´ 1 , r1 q and pr, r2 , r2 q we say that pr, r2 , r2 q extends pr, r1 , r1 q if and only if r1 Ď r2 and r1 Ď r´ ` ´ ` ´ ` ` ´ 2. Is there a way to check if we are generating an inconsistent solution possibly without guessing all the tuples in r? Violations of the latter constraint (i.e., |r´ | ď ¨ |r|) are fairly simple to detect during the guessing phase, it suffices to check after each insertion in r´ if r´ exceeds ¨ |r|, if it is the case the procedure may return N O immediately without guessing any further. Violations of the first constraint (i.e., r` |ù rτJR , swpkqsXY Ñ Z) during the guessing phase are trickier to detect, then we need the following additional definitions. From now on when two tuples t, t1 share the same value for the attribute J (i.e., trJs “ t1 rJs) we will say that they are in the same J-group and we will say that trJs is the value of the J-group containing t and t1 . For the sake of brevity, for a given j P DompJq we will use j-group for denoting the J-group with value j. An ordered pair, written o-pair, is a pair pt, t1 q P r ˆ r such that t and t1 are in the same J-group and trV T s ă t1 rV T s. Given an o-pair t, t1 P r we say that the pair pt, t1 q is an edge if and only if 0 ă t1 rV T s ´ trV T s ď k. Given triple pr, r` , r´ q, an o-pair pt, t1 q P r is active if and only if t, t1 P r` and for every tuple t in the same J-group of t if trV T s ă t ă t1 rV T s we have t P r´ (i.e., t and t1 are selected in the current partial solution and all the tuples between t and t1 in the same J-group are deleted). Given two valid times vt, vt1 P DompV T q and a value j P DompJq we say that vt and vt1 are consecutive in j-group if and only if there exists an active o-pair pt, t1 q with trV T s “ vt, t1 rV T s “ vt1 , and trJs “ j. Notice that we may have two distinct values j, j 1 P DompJq and two distinct valid times vt, vt1 P DompV T q which are consecutive in j-group and not consecutive in j 1 -group. Moreover, we may have edges pt, t1 q that are not active and active pairs pt, t1 q that are not edges, such is the case of active pairs pt, t1 q with t1 rV T s´trV T s ą k. A color is a tuple c on the schema C “ XY Z, two colors px, y, zq and px1 , y 1 , z 1 q are conflicting if and only if x “ x1 , y “ y 1 and z ‰ z 1 . Given an o-pair pt, t1 q, denoted by cpt, t1 q is the tuple cpt, t1 q “ ptrXs, trY s, t1 rZsq. Two o-pairs pt, t1 q, pt2 , t3 q are conflicting if and only if cpt, t1 q and cpt2 , t3 q are conflicting. The following result it is easy to prove and its proof is left as exercise for the reader. Theorem 1. Given an APEFD rτJR , swpkqsXY Ñ Z, an instance r of R, and a partial solution pr, r` , r´ q, if there exists two active edges pt, t1 q and pt2 , t3 q then every solution pr, r` f , r´ f q that follows pr, r` , r´ q satisfy r` f z|ù rτJR , swpkqsXY Ñ Z (i.e., is inconsistent). The above theorem guarantees that from a partial solution pr, r` , r´ q that feature at least two conflicting edges we cannot reach a solution pr, r` f , r´ f q that satisfy the first constraint and in such a case we may return immedi- ately N O without going any further from pr, r` , r´ q. The colours of a par- tial solution pr, r` , r´ q is the set colourspr, r` , r´ q “ tptrXs, t1 rY s, t1 rZsq : pt, t1 q is an active edge in pr, r` , r´ qu. It is easy to see that the hypothesis of Theorem 1 applies if and only if colourspr, r` , r´ q contains at least two conflict- ing colours. Then, by means of colours, our above guess-and-check procedure may be improved by adding the control on the size of r´ and by keeping up- dated the current set of colours colourspr, r` , r´ q. Once an insertion of a tuple in either r` or r´ introduce a color c that is conflicting with at least one colour in pr, r` , r´ q the procedure answers N O immediately. An example of how the procedure works is given in Figure 4 where we have an instance of 5 tuples with “ 0.2 (i.e., we may delete at most one tuple) and swpkq “ 6 (all the tuples are in the same window). The execution depicted in Figure 4 guesses the values of the tuples from the oldest (t1 ) to the newest one (t2 ) according to the value of V T . First it tries to put the current tuple t in r` if no violation arises it continues, if some violation arises it tries to insert the tuple t in r´ if no violation arises it continues otherwise it goes back to the previous choice (i.e., backtracking). Every internal node is labelled with the current tuple which will be guessed next, every leaf is labelled either with Y ES (i.e., the current branch is a solution) or N O (i.e, a violation has arisen), the current set of colours is reported within the node. Nodes are numbered according to their order of appearance. We have that the root is n1 followed by the introduction of the nodes n1 . . . n4 in this precise order. If we introduce t4 in the partial solution associated to n4 we violate the first constraint. Since in n4 adding t4 in r´ does not generate any violation, node n5 is created as child of n4 . However, node px1 ,y1 ,z2 q px1 ,y1 ,z2 q px2 ,y1 ,z2 q px1 ,y1 ,z2 q px2 ,y1 ,z2 q px1 ,y1 ,z2 q x1 |y1 |z1 px1 ,y1 ,z1 q x2 |y1 |z1 px2 ,y1 ,z2 q x1 |y1 |z2 px1 ,y1 ,z2 q x2 |y1 |z2 px2 ,y1 ,z2 q x1 |y1 |z2 n1 t1 ,tu t1 Pr` n2 t2 ,tu # J X Y Z VT t2 Pr` 1 j1 x1 y1 z1 1 n3 t3 ,tpx1 ,y1 ,z1 qu 2 j1 x2 y1 z1 2 t3 Pr` t3 Pr´ 3 j1 x1 y1 z2 3 " * n4 t4 , px1 ,y1 ,z1 q, n6 t4 ,tpx1 ,y1 ,z1 qu 4 j1 x2 y1 z2 4 px2 ,y1 ,z2 q 5 j1 x1 y1 z2 5 t4 Pr` t4 Pr´ t4 Pr` NO n5 conflict between " px1 ,y1 ,z1 q, * " * px1 ,y1 ,z1 q and px1 ,y1 ,z2 q t5 , n7 t5 , px1 ,y1 ,z1 q, px2 ,y1 ,z2 q px2 ,y1 ,z2 q t5 Pr` t5 Pr´ t5 Pr` NO NO Y ES conflict between |r´ |ą1 r“r` Yr´ , |r´ | ď 1, " px1 ,y1 ,z1 q and px1 ,y1 ,z2 q * px1 ,y1 ,z1 q, colourspr, r` , r´ q “ px2 ,y1 ,z2 q Fig. 4. An example of how the use of colors improve a guess and check procedure for solving the problem Check-APEFD. n5 cannot be extended without introducing a violation in the above constraints because if put t5 in r` we introduce a conflicting color, while if we put t5 in r´ we exceed the maximum number of allowed deletions. We backtrack to n4 and we have that even for it all the possible choices have been explored and thus we backtrack to n3 where the choice of adding t3 to r´ is attempted generating the node n6 . From n6 we put t5 in r` without violating any constraint and thus we 0.2 have that tt1 , t2 , t4 , t5 u |ù rτJR , 6sXY Ñ Z. Let us consider deeper in the description of the first algorithm. On an higher lever the algorithm operates like the procedure above apart from some trivial technicalities. However, two more heuristics are introduced in order to possibly stop early during the exploration of a branch in the tree of computation. The main procedure of the algorithm is reported in Figure 7 with auxiliary procedures reported in Figure 5 and in Figure 6. The algorithm is implemented by the function T upleW iseM in that takes 4 arguments. The first argument is Gr which is a pre-processing of r based on the APEFD rτJR , swpkqsXY Ñ Z that has to be checked. More precisely, Gr is an instance of the schema J, X, Y, Z, V T, count, with Dompcountq “ N. We have that t P Gr if and only if there exists t1 P r for which pt1 rJs, t1 rXs, t1 rY s, t1 rZsq “ ptrJs, trXs, trY s, trZsq and trcounts “ |tt1 P r : pt1 rJs, t1 rXs, t1 rY s, t1 rZsq “ ptrJs, trXs, trY s, trZsqu|, that is, we count how procedure InBetweenpGr , t1 , t2 q returns the subset of Gr consisting of all the tuples in the same J-group comment: of t1 and t2 that feature valid times between t1 rV T s and t2 rV T s. return tt P Gr : t1 rV T s ă trV T s ă t2 rV T s ^ trJs “ t1 rJsu procedure EdgeConflict?ppt1 , t2 q, pt11 , t12 qq comment: returns true if and only if the two input edges feature conflicting colors. if t1 rXs “ t11 rXs ^ t2 rY s “ t12 rY s ^ t2 rZs ‰ t12 rZs then return true else return false procedure ColorConflict?ppt1 , t2 q, Cq returns true if and only if the color of the edge pt1 , t2 q is conflicting comment: with at least one color in the set of colors C. if Dcpc P C ^ t1 rXs “ crXs ^ t2 rY s “ crY s ^ t2 rZs ‰ crZsq then return true else return false procedure E!pGr , k, G` ´ r , Gr q returns the set of all and only active edges in the current partial comment: solution. " * t1 rJs “ t2 rJs ^ 0 ă t2 rV T s ´ t1 rV T s ď k^ return pt1 , t2 q : tt1 , t2 u Ď G` ´ r ^ InBetweenpGr , t1 , t2 q Ď Gr q procedure E?pGr , k, G` ´ r , Gr , Cq returns the set of all and only pending edges in the current partial comment: solution. $ , ’ ’ t1 rJs “ t2 rJs ^ 0 ă t2 rV T s ´ t1 rV T s ď k ^ tt1 , t2 u X G´ r “ H^ / / ^ ColorConf lict?ppt1 , t2 q, Cq^ & . return pt1 , t2 q : ` ’ ’ ^InBetweenpGr , t1 , t2 q X Gr “ H^ / / ^ptt1 , t2 u Y InBetweenpGr , t1 , t2 qq Ę pG` ´ r Y Gr q % - procedure Reach?pt1 , t2 , N odes, Edgesq returns true if and only if there exists a path from t1 to t2 in the graph comment: pN odes, Edgesq. It is a function that checks wether or not there exits a path between two nodes in a graph. if Dptt1 . . . tm u Ď N odesqpt1 “ t1 ^ tm “ t2 ^ @p1 ď i ă mqppti , ti`1 q P Edgesqq then return true else return false Fig. 5. Auxiliary procedures used by procedures presented in Figures 6 and 7. procedure GroupIndependent?pj, Gr , k, G` ´ r , Gr , Cq returns true if and only if in the current partial solution pending edges comment: involving tuples belonging to the J-group with value j are not conflicting with the pending edges introduced by the others J-groups. Ej Ð tpt1 , t2 q P E?pGr , k, G` ´ r , Gr , Cq : t1 rJs “ ju ` ´ E j Ð E?pGr , k, Gr , Gr , CqzEj if Dt1 Dt2 Dt1 Dt2 ppt1 , t2 q P Ej ^ pt, t2 q P E j ^ EdgeConf lict?ppt1 , t2 q, pt1 , t2 qqq then return false else return true procedure MacConsistentSubsetpEq returns the maximal subset E 1 of E such that d for every edge comment: pt11 , t12 q P E 1 and for every edge pt1 , t2 q P E we have that cpt11 , t12 q and cpt1 , t2 q are not conflicting. E1 Ð H " if @t1 @t2 ppt1 , t2 q P E Ñ EdgeConf lict?ppt1 , t2 q, pt1 , t2 qqq for each pt1 , t2 q P E do then E 1 Ð E 1 Y tpt1 , t2 qu return E 1 procedure ReplacePath?pt, t, Gr , k, G` ´ r , Gr , Cq returns true if and only if in the current partial solution the consecutive comment: valid times trV T s and trV T s in trJs-group may be safely replaced. if Dt1 pt1 P Gr ^ t1 rJs “ trJs ^ trV T s ă t1 rV T s ă trV T sq then return false Ns Ð tt1 P G` r : t1 rJs “ trJs ^ t1 rV T s “ trV T su Ne Ð tt1 P G` r : t1 rJs “ trJs ^ t1 rV T s “ trV T su Nm Ð tt1 P G´ r : t1 rJs “ trJs ^ trV T s ă t1 rV T s ă trV T su N Ð Ns Y Ne Y Nm Ẽ Ð tpt1 , t2 q : t1 P N ^ t2 P N ^ t1 rV T s ă t2 rV T s ^ pt1 R Ns _ t2 R Ne qu P airs Ð tpt1 , t2 q : t1 P N ^ t2 P N ^ t1 rV T s ` k ă t2 rV T s ^ pt1 R Ns _ t2 R Ne qu Ẽok Ð pM axConsistentSubsetpE?pGr , k, G` ´ r , Gr , Cqq X Ẽq Y P airs N otSaf e Ð H for each $ pt1 , t2 q P ẼzẼok ` ´ ` ´ &for each " pt1 , t2 q P pE?pGr , k, Gr , Gr , Cq Y E!pGr , k, Gr , Gr q Y Ẽq do if EdgeConf lict?ppt1 , t2 q, pt1 , t2 qq % do then N otSaf e Ð N otSaf e Y tpt1 , t2 qu Ẽ Ð ẼzN otSafˆe ˙ t1 , t2 P Nm ^t1 P Ns ^t2 P Ne^ if Dt1 Dt2 @t1 @t2 ^pt1 , t1 q, pt2 , t2 q P Ẽ ^Reach?pt1 , t2 , N, Ẽq then return true return false Fig. 6. Auxiliary procedures used by procedure T upleW iseM in (Figure 7). Algorithm 2.1: TupleWiseMin(Gr , k, G` ´ r “ H, Gr “ H, C “ H) procedure MaximalPaths?pGr , k, G` ´ r , Gr , Cq returns true if and only if in the current partial solution for every comment: J-group j-group every pair of consecutive valid times vt and vt1 in j-group cannot be safely replaced. if Dt1 Dt2 ppt1 , t2 q P E!pGr , k, G` ´ ` ´ r , Gr q ^ ReplaceP ath?pt1 , t2 , k, Gr , Gr Gr , Cqq then return false else return true procedure IsConsistent?pGr , k, G` ´ r , Gr , Cq returns true if and only if the current partial solution features comment: consistent colors in C and for every J-group j-group every pair of consecutive valid times vt and vt1 in j-group cannot be safely replaced. if Dt1 Dt2 ppt1 , t2 q P E!pGr , k, G` ´ r , Gr q ^ ColorConf lict?ppt1 , t2 q, Cqq then return false if M aximalP aths?pGr , k, G` ´ r , Gr , Cq then return true else return false main returns the minimum value m “ min ||Gr zG1r || comment: XY Zcount for G1r in tG2r Ď Gr : G1r |ù rτGr sXY Ñ Zu. if Gr “ H then return ||G´ r || if Djpj P$DompJq ^ GroupIndependent?pj, Gr , k, G` ´ r , Gr , Cqq &Gr Ð ttˆP Gr : trJs “ ju ˙ then T upleW iseCheckpGr , k, G` ´ r X Gr , Gr X Gr , Cq` %return ` ´ T upleW iseCheckpGr zGr , k, Gr zGr , Gr zGr , Cq let t P Gr ` ´ if IsConsistent?pG " 1 r zttu, k, Gr Y ttu, Gr , Cq C Ð C Y tpt1 rXs, t2 rY s, t2 rZsq : pt1 , t2 q P E!pGr zttu, k, G` ´ r Y ttu, Gr qu then ` ´ 1 mt Ð T upleW iseCheckpGr zttu, k, Gr Y ttu, Gr , C q else mt Ð `8 ` ´ if IsConsistent?pG " 1 r zttu, k, Gr , Gr Y ttu, Cq C Ð C Y tpt1 rXs, t2 rY s, t2 rZsq : pt1 , t2 q P E!pGr zttu, k, G` ´ r , Gr Y ttuqu then ` ´ 1 mzt Ð T upleW iseCheckpGr zttu, k, Gr , Gr Y ttu, C q else mzt Ð `8 return minpmt , mzt q Fig. 7. The main procedure for checking APEFDs in a tuple-wise fashion. Notice that we use a compact notation for the recursive procedure which is initially called as T upleW iseM inpGr , kq where when G` ´ r , Gr , and C are ommitted in the procedure call they get their respective default values specified in the procedure declaration (i.e., H for each of them in this case). many tuples in r share the same values for attributes J, X, Y, and Z. The input parameter k is the length for the sliding window grouping swpkq. The sets G` r and G´ r , originally initialized to H, represent the tuples of Gr that are either kept or deleted in the current solution respectively. On instances r of schema R satisfying count P R we denote ř with ||r|| the sum on the count attribute for the tuples in r (i.e., ||r|| “ tPr trcounts). Finally, C is a set of colors which is initially set to H. A color c is a tuple on the schema X, Y, Z. As we will see, C keeps track via colours of the constraints introduced so far in the construction of the solution. The procedure T upleW iseM in returns the minimum number of tuples that has to be deleted from r in order to obtain an instance r1 such that r1 |ù rτJR , swpkqsXY Ñ Z. Then if such minimum is less or equal than ¨ |r| we can conclude r |ù rτJR , swpkqsXY Ñ Z else we have r |ù z rτJR , swpkqsXY Ñ Z. Given Gr , G` ´ 1 r , Gr , and a set of colours C we say that an edge pt, t q P Gr ˆ Gr is pending if and only if the following conditions hold: 1. t, t1 R G´ 1 r and ptrXs, t rY s, zq R C for every z P DompZq; 2. for every t with t rJs “ trJs and trV T s ă t2 rV T s ă t1 rV T s we have t2 R G` 2 2 r ; 3. there exists t2 P pGr Y tt, t1 uqzpG´ r Y G` r q with t2 rJs “ trJs and trV T s ď t2 rV T s ď t1 rV T s. Informally speaking, a pending edge is an edge that is not active in the current partial solution but it may become active during the computation and, if it happens, it introduces a new color in C. In our algorithm, pending edges for the current partial solution are retrieved by the procedure E? while active edges are retrieved by the procedure E!. The procedure T upleW iseM in (Figure 7) works as follows. If G` ´ r YGr “ Gr it means that we have obtained a solution without violating any constraint and thus we can return ||G´ ` ´ r || (i.e., the number of deleted tuples). If Gr Y Gr ‰ Gr ` ´ the algorithm guesses a tuple t P Gr zpGr Y Gr q and proceed as follows. First, it checks if inserting t into G` r does not cause any violation in the constraints, if so it stores in mt the value of the recursive call to T upleW iseM in where t belongs to G` r and C has been updated accordingly. Notice that by inserting a tuple t in G` r the algorithm is asserting that t belongs to the current partial solution, while by inserting t in G´ r the algorithm is asserting that t does not belong to the current partial solution. If a constraint is violated the algorithm stores in mt the value `8, which means that t may not be kept in the current solution. Then, it checks if inserting t into G´ r does not cause any violation in the constraints, if so it stores in mzt the value of the recursive call to T upleW iseM in where G´ r and C are updated accordingly. If a constraint is violated the algorithm stores in mzt the value `8, which means that t must be kept in the current partial solution. In procedure T upleW iseM in the only way in which a constraint may be violated is that, after the insertion a tuple t in G` ´ 1 2 r (resp. Gr ), an edge pt , t q 1 2 2 turns out to be active and its color pt rXs, t rY s, t rZsq turns out to be conflicting with at least one color in C. The first one allows us to restrict the search space by splitting the problem into independent sub-problems in a divide-and-conquer fashion. Let us suppose that at a certain step of our computation there exists a value j P ttrJs : t P Gr u 1 for which for each pair of conflicting pending edges pt, t1 q and pt, t q we have that 1 1 either all t, t1 , t, and t belong to j-group or all t, t1 , t, and t do not belong to j- group (such condition is verified by sub-procedure GroupIndependent? reported in Figure 6.). Let Gr “ tt : trJs “ ju, if every edge involving tuples in j-group is not conflicting with every edge that may be introduced outside j-group then we can split the problem into the two sub-problems pGr , k, G` ´ r X Gr , Gr X Gr q and pGr zGr , k, G` r zGr , G´ r zGr q. Such problems are independent and may be resolved in a separate way. The resulting value for the solution is the sum of the values returned by T upleW iseM in applied to both the two sub-problems. Let H “ |G´ ` ´ ´ ` ´ r zpGr Y Gr q| and h “ |tt P pGr zpGr Y Gr qq : trJs “ ju|, notice that, in this case, the upper bound of the complexity at the current step of computation drops from Op2H q to Op2H´h ` 2h q. The second optimization allows us to prune a sub-tree of computation even before a contradiction arises. The second optimization implements a criteria to detect, in many cases, if every possible solution that may be built starting from the current partial one turns to be not minimal. Suppose that there exists an active o-pair pt, t1 q in a partial solution pGr , G` ´ r , Gr q such that there exists t P Gr in the same J-group of t with trV T s ă trV T s ă t1 rV T s. By definition of active 1 o-pair, we have that t belongs to G´ r as well as every tuple t in the same J- 1 group of t with trV T s ă t rV T s ă t1 rV T s, here the additional condition is that there exists at least one of such tuples. Given a partial solution pGr , G` ´ r , Gr q ` ´ ` ´ we denote with colorspGr , Gr , Gr q the set colorspGr , Gr , Gr q “ tpx, y, zq : there exists an active edge pt, t1 q with cpt, t1 q “ px, y, zqu. Let us define the set of colors pendingpGr , G` ´ r , Gr q “ tpx, y, zq : there exists a pending edge pt, t q 1 1 with cpt, t q “ px, y, zqu which collects all and only the colors that may be introduced later on in the current computation. A color px, y, zq is safe in pvt, vt1 , jq if and only if one of the following three conditions hold: 1. px, y, zq P colorspGr , G` ´ r , Gr q; 2. every color px, y, z q in pendingpGr , G` 1 ´ 1 r , Gr q satisfies z “ z (i.e., px, y, zq is a pending color and there is no pending color that is conflicting with px, y, zq); 3. the color is not conflicting with any color in colorspGr , G` ´ r , Gr q Y pendingp 1 2 Gr , G` ´ ` ´ r , Gr q and do not exist two tuples t, t P pGr Y Gr q X tt P Gr : 2 2 1 t rJs “ j ^ trV T s ď t ď t1 rV T su such that pt, t q is an edge and the color 1 1 ptrXs, t rY s, t rZsq is conflicting with px, y, zq Notice that the above three conditions imply that if a color is safe in pvt, vt1 , jq then it is neither in conflict with a color colorspGr , G` ´ r , Gr q nor with a color in ` ´ pendingpGr , Gr , Gr q, however this is just a necessary but not sufficient condi- tion. Given a partial solution pGr , G` ´ 1 1 r , Gr q and the triple pvt, vt , jq a pvt, vt , jq- ` replace DAG is a DAG pV, Eq where V “ tt P Gr : trV T s “ vt ^ trJs “ ju Y tt P G´ 1 ` 1 r : vt ă trV T s ă vt ^ trJs “ ju Y tt P Gr : trV T s “ vt ^ trJs “ ju and " * ptrV T s ‰ vt _ t1 rV T s ‰ vt1 q ^ trV T s ă t1 rV T s^ pt, t1 q P V ˆ V : cpt, t1 q is safe in pvt, vt1 , jq E“ Y tpt, t1 q P V ˆ V : ptrV T s ‰ vt _ t1 rV T s ‰ vt1 q ^ t1 rV T s ´ trV T s ą ku A node t P V is a starting node (resp. ending node) if and only if vt ă trV T s ă vt1 and for every t1 P V with t1 rV T s “ vt (resp. t1 rV T s “ vt1 ) we have pt1 , tq P E (resp. pt, t1 q P E). A replace path is pvt, vt1 , jq-replace DAG pV, Eq is any path t1 . . . tm in pV, Eq for whicht t1 is a starting node and tm is an ending node. We say that vt and vt1 in j can be safely replaced if and only if there exists a replace path in the pvt, vt1 , jq-replace DAG pV, Eq. Using the above definitions of replace DAGs/paths we can provide the fol- lowing result. Theorem 2. Given a partial solution pGr , G` ´ r , Gr q if there exists a group j with two consecutive valid times vt and vt such that vt and vt1 can be safely replaced 1 in j then every consistent solution that follows pGr , G` ´ r , Gr q is not optimal. The proof of the theorem is very simple and left as exercise to the reader. Let us suppose that t1 . . . tm is a replace path in the pvt, vt1 , jq-replace DAG, it must exists by hypothesis and by definition we have t1 , . . . , tm P G´ r , it suf- ` ´ fices to take any consistent solution pGr , Gr , Gr q that follows pGr , G` ´ r , Gr q ` ´ and that pGr , Gr Y tt1 , . . . , tm u, Gr ztt1 , . . . , tm uq is still a consistent solution, non-optimality immediately follows. We take advantage of Theorem 2 by prun- ing every computation rooted in a partial solution pGr , G` ´ r , Gr q that features 1 a J-group j-group and two consecutive valid times vt and vt in j-group such that vt and vt1 can be safely replaced in j. Notice that verify wether or not such condition applies may be performed in polynomial time. In the procedure T upleW iseM in, this optimization is realized by the joint work of sub-procedures M aximalP aths? and ReplaceP ath? reported in Figure 7 and 6 respectively. 3 Conclusion In this work we propose an algorithm for solving the problem of verifying whether or not a given APEFD EF holds over a given instance r of a temporal schema. Such problem is known to be NP-Complete in the size of r (i.e., data-complexity). We expect our algrithm to perform considerably better w.r.t. standard branch and bound procedures since it considerably prune the search space in most cases. We are building a prototype in order to measure the effective improvement w.r.t. frameworks that solve generic NP-Complete problems (i.e., Integer Linear Pro- grammming tools, SAT solvers, and logic programming), obviously this implies that we have to encode our problem in such formalisms. References 1. Combi, C., Mantovani, M., Sabaini, A., Sala, P., Amaddeo, F., Moretti, U., Pozzi, G.: Mining approximate temporal functional dependencies with pure tem- poral grouping in clinical databases. Comp. in Bio. and Med. 62, 306–324 (2015). https://doi.org/10.1016/j.compbiomed.2014.08.004, https://doi.org/10.1016/j. compbiomed.2014.08.004 2. Combi, C., Mantovani, M., Sala, P.: Discovering quantitative temporal functional dependencies on clinical data. In: 2017 IEEE International Conference on Health- care Informatics, ICHI 2017, Park City, UT, USA, August 23-26, 2017. pp. 248–257. IEEE (2017). https://doi.org/10.1109/ICHI.2017.80, https://doi.org/10.1109/ ICHI.2017.80 3. Combi, C., Montanari, A., Sala, P.: A uniform framework for temporal functional dependencies with multiple granularities. In: Pfoser, D., Tao, Y., Mouratidis, K., Nascimento, M.A., Mokbel, M.F., Shekhar, S., Huang, Y. (eds.) Advances in Spatial and Temporal Databases - 12th International Symposium, SSTD 2011, Minneapolis, MN, USA, August 24-26, 2011, Proceedings. Lecture Notes in Computer Science, vol. 6849, pp. 404–421. Springer (2011). https://doi.org/10.1007/978-3-642-22922- 0 24, https://doi.org/10.1007/978-3-642-22922-0$\_$24 4. Combi, C., Rizzi, R., Sala, P.: The price of evolution in temporal databases. In: Grandi, F., Lange, M., Lomuscio, A. (eds.) 22nd International Sympo- sium on Temporal Representation and Reasoning, TIME 2015, Kassel, Ger- many, September 23-25, 2015. pp. 47–58. IEEE Computer Society (2015). https://doi.org/10.1109/TIME.2015.24, https://doi.org/10.1109/TIME.2015.24 5. Combi, C., Sala, P.: Temporal functional dependencies based on interval relations. In: Combi, C., Leucker, M., Wolter, F. (eds.) Eighteenth In- ternational Symposium on Temporal Representation and Reasoning, TIME 2011, Lübeck , Germany, September 12-14, 2011. pp. 23–30. IEEE (2011). https://doi.org/10.1109/TIME.2011.15, https://doi.org/10.1109/TIME.2011.15 6. Combi, C., Sala, P.: Interval-based temporal functional dependencies: spec- ification and verification. Ann. Math. Artif. Intell. 71(1-3), 85–130 (2014). https://doi.org/10.1007/s10472-013-9387-1, https://doi.org/10.1007/ s10472-013-9387-1 7. Combi, C., Sala, P.: Mining approximate interval-based temporal dependen- cies. Acta Inf. 53(6-8), 547–585 (2016). https://doi.org/10.1007/s00236-015-0246-x, https://doi.org/10.1007/s00236-015-0246-x 8. Sala, P.: Approximate interval-based temporal dependencies: The complex- ity landscape. In: Cesta, A., Combi, C., Laroussinie, F. (eds.) 21st Interna- tional Symposium on Temporal Representation and Reasoning, TIME 2014, Verona, Italy, September 8-10, 2014. pp. 69–78. IEEE Computer Society (2014). https://doi.org/10.1109/TIME.2014.20, https://doi.org/10.1109/TIME.2014.20 9. Sala, P., Combi, C., Cuccato, M., Galvani, A., Sabaini, A.: A framework for mining evolution rules and its application to the clinical domain. In: Balakrishnan, P., Sri- vatsava, J., Fu, W., Harabagiu, S.M., Wang, F. (eds.) 2015 International Conference on Healthcare Informatics, ICHI 2015, Dallas, TX, USA, October 21-23, 2015. pp. 293–302. IEEE Computer Society (2015). https://doi.org/10.1109/ICHI.2015.42, https://doi.org/10.1109/ICHI.2015.42