=Paper= {{Paper |id=Vol-2369/short11 |storemode=property |title=Bring Order to Data |pdfUrl=https://ceur-ws.org/Vol-2369/short11.pdf |volume=Vol-2369 |authors=Heidar Davoudi,Parke Godfrey,Lukasz Golab,Mehdi Kargar,Divesh Srivastava,Jaroslaw Szlichta |dblpUrl=https://dblp.org/rec/conf/amw/DavoudiGGKSS19 }} ==Bring Order to Data== https://ceur-ws.org/Vol-2369/short11.pdf
                            Bringing Order to Data

           Heidar Davoudi1 , Parke Godfrey2 , Lukasz Golab1 , Mehdi Kargar3 ,
                      Divesh Srivastava4 , and Jaroslaw Szlichta5
                   1
                 University of Waterloo, Canada, 2 York University, Canada,
               3
              Ted Rogers School of Management, Ryerson University, Canada,
     4
       AT&T Labs-Research, USA, 5 University of Ontario Institute of Technology, Canada,
    hdavoudi@uwaterloo.ca, godfrey@yorku.ca, lgolab@uwaterloo.ca,
     kargar@ryerson.ca, divesh@research.att.com,szlichta@uoit.ca

        Abstract. Integrity constraints (ICs) are widely used in business intelligence to
        express and enforce application semantics. However, finding ICs manually is time
        consuming, requires the involvement of domain experts, and is prone to human
        error. Thus, techniques have been proposed to automatically find a variety of ICs.
        We propose an algorithm to automatically discover order dependencies (ODs).
        Prior work on OD discovery has factorial complexity, is not complete, and is not
        concise. We propose an algorithm that finds a complete set of ODs with exponen-
        tial worst-case time complexity in the number of attributes and linear complexity
        in the number of tuples. We experimentally show that our algorithm is orders of
        magnitude faster than the prior state-of-the-art.

        Keywords: Integrity constraints · Order Dependency · Axiomatization.

1     Introduction
Ordered attributes, such as timestamps and numbers, are common in business data. Or-
der Dependencies (ODs) capture monotonic relationships among such attributes. For in-
stance, Table 1 shows employee tax records in which tax is calculated as a percentage
(perc) of salary (sal). The OD sal orders perc holds: if we sort the table by salary,
it is also sorted by percentage. Similarly, sal orders grp, subg: if we sort the table
by salary, it is also sorted by group with ties broken by subgroup. With interest in data
analytics at an all-time high, ODs can improve the consistency dimension of data qual-
ity and query optimization [4, 7–9]. ODs can describe business rules (data profiling);
and their violations can point out possible data errors. Furthermore, query optimizers
can use ODs to eliminate costly operators such as joins and sorts: ordered streams be-
tween query operators can exploit available indexes, enable pipelining, and eliminate
intermediate partitioning steps. Finally, ODs subsume Function Dependencies (FDs) as
any FD can be mapped to an equivalent OD by prefixing the left-hand-side attributes
onto the right-hand-side [6].
     It is time consuming to specify integrity constraints manually, motivating the need
for their automatic discovery. However, since ODs are naturally expressed with lists
rather than sets of attributes (as in the example above), existing solutions have factorial
worst-case time complexity in the number of attributes [4]. We describe a more effi-
cient algorithm to discover ODs from data. First, we show that ODs can be expressed
with sets of attributes via a polynomial mapping into an equivalent set-based canoni-
cal form. Then, we introduce sound and complete axioms for set-based canonical ODs,
2           H. Davoudi, P. Godfrey, L. Golab, M. Kargar, D. Srivastava, J. Szlichta,
#     ID    yr   pos   bin   sal   perc   tax    grp   subg                            {}
t1    1     16   sec   1     5K    20%    1K     A     III
t2    2     16   dev   2     8K    25%    2K     C     II                 {A}          {B}          {C}
t3    3     16   dir   3     10K   30%    3K     D     I
t4    1     15   sec   1     4K    20%    0.8K   A     III             {A,B}         {A,C}          {B,C}
t5    2     15   dev   2     6K    25%    1.5K   C     I
t6    3     15   dir   3     8K    25%    2K     C     II
                                                                                     {A,B,C}


            Table 1: Employee salary data                     Fig. 1: A set lattice for attributes A, B, C.

which lead to optimizations of the OD discovery algorithm by avoiding redundant com-
putation. This allows us to design a fast and effective OD discovery algorithm that has
exponential worst-case complexity, O(2|R| ), in the number of attributes |R|, and linear
complexity in the number of tuples. We note that this short paper is a summary of our
published results in [5, 6, 10].
2     Order Dependency Discovery
2.1       Preliminaries
Order dependencies (ODs) describe relationships among lexicographical orderings of
sets of tuples, as in the SQL order by statement. Let X = [A | T] be a list of attributes,
where the attribute A is the head of the list, and the list T is the tail. For two tuples s and
t, we write s X t iff sA < tA or (sA = tA and (T = [ ] or s T t)). Given two lists of
attributes, X and Y, X 7→ Y denotes an OD, read as X orders Y. Table r satisfies X 7→ Y
iff, for all s, t ∈ r, s X t implies s Y t. Moreover, X and Y are order compatible,
denoted as X ∼ Y iff XY ↔ YX. (For example, month and week of the year in the
calendar are order compatible.)
     We say that two tuples, s and t, are equivalent with respect to an attribute set X if
sX = tX . Any attribute set X partitions tuples into equivalence classes [3]. We denote
the equivalence class of a tuple t ∈ r with respect to a given set X by E(tX ); i.e., E(tX )
= {s ∈ r | sX = tX }. A partition of r over X is the set of equivalence classes, ΠX
= {E(tX ) | t ∈ r}. For instance, in Table 1, E(t1{year} ) = E(t2{year} ) = E(t3{year} ) =
{t1, t2, t3} and Πyear = {{t1, t2, t3}, {t4, t5, t6}}.
2.2       Canonical Mapping and Axioms
Expressing ODs in a natural way relies on lists of attributes; e.g., in Table 1, sal 7→
grp, subg is not the same as sal 7→ subg, grp. In contrast, the order of attributes in
an FD does not matter. However, the list representation leads to high complexity when
discovering ODs [4]. Thus, we provide a polynomial mapping of list-based ODs into
equivalent set-based canonical ODs [5, 6, 10]. The mapping allows us to develop an
efficient OD discovery algorithm that traverses a much smaller set-containment lattice
rather than the list-containment lattice used in [4].
     The mapping presented in Theorem 1 (below) converts a list-based OD into canon-
ical set-based ODs of two types. First, an attribute A is a constant within each equiva-
lence class with respect to a set of attributes X , denoted as X : [ ] 7→ A, if X0 7→ X0 A for
any permutation X0 of X (note that X functionally determines Y iff X 7→ XY, for any
list X over the attributes of X and any list Y over the attributes of Y [6]). Second, two
attributes, A and B, are order-compatible within each equivalence class with respect to
the set of attributes X , denoted as X : A ∼ B, if X0 A ∼ X0 B. The set X is called a con-
text. For example, in Table 1, bin is a constant in the context of pos, written as {pos}:
                                                               Bringing Order to Data         3
 1. Reflexivity          4. Strengthen 6. Augmentation-I           8. Chain
  X : [ ] 7→ A, ∀A ∈ X      X :[ ]  →
                                    7     A       X :[ ] →
                                                         7     A      X : A ∼ B1
 2. Identity                X A:[     ] →
                                        7  B      ZX   :[  ] →
                                                             7   A    ∀i∈[1,n−1] ,X : Bi ∼ Bi+1
                            X :[ ] 7→ B      7. Augmentation-II       X : Bn ∼ C
         X: A ∼ A                                                     ∀i∈[1,n] ,X Bi : A ∼ C
                         5. Propagate             X: A ∼ B
 3. Commutativity
                            X :[ ] 7→ A           ZX : A ∼ B          X: A ∼ C
         X: A ∼ B
                            X: A ∼ B
         X: B ∼ A
                      Fig. 2: Set-based axiomatization for canonical ODs.

[ ] 7→ bin since E(t1{pos} ) |= [ ] 7→ bin, E(t2{pos} ) |= [ ] 7→ bin and E(t3{pos} ) |=
[ ] 7→ bin. Also, {year}: bin ∼ salary because E(t1{year} ) |= bin ∼ salary and
E(t4{year} ) |= bin ∼ salary.
Theorem 1. X 7→ Y iff ∀j, X :[ ] 7→ Yj and ∀i, j, {X1 , .., Xi−1 , Y1 , .., Yj−1 }: Xi ∼ Yj .
Example 1. The OD [AB] 7→ [CD] is mapped into the following set of canonical ODs:
{A, B}:[ ] 7→ C, {A, B}:[ ] 7→ D, {}: A ∼ C, {A}: B ∼ C, {C}: A ∼ D, {A, C}: B ∼ D.

    We present a sound and complete set-based axiomatization for ODs in Fig. 2 [6].
The set-based axioms allow us to design effective pruning rules for our OD discovery
algorithm. For example, OD X :[ ] 7→ A is trivial if A ∈ X by Reflexivity (see also
Example 2).

Theorem 2. The axiomatization for canonical ODs in Fig. 2 is sound and complete.

2.3   Discovery Algorithm
Given the mapping of a list-based OD into equivalent set-based ODs, we present an
algorithm, named FASTOD, that efficiently discovers a complete and minimal set of
set-based ODs over a given relation instance. In contrast, the OD discovery algorithm
from [4] traverses a lattice of all possible lists of attributes, which leads to factorial time
complexity. FASTOD starts the search from singleton sets of attributes and works its
way to larger attribute sets through a set-containment lattice (as in Figure 1), level by
level (l = 0, 1, . . . ). When the algorithm is processing an attribute set X , it verifies ODs
of the form X \ A:[ ] 7→ A (let X \ A be shorthand for X \ {A}, where A ∈ X ) and
X \ {A, B}: A ∼ B, where A, B ∈ X and A 6= B. Furthermore, an OD X \ A:[ ] 7→ A
should be minimal, that is, @ Y ⊂ X such that Y \ A:[ ] 7→ A is valid.
     The algorithm maintains information about minimal ODs, in the form of X \A:[ ] 7→
A, in the candidate set Cc+ (X ) [3] (as ODs subsume FDs [7, 9]), where Cc+ (X ) = {A ∈
R | ∀B∈X X \ {A, B}:[ ] 7→ B does not hold}. Similarly, it stores information about
minimal ODs in the form of X \ {A, B}: A ∼ B, in the candidate set Cs+ (X ), where
Cs+ (X ) = {{A, B} ∈ X 2 | A 6= B and ∀C∈X X \ {A, B, C}: A ∼ B does not hold,
and ∀C∈X X \ {A, B, C}:[ ] 7→ C does not hold}. The following lemmas can be used to
prune the search space:
Lemma 1. An OD X \ A:[ ] 7→ A, where A ∈ X , is minimal iff ∀B∈X A ∈ Cc+ (X \ B).

Lemma 2. An OD X \ {A, B}: A ∼ B, where A, B ∈ X and A 6= B, is minimal iff
∀C∈X \{A,B} {A, B} ∈ Cs+ (X \ C), and A ∈ Cc+ (X \ B) and B ∈ Cc+ (X \ A).
4       H. Davoudi, P. Godfrey, L. Golab, M. Kargar, D. Srivastava, J. Szlichta,

                                                                  / X B∈X Cc+ (X \
                                                                      T
According to Lemma 1, we do not need to check X \ A:[ ] 7→ A if A ∈
B), and based on Lemma 2, we do not need to consider X \ {A, B}: A ∼ B if A ∈      /
Cc+ (X \ B) or B ∈/ Cc+ (X \ A). Moreover, according to Lemma 3, we can delete nodes
from the lattice under the following conditions:
Lemma 3. Deleting node X from the lth lattice level, where l ≥ 2, has no effect on the
output set of minimal ODs if Cc+ (X ) = {} and Cs+ (X ) = {}.

Example 2. Let A:[ ] 7→ B, B:[ ] 7→ A and {}: A ∼ B. Since Cc+ ({A, B}) = {} and
Cs+ ({A, B}) as well as l = 2, by the pruning levels rule (Lemma 3), the node {A, B}
is deleted and the node {A, B, C} is not considered (see Figure 1). This is justified as
{AB}:[ ] 7→ C is not minimal by the Strengthen axiom, {AC}:[ ] 7→ B is not minimal
by Augmentation–I, {BC}:[ ] 7→ A is not minimal by Augmentation–I, {C}: A ∼ B is
not minimal by Augmentation–II, {A}: B ∼ C is not minimal by Propagate, and {B}:
A ∼ C is not minimal by Propagate.
Note that while we provide theoretical guarantees for FASTOD to find a complete set of
ODs, the ORDER algorithm [4] is not complete.

Theorem 3. The FASTOD algorithm computes a complete and minimal set of ODs.

    In [10], we extend our algorithm to discover bidirectional ODs, which allow a mix
of ascending and descending (desc) orders. For example, a student with an alphabeti-
cally lower letter grade has a higher percentage grade than another student. We develop
additional pruning rules and show that efficiency of discovery of bidirectional ODs re-
mains the same as for one-directional ODs.
    We experimentally compare FASTOD with previous approaches ( ORDER [4]). Our
algorithm can be orders of magnitude faster. For instance, on the flight dataset from
http://metanome.de with 1K tuples and 20 attributes, FASTOD finishes the com-
putation in less than 1 second, whereas ORDER did not terminate after 5 hours. More-
over, FASTOD’s candidate sets do not increase in size during the execution of the algo-
rithm (unlike ORDER) because of the concise candidate representation (e.g., many ODs
that are considered minimal by ORDER are found to be redundant by our algorithm).
    Finally, in [2], we show that a recent approach to OD discovery called OCDDIS-
COVER in Consonni et al. [1] is incorrect. We show that their claim of completeness of
OD discovery is not true. Built upon their incorrect claim, OCDDISCOVER’s pruning
rules are overly aggressive, and prune parts of the search space that contain legitimate
ODs. This is the reason their approach appears to be “faster” in practice than our FAS-
TOD discovery algorithm despite being significantly worse in asymptotic complexity.
Finally, we show that Consonni et al. [1] misinterpret our set-based canonical form for
ODs, leading to an incorrect claim that our FASTOD implementation has an error.

3   Conclusions
We presented an efficient algorithm for discovering ODs. The technical innovation that
made our algorithm possible is a novel mapping into a set-based canonical form and
an axiomatization for set-based canonical ODs. In future work, we plan to study condi-
tional ODs that hold over portions of data.
                                                                  Bringing Order to Data          5

References
 1. Consonni, C., Sottovia, P., Montresor, A., Velegrakis, Y.: Discovering Order Dependencies
    through Order Compatibility. In: EDBT. pp. 409–420 (2019)
 2. Godfrey, P., Golab, L., Kargar, M., Srivastava, D., Szlichta, J.: Errata note: Discovering
    order dependencies through order compatibility. Technical Report, 5 pages, available at
    https://arxiv.org/abs/1905.02010 (2019)
 3. Huhtala, Y., Karkkainen, J., Porkka, P., Toivonen, H.: Efficient Discovery of Functional and
    Approximate Dependencies Using Partitions. In: ICDE. pp. 392–401 (1998)
 4. Langer, P., Naumann, F.: Efficient order dependency detection. VLDB J. 25(2), 223–241
    (2016)
 5. Mihaylov, A., Godfrey, P., Golab, L., Kargar, M., Srivastava, D., Szlichta, J.: FASTOD:
    bringing order to data. In: 2018 IEEE 34th International Conference on Data Engineering
    (ICDE). pp. 1561–1564. IEEE (2018)
 6. Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and Complete
    Discovery of Order Dependencies via Set-based Axiomatization. PVLDB 10(7), 721–732
    (2017)
 7. Szlichta, J., Godfrey, P., Gryz, J.: Fundamentals of Order Dependencies. PVLDB, 5(11):
    1220-1231 (2012)
 8. Szlichta, J., Godfrey, P., Gryz, J., Ma, W., Qiu, W., Zuzarte, C.: Business-Intelligence Queries
    with Order Dependencies in DB2. In: EDBT, 750-761 (2014)
 9. Szlichta, J., Godfrey, P., Gryz, J., Zuzarte, C.: Expressiveness and Complexity of Order De-
    pendencies. PVLDB 6(14): 1858-1869 (2013)
10. Szlichta, J., Godfrey, P., Golab, L., Kargar, M., Srivastava, D.: Effective and complete dis-
    covery of bidirectional order dependencies via set-based axioms. The VLDB Journal 27(4),
    573–591 (2018)