Data Expiration and Aggregate Queries David Toman D.R.Cheriton School of Computer Science University of Waterloo, Canada david@uwaterloo.ca Abstract. We investigate the space requirements for summaries needed for maintaining exact answers to aggregate queries over histories of re- lational databases. We show that, in general, a super-logarithmic lower bound (in the length of the history) on space needed to maintain a sum- mary of the history in order to be able to maintain answers to counting queries. We also develop a natural restriction on the use of aggregation that allows for a logarithmic upper bound and in turn maintaining the summary of the history using counters. 1 Introduction Data Expiration—the process of removing no longer useful data from database histories while preserving answers to a set of temporal queries—is an essential component of any data warehousing solution that aims on storing and querying information collected over long periods of time. The approaches to data expira- tion can be evaluated on how well they can remove unnecessary data: the ability of a particular approach to remove data can be quantified in terms of the size of the residual data, the data that needs to be retained in order to answer queries. The challenge lies in preserving query answers not only at a particular point in time but also for any further extensions of the history: data that may not seem useful at present may become necessary when additional information arrives in the future. Aggregate queries over histories are often used to summarize the past in a succinct way. Hence, computing aggregates, such as sum or count, over histories of databases (or data streams) has been a focus of considerable research [5], in particular on maintaining the aggregates over time using as little space as possible. In this paper we investigate the trade-offs of maintaining exact aggregates over database histories. We show that, contrary to common belief that aggre- gates efficiently summarize (and compress) such histories, the space needed for maintaining exact aggregates is actually larger than that needed for maintain- ing exact answers to first order queries and, in particular, is not bounded by a logarithmic function in the length of the history. This in turn means that exact aggregates cannot be maintained by using counters as conjectured in [8]. The contributions of this paper are as follows: 1 – We show that unrestricted use √ of counting in temporal queries leads neces- sarily to a lower bound of Ω( n) in the length of a database history; and – We develop restrictions to the use of the counting aggregate that guarantee that the size of the retained data is bounded by O(log n) in the length of the history. Note that the later restriction still allows for more queries than restricting our- selves to counting quantifiers: for counting quantifiers an O(1) upper bound in the length of the history, the same upper bound that has been shown for tem- poral relational calculus [8] can be easily achieved by translation to FOL. lso, the later restriction allows unrestricted use of aggregates in queries defined by First-order temporal logic. The remainder of the paper is organized as follows: Section 2 introduces database histories and temporal queries with aggregation. Section 3 gives a lower bound on the size of the residual data when unrestricted aggregation is allowed. Section 4 develops a restriction on the use of the aggregation that is sufficient to obtain a logarithmic upper bound on the size of the residual data. Section 5 links the results presented in this paper to existing results. Sections:concl concludes with a outline of future directions and open questions. 2 Database Histories and Aggregate Queries A database history records the evolution of a database over time. We assume that time is modeled by positive but not necessarily consecutive integers and we that model the individual consecutive states of the evolution of the database as relational structures. In this setting a finite history of a database is simply a time (integer) indexed sequence of relational database instances: Definition 1 (Database History) Let ρ be a relational signature. A database history H (or a history for short) is an integer-indexed sequence of databases H = (D0 , D1 , . . . , Dn ) where Di is a standard relational database over ρ. We call Di the ith state of H. The data domain domD of a history H is the union of all data values that appear in any relation in Di at any time instant; the temporal domain domT is the set of all time instants that appear as indices in the history H. For a history H we define MaxT(H) to be the maximal (latest) time instant in domT . 2 A history H can be extended by adding a database Dj , j > MaxT(H) to the end of the sequence. This process can be repeated arbitrarily many times. Let H 0 be a sequence of all states successively added to H. We call H 0 a suffix of H and write H; H 0 for the extension of H by H 0 . For the purposes of this paper we restrict our attention only to the point-based active domain semantics: the only data values and time instants that exist are those present in the history or are generated by the aggregates. However, we could similarly use, e.g., an interval based encoding for the temporal domain 2 [3] and insist on using consecutive non-overlapping intervals without significant changes in the technical development. Note also, that the only update operation defined for database histories in our framework is the extension of the history with a new state. This way our his- tories can be viewed as transaction-time temporal databases. In particular, this arrangement disallows retroactive modifications of the data: retroactive updates negatively impact effectiveness of data expiration [8]. 2.1 Aggregate Queries We use the standard syntax for range-restricted first-order queries extended with a sum aggregate to query database histories. Definition 2 (Queries) We use the following grammar rule to specify first- order queries with the sum aggregate: Q ::= R(t, x) | ∃x.Q | ∃t.Q | Aggxz=Σe . Q | Q ∧ Q | Q ∧ ψ | Q ∧ ¬Q | Q ∨ Q where R is a relational symbol, x is a tuple of variables, t is a temporal variable, and ψ is of the form x = y for data variables and t = s or t < s for temporal vari- ables. Aggz=Σe x . Q denotes the sum aggregate operator and e a linear expression. Constants can also be used in the formulas ψ without affecting the result. We require the queries to obey the standard syntactic safety rules: variables in a condition ψ must appear free in the accompanying query and free variables of subqueries involved in disjunction or negation must match. We also assume that the quantified variables have unique names different from all other variables in the query. The semantics of the queries is defined using the usual satisfaction relation |= that links histories (D) and substitutions (θ) with queries; the only difference is in the case of base relations: for an R(x) ∈ ρ we define D, θ |= R(t, x) if xθ ∈ R(Dtθ ). In other words, the atomic predicates are evaluated at the point of the history specified by their first argument. The Aggz=Σe x . Q aggregate operator assigns the variable z the sum of the values of the expression e evaluated with respect to the answer substitutions to Q grouped by the assignments to the variables x. Without loss of generality we assume that the valuations θ always map variables to values of the appropriate domain and are restricted to the free variables of the particular query. 2 We use Cntzx . Q to stand for Aggz=Σ1 x . Q and to represent the count aggregate operator. 2.2 Data Expiration For historical queries, it is not desirable and often not practical or even feasible to store the entire history in computer storage. Therefore, we devise an expiration 3 operator [8–10] to remember only those parts of the history that are necessary to subsequent answering to queries and extensions of the history. Formally: Definition 3 (Expiration Operator) Let Q be a query over a history H. An expiration operator E for Q is a triple (∅, ∆, Γ ) that satisfies the property Q(H) = Γ (E(H)), where E(H) is the actual residual data needed to represent the summary of the history we retain in the system and is is defined by: E(H) = ∆(Dk , ∆(Dk−1 , ∆(. . . ∆(D0 , ∅) . . .))), for every H = (D0 , D1 , . . . , Dk ); ∅ in this definition is a constant initial summary, ∆ is a map from summaries and states to summaries, and Γ is a function mapping summaries to answers.. In addition, we require that the triple (∅, ∆, Γ ) can be effectively constructed from Q. The first two components of the expiration operator for Q define a self-maintain- able materialized view1 of H: the ∅ component tells us what the contents of this view is in the beginning and the ∆ component tells us how to update this view when more data arrives in S. The last component, Γ , reproduces the answers to Q while only accessing the information in the view. Note that the definition does not specify what data model the view uses nor what query languages are used for the three components of the operator. 2.3 Properties of Expiration Operators Intuitively, we have replaced the complete prefix of H with E(H). Thus our aim is to minimize the size of E(H) in terms of: 1. the length of the history H, |domT |; 2. the number of distinct values in H, |domD |; and 3. the size of Q. The dependency on the length of the history is the most critical factor. In prac- tice, we would like to have E(H) ∈ O(log|domT |), i.e., the size of the residual data is bounded by logarithm of the length of the history—that means that we may need to store, e.g., a counter depending on the length of the history. The size of the residual data may, however, also depend on the size of the active domain for the data elements, |domD (t)| and the size of the query |Q|. This is quite intuitive as the more distinct values are used in the history the more space the residual data is likely take: in most cases we will have to store at least all the uninterpreted constants that appear in H. Similarly, more complex queries are likely to require more data to be retained. It is easy to see that for queries whose results contain (valuations of) temporal variables can be posed over histories cannot possibly be amenable to data expiry 1 Not necessarily relational view. 4 as we may need the information about possibly all the data instants at which the particular history has been updated: this immediately yields a linear lower bound on the size of the residual data with respect to |domT |. Hence, for the remainder of the paper we only consider a fixed (number of) queries whose answers only contain data values and possibly aggregate values. 3 Lower Bound for Counting Consider a history over the schema ρ = {p} (a single propositional letter) of the form H = (∅, {p}, . . . , {p}, ∅, {p}, . . . , {p}, ∅, . . . ∅, {p}, . . . , {p}, ∅) | {z } | {z } | {z } m1 m2 mk where ∅ stands for an empty database instance (i.e., ¬p holds), {p} for a database instance in which p holds (e.g., H |= P (i) for 0 < i ≤ m1 ), and 1 < mi for all 0 < i ≤ k such that mi 6= mj for all 0 < i < j ≤ k. Now consider the query asking the question “what are the lengths of consecutive p-segments in H”: ∃t1 , t2 .t1 < t2 ∧ Cntz{t1 ,t2 } . ( ¬P (t1 ) ∧ ¬P (t2 )∧ (∀t.t1 < t < t2 → P (t)) ∧ t1 < t < t2 ) It is easy to see that Q(H) = {m1 , m2 , . . . , mk }. To obtain this answer the residual data has to contain at least k k ! X Y log mi = log mi ≥ log 2k = k i=1 i=1 √ √ Hence for k = n and m1 = i + 1 we have |H| ∈ O(n) but we need at least bits. Ω( n) bits to represent the residual data for √ H. Note also that the need for at least Ω( n) bits is not necessarily linked to the size of the answer to the query: consider a similar query that asks whether there is a contiguous block of q’s of equal length to a block of p’s. The later query is boolean, but to be able to provide an answer to this query for any extension of H we need to represent √ at least the counts m1 , . . . , mk in the residual history and hence we need Ω( n) space. 4 Non-splitting Aggregates The super-logarithmic lower bound presented in section 3 relies crucially on the ability of our query language to count time instants relatively to other time instants and this way to generate a large number of distinct aggregate values independently of the size of the data domain domD . Indeed, the lower bound presented uses only time-dependent propositions. To avoid this problem, we restrict the use of the aggregation operator Aggz=Σy x . Q as follows: 5 Definition 4 (Non-splitting Aggregate) Let Q be a query and t be all tem- poral variables free in Q. We call an aggregate operator Aggxz=Σy . Q non-splitting if t ⊆ x or t ∩ x = ∅. We call a query Q non-splitting if all occurrences of aggregate operators in Q are non-splitting. We extend the approach to data expiration for first-order queries [8] to non- splitting aggregate queries. The technique is based on partial evaluation: we treat relations in the known part of the history H and in all its possible extensions H 0 as characteristic formulas based on equality and order constraints as follows: Definition 5 (Abstract Substitutions and Formulas) Let H be a history and x and t a data and a temporal variables, respectively, and • 6∈ domD ∪domT a new symbol; this symbol is used to denote all the values outside of the (current) active data and temporal domains. We define abstract substitutions to be the formulas  x = a a ∈ domD [xa ] ≡ z = ka z is a result of aggregation  ∀a ∈ dom .x 6= a a = • D for a data variable where ka are distinct unknown values (we need at most |Q| log|Q| |domD | of such values) and  t t=s s ∈ domT [s ] ≡ t > MaxT(domT ) s = • for a temporal variable. We allow composite abstract substitutions to denote a (finite) conjunction of the above formulas (e.g., [xy ab ] denotes the conjunction of [xa ] and [yb ]). 2 The approach is based on specializing a given aggregate query Q with respect to the known part of the history H while keeping the size of the result of the partial evaluation bounded by a function depending only on |domD | and log |domT |. A naive partial evaluation fails as, in the cases of quantification over the temporal domain (and similarly for aggregation over time), the naive replacement of an existential quantifier by a disjunction over all possible time instants would violate this requirement. Hence we devise a equivalence relation ∼H Q that, for an existential subquery, classifies the possible abstract substitutions to equivalence classes in which all elements behave the same with respect to the extensions of the history (i.e., if one is present in an answer after an arbitrary extension of the history, all of them are). This equivalence relation allows choosing a single representative and this way to keep the size of the partially evaluated formula within the required bounds. The following definition introduces simultaneously both the partial evalua- tion operation and the equivalence relation: 6 Definition 6 (Query Specialization and Substitution Equivalence) Let H be a history. We simultaneously define a function PEH that maps a query Q to a set of residual queries indexed by abstract substitutions of the form Qi [xa ] for x the set of free variables of Q and a the corresponding set of abstract values (of the appropriate type), and an equivalence relation ∼H Q on abstract substi- tutions. The function PEH and the relation ∼H Q are defined inductively on the structure of Q as follows: R(t, x): For atomic formulas, the result of partial evaluation with respect to H yields |x| PEH (Q) = {true[tx tx sa ] : R(s, a) ∈ D} ∪ {R(t, x)[•a ] : a ∈ (domD ∪ {•}) }, 0 00 and the equivalence relation [xa1 ] ∼H x Q [a2 ] to hold whenever Q = Q = true 0 tx 00 tx or a1 = a2 and s1 = s2 for Q [s1 a1 ], Q [s2 a2 ] ∈ PEH (Q); Q1 ∧ F : For a selection, we simply apply the selection on the abstract substi- tutions: PEH (Q) = {(Q01 ∧ F )[xa ] : [xa ], Q01 ∈ PEH (Q1 ), |= [xa ] ∧ F }. We set [xa1 ] ∼H x x H x x x Q [a2 ] whenever [a1 ] ∼Q1 [a2 ] and [a1 ] ∧ F and [a2 ] ∧ F are sat- isfiable; Q1 ∧ Q2 : For a conjunction (join) we combine the abstract substitutions and the residual queries as follows: PEH (Q) = {Q01 ∧ Q02 [xy 0 x 0 y x y ab ] : Q1 [a ] ∈ PEH (Q1 ), Q2 [b ] ∈ PEH (Q2 ), |= [a ] ∧ [b ]}. The equivalence relation [xy H xy x H x a1 b1 ] ∼Q [a2 b2 ] is defined whenever [a1 ] ∼Q1 [a2 ] y H y x y x y and [b1 ] ∼Q2 [b2 ] where both [a1 ] ∧ [b1 ] and [a2 ] ∧ [b2 ] are satisfiable such that Q01 [xa1 ], Q001 [xa2 ] ∈ PEH (Q1 ) and Q02 [xb1 ], Q002 [xb2 ] ∈ PEH (Q2 ); ∃y.Q1 : For an existential subformula quantifying over a data variable we get _ PEH (Q) = {(∃y. Q01 )[xa ] : ∃b.Q00 [yx ba ] ∈ PEH (Q1 )} Q01 [yx ba ]∈PEH (Q1 ) To define the equivalence relation among the abstract substitutions, let S1 = {b : Q01 [yx 0 0 yx 0 x ba1 ] ∈ PEH (Q )} and S2 = {b : Q2 [ba2 ] ∈ PEH (Q )}. Then [a1 ] ∼Q H yx [xa2 ] whenever for every b ∈ S1 there is c ∈ S2 such that [ba1 ] ∼H yx Q0 [ca2 ] and vice versa. ∃t.Q1 : For a temporal existential quantifier we use the equivalence relation as follows: Let saj be a representative of each equivalence class with respect 0 xt to ∼H Q1 of all s such that Q1 [as ] ∈ PEH (Q1 ) (e.g., the smallest value in temporal order). We define _ PEH (Q) = {(∃y. Q01 )[xa ] : ∃s.Q00 [xt as ] ∈ PEH (Q1 )} Q01 [xt asa ]∈PEH (Q1 ) j The definition of the equivalence among the resulting abstract substitutions is as above. 7 Aggxz=Σy . Q1 : First, consider the case x ∩ t = ∅ where t are all free temporal variables in Q1 . Let sabj be a representative of each equivalence class with respect to ∼H Q1 of all s such that Q01 [txy ab sab ] ∈ PEH (Q1 ) and kj the cardinality of the class. We define   z=Σkjab ·y _ Q01    xz PEH (Q) = {Aggx .   [aka ]}, Q01 [txy ab ]∈PEH (Q1 ) s ab j where the value of the aggregate is set to ka , a (unique) unknown value determined by a. Since x is free of temporal variables and the result of the aggregate is functionally determined by the valuation of x, it is sufficient to define the ∼Q H relation to be the diagonal relation on the abstract substitutions. For the case t ⊆ x we have   _ PEH (Q) = {Aggz=Σy x . Q01  [xz aka ]}. Q01 [xy ab ]∈PEH (Q1 ) We define the equivalence relation [xz H xz a1 k1 ] ∼Q [a2 k2 ] to hold whenever for a pair of a1 and a2 we can find a matching of the b1 and b2 values of y in the abstract answers to Q1 such that [xy H xy a1 b1 ] ∼Q [a2 b2 ] holds. Q1 ∧ ¬Q2 : for set difference we define PEH (Q) = {Q01 ∧ ¬Q02 [xa ] : Q01 [xa ] ∈ PEH (Q1 ), Q02 [xa ] ∈ PEH (Q2 )} ∪ {Q01 [xa ] : Q01 [xa ] ∈ PEH (Q1 ), Q02 [xa ] 6∈ PEH (Q2 )},  and [xa1 ] ∼H x x H x x H x Q [a2 ] ⇐⇒ [a1 ] ∼Q1 [a2 ] ∧ [a1 ] ∼Q2 [a2 ] , assuming that abstract substitutions not present in PEH (Qi ) are related by ∼H Qi . Q1 ∨ Q2 : similarly, for union we have: PEH (Q) = {Q01 ∨ Q02 [xa ] : Q01 ∈ PEH (Q1 )[xa ], Q02 [xa ] ∈ PEH (Q2 )} ∪ {Q01 [xa ] : Q01 [xa ] ∈ PEH (Q1 ), Q02 [xa ] 6∈ PEH (Q2 )} ∪ {Q02 [xa ] : Q01 [xa ] 6∈ PEH (Q1 ), Q02 [xa ] ∈ PEH (Q2 )}  and [xa1 ] ∼H x Q [a2 ] ⇐⇒ [xa1 ] ∼H x x H x Q1 [a2 ] ∧ [a1 ] ∼Q2 [a2 ] , again assuming that abstract substitutions not present in PEH (Qi ) are related by ∼H Qi . It is easy to prove by induction on the definitions of PEH and ∼H Q that – ∼H Q is an equivalence relation with index bounded by a function of only |domD | and |Q|; and – the formula PEH (Q) is bounded in depth by |Q| and with a maximal fan-out (in its syntactic representation) bounded by a function of |domD | and |Q|. 8 The later claim is follows from the former as the disjunctions present in the existential quantifier and in the aggregation depend on the index of ∼H and not on the size of domT (as they would in a naively partially evaluated Q). As shown in [8] this function is bounded by stack of exponents of height |Q| with a matching lower bound in the case of first-order logic. The additional logarithmic factor log |domT | comes from the fact that in PEH (Q) we need to store the (partial) sums kjab for the aggregates. 4.1 Partial Evaluation based Expiration Operator The result of PEH (Q) can be used directly to encode the the history of the history H as follows; to show the second equivalence we extend the PE operator to be able to handle constants in a natural way (omitted in this paper for sake of simplicity): Q(H) = PEH (Q)(∅) PEH,H 0 (Q) ≡ PEH 0 (PEH (Q)) Hence we can simply define the expiration operator to be the triple hPE∅ (Q), λD.λH PE{D} (H), λH.H(∅)i. This is sufficient to prove our claims. Theorem 7 Let Q be a non-splitting aggregate query. Then hPE∅ (Q), λD.λH PE{D} (H), λH.H(∅)i is a log |domT |-bounded expiration operator for H with respect to Q. However, in practice, we can extract a subset of H based on collecting the time instants chosen as representatives of the equivalence classes of ∼Q H [8]. However, we also need to assign the partial counts/sums to these representatives; this can be achieved using a solution to a set of linear equations generated from the cardinalities of the equivalence classes (but is beyond the scope of this paper). 5 Related Work This work has been inspired by Chomicki’s work on bounded history encoding for checking temporal integrity constraints [4]. However, the technique presented in this paper was originally developed for first-order queries [8] and is based on partial evaluation [6]. It is worth mentioning that Chomicki’s method for past temporal logic achieves a polynomial upper bound with respect to domD while a similar bound for the method presented in this paper is non elementary: this cannot come as a surprise as first order logic is non-elementarily more succinct than temporal logic (even in the propositional setting). A parallel stream of research investigates the construction of synopses— summaries of data streams—for the purpose of answering streaming queries [1, 2, 11, 12]. Many of the issues in that setting are common with the approaches to data expiration [10]. However, due to difficulties of maintaining synopses for ex- act computation of aggregates, a considerable research focused on approximate algorithms [7]. 9 6 Conclusion We have investigated the space requirements for summaries of database histo- ries needed to maintain exact answers to aggregate queries: the results show that maintaining general aggregates such as count and sum leads to consider- able increase in storage requirements over maintaining summaries for first-order queries. We have also developed a syntactic restriction on the use of aggregates that allows the summaries to be bounded by log of the length of the history and in turn implemented using a few additional counters added to a summary that is independent of the length of the history. Future work should provide a matching upper bound for summaries with respect to general aggregate queries (or to improve on the lower bound presented in this paper). Also we plan to consider alternatives to the naive generation of representatives for the equivalence classes used in the partial evaluation-based technique in order to extract a residual history from the original history H, possibly adorned with values of partial sums and counts. References 1. Arvind Arasu, Brian Babcock, Shivnath Babu, Jon McAlister, and Jennifer Widom. Characterizing Memory Requirements for Queries over Continuous Data Streams. In ACM Symposium on Principles of Database Systems, pages 221–232, 2002. 2. Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. Models and Issues in Data Stream Systems. In ACM Symposium on Principles of Database Systems, pages 1–16, 2002. 3. J. Chomicki and D. Toman. Temporal Databases. In M. Fischer, D. Gabbay, and L. Villa, editors, Handbook of Temporal Reasoning in Artificial Intelligence, pages 429–467. Elsevier Foundations of Artificial Intelligence, 2005. 4. Jan Chomicki. Efficient Checking of Temporal Integrity Constraints Using Bounded History Encoding. ACM Transactions on Database Systems, 20(2):149– 186, 1995. 5. Graham Cormode and S. Muthukrishnan. An improved data stream summary: The Count-Min sketch and its applications. Journal of Algorithms, 55:29–38, 2004. 6. N.D. Jones, C.K. Gomard, and P. Sestoft. Partial Evaluation and Automatic Program Generation. Prentice Hall International, 1993. 7. S. Muthukrishnan. Data Streams: Algorithms and applications. Now Publishers Inc., 2005. 8. David Toman. Expiration of Historical Databases. In International Symposium on Temporal Representation and Reasoning, pages 128–135. IEEE Press, 2001. 9. David Toman. Logical Data Expiration. In Jan Chomicki, Gunter Saake, and Ron van der Meyden, editors, Logics for Emerging Applications of Databases, chapter 7, pages 203–238. Springer, 2003. 10. David Toman. On Construction of Holistic Synopses under the Duplicate Semantics of Streaming Queries. In International Symposium on Temporal Representation and Reasoning, pages 150–163. IEEE Press, 2007. 11. Jun Yang and Jennifer Widom. Maintaining temporal views over non-temporal information sources for data warehousing. In Proceedings of EDBT 1998, pages 389–403, 1998. 12. Jun Yang and Jennifer Widom. Temporal view self-maintenance. In Proceedings of EDBT 2000, pages 395–412, 2000. 10