Introduction

Data Expiration and Aggregate Queries

David Toman

david@uwaterloo.ca 0 0 D.R.Cheriton School of Computer Science University of Waterloo , Canada

We investigate the space requirements for summaries needed for maintaining exact answers to aggregate queries over histories of relational databases. We show that, in general, a super-logarithmic lower bound (in the length of the history) on space needed to maintain a summary of the history in order to be able to maintain answers to counting queries. We also develop a natural restriction on the use of aggregation that allows for a logarithmic upper bound and in turn maintaining the summary of the history using counters.

Introduction

Data Expiration|the process of removing no longer useful data from database histories while preserving answers to a set of temporal queries|is an essential component of any data warehousing solution that aims on storing and querying information collected over long periods of time. The approaches to data expiration can be evaluated on how well they can remove unnecessary data: the ability of a particular approach to remove data can be quanti ed in terms of the size of the residual data, the data that needs to be retained in order to answer queries. The challenge lies in preserving query answers not only at a particular point in time but also for any further extensions of the history: data that may not seem useful at present may become necessary when additional information arrives in the future.

Aggregate queries over histories are often used to summarize the past in a succinct way. Hence, computing aggregates, such as sum or count, over histories of databases (or data streams) has been a focus of considerable research [ 5 ], in particular on maintaining the aggregates over time using as little space as possible.

In this paper we investigate the trade-o s of maintaining exact aggregates over database histories. We show that, contrary to common belief that aggregates e ciently summarize (and compress) such histories, the space needed for maintaining exact aggregates is actually larger than that needed for maintaining exact answers to rst order queries and, in particular, is not bounded by a logarithmic function in the length of the history. This in turn means that exact aggregates cannot be maintained by using counters as conjectured in [ 8 ]. The contributions of this paper are as follows: { We show that unrestricted use of counting in temporal queries leads necessarily to a lower bound of (pn) in the length of a database history; and { We develop restrictions to the use of the counting aggregate that guarantee that the size of the retained data is bounded by O(log n) in the length of the history.

Note that the later restriction still allows for more queries than restricting ourselves to counting quanti ers : for counting quanti ers an O(1) upper bound in the length of the history, the same upper bound that has been shown for temporal relational calculus [ 8 ] can be easily achieved by translation to FOL. lso, the later restriction allows unrestricted use of aggregates in queries de ned by First-order temporal logic.

The remainder of the paper is organized as follows: Section 2 introduces database histories and temporal queries with aggregation. Section 3 gives a lower bound on the size of the residual data when unrestricted aggregation is allowed. Section 4 develops a restriction on the use of the aggregation that is su cient to obtain a logarithmic upper bound on the size of the residual data. Section 5 links the results presented in this paper to existing results. Sections:concl concludes with a outline of future directions and open questions. 2

Database Histories and Aggregate Queries

A database history records the evolution of a database over time. We assume that time is modeled by positive but not necessarily consecutive integers and we that model the individual consecutive states of the evolution of the database as relational structures. In this setting a nite history of a database is simply a time (integer) indexed sequence of relational database instances: De nition 1 (Database History) Let be a relational signature. A database history H (or a history for short) is an integer-indexed sequence of databases H = (D0; D1; : : : ; Dn) where Di is a standard relational database over . We call Di the ith state of H.

The data domain domD of a history H is the union of all data values that appear in any relation in Di at any time instant; the temporal domain domT is the set of all time instants that appear as indices in the history H. For a history H we de ne MaxT(H) to be the maximal (latest) time instant in domT . 2 A history H can be extended by adding a database Dj , j > MaxT(H) to the end of the sequence. This process can be repeated arbitrarily many times. Let H0 be a sequence of all states successively added to H. We call H0 a su x of H and write H; H0 for the extension of H by H0.

For the purposes of this paper we restrict our attention only to the point-based active domain semantics: the only data values and time instants that exist are those present in the history or are generated by the aggregates. However, we could similarly use, e.g., an interval based encoding for the temporal domain [ 3 ] and insist on using consecutive non-overlapping intervals without signi cant changes in the technical development.

Note also, that the only update operation de ned for database histories in our framework is the extension of the history with a new state. This way our histories can be viewed as transaction-time temporal databases. In particular, this arrangement disallows retroactive modi cations of the data: retroactive updates negatively impact e ectiveness of data expiration [ 8 ]. 2.1

Aggregate Queries We use the standard syntax for range-restricted rst-order queries extended with a sum aggregate to query database histories.

De nition 2 (Queries) We use the following grammar rule to specify rstorder queries with the sum aggregate:

Q ::= R(t; x) j 9x:Q j 9t:Q j Aggzx= e : Q j Q ^ Q j Q ^ j Q ^ :Q j Q _ Q where R is a relational symbol, x is a tuple of variables, t is a temporal variable, and is of the form x = y for data variables and t = s or t < s for temporal variables. Aggzx= e : Q denotes the sum aggregate operator and e a linear expression. Constants can also be used in the formulas without a ecting the result. We require the queries to obey the standard syntactic safety rules: variables in a condition must appear free in the accompanying query and free variables of subqueries involved in disjunction or negation must match. We also assume that the quanti ed variables have unique names di erent from all other variables in the query.

The semantics of the queries is de ned using the usual satisfaction relation j= that links histories (D) and substitutions ( ) with queries; the only di erence is in the case of base relations: for an R(x) 2 we de ne D; j= R(t; x) if x 2 R(Dt ). In other words, the atomic predicates are evaluated at the point of the history speci ed by their rst argument.

The Aggzx= e : Q aggregate operator assigns the variable z the sum of the values of the expression e evaluated with respect to the answer substitutions to Q grouped by the assignments to the variables x.

Without loss of generality we assume that the valuations always map variables to values of the appropriate domain and are restricted to the free variables of the particular query. 2 We use Cntzx : Q to stand for Aggzx= 1 : Q and to represent the count aggregate operator. 2.2

Data Expiration For historical queries, it is not desirable and often not practical or even feasible to store the entire history in computer storage. Therefore, we devise an expiration operator [8{10] to remember only those parts of the history that are necessary to subsequent answering to queries and extensions of the history. Formally: De nition 3 (Expiration Operator) Let Q be a query over a history H. An expiration operator E for Q is a triple (;; ; ) that satis es the property Q(H) = (E (H)); where E (H) is the actual residual data needed to represent the summary of the history we retain in the system and is is de ned by:

E (H) =

(Dk; (Dk 1; (: : : (D0; ;) : : :))); for every H = (D0; D1; : : : ; Dk); ; in this de nition is a constant initial summary, is a map from summaries and states to summaries, and is a function mapping summaries to answers.. In addition, we require that the triple (;; ; ) can be e ectively constructed from Q.

The rst two components of the expiration operator for Q de ne a self-maintainable materialized view1 of H: the ; component tells us what the contents of this view is in the beginning and the component tells us how to update this view when more data arrives in S. The last component, , reproduces the answers to Q while only accessing the information in the view. Note that the de nition does not specify what data model the view uses nor what query languages are used for the three components of the operator. 2.3

Properties of Expiration Operators Intuitively, we have replaced the complete pre x of H with E (H). Thus our aim is to minimize the size of E (H) in terms of:

1. the length of the history H, jdomT j;

2. the number of distinct values in H, jdomDj; and 3. the size of Q.

The dependency on the length of the history is the most critical factor. In practice, we would like to have E (H) 2 O(logjdomT j), i.e., the size of the residual data is bounded by logarithm of the length of the history|that means that we may need to store, e.g., a counter depending on the length of the history.

The size of the residual data may, however, also depend on the size of the active domain for the data elements, jdomD(t)j and the size of the query jQj. This is quite intuitive as the more distinct values are used in the history the more space the residual data is likely take: in most cases we will have to store at least all the uninterpreted constants that appear in H. Similarly, more complex queries are likely to require more data to be retained.

It is easy to see that for queries whose results contain (valuations of) temporal variables can be posed over histories cannot possibly be amenable to data expiry 1 Not necessarily relational view. as we may need the information about possibly all the data instants at which the particular history has been updated: this immediately yields a linear lower bound on the size of the residual data with respect to jdomT j. Hence, for the remainder of the paper we only consider a xed (number of) queries whose answers only contain data values and possibly aggregate values. 3

Lower Bound for Counting

Consider a history over the schema form

= fpg (a single propositional letter) of the H = (;; fpg; : : : ; fpg; ;; fpg; : : : ; fpg; ;; : : : ;; fpg; : : : ; fpg; ;)

| m{z1 } | m{z2 } | m{zk } where ; stands for an empty database instance (i.e., :p holds), fpg for a database instance in which p holds (e.g., H j= P (i) for 0 < i m1), and 1 < mi for all 0 < i k such that mi 6= mj for all 0 < i < j k.

Now consider the query asking the question \what are the lengths of consecutive p-segments in H": 9t1; t2:t1 < t2 ^ Cntzft1;t2g : ( :P (t1) ^ :P (t2)^

(8t:t1 < t < t2 ! P (t)) ^ t1 < t < t2 ) It is easy to see that Q(H) = fm1; m2; : : : ; mkg. To obtain this answer the residual data has to contain at least k X log mi = log i=1 k Y mi i=1 ! log 2k = k bits. Hence for k = pn and m1 = i + 1 we have jHj 2 O(n) but we need at least (pn) bits to represent the residual data for H.

Note also that the need for at least (pn) bits is not necessarily linked to the size of the answer to the query: consider a similar query that asks whether there is a contiguous block of q's of equal length to a block of p's. The later query is boolean, but to be able to provide an answer to this query for any extension of H we need to represent at least the counts m1; : : : ; mk in the residual history and hence we need (pn) space. 4

Non-splitting Aggregates

The super-logarithmic lower bound presented in section 3 relies crucially on the ability of our query language to count time instants relatively to other time instants and this way to generate a large number of distinct aggregate values independently of the size of the data domain domD. Indeed, the lower bound presented uses only time-dependent propositions. To avoid this problem, we restrict the use of the aggregation operator Aggzx= y : Q as follows: De nition 4 (Non-splitting Aggregate) Let Q be a query and t be all temporal variables free in Q. We call an aggregate operator Aggzx= y : Q non-splitting if t x or t \ x = ;.

We call a query Q non-splitting if all occurrences of aggregate operators in Q are non-splitting.

We extend the approach to data expiration for rst-order queries [ 8 ] to nonsplitting aggregate queries. The technique is based on partial evaluation: we treat relations in the known part of the history H and in all its possible extensions H0 as characteristic formulas based on equality and order constraints as follows: De nition 5 (Abstract Substitutions and Formulas) Let H be a history and x and t a data and a temporal variables, respectively, and 62 domD [domT a new symbol; this symbol is used to denote all the values outside of the (current) active data and temporal domains. We de ne abstract substitutions to be the formulas [ax] <8 zx == kaa az 2is daormesuDlt of aggregation : 8a 2 domD:x 6= a a = for a data variable where ka are distinct unknown values (we need at most jQj logjQj jdomDj of such values) and [ts] t = s s 2 domT t > MaxT(domT ) s = for a temporal variable. We allow composite abstract substitutions to denote a ( nite) conjunction of the above formulas (e.g., [axby] denotes the conjunction of [ax] and [by]). 2 The approach is based on specializing a given aggregate query Q with respect to the known part of the history H while keeping the size of the result of the partial evaluation bounded by a function depending only on jdomDj and log jdomT j. A naive partial evaluation fails as, in the cases of quanti cation over the temporal domain (and similarly for aggregation over time), the naive replacement of an existential quanti er by a disjunction over all possible time instants would violate this requirement. Hence we devise a equivalence relation QH that, for an existential subquery, classi es the possible abstract substitutions to equivalence classes in which all elements behave the same with respect to the extensions of the history (i.e., if one is present in an answer after an arbitrary extension of the history, all of them are). This equivalence relation allows choosing a single representative and this way to keep the size of the partially evaluated formula within the required bounds.

The following de nition introduces simultaneously both the partial evaluation operation and the equivalence relation: De nition 6 (Query Specialization and Substitution Equivalence) Let H be a history. We simultaneously de ne a function PEH that maps a query Q to a set of residual queries indexed by abstract substitutions of the form Qi[ax] for x the set of free variables of Q and a the corresponding set of abstract values (of the appropriate type), and an equivalence relation QH on abstract substitutions. The function PEH and the relation QH are de ned inductively on the structure of Q as follows: R(t; x): For atomic formulas, the result of partial evaluation with respect to H yields

PEH (Q) = ftrue[tsxa] : R(s; a) 2 Dg [ fR(t; x)[txa] : a 2 (domD [ f g)jxjg; or a1 = a2 and s1 = s2 for Q0[tsx1a1 ]; Q00[tsx2a2 ] 2 PEH (Q); and the equivalence relation [ax1 ]

QH [ax2 ] to hold whenever Q0 = Q00 = true Q1 ^ F : For a selection, we simply apply the selection on the abstract substitutions:

PEH (Q) = f(Q01 ^ F )[ax] : [ax]; Q01 2 PEH (Q1); j= [ax] ^ F g:

QH [ax2 ] whenever [ax1 ]

QH1 [ax2 ] and [ax1 ] ^ F and [ax2 ] ^ F are satQ1 ^ Q2: For a conjunction (join) we combine the abstract substitutions and the residual queries as follows: PEH (Q) = fQ01 ^ Q02[axby] : Q01[ax] 2 PEH (Q1); Q02[yb] 2 PEH (Q2); j= [ax] ^ [yb]g: The equivalence relation [ax1yb1 ] ax1 ]QH^[ax[yb2yb1]2 ]ainsdd[eax2 n]e^d[ybw2h] eanreevesrat[iaxs1 ]able such H [ax2 ] Q1 and [yb1 ] QH2 [yb2 ] where both [ that Q01[ax1 ]; Q010[ax2 ] 2 PEH (Q1) and Q02[xb1 ]; Q020[xb2 ] 2 PEH (Q2); 9y:Q1: For an existential subformula quantifying over a data variable we get PEH (Q) = f(9y:

_ Q01[byax]2PEH (Q1)

Q01)[ax] : 9b:Q00[byax] 2 PEH (Q1)g To de ne the equivalence relation among the abstract substitutions, let S1 = fb : Q01[byax1 ] 2 PEH (Q0)g and S2 = fb : Q02[byax2 ] 2 PEH (Q0)g. Then [ax1 ] QH [ax2 ] whenever for every b 2 S1 there is c 2 S2 such that [byax1 ] QH0 [cyax2 ] and vice versa. 9t:Q1: For a temporal existential quanti er we use the equivalence relation as follows: Let sja be a representative of each equivalence class with respect to QH1 of all s such that Q01[axst] 2 PEH (Q1) (e.g., the smallest value in temporal order). We de ne

Q01[axstja ]2PEH (Q1) PEH (Q) = f(9y:

Q01)[ax] : 9s:Q00[axst] 2 PEH (Q1)g The de nition of the equivalence among the resulting abstract substitutions is as above. Aggzx= y : Q1: First, consider the case x \ t = ; where t are all free temporal variables in Q1. Let sab be a representative of each equivalence class with j respect to QH1 of all s such that Q01[tsxaby] 2 PEH (Q1) and kjab the cardinality of the class. We de ne

PEH (Q) = fQ01 _ Q02[ax] : Q01 2 PEH (Q1)[ax]; Q02[ax] 2 PEH (Q2)g [ fQ01[ax] : Q01[ax] 2 PEH (Q1); Q02[ax] 62 PEH (Q2)g [ fQ02[ax] : Q01[ax] 62 PEH (Q1); Q02[ax] 2 PEH (Q2)g We de ne the equivalence relation [ax1zk1 ] QH [ax2zk2 ] to hold whenever for a pair of a1 and a2 we can nd a matching of the b1 and b2 values of y in the abstract answers to Q1 such that [ax1yb1 ] QH [ax2yb2 ] holds.

PEH (Q) = fQ01 ^ :Q02[ax] : Q01[ax] 2 PEH (Q1); Q02[ax] 2 PEH (Q2)g

[ fQ01[ax] : Q01[ax] 2 PEH (Q1); Q02[ax] 62 PEH (Q2)g; and [ax1 ] QH [ax2 ] () [ax1 ] QH1 [ax2 ] ^ [ax1 ] QH2 [ax2 ] , assuming that abstract substitutions not present in PEH (Qi) are related by QHi .

_ Q01[axby]2PEH(Q1)

1 0 z= kjab y : BB

PEH (Q) = fAggx where the value of the aggregate is set to ka, a (unique) unknown value determined by a. Since x is free of temporal variables and the result of the aggregate is functionally determined by the valuation of x, it is su cient to de ne the QH relation to be the diagonal relation on the abstract substitutions.

For the case t x we have _ @Q01[tsxjabyab]2PEH(Q1)

Q1 ^ :Q2: for set di erence we de ne Q1 _ Q2: similarly, for union we have:

and [ax1 ] QH [ax2 ] () [ax1 ] QH1 [ax2 ] ^ [ax1 ] QH2 [ax2 ] , again assuming that abstract substitutions not present in PEH (Qi) are related by QHi . It is easy to prove by induction on the de nitions of PEH and

QH that

{ QH is an equivalence relation with index bounded by a function of only jdomDj and jQj; and { the formula PEH (Q) is bounded in depth by jQj and with a maximal fan-out (in its syntactic representation) bounded by a function of jdomDj and jQj. The later claim is follows from the former as the disjunctions present in the existential quanti er and in the aggregation depend on the index of H and not on the size of domT (as they would in a naively partially evaluated Q). As shown in [ 8 ] this function is bounded by stack of exponents of height jQj with a matching lower bound in the case of rst-order logic. The additional logarithmic factor log jdomT j comes from the fact that in PEH (Q) we need to store the (partial) sums kjab for the aggregates. 4.1

Partial Evaluation based Expiration Operator The result of PEH (Q) can be used directly to encode the the history of the history H as follows; to show the second equivalence we extend the PE operator to be able to handle constants in a natural way (omitted in this paper for sake of simplicity): Hence we can simply de ne the expiration operator to be the triple

Q(H) = PEH (Q)(;)

PEH;H0 (Q) PEH0 (PEH (Q)) hPE;(Q); D: H PEfDg(H); H:H(;)i: This is su cient to prove our claims.

Theorem 7 Let Q be a non-splitting aggregate query. Then

hPE;(Q); D: H PEfDg(H); H:H(;)i is a log jdomT j-bounded expiration operator for H with respect to Q. However, in practice, we can extract a subset of H based on collecting the time instants chosen as representatives of the equivalence classes of QH [ 8 ]. However, we also need to assign the partial counts/sums to these representatives; this can be achieved using a solution to a set of linear equations generated from the cardinalities of the equivalence classes (but is beyond the scope of this paper). 5

Related Work

This work has been inspired by Chomicki's work on bounded history encoding for checking temporal integrity constraints [ 4 ]. However, the technique presented in this paper was originally developed for rst-order queries [ 8 ] and is based on partial evaluation [ 6 ]. It is worth mentioning that Chomicki's method for past temporal logic achieves a polynomial upper bound with respect to domD while a similar bound for the method presented in this paper is non elementary: this cannot come as a surprise as rst order logic is non-elementarily more succinct than temporal logic (even in the propositional setting).

A parallel stream of research investigates the construction of synopses| summaries of data streams|for the purpose of answering streaming queries [ 1, 2, 11, 12 ]. Many of the issues in that setting are common with the approaches to data expiration [ 10 ]. However, due to di culties of maintaining synopses for exact computation of aggregates, a considerable research focused on approximate algorithms [ 7 ].

Conclusion

We have investigated the space requirements for summaries of database histories needed to maintain exact answers to aggregate queries: the results show that maintaining general aggregates such as count and sum leads to considerable increase in storage requirements over maintaining summaries for rst-order queries. We have also developed a syntactic restriction on the use of aggregates that allows the summaries to be bounded by log of the length of the history and in turn implemented using a few additional counters added to a summary that is independent of the length of the history.

Future work should provide a matching upper bound for summaries with respect to general aggregate queries (or to improve on the lower bound presented in this paper). Also we plan to consider alternatives to the naive generation of representatives for the equivalence classes used in the partial evaluation-based technique in order to extract a residual history from the original history H, possibly adorned with values of partial sums and counts.

Arvind

Arasu , Brian Babcock, Shivnath Babu, Jon McAlister , and Jennifer Widom . Characterizing Memory Requirements for Queries over Continuous Data Streams . In ACM Symposium on Principles of Database Systems , pages 221 { 232 , 2002 .

Brian

Babcock , Shivnath Babu, Mayur Datar, Rajeev Motwani, and

Jennifer

Widom . Models and Issues in Data Stream Systems . In ACM Symposium on Principles of Database Systems , pages 1 { 16 , 2002 .

Chomicki and

D. Toman. Temporal

Databases. In M. Fischer ,

Gabbay , and L. Villa, editors, Handbook of Temporal Reasoning in Arti cial Intelligence , pages 429 { 467 . Elsevier Foundations of Arti cial Intelligence , 2005 .

Jan

Chomicki . E cient Checking of Temporal Integrity Constraints Using Bounded History Encoding . ACM Transactions on Database Systems , 20 ( 2 ): 149 { 186 , 1995 .

Graham

Cormode and

Muthukrishnan . An improved data stream summary: The Count-Min sketch and its applications . Journal of Algorithms , 55 : 29 { 38 , 2004 .

N.D.

Jones ,

C.K.

Gomard , and

Sestoft . Partial Evaluation and Automatic Program Generation . Prentice Hall International, 1993 .

Muthukrishnan . Data Streams: Algorithms and applications . Now Publishers Inc., 2005 .

David

Toman . Expiration of Historical Databases . In International Symposium on Temporal Representation and Reasoning , pages 128 { 135 . IEEE Press, 2001 .

David

Toman . Logical Data Expiration . In Jan Chomicki, Gunter Saake, and Ron van der Meyden, editors, Logics for Emerging Applications of Databases, chapter 7 , pages 203 { 238 . Springer, 2003 .

10.

David

Toman . On Construction of Holistic Synopses under the Duplicate Semantics of Streaming Queries . In International Symposium on Temporal Representation and Reasoning , pages 150 { 163 . IEEE Press, 2007 .

11.

Jun

Yang and

Jennifer

Widom . Maintaining temporal views over non-temporal information sources for data warehousing . In Proceedings of EDBT 1998 , pages 389 { 403 , 1998 .

12.

Jun

Yang and

Jennifer

Widom . Temporal view self-maintenance . In Proceedings of EDBT 2000 , pages 395 { 412 , 2000 .