-

Polystore Query Rewriting: The Challenges of Variety

Yannis Papakonstantinou yannis@cs.ucsd.edu

Numerous databases marketed as SQL-on-Hadoop, NewSQL [ 16 ] and NoSQL have emerged to catalyze Big Data applications. These databases generally support the 3Vs [ 7 ]. (i) Volume: amount of data (ii) Velocity: speed of data in and out (iii) Variety: semi-structured and heterogeneous data. As a result of di ering use cases and design considerations around the Variety requirement, these new databases have adopted semi-structured data models that vary among each other. Their query languages have even more variations. Some variations are due to super cial syntactic di erences. Some variations arise from the data model di erences. Other variations are genuine di erences in query capabilities. Yet another kind of variations involves subtly di erent semantics for seemingly similar query functionalities. E.g., equality may have subtle and unexpected meanings in the presence of missing attributes in NoSQL databases.

Even in a single organization, it is common to nd multiple databases that exhibit high variety. Often applications require integrated access to those databases. It is di cult to write optimized software that retrieves data from multiple such databases, given the di erent data models, di erent query syntaxes and the (often subtly) di erent query semantics. This problem has been recognized for many decades in the database community. It is now accentuated, as a plethora of di erent and specialized databases nds its place in the enterprise. For example, the problem happens whenever an enterprise adopts a fast and scalable NoSQL database to capture its users' activity on its web site (web log) and then builds applications that need integrated access to the web log data stored in the NoSQL database and also to data in its existing SQL databases.

REWRITING

Mediator systems had been proposed in the 90s in order to provide integrated query access to multiple heterogeneous databases, including databases with di erent query capabilities. Polystores provide similar functionality [ 6 ]. UCSD's Supported by NSF DC 0910820, NSF III 1018961, NSF IIS 1237174, Informatica and Couchbase grants. FORWARD Middleware is a mediator. Figure 1 shows an example of how the FORWARD middleware evaluates queries over di erent databases with varying capabilities. An organization owns a PostgreSQL database and a MongoDB database, where the PostgreSQL database contains a sensors table and MongoDB contains a measurements array of JSON objects. Conceptually, the FORWARD Middleware presents to its clients the virtual SQL++ views V1 and V2 of these databases. (We describe the SQL-backwardscompatible SQL++ below.) The sensors table of the Postgres database is presented in the view V1 as a bag (bags are denoted by {{...}}) of JSON objects, since both tuples and JSON objects are sets of attribute/value pairs. Notice, the virtual view V2 of MongoDB is identical to its native data representation. Since the views are virtual, the FORWARD Middleware does not have a copy of the source data.

Suppose the client issues the federated query Q, which nds the average temperature reported by any reliably functioning sensor in a speci c lat-long bounding box, where a sensor is deemed reliable only if none of its measurements are outside the range 40 F to 140 F. The query can be rewritten into the following plan, which includes PostgreSQL and MongoDB subqueries that are e cient and compatible with the limited query capabilities of MongoDB.1 Plan 1: (qQ1 s:id!n@id pQ2) s:id!1@id pQ3

The plan rst issues to PostgreSQL the query Q1, which nds the ids (s.id) of the sensors in the bounding box. Then, for each id, the parameter-passing semijoin operator issues to MongoDB repeated queries Q2, each time instantiating the parameter @id of Q2 with another id from the result of Q1. The queries Q2 test whether the identi ed sensor is reliable. If the sensor quali es, the particular sensor from Q1 appears in the output of the semijoin. Finally, the parameter-passing join issues to MongoDB repeated queries Q3 that nd the average of the temperature measurements for each quali ed sensor. Notice that if MongoDB had the capability to support nested queries, it would have been possible to issue a single MongoDB query for each id as opposed to two queries. Therefore the rewriting is aware of source capabilities. Finally, the coord to state() function, which inputs coordinates and outputs the name of the corresponding state, is executed in the middleware level.

Plan 1 is one of the many possible plans that one may consider. For example, another plan, say Plan 2, could have issued a query Q1' that fetches to the middleware all the sensors, regardless of whether they are in the bounding box 1In the interest of simplicity we simplify some aspects of the plan. will not become a general purpose mediator, but rather will become federators specializing to a few particular sources.

Client SELECT coord_to_state(s.lat, s.lng), AVG(m.temp) AS avg_temp FROM sensors AS s JOIN measurements AS m ON s.id = m.sid WHERE (s.lat>32.6 AND s.lat<32.9 AND s.lng>-117.0 AND s.lng<-117.3) AND NOT EXISTS (SELECT 1 FROM measurements AS me

WHERE me.sid=s.id AND (me.temp>140 OR me.temp<-40)) GROUP BY s.id, s.lat, s.lng Q FORWARD Middleware

SQL++ Query Processor SQL++ Virtual Database

V2 V1

Q1 SELECT s.id,s.lat,s.lng FROM sensors WHERE ( s.lat>32.6 AND s.lat<32.9 AND s.lng>-117.0 AND s.lng<-117.3) or not. Consequently, the bounding box selection would happen at the middleware level and then Plan 2 could proceed with the parameter-passing semijoin and join of Plan 1. However, one can easily tell that Plan 2 is not competitive since the capabilities of Postgres guarantee that one is always better o executing this kind of selection conditions at the Postgres source. Even in the absence of statistics and a cost estimator, a mediator should characterize Plan 2 as non-competitive and worthless of further consideration.

Nevertheless, in general, there are multiple competitive plans, i.e., plans that one needs source statistics and a cost estimator to tell which one is the expected best. For example, if it were the case that the vast majority of sensors are unreliable (i.e., if only a few sensors passed the conditions > 40 and < 140), then it could made sense to execute a Plan 3 that rst issues a MongoDB query to nd the unreliable sensors, then checks on Postgres whether they are in the wanted bounding box and then nds the average tempratures. One needs statistics and a cost estimator to tell which one of Plan 1 or Plan 3 is better. Therefore they are both competitive.

We de ne the rewriting problem as follows: Given a client query Q, nd all competitive candidate plans that are nontrivially di erent. By requiring non-trivial di erence, we eliminate the possibility of having multiple candidates that are super cial variations of each other. For example, supercially di erent plans can be created by, say, creating two versions of a plan, where the two versions use di erent, yet result-equivalent and performance-equivalent source queries.

In this paper we assume that the optimal plan is found in a sequence of two modules. The rst module is the rewriter: It inputs the client query and it outputs candidate competitive rewritings. The second module is the cost optimizer: It inputs the competitive candidate rewritings, assigns a cost to each one of them and chooses the plan with the minimum cost. In contrast, many query processors generally mix rewriting and cost optimization, hence pruning the search space and/or e ciently exploring the space. (By \space" we refer to the space of candidate rewritings.) The simpler, modular architecture of this paper splits the rewriter from the cost optimizer, primarily for presentation reasons: We show that rewriting poses challenging problems, even if not mixed with cost estimation and pruning.2

The rewriting problem is now more relevant and more challenging than it was in the past, since we now face unprecented variety in the capabilities of sources. This paper presents the past techniques and architectures for query rewriting, argues that continued work is needed in order to become practically viable in the present environment and provides corresponding directions. Notice that \practical viability" means being able to develop mediators where the number of lines of code (LOC) grows ideally sublinearly or, in the worst case, linearly in the number of involved sources. Given the number of possible di erent sources, any superlinear growth of LOC all but guarantees that such system 2Furthermore, it is an open question, to be judged in realworld deployments, whether pruning of the search space is a signi cant feature in mediators and polystores. In relational database optimizers, where rewriting, estimation and pruning are mixed, n-way joins e ectively prohibited architectures where all equivalent plans (join orders) would be explicit produced, as their number would generally be exponential in n. In contrast, the number of competitive plans may rarely be huge, therefore demoting the importance of mixing rewriting and optimization. sensors: {{ {id:1, lat:32.8, lng:-117.1}, {id:2, lat:32.7, lng:-117.2} }}

SQL Wrapper measurements: [ {sid:1, temp:200, msg:"calib. err."}, {sid:2, temp:70.1}, {sid:2, temp:70.2} ]

MongoDB Wrapper sensors: id lat lng 1 32.8 -117.1 2 32.7 -117.2 measurements: [ {sid: 1, temp: 200, msg:"calib. err."}, {sid: 2, temp: 70.1}, {sid: 2, temp: 70.2} ]

MongoDB

DESCRIBING SOURCE CAPABILITIES

The prior example made clear that knowledge of the query capabilities of the sources is essential in the rewriting problem. To further isolate the point that requires capabilities awareness, during the rewriting process, we further modularize the rewriting module into two sub-modules. The split decouples rewriting aspects that pertain to fundamental capability di erences (e.g., can a source do nested conditions or not?) from rewriting aspects that are a translation across super cial syntactic di erences (e.g., a source can do nested conditions but it expresses them in its own, non-SQL dialect).

Architecturally, we enable this split by formally specifying the syntax and semantics of SQL++ [ 12 ], which is a unifying semi-structured data model and query language that is designed to encompass the data model and query language capabilities of NoSQL, NewSQL and SQL-on-Hadoop databases. The reader is referred to [ 11 ] for its full details.3 The virtual views shown in Figure 1 follow the SQL++ data model. Furthermore, we assume that each virtual view can be queried with SQL++. However, in light of the limited ca3 The SQL++ semantics stands on the shoulders of the extensive past work from the database R&D community in non-relational data models and query languages: OQL [ 2 ], the nested relational model and query languages [ 9, 15, 1 ] and XQuery (and other XML-based query languages) [ 14, 5, 4 ]. SQL++ is an extension to SQL and is backwardscompatible with SQL. This choice was made in order to facilitate the SQL-aware audience. However, choosing SQL compatibility is not an important property of the rewriting problem or of mediators. The issues discussed here will emerge even using a di erent, non-SQL-compatible language.

Q2 db.measurements .aggregate( {$match: {$and: [ {sid: @id}, {$or: [ {temp: {$gt: 140}}, {temp: {$lt: -40}} ]}]}}, {$limit: 1}) PostgreSQL

Q3 db.measurements .aggregate( {$match:{sid:@id}}, {$group: { _id: "$sid", avg: {$avg:"$temp"} }}) pabilities of the sources, the more precise statement is that each virtual view can answer a subset of SQL++ queries, according to the source capabilities. For example, the virtual view of MongoDB can be queried with SQL++ but such SQL++ cannot have nested conditions. The important differentiating aspect of SQL++ is that its syntax includes con guration ags. For each database system, we set the ags to di erent values, typically to describe whether the database system supports the corresponding part of the language. For example, a con guration ag represents whether the nested conditions part of the syntax is allowed. When we describe a Postgres database this ag will be true. In a MongoDB it will be false. While [ 12 ] describes the ags by laying them on the SQL syntax, the actual FORWARD uses the respective ags on the algebra level.

There are also additional ags that describe subtle semantic di erences. For example, equality in SQL and equality in MongoDB have subtly di erent semantics in the case of nulls and absent attributes. Appropriate con guration ags classify these di erences among the sources. In the case of the equality problem, appropriate rewriting rules will then translate an equality under SQL's con guraton ag, into an equality under MongoDB's ag.

By appropriate choices of con guration options, the SQL++ semantics morph into the semantics of other SQL+JSON databases. The short paper [ 12 ] shows how the SQL standard and four well known (Cassandra CQL, MongoDB, Couchbase N1QL, AsterixDB AQL) semi-structured database query languages, are explained as particular settings of the con guration options. While SQL++ does not support the exact syntax of any of these four databases, it can be morphed by the con guration options to support equivalent queries.4 To further facilitate the readers' understanding of SQL++ and the e ect of the various con guration options, we provide a web-accessible reference implementation of SQL++ at http://forward.ucsd.edu/sqlpp.

DECOMPOSI3. NAIVE QUERY PLAN TION WILL NOT WORK

A simple and easy-to-implement algebra-based architecture adds the capabilities-based rewriting as a very last step in otherwise conventional rewriting [ 8 ]. In our running example, this would mean that the rewriter rst employs the usual palette of bene cial rewritings, such as pushing conditions down, in order to produce an optimized algebraic expression that is still source-agnostic, in the sense that it does not specify what SQL++ queries must be sent to the SQL++ views of the underlying sources. Then a query decomposer, which is aware of the query capabilities, inspects the source-agnostic plan and nds its biggest subexpressions that can be executed in a single source. Then it formulates corresponding SQL++ queries.

Such architecture is a relatively obvious and simple incremental improvement over conventional query processors. The bad news is that in the era of high variety there cannot be a uniform set of conventions for the format of the sourceagnostic plans. Ideally the source agnostic plans would have 4 An earlier, extended version [ 11 ] shows how an additional six databases correspond to particular settings of the conguration options. We expect that some of the results listed in the feature matrices describing con guration options will change in the next years as the space evolves rapidly. a format that would allow decomposers to always nd the optimal subexpressions/subqueries. Such ideal source agnostic plans are generally impossible in the presence of high variety.

We exemplify next a very common, albeit not the only one, problem pattern that argues for the non-existence of \ideal" source agnostic plans. Consider two algebraic operators f and g that can be swapped, i.e., f (g(:) = g(f (:). Assume that the source-agnostic plan for a client query q is Expession A1: o1(f (g(e2)))

By virtue of the equivalence of f (g(:)) and g(f (:)), one can see that the rewriter could have also produced the sourceagnostic Expession A2 o1(g(f (e2))) instead of Expession A1. In the absence of knowledge of the capabilities of the unerlying sources, it is impossible to choose between A1 and A2. The inability to choose can eventually lead to suboptimal plans, because the decomposers will detect suboptimal subexpressions. For example, in the discussed problem pattern, consider two alternate source capability situations. In both situations e2 should be executed at a source s, while o1 can only be executed in the middleware. In the rst situation, g can also be executed at the source s, along with o2, while f cannot be executed at s. In the second situation, it is vice versa: f can be executed at s, along with o2, while g cannot. It is easy to see that the rst situation produces an optimal plan if the source-agnostic plan is the Expession A1. The second situation produces an optimal plan when the source agnostic plan is the Expession A2.

In e ect, there is not a one-size- ts-all source agnostic plan. In the absence thereof, one may argue for a seemingly simple x: First produce all possible source agnostic plans. For example, whenever faced with operators that can be swapped produce a plan for each swapping. Then let decomposers work on each source agnostic plan. Unfortunately, this is not a scalable solution since the number of source agnostic plans would be huge.

Instead, FORWARD supports a capability-aware rewriting stage that is aware of the capabilities ags and performs rewritings with the goal of maximizing the subexpressions that can be pushed to a source. According to this architecture, it does not matter whether the source agnostic phase produces Expession A1 or Expession A2. Let us, for example, assume that it produces Expession A1, while g cannot be pushed to the source but f can. The capability-aware rewriting phase will inspect operators above e2 and see if they can be pushed down along with e2. Since f is such an operator, the rewriting will transform f (g(e2)) into g(f (e2)), hence preparing the ground for the decomposer. 4.

MODELING VARIETY BY INFINITELY MANY PARAMETERIZED VIEWS

A number of works tackled the problem of answering queries using sources with limited query capabilities as if it were an extension of the well-studied problem of answering queries using views. Namely, [ 13, 10, 17, 18 ] introduced notations that precisely described the (in general, in nite) set of queries that the sources express. All works focused on conjunctive queries, including in semistructured settings as well [ 13 ]. At a high level of abstraction, they all enabled the description of the (potentially in nite) set of views by an appropriately expanded notation of Datalog, where the (in nitely many) views correspond to the (in nitely many) views that can be produced by expanding the intensional database predicates of the Datalog program. For example, consider the following capabilities description: 1: view(doc) :- document(id, doc), id=? 2: view(doc) :- document(id, doc), conditions(id) 3: conditions(id) :- contains(id, w), w=? 4: conditions(id) :- conditions(id), contains(id, w), w=?

By expanding the containsconditions(id) of Line 2 with Line 4 and then with Line 3 we derive that one supported (parameterized) view is view(doc) :- document(id, doc), contains(id, w1), w1=?, contains(id, w2), w2=?

By replacing the parameters ? with constants we have conventional views/queries. Other language extensions [ 18 ] enable one to describe the capabilities of a source without even knowing its schema. These works often produced completeness guarantees, i.e., they would nd all competitive candidate rewritings, even under heavy variety.

While theoretically strong, these works did not turn out as-is into industrial processors, despite their authors founding two mediator companies in the early 00s (Nimble, subsequently acquired by Actuate; and Enosys Software, subsequently acquired by BEA Aqualogic). A key reason for their non-adoption is the relatively low variety requirements of the early 00s. Hence it was su cient to follow variants of Garlic's [ 8 ] algebra-based architecture (brie y outlined as the naive approach in Section 3) rather than delving in a radically di erent query processor architecture. Furthermore, the completeness of the in nite views-based solutions came at the price of focusing on limited query languages (essentially conjunctive), which is a non-starter for a usable processor.

Nevertheless, their ability to describe the sets of supported queries and the fact they o ered completeness guarantee, invites drawing ideas from them in the current era of high variability. How can we bring such descriptive power into an algebraic setting?

OTHER ISSUES

We do not discuss cost optimization and polystore design [ 6 ], while they are also important problems. It is possible that a designer explicitly chose two (or more) di erent systems based on the application's needs. Furthermore, such designer may have explicitly placed the same data at more than one sources, hence creating rewriting opportunities where one plan may use one source (for such data), while another plan may use another source (for the same data). It is apparent that the presence of the same data set at multiple sources, further increases the rewriting opportunities.

[1]

Abiteboul ,

P. C.

Fischer , and H.-J. Schek, editors. Nested Relations and Complex Objects , volume 361 of Lecture Notes in Computer Science. Springer, 1989 .

[2]

Bancilhon ,

Cluet , and

Delobel . A query language for the O2 object-oriented database system . In DBPL , pages 122 { 138 , 1989 .

[3]

Behm ,

V. R.

Borkar ,

M. J.

Carey ,

Grover ,

Li ,

Onose ,

Vernica ,

Deutsch ,

Papakonstantinou , and

V. J.

Tsotras . ASTERIX: towards a scalable, semistructured data platform for evolving-world models . Distributed and Parallel Databases , 29 ( 3 ): 185 { 216 , 2011 .

[4]

Bonifati and

Ceri . Comparative analysis of ve XML query languages . SIGMOD Record , 29 ( 1 ): 68 { 79 , 2000 .

[5]

Deutsch ,

M. F.

Fernandez ,

Florescu ,

A. Y.

Levy , and

Suciu . A query language for XML . Computer Networks , 31 ( 11 -16): 1155 { 1169 , 1999 .

[6]

Duggan ,

A. J.

Elmore ,

Stonebraker ,

Balazinska ,

Howe ,

Kepner ,

Madden ,

Maier ,

Mattson , and

S. B.

Zdonik . The bigdawg polystore system . SIGMOD Record , 44 ( 2 ): 11 { 16 , 2015 .

[7]

Dumbill . What is Big Data?, 2012 . http://strata.oreilly.com/ 2012 /01/ what-is -big-data . html.

[8]

L. M.

Haas ,

Kossmann ,

E. L.

Wimmers , and

Yang . Optimizing queries across diverse data sources . In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29 , 1997 , Athens, Greece, pages 276 { 285 , 1997 .

[9]

Jaeschke and H.-J. Schek . Remarks on the algebra of non rst normal form relations . In PODS , pages 124 { 138 . ACM, 1982 .

[10]

A. Y.

Levy ,

Rajaraman , and

J. D.

Ullman . Answering queries using limited external processors . In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5 , 1996 , Montreal, Canada, pages 227 { 237 , 1996 .

[11]

K. W.

Ong ,

Papakonstantinou , and

Vernoux . SQL+ +: An expressiveness benchmark of SQL-on-Hadoop, NoSQL and NewSQL databases, 2014 . http://db.ucsd.edu/pubsFileFolder/375.pdf.

[12]

K. W.

Ong ,

Papakonstantinou , and

Vernoux . The SQL+ + query language: Con gurable, unifying and semi-structured . CoRR, abs/1405.3631 , 2015 . http://arxiv.org/abs/1405.3631.

[13]

Papakonstantinou ,

Gupta , and

L. M.

Haas . Capabilities-based query rewriting in mediator systems . In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems, December 18-20 , 1996 ,

Miami

Beach , Florida, USA, pages 170 { 181 , 1996 .

[14]

Robie ,

Chamberlin ,

Dyck , and J. Snelson. XQuery 3 .0: An

XML

query language , W3C candidate recommendation , 2013 . http: //www.w3.org/TR/2013/PR-xquery- 30 -20131022/.

[15]

M. A.

Roth ,

H. F.

Korth , and

Silberschatz . Extended algebra and calculus for nested relational databases . ACM Trans. Database Syst ., 13 ( 4 ): 389 { 417 , 1988 .

[16]

Stonebraker . New opportunities for New SQL . Commun. ACM , 55 ( 11 ): 10 { 11 , 2012 .

[17]

Vassalos and

Papakonstantinou . Describing and using query capabilities of heterogeneous sources . In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29 , 1997 , Athens, Greece, pages 256 { 265 , 1997 .

[18]

Vassalos and

Papakonstantinou . Expressive capabilities description languages and query rewriting algorithms . J. Log. Program. , 43 ( 1 ): 75 { 122 , 2000 .