<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Polystore Query Rewriting: The Challenges of Variety</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yannis Papakonstantinou yannis@cs.ucsd.edu</string-name>
        </contrib>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Numerous databases marketed as SQL-on-Hadoop,
NewSQL [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and NoSQL have emerged to catalyze Big
Data applications. These databases generally support the
3Vs [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. (i) Volume: amount of data (ii) Velocity: speed
of data in and out (iii) Variety: semi-structured and
heterogeneous data. As a result of di ering use cases and
design considerations around the Variety requirement, these
new databases have adopted semi-structured data models
that vary among each other. Their query languages have
even more variations. Some variations are due to super cial
syntactic di erences. Some variations arise from the data
model di erences. Other variations are genuine di erences
in query capabilities. Yet another kind of variations involves
subtly di erent semantics for seemingly similar query
functionalities. E.g., equality may have subtle and unexpected
meanings in the presence of missing attributes in NoSQL
databases.
      </p>
      <p>Even in a single organization, it is common to nd
multiple databases that exhibit high variety. Often applications
require integrated access to those databases. It is di cult
to write optimized software that retrieves data from
multiple such databases, given the di erent data models, di
erent query syntaxes and the (often subtly) di erent query
semantics. This problem has been recognized for many
decades in the database community. It is now accentuated,
as a plethora of di erent and specialized databases nds its
place in the enterprise. For example, the problem happens
whenever an enterprise adopts a fast and scalable NoSQL
database to capture its users' activity on its web site (web
log) and then builds applications that need integrated access
to the web log data stored in the NoSQL database and also
to data in its existing SQL databases.</p>
    </sec>
    <sec id="sec-2">
      <title>REWRITING</title>
      <p>
        Mediator systems had been proposed in the 90s in order
to provide integrated query access to multiple heterogeneous
databases, including databases with di erent query
capabilities. Polystores provide similar functionality [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. UCSD's
Supported by NSF DC 0910820, NSF III 1018961, NSF IIS
1237174, Informatica and Couchbase grants.
FORWARD Middleware is a mediator. Figure 1 shows
an example of how the FORWARD middleware evaluates
queries over di erent databases with varying capabilities.
An organization owns a PostgreSQL database and a
MongoDB database, where the PostgreSQL database contains a
sensors table and MongoDB contains a measurements
array of JSON objects. Conceptually, the FORWARD
Middleware presents to its clients the virtual SQL++ views V1 and
V2 of these databases. (We describe the
SQL-backwardscompatible SQL++ below.) The sensors table of the
Postgres database is presented in the view V1 as a bag (bags are
denoted by {{...}}) of JSON objects, since both tuples and
JSON objects are sets of attribute/value pairs. Notice, the
virtual view V2 of MongoDB is identical to its native data
representation. Since the views are virtual, the FORWARD
Middleware does not have a copy of the source data.
      </p>
      <p>Suppose the client issues the federated query Q, which
nds the average temperature reported by any reliably
functioning sensor in a speci c lat-long bounding box, where a
sensor is deemed reliable only if none of its measurements are
outside the range 40 F to 140 F. The query can be
rewritten into the following plan, which includes PostgreSQL and
MongoDB subqueries that are e cient and compatible with
the limited query capabilities of MongoDB.1
Plan 1: (qQ1 s:id!n@id pQ2) s:id!1@id pQ3</p>
      <p>The plan rst issues to PostgreSQL the query Q1, which
nds the ids (s.id) of the sensors in the bounding box.
Then, for each id, the parameter-passing semijoin operator
issues to MongoDB repeated queries Q2, each time
instantiating the parameter @id of Q2 with another id from the
result of Q1. The queries Q2 test whether the identi ed
sensor is reliable. If the sensor quali es, the particular sensor
from Q1 appears in the output of the semijoin. Finally, the
parameter-passing join issues to MongoDB repeated queries
Q3 that nd the average of the temperature measurements
for each quali ed sensor. Notice that if MongoDB had the
capability to support nested queries, it would have been
possible to issue a single MongoDB query for each id as opposed
to two queries. Therefore the rewriting is aware of source
capabilities. Finally, the coord to state() function, which
inputs coordinates and outputs the name of the
corresponding state, is executed in the middleware level.</p>
      <p>Plan 1 is one of the many possible plans that one may
consider. For example, another plan, say Plan 2, could have
issued a query Q1' that fetches to the middleware all the
sensors, regardless of whether they are in the bounding box
1In the interest of simplicity we simplify some aspects of the
plan.
will not become a general purpose mediator, but rather will
become federators specializing to a few particular sources.</p>
      <p>Client
SELECT coord_to_state(s.lat, s.lng), AVG(m.temp) AS avg_temp
FROM sensors AS s JOIN measurements AS m ON s.id = m.sid
WHERE (s.lat&gt;32.6 AND s.lat&lt;32.9 AND s.lng&gt;-117.0 AND s.lng&lt;-117.3)
AND NOT EXISTS (SELECT 1 FROM measurements AS me</p>
      <p>WHERE me.sid=s.id AND (me.temp&gt;140 OR me.temp&lt;-40))
GROUP BY s.id, s.lat, s.lng
Q
FORWARD Middleware</p>
      <p>SQL++ Query Processor
SQL++ Virtual Database</p>
      <p>V2
V1</p>
      <p>Q1
SELECT
s.id,s.lat,s.lng
FROM sensors
WHERE (
s.lat&gt;32.6 AND
s.lat&lt;32.9 AND
s.lng&gt;-117.0 AND
s.lng&lt;-117.3)
or not. Consequently, the bounding box selection would
happen at the middleware level and then Plan 2 could
proceed with the parameter-passing semijoin and join of Plan 1.
However, one can easily tell that Plan 2 is not competitive
since the capabilities of Postgres guarantee that one is
always better o executing this kind of selection conditions at
the Postgres source. Even in the absence of statistics and
a cost estimator, a mediator should characterize Plan 2 as
non-competitive and worthless of further consideration.</p>
      <p>Nevertheless, in general, there are multiple competitive
plans, i.e., plans that one needs source statistics and a cost
estimator to tell which one is the expected best. For
example, if it were the case that the vast majority of sensors are
unreliable (i.e., if only a few sensors passed the conditions
&gt; 40 and &lt; 140), then it could made sense to execute a
Plan 3 that rst issues a MongoDB query to nd the
unreliable sensors, then checks on Postgres whether they are in
the wanted bounding box and then nds the average
tempratures. One needs statistics and a cost estimator to tell
which one of Plan 1 or Plan 3 is better. Therefore they are
both competitive.</p>
      <p>We de ne the rewriting problem as follows: Given a client
query Q, nd all competitive candidate plans that are
nontrivially di erent. By requiring non-trivial di erence, we
eliminate the possibility of having multiple candidates that
are super cial variations of each other. For example,
supercially di erent plans can be created by, say, creating two
versions of a plan, where the two versions use di erent, yet
result-equivalent and performance-equivalent source queries.</p>
      <p>In this paper we assume that the optimal plan is found in
a sequence of two modules. The rst module is the rewriter:
It inputs the client query and it outputs candidate
competitive rewritings. The second module is the cost optimizer: It
inputs the competitive candidate rewritings, assigns a cost
to each one of them and chooses the plan with the
minimum cost. In contrast, many query processors generally mix
rewriting and cost optimization, hence pruning the search
space and/or e ciently exploring the space. (By \space" we
refer to the space of candidate rewritings.) The simpler,
modular architecture of this paper splits the rewriter from
the cost optimizer, primarily for presentation reasons: We
show that rewriting poses challenging problems, even if not
mixed with cost estimation and pruning.2</p>
      <p>The rewriting problem is now more relevant and more
challenging than it was in the past, since we now face
unprecented variety in the capabilities of sources. This
paper presents the past techniques and architectures for query
rewriting, argues that continued work is needed in order to
become practically viable in the present environment and
provides corresponding directions. Notice that \practical
viability" means being able to develop mediators where the
number of lines of code (LOC) grows ideally sublinearly or,
in the worst case, linearly in the number of involved sources.
Given the number of possible di erent sources, any
superlinear growth of LOC all but guarantees that such system
2Furthermore, it is an open question, to be judged in
realworld deployments, whether pruning of the search space is
a signi cant feature in mediators and polystores. In
relational database optimizers, where rewriting, estimation and
pruning are mixed, n-way joins e ectively prohibited
architectures where all equivalent plans (join orders) would be
explicit produced, as their number would generally be
exponential in n. In contrast, the number of competitive plans
may rarely be huge, therefore demoting the importance of
mixing rewriting and optimization.
sensors: {{
{id:1, lat:32.8, lng:-117.1},
{id:2, lat:32.7, lng:-117.2}
}}</p>
      <p>SQL Wrapper
measurements: [
{sid:1, temp:200,
msg:"calib. err."},
{sid:2, temp:70.1},
{sid:2, temp:70.2} ]</p>
      <p>MongoDB Wrapper
sensors:
id lat lng
1 32.8 -117.1
2 32.7 -117.2
measurements: [
{sid: 1, temp: 200,
msg:"calib. err."},
{sid: 2, temp: 70.1},
{sid: 2, temp: 70.2} ]</p>
      <p>MongoDB</p>
    </sec>
    <sec id="sec-3">
      <title>DESCRIBING SOURCE CAPABILITIES</title>
      <p>The prior example made clear that knowledge of the query
capabilities of the sources is essential in the rewriting
problem. To further isolate the point that requires capabilities
awareness, during the rewriting process, we further
modularize the rewriting module into two sub-modules. The split
decouples rewriting aspects that pertain to fundamental
capability di erences (e.g., can a source do nested conditions
or not?) from rewriting aspects that are a translation across
super cial syntactic di erences (e.g., a source can do nested
conditions but it expresses them in its own, non-SQL
dialect).</p>
      <p>
        Architecturally, we enable this split by formally specifying
the syntax and semantics of SQL++ [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which is a
unifying semi-structured data model and query language that
is designed to encompass the data model and query
language capabilities of NoSQL, NewSQL and SQL-on-Hadoop
databases. The reader is referred to [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for its full details.3
The virtual views shown in Figure 1 follow the SQL++ data
model. Furthermore, we assume that each virtual view can
be queried with SQL++. However, in light of the limited
ca3 The SQL++ semantics stands on the shoulders of the
extensive past work from the database R&amp;D community in
non-relational data models and query languages: OQL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
the nested relational model and query languages [
        <xref ref-type="bibr" rid="ref1 ref15 ref9">9, 15, 1</xref>
        ]
and XQuery (and other XML-based query languages) [
        <xref ref-type="bibr" rid="ref14 ref4 ref5">14,
5, 4</xref>
        ]. SQL++ is an extension to SQL and is
backwardscompatible with SQL. This choice was made in order to
facilitate the SQL-aware audience. However, choosing SQL
compatibility is not an important property of the rewriting
problem or of mediators. The issues discussed here will emerge
even using a di erent, non-SQL-compatible language.
      </p>
      <p>Q2
db.measurements
.aggregate( {$match:
{$and: [ {sid: @id},
{$or: [
{temp: {$gt: 140}},
{temp: {$lt: -40}}
]}]}},
{$limit: 1})
PostgreSQL</p>
      <p>
        Q3
db.measurements
.aggregate(
{$match:{sid:@id}},
{$group: {
_id: "$sid",
avg: {$avg:"$temp"}
}})
pabilities of the sources, the more precise statement is that
each virtual view can answer a subset of SQL++ queries,
according to the source capabilities. For example, the
virtual view of MongoDB can be queried with SQL++ but such
SQL++ cannot have nested conditions. The important
differentiating aspect of SQL++ is that its syntax includes
con guration ags. For each database system, we set the
ags to di erent values, typically to describe whether the
database system supports the corresponding part of the
language. For example, a con guration ag represents whether
the nested conditions part of the syntax is allowed. When
we describe a Postgres database this ag will be true. In a
MongoDB it will be false. While [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] describes the ags by
laying them on the SQL syntax, the actual FORWARD uses
the respective ags on the algebra level.
      </p>
      <p>There are also additional ags that describe subtle
semantic di erences. For example, equality in SQL and equality
in MongoDB have subtly di erent semantics in the case of
nulls and absent attributes. Appropriate con guration ags
classify these di erences among the sources. In the case of
the equality problem, appropriate rewriting rules will then
translate an equality under SQL's con guraton ag, into an
equality under MongoDB's ag.</p>
      <p>
        By appropriate choices of con guration options, the
SQL++ semantics morph into the semantics of other
SQL+JSON databases. The short paper [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] shows
how the SQL standard and four well known
(Cassandra CQL, MongoDB, Couchbase N1QL, AsterixDB AQL)
semi-structured database query languages, are explained
as particular settings of the con guration options. While
SQL++ does not support the exact syntax of any of
these four databases, it can be morphed by the con
guration options to support equivalent queries.4 To
further facilitate the readers' understanding of SQL++ and
the e ect of the various con guration options, we provide
a web-accessible reference implementation of SQL++ at
http://forward.ucsd.edu/sqlpp.
      </p>
    </sec>
    <sec id="sec-4">
      <title>DECOMPOSI3.</title>
    </sec>
    <sec id="sec-5">
      <title>NAIVE QUERY PLAN</title>
    </sec>
    <sec id="sec-6">
      <title>TION WILL NOT WORK</title>
      <p>
        A simple and easy-to-implement algebra-based
architecture adds the capabilities-based rewriting as a very last step
in otherwise conventional rewriting [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In our running
example, this would mean that the rewriter rst employs the
usual palette of bene cial rewritings, such as pushing
conditions down, in order to produce an optimized algebraic
expression that is still source-agnostic, in the sense that it
does not specify what SQL++ queries must be sent to the
SQL++ views of the underlying sources. Then a query
decomposer, which is aware of the query capabilities, inspects
the source-agnostic plan and nds its biggest subexpressions
that can be executed in a single source. Then it formulates
corresponding SQL++ queries.
      </p>
      <p>
        Such architecture is a relatively obvious and simple
incremental improvement over conventional query processors.
The bad news is that in the era of high variety there cannot
be a uniform set of conventions for the format of the
sourceagnostic plans. Ideally the source agnostic plans would have
4 An earlier, extended version [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] shows how an additional
six databases correspond to particular settings of the
conguration options. We expect that some of the results listed
in the feature matrices describing con guration options will
change in the next years as the space evolves rapidly.
a format that would allow decomposers to always nd the
optimal subexpressions/subqueries. Such ideal source
agnostic plans are generally impossible in the presence of high
variety.
      </p>
      <p>We exemplify next a very common, albeit not the only one,
problem pattern that argues for the non-existence of \ideal"
source agnostic plans. Consider two algebraic operators f
and g that can be swapped, i.e., f (g(:) = g(f (:). Assume
that the source-agnostic plan for a client query q is
Expession A1: o1(f (g(e2)))</p>
      <p>By virtue of the equivalence of f (g(:)) and g(f (:)), one can
see that the rewriter could have also produced the
sourceagnostic Expession A2 o1(g(f (e2))) instead of Expession A1.
In the absence of knowledge of the capabilities of the
unerlying sources, it is impossible to choose between A1 and
A2. The inability to choose can eventually lead to
suboptimal plans, because the decomposers will detect suboptimal
subexpressions. For example, in the discussed problem
pattern, consider two alternate source capability situations. In
both situations e2 should be executed at a source s, while
o1 can only be executed in the middleware. In the rst
situation, g can also be executed at the source s, along with
o2, while f cannot be executed at s. In the second
situation, it is vice versa: f can be executed at s, along with
o2, while g cannot. It is easy to see that the rst situation
produces an optimal plan if the source-agnostic plan is the
Expession A1. The second situation produces an optimal
plan when the source agnostic plan is the Expession A2.</p>
      <p>In e ect, there is not a one-size- ts-all source agnostic
plan. In the absence thereof, one may argue for a
seemingly simple x: First produce all possible source agnostic
plans. For example, whenever faced with operators that can
be swapped produce a plan for each swapping. Then let
decomposers work on each source agnostic plan.
Unfortunately, this is not a scalable solution since the number of
source agnostic plans would be huge.</p>
      <p>Instead, FORWARD supports a capability-aware
rewriting stage that is aware of the capabilities ags and performs
rewritings with the goal of maximizing the subexpressions
that can be pushed to a source. According to this
architecture, it does not matter whether the source agnostic phase
produces Expession A1 or Expession A2. Let us, for
example, assume that it produces Expession A1, while g cannot
be pushed to the source but f can. The capability-aware
rewriting phase will inspect operators above e2 and see if
they can be pushed down along with e2. Since f is such an
operator, the rewriting will transform f (g(e2)) into g(f (e2)),
hence preparing the ground for the decomposer.
4.</p>
    </sec>
    <sec id="sec-7">
      <title>MODELING VARIETY BY INFINITELY</title>
    </sec>
    <sec id="sec-8">
      <title>MANY PARAMETERIZED VIEWS</title>
      <p>
        A number of works tackled the problem of answering
queries using sources with limited query capabilities as if
it were an extension of the well-studied problem of
answering queries using views. Namely, [
        <xref ref-type="bibr" rid="ref10 ref13 ref17 ref18">13, 10, 17, 18</xref>
        ] introduced
notations that precisely described the (in general, in nite)
set of queries that the sources express. All works focused
on conjunctive queries, including in semistructured settings
as well [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. At a high level of abstraction, they all enabled
the description of the (potentially in nite) set of views by
an appropriately expanded notation of Datalog, where the
(in nitely many) views correspond to the (in nitely many)
views that can be produced by expanding the intensional
database predicates of the Datalog program. For example,
consider the following capabilities description:
1: view(doc) :- document(id, doc), id=?
2: view(doc) :- document(id, doc),
conditions(id)
3: conditions(id) :- contains(id, w), w=?
4: conditions(id) :- conditions(id),
contains(id, w), w=?
      </p>
      <p>By expanding the containsconditions(id) of Line 2
with Line 4 and then with Line 3 we derive that one
supported (parameterized) view is
view(doc) :- document(id, doc), contains(id, w1),
w1=?, contains(id, w2), w2=?</p>
      <p>
        By replacing the parameters ? with constants we have
conventional views/queries. Other language extensions [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
enable one to describe the capabilities of a source without
even knowing its schema. These works often produced
completeness guarantees, i.e., they would nd all competitive
candidate rewritings, even under heavy variety.
      </p>
      <p>
        While theoretically strong, these works did not turn out
as-is into industrial processors, despite their authors
founding two mediator companies in the early 00s (Nimble,
subsequently acquired by Actuate; and Enosys Software,
subsequently acquired by BEA Aqualogic). A key reason for
their non-adoption is the relatively low variety requirements
of the early 00s. Hence it was su cient to follow variants
of Garlic's [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] algebra-based architecture (brie y outlined
as the naive approach in Section 3) rather than delving in
a radically di erent query processor architecture.
Furthermore, the completeness of the in nite views-based solutions
came at the price of focusing on limited query languages
(essentially conjunctive), which is a non-starter for a usable
processor.
      </p>
      <p>Nevertheless, their ability to describe the sets of supported
queries and the fact they o ered completeness guarantee,
invites drawing ideas from them in the current era of high
variability. How can we bring such descriptive power into
an algebraic setting?</p>
    </sec>
    <sec id="sec-9">
      <title>OTHER ISSUES</title>
      <p>
        We do not discuss cost optimization and polystore
design [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], while they are also important problems. It is
possible that a designer explicitly chose two (or more) di
erent systems based on the application's needs. Furthermore,
such designer may have explicitly placed the same data at
more than one sources, hence creating rewriting
opportunities where one plan may use one source (for such data),
while another plan may use another source (for the same
data). It is apparent that the presence of the same data set
at multiple sources, further increases the rewriting
opportunities.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Fischer</surname>
          </string-name>
          , and H.-J. Schek, editors.
          <source>Nested Relations and Complex Objects</source>
          , volume
          <volume>361</volume>
          of Lecture Notes in Computer Science. Springer,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bancilhon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cluet</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Delobel</surname>
          </string-name>
          .
          <article-title>A query language for the O2 object-oriented database system</article-title>
          .
          <source>In DBPL</source>
          , pages
          <volume>122</volume>
          {
          <fpage>138</fpage>
          ,
          <year>1989</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Behm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. R.</given-names>
            <surname>Borkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Carey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Onose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vernica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Tsotras</surname>
          </string-name>
          . ASTERIX:
          <article-title>towards a scalable, semistructured data platform for evolving-world models</article-title>
          .
          <source>Distributed and Parallel Databases</source>
          ,
          <volume>29</volume>
          (
          <issue>3</issue>
          ):
          <volume>185</volume>
          {
          <fpage>216</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ceri</surname>
          </string-name>
          .
          <article-title>Comparative analysis of ve XML query languages</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>29</volume>
          (
          <issue>1</issue>
          ):
          <volume>68</volume>
          {
          <fpage>79</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Florescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Levy</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Suciu</surname>
          </string-name>
          .
          <article-title>A query language for XML</article-title>
          .
          <source>Computer Networks</source>
          ,
          <volume>31</volume>
          (
          <fpage>11</fpage>
          -16):
          <volume>1155</volume>
          {
          <fpage>1169</fpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Duggan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Elmore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Balazinska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kepner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Madden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mattson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Zdonik</surname>
          </string-name>
          .
          <article-title>The bigdawg polystore system</article-title>
          .
          <source>SIGMOD Record</source>
          ,
          <volume>44</volume>
          (
          <issue>2</issue>
          ):
          <volume>11</volume>
          {
          <fpage>16</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dumbill</surname>
          </string-name>
          . What is Big Data?,
          <year>2012</year>
          . http://strata.oreilly.com/
          <year>2012</year>
          /01/ what-is
          <article-title>-big-data</article-title>
          .
          <source>html.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Haas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kossmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Wimmers</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>Optimizing queries across diverse data sources</article-title>
          .
          <source>In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29</source>
          ,
          <year>1997</year>
          , Athens, Greece, pages
          <volume>276</volume>
          {
          <fpage>285</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jaeschke and H.-J. Schek</surname>
          </string-name>
          .
          <article-title>Remarks on the algebra of non rst normal form relations</article-title>
          .
          <source>In PODS</source>
          , pages
          <volume>124</volume>
          {
          <fpage>138</fpage>
          . ACM,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Ullman</surname>
          </string-name>
          .
          <article-title>Answering queries using limited external processors</article-title>
          .
          <source>In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5</source>
          ,
          <year>1996</year>
          , Montreal, Canada, pages
          <volume>227</volume>
          {
          <fpage>237</fpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Vernoux</surname>
          </string-name>
          . SQL+
          <article-title>+: An expressiveness benchmark of SQL-on-Hadoop, NoSQL</article-title>
          and NewSQL databases,
          <year>2014</year>
          . http://db.ucsd.edu/pubsFileFolder/375.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K. W.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Vernoux</surname>
          </string-name>
          . The SQL+
          <article-title>+ query language: Con gurable, unifying and semi-structured</article-title>
          .
          <source>CoRR, abs/1405.3631</source>
          ,
          <year>2015</year>
          . http://arxiv.org/abs/1405.3631.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Haas</surname>
          </string-name>
          .
          <article-title>Capabilities-based query rewriting in mediator systems</article-title>
          .
          <source>In Proceedings of the Fourth International Conference on Parallel and Distributed Information Systems, December 18-20</source>
          ,
          <year>1996</year>
          ,
          <string-name>
            <given-names>Miami</given-names>
            <surname>Beach</surname>
          </string-name>
          , Florida, USA, pages
          <volume>170</volume>
          {
          <fpage>181</fpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Robie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chamberlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dyck</surname>
          </string-name>
          , and J.
          <source>Snelson. XQuery 3</source>
          .0:
          <string-name>
            <surname>An</surname>
            <given-names>XML</given-names>
          </string-name>
          <article-title>query language</article-title>
          ,
          <source>W3C candidate recommendation</source>
          ,
          <year>2013</year>
          . http: //www.w3.org/TR/2013/PR-xquery-
          <volume>30</volume>
          -20131022/.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. F.</given-names>
            <surname>Korth</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Silberschatz</surname>
          </string-name>
          .
          <article-title>Extended algebra and calculus for nested relational databases</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .,
          <volume>13</volume>
          (
          <issue>4</issue>
          ):
          <volume>389</volume>
          {
          <fpage>417</fpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          .
          <article-title>New opportunities for New SQL</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>55</volume>
          (
          <issue>11</issue>
          ):
          <volume>10</volume>
          {
          <fpage>11</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vassalos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          .
          <article-title>Describing and using query capabilities of heterogeneous sources</article-title>
          .
          <source>In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29</source>
          ,
          <year>1997</year>
          , Athens, Greece, pages
          <volume>256</volume>
          {
          <fpage>265</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vassalos</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Papakonstantinou</surname>
          </string-name>
          .
          <article-title>Expressive capabilities description languages and query rewriting algorithms</article-title>
          .
          <source>J. Log. Program.</source>
          ,
          <volume>43</volume>
          (
          <issue>1</issue>
          ):
          <volume>75</volume>
          {
          <fpage>122</fpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>