<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Execution Strategies for Compute Intensive Queries in Particle Physics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maximilian Berens</string-name>
          <email>maximilian.berens@tu-dortmund.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Dortmund University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>2</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>Data analysis in many scienti c domains, such as high energy physics, pushes the available resources of even big research collaborations to their limit, because it not only requires huge amounts of data but also compute intensive calculations. Current approaches require increased storage and computing resources, lack sufficient exibility and are still far away from interactivity. In this paper we present ideas to cope with challenges posed by compute-intensive analyses of high-volume data in a many-user- and resource-restricted environment. Extreme selectivity of user queries (that is inherent in high energy physics analyses, for instance) permits us to reduce the computational effort to a small portion, when irrelevant data can be discarded by more efficient means. Our focus lies on providing execution strategies for analysis queries, guided by the expertise of the user and executed as a scalable, scan-based preselection (introduced in DeLorean [8]). Future research will encompass the development of a compact data summary that facilitates fast, columnar scan queries, as well as a domain speci c language that provides us with the required information to generate them.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The ability to accumulate and use increasing amounts of
data in many science domains opens enticing prospects of
answering open questions of humanity. High energy
physics presents itself as a prominent example of such a domain,
where scienti c conclusions are drawn from statistical
evidence that is gained by analysing huge quantities of data.
Analysis tools at this scale have to cope not only with the
sheer volume of data, but also with the complexity of
involved computations, the limited availabilty of resources,
as well as the number and variety of user requests. The
CERN corporation's Large Hadron Collider (LHC), close to
Geneva, Switzerland, houses multiple experiments, each one
dedicated to different questions in particle physics. One is
the LHC beauty experiment (LHCb), where various aspects
of the differences between matter and antimatter are
studied. A common LHCb physics analysis concerns itself with
a particular (type of) decay or particle. In order to
observe them, protons are accelerated to very high energies
and brought to collision. Over the course of a year, up to
40 million collisions are measured every seconds, pre ltered
and stored. Before performing the actual analysis, speci c
recordings of these collisions, termed events, have to be
selected from a global event store. Decays of interest are usually
very rare. For example Bs0 ! + was found only 3 times
in 10 billion events [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. A signi cant portion of a physics
analysis consists of isolating particular types of events by
carefully de ning query predicates.
      </p>
      <p>Detector measurements, in their initial, raw state, are not
immediately useful and require computationally expensive
and decay-speci c reconstruction into physically meaningful
entities. Filtering events for user analyses is done by
enforcing predicates on these high level features.</p>
      <p>The overall productivity of many scienti c projects is
limited by the amount of data that can be processed and stored.
At the LHCb, this is because the measurements can not
be retained at the same rate as they are taken [12]. Still,
around 20 petabyte of \raw event" data goes to a tape
storage every year. Also adding the long reconstruction time
per event into the equation, naive on-the- y computations
over all available events for individual user requests are
intractable. Offering tools to quickly query the available data
without qualitative drawbacks and optimal utilization of the
(limited available) resources is essential for the success of a
collaboration, such as the LHCb, and impacts the pace of
scienti c discovery in general.</p>
      <p>A major upgrade of the detector will start recording new
data in 2021, increasing the data volume (both rate and size
of events [12]) even further and signifying the necessity of
new ideas.</p>
      <p>The remainder of this paper is structured as follows. First,
we give a brief overview of the concepts currently in place at
the LHCb, its drawbacks and consequent, general directions
of our research. Preliminary and other related work are
covered in section 4. Section 5 discusses open challenges
that we are going to address in the upcoming years. Finally,
section 6 summarizes these ideas in a roadmap.</p>
    </sec>
    <sec id="sec-2">
      <title>DATA PROCESSING AT THE LHCB</title>
      <p>In order to prevent analysts to query and reconstruct the
whole set of available data for every query, a centrally
scheduled reconstruction-and-preselection procedure, called
stripping, is performed. Several hundred "stripping lines",
prede ned lter queries, reduce the volume of data to an
acceptable size by materializing their results in separate
subsets.1 A stripping line typically tries to reconstruct a certain
decay and lters events in the process. The criteria to
select events tend to be \vague" (in their predicates), because
multiple analyses are supposed to be done on the result of a
single line or small subset thereof. The stripping is initially
applied during data acquisition and has to be redone later,
when changes in the reconstruction software or overall user
demand requires it.2 Generally, this approach is problematic
for multiple reasons:</p>
      <p>In addition to the raw-event data necessary for
reconstruction, materialized results occupy scarce disc
capacity.</p>
      <p>Results have to conform with user requirements, which
are generally more speci c/strict. This usually
requires users to \restrip\ a selected subset with a
customized con guration.</p>
      <p>Stripping line predicates are designed with respect to
strict limits on available resources. This can con ict
with physically motivated predicate parametrization
and negatively impact the quality of conclusions, as
important data might be kept out of the analysts reach.
Even small changes in predicates (i.e., \loosening" of
inequalities or \shifting" of ranges) are not directly
implementable, because the stripping is rarely redone.
Prede ned selections are unlikely to cover unforeseen
analysis use cases and thus require the de nition of a
completely new stripping line.</p>
      <p>Approaches that restrict the queryable data set by
prede ned criteria are bound to lack exibility, because
assumptions can change and individual users have different
requirements. In addition, the stripping still requires long
job waiting periods and additional resources.</p>
    </sec>
    <sec id="sec-3">
      <title>OUR OBJECTIVES</title>
      <p>The computationally expensive reconstruction is trivially
parallelizable, because events are completely independent
and small in size ( hundreds of kilobytes). However, relying
on data parallelism alone does not guarantee neither general
scalability nor sufficient resource utilization. New solutions
need to scale up by leveraging available hardware features.
In addition, they need to scale out in order to avoid
bottlenecks caused by resource contention in large-scale multi-user
environments.</p>
      <p>For the purpose of pushing analytics closer towards
interactivity, we identify the following goals:</p>
      <p>A signi cant reduction of the data volume, that is
involved in individual user queries.
1All lines combined drop 95% of all event data; a single line
must not have more then 0.05% of the overall data volume
(on average).
2Waiting times of several months for a new stripping version
are not unusual.</p>
      <p>Limit the execution of expensive computations to the
result set and thus minimize the overall compute load.
Enable efficient usage of modern hardware features
(i.e, deep cache hierarchies, advanced processor
instructions and multi-core architectures).</p>
      <p>Reuse of information by caching results and
intermediate computations for upcoming queries.</p>
      <p>
        A scan intensive preselection, based on a columnar,
compacted representation of data entities (i.e., particle-collision
events) will enable us to implement these objectives (see 4
for preliminary work). Given a small expected result
cardinality of individual requests, incorporating users and their
domain expertise closer into the process potentially offers
signi cant advantages. However, precisely translating the
user's intent into selective preselection queries requires a
suitable interface. In contrast to more general query
languages, such as SQL, a specialized domain speci c language
(DSL) offers the required expressivity and easier
incorporation of domain speci c optimizations. Furthermore, a DSL
can support efficient distributed caching strategies, as
previous results might be used to answer (a potentially wide)
range of upcoming queries [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Caching increases data
locality and spreads the workload in non-uniform data access
scenarios. We will further elaborate on this topic in the
related works section. To this end, the development of a
suitable query interface for physics analytics will be a major
topic in our upcoming research.
      </p>
      <p>Another important aspect of our approach will be the
construction of a compacted representation or synopsis.
Assuming that we are able to lter data just based on this
representation, large portions can be discarded efficiently (via
scanning) and without involving expensive computations on
irrelevant data. Executing the computationally expensive
(reconstruction) pipeline only on the pre- ltered,
intermediate result gives the same ( nal) result as applying the
pipeline to the whole data set directly. Of course, this
requirement prohibits the synopsis-based preselection to reject
any true result tuple. We provide more details on synopsis
requirements in section 5.1.
4.</p>
    </sec>
    <sec id="sec-4">
      <title>RELATED WORKS</title>
      <p>
        A rst approach to address the problem via preselection,
named DeLorean [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], was developed at our group and is
going to be the entry point of this work. For details and a
preliminary evaluation of a proof of concept, see Ku mann
et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In the following, we give a brief description.
      </p>
      <p>The idea is to separate a query into a compute- and a
data intensive part. In queries commonly-used and
selective attributes (i.e., attributes that are expected to involve
selective predicates) are precalculated during data
acquisition and collected as columns in a single table, resulting in a
much smaller representation (see g. 1a). In order to avoid
evaluations of the expensive reconstruction, a fast scan of
this compact synopsis is supposed to discard a large
number of tuples (events), reducing the (query speci c)
computational efforts to a sufficiently small superset of the true
result ( g. 1b).</p>
      <p>The synopsis lookup itself is efficiently expressed in
relational terms, replacing reconstruction operations with scan
intensive "SQL\ operators.
datataking</p>
      <p>extract
raw event store
(a) Preprocessing.</p>
      <p>columnar
synopsis</p>
      <p>preselection ⃝1 synopsis
access</p>
      <p>user
analysis
(b) Query processing.</p>
      <p>In a second step, surviving events are retrieved3 and fed
into the stripping software, speci cally con gured for
individual requests, yielding the nal result.</p>
      <p>
        The new synopsis lookup can be implemented by modern
cloud execution platforms, such as Apache Drill [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], making
use of their scalability and scan-bene cial columnar
storage layout [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Constructing a synopsis in a columnar layout
provides bene ts such as reduced data volume, because only
relevant attributes have to be loaded and scanned.
Furthermore, columns offer improved compressibility and thus also
a tradeoff between processor load and bandwidth [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        Geometric data summarization techniques in general are
an essential tool in many domains, having applications in
large scale maschine learning and databases, for instance.
These summaries can be roughly classi ed into two types,
coresets and sketches [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Similar to DeLorean, a summary
acts as a "proxy\ for the complete data set and algorithms
executed on this smaller set give an approximation of the
result. However, these summaries pursue a reduction of the
number of tuples, via density estimation or clustering, for
instance. In contrast, DeLorean, as it is now, applies a
dimensionality reduction, where less relevant attributes are
simply projected out. We conjecture that both elds,
dimensionality reduction as well as data summaries, can have
interesting applications in our research.
      </p>
      <p>A related and to DeLorean similarly motivated concept
is that of vector approximation les (or VA les) by Weber
et al. [13]. VA les partition the data space into rectangular
cells and enumerate them with unique bit-strings, offering
scan based preselection on a compact representation. Some
cases of nearest neighbor search, for instance, can be decided
on this compacted representation alone.</p>
      <p>
        The scale up vs scale out topic is discussed by Essertel
et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The Flare project, an extension of the
distributed data analysis platform Apache Spark, tries to
maximize the potential of individual machines by means of
native code compilation. However, the existing LHCb software
stack, implemented in C++, is not reasonably migratable to
another platform, given the extensiveness of the code base
alone. To this end, any approach, that wants to commit
to at least some practical applicability on the LHCb event
retrieval problem, has to make the integration of existing
software stacks possible.
      </p>
      <p>
        Generally, this can be done by providing efficient
evaluation strategies. Stepping outside of the stripping
perspective and positioning the \retrieval" problem into a
data3The LHCb data format allows tree-based seeking of single
raw-events via identi er [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
base context, the general schema to obtain events from the
global (raw-) event store E can be illustrated in terms of SQL
by involving a conjunction of expensive predicates (or \user
de ned functions") pi:
      </p>
      <p>SELECT * FROM E WHERE ∧ p
i i</p>
      <p>Actual physics analyses involve various types of
predicates. They are de ned in terms of properties of
reconstructed decays or particles, instead of plain (raw-event )
attributes that are typically expected in database queries.</p>
      <p>Clearly, most of the effort arises within those functions,
that reach outside of the scope of relational algebra (SQL).
These \black boxes" are inherently hard al with by means of
\traditional" database technology.</p>
      <p>
        Hellerstein and Stonebraker [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] try to correctly take into
account the cost of expensive predicates when optimizing
query plans via operator reordering. User de ned functions
as well as sub-query predicates are sometimes incorrectly
assessed as \zero-time operations\ by database optimizers.
Due to the simplistic relational structure of the above
mentioned expression, general purpose query plan optimization
techniques are unlikely to offer improvements. In contrast,
our approach is going to reduce the total number of
evaluations when answering user queries.
      </p>
      <p>
        Joglekar et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] try to reduce the number of explicit
evaluations of expensive predicate functions. In exchange
for a decrease in query accuracy, correlations of a function's
result and the value of a variable can be abused to decide,
if the evaluation of the predicate is required or skippable.
However, the required correlation estimation is only feasible
for low cardinality attributes that rarely occur in the physics
context.
      </p>
      <p>
        Declarative languages, such as SQL or XQuery, were
developed to offer enhanced expressivity, enable speci c
optimizations and enjoy widespread usage. The importance
of the declarative property of big data analytics-centric
languages is supported by the works of Fegaras [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (MRQL) and
Alexandrov et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (Emma).
      </p>
      <p>As SQL has roots in linear algebra (and tuple relational
calculus), MRQL and Emma have monoid homomorphisms
and monad comprehensions (respectively) as their formal
foundation. However, these languages address a (still) very
broad domain of queries and databases, omitting possible
optimization potential that can not be detected in a more
general context. In our work, we are going to identify query
patterns that are speci c to the domain of interest (LHCb
event analysis, in this case). Utilizing formal systems, such
as the ones used by SQL, MRQL or Emma, provides the
associated tools and insights as well as an evironment to
reason about queries and specify transformation rules for
optimization.</p>
      <p>Given that every user has different requirements and is
interested in different types of data, a specialized execution
strategy can optimize individual requests and thereby
further improve overall performance.</p>
      <p>
        Building a new and customized DSL is costly and requires
knowledge both in the application domain and programming
language design [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. DSL-compiler frameworks, such as
Delite [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], improve the creation of new languages by providing
abstract means to integrate domain speci c optimizations,
implicit parallelism and (native) code generation. The
insights and tools provided by this type of framework can
prove useful to retrieve information from user requests,
generate synopsis queries and improve the overall productivity
of analysts. We suggest additional ideas in this direction in
section 5.3.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the authors provide a formal de nition and query
processing strategies for semantic caching. Semantic
caching, in contrast to page-based caching, utilizes a semantic
representation of the cache content in order to nd items
that can totally (or partially) answer a given query. We
believe that this idea can be advantageous in our setting,
as queries in the LHCb context share considerable
"overlap". This stems from the fact that certain (sub-)decays
are \contained" in multiple decays, requiring the same
computations. In fact, the concept of shared computations is
already (manually) implemented in the current LHCb
software stack. Detecting and abusing this overlap
automatically will therefore be bene cial.
      </p>
    </sec>
    <sec id="sec-5">
      <title>ADDRESSING OPEN CHALLENGES</title>
      <p>
        The precise requirements on the synopsis content are yet
to be de ned. So far, predicates were handpicked according
to their selectivity for a selected sample query. Involved
attributes were included into the synopsis and all values
determined by an initial stripping pass. First benchmarks
are promising [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], but we need to generalize the ndings
to provide performance indications for a range of (unseen)
queries. Also, inherent challenges of distributed many-user,
high-volume data processing need to be addressed.
5.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>Creating the synopsis</title>
      <p>To offer performance and correctness, even for new
queries, some general qualities of the synopsis can be declared:
Applicability - The synopsis has to contain attributes
that are relevant for upcoming user query, otherwise
we do not gain any advantage.</p>
      <p>Correctness - To prevent \false-negatives", no event
that is actualy relevant for a user request should be
rejected by the synopsis scan.</p>
      <p>Selectivity - To more-then-amortize the additional cost
of a synopsis scan, a sufficiently large number of events
needs to be rejected, preventing their expensive
reconstruction.</p>
      <p>Note that events, that are selected but in fact uninteresting
(\false-positives"), are permissible although undesired. They
are expected to be rejected by the second step and just
deteriorate selectivity and therefore performance, which is less
crucial then correctness.</p>
      <p>To serve an adequate amount of user request topics, a
sufciently broad selection of synopsis attributes has to cover
most frequent analyses. As a rst approach, the stripping
line formulations, even though being inherently \vague\,
involve many commonly used attributes, because the complete
set is designed to cover a wide range of analysis subjects at
the LHCb. A stripping line is formulated in multiple
consecutive steps that de ne speci c reconstruction procedures.
Those steps also include lter statements on properties
determined during this procedure. Also, there is a
considerable overlap between the lines, as several steps are frequently
shared. Many particles and even some decays are involved
in multiple analyses and usually involve the same or just
slightly different predicates. Currently, investigating criteria
to assess attributes and their (combined) selective power for
upcoming queries is important, as it represents the next logic
step towards a systematic creation of an event synopsis.</p>
      <p>A successful preselection strategy has to offer a
selectivity comparable to stripping lines. Given such a selective
synopsis, the overall number of events can be reduced to a
manageable portion and enables the user to perform the
reconstruction in a reasonable amount of time. Note that a
"local restripping" is already done by analysts in practice on
manually selected stripping line results in order to re ne the
event selection according to individual user requirements.
5.2</p>
    </sec>
    <sec id="sec-7">
      <title>Result Caching</title>
      <p>Depending on the size of the synopsis, distributing
redundant copies (or only relevant columns to cover a single
topic) enables the execution of the preselection
independently for different work groups, possibly even single users.
This way, we are able to shift the \expensive query
predicate" problem further towards a data serving problem, where
only actually interesting (but unprocessed) events have to
be efficiently handed over to many users.</p>
      <p>However, the selected data might be resident in
different sites that are geographically dispersed (as it is the case
for CERN/LHCb) and/or busy, introducing latencies.
Nonuniform data access (over the data-sites) increases
contention in both network and local resources. Note that this
issues also arises in settings, where queries do not involve
(network-)communication-intensive algorithms, such as the
LHCb data analysis.</p>
      <p>Adequate caching mechanisms can greatly improve the
ability to serve data by adaptively holding frequently
requested (sets of) events, greatly reducing data transfer volumes
and serving latencies. Also, with knowledge about the data
(-dependencies), events could be cached speculatively. For
example: Decays that appear to be very similar to the
desired decay are sometimes explicitly fetched to exclude them
properly from the analysis.
5.3</p>
    </sec>
    <sec id="sec-8">
      <title>Query Specification Interface for Physics</title>
    </sec>
    <sec id="sec-9">
      <title>Analyses</title>
      <p>Developing a dedicated interface for LHCb physics
analysis queries, such as a DSL, offers several bene ts for this
project. In addition to the general advantage of improved
ease-of-use for analysts, such a language can have
performance critical implications by guiding query plan
optimization:</p>
      <p>Declarative formulation in higher-level semantics
enables the user to specify his intent while relieving him
from being familiar with implementation details.
Automatic generation of preselection queries, that can
be evaluated on the synopsis. This enables the user to
specify queries without knowing the synopsis schema.
Identi cation of new synopsis attributes by
determining overlap or "similarity\ between queries. This
information could serve as a foundation to adaptively
add or remove synopsis attributes, based on common
query "topics\. Performance/selectivity of upcoming
query could be improved by iteratively replacing or
adding information to the synopsis.</p>
      <p>Selectivity approximation of user requests with
precalculated statistics, such as value distributions and
correlations of synopsis attributes. Offering a mechanism
to estimate the reduction rate of a query beforehand
allows quick rejection of infeasible queries, solely based
on statistics and more importantly: without starting
actual computations or data transfer requests.</p>
      <p>
        Improve (opportunistic) caching. Events that only
closely fail to match predicates (and other
"similarities", such as the ones mentioned above) could
become relevant and therefore proof bene cial to have in
a cache, especially in an interactive test-and-see query
re nement scenario. Also, result sets, that superset
other results and are contained in a cache, can be used
to answer particular (upcoming) queries (see
cacheanswerability [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]).
      </p>
      <p>
        Multiple of these aspects rely on the ability to analyse the
structure of queries. Automatic, non-trivial optimizations,
such as lter push downs, require the system to reason about
(possibly higher order) operations and their arguments, which
is hard to do with application programming interfaces (APIs).
Having a well-de ned DSL, backed by a suitable, formal
foundation, allows the de nition of abstract transformation
rules and eases cost-based optimization [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Also, answering
formally motivated questions, such as cache-answerability,
requires the means to infer implication- and satis ability
properties of queries [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Furthermore, disconnecting the
interface from the execution platform offers the ability to
keep the same interface, if the backend gets replaced. This
is particularly useful for large projects that are expected to
operate over many years.
      </p>
    </sec>
    <sec id="sec-10">
      <title>ROADMAP</title>
      <p>
        A synopsis that supports efficient, scan based event
selection for new queries promises major performance bene ts.
But many open challenges exist. Currently, it is not
obvious what kind of information should form the synopsis and
how the choice will impact performance in a more general
sense. Therefore, we rst need to con rm the ideas
introduced in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] by developing means to systematically extract
\frequent" synopsis attributes. Getting a better
understanding on the implications of attribute choice and their impact
on performance and applicability is necessary to support a
sufficiently wide range of accurate queries. Also,
alternatives to \carefully selected reconstruction attributes" as the
synopsis's constituents should be explored.
      </p>
      <p>The idea to incorporate domain speci c knowledge for
optimization lends itself to interesting concepts in data
processing, where general purpose techniques fail to offer
signicant gains. Developing new and universal schemes of
applying domain knowledge for performance allows the transfer
of our ndings to other situations. Extracting information
from user queries in order to derive a execution strategy is
our rst step into this direction.</p>
      <p>After having a precise and selective mechanism in place,
the \data serving" aspect of the problem will offer multiple
incentives for future research.
7.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work has been supported by the German Ministry of
Education and Research (BMBF), project Industrial Data
Science, and by Deutsche Forschungsgemeinschaft (DFG),
Collaborative Research Center SFB 876, project C5.
8.
[12] C. The LHCb Collaboration. Upgrade Software and
Computing. Technical Report CERN-LHCC-2018-007.</p>
      <p>LHCB-TDR-017, CERN, Geneva, Mar 2018.
[13] R. Weber, H.-J. Schek, and S. Blott. A quantitative
analysis and performance study for similarity-search
methods in high-dimensional spaces. In Proceedings of
the 24rd International Conference on Very Large Data
Bases, VLDB '98, pages 194{205, San Francisco, CA,
USA, 1998. Morgan Kaufmann Publishers Inc.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Drill -</surname>
          </string-name>
          Schema
          <article-title>-free SQL for Hadoop, NoSQL and Cloud Storage</article-title>
          . https://drill.apache.org/. Accessed:
          <fpage>2019</fpage>
          -03-03.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Alexandrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katsifodimos</surname>
          </string-name>
          , F. Schuler, L.
          <string-name>
            <surname>Thamsen</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Kao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Herb</surname>
            , and
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Markl</surname>
          </string-name>
          .
          <article-title>Implicit parallelism through deep language embedding</article-title>
          .
          <source>In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15</source>
          , pages
          <fpage>47</fpage>
          {
          <fpage>61</fpage>
          , New York, NY, USA,
          <year>2015</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Essertel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Y.</given-names>
            <surname>Tahboub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Decker</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. J. Brown</surname>
            , K. Olukotun, and
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Rompf</surname>
          </string-name>
          . Flare:
          <article-title>Native compilation for heterogeneous workloads in apache spark</article-title>
          .
          <source>CoRR, abs/1703.08219</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Fegaras</surname>
          </string-name>
          .
          <article-title>An algebra for distributed big data analytics</article-title>
          .
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Hellerstein</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Stonebraker</surname>
          </string-name>
          .
          <article-title>Predicate migration: Optimizing queries with expensive predicates</article-title>
          .
          <source>pages 267{276</source>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Joglekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parameswaran</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Re</surname>
          </string-name>
          .
          <article-title>Exploiting correlations for expensive predicate evaluation</article-title>
          .
          <source>In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15</source>
          , pages
          <fpage>1183</fpage>
          {
          <fpage>1198</fpage>
          , New York, NY, USA,
          <year>2015</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Khachatryan</surname>
          </string-name>
          et al.
          <article-title>Observation of the rare Bs0 ! + decay from the combined analysis of CMS and LHCb data</article-title>
          .
          <source>Nature</source>
          ,
          <volume>522</volume>
          :
          <fpage>68</fpage>
          {
          <fpage>72</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ku mann</surname>
          </string-name>
          , M. Berens,
          <string-name>
            <given-names>U.</given-names>
            <surname>Eitschberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kilic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lindemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Niet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schellenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stevens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wishahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Spaan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Teubner</surname>
          </string-name>
          .
          <article-title>Delorean: A storage layer to analyze physical data at scale</article-title>
          . In B.
          <string-name>
            <surname>Mitschang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Nicklas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Leymann</surname>
            , H. Schoning, M. Herschel,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Teubner</surname>
            , T. Harder,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Kopp</surname>
          </string-name>
          , and M. Wieland, editors,
          <source>Datenbanksysteme fur Business</source>
          ,
          <source>Technologie und Web (BTW</source>
          <year>2017</year>
          ), pages
          <fpage>413</fpage>
          {
          <fpage>422</fpage>
          . Gesellschaft fur Informatik,
          <source>Bonn</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Phillips</surname>
          </string-name>
          .
          <article-title>Coresets and sketches</article-title>
          .
          <source>CoRR, abs/1601.00617</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. H.</given-names>
            <surname>Dunham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          .
          <article-title>Semantic caching and query processing</article-title>
          .
          <source>IEEE Trans. on Knowl. and Data Eng</source>
          .,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <volume>192</volume>
          {
          <fpage>210</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A. K. Sujeeth</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. J. Brown</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Rompf</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Cha</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Odersky</surname>
            , and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Olukotun</surname>
          </string-name>
          .
          <article-title>Delite: A compiler architecture for performance-oriented embedded domain-speci c languages</article-title>
          .
          <source>ACM Trans. Embed. Comput. Syst.</source>
          ,
          <volume>13</volume>
          (4s):
          <volume>134</volume>
          :1{
          <fpage>134</fpage>
          :
          <fpage>25</fpage>
          ,
          <string-name>
            <surname>Apr</surname>
          </string-name>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>