=Paper=
{{Paper
|id=Vol-2367/paper_11
|storemode=property
|title=Execution Strategies for Compute Intensive Queries in Particle Physics
|pdfUrl=https://ceur-ws.org/Vol-2367/paper_11.pdf
|volume=Vol-2367
|authors=Maximilian Berens
|dblpUrl=https://dblp.org/rec/conf/gvd/Berens19
}}
==Execution Strategies for Compute Intensive Queries in Particle Physics==
<pdf width="1500px">https://ceur-ws.org/Vol-2367/paper_11.pdf</pdf>
<pre>
     Execution Strategies for Compute Intensive Queries in
                        Particle Physics

                                                        Maximilian Berens
                                                       TU Dortmund University
                                           maximilian.berens@tu-dortmund.de


ABSTRACT                                                              as well as the number and variety of user requests. The
Data analysis in many scientific domains, such as high en-            CERN corporation’s Large Hadron Collider (LHC), close to
ergy physics, pushes the available resources of even big re-          Geneva, Switzerland, houses multiple experiments, each one
search collaborations to their limit, because it not only re-         dedicated to diﬀerent questions in particle physics. One is
quires huge amounts of data but also compute intensive cal-           the LHC beauty experiment (LHCb), where various aspects
culations. Current approaches require increased storage and           of the diﬀerences between matter and antimatter are stud-
computing resources, lack suﬃcient flexibility and are still          ied. A common LHCb physics analysis concerns itself with
far away from interactivity. In this paper we present ideas           a particular (type of) decay or particle. In order to ob-
to cope with challenges posed by compute-intensive analyses           serve them, protons are accelerated to very high energies
of high-volume data in a many-user- and resource-restricted           and brought to collision. Over the course of a year, up to
environment. Extreme selectivity of user queries (that is             40 million collisions are measured every seconds, prefiltered
inherent in high energy physics analyses, for instance) per-          and stored. Before performing the actual analysis, specific
mits us to reduce the computational eﬀort to a small por-             recordings of these collisions, termed events, have to be selec-
tion, when irrelevant data can be discarded by more eﬃcient           ted from a global event store. Decays of interest are usually
means. Our focus lies on providing execution strategies for           very rare. For example B0s → µ+ µ− was found only 3 times
analysis queries, guided by the expertise of the user and             in 10 billion events [7]. A significant portion of a physics
executed as a scalable, scan-based preselection (introduced           analysis consists of isolating particular types of events by
in DeLorean [8]). Future research will encompass the de-              carefully defining query predicates.
velopment of a compact data summary that facilitates fast,               Detector measurements, in their initial, raw state, are not
columnar scan queries, as well as a domain specific language          immediately useful and require computationally expensive
that provides us with the required information to generate            and decay-specific reconstruction into physically meaningful
them.                                                                 entities. Filtering events for user analyses is done by enfor-
                                                                      cing predicates on these high level features.
                                                                         The overall productivity of many scientific projects is lim-
Keywords                                                              ited by the amount of data that can be processed and stored.
Approximate Query Processing, Expensive Predicates, Big               At the LHCb, this is because the measurements can not
Data, Resource-Constrained Data Analysis, Domain Specific             be retained at the same rate as they are taken [12]. Still,
Language, High Energy Physics                                         around 20 petabyte of “raw event” data goes to a tape stor-
                                                                      age every year. Also adding the long reconstruction time
1.   INTRODUCTION                                                     per event into the equation, naive on-the-fly computations
                                                                      over all available events for individual user requests are in-
   The ability to accumulate and use increasing amounts of
                                                                      tractable. Oﬀering tools to quickly query the available data
data in many science domains opens enticing prospects of
                                                                      without qualitative drawbacks and optimal utilization of the
answering open questions of humanity. High energy phys-
                                                                      (limited available) resources is essential for the success of a
ics presents itself as a prominent example of such a domain,
                                                                      collaboration, such as the LHCb, and impacts the pace of
where scientific conclusions are drawn from statistical evid-
                                                                      scientific discovery in general.
ence that is gained by analysing huge quantities of data.
                                                                         A major upgrade of the detector will start recording new
Analysis tools at this scale have to cope not only with the
                                                                      data in 2021, increasing the data volume (both rate and size
sheer volume of data, but also with the complexity of in-
                                                                      of events [12]) even further and signifying the necessity of
volved computations, the limited availabilty of resources,
                                                                      new ideas.
                                                                         The remainder of this paper is structured as follows. First,
                                                                      we give a brief overview of the concepts currently in place at
                                                                      the LHCb, its drawbacks and consequent, general directions
                                                                      of our research. Preliminary and other related work are
                                                                      covered in section 4. Section 5 discusses open challenges
                                                                      that we are going to address in the upcoming years. Finally,
31st GI-Workshop on Foundations of Databases (Grundlagen von Daten-
                                                                      section 6 summarizes these ideas in a roadmap.
banken), 11.06.2019 - 14.06.2019, Saarburg, Germany.
Copyright is held by the author/owner(s).
2.     DATA PROCESSING AT THE LHCB                                     • Limit the execution of expensive computations to the
   In order to prevent analysts to query and reconstruct the             result set and thus minimize the overall compute load.
whole set of available data for every query, a centrally sched-
uled reconstruction-and-preselection procedure, called strip-          • Enable eﬃcient usage of modern hardware features
ping, is performed. Several hundred ”stripping lines”, pre-              (i.e, deep cache hierarchies, advanced processor in-
defined filter queries, reduce the volume of data to an ac-              structions and multi-core architectures).
ceptable size by materializing their results in separate sub-
sets.1 A stripping line typically tries to reconstruct a certain       • Reuse of information by caching results and interme-
decay and filters events in the process. The criteria to se-             diate computations for upcoming queries.
lect events tend to be “vague” (in their predicates), because
                                                                       A scan intensive preselection, based on a columnar, com-
multiple analyses are supposed to be done on the result of a
                                                                    pacted representation of data entities (i.e., particle-collision
single line or small subset thereof. The stripping is initially
                                                                    events) will enable us to implement these objectives (see 4
applied during data acquisition and has to be redone later,
                                                                    for preliminary work). Given a small expected result cardin-
when changes in the reconstruction software or overall user
                                                                    ality of individual requests, incorporating users and their
demand requires it.2 Generally, this approach is problematic
                                                                    domain expertise closer into the process potentially oﬀers
for multiple reasons:
                                                                    significant advantages. However, precisely translating the
     • In addition to the raw-event data necessary for recon-       user’s intent into selective preselection queries requires a
       struction, materialized results occupy scarce disc ca-       suitable interface. In contrast to more general query lan-
       pacity.                                                      guages, such as SQL, a specialized domain specific language
                                                                    (DSL) oﬀers the required expressivity and easier incorpora-
     • Results have to conform with user requirements, which        tion of domain specific optimizations. Furthermore, a DSL
       are generally more specific/strict. This usually re-         can support eﬃcient distributed caching strategies, as pre-
       quires users to “restrip“ a selected subset with a cus-      vious results might be used to answer (a potentially wide)
       tomized configuration.                                       range of upcoming queries [10]. Caching increases data loc-
                                                                    ality and spreads the workload in non-uniform data access
     • Stripping line predicates are designed with respect to
                                                                    scenarios. We will further elaborate on this topic in the
       strict limits on available resources. This can conflict
                                                                    related works section. To this end, the development of a
       with physically motivated predicate parametrization
                                                                    suitable query interface for physics analytics will be a major
       and negatively impact the quality of conclusions, as
                                                                    topic in our upcoming research.
       important data might be kept out of the analysts reach.
                                                                       Another important aspect of our approach will be the con-
       Even small changes in predicates (i.e., “loosening” of
                                                                    struction of a compacted representation or synopsis. As-
       inequalities or “shifting” of ranges) are not directly im-
                                                                    suming that we are able to filter data just based on this rep-
       plementable, because the stripping is rarely redone.
                                                                    resentation, large portions can be discarded eﬃciently (via
     • Predefined selections are unlikely to cover unforeseen       scanning) and without involving expensive computations on
       analysis use cases and thus require the definition of a      irrelevant data. Executing the computationally expensive
       completely new stripping line.                               (reconstruction) pipeline only on the pre-filtered, interme-
                                                                    diate result gives the same (final) result as applying the
  Approaches that restrict the queryable data set by pre-           pipeline to the whole data set directly. Of course, this re-
defined criteria are bound to lack flexibility, because as-         quirement prohibits the synopsis-based preselection to reject
sumptions can change and individual users have diﬀerent             any true result tuple. We provide more details on synopsis
requirements. In addition, the stripping still requires long        requirements in section 5.1.
job waiting periods and additional resources.

3.     OUR OBJECTIVES                                               4. RELATED WORKS
  The computationally expensive reconstruction is trivially            A first approach to address the problem via preselection,
parallelizable, because events are completely independent           named DeLorean [8], was developed at our group and is
and small in size (∼hundreds of kilobytes). However, relying        going to be the entry point of this work. For details and a
on data parallelism alone does not guarantee neither general        preliminary evaluation of a proof of concept, see Kußmann
scalability nor suﬃcient resource utilization. New solutions        et al. [8]. In the following, we give a brief description.
need to scale up by leveraging available hardware features.            The idea is to separate a query into a compute- and a
In addition, they need to scale out in order to avoid bottle-       data intensive part. In queries commonly-used and select-
necks caused by resource contention in large-scale multi-user       ive attributes (i.e., attributes that are expected to involve
environments.                                                       selective predicates) are precalculated during data acquisi-
  For the purpose of pushing analytics closer towards inter-        tion and collected as columns in a single table, resulting in a
activity, we identify the following goals:                          much smaller representation (see fig. 1a). In order to avoid
                                                                    evaluations of the expensive reconstruction, a fast scan of
     • A significant reduction of the data volume, that is in-      this compact synopsis is supposed to discard a large num-
       volved in individual user queries.                           ber of tuples (events), reducing the (query specific) compu-
1                                                                   tational eﬀorts to a suﬃciently small superset of the true
  All lines combined drop 95% of all event data; a single line
must not have more then 0.05% of the overall data volume            result (fig. 1b).
(on average).                                                          The synopsis lookup itself is eﬃciently expressed in rela-
2                                                                   tional terms, replacing reconstruction operations with scan
  Waiting times of several months for a new stripping version
are not unusual.                                                    intensive ”SQL“ operators.
            data-                                                                              ⃝3 fetch
           taking                                                                  ⃝
                                                                                   2 scan      relevant
                                                                     pre-                                      user
                                        extract                   selection ⃝
                                                                            1 synopsis                       analysis
                                                                              access

                                              columnar
                      raw event store
                                               synopsis
                        (a) Preprocessing.                                      (b) Query processing.

Figure 1: The DeLorean storage layer. At data acquisition time, DeLorean extracts a compact synopsis of the events. At
analysis time, the synopsis is scanned and only a relevant subset is fetched and reconstructed for user analyses.


   In a second step, surviving events are retrieved3 and fed       base context, the general schema to obtain events from the
into the stripping software, specifically configured for indi-     global (raw-) event store E can be illustrated in terms of SQL
vidual requests, yielding the final result.                        by involving a conjunction of expensive predicates (or “user
   The new synopsis lookup can be implemented by modern            defined functions”) pi :
cloud execution platforms, such as Apache Drill [1], making                                               ∧
use of their scalability and scan-beneficial columnar stor-              SELECT * FROM E WHERE              i pi
age layout [8]. Constructing a synopsis in a columnar layout
provides benefits such as reduced data volume, because only          Actual physics analyses involve various types of predic-
relevant attributes have to be loaded and scanned. Further-       ates. They are defined in terms of properties of reconstruc-
more, columns oﬀer improved compressibility and thus also         ted decays or particles, instead of plain (raw-event) attrib-
a tradeoﬀ between processor load and bandwidth [8].               utes that are typically expected in database queries.
   Geometric data summarization techniques in general are            Clearly, most of the eﬀort arises within those functions,
an essential tool in many domains, having applications in         that reach outside of the scope of relational algebra (SQL).
large scale maschine learning and databases, for instance.        These “black boxes” are inherently hard al with by means of
These summaries can be roughly classified into two types,         “traditional” database technology.
coresets and sketches [9]. Similar to DeLorean, a summary            Hellerstein and Stonebraker [5] try to correctly take into
acts as a ”proxy“ for the complete data set and algorithms        account the cost of expensive predicates when optimizing
executed on this smaller set give an approximation of the         query plans via operator reordering. User defined functions
result. However, these summaries pursue a reduction of the        as well as sub-query predicates are sometimes incorrectly
number of tuples, via density estimation or clustering, for       assessed as “zero-time operations“ by database optimizers.
instance. In contrast, DeLorean, as it is now, applies a di-      Due to the simplistic relational structure of the above men-
mensionality reduction, where less relevant attributes are        tioned expression, general purpose query plan optimization
simply projected out. We conjecture that both fields, di-         techniques are unlikely to oﬀer improvements. In contrast,
mensionality reduction as well as data summaries, can have        our approach is going to reduce the total number of evalu-
interesting applications in our research.                         ations when answering user queries.
   A related and to DeLorean similarly motivated concept             Joglekar et al. [6] try to reduce the number of explicit
is that of vector approximation files (or VA files) by Weber      evaluations of expensive predicate functions. In exchange
et al. [13]. VA files partition the data space into rectangular   for a decrease in query accuracy, correlations of a function’s
cells and enumerate them with unique bit-strings, oﬀering         result and the value of a variable can be abused to decide,
scan based preselection on a compact representation. Some         if the evaluation of the predicate is required or skippable.
cases of nearest neighbor search, for instance, can be decided    However, the required correlation estimation is only feasible
on this compacted representation alone.                           for low cardinality attributes that rarely occur in the physics
   The scale up vs scale out topic is discussed by Essertel       context.
et al. [3]. The Flare project, an extension of the distrib-          Declarative languages, such as SQL or XQuery, were de-
uted data analysis platform Apache Spark, tries to maxim-         veloped to oﬀer enhanced expressivity, enable specific op-
ize the potential of individual machines by means of nat-         timizations and enjoy widespread usage. The importance
ive code compilation. However, the existing LHCb software         of the declarative property of big data analytics-centric lan-
stack, implemented in C++, is not reasonably migratable to        guages is supported by the works of Fegaras [4] (MRQL) and
another platform, given the extensiveness of the code base        Alexandrov et al. [2] (Emma).
alone. To this end, any approach, that wants to commit               As SQL has roots in linear algebra (and tuple relational
to at least some practical applicability on the LHCb event        calculus), MRQL and Emma have monoid homomorphisms
retrieval problem, has to make the integration of existing        and monad comprehensions (respectively) as their formal
software stacks possible.                                         foundation. However, these languages address a (still) very
   Generally, this can be done by providing eﬃcient eval-         broad domain of queries and databases, omitting possible
uation strategies. Stepping outside of the stripping per-         optimization potential that can not be detected in a more
spective and positioning the “retrieval” problem into a data-     general context. In our work, we are going to identify query
                                                                  patterns that are specific to the domain of interest (LHCb
3                                                                 event analysis, in this case). Utilizing formal systems, such
  The LHCb data format allows tree-based seeking of single
raw-events via identifier [8].                                    as the ones used by SQL, MRQL or Emma, provides the
associated tools and insights as well as an evironment to           To serve an adequate amount of user request topics, a suf-
reason about queries and specify transformation rules for        ficiently broad selection of synopsis attributes has to cover
optimization.                                                    most frequent analyses. As a first approach, the stripping
   Given that every user has diﬀerent requirements and is        line formulations, even though being inherently “vague“, in-
interested in diﬀerent types of data, a specialized execution    volve many commonly used attributes, because the complete
strategy can optimize individual requests and thereby fur-       set is designed to cover a wide range of analysis subjects at
ther improve overall performance.                                the LHCb. A stripping line is formulated in multiple con-
   Building a new and customized DSL is costly and requires      secutive steps that define specific reconstruction procedures.
knowledge both in the application domain and programming         Those steps also include filter statements on properties de-
language design [11]. DSL-compiler frameworks, such as De-       termined during this procedure. Also, there is a consider-
lite [11], improve the creation of new languages by providing    able overlap between the lines, as several steps are frequently
abstract means to integrate domain specific optimizations,       shared. Many particles and even some decays are involved
implicit parallelism and (native) code generation. The in-       in multiple analyses and usually involve the same or just
sights and tools provided by this type of framework can          slightly diﬀerent predicates. Currently, investigating criteria
prove useful to retrieve information from user requests, gen-    to assess attributes and their (combined) selective power for
erate synopsis queries and improve the overall productivity      upcoming queries is important, as it represents the next logic
of analysts. We suggest additional ideas in this direction in    step towards a systematic creation of an event synopsis.
section 5.3.                                                        A successful preselection strategy has to oﬀer a selectiv-
   In [10], the authors provide a formal definition and query    ity comparable to stripping lines. Given such a selective
processing strategies for semantic caching. Semantic cach-       synopsis, the overall number of events can be reduced to a
ing, in contrast to page-based caching, utilizes a semantic      manageable portion and enables the user to perform the re-
representation of the cache content in order to find items       construction in a reasonable amount of time. Note that a
that can totally (or partially) answer a given query. We         ”local restripping” is already done by analysts in practice on
believe that this idea can be advantageous in our setting,       manually selected stripping line results in order to refine the
as queries in the LHCb context share considerable ”over-         event selection according to individual user requirements.
lap”. This stems from the fact that certain (sub-)decays
are “contained” in multiple decays, requiring the same com-      5.2 Result Caching
putations. In fact, the concept of shared computations is           Depending on the size of the synopsis, distributing re-
already (manually) implemented in the current LHCb soft-         dundant copies (or only relevant columns to cover a single
ware stack. Detecting and abusing this overlap automatic-        topic) enables the execution of the preselection independ-
ally will therefore be beneficial.                               ently for diﬀerent work groups, possibly even single users.
                                                                 This way, we are able to shift the “expensive query predic-
5.    ADDRESSING OPEN CHALLENGES                                 ate” problem further towards a data serving problem, where
                                                                 only actually interesting (but unprocessed) events have to
  The precise requirements on the synopsis content are yet
                                                                 be eﬃciently handed over to many users.
to be defined. So far, predicates were handpicked according
                                                                    However, the selected data might be resident in diﬀer-
to their selectivity for a selected sample query. Involved
                                                                 ent sites that are geographically dispersed (as it is the case
attributes were included into the synopsis and all values
                                                                 for CERN/LHCb) and/or busy, introducing latencies. Non-
determined by an initial stripping pass. First benchmarks
                                                                 uniform data access (over the data-sites) increases conten-
are promising [8], but we need to generalize the findings
                                                                 tion in both network and local resources. Note that this
to provide performance indications for a range of (unseen)
                                                                 issues also arises in settings, where queries do not involve
queries. Also, inherent challenges of distributed many-user,
                                                                 (network-)communication-intensive algorithms, such as the
high-volume data processing need to be addressed.
                                                                 LHCb data analysis.
5.1     Creating the synopsis                                       Adequate caching mechanisms can greatly improve the
   To oﬀer performance and correctness, even for new quer-       ability to serve data by adaptively holding frequently reques-
ies, some general qualities of the synopsis can be declared:     ted (sets of) events, greatly reducing data transfer volumes
                                                                 and serving latencies. Also, with knowledge about the data
     • Applicability - The synopsis has to contain attributes    (-dependencies), events could be cached speculatively. For
       that are relevant for upcoming user query, otherwise      example: Decays that appear to be very similar to the de-
       we do not gain any advantage.                             sired decay are sometimes explicitly fetched to exclude them
                                                                 properly from the analysis.
     • Correctness - To prevent “false-negatives”, no event
       that is actualy relevant for a user request should be     5.3 Query Specification Interface for Physics
       rejected by the synopsis scan.                                Analyses
     • Selectivity - To more-then-amortize the additional cost      Developing a dedicated interface for LHCb physics ana-
       of a synopsis scan, a suﬃciently large number of events   lysis queries, such as a DSL, oﬀers several benefits for this
       needs to be rejected, preventing their expensive recon-   project. In addition to the general advantage of improved
       struction.                                                ease-of-use for analysts, such a language can have perform-
                                                                 ance critical implications by guiding query plan optimiza-
Note that events, that are selected but in fact uninteresting
                                                                 tion:
(“false-positives”), are permissible although undesired. They
are expected to be rejected by the second step and just de-         • Declarative formulation in higher-level semantics en-
teriorate selectivity and therefore performance, which is less        ables the user to specify his intent while relieving him
crucial then correctness.                                             from being familiar with implementation details.
     • Automatic generation of preselection queries, that can       of our findings to other situations. Extracting information
       be evaluated on the synopsis. This enables the user to       from user queries in order to derive a execution strategy is
       specify queries without knowing the synopsis schema.         our first step into this direction.
                                                                       After having a precise and selective mechanism in place,
     • Identification of new synopsis attributes by determin-
                                                                    the “data serving” aspect of the problem will oﬀer multiple
       ing overlap or ”similarity“ between queries. This in-
                                                                    incentives for future research.
       formation could serve as a foundation to adaptively
       add or remove synopsis attributes, based on common
       query ”topics“. Performance/selectivity of upcoming          7. ACKNOWLEDGMENTS
       query could be improved by iteratively replacing or            This work has been supported by the German Ministry of
       adding information to the synopsis.                          Education and Research (BMBF), project Industrial Data
                                                                    Science, and by Deutsche Forschungsgemeinschaft (DFG),
     • Selectivity approximation of user requests with precal-
                                                                    Collaborative Research Center SFB 876, project C5.
       culated statistics, such as value distributions and cor-
       relations of synopsis attributes. Oﬀering a mechanism
       to estimate the reduction rate of a query beforehand         8. REFERENCES
       allows quick rejection of infeasible queries, solely based    [1] Apache Drill - Schema-free SQL for Hadoop, NoSQL
       on statistics and more importantly: without starting              and Cloud Storage. https://drill.apache.org/.
       actual computations or data transfer requests.                    Accessed: 2019-03-03.
     • Improve (opportunistic) caching. Events that only             [2] A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler,
       closely fail to match predicates (and other ”similar-             L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit
       ities”, such as the ones mentioned above) could be-               parallelism through deep language embedding. In
       come relevant and therefore proof beneficial to have in           Proceedings of the 2015 ACM SIGMOD International
       a cache, especially in an interactive test-and-see query          Conference on Management of Data, SIGMOD ’15,
       refinement scenario. Also, result sets, that superset             pages 47–61, New York, NY, USA, 2015. ACM.
       other results and are contained in a cache, can be used       [3] G. M. Essertel, R. Y. Tahboub, J. M. Decker, K. J.
       to answer particular (upcoming) queries (see cache-               Brown, K. Olukotun, and T. Rompf. Flare: Native
       answerability [10]).                                              compilation for heterogeneous workloads in apache
Multiple of these aspects rely on the ability to analyse the             spark. CoRR, abs/1703.08219, 2017.
structure of queries. Automatic, non-trivial optimizations,          [4] L. Fegaras. An algebra for distributed big data
such as filter push downs, require the system to reason about            analytics. 2017.
(possibly higher order) operations and their arguments, which        [5] J. M. Hellerstein and M. Stonebraker. Predicate
is hard to do with application programming interfaces (APIs).            migration: Optimizing queries with expensive
Having a well-defined DSL, backed by a suitable, formal                  predicates. pages 267–276, 1993.
foundation, allows the definition of abstract transformation         [6] M. Joglekar, H. Garcia-Molina, A. Parameswaran, and
rules and eases cost-based optimization [4]. Also, answering             C. Re. Exploiting correlations for expensive predicate
formally motivated questions, such as cache-answerability,               evaluation. In Proceedings of the 2015 ACM SIGMOD
requires the means to infer implication- and satisfiability              International Conference on Management of Data,
properties of queries [10]. Furthermore, disconnecting the               SIGMOD ’15, pages 1183–1198, New York, NY, USA,
interface from the execution platform oﬀers the ability to               2015. ACM.
keep the same interface, if the backend gets replaced. This          [7] V. Khachatryan et al. Observation of the rare
is particularly useful for large projects that are expected to           Bs0 → µ+ µ− decay from the combined analysis of
operate over many years.                                                 CMS and LHCb data. Nature, 522:68–72, 2015.
                                                                     [8] M. Kußmann, M. Berens, U. Eitschberger, A. Kilic,
6.     ROADMAP                                                           T. Lindemann, F. Meier, R. Niet, M. Schellenberg,
   A synopsis that supports eﬃcient, scan based event selec-             H. Stevens, J. Wishahi, B. Spaan, and J. Teubner.
tion for new queries promises major performance benefits.                Delorean: A storage layer to analyze physical data at
But many open challenges exist. Currently, it is not obvi-               scale. In B. Mitschang, D. Nicklas, F. Leymann,
ous what kind of information should form the synopsis and                H. Schöning, M. Herschel, J. Teubner, T. Härder,
how the choice will impact performance in a more general                 O. Kopp, and M. Wieland, editors, Datenbanksysteme
sense. Therefore, we first need to confirm the ideas intro-              für Business, Technologie und Web (BTW 2017),
duced in [8] by developing means to systematically extract               pages 413–422. Gesellschaft für Informatik, Bonn,
“frequent” synopsis attributes. Getting a better understand-             2017.
ing on the implications of attribute choice and their impact         [9] J. M. Phillips. Coresets and sketches. CoRR,
on performance and applicability is necessary to support a               abs/1601.00617, 2016.
suﬃciently wide range of accurate queries. Also, alternat-          [10] Q. Ren, M. H. Dunham, and V. Kumar. Semantic
ives to “carefully selected reconstruction attributes” as the            caching and query processing. IEEE Trans. on Knowl.
synopsis’s constituents should be explored.                              and Data Eng., 15(1):192–210, Jan. 2003.
   The idea to incorporate domain specific knowledge for op-        [11] A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf,
timization lends itself to interesting concepts in data pro-             H. Chafi, M. Odersky, and K. Olukotun. Delite: A
cessing, where general purpose techniques fail to oﬀer signi-            compiler architecture for performance-oriented
ficant gains. Developing new and universal schemes of ap-                embedded domain-specific languages. ACM Trans.
plying domain knowledge for performance allows the transfer              Embed. Comput. Syst., 13(4s):134:1–134:25, Apr. 2014.
[12] C. The LHCb Collaboration. Upgrade Software and
     Computing. Technical Report CERN-LHCC-2018-007.
     LHCB-TDR-017, CERN, Geneva, Mar 2018.
[13] R. Weber, H.-J. Schek, and S. Blott. A quantitative
     analysis and performance study for similarity-search
     methods in high-dimensional spaces. In Proceedings of
     the 24rd International Conference on Very Large Data
     Bases, VLDB ’98, pages 194–205, San Francisco, CA,
     USA, 1998. Morgan Kaufmann Publishers Inc.

</pre>