Execution Strategies for Compute Intensive Queries in Particle Physics Maximilian Berens TU Dortmund University maximilian.berens@tu-dortmund.de ABSTRACT as well as the number and variety of user requests. The Data analysis in many scientific domains, such as high en- CERN corporation’s Large Hadron Collider (LHC), close to ergy physics, pushes the available resources of even big re- Geneva, Switzerland, houses multiple experiments, each one search collaborations to their limit, because it not only re- dedicated to different questions in particle physics. One is quires huge amounts of data but also compute intensive cal- the LHC beauty experiment (LHCb), where various aspects culations. Current approaches require increased storage and of the differences between matter and antimatter are stud- computing resources, lack sufficient flexibility and are still ied. A common LHCb physics analysis concerns itself with far away from interactivity. In this paper we present ideas a particular (type of) decay or particle. In order to ob- to cope with challenges posed by compute-intensive analyses serve them, protons are accelerated to very high energies of high-volume data in a many-user- and resource-restricted and brought to collision. Over the course of a year, up to environment. Extreme selectivity of user queries (that is 40 million collisions are measured every seconds, prefiltered inherent in high energy physics analyses, for instance) per- and stored. Before performing the actual analysis, specific mits us to reduce the computational effort to a small por- recordings of these collisions, termed events, have to be selec- tion, when irrelevant data can be discarded by more efficient ted from a global event store. Decays of interest are usually means. Our focus lies on providing execution strategies for very rare. For example B0s → µ+ µ− was found only 3 times analysis queries, guided by the expertise of the user and in 10 billion events [7]. A significant portion of a physics executed as a scalable, scan-based preselection (introduced analysis consists of isolating particular types of events by in DeLorean [8]). Future research will encompass the de- carefully defining query predicates. velopment of a compact data summary that facilitates fast, Detector measurements, in their initial, raw state, are not columnar scan queries, as well as a domain specific language immediately useful and require computationally expensive that provides us with the required information to generate and decay-specific reconstruction into physically meaningful them. entities. Filtering events for user analyses is done by enfor- cing predicates on these high level features. The overall productivity of many scientific projects is lim- Keywords ited by the amount of data that can be processed and stored. Approximate Query Processing, Expensive Predicates, Big At the LHCb, this is because the measurements can not Data, Resource-Constrained Data Analysis, Domain Specific be retained at the same rate as they are taken [12]. Still, Language, High Energy Physics around 20 petabyte of “raw event” data goes to a tape stor- age every year. Also adding the long reconstruction time 1. INTRODUCTION per event into the equation, naive on-the-fly computations over all available events for individual user requests are in- The ability to accumulate and use increasing amounts of tractable. Offering tools to quickly query the available data data in many science domains opens enticing prospects of without qualitative drawbacks and optimal utilization of the answering open questions of humanity. High energy phys- (limited available) resources is essential for the success of a ics presents itself as a prominent example of such a domain, collaboration, such as the LHCb, and impacts the pace of where scientific conclusions are drawn from statistical evid- scientific discovery in general. ence that is gained by analysing huge quantities of data. A major upgrade of the detector will start recording new Analysis tools at this scale have to cope not only with the data in 2021, increasing the data volume (both rate and size sheer volume of data, but also with the complexity of in- of events [12]) even further and signifying the necessity of volved computations, the limited availabilty of resources, new ideas. The remainder of this paper is structured as follows. First, we give a brief overview of the concepts currently in place at the LHCb, its drawbacks and consequent, general directions of our research. Preliminary and other related work are covered in section 4. Section 5 discusses open challenges that we are going to address in the upcoming years. Finally, 31st GI-Workshop on Foundations of Databases (Grundlagen von Daten- section 6 summarizes these ideas in a roadmap. banken), 11.06.2019 - 14.06.2019, Saarburg, Germany. Copyright is held by the author/owner(s). 2. DATA PROCESSING AT THE LHCB • Limit the execution of expensive computations to the In order to prevent analysts to query and reconstruct the result set and thus minimize the overall compute load. whole set of available data for every query, a centrally sched- uled reconstruction-and-preselection procedure, called strip- • Enable efficient usage of modern hardware features ping, is performed. Several hundred ”stripping lines”, pre- (i.e, deep cache hierarchies, advanced processor in- defined filter queries, reduce the volume of data to an ac- structions and multi-core architectures). ceptable size by materializing their results in separate sub- sets.1 A stripping line typically tries to reconstruct a certain • Reuse of information by caching results and interme- decay and filters events in the process. The criteria to se- diate computations for upcoming queries. lect events tend to be “vague” (in their predicates), because A scan intensive preselection, based on a columnar, com- multiple analyses are supposed to be done on the result of a pacted representation of data entities (i.e., particle-collision single line or small subset thereof. The stripping is initially events) will enable us to implement these objectives (see 4 applied during data acquisition and has to be redone later, for preliminary work). Given a small expected result cardin- when changes in the reconstruction software or overall user ality of individual requests, incorporating users and their demand requires it.2 Generally, this approach is problematic domain expertise closer into the process potentially offers for multiple reasons: significant advantages. However, precisely translating the • In addition to the raw-event data necessary for recon- user’s intent into selective preselection queries requires a struction, materialized results occupy scarce disc ca- suitable interface. In contrast to more general query lan- pacity. guages, such as SQL, a specialized domain specific language (DSL) offers the required expressivity and easier incorpora- • Results have to conform with user requirements, which tion of domain specific optimizations. Furthermore, a DSL are generally more specific/strict. This usually re- can support efficient distributed caching strategies, as pre- quires users to “restrip“ a selected subset with a cus- vious results might be used to answer (a potentially wide) tomized configuration. range of upcoming queries [10]. Caching increases data loc- ality and spreads the workload in non-uniform data access • Stripping line predicates are designed with respect to scenarios. We will further elaborate on this topic in the strict limits on available resources. This can conflict related works section. To this end, the development of a with physically motivated predicate parametrization suitable query interface for physics analytics will be a major and negatively impact the quality of conclusions, as topic in our upcoming research. important data might be kept out of the analysts reach. Another important aspect of our approach will be the con- Even small changes in predicates (i.e., “loosening” of struction of a compacted representation or synopsis. As- inequalities or “shifting” of ranges) are not directly im- suming that we are able to filter data just based on this rep- plementable, because the stripping is rarely redone. resentation, large portions can be discarded efficiently (via • Predefined selections are unlikely to cover unforeseen scanning) and without involving expensive computations on analysis use cases and thus require the definition of a irrelevant data. Executing the computationally expensive completely new stripping line. (reconstruction) pipeline only on the pre-filtered, interme- diate result gives the same (final) result as applying the Approaches that restrict the queryable data set by pre- pipeline to the whole data set directly. Of course, this re- defined criteria are bound to lack flexibility, because as- quirement prohibits the synopsis-based preselection to reject sumptions can change and individual users have different any true result tuple. We provide more details on synopsis requirements. In addition, the stripping still requires long requirements in section 5.1. job waiting periods and additional resources. 3. OUR OBJECTIVES 4. RELATED WORKS The computationally expensive reconstruction is trivially A first approach to address the problem via preselection, parallelizable, because events are completely independent named DeLorean [8], was developed at our group and is and small in size (∼hundreds of kilobytes). However, relying going to be the entry point of this work. For details and a on data parallelism alone does not guarantee neither general preliminary evaluation of a proof of concept, see Kußmann scalability nor sufficient resource utilization. New solutions et al. [8]. In the following, we give a brief description. need to scale up by leveraging available hardware features. The idea is to separate a query into a compute- and a In addition, they need to scale out in order to avoid bottle- data intensive part. In queries commonly-used and select- necks caused by resource contention in large-scale multi-user ive attributes (i.e., attributes that are expected to involve environments. selective predicates) are precalculated during data acquisi- For the purpose of pushing analytics closer towards inter- tion and collected as columns in a single table, resulting in a activity, we identify the following goals: much smaller representation (see fig. 1a). In order to avoid evaluations of the expensive reconstruction, a fast scan of • A significant reduction of the data volume, that is in- this compact synopsis is supposed to discard a large num- volved in individual user queries. ber of tuples (events), reducing the (query specific) compu- 1 tational efforts to a sufficiently small superset of the true All lines combined drop 95% of all event data; a single line must not have more then 0.05% of the overall data volume result (fig. 1b). (on average). The synopsis lookup itself is efficiently expressed in rela- 2 tional terms, replacing reconstruction operations with scan Waiting times of several months for a new stripping version are not unusual. intensive ”SQL“ operators. data- ⃝3 fetch taking ⃝ 2 scan relevant pre- user extract selection ⃝ 1 synopsis analysis access columnar raw event store synopsis (a) Preprocessing. (b) Query processing. Figure 1: The DeLorean storage layer. At data acquisition time, DeLorean extracts a compact synopsis of the events. At analysis time, the synopsis is scanned and only a relevant subset is fetched and reconstructed for user analyses. In a second step, surviving events are retrieved3 and fed base context, the general schema to obtain events from the into the stripping software, specifically configured for indi- global (raw-) event store E can be illustrated in terms of SQL vidual requests, yielding the final result. by involving a conjunction of expensive predicates (or “user The new synopsis lookup can be implemented by modern defined functions”) pi : cloud execution platforms, such as Apache Drill [1], making ∧ use of their scalability and scan-beneficial columnar stor- SELECT * FROM E WHERE i pi age layout [8]. Constructing a synopsis in a columnar layout provides benefits such as reduced data volume, because only Actual physics analyses involve various types of predic- relevant attributes have to be loaded and scanned. Further- ates. They are defined in terms of properties of reconstruc- more, columns offer improved compressibility and thus also ted decays or particles, instead of plain (raw-event) attrib- a tradeoff between processor load and bandwidth [8]. utes that are typically expected in database queries. Geometric data summarization techniques in general are Clearly, most of the effort arises within those functions, an essential tool in many domains, having applications in that reach outside of the scope of relational algebra (SQL). large scale maschine learning and databases, for instance. These “black boxes” are inherently hard al with by means of These summaries can be roughly classified into two types, “traditional” database technology. coresets and sketches [9]. Similar to DeLorean, a summary Hellerstein and Stonebraker [5] try to correctly take into acts as a ”proxy“ for the complete data set and algorithms account the cost of expensive predicates when optimizing executed on this smaller set give an approximation of the query plans via operator reordering. User defined functions result. However, these summaries pursue a reduction of the as well as sub-query predicates are sometimes incorrectly number of tuples, via density estimation or clustering, for assessed as “zero-time operations“ by database optimizers. instance. In contrast, DeLorean, as it is now, applies a di- Due to the simplistic relational structure of the above men- mensionality reduction, where less relevant attributes are tioned expression, general purpose query plan optimization simply projected out. We conjecture that both fields, di- techniques are unlikely to offer improvements. In contrast, mensionality reduction as well as data summaries, can have our approach is going to reduce the total number of evalu- interesting applications in our research. ations when answering user queries. A related and to DeLorean similarly motivated concept Joglekar et al. [6] try to reduce the number of explicit is that of vector approximation files (or VA files) by Weber evaluations of expensive predicate functions. In exchange et al. [13]. VA files partition the data space into rectangular for a decrease in query accuracy, correlations of a function’s cells and enumerate them with unique bit-strings, offering result and the value of a variable can be abused to decide, scan based preselection on a compact representation. Some if the evaluation of the predicate is required or skippable. cases of nearest neighbor search, for instance, can be decided However, the required correlation estimation is only feasible on this compacted representation alone. for low cardinality attributes that rarely occur in the physics The scale up vs scale out topic is discussed by Essertel context. et al. [3]. The Flare project, an extension of the distrib- Declarative languages, such as SQL or XQuery, were de- uted data analysis platform Apache Spark, tries to maxim- veloped to offer enhanced expressivity, enable specific op- ize the potential of individual machines by means of nat- timizations and enjoy widespread usage. The importance ive code compilation. However, the existing LHCb software of the declarative property of big data analytics-centric lan- stack, implemented in C++, is not reasonably migratable to guages is supported by the works of Fegaras [4] (MRQL) and another platform, given the extensiveness of the code base Alexandrov et al. [2] (Emma). alone. To this end, any approach, that wants to commit As SQL has roots in linear algebra (and tuple relational to at least some practical applicability on the LHCb event calculus), MRQL and Emma have monoid homomorphisms retrieval problem, has to make the integration of existing and monad comprehensions (respectively) as their formal software stacks possible. foundation. However, these languages address a (still) very Generally, this can be done by providing efficient eval- broad domain of queries and databases, omitting possible uation strategies. Stepping outside of the stripping per- optimization potential that can not be detected in a more spective and positioning the “retrieval” problem into a data- general context. In our work, we are going to identify query patterns that are specific to the domain of interest (LHCb 3 event analysis, in this case). Utilizing formal systems, such The LHCb data format allows tree-based seeking of single raw-events via identifier [8]. as the ones used by SQL, MRQL or Emma, provides the associated tools and insights as well as an evironment to To serve an adequate amount of user request topics, a suf- reason about queries and specify transformation rules for ficiently broad selection of synopsis attributes has to cover optimization. most frequent analyses. As a first approach, the stripping Given that every user has different requirements and is line formulations, even though being inherently “vague“, in- interested in different types of data, a specialized execution volve many commonly used attributes, because the complete strategy can optimize individual requests and thereby fur- set is designed to cover a wide range of analysis subjects at ther improve overall performance. the LHCb. A stripping line is formulated in multiple con- Building a new and customized DSL is costly and requires secutive steps that define specific reconstruction procedures. knowledge both in the application domain and programming Those steps also include filter statements on properties de- language design [11]. DSL-compiler frameworks, such as De- termined during this procedure. Also, there is a consider- lite [11], improve the creation of new languages by providing able overlap between the lines, as several steps are frequently abstract means to integrate domain specific optimizations, shared. Many particles and even some decays are involved implicit parallelism and (native) code generation. The in- in multiple analyses and usually involve the same or just sights and tools provided by this type of framework can slightly different predicates. Currently, investigating criteria prove useful to retrieve information from user requests, gen- to assess attributes and their (combined) selective power for erate synopsis queries and improve the overall productivity upcoming queries is important, as it represents the next logic of analysts. We suggest additional ideas in this direction in step towards a systematic creation of an event synopsis. section 5.3. A successful preselection strategy has to offer a selectiv- In [10], the authors provide a formal definition and query ity comparable to stripping lines. Given such a selective processing strategies for semantic caching. Semantic cach- synopsis, the overall number of events can be reduced to a ing, in contrast to page-based caching, utilizes a semantic manageable portion and enables the user to perform the re- representation of the cache content in order to find items construction in a reasonable amount of time. Note that a that can totally (or partially) answer a given query. We ”local restripping” is already done by analysts in practice on believe that this idea can be advantageous in our setting, manually selected stripping line results in order to refine the as queries in the LHCb context share considerable ”over- event selection according to individual user requirements. lap”. This stems from the fact that certain (sub-)decays are “contained” in multiple decays, requiring the same com- 5.2 Result Caching putations. In fact, the concept of shared computations is Depending on the size of the synopsis, distributing re- already (manually) implemented in the current LHCb soft- dundant copies (or only relevant columns to cover a single ware stack. Detecting and abusing this overlap automatic- topic) enables the execution of the preselection independ- ally will therefore be beneficial. ently for different work groups, possibly even single users. This way, we are able to shift the “expensive query predic- 5. ADDRESSING OPEN CHALLENGES ate” problem further towards a data serving problem, where only actually interesting (but unprocessed) events have to The precise requirements on the synopsis content are yet be efficiently handed over to many users. to be defined. So far, predicates were handpicked according However, the selected data might be resident in differ- to their selectivity for a selected sample query. Involved ent sites that are geographically dispersed (as it is the case attributes were included into the synopsis and all values for CERN/LHCb) and/or busy, introducing latencies. Non- determined by an initial stripping pass. First benchmarks uniform data access (over the data-sites) increases conten- are promising [8], but we need to generalize the findings tion in both network and local resources. Note that this to provide performance indications for a range of (unseen) issues also arises in settings, where queries do not involve queries. Also, inherent challenges of distributed many-user, (network-)communication-intensive algorithms, such as the high-volume data processing need to be addressed. LHCb data analysis. 5.1 Creating the synopsis Adequate caching mechanisms can greatly improve the To offer performance and correctness, even for new quer- ability to serve data by adaptively holding frequently reques- ies, some general qualities of the synopsis can be declared: ted (sets of) events, greatly reducing data transfer volumes and serving latencies. Also, with knowledge about the data • Applicability - The synopsis has to contain attributes (-dependencies), events could be cached speculatively. For that are relevant for upcoming user query, otherwise example: Decays that appear to be very similar to the de- we do not gain any advantage. sired decay are sometimes explicitly fetched to exclude them properly from the analysis. • Correctness - To prevent “false-negatives”, no event that is actualy relevant for a user request should be 5.3 Query Specification Interface for Physics rejected by the synopsis scan. Analyses • Selectivity - To more-then-amortize the additional cost Developing a dedicated interface for LHCb physics ana- of a synopsis scan, a sufficiently large number of events lysis queries, such as a DSL, offers several benefits for this needs to be rejected, preventing their expensive recon- project. In addition to the general advantage of improved struction. ease-of-use for analysts, such a language can have perform- ance critical implications by guiding query plan optimiza- Note that events, that are selected but in fact uninteresting tion: (“false-positives”), are permissible although undesired. They are expected to be rejected by the second step and just de- • Declarative formulation in higher-level semantics en- teriorate selectivity and therefore performance, which is less ables the user to specify his intent while relieving him crucial then correctness. from being familiar with implementation details. • Automatic generation of preselection queries, that can of our findings to other situations. Extracting information be evaluated on the synopsis. This enables the user to from user queries in order to derive a execution strategy is specify queries without knowing the synopsis schema. our first step into this direction. After having a precise and selective mechanism in place, • Identification of new synopsis attributes by determin- the “data serving” aspect of the problem will offer multiple ing overlap or ”similarity“ between queries. This in- incentives for future research. formation could serve as a foundation to adaptively add or remove synopsis attributes, based on common query ”topics“. Performance/selectivity of upcoming 7. ACKNOWLEDGMENTS query could be improved by iteratively replacing or This work has been supported by the German Ministry of adding information to the synopsis. Education and Research (BMBF), project Industrial Data Science, and by Deutsche Forschungsgemeinschaft (DFG), • Selectivity approximation of user requests with precal- Collaborative Research Center SFB 876, project C5. culated statistics, such as value distributions and cor- relations of synopsis attributes. Offering a mechanism to estimate the reduction rate of a query beforehand 8. REFERENCES allows quick rejection of infeasible queries, solely based [1] Apache Drill - Schema-free SQL for Hadoop, NoSQL on statistics and more importantly: without starting and Cloud Storage. https://drill.apache.org/. actual computations or data transfer requests. Accessed: 2019-03-03. • Improve (opportunistic) caching. Events that only [2] A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, closely fail to match predicates (and other ”similar- L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit ities”, such as the ones mentioned above) could be- parallelism through deep language embedding. In come relevant and therefore proof beneficial to have in Proceedings of the 2015 ACM SIGMOD International a cache, especially in an interactive test-and-see query Conference on Management of Data, SIGMOD ’15, refinement scenario. Also, result sets, that superset pages 47–61, New York, NY, USA, 2015. ACM. other results and are contained in a cache, can be used [3] G. M. Essertel, R. Y. Tahboub, J. M. Decker, K. J. to answer particular (upcoming) queries (see cache- Brown, K. Olukotun, and T. Rompf. Flare: Native answerability [10]). compilation for heterogeneous workloads in apache Multiple of these aspects rely on the ability to analyse the spark. CoRR, abs/1703.08219, 2017. structure of queries. Automatic, non-trivial optimizations, [4] L. Fegaras. An algebra for distributed big data such as filter push downs, require the system to reason about analytics. 2017. (possibly higher order) operations and their arguments, which [5] J. M. Hellerstein and M. Stonebraker. Predicate is hard to do with application programming interfaces (APIs). migration: Optimizing queries with expensive Having a well-defined DSL, backed by a suitable, formal predicates. pages 267–276, 1993. foundation, allows the definition of abstract transformation [6] M. Joglekar, H. Garcia-Molina, A. Parameswaran, and rules and eases cost-based optimization [4]. Also, answering C. Re. Exploiting correlations for expensive predicate formally motivated questions, such as cache-answerability, evaluation. In Proceedings of the 2015 ACM SIGMOD requires the means to infer implication- and satisfiability International Conference on Management of Data, properties of queries [10]. Furthermore, disconnecting the SIGMOD ’15, pages 1183–1198, New York, NY, USA, interface from the execution platform offers the ability to 2015. ACM. keep the same interface, if the backend gets replaced. This [7] V. Khachatryan et al. Observation of the rare is particularly useful for large projects that are expected to Bs0 → µ+ µ− decay from the combined analysis of operate over many years. CMS and LHCb data. Nature, 522:68–72, 2015. [8] M. Kußmann, M. Berens, U. Eitschberger, A. Kilic, 6. ROADMAP T. Lindemann, F. Meier, R. Niet, M. Schellenberg, A synopsis that supports efficient, scan based event selec- H. Stevens, J. Wishahi, B. Spaan, and J. Teubner. tion for new queries promises major performance benefits. Delorean: A storage layer to analyze physical data at But many open challenges exist. Currently, it is not obvi- scale. In B. Mitschang, D. Nicklas, F. Leymann, ous what kind of information should form the synopsis and H. Schöning, M. Herschel, J. Teubner, T. Härder, how the choice will impact performance in a more general O. Kopp, and M. Wieland, editors, Datenbanksysteme sense. Therefore, we first need to confirm the ideas intro- für Business, Technologie und Web (BTW 2017), duced in [8] by developing means to systematically extract pages 413–422. Gesellschaft für Informatik, Bonn, “frequent” synopsis attributes. Getting a better understand- 2017. ing on the implications of attribute choice and their impact [9] J. M. Phillips. Coresets and sketches. CoRR, on performance and applicability is necessary to support a abs/1601.00617, 2016. sufficiently wide range of accurate queries. Also, alternat- [10] Q. Ren, M. H. Dunham, and V. Kumar. Semantic ives to “carefully selected reconstruction attributes” as the caching and query processing. IEEE Trans. on Knowl. synopsis’s constituents should be explored. and Data Eng., 15(1):192–210, Jan. 2003. The idea to incorporate domain specific knowledge for op- [11] A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, timization lends itself to interesting concepts in data pro- H. Chafi, M. Odersky, and K. Olukotun. Delite: A cessing, where general purpose techniques fail to offer signi- compiler architecture for performance-oriented ficant gains. Developing new and universal schemes of ap- embedded domain-specific languages. ACM Trans. plying domain knowledge for performance allows the transfer Embed. Comput. Syst., 13(4s):134:1–134:25, Apr. 2014. [12] C. The LHCb Collaboration. Upgrade Software and Computing. Technical Report CERN-LHCC-2018-007. LHCB-TDR-017, CERN, Geneva, Mar 2018. [13] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of the 24rd International Conference on Very Large Data Bases, VLDB ’98, pages 194–205, San Francisco, CA, USA, 1998. Morgan Kaufmann Publishers Inc.