Checking Plausibility in Exploratory Data Analysis Hermann Stolte supervised by Prof. Matthias Weidlich1 and Dr. Elisa Pueschel2 1 Humboldt-Universität zu Berlin, 2 Deutsches Elektronen-Synchrotron DESY hermann.stolte@hu-berlin.de ABSTRACT Exploratory data analysis is widespread across many scientific do- Model Construction mains. It relies on complex pipelines and computational models for data processing, that are commonly designed collaboratively t Plausibility I Pipeline Constraint by scientists with diverse backgrounds for a variety of software o o Model stacks and computation environments. Here, a major challenge is the uncertainty about the correctness of analysis results, due to the t ca b Pipeline high complexity of both, the actual data and the implemented anal- ysis steps, and the continuous reuse and adaptation of data analysis o f Plausibility Pipeline o o (annotated) y Constraint De nition Integration D pipelines in different application settings. This PhD project inves- tigates how the design, adaptation, and evaluation of exploratory data analysis pipelines can be supported through automated plau- User a sibility assessment. To this end, we outline the requirements, our Violation Analysis C3 Constraint Pipeline approach, and initial results for models and methods to enable C4 Validation Execution plausibility checking in the context of exploratory data analysis. Figure 1: An illustration of the steps for plausibility analysis of scientific data processing pipelines. originate not only from the restricted scope of a single research 1 INTRODUCTION group or laboratory, but from external collaborators with limited or Today’s large-scale research projects in domains such as Materials no involvement in the subsequent use of the updated code and data. Science, Astrophysics, or Remote Sensing, to name just a few exam- In the absence of established routines for version management of ples, often involve exploratory data analysis. Here, data from mul- datasets and data processing pipelines, such a setting can lead to tiple distributed sources is integrated and analyzed collaboratively hard-to-find bugs, especially when time-pressure is involved. by scientists with diverse backgrounds, from various disciplines, The overall uncertainty about the correctness of analysis results and from different organizations. Key to the process are complex makes the development, maintenance, and evaluation of pipelines pipelines for scientific data processing, sometimes referred to as difficult. Support is needed for users to assess the plausibility of the scientific workflows [10], which may be designed for a variety of results obtained by data processing pipelines. Therefore, this PhD software stacks and computation environments [13]. project is dedicated to answer the following research question: Main challenges here arise from the complexity of both, datasets How to support the design, adaptation, and evaluation and analysis steps, that makes results difficult to interpret and error- of exploratory data processing pipelines through auto- prone. With data being collected by multiple internal or external mated plausibility analyses? stakeholders each using their own methods in varying environ- We aim to answer this research question by providing the foun- ments, bias and noise need to be accounted for. For complex re- dations for automated plausibility analysis, as illustrated in Fig. 1. search questions, interdisciplinary teams collaborate to combine (1) Construct a meta-model for plausibility constraints domain and technical expertise, e.g., for creating computational (2) Support users in defining plausibility constraints models. Researchers from different backgrounds have an individual (3) Integrate constraints into a given data processing pipeline set of hidden assumptions about data and models, which can easily (4) Validate constraints during pipeline execution cause miscommunication and thus introduce errors in the design (5) Enable users to identify root causes of constraint validations of the data processing pipelines. An important aspect of exploratory data analysis is that the The next section illustrates the need for automated plausibility specific research questions evolve over time. As a consequence, assessment by a specific application case from the field of Astro- pipelines and datasets are also subject to frequent changes [3]. physics. §3 gives an overview of relevant related work. In §4, we While program code for complex analysis is error-prone in general, outline our solution approach, before we conclude in §5. the evolution of pipelines and data is inherent to exploratory data analysis and amplifies the resulting challenges. Moreover, changes 2 BACKGROUND fi In Astrophysics research, extreme acceleration processes (e.g., around Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den- black holes, exploding or merging stars) are of high scientific inter- mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). est [4]. Here, energy is emitted as radiation across the whole electro- magnetic spectrum, including gamma radiation. When gamma-rays Aggregating event High (manual or IACT Filtering by Filtered Delayed Event IACT counts over time, IACT Light Quality Local automatic) Camera Pre-Selection IACT High-Quality Reconstruction Event List grouped by energy Curve IACT Light Observation Observation Images Cuts Event List Post-Processing range Curve Archive Scheduling Observing Is there any spectral Condition C4 upturn or is the spectrum Metadata Is there any activity that unphysically hard? Alert IACT Alert cannot be associated Decision C2 Steam with a known source? Log Entry Temporal Multi- Raw MW Integration and Filtering Wavelength Data Source Light Curve Is the multi-wavelength Association (manual) C3 light curve after source Decision-making association realistic? Alert External for public alert Stream Light-Curves Has another pipeline stopped from (e.g. radio, updating and sent out the Flare C1 Preprocessing Flare Near Real-Time external optical, x-ray, same timestamps/values? Detection Flare Detection Observation and Training Training Detection Inference on instruments gama-ray) Model Training Significance Data Curation Dataset Model Observed Data Figure 2: A pipeline for near real-time flare detection in IACT and multi-wavelength data with four plausibility checks. hit the Earth’s atmosphere, they cause faint light showers, which the combined expertise and interdisciplinary collaboration of astro- are observable using ground-based telescopes. With so called Imag- physicists, computer scientists, and further engineers, is required. ing Atmospheric Cherenkov Telescopes (IACT) [8], the physics Against this background, the design of data processing pipelines is behind acceleration processes can be studied, e.g., by analyzing an error-prone process, which continuously raises the question of temporal and spectral changes over time. how to assess the plausibility of the produced data. However, the respective data analysis is challenging. Background For the pipeline in Fig. 2, for instance, the plausibility of inter- noise from cosmic rays needs to be filtered out and data can only mediate data can be assessed based on the following constraints: be captured when specific observing conditions are met (e.g., given C1 It may happen that an external pipeline to deliver and filter source visibility in the sky, no clouds, no bright moonlight). Also, light curves from external instruments is faulty and sends reconstructing the properties of gamma ray photons from images identical measurements repetitively, which is implausible. of light showers introduces systematic bias (e.g., related to the visi- C2 In MW data, source activity can usually be observed by mul- bility of light showers at low energies and different zenith angles). tiple instruments. A source with high activity being detected To study acceleration processes effectively, IACT data is com- only by a single instrument is therefore suspicious. bined with data from other instruments (e.g., other gamma-ray, C3 MW instruments can have a limited spatial resolution, so optical, x-ray or radar observatories), which is referred to as multi- that observed activity may be associated with several source wavelength (MW) data. The integration and joint analysis of such candidates. An erroneous source association may be discov- data is particularly challenging when research questions demand ered based on an implausible multi-wavelength light curve. a near real-time analysis of transient phenomena, i.e., events that C4 Based on the current physical understanding of gamma-ray may be observable for only a few minutes. emission from blazars, certain features of an IACT spectrum, For instance, consider the use case of near real-time IACT and such as a spectral upturn, may be unexpected. multi-wavelength blazar1 flare detection. Fig. 2 illustrates a pipeline Note that violations of plausibility constraints are not necessarily for detecting flaring states and scheduling observations of blazars related to errors in the data processing pipeline. Rather, they may based on IACT and MW data. Here, data from different distributed also hint at unexpected and, therefore, particularly interesting phys- sources need to be jointly analyzed to detect transient flaring states ical phenomena. In any case, a plausibility assessment is beneficial of blazars within hours to minutes. When a flaring state of interest is and supports the design, adaptation, and evaluation of the pipeline. detected, the local observation schedule can be updated accordingly, and other observatories will be alerted. Challenges for the analysis of IACT and MW data are imposed 3 RELATED WORK by the properties of the data, mainly its heterogeneity, sparsity, and Scientific workflows. Our ideas are related to the field of scientific inherent bias and noise. Moreover, the research questions tend to workflows and workflow engines [10]. They have been researched evolve, as for example, the definition of an interesting flare may be for decades to provide accessibility and reproducibility. To this revised, which then leads to changes in the data processing pipeline. end, workflow engines (e.g., Apache Taverna, Galaxy, KNIME, and In addition, the setting in which the analysis is conducted imposes Pegasus) support for the design and execution of pipelines by large challenges on the technical side, as a variety of software stacks catalogs of operators, graphical user interfaces, and techniques for and computational environments is used, from machine learning data integration. Yet, plausibility analysis as envisioned here is not frameworks, through batch processing systems and middleware part of their functionality. for stream processing, to transient alert brokers. To cope with the Analysis of provenance data. Some scientific workflow engines challenges induced by the data, the analysis, and the infrastructure, collect provenance data [7] as a source for analyzing data lineage, 1 A blazar is an astronomical object of particular scientific interest. When pointed to i.e., to explain how a certain data item was created. Various mod- earth, it can be observed as a highly variable source of multi-wavelength radiation. els for provenance data have been defined in the literature, most 2 prominently PROV [2]. Moreover, there are tools available to ex- (2) Optionally, the location of external data referenced in the plore and analyze provenance data, e.g., VisTrails [6], or to use it definition of the constraint. for distributed debugging, e.g., with BugDoc [12]. We argue that (3) A constraint function that maps from the domains of the models for provenance data are a useful starting point for automatic entities (and external data, if required) to a probability dis- plausibility analysis of data processing pipelines. tribution, thereby expressing the plausibility of data items. According to this model, C4 from Fig. 2 is captured as a function Program verification. Pipelines can be seen as programs, so that that maps from the domain of an IACT light curve to a probabil- inspiration for plausibility analysis may be taken from the field of ity distribution. Specifically, the constraint function assigns high software engineering, especially software verification. Here, work probabilities when expected spectral features are present. on automatic test case generation [1] or invariant mining [11], which aims at identifying state-based expressions known to hold true, are particularly promising. Considering system faults to be a 4.2 Constraint Definition special case of implausible pipeline executions, and system invari- The effort needed to define constraints shall be minimized, since ants to be a special case of plausible pipeline behavior, inspiration it is the crucial step in which a user has to invest time to benefit can be drawn from approaches for discovering both. from automated plausibility checking down the line. Ideally, con- straints could be found and suggested to a user for review. Research questions here are hence (1) How to support the user in the manual 4 CHECKING PLAUSIBILITY definition of plausibility constraints, and (2) how to mine plausibility This section outlines our approach to automated plausibility analy- constraints automatically and assess their usefulness? sis. For each step outlined in Fig. 1, we discuss the requirements, To this end, we observe in our use case that constraints are closely our approach and preliminary results of how to address them. related to physical models and laws. For example, the constraint C4 requires defining a function that maps features of a source’s IACT 4.1 Model light curve to a probability of plausibility. The latter is based on the source type, so that we aim to leverage physical models and laws Automated plausibility assessment requires a model for making for mining and defining constraints. statements about data. The model needs to enable the specification We further intend to investigate whether textual documentation of dependencies between data that is processed in a pipeline, while (e.g., instrument specifications) can serve as a basis for constraint incorporating the following considerations: mining. This requires mapping the data entities of a provenance Single or multi data item. Constraints may not only consider graph to named entities in natural language text, i.e., using tech- single data items, but the relation of multiple data items. For niques for named entity recognition and relation extraction. Once instance, considering our use case, a change in the rate of causal relations between entities have been discovered, they can be reconstructed events by an IACT telescope array is consid- employed for suggestions in the manual definition of constraints. ered suspicious, if it coincides with a telescope entering or leaving the array. The plausibility of an IACT event list may therefore be assessed using observing condition metadata. 4.3 Pipeline Integration Single or multi pipeline execution. A basic constraint checks A key question in automated plausibility analysis is how to integrate the value of one or more data items in isolation. However, the definition of plausibility constraints into existing pipelines. The plausibility assessment may require the joint analysis of a objective of pipeline integration is two-fold: sequence of data items as a value distribution across multi- ◦ Data entities from the underlying provenance model need to ple pipeline executions. For instance, for the example con- be linked to software parameters and function arguments; straint C1, the sequence of 𝑛 previous values of a data item ◦ Plausibility constraints need to be placed in a pipeline exe- is required to analyze the distribution variance. cution graph in the first possible position, where all required Independent or dependent on external data. A constraint may data items are available. This ensures that implausibilities refer to external data sources. For example, the constraint C2 can be detected and reacted to as early as possible. requires access to an exhaustive catalog of known sources, Embedding plausibility constraints in a pipeline definition is not that is not available as a data item during pipeline execution. straightforward. Even in pipelines modeled as directed acyclic Our idea is to develop a model for plausibility constraints based graphs, the same type of data may occur multiple times in an exe- on provenance graphs, which capture the relation between data cution graph. Also, a constraint may concern multiple data items. items in a pipeline as a directed acyclic graph (DAG) [7]. Nodes To enable pipeline integration, the functionality for plausibility in the graph are named entities and represent data items, whereas analysis first needs to be implemented. Since exploratory analysis edges are causal dependencies. Data items also need to be linked to often relies on software toolkits that offer standard solutions for computational steps. We intend to realize this using the ProvONE data management and analysis methods in a specific domain, these model [5], which links data provenance with pipeline definitions provide a suitable starting point for such an implementation. For (i.e., workflows in ProvONE terminology). Based thereon, the defi- example, for the analysis of IACT data, the libraries ctools and nition of a plausibility constraint includes: gammalib [9] are examples for such toolkits. (1) A collection of one or more ProvONE entities representing To integrate an implementation of plausibility analysis in the data items in a pipeline. definition of a pipeline, our idea is to exploit the fact that processing 3 steps in a pipeline are commonly defined as parameterized soft- 4.5 Violation Analysis ware interfaces, e.g., a function, a software class, or a service. Our Once a constraint indicates that certain data is implausible with approach for integrating plausibility checks, therefore, is to create a high-confidence, a user needs to assess whether there is indeed a wrapper layer around common software components. In our use some error in the pipeline or whether the phenomenon is due case, for instance, there are common types of plots for visualiz- to an unexpected, and often particularly interesting trend in the ing specific data entities, such as source spectra. If a pipeline uses data. While the lineage of the data items for which implausibility pre-defined calls to construct common plot types, the link from is indicated can be derived directly from a respective provenance function arguments to data entities and from plausibility constraints graph, effective violation analysis aims at a more targeted analysis. to execution placements can be derived automatically. That is, given the set of all upstream data items, we strive for a Listing 1 illustrates a Python wrapper for a call (ctbutterfly()) separation of those that are actually correlated with the violation of the ctools library to create a so-called butterfly plot. Knowing that of the constraint from those that are irrelevant. the input of the call is a light curve, the wrapper fetches applicable We intend to approach this use case from two angles. First, a constraints (brown), and instantiates (green) and checks them (red). correlation analysis between the data items created at intermediate, upstream steps of the pipeline may help to identify which type def plauscheck_plot_ctbutterfly ( iact_light_curve ) : of data is likely to have a causal effect. Second, standard means entity = plauscheck.entities.IACT_LIGHT_CURVE constraintTypes = plauscheck.getConstraintsFor( entity ) for outlier detection for the distributions of data items created by for constraintType in constaintTypes : upstream steps in the pipeline may provide clues on abnormal constraint = constraintType( iact_light_curve ) trends that led to downstream constraint violations. plauscheck.validate( constraint ) ctools . ctbutterfly ( iact_light_curve ) 5 CONCLUSIONS iact_light_curve = derive_light_curve ( iact_event_list ) plauscheck_plot_ctbutterfly ( iact_light_curve ) In this paper, we motivated the need to support the design, adapta- tion, and evaluation of data processing pipelines with a case from Listing 1: A wrapper function for integrating plausibility Astrophysics. Against this background, we outlined the require- analysis into a library call for plotting IACT light curves. ments, our approach, and initial results on how to enable compre- hensive plausibility checking in exploratory data analysis. Having a first version of the model of plausibility constraints, our current 4.4 Constraint Validation research focuses on the derivation of constraints from the physical Once plausibility constraints have been formulated and integrated, models underlying the illustrated pipeline for flare detection. they need to be validated during pipeline execution. Given the common properties of scientific data, such validation is challenging: ACKNOWLEDGMENTS ◦ Data may be sparse, meaning that data is available only for We thank Iftach Sadeh (DESY) for valuable insights to gamma-ray certain (spatial or temporal) contexts. For our example of astrophysics. The work is supported by the Helmholtz Einstein MW data integration, time series are available only in short International Berlin Research School in Data Science (HEIBRiDS). time periods with varying cadences between instruments. ◦ Data may be uncertain in various ways. In the context of REFERENCES IACT and MW data, for example, there exists uncertainty [1] Saswat Anand, Edmund K Burke, Tsong Yueh Chen, et al. 2013. An orchestrated survey of methodologies for automated software test case generation. Journal of in the distance estimation between source and observer, as Systems and Software 86, 8 (2013), 1978–2001. well as in the source association, where activity in a region [2] Khalid Belhajjame, Helena Deus, Daniel Garijo, et al. 2012. Prov model primer. WWW Consortium (2012). of interest may be associated with several source candidates. [3] Jeffrey Chang. 2015. Core services: Reward bioinformaticians. Nature 520, 7546 ◦ Data can be multi-resolution, e.g. measuring a phenomena (April 2015), 151–152. in varying level of detail. In our use case, the sensitivity of [4] The Cherenkov Telescope Array Consortium. 2015. Science with the Cherenkov Telescope Array. International Journal of Modern Physics D (2015). different MW instruments varies greatly in terms of cap- [5] Vıctor Cuevas-Vicenttın et al. 2016. ProvONE: A PROV Extension Data Model tured energy resolution, requiring a careful integration that for Scientific Workflow Provenance. respects uncertainty associated with the instrument type. [6] Juliana Freire and Cláudio T. Silva. 2012. Making Computations and Publications Reproducible with VisTrails. Comput. Sci. Eng. 14, 4 (2012), 18–25. While we consider plausibility constraints to be stochastic, see §4.1, [7] Melanie Herschel, Ralf Diestelkämper, and Houssem B. Lahmar. 2017. A Survey on the above properties also motivate a probabilistic validation of them. Provenance: What for? What Form? What From? VLDB J. 26, 6 (2017), 881–906. [8] Jamie Holder. 2015. Atmospheric Cherenkov Gamma-ray Telescopes. arXiv This way, the confidence into the result of constraint validation is e-prints, (Oct. 2015), arXiv:1510.05675. quantified. In the extreme case, some plausibility constraints cannot [9] Jea Knödlseder, M Mayer, C Deil, et al. 2016. GammaLib and ctools-A software framework for the analysis of astronomical gamma-ray data. Astronomy & be validated due to data sparsity, uncertainty, or resolution differ- Astrophysics 593 (2016), A1. ences. Moreover, the evolution of confidence over time needs to be [10] Chee Sun Liew, Malcolm P. Atkinson, Michelle Galea, et al. 2016. Scientific taken into account. For some constraints, such as the constraint C1, Workflows: Moving Across Paradigms. ACM Comput. Surv. 49, 4, (2016). [11] Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining the confidence changes over time. The more data is processed, the Invariants from Console Logs for System Problem Detection. In USENIX ATC. more reliable will be the assessment of the state of up-stream data [12] Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. BugDoc: Algorithms to sources, which increases the confidence in the ability to validate Debug Computational Processes. In SIGMOD, 463–478. [13] Victoria Stodden, Marcia McNutt, David H Bailey, et al. 2016. Enhancing repro- constraint C1. Against this background, we strive for algorithms ducibility for computational methods. Science 354, 6317 (2016), 1240–1241. for constraint validation that are rooted in Bayesian modeling. 4