Checking Plausibility in Exploratory Data Analysis
                                                                          Hermann Stolte
                                       supervised by Prof. Matthias Weidlich1 and Dr. Elisa Pueschel2
                                 1 Humboldt-Universität zu Berlin, 2 Deutsches Elektronen-Synchrotron DESY

                                                        hermann.stolte@hu-berlin.de
ABSTRACT
Exploratory data analysis is widespread across many scientific do-                                        Model Construction
mains. It relies on complex pipelines and computational models
for data processing, that are commonly designed collaboratively
                                                                                                                     t
                                                                                                                Plausibility


                                                                                                            I
                                                                                                                                    Pipeline
                                                                                                                Constraint
by scientists with diverse backgrounds for a variety of software
                                                                                                                                    o o
                                                                                                                  Model

stacks and computation environments. Here, a major challenge is
the uncertainty about the correctness of analysis results, due to the                                                t         ca        b            Pipeline
high complexity of both, the actual data and the implemented anal-
ysis steps, and the continuous reuse and adaptation of data analysis                       o          f   Plausibility                Pipeline        o o
                                                                                                                                                           (annotated)


                                                                                                                                                                         y
                                                                                                      Constraint De nition          Integration


                                                                                        D
pipelines in different application settings. This PhD project inves-
tigates how the design, adaptation, and evaluation of exploratory
data analysis pipelines can be supported through automated plau-                        User
                                                                                                                                                  a
sibility assessment. To this end, we outline the requirements, our                                    Violation Analysis       C3   Constraint           Pipeline
approach, and initial results for models and methods to enable                                                                 C4   Validation          Execution

plausibility checking in the context of exploratory data analysis.
                                                                                       Figure 1: An illustration of the steps for plausibility analysis
                                                                                       of scientific data processing pipelines.
                                                                                       originate not only from the restricted scope of a single research
1    INTRODUCTION                                                                      group or laboratory, but from external collaborators with limited or
Today’s large-scale research projects in domains such as Materials                     no involvement in the subsequent use of the updated code and data.
Science, Astrophysics, or Remote Sensing, to name just a few exam-                     In the absence of established routines for version management of
ples, often involve exploratory data analysis. Here, data from mul-                    datasets and data processing pipelines, such a setting can lead to
tiple distributed sources is integrated and analyzed collaboratively                   hard-to-find bugs, especially when time-pressure is involved.
by scientists with diverse backgrounds, from various disciplines,                         The overall uncertainty about the correctness of analysis results
and from different organizations. Key to the process are complex                       makes the development, maintenance, and evaluation of pipelines
pipelines for scientific data processing, sometimes referred to as                     difficult. Support is needed for users to assess the plausibility of the
scientific workflows [10], which may be designed for a variety of                      results obtained by data processing pipelines. Therefore, this PhD
software stacks and computation environments [13].                                     project is dedicated to answer the following research question:
   Main challenges here arise from the complexity of both, datasets                            How to support the design, adaptation, and evaluation
and analysis steps, that makes results difficult to interpret and error-                       of exploratory data processing pipelines through auto-
prone. With data being collected by multiple internal or external                              mated plausibility analyses?
stakeholders each using their own methods in varying environ-                             We aim to answer this research question by providing the foun-
ments, bias and noise need to be accounted for. For complex re-                        dations for automated plausibility analysis, as illustrated in Fig. 1.
search questions, interdisciplinary teams collaborate to combine
                                                                                           (1) Construct a meta-model for plausibility constraints
domain and technical expertise, e.g., for creating computational
                                                                                           (2) Support users in defining plausibility constraints
models. Researchers from different backgrounds have an individual
                                                                                           (3) Integrate constraints into a given data processing pipeline
set of hidden assumptions about data and models, which can easily
                                                                                           (4) Validate constraints during pipeline execution
cause miscommunication and thus introduce errors in the design
                                                                                           (5) Enable users to identify root causes of constraint validations
of the data processing pipelines.
   An important aspect of exploratory data analysis is that the                           The next section illustrates the need for automated plausibility
specific research questions evolve over time. As a consequence,                        assessment by a specific application case from the field of Astro-
pipelines and datasets are also subject to frequent changes [3].                       physics. §3 gives an overview of relevant related work. In §4, we
While program code for complex analysis is error-prone in general,                     outline our solution approach, before we conclude in §5.
the evolution of pipelines and data is inherent to exploratory data
analysis and amplifies the resulting challenges. Moreover, changes                     2       BACKGROUND
                                                                                                 fi

                                                                                       In Astrophysics research, extreme acceleration processes (e.g., around
Proceedings of the VLDB 2021 PhD Workshop, August 16th, 2021. Copenhagen, Den-         black holes, exploding or merging stars) are of high scientific inter-
mark. Copyright (C) 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).                             est [4]. Here, energy is emitted as radiation across the whole electro-
                                                                                       magnetic spectrum, including gamma radiation. When gamma-rays
                                                                                                                                Aggregating event                                                           High
     (manual or         IACT                                                     Filtering by          Filtered                                                                       Delayed
                                            Event                IACT                                                            counts over time,             IACT Light                                  Quality           Local
     automatic)        Camera                                                   Pre-Selection           IACT                                                                        High-Quality
                                         Reconstruction        Event List                                                       grouped by energy                Curve                                   IACT Light        Observation
     Observation       Images                                                        Cuts             Event List                                                                   Post-Processing
                                                                                                                                      range                                                                Curve            Archive
     Scheduling

                                           Observing                                                                                                                        Is there any spectral
                                           Condition                                                                                                              C4        upturn or is the spectrum
                                           Metadata                                                                               Is there any activity that                unphysically hard?            Alert
                                                                                                                                                                                                                            IACT Alert
                                                                                                                                  cannot be associated                                                  Decision
                                                                                                                       C2                                                                                                     Steam
                                                                                                                                  with a known source?                                                  Log Entry

                                                                                   Temporal               Multi-
                                                               Raw MW           Integration and
                                              Filtering                                                Wavelength
                                                                Data                Source             Light Curve                          Is the multi-wavelength
                                                                                  Association                                                                                                                                (manual)
                                                                                                                                   C3       light curve after source
                                                                                                                                                                                                                         Decision-making
                                                                                                                                            association realistic?
             Alert       External                                                                                                                                                                                         for public alert
           Stream      Light-Curves                            Has another pipeline stopped
             from       (e.g. radio,                           updating and sent out the                                  Flare
                                                          C1                                    Preprocessing                                                           Flare          Near Real-Time
           external    optical, x-ray,                         same timestamps/values?                                  Detection            Flare Detection                                              Observation
                                                                                                 and Training           Training                                       Detection        Inference on
         instruments    gama-ray)                                                                                                            Model Training                                               Significance
                                                                                                Data Curation            Dataset                                        Model          Observed Data


   Figure 2: A pipeline for near real-time flare detection in IACT and multi-wavelength data with four plausibility checks.

hit the Earth’s atmosphere, they cause faint light showers, which                                                           the combined expertise and interdisciplinary collaboration of astro-
are observable using ground-based telescopes. With so called Imag-                                                          physicists, computer scientists, and further engineers, is required.
ing Atmospheric Cherenkov Telescopes (IACT) [8], the physics                                                                Against this background, the design of data processing pipelines is
behind acceleration processes can be studied, e.g., by analyzing                                                            an error-prone process, which continuously raises the question of
temporal and spectral changes over time.                                                                                    how to assess the plausibility of the produced data.
   However, the respective data analysis is challenging. Background                                                            For the pipeline in Fig. 2, for instance, the plausibility of inter-
noise from cosmic rays needs to be filtered out and data can only                                                           mediate data can be assessed based on the following constraints:
be captured when specific observing conditions are met (e.g., given                                                            C1 It may happen that an external pipeline to deliver and filter
source visibility in the sky, no clouds, no bright moonlight). Also,                                                               light curves from external instruments is faulty and sends
reconstructing the properties of gamma ray photons from images                                                                     identical measurements repetitively, which is implausible.
of light showers introduces systematic bias (e.g., related to the visi-                                                        C2 In MW data, source activity can usually be observed by mul-
bility of light showers at low energies and different zenith angles).                                                              tiple instruments. A source with high activity being detected
   To study acceleration processes effectively, IACT data is com-                                                                  only by a single instrument is therefore suspicious.
bined with data from other instruments (e.g., other gamma-ray,                                                                 C3 MW instruments can have a limited spatial resolution, so
optical, x-ray or radar observatories), which is referred to as multi-                                                             that observed activity may be associated with several source
wavelength (MW) data. The integration and joint analysis of such                                                                   candidates. An erroneous source association may be discov-
data is particularly challenging when research questions demand                                                                    ered based on an implausible multi-wavelength light curve.
a near real-time analysis of transient phenomena, i.e., events that                                                            C4 Based on the current physical understanding of gamma-ray
may be observable for only a few minutes.                                                                                          emission from blazars, certain features of an IACT spectrum,
   For instance, consider the use case of near real-time IACT and                                                                  such as a spectral upturn, may be unexpected.
multi-wavelength blazar1 flare detection. Fig. 2 illustrates a pipeline                                                     Note that violations of plausibility constraints are not necessarily
for detecting flaring states and scheduling observations of blazars                                                         related to errors in the data processing pipeline. Rather, they may
based on IACT and MW data. Here, data from different distributed                                                            also hint at unexpected and, therefore, particularly interesting phys-
sources need to be jointly analyzed to detect transient flaring states                                                      ical phenomena. In any case, a plausibility assessment is beneficial
of blazars within hours to minutes. When a flaring state of interest is                                                     and supports the design, adaptation, and evaluation of the pipeline.
detected, the local observation schedule can be updated accordingly,
and other observatories will be alerted.
   Challenges for the analysis of IACT and MW data are imposed                                                              3        RELATED WORK
by the properties of the data, mainly its heterogeneity, sparsity, and                                                      Scientific workflows. Our ideas are related to the field of scientific
inherent bias and noise. Moreover, the research questions tend to                                                           workflows and workflow engines [10]. They have been researched
evolve, as for example, the definition of an interesting flare may be                                                       for decades to provide accessibility and reproducibility. To this
revised, which then leads to changes in the data processing pipeline.                                                       end, workflow engines (e.g., Apache Taverna, Galaxy, KNIME, and
In addition, the setting in which the analysis is conducted imposes                                                         Pegasus) support for the design and execution of pipelines by large
challenges on the technical side, as a variety of software stacks                                                           catalogs of operators, graphical user interfaces, and techniques for
and computational environments is used, from machine learning                                                               data integration. Yet, plausibility analysis as envisioned here is not
frameworks, through batch processing systems and middleware                                                                 part of their functionality.
for stream processing, to transient alert brokers. To cope with the
                                                                                                                            Analysis of provenance data. Some scientific workflow engines
challenges induced by the data, the analysis, and the infrastructure,
                                                                                                                            collect provenance data [7] as a source for analyzing data lineage,
1 A blazar is an astronomical object of particular scientific interest. When pointed to                                     i.e., to explain how a certain data item was created. Various mod-
earth, it can be observed as a highly variable source of multi-wavelength radiation.                                        els for provenance data have been defined in the literature, most
                                                                                                                   2
prominently PROV [2]. Moreover, there are tools available to ex-                 (2) Optionally, the location of external data referenced in the
plore and analyze provenance data, e.g., VisTrails [6], or to use it                  definition of the constraint.
for distributed debugging, e.g., with BugDoc [12]. We argue that                 (3) A constraint function that maps from the domains of the
models for provenance data are a useful starting point for automatic                  entities (and external data, if required) to a probability dis-
plausibility analysis of data processing pipelines.                                   tribution, thereby expressing the plausibility of data items.
                                                                                 According to this model, C4 from Fig. 2 is captured as a function
Program verification. Pipelines can be seen as programs, so that              that maps from the domain of an IACT light curve to a probabil-
inspiration for plausibility analysis may be taken from the field of          ity distribution. Specifically, the constraint function assigns high
software engineering, especially software verification. Here, work            probabilities when expected spectral features are present.
on automatic test case generation [1] or invariant mining [11],
which aims at identifying state-based expressions known to hold
true, are particularly promising. Considering system faults to be a           4.2    Constraint Definition
special case of implausible pipeline executions, and system invari-           The effort needed to define constraints shall be minimized, since
ants to be a special case of plausible pipeline behavior, inspiration         it is the crucial step in which a user has to invest time to benefit
can be drawn from approaches for discovering both.                            from automated plausibility checking down the line. Ideally, con-
                                                                              straints could be found and suggested to a user for review. Research
                                                                              questions here are hence (1) How to support the user in the manual
4     CHECKING PLAUSIBILITY
                                                                              definition of plausibility constraints, and (2) how to mine plausibility
This section outlines our approach to automated plausibility analy-           constraints automatically and assess their usefulness?
sis. For each step outlined in Fig. 1, we discuss the requirements,               To this end, we observe in our use case that constraints are closely
our approach and preliminary results of how to address them.                  related to physical models and laws. For example, the constraint C4
                                                                              requires defining a function that maps features of a source’s IACT
4.1    Model                                                                  light curve to a probability of plausibility. The latter is based on the
                                                                              source type, so that we aim to leverage physical models and laws
Automated plausibility assessment requires a model for making
                                                                              for mining and defining constraints.
statements about data. The model needs to enable the specification
                                                                                  We further intend to investigate whether textual documentation
of dependencies between data that is processed in a pipeline, while
                                                                              (e.g., instrument specifications) can serve as a basis for constraint
incorporating the following considerations:
                                                                              mining. This requires mapping the data entities of a provenance
Single or multi data item. Constraints may not only consider                  graph to named entities in natural language text, i.e., using tech-
        single data items, but the relation of multiple data items. For       niques for named entity recognition and relation extraction. Once
        instance, considering our use case, a change in the rate of           causal relations between entities have been discovered, they can be
        reconstructed events by an IACT telescope array is consid-            employed for suggestions in the manual definition of constraints.
        ered suspicious, if it coincides with a telescope entering or
        leaving the array. The plausibility of an IACT event list may
        therefore be assessed using observing condition metadata.             4.3    Pipeline Integration
Single or multi pipeline execution. A basic constraint checks                 A key question in automated plausibility analysis is how to integrate
        the value of one or more data items in isolation. However,            the definition of plausibility constraints into existing pipelines. The
        plausibility assessment may require the joint analysis of a           objective of pipeline integration is two-fold:
        sequence of data items as a value distribution across multi-               ◦ Data entities from the underlying provenance model need to
        ple pipeline executions. For instance, for the example con-                   be linked to software parameters and function arguments;
        straint C1, the sequence of 𝑛 previous values of a data item               ◦ Plausibility constraints need to be placed in a pipeline exe-
        is required to analyze the distribution variance.                             cution graph in the first possible position, where all required
Independent or dependent on external data. A constraint may                           data items are available. This ensures that implausibilities
        refer to external data sources. For example, the constraint C2                can be detected and reacted to as early as possible.
        requires access to an exhaustive catalog of known sources,            Embedding plausibility constraints in a pipeline definition is not
        that is not available as a data item during pipeline execution.       straightforward. Even in pipelines modeled as directed acyclic
Our idea is to develop a model for plausibility constraints based             graphs, the same type of data may occur multiple times in an exe-
on provenance graphs, which capture the relation between data                 cution graph. Also, a constraint may concern multiple data items.
items in a pipeline as a directed acyclic graph (DAG) [7]. Nodes                 To enable pipeline integration, the functionality for plausibility
in the graph are named entities and represent data items, whereas             analysis first needs to be implemented. Since exploratory analysis
edges are causal dependencies. Data items also need to be linked to           often relies on software toolkits that offer standard solutions for
computational steps. We intend to realize this using the ProvONE              data management and analysis methods in a specific domain, these
model [5], which links data provenance with pipeline definitions              provide a suitable starting point for such an implementation. For
(i.e., workflows in ProvONE terminology). Based thereon, the defi-            example, for the analysis of IACT data, the libraries ctools and
nition of a plausibility constraint includes:                                 gammalib [9] are examples for such toolkits.
    (1) A collection of one or more ProvONE entities representing                To integrate an implementation of plausibility analysis in the
        data items in a pipeline.                                             definition of a pipeline, our idea is to exploit the fact that processing
                                                                          3
steps in a pipeline are commonly defined as parameterized soft-                4.5     Violation Analysis
ware interfaces, e.g., a function, a software class, or a service. Our         Once a constraint indicates that certain data is implausible with
approach for integrating plausibility checks, therefore, is to create          a high-confidence, a user needs to assess whether there is indeed
a wrapper layer around common software components. In our use                  some error in the pipeline or whether the phenomenon is due
case, for instance, there are common types of plots for visualiz-              to an unexpected, and often particularly interesting trend in the
ing specific data entities, such as source spectra. If a pipeline uses         data. While the lineage of the data items for which implausibility
pre-defined calls to construct common plot types, the link from                is indicated can be derived directly from a respective provenance
function arguments to data entities and from plausibility constraints          graph, effective violation analysis aims at a more targeted analysis.
to execution placements can be derived automatically.                          That is, given the set of all upstream data items, we strive for a
   Listing 1 illustrates a Python wrapper for a call (ctbutterfly())           separation of those that are actually correlated with the violation
of the ctools library to create a so-called butterfly plot. Knowing that       of the constraint from those that are irrelevant.
the input of the call is a light curve, the wrapper fetches applicable             We intend to approach this use case from two angles. First, a
constraints (brown), and instantiates (green) and checks them (red).           correlation analysis between the data items created at intermediate,
                                                                               upstream steps of the pipeline may help to identify which type
def plauscheck_plot_ctbutterfly ( iact_light_curve ) :                         of data is likely to have a causal effect. Second, standard means
    entity = plauscheck.entities.IACT_LIGHT_CURVE
    constraintTypes = plauscheck.getConstraintsFor( entity )                   for outlier detection for the distributions of data items created by
    for constraintType in constaintTypes :                                     upstream steps in the pipeline may provide clues on abnormal
        constraint = constraintType( iact_light_curve )                        trends that led to downstream constraint violations.
        plauscheck.validate( constraint )
    ctools . ctbutterfly ( iact_light_curve )
                                                                               5     CONCLUSIONS
iact_light_curve = derive_light_curve ( iact_event_list )
plauscheck_plot_ctbutterfly ( iact_light_curve )                               In this paper, we motivated the need to support the design, adapta-
                                                                               tion, and evaluation of data processing pipelines with a case from
Listing 1: A wrapper function for integrating plausibility                     Astrophysics. Against this background, we outlined the require-
analysis into a library call for plotting IACT light curves.                   ments, our approach, and initial results on how to enable compre-
                                                                               hensive plausibility checking in exploratory data analysis. Having
                                                                               a first version of the model of plausibility constraints, our current
4.4    Constraint Validation                                                   research focuses on the derivation of constraints from the physical
Once plausibility constraints have been formulated and integrated,             models underlying the illustrated pipeline for flare detection.
they need to be validated during pipeline execution. Given the
common properties of scientific data, such validation is challenging:          ACKNOWLEDGMENTS
     ◦ Data may be sparse, meaning that data is available only for             We thank Iftach Sadeh (DESY) for valuable insights to gamma-ray
       certain (spatial or temporal) contexts. For our example of              astrophysics. The work is supported by the Helmholtz Einstein
       MW data integration, time series are available only in short            International Berlin Research School in Data Science (HEIBRiDS).
       time periods with varying cadences between instruments.
     ◦ Data may be uncertain in various ways. In the context of                REFERENCES
       IACT and MW data, for example, there exists uncertainty                  [1] Saswat Anand, Edmund K Burke, Tsong Yueh Chen, et al. 2013. An orchestrated
                                                                                    survey of methodologies for automated software test case generation. Journal of
       in the distance estimation between source and observer, as                   Systems and Software 86, 8 (2013), 1978–2001.
       well as in the source association, where activity in a region            [2] Khalid Belhajjame, Helena Deus, Daniel Garijo, et al. 2012. Prov model primer.
                                                                                    WWW Consortium (2012).
       of interest may be associated with several source candidates.            [3] Jeffrey Chang. 2015. Core services: Reward bioinformaticians. Nature 520, 7546
     ◦ Data can be multi-resolution, e.g. measuring a phenomena                     (April 2015), 151–152.
       in varying level of detail. In our use case, the sensitivity of          [4] The Cherenkov Telescope Array Consortium. 2015. Science with the Cherenkov
                                                                                    Telescope Array. International Journal of Modern Physics D (2015).
       different MW instruments varies greatly in terms of cap-                 [5] Vıctor Cuevas-Vicenttın et al. 2016. ProvONE: A PROV Extension Data Model
       tured energy resolution, requiring a careful integration that                for Scientific Workflow Provenance.
       respects uncertainty associated with the instrument type.                [6] Juliana Freire and Cláudio T. Silva. 2012. Making Computations and Publications
                                                                                    Reproducible with VisTrails. Comput. Sci. Eng. 14, 4 (2012), 18–25.
While we consider plausibility constraints to be stochastic, see §4.1,          [7] Melanie Herschel, Ralf Diestelkämper, and Houssem B. Lahmar. 2017. A Survey on
the above properties also motivate a probabilistic validation of them.              Provenance: What for? What Form? What From? VLDB J. 26, 6 (2017), 881–906.
                                                                                [8] Jamie Holder. 2015. Atmospheric Cherenkov Gamma-ray Telescopes. arXiv
This way, the confidence into the result of constraint validation is                e-prints, (Oct. 2015), arXiv:1510.05675.
quantified. In the extreme case, some plausibility constraints cannot           [9] Jea Knödlseder, M Mayer, C Deil, et al. 2016. GammaLib and ctools-A software
                                                                                    framework for the analysis of astronomical gamma-ray data. Astronomy &
be validated due to data sparsity, uncertainty, or resolution differ-               Astrophysics 593 (2016), A1.
ences. Moreover, the evolution of confidence over time needs to be             [10] Chee Sun Liew, Malcolm P. Atkinson, Michelle Galea, et al. 2016. Scientific
taken into account. For some constraints, such as the constraint C1,                Workflows: Moving Across Paradigms. ACM Comput. Surv. 49, 4, (2016).
                                                                               [11] Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining
the confidence changes over time. The more data is processed, the                   Invariants from Console Logs for System Problem Detection. In USENIX ATC.
more reliable will be the assessment of the state of up-stream data            [12] Raoni Lourenço, Juliana Freire, and Dennis Shasha. 2020. BugDoc: Algorithms to
sources, which increases the confidence in the ability to validate                  Debug Computational Processes. In SIGMOD, 463–478.
                                                                               [13] Victoria Stodden, Marcia McNutt, David H Bailey, et al. 2016. Enhancing repro-
constraint C1. Against this background, we strive for algorithms                    ducibility for computational methods. Science 354, 6317 (2016), 1240–1241.
for constraint validation that are rooted in Bayesian modeling.
                                                                           4