<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Checking Plausibility in Exploratory Data Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hermann Stolte supervised by  Matthias Weidlich</string-name>
          <email>hermann.stolte@hu-berlin.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dr. Elisa Pueschel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D User</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deutsches Elektronen-Synchrotron DESY</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Humboldt-Universität zu Berlin</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Model Construction t Plausibility</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Exploratory data analysis is widespread across many scientific domains. It relies on complex pipelines and computational models for data processing, that are commonly designed collaboratively by scientists with diverse backgrounds for a variety of software stacks and computation environments. Here, a major challenge is the uncertainty about the correctness of analysis results, due to the high complexity of both, the actual data and the implemented analysis steps, and the continuous reuse and adaptation of data analysis pipelines in diferent application settings. This PhD project investigates how the design, adaptation, and evaluation of exploratory data analysis pipelines can be supported through automated plausibility assessment. To this end, we outline the requirements, our approach, and initial results for models and methods to enable plausibility checking in the context of exploratory data analysis.</p>
      </abstract>
      <kwd-group>
        <kwd>Plausibility</kwd>
        <kwd>Constraint Definition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Today’s large-scale research projects in domains such as Materials
Science, Astrophysics, or Remote Sensing, to name just a few
examples, often involve exploratory data analysis. Here, data from
multiple distributed sources is integrated and analyzed collaboratively
by scientists with diverse backgrounds, from various disciplines,
and from diferent organizations. Key to the process are complex
pipelines for scientific data processing, sometimes referred to as
scientific workflows [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which may be designed for a variety of
software stacks and computation environments [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>Main challenges here arise from the complexity of both, datasets
and analysis steps, that makes results dificult to interpret and
errorprone. With data being collected by multiple internal or external
stakeholders each using their own methods in varying
environments, bias and noise need to be accounted for. For complex
research questions, interdisciplinary teams collaborate to combine
domain and technical expertise, e.g., for creating computational
models. Researchers from diferent backgrounds have an individual
set of hidden assumptions about data and models, which can easily
cause miscommunication and thus introduce errors in the design
of the data processing pipelines.</p>
      <p>
        An important aspect of exploratory data analysis is that the
specific research questions evolve over time. As a consequence,
pipelines and datasets are also subject to frequent changes [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
While program code for complex analysis is error-prone in general,
the evolution of pipelines and data is inherent to exploratory data
analysis and amplifies the resulting challenges. Moreover, changes
C3
C 4
      </p>
      <p>Pipeline
o o</p>
      <p>b
Pipeline
Integration</p>
      <p>a
Constraint
Validation</p>
      <p>Pipeline</p>
      <p>Execution
Pipeline
o o
(annotated)
originate not only from the restricted scope of a single research
group or laboratory, but from external collaborators with limited or
no involvement in the subsequent use of the updated code and data.
In the absence of established routines for version management of
datasets and data processing pipelines, such a setting can lead to
hard-to-find bugs, especially when time-pressure is involved.</p>
      <p>The overall uncertainty about the correctness of analysis results
makes the development, maintenance, and evaluation of pipelines
dificult. Support is needed for users to assess the plausibility of the
results obtained by data processing pipelines. Therefore, this PhD
project is dedicated to answer the following research question:
How to support the design, adaptation, and evaluation
of exploratory data processing pipelines through
automated plausibility analyses?</p>
      <p>We aim to answer this research question by providing the
foundations for automated plausibility analysis, as illustrated in Fig. 1.
(1) Construct a meta-model for plausibility constraints
(2) Support users in defining plausibility constraints
(3) Integrate constraints into a given data processing pipeline
(4) Validate constraints during pipeline execution
(5) Enable users to identify root causes of constraint validations
The next section illustrates the need for automated plausibility
assessment by a specific application case from the field of
Astrophysics. §3 gives an overview of relevant related work. In §4, we
outline our solution approach, before we conclude in §5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>
        In Astrophysics research, extreme acceleration processes (e.g., around
black holes, exploding or merging stars) are of high scientific
interest [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Here, energy is emitted as radiation across the whole
electromagnetic spectrum, including gamma radiation. When gamma-rays
Alert
Stream
from
external
instruments
      </p>
      <p>External
Light-Curves
(e.g. radio,
optical, x-ray,
gama-ray)</p>
      <p>Observing
Condition
Metadata</p>
      <p>Filtering</p>
      <p>Raw MW
Data</p>
      <p>Temporal
Integration and</p>
      <p>Source
Association</p>
      <p>MultiWavelength</p>
      <p>Light Curve
C1</p>
      <p>Has another pipeline stopped
updating and sent out the
same timestamps/values?</p>
      <p>Preprocessing
and Training
Data Curation</p>
      <p>Flare
Detection
Training
Dataset</p>
      <p>C3</p>
      <p>Is the multi-wavelength
light curve after source
association realistic?
Flare Detection
Model Training</p>
      <p>Flare
Detection
Model</p>
      <p>Near Real-Time
Inference on
Observed Data</p>
      <p>Observation
Significance
Aggregating event
counts over time,
grouped by energy</p>
      <p>range
C2</p>
      <p>Is there any activity that
cannot be associated
with a known source?
C4</p>
      <p>Delayed
High-Quality</p>
      <p>
        Post-Processing
Is there any spectral
upturn or is the spectrum
unphysically hard?
Alert
Decision
Log Entry
IACT Alert
Steam
(manual)
Decision-making
for public alert
hit the Earth’s atmosphere, they cause faint light showers, which
are observable using ground-based telescopes. With so called
Imaging Atmospheric Cherenkov Telescopes (IACT) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the physics
behind acceleration processes can be studied, e.g., by analyzing
temporal and spectral changes over time.
      </p>
      <p>However, the respective data analysis is challenging. Background
noise from cosmic rays needs to be filtered out and data can only
be captured when specific observing conditions are met (e.g., given
source visibility in the sky, no clouds, no bright moonlight). Also,
reconstructing the properties of gamma ray photons from images
of light showers introduces systematic bias (e.g., related to the
visibility of light showers at low energies and diferent zenith angles).</p>
      <p>To study acceleration processes efectively, IACT data is
combined with data from other instruments (e.g., other gamma-ray,
optical, x-ray or radar observatories), which is referred to as
multiwavelength (MW) data. The integration and joint analysis of such
data is particularly challenging when research questions demand
a near real-time analysis of transient phenomena, i.e., events that
may be observable for only a few minutes.</p>
      <p>For instance, consider the use case of near real-time IACT and
multi-wavelength blazar1 flare detection. Fig. 2 illustrates a pipeline
for detecting flaring states and scheduling observations of blazars
based on IACT and MW data. Here, data from diferent distributed
sources need to be jointly analyzed to detect transient flaring states
of blazars within hours to minutes. When a flaring state of interest is
detected, the local observation schedule can be updated accordingly,
and other observatories will be alerted.</p>
      <p>Challenges for the analysis of IACT and MW data are imposed
by the properties of the data, mainly its heterogeneity, sparsity, and
inherent bias and noise. Moreover, the research questions tend to
evolve, as for example, the definition of an interesting flare may be
revised, which then leads to changes in the data processing pipeline.
In addition, the setting in which the analysis is conducted imposes
challenges on the technical side, as a variety of software stacks
and computational environments is used, from machine learning
frameworks, through batch processing systems and middleware
for stream processing, to transient alert brokers. To cope with the
challenges induced by the data, the analysis, and the infrastructure,
1A blazar is an astronomical object of particular scientific interest. When pointed to
earth, it can be observed as a highly variable source of multi-wavelength radiation.
the combined expertise and interdisciplinary collaboration of
astrophysicists, computer scientists, and further engineers, is required.
Against this background, the design of data processing pipelines is
an error-prone process, which continuously raises the question of
how to assess the plausibility of the produced data.</p>
      <p>For the pipeline in Fig. 2, for instance, the plausibility of
intermediate data can be assessed based on the following constraints:
C1 It may happen that an external pipeline to deliver and filter
light curves from external instruments is faulty and sends
identical measurements repetitively, which is implausible.
C2 In MW data, source activity can usually be observed by
multiple instruments. A source with high activity being detected
only by a single instrument is therefore suspicious.</p>
      <p>C3 MW instruments can have a limited spatial resolution, so
that observed activity may be associated with several source
candidates. An erroneous source association may be
discovered based on an implausible multi-wavelength light curve.
C4 Based on the current physical understanding of gamma-ray
emission from blazars, certain features of an IACT spectrum,
such as a spectral upturn, may be unexpected.</p>
      <p>Note that violations of plausibility constraints are not necessarily
related to errors in the data processing pipeline. Rather, they may
also hint at unexpected and, therefore, particularly interesting
physical phenomena. In any case, a plausibility assessment is beneficial
and supports the design, adaptation, and evaluation of the pipeline.</p>
    </sec>
    <sec id="sec-3">
      <title>3 RELATED WORK</title>
      <p>
        Scientific workflows. Our ideas are related to the field of scientific
workflows and workflow engines [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. They have been researched
for decades to provide accessibility and reproducibility. To this
end, workflow engines (e.g., Apache Taverna, Galaxy, KNIME, and
Pegasus) support for the design and execution of pipelines by large
catalogs of operators, graphical user interfaces, and techniques for
data integration. Yet, plausibility analysis as envisioned here is not
part of their functionality.
      </p>
      <p>
        Analysis of provenance data. Some scientific workflow engines
collect provenance data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as a source for analyzing data lineage,
i.e., to explain how a certain data item was created. Various
models for provenance data have been defined in the literature, most
prominently PROV [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, there are tools available to
explore and analyze provenance data, e.g., VisTrails [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], or to use it
for distributed debugging, e.g., with BugDoc [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We argue that
models for provenance data are a useful starting point for automatic
plausibility analysis of data processing pipelines.
      </p>
      <p>
        Program verification. Pipelines can be seen as programs, so that
inspiration for plausibility analysis may be taken from the field of
software engineering, especially software verification. Here, work
on automatic test case generation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or invariant mining [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
which aims at identifying state-based expressions known to hold
true, are particularly promising. Considering system faults to be a
special case of implausible pipeline executions, and system
invariants to be a special case of plausible pipeline behavior, inspiration
can be drawn from approaches for discovering both.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>CHECKING PLAUSIBILITY</title>
      <p>This section outlines our approach to automated plausibility
analysis. For each step outlined in Fig. 1, we discuss the requirements,
our approach and preliminary results of how to address them.
4.1</p>
    </sec>
    <sec id="sec-5">
      <title>Model</title>
      <p>Automated plausibility assessment requires a model for making
statements about data. The model needs to enable the specification
of dependencies between data that is processed in a pipeline, while
incorporating the following considerations:
Single or multi data item. Constraints may not only consider
single data items, but the relation of multiple data items. For
instance, considering our use case, a change in the rate of
reconstructed events by an IACT telescope array is
considered suspicious, if it coincides with a telescope entering or
leaving the array. The plausibility of an IACT event list may
therefore be assessed using observing condition metadata.
Single or multi pipeline execution. A basic constraint checks
the value of one or more data items in isolation. However,
plausibility assessment may require the joint analysis of a
sequence of data items as a value distribution across
multiple pipeline executions. For instance, for the example
constraint C1, the sequence of  previous values of a data item
is required to analyze the distribution variance.</p>
      <sec id="sec-5-1">
        <title>Independent or dependent on external data. A constraint may</title>
        <p>
          refer to external data sources. For example, the constraint C2
requires access to an exhaustive catalog of known sources,
that is not available as a data item during pipeline execution.
Our idea is to develop a model for plausibility constraints based
on provenance graphs, which capture the relation between data
items in a pipeline as a directed acyclic graph (DAG) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Nodes
in the graph are named entities and represent data items, whereas
edges are causal dependencies. Data items also need to be linked to
computational steps. We intend to realize this using the ProvONE
model [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], which links data provenance with pipeline definitions
(i.e., workflows in ProvONE terminology). Based thereon, the
definition of a plausibility constraint includes:
(1) A collection of one or more ProvONE entities representing
data items in a pipeline.
(2) Optionally, the location of external data referenced in the
definition of the constraint.
(3) A constraint function that maps from the domains of the
entities (and external data, if required) to a probability
distribution, thereby expressing the plausibility of data items.
        </p>
        <p>According to this model, C4 from Fig. 2 is captured as a function
that maps from the domain of an IACT light curve to a
probability distribution. Specifically, the constraint function assigns high
probabilities when expected spectral features are present.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Constraint Definition</title>
      <p>The efort needed to define constraints shall be minimized, since
it is the crucial step in which a user has to invest time to benefit
from automated plausibility checking down the line. Ideally,
constraints could be found and suggested to a user for review. Research
questions here are hence (1) How to support the user in the manual
definition of plausibility constraints, and (2) how to mine plausibility
constraints automatically and assess their usefulness?</p>
      <p>To this end, we observe in our use case that constraints are closely
related to physical models and laws. For example, the constraint C4
requires defining a function that maps features of a source’s IACT
light curve to a probability of plausibility. The latter is based on the
source type, so that we aim to leverage physical models and laws
for mining and defining constraints.</p>
      <p>We further intend to investigate whether textual documentation
(e.g., instrument specifications) can serve as a basis for constraint
mining. This requires mapping the data entities of a provenance
graph to named entities in natural language text, i.e., using
techniques for named entity recognition and relation extraction. Once
causal relations between entities have been discovered, they can be
employed for suggestions in the manual definition of constraints.
4.3</p>
    </sec>
    <sec id="sec-7">
      <title>Pipeline Integration</title>
      <p>A key question in automated plausibility analysis is how to integrate
the definition of plausibility constraints into existing pipelines. The
objective of pipeline integration is two-fold:
◦ Data entities from the underlying provenance model need to
be linked to software parameters and function arguments;
◦ Plausibility constraints need to be placed in a pipeline
execution graph in the first possible position, where all required
data items are available. This ensures that implausibilities
can be detected and reacted to as early as possible.</p>
      <p>Embedding plausibility constraints in a pipeline definition is not
straightforward. Even in pipelines modeled as directed acyclic
graphs, the same type of data may occur multiple times in an
execution graph. Also, a constraint may concern multiple data items.</p>
      <p>
        To enable pipeline integration, the functionality for plausibility
analysis first needs to be implemented. Since exploratory analysis
often relies on software toolkits that ofer standard solutions for
data management and analysis methods in a specific domain, these
provide a suitable starting point for such an implementation. For
example, for the analysis of IACT data, the libraries ctools and
gammalib [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] are examples for such toolkits.
      </p>
      <p>To integrate an implementation of plausibility analysis in the
definition of a pipeline, our idea is to exploit the fact that processing
steps in a pipeline are commonly defined as parameterized
software interfaces, e.g., a function, a software class, or a service. Our
approach for integrating plausibility checks, therefore, is to create
a wrapper layer around common software components. In our use
case, for instance, there are common types of plots for
visualizing specific data entities, such as source spectra. If a pipeline uses
pre-defined calls to construct common plot types, the link from
function arguments to data entities and from plausibility constraints
to execution placements can be derived automatically.</p>
      <p>Listing 1 illustrates a Python wrapper for a call (ctbutterfly())
of the ctools library to create a so-called butterfly plot. Knowing that
the input of the call is a light curve, the wrapper fetches applicable
constraints (brown), and instantiates (green) and checks them (red).
def plauscheck_plot_ctbutterfly ( iact_light_curve ):
entity = plauscheck.entities.IACT_LIGHT_CURVE
constraintTypes = plauscheck.getConstraintsFor( entity )
for constraintType in constaintTypes :
constraint = constraintType( iact_light_curve )
plauscheck.validate( constraint )
ctools . ctbutterfly ( iact_light_curve )
iact_light_curve = derive_light_curve ( iact_event_list )
plauscheck_plot_ctbutterfly ( iact_light_curve )</p>
      <sec id="sec-7-1">
        <title>Listing 1: A wrapper function for integrating plausibility</title>
        <p>analysis into a library call for plotting IACT light curves.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>4.4 Constraint Validation</title>
      <p>Once plausibility constraints have been formulated and integrated,
they need to be validated during pipeline execution. Given the
common properties of scientific data, such validation is challenging:
◦ Data may be sparse, meaning that data is available only for
certain (spatial or temporal) contexts. For our example of
MW data integration, time series are available only in short
time periods with varying cadences between instruments.
◦ Data may be uncertain in various ways. In the context of
IACT and MW data, for example, there exists uncertainty
in the distance estimation between source and observer, as
well as in the source association, where activity in a region
of interest may be associated with several source candidates.
◦ Data can be multi-resolution, e.g. measuring a phenomena
in varying level of detail. In our use case, the sensitivity of
diferent MW instruments varies greatly in terms of
captured energy resolution, requiring a careful integration that
respects uncertainty associated with the instrument type.
While we consider plausibility constraints to be stochastic, see §4.1,
the above properties also motivate a probabilistic validation of them.
This way, the confidence into the result of constraint validation is
quantified. In the extreme case, some plausibility constraints cannot
be validated due to data sparsity, uncertainty, or resolution
diferences. Moreover, the evolution of confidence over time needs to be
taken into account. For some constraints, such as the constraint C1,
the confidence changes over time. The more data is processed, the
more reliable will be the assessment of the state of up-stream data
sources, which increases the confidence in the ability to validate
constraint C1. Against this background, we strive for algorithms
for constraint validation that are rooted in Bayesian modeling.</p>
    </sec>
    <sec id="sec-9">
      <title>4.5 Violation Analysis</title>
      <p>Once a constraint indicates that certain data is implausible with
a high-confidence, a user needs to assess whether there is indeed
some error in the pipeline or whether the phenomenon is due
to an unexpected, and often particularly interesting trend in the
data. While the lineage of the data items for which implausibility
is indicated can be derived directly from a respective provenance
graph, efective violation analysis aims at a more targeted analysis.
That is, given the set of all upstream data items, we strive for a
separation of those that are actually correlated with the violation
of the constraint from those that are irrelevant.</p>
      <p>We intend to approach this use case from two angles. First, a
correlation analysis between the data items created at intermediate,
upstream steps of the pipeline may help to identify which type
of data is likely to have a causal efect. Second, standard means
for outlier detection for the distributions of data items created by
upstream steps in the pipeline may provide clues on abnormal
trends that led to downstream constraint violations.</p>
    </sec>
    <sec id="sec-10">
      <title>5 CONCLUSIONS</title>
      <p>In this paper, we motivated the need to support the design,
adaptation, and evaluation of data processing pipelines with a case from
Astrophysics. Against this background, we outlined the
requirements, our approach, and initial results on how to enable
comprehensive plausibility checking in exploratory data analysis. Having
a first version of the model of plausibility constraints, our current
research focuses on the derivation of constraints from the physical
models underlying the illustrated pipeline for flare detection.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGMENTS</title>
      <p>We thank Iftach Sadeh (DESY) for valuable insights to gamma-ray
astrophysics. The work is supported by the Helmholtz Einstein
International Berlin Research School in Data Science (HEIBRiDS).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Saswat</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <article-title>Edmund K Burke, Tsong Yueh Chen</article-title>
          , et al.
          <year>2013</year>
          .
          <article-title>An orchestrated survey of methodologies for automated software test case generation</article-title>
          .
          <source>Journal of Systems and Software 86</source>
          ,
          <issue>8</issue>
          (
          <year>2013</year>
          ),
          <fpage>1978</fpage>
          -
          <lpage>2001</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Khalid</given-names>
            <surname>Belhajjame</surname>
          </string-name>
          , Helena Deus,
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Garijo</surname>
          </string-name>
          , et al.
          <year>2012</year>
          .
          <article-title>Prov model primer</article-title>
          .
          <source>WWW Consortium</source>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Core services: Reward bioinformaticians</article-title>
          .
          <source>Nature</source>
          <volume>520</volume>
          ,
          <issue>7546</issue>
          (April
          <year>2015</year>
          ),
          <fpage>151</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>The</given-names>
            <surname>Cherenkov Telescope Array Consortium</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Science with the Cherenkov Telescope Array</article-title>
          .
          <source>International Journal of Modern Physics D (</source>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Vıctor</given-names>
            <surname>Cuevas-Vicenttın</surname>
          </string-name>
          et al.
          <year>2016</year>
          .
          <article-title>ProvONE: A PROV Extension Data Model for Scientific Workflow Provenance</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Juliana</given-names>
            <surname>Freire and Cláudio</surname>
          </string-name>
          <string-name>
            <given-names>T.</given-names>
            <surname>Silva</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Making Computations and Publications Reproducible with VisTrails</article-title>
          .
          <source>Comput. Sci. Eng</source>
          .
          <volume>14</volume>
          ,
          <issue>4</issue>
          (
          <year>2012</year>
          ),
          <fpage>18</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Melanie</given-names>
            <surname>Herschel</surname>
          </string-name>
          , Ralf Diestelkämper, and
          <string-name>
            <surname>Houssem</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Lahmar</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A Survey on Provenance: What for? What Form? What From?</article-title>
          VLDB J.
          <volume>26</volume>
          ,
          <issue>6</issue>
          (
          <year>2017</year>
          ),
          <fpage>881</fpage>
          -
          <lpage>906</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Holder</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Atmospheric Cherenkov Gamma-ray Telescopes</article-title>
          . arXiv e-prints, (Oct.
          <year>2015</year>
          ), arXiv:
          <fpage>1510</fpage>
          .
          <fpage>05675</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jea</given-names>
            <surname>Knödlseder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Mayer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C</given-names>
            <surname>Deil</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>GammaLib and ctools-A software framework for the analysis of astronomical gamma-ray data</article-title>
          .
          <source>Astronomy &amp; Astrophysics</source>
          <volume>593</volume>
          (
          <year>2016</year>
          ),
          <fpage>A1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Chee</given-names>
            <surname>Sun</surname>
          </string-name>
          <string-name>
            <surname>Liew</surname>
          </string-name>
          , Malcolm P. Atkinson,
          <string-name>
            <given-names>Michelle</given-names>
            <surname>Galea</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Scientific Workflows: Moving Across Paradigms</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>49</volume>
          ,
          <issue>4</issue>
          , (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jian-Guang</surname>
            <given-names>Lou</given-names>
          </string-name>
          , Qiang Fu, Shengqi Yang, Ye Xu, and
          <string-name>
            <given-names>Jiang</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Mining Invariants from Console Logs for System Problem Detection</article-title>
          .
          <source>In USENIX ATC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Raoni</surname>
            <given-names>Lourenço</given-names>
          </string-name>
          , Juliana Freire, and
          <string-name>
            <given-names>Dennis</given-names>
            <surname>Shasha</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>BugDoc: Algorithms to Debug Computational Processes</article-title>
          . In SIGMOD,
          <fpage>463</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Victoria</surname>
            <given-names>Stodden</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marcia</surname>
            <given-names>McNutt</given-names>
          </string-name>
          ,
          <string-name>
            <surname>David H Bailey</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Enhancing reproducibility for computational methods</article-title>
          .
          <source>Science</source>
          <volume>354</volume>
          ,
          <issue>6317</issue>
          (
          <year>2016</year>
          ),
          <fpage>1240</fpage>
          -
          <lpage>1241</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>