=Paper= {{Paper |id=Vol-1800/paper6 |storemode=property |title=Online Input Data Reduction in Scientific Workflows |pdfUrl=https://ceur-ws.org/Vol-1800/paper6.pdf |volume=Vol-1800 |authors=Renan Souza,Vitor Silva,Alvaro L G A Coutinho,Patrick Valduriez,Marta Mattoso |dblpUrl=https://dblp.org/rec/conf/sc/SouzaSCVM16 }} ==Online Input Data Reduction in Scientific Workflows== https://ceur-ws.org/Vol-1800/paper6.pdf

Online Input Data Reduction in Scientific Workflows
Renan Souza§,°, Vítor Silva§, Alvaro L G A Coutinho§,
Patrick Valduriez¶, Marta Mattoso§
§ ° ¶
COPPE/Federal University of Rio de Janeiro, IBM Research Brazil, Inria and LIRMM, Montpellier

ABSTRACT space that will not influence relevant results and thus, as with a
“branch and bound” optimization strategy, can be bounded. A
Many scientific workflows are data-intensive and need be
similar scenario occurs when the workflow involves a large input
iteratively executed for large input sets of data elements.
dataset. When domain-specialist users can actively participate in
Reducing input data is a powerful way to reduce overall execution
the computational process, practice frequently referred to as
time in such workflows. When this is accomplished online (i.e.,
“human-in-the-loop”, they may analyze partial result data and tell
without requiring users to stop execution to reduce the data and
which part of the data is relevant or not for the final result [14].
resume execution), it can save much time and user interactions
Then, based on their domain knowledge, users can identify which
can integrate within workflow execution. Then, a major problem
subset of the data is not interesting and thus should be removed
is to determine which subset of the input data should be removed.
from the execution by the SWMS, thereby reducing execution
Other related problems include guaranteeing that the workflow
time.
system will maintain execution and data consistent after
reduction, and keeping track of how users interacted with Data reduction can be accomplished in at least three different
execution. In this paper, we adopt the approach “human-in-the- forms. First, users can do it before the execution starts. However,
loop” for scientific workflows by enabling users to steer the in most complex scenarios, the high number of possibilities makes
workflow execution and reduce input elements from datasets at it impossible to know beforehand the uninteresting subsets,
runtime. We propose an adaptive monitoring approach that without any prior execution. Furthermore, not only the initial
combines workflow provenance monitoring and computational dataset can be reduced, but also the data generated by the
steering to support users in analyzing the evolution of key workflow, since the activities composing scientific workflows
parameters and determining which subset of the data should be continuously produce significant amounts of partial data that are
removed. We also extend a provenance data model to keep track consumed by other activities. A second form of data reduction is
of user interactions when users reduce data at runtime. In our to do it online. When the SWMS allows for partial result data
experimental validation, we develop a test case from the oil and analysis, the user may interact with the generated partial data, find
gas industry, using a 936-cores cluster. The results on our which slice of the dataset is not interesting, and reduce the dataset
parameter sweep test case show that the user interactions for online. We use the term online for the interactions where users are
online data reduction yield a 37% reduction of execution time. able to inspect workflow execution, analyze partial and
performance data, and dynamically adapt (i.e., steer) workflow
CCS Concepts settings while the workflow is running (i.e., at runtime). The third
• Massively parallel and high-performance simulations. form of data reduction is by stopping execution, reducing the data
offline, and then resuming execution with the reduced dataset.
Keywords Because of the difficulty in defining the exploratory input dataset
Scientific Workflows; Human in the Loop; Online Data and the long execution time of such iterative workflows, users
Reduction; Provenance Data; Dynamic Workflows. frequently adopt the third form. However, in the offline form, the
SWMS is not aware of the changes, and the results with one
1. INTRODUCTION workflow configuration are not related to the others. Therefore,
Scientific Workflow Management Systems (SWMS) with parallel this is generally more time-consuming, there is no control or
capabilities have been designed for executing data-intensive registration of user interactions, and the execution may become
scientific workflows, scientific workflows for short hereafter, in inconsistent [7].
High Performance Computing (HPC) environments. A typical
Online data reduction has obvious advantages but introduces
execution may involve thousands of parallel tasks with large input
several problems related to computational steering in HPC
datasets. When the workflow is iterative, it is repeatedly executed
environments [14]. First, because of the complexity of their
for each element of an input dataset. The more the data to be
scientific question to address and the huge amounts of data, users
processed, the longer the workflow may take, which may be days
do not exactly know beforehand which subset of dataset should be
depending on the problem and HPC environment [7]. Configuring
kept or removed. Also, if users cannot actively follow the result
a scientific workflow with parameters and data to be processed is
data evolution online, in particular domain data associated to
hard. Typically, the user needs to try several input data or
execution and provenance data (history of data derivation), they
parameter combinations in different workflow executions. These
can be driven to misleading conclusions when trying to identify
trials make the scientific experiment take even longer. It is well
the uninteresting subset of the data. Indeed, this is the main
known that optimizing performance of the parallel system is a
related challenge. Second, if they can find which subset to remove
way to improve overall workflow execution time, but reducing the
and actually try to remove, the SWMS must guarantee that the
input data that was initially planned to be processed is another
operation will be done consistently. Otherwise, it can introduce
effective approach to reduce workflow execution time [4].
anomalous data, yielding to no control of data elimination, data
In scientific workflows, the total amount of data is very large, but redundancy, or even execution crash. Third, in a long run, there
not necessarily the entire input dataset has relevant data for may be more than one user interaction, each removing more
achieving the goal of the workflow execution. This is particularly subsets, at different times. If the SWMS does not keep track of
the case when a large parameter space needs to be processed in user actions, it may negatively impact the results’ reproducibility
parameter sweep workflows. There may be slices of the parameter and reliability. Although data reduction is not new in SWMS [4],

Copyright held by the authors
44
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

to the best of our knowledge, these problems related to online subsea equipment and the offshore oil floating production unit.
user-steered data reduction while maintaining data provenance They are susceptible to a wide variation of environmental
have not been addressed by related works. conditions (e.g., sea currents, wind speed, ocean waves,
Our approach is to represent workflow input datasets as database temperature), which may damage their structure. The fatigue
relations. Each input element from the scientific domain dataset is analysis workflow adopts a cumulative damage approach as part
represented as a tuple of the input relations. When the elements of of the riser's risk assessment procedure considering a wide
the input dataset are files, we insert paths to these files. The combination of possible conditions. The result is the estimate of
approach is implemented in Chiron, a parallel SWMS that adopts riser’s fatigue life, which is the length of time that it will safely
a tuple-oriented algebraic approach [17]. Chiron has been used to operate. The Design Fatigue Factor (DFF) may range from 3 to
manage workflow applications with user steering in domains, 10, meaning that the riser’s fatigue life must be at least DFF times
such as bioinformatics [2], computational fluid dynamics [7], higher than the service life [6].
astronomy [20], etc. Chiron continuously populates a relational Sensors located at the offshore platform collect external
database at runtime to store domain-specific data, workflow conditions and floating unit data, which are stored in multiple raw
execution data, and, more importantly, provenance data, all files. Offshore engineers use specialized programs (mostly
integrated in the same database available for online queries. In complex simulation solvers) to consume the files, evaluate the
this paper, we use the term workflow Database (wf-Database) to impact on the risers in the near future (e.g., risk of a fracture
refer to this database. In addition to the traditional advantages of occurrence), and estimate the risers’ fatigue life. Figure 1 presents
managing provenance data in scientific workflows (i.e., a scientific workflow composed of seven piped specialized
reproducibility, reliability, and quality of result data) [3], online programs (represented by workflow activities) with a dataflow in
provenance data management eases interactive domain data between.
analysis [2][20]. Such interactive flexible data analysis through
provenance helps finding which subset of a dataset to be removed.
Moreover, execution monitoring is another very desirable feature
in any data-intensive system, including SWMS, and can also be
used to assist users in addressing the subset to be removed.
However, Chiron does not control changes in input datasets,
including removing a subset. In this work, we extend Chiron's wf-
Database to maintain the provenance of the subsets of the dataset Figure 1. Risers Fatigue Analysis Workflow.
that are removed. Furthermore, we take advantage of a distributed Each task of Data Gathering (Activity 1) decompresses one
in-memory database system coupled to Chiron, in a version called large file into many files containing important input data, reads
d-Chiron that is significantly more scalable [21], to address the decompressed files, and gathers specific values
consistency issues with respect to data reduction. We make the (environmental conditions, floating unit’s, and other data), which
following contributions: are used by the following activities. Preprocessing (Activity 2)
performs pre-calculations and data cleansing over some other
• A mechanism coupled to d-Chiron for online input data finite element mesh files that will be processed in the following
reduction, which allows users to remove subsets of the dataset activities. Stress Analysis (Activity 3) runs a computational
to be processed at runtime. It guarantees that both execution structural mechanics program to calculate the stresses applied to
and data remain consistent after reduction. the riser. Each task consumes pre-processed meshes and other
• An extension to a provenance data model (which is W3C important input data values (gathered from first activity) and
PROV compliant) to maintain the history of user interactions generates result data files, such as histograms of stresses applied
when users decide to reduce a dataset during workflow throughout the riser (this is an output file), and stress intensity
execution. factors in the riser and principal stress tensor components. It also
• An adaptive monitoring approach that combines monitoring calculates the current curvature of the riser. Then, Stress
and computational steering. It helps users to follow the Critical Case Selection (Activity 4) and Curvature
evolution of interesting parameters and result data to find Critical Case Selection (Activity 5) calculate the fatigue
which subsets of the dataset can be removed during execution. life of the riser based on the stresses and curvature, respectively.
Also, since what users find interesting may change over time, These two activities filter out results corresponding to risers that
this approach allows the user to steer the monitoring certainly in a good state (no critical stress or curvature values
definitions, such as which data should be monitored and how. were identified), which are of no interest to the analysis.
Although existing solutions enable workflow execution Calculate Fatigue Life (Activity 6) uses previously
monitoring [13][16][19], there is no approach to enable users
calculated values to execute a standard methodology [6] and
to run monitoring queries that integrate execution, provenance,
calculate the final fatigue life value of a riser. Compress
and domain data, and dynamically adapt these queries online,
Results (Activity 7) compresses output files by riser.
to the best of our knowledge.
Most of these activities generate result data (both raw data files
Paper organization: Section 2 gives a motivating example. Section
and some other domain-specific data values), which are consumed
3 gives the background for this work. We present our online data
by the subsequent activities. These intermediary data need to be
reduction approach in Section 4 and our adaptive monitoring
analyzed during workflow execution. More importantly,
approach in in Section 5. Section 6 gives the experimental
depending on a specific range of data values for an output result
validation. Section 7 discusses related work. Section 8 concludes.
data (e.g., fatigue life value), there may be a specific combination
2. MOTIVATING EXAMPLE IN OIL AND of input data (e.g., environmental conditions) that are more or less
GAS INDUSTRY important during an interval of time within the workflow
In ultra-deep water oil production systems, a major application is execution. The specific range is frequently hard to determine and
to perform risers’ analyses. Risers are fluid conduits between requires a domain expert to analyze partial data during execution.

45
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

For example, an input data element for Activity 2 is a file that data and provenance dataflow analysis. The corresponding
contains a large matrix of data values, composed of thousands of generated SQL, as well as the relational schema, are in
rows and dozens of columns. Each column contains data for an http://github.com/hpcdb/d-chiron. These queries reflect typical
environmental condition and each row has data collected for a user interactions. When these workflows are executed as scripts,
given instant in time. Each row can be processed in parallel and without Chiron's support, users look for files in their directories,
the domain application needs to consume and produce other data open files, and try to do this analysis in an ad-hoc way, frequently
files (on average, about 14 MB consumed and 6 MB produced per writing programs to "query" these result files. Often they interrupt
processed input data element). After many analyses online, the the execution to fine tune input data and save execution time.
user finds that, for waves > 38m with frequency > 1Hz, a riser Table 1. Domain dataflow provenance interactive queries.
fatigue will never happen. Thus, within the entire matrix, any What is the average of the 10 environmental conditions that are
input data element that contains this specific uninteresting range 𝑸𝟏
leading to the largest fatigue life value?
does not need to be processed. Therefore, by reducing the input
𝑸𝟐 What are the water craft’s hull conditions that are leading to risers’
dataset, the overall data processed and generated are reduced and, curvature lower than 800?
more importantly, the overall execution time is reduced. In this 𝑸𝟑
What are the top 5 raw data files that contain original data that are
paper, we use this workflow in our examples. leading to lowest fatigue life value?
What are the histograms and finite element meshes files related when
𝑸𝟒
3. USER-STEERED WORKFLOWS computed fatigue life based on stress analysis is lower than 60?
Mattoso et al. [14] analyze six aspects of computational steering
For Queries 𝑄1-𝑄4, the SWMS needs to store the history of the
in scientific workflows: interactive analysis, monitoring, human
tuples generated in Activities 4 and 5 since the beginning of the
adaptation, notification, interface for interaction, and computing
flow, adequately linking each tuple flow in between. For example,
model. Despite their importance, the first three are essential and
environmental conditions (𝑄1) and hull conditions (𝑄2) are
we mostly focus on those in this work. In fact, human adaptation
obtained in Activity 1, and stress- and curvature-related values are
is definitely the core of computational steering. However, users
obtained in Activities 4 and 5, respectively. To correlate output
will only know how to fine-tune parameters or which subset needs
tuples from Activity 4 or 5 to tuples from Activity 1, provenance
further focus if they can explore partial result data during a long-
data relationships are required.
term execution. For this, interactive analysis and monitoring play
an important role to put the human in the loop. B. Workflow execution information. Lower level execution engine
information, such as physical location (i.e., virtual machine or
Online provenance data management in SWMS is an essential
cluster node) where a task is being executed, can highly benefit
asset to support all six aspects of computational steering in
data analysis and debugging in HPC execution. Users may want to
scientific workflows. In this section, we explain the three
interactively investigate how many parallel tasks each node is
computational steering aspects explored in this paper.
running. Moreover, tasks can run domain applications that result
3.1 Interactive analysis in errors. If there were thousands of tasks in a large execution,
We address two aspects of workflow data that should be how to determine which tasks resulted in domain application
interactively analyzed: (A) domain dataflow and (B) workflow errors and what the errors were? This also eases debugging, an
execution information [14]. important feature to be provided in large parallel executions.
Furthermore, performance data analysis is very useful. Users are
A. Domain dataflow. Workflows are composed of activities frequently interested in knowing how long tasks are taking. All
(scientific programs, scripts, or services) linked as a dataflow. this workflow execution information is important to be analyzed
Each activity invocation, or task, may consume input datasets and and can deliver much more interesting insights when linked to
input raw data files and produce output datasets and files. These
domain dataflow data. When execution data is stored separately
flows form the domain dataflow. To support domain dataflow
from domain and provenance data, these steering queries are not
interactive analysis, Chiron stores dataflow provenance data in the
possible or demand combining different tools and writing specific
wf-Database during execution and makes them available for analysis programs [20].
online user queries. Users can then query the wf-Database using a
query interface or SQL following PROV-Wf [2], a W3C PROV- To support all this, Chiron’s wf-Database registers parallel
compliant data model [15] that specializes PROV entities into workflow execution data. This means that all necessary execution
domain-data entities to allow for domain dataflow analysis at a information for the parallel engine to work are linked to domain
finer grain than PROV. dataflow data and managed in the same database. Table 2 shows
some provenance queries for the Risers Analysis workflow that
Chiron’s tuple-oriented engine first stores input dataset as tuples link workflow execution data to domain dataflow.
in the wf-Database. In parameter sweep, tuples are data values
from the Cartesian product of input parameters. Then, each task Table 2. Domain data linked to performance data.
consumes input tuples retrieved from this database, executes Determine the average of each environmental conditions (output of
them, and then stores the generated output tuples in the wf- Data Gathering – Activity 1) associated to the tasks that are
𝑸𝟓
taking execution time more than 2 standard deviations of
Database immediately after task execution, adequately
Curvature Critical Case Selection (Activity 5).
maintaining the data relationships to the input tuples. The
Determine the finite element meshes files (output of Preprocessing –
workflow activities that generated the tuples are also stored in the 𝑸𝟔
Activity 2) associated to the tasks that are finishing with error status.
wf-Database and linked accordingly. Large raw data files List the 5 computing nodes with the greatest number of
consumed or produced by each task are not stored in the wf- 𝑸𝟕 Preprocessing activity tasks that are consuming tuples that
Database, but are rather linked to them, for file flow management. contain wind speed values greater than 70 Km/h.
Thus, Chiron enables online fine-grained domain dataflow 3.2 Monitoring
analysis [2] as well as the analysis of related domain raw data files Another important form to help gaining insights from the data is
through file flow relationships [20]. Table 1 shows some useful by monitoring in a passive way. It means that users can set up
queries for the riser fatigue analysis workflow involving domain some monitoring analyses and wait for the monitoring results to

46
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

be generated. Results might be delivered to end-users as graphical entire dataset, the subset that contains those values can be
dashboards or three-dimensional in-situ scientific data removed, hence reducing the dataset.
visualizations. As users gain insights from monitoring results, if However, reducing a dataset to be processed has specific
the SWMS has dynamic analytical support, they can adapt constraints that need to be addressed so the execution remains in a
previously set up monitoring configurations or add new consistent state, i.e., with valid data and with guarantee that the
monitoring analysis [14]. Also, from these new insights, new data execution will not crash. As previously described, Chiron
exploration through interactive analysis can be done. implements a relational data model in a tuple-oriented algebraic
If the SWMS allows for provenance data analysis during approach for scientific workflows [17].
workflow execution, monitoring becomes more effective, since We propose to represent the input dataset as a database relation,
the SWMS can exploit the continuously populated wf-Database. which is a set of tuples. In the tuple-oriented approach [17], tuples
By doing this, all the aforementioned data provenance analysis represent a domain-specific dataset to be consumed or produced
and queries executed by users may be used by a monitoring by a workflow activity. As examples, tuples may be composed of
engine. parameter values of a computational model, file paths to a large
3.3 Human adaptation raw data file (e.g., genomic sequences, finite element meshes,
After users have analyzed partial data and gained insight from textual data, binary files), or calculated values. In the tuple-
them, they may decide to adapt the workflow execution. oriented approach, removing a subset of the entire dataset to be
Adaptation brings powerful abilities to end-users, putting the processed means removing a set of tuples to be consumed by a
human in full control of scientific workflow executions. Many workflow activity. As a consequence of this removal, the tasks
aspects can be adapted by humans, but very few systems support that would process the tuples within the removed set of tuples will
human-in-the-loop actions [14]. The human-adaptable aspects not be executed, hence, reducing both workflow execution time
range from computing resources involved in the execution (e.g., and data processing. Data processing reduction becomes more
adding or removing nodes), to checking-point and rolling-back evident if the removed tuples contain paths to large raw data files
(debugging), loop break conditions, reducing datasets, that would be consumed by tasks if the tuples were not removed.
modification of filter conditions, and very specific parameter fine- Not only this prevents execution of tasks that would consume
tuning. uninteresting data, but also the non-executed tasks will not
produce more data, reducing overall generated data amount in a
Populating the wf-Database during workflow execution can help workflow execution. Furthermore, if a tuple of a given activity is
all these aspects. In the Chiron SWMS, it has been shown that it removed, the following tuples forming the tuple-flow of the next
particularly facilitates steering. For example, in [8], it was shown linked activities will not be processed too, reducing data and,
that it is possible to change filter conditions during execution. more importantly, execution time in cascade.
Also, in [7], it is proposed an algebraic approach to enable
Addressing which specific subset will be removed is quite
steering and dynamic changes of loop conditions during execution
important. So, we first formalize the subset to be removed
of iterative workflows (e.g., modify number of iterations or loop
(Section 4.1). In Section 4.2, we describe how we implemented
stop conditions), and such approach was evaluated in Chiron.
this in d-Chiron, which is a modified version of Chiron that
These works show that these adaptations can significantly reduce
manages the wf-Database in an in-memory distributed database
overall execution time, since domain expert users are able to
system and is significantly more scalable than the original Chiron
identify a satisfactory result before the programmed number of
[21]. We also highlight that even though we implemented our
iterations. Prior to this work, although [7] has highlighted its
solutions in d-Chiron, other SWMS could be used. The only
advantages, no work has been developed in Chiron to tackle user-
requirement is that the SWMS engine needs to manage workflow
steered data reduction online.
data as datasets in a tuple-oriented approach and manage domain,
Since provenance data is so beneficial, we consider that when a provenance, and workflow execution data online in the same
user interacts with the workflow execution, new data (user database.
interaction data) are generated, and thus their provenance must be
registered. In a long-running execution, many interactions can
4.1 Removing slices of input datasets
occur and many adaptations may be made. If the SWMS does not In the tuple-oriented approach, to address a subset of the dataset to
adequately register the provenance of interaction data, the users be removed, we first define a slice, which is a subset of tuples to
can easily lose track of what they have changed in the past. This is be removed according to a criteria defined by the user. Let 𝑅 with
critical if the entire computational experiment takes days to finish data schema ₰ = {𝐶} be the relation that represents a workflow
and many specific adaptations had to occur, since it may be activity input dataset to be reduced. {𝐶} is the set of attributes 𝑐! ,
impossible for the users to remember in the last day of execution 1 ≤ 𝑗 ≤ | 𝐶 |, from 𝑅 and each 𝑐! assumes a data type
what they have steered in the first days. Furthermore, adding (integer, string, boolean, etc). Moreover, we split 𝑅 into two
interaction data to the wf-Database enriches its content and subsets 𝑃 and 𝑆, where 𝑃 is the subset of 𝑅 containing tuples that
enables future user interaction analysis. One example of how such have already been processed and 𝑆 is the subset of 𝑅 containing
data can be exploited is that the registered interaction data could tuples that will be processed. Thus, 𝑅 ← 𝑃 ∪ 𝑆 | 𝑃 ∩ 𝑆 = ∅. 𝑃
be used by artificial intelligence algorithms for understanding and 𝑆 have the same schema ₰.
interaction patterns and recommend future adaptations. For these Then, we define a slice § as a subset of 𝑆, which is represented as
reasons, the SWMS that enables computational steering should a primary horizontal fragment of 𝑆, defined by the selection
collect provenance of user interaction data. To the best of our relational algebraic operation 𝜎 [18]. Thus, § ← 𝜎! 𝑆 , where 𝐹
knowledge, this has not been done before.
is the selection formula to obtain the primary horizontal fragment.
4. ONLINE DATA REDUCTION The formula 𝐹 may either be a simple predicate (e.g., 𝑐!" =
′𝐹𝐴𝑇𝐼𝐺𝑈𝐸′) or a minterm predicate (e.g., 𝑐! > 38 ∧ 𝑐! > 0.1
In this section, we show our main contribution. Suppose that after
analyzing the monitoring results, a user identifies, within the ∧ 𝑐! < 1.0) [18]. Figure 2 shows a workflow example on the left:

47
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

Act. 1 consumes input relation 𝑅 and produces an output relation
that also works as an input relation to be consumed by Act. 2,
which produces the final output relation. Although in this
illustration we show a data reduction in the first activity, we
highlight that input data of any workflow activity can be reduced,
including intermediary ones, as shown in Section 6, where input
data from the second activity is reduced. The input relation 𝑅 is
magnified on the right of Figure 2, where we show the subsets 𝑃
and 𝑆, and the slice § defined by the formula 𝐹.
Once the user has selected the slice to be removed (based on user-
defined criteria), the slice can be cut off. For this, we define the
operation 𝐶𝑢𝑡 using the difference relational algebraic operator, so
that 𝐶𝑢𝑡 𝑅, § ← 𝑅 − §.
By doing so, we ensure that only tuples from 𝑆 will be removed,
since § only contains tuples from 𝑆. This is necessary because
only tuples that have not been processed yet (i.e., they are “ready” Figure 2. Relation 𝑹 with subsets 𝑷 and 𝑺, and a slice
to be processed) can be removed. Otherwise, either the data or the component needs to update the Task table to mark the tasks with
workflow execution may become inconsistent. We note that our the identifiers in the 𝜋!"#$% slice as removed by user, so that the
solution is applicable to reduce input data of any workflow engine will not get them for execution.
activity that needs to process a dataset, as long as the SWMS is
aware of the data elements composing the dataset. The wf-Database tables are distributed, thus making concurrency
control of the Task table partitions even more complex. In d-
4.2 Implementation Chiron, distributed concurrency control in the Task table is
In this section, we describe how we implement slice removal and outsourced to the distributed database system that guarantees the
cut in d-Chiron. In the SWMS that implements the tuple-oriented ACID properties [18]. Thus, concurrency caused by the
approach and manages execution data in the wf-Database, each aforementioned updates is controlled by the database system,
input tuple (or set of tuples, depending on the dataflow operator) guarantying that both execution and data remain consistent.
to be consumed is related to a task. For this reason, removing To store provenance of removed tuples, we extend the wf-
tuples means removing the tasks that would consume the Database schema with the table User_Query to store the queries
referenced tuples. In the PROV-Wf data model [2], which d- that select the slice of the dataset to be removed. The description
Chiron supports, tasks’ data and input tuples are stored in for each User_Query column is described in Table 3. We also
different relations, with a relationship in between. Thus, to
keep track of the removed tasks in table Modified_Task, which
implement the set 𝑆, as defined previously, we need to semi-join
is a table that represents a many-to-many relationship between
[18] input tuples from the input relation 𝑅 with tasks in READY
User_Query and Task tables. In Section 5.2, we will give the
state in order to only select the tuples that will be processed. Then,
necessary extensions to the PROV-Wf data model implemented in
to get the identifiers of the ready tasks 𝜋!"#$% to be removed, we
d-Chiron to accommodate User_Query and modified tasks.
project over the task identifiers. Using relational algebra:
Table 3. User_Query table description.
𝜋!"#$% ← Π!"#$_!" 𝜎! 𝑅 ⋉ 𝜎!"#"$!!"#$% 𝑇𝑎𝑠𝑘 ,
Column name Description
where the formula 𝐹 is analogous to that in Section 4.1, which is query_id Auto increment identifier
the criteria to select the slice § to be removed. We emphasize that slice_query Query that selects the slice of the dataset to be
such verifications are necessary to guarantee consistency and are removed.
the SWMS’s responsibility only, not the users’. The users would tasks_query Query generated by the SWMS to retrieve the
ready tasks associated.
only need to specify the formula 𝐹 to select the slice. issued_time Timestamp of the user interaction
To ease slice removal in d-Chiron, we developed a Steer Field that determines how the user interacted.
module. With the Steer module, users can issue command lines It could be “Removal”, “Addition”, and
query_type others. We currently only implemented
to inform the name of the input relation 𝑅 and the formula 𝐹 to
“Removal” of tuples, but it can be extended in
select the slice to be removed. Then, the module is responsible for future work.
retrieving the identifiers 𝜋!"#$% of the ready tasks to be removed Relationship with the user who issued the
(analogous to the set 𝑆 needed for the slice definition). Instead of user_id
interaction query
physically removing the tasks from the wf-Database, we choose to To maintain relationship with the rest of
wkfid
mark them with the state REMOVED_BY_USER. By doing so, we workflow execution data.
enable these tasks to be later analyzed with provenance queries
and to be consistently related within the table Modified_Task. 5. ADAPTIVE MONITORING
In this section, we combine monitoring and computational
To guarantee consistency, we take advantage of d-Chiron’s
steering into an adaptive monitoring approach. Our workflow
database system [21]. d-Chiron uses a transaction-optimized in-
monitoring approach relies on online queries to the continuously
memory distributed database system that provides atomicity,
populated wf-Database. Users can set up monitoring queries (as in
consistency, isolation, and durability (ACID). In a data reduction,
Table 1 and Table 2) and analyze monitoring results.
both d-Chiron's engine and the Steer module need to
concurrently update a shared resource, the Task table in the wf- In Section 5.1, we present a formal description and describe the
Database. While d-Chiron's engine updates the Task table to get implementation in Section 5.2 with the extensions to PROV-Wf to
tasks for execution and to mark them later as executed, the Steer accommodate adaptive monitoring and online data reduction.

48
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

5.1 Formal description of adaptive monitoring Similar to what we did for the Steer module, we also developed
Monitoring works as follows. There is a set {𝑄} composed of the module Monitor to facilitate utilization. The Monitor should
monitoring queries 𝑚𝑞! , 0 ≤ 𝑖 ≤ | 𝑄 |, each one to be executed start at any cluster node that is able to access the distributed
at each 𝑑! > 0. Users do not need to specify queries at the database system and should start after the workflow execution has
beginning of execution, since they do not know everything they begun, whenever users want to monitor the workflow execution.
want to monitor. This is why {𝑄} starts empty. After users gain Similar to what we did for the Steer module, we also developed
insights from the data, after some interactive provenance data the module Monitor to facilitate utilization. The Monitor should
analyses, they can add monitoring queries to {𝑄} in an ad-hoc start at any cluster node that is able to access the distributed
manner. Each 𝑑! can be adapted by users, meaning that users have database system and should start after the workflow execution has
control of the time frame of each 𝑚𝑞! during execution. begun, whenever users want to monitor the workflow execution.
Each 𝑚𝑞! execution generates a monitoring query result set A command line starts the Monitor module that runs in
𝑚𝑞𝑟!" , 𝑡 = 𝑘𝑑! | 𝑘 ∈ ℕ!! , at each time interval 𝑑! . We background. It establishes a connection with the distributed
constrain that each 𝑚𝑞𝑟!" must deliver one column only. If users database system (connection settings are provided in the XML
want more columns, they can write different monitoring queries configuration file). Chiron (and d-Chiron) makes use of this XML
for each new column. However, the number of rows in the result file to define the workflow design, workflow general settings, and
set is not limited. This means that each monitoring result set other user-defined variables. Then, the Monitor program keeps
𝑚𝑞𝑟!" should be either a scalar value or an array. querying the Monitoring_Query table at each 𝑠 to check if a
To improve human-in-the-loop, the end-users have the flexibility new monitoring query was added. The default value for 𝑠 is 30s,
to adapt monitoring during workflow execution. To do so, at each as the time interval to check if monitoring queries were added or
instant 𝑡 after each monitoring query result 𝑚𝑞𝑟!" has been removed. However, users can customize this. After the Monitor
generated, the values for 𝑑! and 𝑚𝑞! are reloaded from the wf- has started, users can add (or remove) monitoring queries to (or
Database. If any change has happened, it will be considered in the from) the Monitoring_Query table. Currently, users can add
next iteration 𝑡 + 𝑑! . Moreover, at each certain amount of time monitoring queries using a command line to inform the SQL
during execution (also configured by the user), the system checks query to be executed at each time interval and the time interval.
if the user has added new monitoring queries in {𝑄}. Our adaptive Whenever the Monitor module identifies that the user added a
monitoring approach takes full advantage of the data stored online new monitoring query, it launches a new thread. Each thread is
in the wf-Database. More importantly, it enables users to responsible for executing each monitoring query in
dynamically steer monitoring settings (including which data will Monitoring_Query at each defined time interval. A thread is
be monitored and how), highly benefiting them in finding finished when a monitoring query is removed or when the
uninteresting subsets to be removed. workflow stops executing (in that case, all threads are finished).
Figure 3 shows the steps executed at each time interval.
5.2 Implementation 1. Execute the monitoring query 𝑚𝑞!
To implement our approach, we first need to extend the wf-
Database schema. To store {𝑄}, we add the table 2. Store query results in the wf-Database
Monitoring_Query, shown in Table 4. 3. Reload all information for 𝑚𝑞! from the wf-Database for the next
time iteration. The user could have adapted any of this information.
Table 4. Monitoring_Query table description.
Column name Description 4. Wait for 𝑑! seconds
monitoring_id Auto increment identifier Figure 3. Steps executed by each thread within a time interval.
interval Interval time (in seconds) between each
monitoring query (𝑑! ) To enable all these monitoring capabilities and human-adaptation,
monitoring_query Raw SQL query to be queried three of these steps represent queries to the wf-Database,
Relationship between the monitoring queries including reads and writes. The stored results can be further
wkfid and the current execution of the workflow. In analyzed a posteriori or, more interestingly, used as input for
d-Chiron’s wf-Database, there may be data
from past executions for a same workflow.
runtime data visualization tools, since results are immediately
made available after they are generated.
The main advantage of storing monitoring results in the wf-
Database (and adequately linking the results with the remainder of Another contribution of this paper is that we add three concepts to
the data already stored in this database) whenever a monitoring PROV-Wf [2], which is W3C PROV-compliant [15]. Our main
query result is executed is that users are able to query the results motivations to adhere to the W3C PROV recommendations are to
immediately after their generation. The wf-Database can also help on query specification, to maintain compatibility between
serve as data source for data visualization applications. To store different SWMS and facilitate interoperability between different
monitoring results in the wf-Database, we add another table: databases.
Monitoring_Query_Result, shown in Table 5. These concepts are: UserQuery, MonitoringQuery, and
Table 5. Monitoring_Result table description. MonitoringResult, as in Figure 4. Using PROV nomenclature,
Column name Description UserQuery is a PROV Activity that stores the user queries that
monitoring_result_id Auto increment identifier remove sets of tuples and thus influence the state of the associated
monitoring_id Relationship with the monitoring query tasks (i.e., remove them). MonitoringQuery is a PROV Activity
that generated this result that contains the monitoring queries submitted by the user in
monitoring_values Results of the monitoring_query specific time intervals. The monitoring queries generate PROV
Data type of the result values of both Entity MonitoringResult that stores the query results.
result_type queries. Currently, “Integer”, “Double”,
“Array[Integer]”, and “Array[Double]”

49
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

6. EXPERIMENTAL VALIDATION associated to low risk of fatigue life values. In the workflow
In this section, we validate our solution (for online data reduction (Figure 1), the final value of fatigue life is calculated in Activity
and adaptive monitoring) based on a real data. In Section 6.1, we 6, but input values are obtained as output of Activity 1, gathered
show the experimental setup, Section 6.2 shows a test case where from raw input files. Keeping provenance is essential to associate
an expert monitors the execution and removes slices of the data from Activity 1 with data from Activity 6.
dataset. In Section 6.3, we analyze the added overhead. To understand which input values are leading to high fatigue life
values, Peter monitors the generated data online. For simplicity,
6.1 Experimental setup we consider wind speed, which is only one out of the many
Scientific workflow. As a proof of concept for this work, we use environmental condition parameter values captured by Activity 1
a synthetic parameter sweep workflow of the Riser Fatigue to serve as input for Activity 2. Peter knows that wind speed has a
Analysis example (see Figure 1), which is based on a real case strong correlation with fatigue life in risers. He expects that with
study. The workflow manipulates approximately 300 GB of raw low speed winds, there is a lower risk of accident.
data. In all executions, we use the same dataset, which spans over When workflow execution starts, the Monitor module is
approximately 12,000 data elements to be processed in parallel. initialized. Then, Peter adds two monitoring queries: 𝑚𝑞! shows
Depending on the workflow activity, tasks may take few seconds the average of the 10 greatest values of fatigue life calculated in
(e.g., Activity 7) or up to one minute on average (e.g., Activity 3). the last 30s of workflow execution, setting 𝑑! = 30s; and 𝑚𝑞!
Software. In all executions, we use d-Chiron [21], which uses shows the average wind speed associated to the 10 greatest values
MySQL Cluster 7.4.9 as its in-memory distributed database of fatigue life calculated in the last 30s, also setting the query
system to manage the wf-Database. The code to run d-Chiron and interval 𝑑! = 30s. We recall from Table 1 that 𝑚𝑞! is similar to
setup files are in github.com/hpcdb/d-chiron. 𝑄1, but only considering data processed in the last 30s. 𝑚𝑞! and
𝑚𝑞! queries are added to the Monitoring_Query table.
Hardware. The experiments were conducted in Grid5000 using a
cluster with 39 nodes, containing 24 cores each (936 cores). Every Peter monitors the results using the Monitoring_Result table.
node has two AMD Opteron 1.7 GHz 12-core processors, 48GB These results can be a data source for a visualization that plots
RAM, and 250GB of local disk. All nodes are connected via dashboards dynamically, refreshed according to the query
Gigabit Ethernet and access a shared storage of 10TB. intervals. After gaining insights from the results and
understanding patterns, he can start removing the undesired values
6.2 Test case for wind speed. The monitoring query results 𝑚𝑞𝑟!! and
Let us consider the following scenario. Peter is an offshore 𝑚𝑞𝑟!! for the two previously listed queries, as well as when the
engineer, expert in riser analysis and learned how to set up user reduced the data, are plotted along the workflow elapsed
monitoring, analyze d-Chiron’s wf-Database, and use the Steer time, as shown in Figure 5. It presents 𝑚𝑞𝑟!! in full black line
module developed in this work. In Peter’s project, the Design with square markers and 𝑚𝑞𝑟!! in full gray line with triangle
Fatigue Factor is set to 3 and service life is set to 20 years, markers. These markers determine when the monitoring occurred.
meaning that fatigue life must be at least 60 years (see from
The workflow execution starts at 𝑡 = 0, but only after
Section 2). Peter is only interested in analyzing risers with low
approximately 150s, the first output results from Activity 6 starts
fatigue life values, because they are critical and might need repair
to be generated. From the first results, at 𝑡 = 150 and 𝑡 = 180,
or replacement. During workflow execution, it would be
Peter checks that when wind speed is less than 16 Km/h (see
interesting if Peter could inform the SWMS, which input values
horizontal dashed line in 𝑤𝑖𝑛𝑑 𝑠𝑝𝑒𝑒𝑑 = 16 in Figure 5), the
would lead to low risk of fatigue, so they could be removed.
results lead to the largest fatigue life values. Since risers with
However, this is not simple because it is hard to determine the
large fatigue life values are not interesting in this analysis, he
specific range of values (i.e., the slice to be removed). For this,
decides, at 𝑡 = 190, to remove all input data elements that
Peter first needs to understand the pattern of input values

Figure 4. Extended PROV-Wf data module to accommodate modified tasks and monitoring

50
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

contain wind speed less than 16 Km/h. For this, the first user in the large fatigue life values was registered after this last Peter’s
query 𝑞! is issued with a command line to the Steer module. steering. Thus, he keeps analyzing the monitoring results, but does
User queries are represented with circles in the horizontal axis not remove input data anymore until the end of execution.
(Elapsed time) in Figure 5. The exact time a user issued an We store each interaction in the User_Query table and map (in
interaction query is stored in User_Query table. table Modified_Task) its rows with rows in the Task table, to
The next markers after 𝑞! happens at 𝑡 = 210. Comparing with consistently keep provenance of which tasks were modified (in
the previous monitoring mark, at 𝑡 = 180, we can observe that this case, removed) by each specific user steering. Thus, keeping
this Peter’s steering (𝑞! ) increases the minimum wind speed provenance of user steering helps analyzing how specific
values to be considered from 14.2 Km/h to 24.1 Km/h. Also, we interactions impacted the results. Figure 5 shows that some
observe a significant decrease in the slope of the largest values for specific interactions imply significant changes in lines’ slopes.
fatigue life (10.6% lower). This means that the removal of these Queries on the wf-Database can show finer details about how
input data containing wind speed less than 16 Km/h made the many tuples each user interaction made the SWMS not process, as
SWMS not process data containing low wind speed values, which shown in Table 6. Each issued time follows Figure 5 and is
would lead to larger fatigue life results. registered with the timestamp of when the first activity started.
Then, monitoring continues, but that slope decrease calls Peter’s Table 6. Provenance of slices removed by the user
attention. To obtain a finer detail of what is happening, he decides Inter Issued Number of removed
Slice query
to adjust the monitoring interval time (𝑑! and 𝑑! ) at runtime, by act. time (s) data elements
reducing to 10s to get monitoring feedbacks more frequently. We 𝑞! 190 wind_speed < 16 623
can observe that for both lines 𝑚𝑞𝑟!! and 𝑚𝑞𝑟!! , the markers
become more frequent during 𝑡 = [220, 270]. This is because a 𝑞! 310 wind_speed < 25 373
monitoring is registered at every 10s. We highlight that, although 𝑞! 370 wind_speed < 30 355
in test use case we are only showing monitoring correlating wind
𝑞! 430 wind_speed < 34.5 115
speed and fatigue life, other monitoring correlations could also be
analyzed and users can add, remove or adjust monitoring queries 𝑞! 520 wind_speed < 35.5 3
at any time during execution. Finally, we run the exact same workflow and input datasets, but
After verifying that the results are reasonable, Peter decides to with no monitoring or interactions to compare how such slice
increase back the monitoring query intervals for both queries to removals help decrease overall execution time. The workflow
30s after 𝑡 = 270. He then observes that since 𝑞! , wind speed with no interaction processes all input data, including those
less than 25 Km/h are leading to large fatigue life values. containing wind speed values that lead to risers with low risk of
Then, at 𝑡 = 310, he calls Steer again to issue 𝑞! that removes fatigue, which are not considered in Peter’s analyses. In total,
Peter’s steering yields the removal of 1469 input data elements
input data for wind speed < 25 Km/h. The next markers after
(out of approximately 12,000). This reduces the execution time
𝑞! shows that this steering made the wind speed value associated
for this test case by 37% compared with no steering. Furthermore,
to large fatigue life be at least 30.5 Km/h and a decrease of 6.5%
these removed input data would make the workflow process and
in large fatigue life values between 𝑡 = 300 𝑎𝑛𝑑 𝑡 = 330.
generate more raw data files if the input data elements were not
Similarly, Peter continues to monitor and steer the execution. He removed. By querying the wf-Database in the end of execution,
issues 𝑞! at 𝑡 = 370 to remove input data with wind speed < we found that the execution with no user steering processed
30.5 Km/h, making a decrease of 4.9% in large fatigue life approximately 300GB of raw data files, whereas with steering the
(comparing fatigue life in 𝑡 = 360 and 𝑡 = 390). Then, he total was 258GB, representing 14% of data reduction.
issues 𝑞! at 𝑡 = 430 to remove input data with wind speed <
34.5, attaining a decrease of 1.7% in large fatigue life (comparing 6.3 Analyzing monitoring overhead
fatigue life in 𝑡 = 420 and 𝑡 = 450). Despite this small A monitoring query 𝑚𝑞! in {𝑄} is run by a thread at each 𝑑!
decrease, he decides at t = 520 to further remove data, but with seconds. Depending on the number of threads (|{𝑄}|) and on the
wind speed < 35.5 Km/h. However, no decrease greater than 1% interval 𝑑! there may be too many concurrent accesses to the wf-

80.0
35.0
33.0 78.0
31.0 76.0
Wind speed (Km/h)

Fatigue life (years)

29.0 74.0
27.0 72.0
25.0 70.0
23.0
68.0
21.0
66.0
19.0
17.0 64.0
15.0 q1 q2 q3 q4 q5 62.0
13.0 60.0
150 180 210 230 250 270 300 330 360 390 420 450 480 510 540 570 600
Elapsed time (seconds)
Wind speed Fatigue life Steering
Figure 5. Use case plot to analyze impact of user steering comparing Wind Speed (input) with Fatigue life (output).

51
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

Database, which may add overhead. The goal of this experiment is being processed online by dimension reduction or by reducing the
to analyze such overhead. range of some parameters, sharing similar motivations to our
We set up the Monitor module to run queries, which are work. It uses Nimrod/K as its underlying parallel workflow
engine, which is an extension of the Kepler workflow system [1].
variations of the queries 𝑄1-𝑄7 presented in Table 1 and Table 2.
WorkWays presents several tools for user interaction in human-in-
For example, in 𝑄2, we vary the curvature value. We also modify
the-loop workflows, such as graphic user interfaces, data
them to calculate only the results over the last 𝑑 seconds, at each
visualization, and interoperability among others. However,
𝑑 seconds. To evaluate the overheads, we measure execution time
WorkWays does not provide for provenance representation and
without monitoring and then with monitoring, but varying the
users do not have query access to simulation data, execution data,
number of queries |{𝑄}| and the interval 𝑑, which is considered
metadata, and provenance, all related in a database, which limits
the same for all queries in {𝑄} in this experiment. The
the power of online computational steering. For example, it
experiments were repeated until the standard deviation of
prevents ad-hoc data analysis using both domain and workflow
workflow elapsed times was less than 1%. The results are the
execution data, such as those presented in Table 1 and Table 2,
average of these times within the 1% margin. Figure 6 shows the
which support the user in defining which slice of the dataset
results, where the gray portion represents the workflow execution
should be removed. In contrast, our work uses a robust in-memory
time when no monitoring is used; and the black portion represents
distributed database system to manage and relate analytical data
the difference between the workflow execution time with and
involved in the workflow execution. Moreover, the lack of
without monitoring (i.e., the monitoring overhead).
provenance data support in WorkWays, either online or post-
From these results, we observe that when the interval 𝑑 is equal to mortem, does not support reproducibility and prevents from
30s, the overhead is negligible. When the interval is 1s, the registering user adaptations, missing opportunities to determine in
overhead is higher when the number of monitoring threads is detail how specific user interactions influenced workflow results.
greater. This happens because three queries are executed in each Another notable SWMS example is WINGS/Pegasus [9], which
time interval (see Figure 3), for each thread. In the scenarios with especially focus on assisting users in automatic data discovery. It
30 threads, there will be 120 queries in a single time interval 𝑑. In helps generating and executing multiple combinations of
that case, if 𝑑 is small (e.g., 𝑑 = 1), there are 120 queries being workflows based on user contraints, selecting appropriate input
executed per second, just for the monitoring. The database that is data, and eliminating workflows that are not viable. However, it
queried by the monitors is also frequently queried by the SWMS differs from our solution in the sense that it tries to explore
engine, thus adding higher overhead. However, even in this multiple workflows until finding the most suitable one, whereas
specific scenario that shows higher overhead (|{𝑄}| = 30 and we often model our experiments as one single scientific workflow
𝑑 = 1), it is only 33s or 3.19% higher than when no monitoring is to be processed. Also, it does not aim at putting users in the loop
used. Most of the real monitoring cases do not need such frequent to actively eliminate subsets of an input dataset, especially based
(every second) updates. If 30s is frequent enough for the user, on extensive ad-hoc intermediary data analysis online.
there might be no added overhead, like in this test case. Additionally, as WorkWays, provenance data is not collected
We also evaluated the same scenarios without storing monitoring online, nor is it integrated with domain-specific and execution
data for enhanced analysis.
results in the wf-Database, but rather appending in CSV files,
which is simpler. The results are nearly the same as in Figure 6. While human adaptation is less explored in parallel SWMS,
This suggests storing all monitoring results in the wf-Database at monitoring is widely supported in several existing SWMS
runtime, which enables users to submit powerful queries as they [13][14]. For example, Pegasus [5] and Triana may be integrated
are generated, with all other provenance data. This would not be to analytical tools such as Stampede [10][22], which provides a
possible with a solution that appends data to CSV. framework to monitor workflow executions and has rich
capabilities for online performance monitoring, troubleshooting,
7. RELATED WORK and debugging. However, in these solutions, it is not possible to
Considering our contributions, we discuss the SWMS with monitor workflow execution data associating them to provenance
parallel capabilities with respect to human adaptation (especially and domain data, as we do using queries to the wf-Database. To
data reduction), online provenance support, and monitoring the best of our knowledge, there is no related work that allows for
features. online data reduction based on a rich analytical support with
Although online human adaptation is the core of computational adaptive monitoring and provenance registration of human
steering, there are few parallel SWMS [11][12][19] that support it. adaptations in scientific workflows. These features allow for
These solutions have monitoring services and are highly scalable, performance improvements of scientific workflows, while
but do not allow for online data reduction as a means to reduce keeping data reduction consistency and provenance queries that
overall execution time. WorkWays [16] is a powerful science can show the history of human-in-the-loop actions and results.
gateway that enables users to steer and dynamically reduce data
8. CONCLUSION
Overhead This work contributes to putting the human in the loop of
16.5 No monitor time
Exec. time (min)

scientific workflow systems, especially when users can actively
16 steer and reduce data online to improve performance. As a
solution to the input data reduction problem, we made use of a
15.5
tuple-oriented algebraic approach that organizes workflow data to
15 be processed as sets of tuples stored in a wf-Database, managed
d=1 d=1 d=30 d=30 by an in-memory distributed database system at runtime. We
|{Q}|=3 |{Q}|=30 |{Q}|=3 |{Q}|=30 developed a mechanism coupled to d-Chiron, a distributed version
of Chiron SWMS, which allows for reducing data, while
Figure 6. Results of adaptive monitoring overhead. maintaining both data integrity and execution consistency. A
major challenge to the problem of data reduction is to address

52
WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

which subset of the data should be removed. As a solution to this, [4] Deelman, E., Gannon, D., Shields, M., Taylor, I. Workflows and e-
we proposed an adaptive monitoring approach that aids users in Science: an overview of workflow system features and capabilities.
analyzing partial result data at runtime. Based on the evaluation of FGCS, 25(5):528–540, 2009.
[5] Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S.,
input data elements and its corresponding results, the user may
Maechling, P.J., Mayani, R., Chen, W., Ferreira da Silva, R., et al.
find which subset of the input data is not interesting for a Pegasus, a workflow management system for science automation.
particular execution, hence can be removed. The adaptive FGCS, 46(C):17–35, 2015.
monitoring allows users not only to follow the evolution of the [6] Det Norske Veritas. Recommended practice: riser fatigue. DNV-RP-
workflow, but also to dynamically adjust monitoring aspects F204, 2010.
during execution. We extended our previous workflow [7] Dias, J., Guerra, G., Rochinha, F., Coutinho, A.L.G.A., Valduriez, P.,
provenance data model to be able to represent provenance of the Mattoso, M. Data-centric iteration in dynamic workflows. FGCS,
online data reduction actions by users and the monitoring results. 46(C):114–126, 2015.
[8] Dias, J., Ogasawara, E., Oliveira, D., Porto, F., Coutinho, A.L.G.A.,
Although we implemented our solution in d-Chiron, other SWMS
Mattoso, M. Supporting dynamic parameter sweep in adaptive and
could be used if provenance, execution, and domain dataflow data user-steered workflow. WORKS, 31–36, 2011.
are managed in a database at runtime. [9] Gil, Y., Ratnakar, V., Kim, J., Gonzalez-Calero, P., Groth, P., Moody,
To validate our solution, we executed a data-intensive parameter J., Deelman, E. Wings: Intelligent workflow-based design of
sweep workflow based on a real case study from the oil and gas computational experiments. Intelligent Systems, 26(1):62–72, 2011.
[10] Gunter, D., Deelman, E., Samak, et al. Online workflow management
industry, running on a 936-cores cluster. A test case demonstrated and performance analysis with Stampede. CNSM, 152–161, 2011.
how the user can monitor the execution, dynamically adapt [11] Jain, A., Ong, S.P., Chen, W., et al. FireWorks: a dynamic workflow
monitoring settings, and especially remove uninteresting data to system designed for high-throughput applications. CCPE,
be processed, all during execution. Results for this test case show 27(17):5037–5059, 2015.
that the user interactions reduced the execution time by 37% [12] Lee, K., Paton, N.W., Sakellariou, R., Fernandes, A.A.A. Utility
comparing with the execution that processed the whole dataset. functions for adaptively executing concurrent workflows. CCPE,
Although the test case was from the oil and gas domain, any other 23(6):646–666, 2011.
workflow application could have been used, as long as a domain [13] Mandal, A., Ruth, P., Baldin, I., et al. Toward an end-to-end
framework for modeling, monitoring and anomaly detection for
expert can tell which slice is not interesting, removed with no scientific workflows. IPDPSW, 1370–1379, 2016.
harm to the final results. [14] Mattoso, M., Dias, J., Ocaña, K.A.C.S., Ogasawara, E., Costa, F.,
To the best of our knowledge, this is the first work that explores Horta, F., Silva, V., de Oliveira, D. Dynamic steering of HPC
user-steered online data reduction in scientific workflows steered scientific workflows: A survey. FGCS, 46:100–113, 2015.
[15] Moreau, L., Missier, P. PROV-DM: the PROV data model. Available
by ad-hoc queries and adaptive monitoring, while maintaining
at: http://www.w3.org/TR/prov-dm Accessed: 1 Aug 2016., 2013.
provenance of user interactions. The results motivate us to extend [16] Nguyen, H.A., Abramson, D., Kiporous, T., Janke, A., Galloway, G.
our solution and explore different aspects that can be adapted by WorkWays: interacting with scientific workflows. Gateway
humans based on sophisticated workflow data analysis support. Computing Environments Workshop, 21–24, 2014.
Our solution is currently dependent on the domain expert’s [17] Ogasawara, E., Dias, J., Oliveira, D., Porto, F., Valduriez, P.,
knowledge to identify correlations between input and output data Mattoso, M. An algebraic approach for data-centric scientific
to determine which subsets are uninteresting. We plan to address workflows. PVLDB, 4(12):1328–1339, 2011.
in-situ data visualization based on the adaptive monitoring and [18] Özsu, M.T., Valduriez, P. Principles of distributed database systems.
3 ed. New York, Springer, 2011.
interactive queries results and develop recommendation models to
[19] Reuillon, R., Leclaire, M., Rey-Coyrehourcq, S. OpenMOLE, a
suggest correlations based on history stored in the wf-Database. workflow engine specifically tailored for the distributed exploration
Other future work include: enabling users to set priorities to of simulation models. FGCS, 29(8):1981–1990, 2013.
different slices of the data in a way that the SWMS system will [20] Silva, V., de Oliveira, D., Valduriez, P., Mattoso, M. Analyzing
process critic slices before; improving usability of the system by related raw data files through dataflows. CCPE, 28:2528–2545, 2015.
developing intuitive user interfaces to decrease the learning curve, [21] Souza, R., Silva, V., Oliveira, Daniel, Valduriez, P., Lima, A.A.B.,
especially related to the query interface, to take full advantage of Mattoso, M. Parallel execution of workflows driven by a distributed
the wf-Database. We also plan to expand our experiments and database management system. Poster in Supercomputing, 2015.
[22] Vahi, K., Harvey, I., Samak, T., et al. A case study into using
analyze how reducing each specific type of data (relation tuples,
common real-time workflow monitoring infrastructure for scientific
raw data files not processed and not generated) impact final workflows. J. Grid Comput., 11(3):381–406, 2013.
results.

9. ACKNOWLEDGMENTS
This work was partially funded by CNPq, FAPERJ and Inria
(MUSIC project), EU H2020 Programme and MCTI/RNP-Brazil
(HPC4E grant no. 689772), and performed (for P. Valduriez) in
the context of the Computational Biology Institute (www.ibc-
montpellier.fr). The experiments were carried out using the
Grid'5000 testbed (https://www.grid5000.fr).

10. REFERENCES
[1] Abramson, D., Enticott, C., Altinas, I. Nimrod/K: Towards massively
parallel dynamic grid workflows. Supercomputing, 24:1–24:11, 2008.
[2] Costa, F., Silva, V., de Oliveira, D., Ocaña, K., Ogasawara, E., Dias,
J., Mattoso, M. Capturing and querying workflow runtime provenance
with PROV: a practical approach. EDBT Workshops, 282–289, 2013.
[3] Davidson, S.B., Freire, J. Provenance and scientific workflows:
challenges and opportunities. SIGMOD, 1345–1350, 2008.