=Paper= {{Paper |id=Vol-1800/paper6 |storemode=property |title=Online Input Data Reduction in Scientific Workflows |pdfUrl=https://ceur-ws.org/Vol-1800/paper6.pdf |volume=Vol-1800 |authors=Renan Souza,Vitor Silva,Alvaro L G A Coutinho,Patrick Valduriez,Marta Mattoso |dblpUrl=https://dblp.org/rec/conf/sc/SouzaSCVM16 }} ==Online Input Data Reduction in Scientific Workflows== https://ceur-ws.org/Vol-1800/paper6.pdf
        Online Input Data Reduction in Scientific Workflows
                                Renan Souza§,°, Vítor Silva§, Alvaro L G A Coutinho§,
                                       Patrick Valduriez¶, Marta Mattoso§
         §                                                        °                            ¶
          COPPE/Federal University of Rio de Janeiro, IBM Research Brazil, Inria and LIRMM, Montpellier

ABSTRACT                                                                    space that will not influence relevant results and thus, as with a
                                                                            “branch and bound” optimization strategy, can be bounded. A
Many scientific workflows are data-intensive and need be
                                                                            similar scenario occurs when the workflow involves a large input
iteratively executed for large input sets of data elements.
                                                                            dataset. When domain-specialist users can actively participate in
Reducing input data is a powerful way to reduce overall execution
                                                                            the computational process, practice frequently referred to as
time in such workflows. When this is accomplished online (i.e.,
                                                                            “human-in-the-loop”, they may analyze partial result data and tell
without requiring users to stop execution to reduce the data and
                                                                            which part of the data is relevant or not for the final result [14].
resume execution), it can save much time and user interactions
                                                                            Then, based on their domain knowledge, users can identify which
can integrate within workflow execution. Then, a major problem
                                                                            subset of the data is not interesting and thus should be removed
is to determine which subset of the input data should be removed.
                                                                            from the execution by the SWMS, thereby reducing execution
Other related problems include guaranteeing that the workflow
                                                                            time.
system will maintain execution and data consistent after
reduction, and keeping track of how users interacted with                   Data reduction can be accomplished in at least three different
execution. In this paper, we adopt the approach “human-in-the-              forms. First, users can do it before the execution starts. However,
loop” for scientific workflows by enabling users to steer the               in most complex scenarios, the high number of possibilities makes
workflow execution and reduce input elements from datasets at               it impossible to know beforehand the uninteresting subsets,
runtime. We propose an adaptive monitoring approach that                    without any prior execution. Furthermore, not only the initial
combines workflow provenance monitoring and computational                   dataset can be reduced, but also the data generated by the
steering to support users in analyzing the evolution of key                 workflow, since the activities composing scientific workflows
parameters and determining which subset of the data should be               continuously produce significant amounts of partial data that are
removed. We also extend a provenance data model to keep track               consumed by other activities. A second form of data reduction is
of user interactions when users reduce data at runtime. In our              to do it online. When the SWMS allows for partial result data
experimental validation, we develop a test case from the oil and            analysis, the user may interact with the generated partial data, find
gas industry, using a 936-cores cluster. The results on our                 which slice of the dataset is not interesting, and reduce the dataset
parameter sweep test case show that the user interactions for               online. We use the term online for the interactions where users are
online data reduction yield a 37% reduction of execution time.              able to inspect workflow execution, analyze partial and
                                                                            performance data, and dynamically adapt (i.e., steer) workflow
CCS Concepts                                                                settings while the workflow is running (i.e., at runtime). The third
• Massively parallel and high-performance simulations.                      form of data reduction is by stopping execution, reducing the data
                                                                            offline, and then resuming execution with the reduced dataset.
Keywords                                                                    Because of the difficulty in defining the exploratory input dataset
Scientific Workflows; Human in the Loop; Online Data                        and the long execution time of such iterative workflows, users
Reduction; Provenance Data; Dynamic Workflows.                              frequently adopt the third form. However, in the offline form, the
                                                                            SWMS is not aware of the changes, and the results with one
1. INTRODUCTION                                                             workflow configuration are not related to the others. Therefore,
Scientific Workflow Management Systems (SWMS) with parallel                 this is generally more time-consuming, there is no control or
capabilities have been designed for executing data-intensive                registration of user interactions, and the execution may become
scientific workflows, scientific workflows for short hereafter, in          inconsistent [7].
High Performance Computing (HPC) environments. A typical
                                                                            Online data reduction has obvious advantages but introduces
execution may involve thousands of parallel tasks with large input
                                                                            several problems related to computational steering in HPC
datasets. When the workflow is iterative, it is repeatedly executed
                                                                            environments [14]. First, because of the complexity of their
for each element of an input dataset. The more the data to be
                                                                            scientific question to address and the huge amounts of data, users
processed, the longer the workflow may take, which may be days
                                                                            do not exactly know beforehand which subset of dataset should be
depending on the problem and HPC environment [7]. Configuring
                                                                            kept or removed. Also, if users cannot actively follow the result
a scientific workflow with parameters and data to be processed is
                                                                            data evolution online, in particular domain data associated to
hard. Typically, the user needs to try several input data or
                                                                            execution and provenance data (history of data derivation), they
parameter combinations in different workflow executions. These
                                                                            can be driven to misleading conclusions when trying to identify
trials make the scientific experiment take even longer. It is well
                                                                            the uninteresting subset of the data. Indeed, this is the main
known that optimizing performance of the parallel system is a
                                                                            related challenge. Second, if they can find which subset to remove
way to improve overall workflow execution time, but reducing the
                                                                            and actually try to remove, the SWMS must guarantee that the
input data that was initially planned to be processed is another
                                                                            operation will be done consistently. Otherwise, it can introduce
effective approach to reduce workflow execution time [4].
                                                                            anomalous data, yielding to no control of data elimination, data
In scientific workflows, the total amount of data is very large, but        redundancy, or even execution crash. Third, in a long run, there
not necessarily the entire input dataset has relevant data for              may be more than one user interaction, each removing more
achieving the goal of the workflow execution. This is particularly          subsets, at different times. If the SWMS does not keep track of
the case when a large parameter space needs to be processed in              user actions, it may negatively impact the results’ reproducibility
parameter sweep workflows. There may be slices of the parameter             and reliability. Although data reduction is not new in SWMS [4],



Copyright held by the authors
                                                                       44
              WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

to the best of our knowledge, these problems related to online               subsea equipment and the offshore oil floating production unit.
user-steered data reduction while maintaining data provenance                They are susceptible to a wide variation of environmental
have not been addressed by related works.                                    conditions (e.g., sea currents, wind speed, ocean waves,
Our approach is to represent workflow input datasets as database             temperature), which may damage their structure. The fatigue
relations. Each input element from the scientific domain dataset is          analysis workflow adopts a cumulative damage approach as part
represented as a tuple of the input relations. When the elements of          of the riser's risk assessment procedure considering a wide
the input dataset are files, we insert paths to these files. The             combination of possible conditions. The result is the estimate of
approach is implemented in Chiron, a parallel SWMS that adopts               riser’s fatigue life, which is the length of time that it will safely
a tuple-oriented algebraic approach [17]. Chiron has been used to            operate. The Design Fatigue Factor (DFF) may range from 3 to
manage workflow applications with user steering in domains,                  10, meaning that the riser’s fatigue life must be at least DFF times
such as bioinformatics [2], computational fluid dynamics [7],                higher than the service life [6].
astronomy [20], etc. Chiron continuously populates a relational              Sensors located at the offshore platform collect external
database at runtime to store domain-specific data, workflow                  conditions and floating unit data, which are stored in multiple raw
execution data, and, more importantly, provenance data, all                  files. Offshore engineers use specialized programs (mostly
integrated in the same database available for online queries. In             complex simulation solvers) to consume the files, evaluate the
this paper, we use the term workflow Database (wf-Database) to               impact on the risers in the near future (e.g., risk of a fracture
refer to this database. In addition to the traditional advantages of         occurrence), and estimate the risers’ fatigue life. Figure 1 presents
managing provenance data in scientific workflows (i.e.,                      a scientific workflow composed of seven piped specialized
reproducibility, reliability, and quality of result data) [3], online        programs (represented by workflow activities) with a dataflow in
provenance data management eases interactive domain data                     between.
analysis [2][20]. Such interactive flexible data analysis through
provenance helps finding which subset of a dataset to be removed.
Moreover, execution monitoring is another very desirable feature
in any data-intensive system, including SWMS, and can also be
used to assist users in addressing the subset to be removed.
However, Chiron does not control changes in input datasets,
including removing a subset. In this work, we extend Chiron's wf-
Database to maintain the provenance of the subsets of the dataset                       Figure 1. Risers Fatigue Analysis Workflow.
that are removed. Furthermore, we take advantage of a distributed            Each task of Data Gathering (Activity 1) decompresses one
in-memory database system coupled to Chiron, in a version called             large file into many files containing important input data, reads
d-Chiron that is significantly more scalable [21], to address                the decompressed files, and gathers specific values
consistency issues with respect to data reduction. We make the               (environmental conditions, floating unit’s, and other data), which
following contributions:                                                     are used by the following activities. Preprocessing (Activity 2)
                                                                             performs pre-calculations and data cleansing over some other
• A mechanism coupled to d-Chiron for online input data                      finite element mesh files that will be processed in the following
  reduction, which allows users to remove subsets of the dataset             activities. Stress Analysis (Activity 3) runs a computational
  to be processed at runtime. It guarantees that both execution              structural mechanics program to calculate the stresses applied to
  and data remain consistent after reduction.                                the riser. Each task consumes pre-processed meshes and other
• An extension to a provenance data model (which is W3C                      important input data values (gathered from first activity) and
  PROV compliant) to maintain the history of user interactions               generates result data files, such as histograms of stresses applied
  when users decide to reduce a dataset during workflow                      throughout the riser (this is an output file), and stress intensity
  execution.                                                                 factors in the riser and principal stress tensor components. It also
• An adaptive monitoring approach that combines monitoring                   calculates the current curvature of the riser. Then, Stress
  and computational steering. It helps users to follow the                   Critical Case Selection (Activity 4) and Curvature
  evolution of interesting parameters and result data to find                Critical Case Selection (Activity 5) calculate the fatigue
  which subsets of the dataset can be removed during execution.              life of the riser based on the stresses and curvature, respectively.
  Also, since what users find interesting may change over time,              These two activities filter out results corresponding to risers that
  this approach allows the user to steer the monitoring                      certainly in a good state (no critical stress or curvature values
  definitions, such as which data should be monitored and how.               were identified), which are of no interest to the analysis.
  Although existing solutions enable workflow execution                      Calculate Fatigue Life (Activity 6) uses previously
  monitoring [13][16][19], there is no approach to enable users
                                                                             calculated values to execute a standard methodology [6] and
  to run monitoring queries that integrate execution, provenance,
                                                                             calculate the final fatigue life value of a riser. Compress
  and domain data, and dynamically adapt these queries online,
                                                                             Results (Activity 7) compresses output files by riser.
  to the best of our knowledge.
                                                                             Most of these activities generate result data (both raw data files
Paper organization: Section 2 gives a motivating example. Section
                                                                             and some other domain-specific data values), which are consumed
3 gives the background for this work. We present our online data
                                                                             by the subsequent activities. These intermediary data need to be
reduction approach in Section 4 and our adaptive monitoring
                                                                             analyzed during workflow execution. More importantly,
approach in in Section 5. Section 6 gives the experimental
                                                                             depending on a specific range of data values for an output result
validation. Section 7 discusses related work. Section 8 concludes.
                                                                             data (e.g., fatigue life value), there may be a specific combination
2. MOTIVATING EXAMPLE IN OIL AND                                             of input data (e.g., environmental conditions) that are more or less
GAS INDUSTRY                                                                 important during an interval of time within the workflow
In ultra-deep water oil production systems, a major application is           execution. The specific range is frequently hard to determine and
to perform risers’ analyses. Risers are fluid conduits between               requires a domain expert to analyze partial data during execution.




                                                                        45
              WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

For example, an input data element for Activity 2 is a file that             data and provenance dataflow analysis. The corresponding
contains a large matrix of data values, composed of thousands of             generated SQL, as well as the relational schema, are in
rows and dozens of columns. Each column contains data for an                 http://github.com/hpcdb/d-chiron. These queries reflect typical
environmental condition and each row has data collected for a                user interactions. When these workflows are executed as scripts,
given instant in time. Each row can be processed in parallel and             without Chiron's support, users look for files in their directories,
the domain application needs to consume and produce other data               open files, and try to do this analysis in an ad-hoc way, frequently
files (on average, about 14 MB consumed and 6 MB produced per                writing programs to "query" these result files. Often they interrupt
processed input data element). After many analyses online, the               the execution to fine tune input data and save execution time.
user finds that, for waves > 38m with frequency > 1Hz, a riser                 Table 1. Domain dataflow provenance interactive queries.
fatigue will never happen. Thus, within the entire matrix, any                  What is the average of the 10 environmental conditions that are
input data element that contains this specific uninteresting range           𝑸𝟏
                                                                                leading to the largest fatigue life value?
does not need to be processed. Therefore, by reducing the input
                                                                             𝑸𝟐 What are the water craft’s hull conditions that are leading to risers’
dataset, the overall data processed and generated are reduced and,              curvature lower than 800?
more importantly, the overall execution time is reduced. In this             𝑸𝟑
                                                                                What are the top 5 raw data files that contain original data that are
paper, we use this workflow in our examples.                                    leading to lowest fatigue life value?
                                                                                What are the histograms and finite element meshes files related when
                                                                             𝑸𝟒
3. USER-STEERED WORKFLOWS                                                       computed fatigue life based on stress analysis is lower than 60?
Mattoso et al. [14] analyze six aspects of computational steering
                                                                             For Queries 𝑄1-𝑄4, the SWMS needs to store the history of the
in scientific workflows: interactive analysis, monitoring, human
                                                                             tuples generated in Activities 4 and 5 since the beginning of the
adaptation, notification, interface for interaction, and computing
                                                                             flow, adequately linking each tuple flow in between. For example,
model. Despite their importance, the first three are essential and
                                                                             environmental conditions (𝑄1) and hull conditions (𝑄2) are
we mostly focus on those in this work. In fact, human adaptation
                                                                             obtained in Activity 1, and stress- and curvature-related values are
is definitely the core of computational steering. However, users
                                                                             obtained in Activities 4 and 5, respectively. To correlate output
will only know how to fine-tune parameters or which subset needs
                                                                             tuples from Activity 4 or 5 to tuples from Activity 1, provenance
further focus if they can explore partial result data during a long-
                                                                             data relationships are required.
term execution. For this, interactive analysis and monitoring play
an important role to put the human in the loop.                              B. Workflow execution information. Lower level execution engine
                                                                             information, such as physical location (i.e., virtual machine or
Online provenance data management in SWMS is an essential
                                                                             cluster node) where a task is being executed, can highly benefit
asset to support all six aspects of computational steering in
                                                                             data analysis and debugging in HPC execution. Users may want to
scientific workflows. In this section, we explain the three
                                                                             interactively investigate how many parallel tasks each node is
computational steering aspects explored in this paper.
                                                                             running. Moreover, tasks can run domain applications that result
3.1 Interactive analysis                                                     in errors. If there were thousands of tasks in a large execution,
We address two aspects of workflow data that should be                       how to determine which tasks resulted in domain application
interactively analyzed: (A) domain dataflow and (B) workflow                 errors and what the errors were? This also eases debugging, an
execution information [14].                                                  important feature to be provided in large parallel executions.
                                                                             Furthermore, performance data analysis is very useful. Users are
A. Domain dataflow. Workflows are composed of activities                     frequently interested in knowing how long tasks are taking. All
(scientific programs, scripts, or services) linked as a dataflow.            this workflow execution information is important to be analyzed
Each activity invocation, or task, may consume input datasets and            and can deliver much more interesting insights when linked to
input raw data files and produce output datasets and files. These
                                                                             domain dataflow data. When execution data is stored separately
flows form the domain dataflow. To support domain dataflow
                                                                             from domain and provenance data, these steering queries are not
interactive analysis, Chiron stores dataflow provenance data in the
                                                                             possible or demand combining different tools and writing specific
wf-Database during execution and makes them available for                    analysis programs [20].
online user queries. Users can then query the wf-Database using a
query interface or SQL following PROV-Wf [2], a W3C PROV-                    To support all this, Chiron’s wf-Database registers parallel
compliant data model [15] that specializes PROV entities into                workflow execution data. This means that all necessary execution
domain-data entities to allow for domain dataflow analysis at a              information for the parallel engine to work are linked to domain
finer grain than PROV.                                                       dataflow data and managed in the same database. Table 2 shows
                                                                             some provenance queries for the Risers Analysis workflow that
Chiron’s tuple-oriented engine first stores input dataset as tuples          link workflow execution data to domain dataflow.
in the wf-Database. In parameter sweep, tuples are data values
from the Cartesian product of input parameters. Then, each task                     Table 2. Domain data linked to performance data.
consumes input tuples retrieved from this database, executes                    Determine the average of each environmental conditions (output of
them, and then stores the generated output tuples in the wf-                    Data Gathering – Activity 1) associated to the tasks that are
                                                                             𝑸𝟓
                                                                                taking execution time more than 2 standard deviations of
Database immediately after task execution, adequately
                                                                                Curvature Critical Case Selection (Activity 5).
maintaining the data relationships to the input tuples. The
                                                                                Determine the finite element meshes files (output of Preprocessing –
workflow activities that generated the tuples are also stored in the         𝑸𝟔
                                                                                Activity 2) associated to the tasks that are finishing with error status.
wf-Database and linked accordingly. Large raw data files                        List the 5 computing nodes with the greatest number of
consumed or produced by each task are not stored in the wf-                  𝑸𝟕 Preprocessing activity tasks that are consuming tuples that
Database, but are rather linked to them, for file flow management.              contain wind speed values greater than 70 Km/h.
Thus, Chiron enables online fine-grained domain dataflow                     3.2 Monitoring
analysis [2] as well as the analysis of related domain raw data files        Another important form to help gaining insights from the data is
through file flow relationships [20]. Table 1 shows some useful              by monitoring in a passive way. It means that users can set up
queries for the riser fatigue analysis workflow involving domain             some monitoring analyses and wait for the monitoring results to




                                                                        46
              WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

be generated. Results might be delivered to end-users as graphical          entire dataset, the subset that contains those values can be
dashboards or three-dimensional in-situ scientific data                     removed, hence reducing the dataset.
visualizations. As users gain insights from monitoring results, if          However, reducing a dataset to be processed has specific
the SWMS has dynamic analytical support, they can adapt                     constraints that need to be addressed so the execution remains in a
previously set up monitoring configurations or add new                      consistent state, i.e., with valid data and with guarantee that the
monitoring analysis [14]. Also, from these new insights, new data           execution will not crash. As previously described, Chiron
exploration through interactive analysis can be done.                       implements a relational data model in a tuple-oriented algebraic
If the SWMS allows for provenance data analysis during                      approach for scientific workflows [17].
workflow execution, monitoring becomes more effective, since                We propose to represent the input dataset as a database relation,
the SWMS can exploit the continuously populated wf-Database.                which is a set of tuples. In the tuple-oriented approach [17], tuples
By doing this, all the aforementioned data provenance analysis              represent a domain-specific dataset to be consumed or produced
and queries executed by users may be used by a monitoring                   by a workflow activity. As examples, tuples may be composed of
engine.                                                                     parameter values of a computational model, file paths to a large
3.3 Human adaptation                                                        raw data file (e.g., genomic sequences, finite element meshes,
After users have analyzed partial data and gained insight from              textual data, binary files), or calculated values. In the tuple-
them, they may decide to adapt the workflow execution.                      oriented approach, removing a subset of the entire dataset to be
Adaptation brings powerful abilities to end-users, putting the              processed means removing a set of tuples to be consumed by a
human in full control of scientific workflow executions. Many               workflow activity. As a consequence of this removal, the tasks
aspects can be adapted by humans, but very few systems support              that would process the tuples within the removed set of tuples will
human-in-the-loop actions [14]. The human-adaptable aspects                 not be executed, hence, reducing both workflow execution time
range from computing resources involved in the execution (e.g.,             and data processing. Data processing reduction becomes more
adding or removing nodes), to checking-point and rolling-back               evident if the removed tuples contain paths to large raw data files
(debugging), loop break conditions, reducing datasets,                      that would be consumed by tasks if the tuples were not removed.
modification of filter conditions, and very specific parameter fine-        Not only this prevents execution of tasks that would consume
tuning.                                                                     uninteresting data, but also the non-executed tasks will not
                                                                            produce more data, reducing overall generated data amount in a
Populating the wf-Database during workflow execution can help               workflow execution. Furthermore, if a tuple of a given activity is
all these aspects. In the Chiron SWMS, it has been shown that it            removed, the following tuples forming the tuple-flow of the next
particularly facilitates steering. For example, in [8], it was shown        linked activities will not be processed too, reducing data and,
that it is possible to change filter conditions during execution.           more importantly, execution time in cascade.
Also, in [7], it is proposed an algebraic approach to enable
                                                                            Addressing which specific subset will be removed is quite
steering and dynamic changes of loop conditions during execution
                                                                            important. So, we first formalize the subset to be removed
of iterative workflows (e.g., modify number of iterations or loop
                                                                            (Section 4.1). In Section 4.2, we describe how we implemented
stop conditions), and such approach was evaluated in Chiron.
                                                                            this in d-Chiron, which is a modified version of Chiron that
These works show that these adaptations can significantly reduce
                                                                            manages the wf-Database in an in-memory distributed database
overall execution time, since domain expert users are able to
                                                                            system and is significantly more scalable than the original Chiron
identify a satisfactory result before the programmed number of
                                                                            [21]. We also highlight that even though we implemented our
iterations. Prior to this work, although [7] has highlighted its
                                                                            solutions in d-Chiron, other SWMS could be used. The only
advantages, no work has been developed in Chiron to tackle user-
                                                                            requirement is that the SWMS engine needs to manage workflow
steered data reduction online.
                                                                            data as datasets in a tuple-oriented approach and manage domain,
Since provenance data is so beneficial, we consider that when a             provenance, and workflow execution data online in the same
user interacts with the workflow execution, new data (user                  database.
interaction data) are generated, and thus their provenance must be
registered. In a long-running execution, many interactions can
                                                                            4.1 Removing slices of input datasets
occur and many adaptations may be made. If the SWMS does not                In the tuple-oriented approach, to address a subset of the dataset to
adequately register the provenance of interaction data, the users           be removed, we first define a slice, which is a subset of tuples to
can easily lose track of what they have changed in the past. This is        be removed according to a criteria defined by the user. Let 𝑅 with
critical if the entire computational experiment takes days to finish        data schema ₰ = {𝐶} be the relation that represents a workflow
and many specific adaptations had to occur, since it may be                 activity input dataset to be reduced. {𝐶}  is the set of attributes 𝑐! ,
impossible for the users to remember in the last day of execution           1   ≤   𝑗   ≤ | 𝐶 |,   from   𝑅 and each 𝑐! assumes a data type
what they have steered in the first days. Furthermore, adding               (integer, string, boolean, etc). Moreover, we split 𝑅 into two
interaction data to the wf-Database enriches its content and                subsets 𝑃 and 𝑆, where 𝑃 is the subset of 𝑅 containing tuples that
enables future user interaction analysis. One example of how such           have already been processed and 𝑆 is the subset of 𝑅 containing
data can be exploited is that the registered interaction data could         tuples that will be processed. Thus, 𝑅 ← 𝑃 ∪ 𝑆  |  𝑃 ∩ 𝑆 =   ∅. 𝑃
be used by artificial intelligence algorithms for understanding             and 𝑆 have the same schema ₰.
interaction patterns and recommend future adaptations. For these            Then, we define a slice § as a subset of 𝑆, which is represented as
reasons, the SWMS that enables computational steering should                a primary horizontal fragment of 𝑆, defined by the selection
collect provenance of user interaction data. To the best of our             relational algebraic operation 𝜎 [18]. Thus, § ←    𝜎! 𝑆 , where 𝐹
knowledge, this has not been done before.
                                                                            is the selection formula to obtain the primary horizontal fragment.
4. ONLINE DATA REDUCTION                                                    The formula 𝐹 may either be a simple predicate (e.g., 𝑐!"    =
                                                                            ′𝐹𝐴𝑇𝐼𝐺𝑈𝐸′) or a minterm predicate (e.g., 𝑐! > 38   ∧ 𝑐! > 0.1
In this section, we show our main contribution. Suppose that after
analyzing the monitoring results, a user identifies, within the             ∧ 𝑐! < 1.0) [18]. Figure 2 shows a workflow example on the left:




                                                                       47
              WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

Act. 1 consumes input relation 𝑅 and produces an output relation
that also works as an input relation to be consumed by Act. 2,
which produces the final output relation. Although in this
illustration we show a data reduction in the first activity, we
highlight that input data of any workflow activity can be reduced,
including intermediary ones, as shown in Section 6, where input
data from the second activity is reduced. The input relation 𝑅 is
magnified on the right of Figure 2, where we show the subsets 𝑃
and 𝑆, and the slice § defined by the formula 𝐹.
Once the user has selected the slice to be removed (based on user-
defined criteria), the slice can be cut off. For this, we define the
operation 𝐶𝑢𝑡 using the difference relational algebraic operator, so
that 𝐶𝑢𝑡 𝑅, § ← 𝑅 − §.
By doing so, we ensure that only tuples from 𝑆 will be removed,
since § only contains tuples from 𝑆. This is necessary because
only tuples that have not been processed yet (i.e., they are “ready”               Figure 2. Relation 𝑹 with subsets 𝑷 and 𝑺, and a slice
to be processed) can be removed. Otherwise, either the data or the            component needs to update the Task table to mark the tasks with
workflow execution may become inconsistent. We note that our                  the identifiers in the 𝜋!"#$% slice as removed by user, so that the
solution is applicable to reduce input data of any workflow                   engine will not get them for execution.
activity that needs to process a dataset, as long as the SWMS is
aware of the data elements composing the dataset.                             The wf-Database tables are distributed, thus making concurrency
                                                                              control of the Task table partitions even more complex. In d-
4.2 Implementation                                                            Chiron, distributed concurrency control in the Task table is
In this section, we describe how we implement slice removal and               outsourced to the distributed database system that guarantees the
cut in d-Chiron. In the SWMS that implements the tuple-oriented               ACID properties [18]. Thus, concurrency caused by the
approach and manages execution data in the wf-Database, each                  aforementioned updates is controlled by the database system,
input tuple (or set of tuples, depending on the dataflow operator)            guarantying that both execution and data remain consistent.
to be consumed is related to a task. For this reason, removing                To store provenance of removed tuples, we extend the wf-
tuples means removing the tasks that would consume the                        Database schema with the table User_Query to store the queries
referenced tuples. In the PROV-Wf data model [2], which d-                    that select the slice of the dataset to be removed. The description
Chiron supports, tasks’ data and input tuples are stored in                   for each User_Query column is described in Table 3. We also
different relations, with a relationship in between. Thus, to
                                                                              keep track of the removed tasks in table Modified_Task, which
implement the set 𝑆, as defined previously, we need to semi-join
                                                                              is a table that represents a many-to-many relationship between
[18] input tuples from the input relation 𝑅 with tasks in READY
                                                                              User_Query and Task tables. In Section 5.2, we will give the
state in order to only select the tuples that will be processed. Then,
                                                                              necessary extensions to the PROV-Wf data model implemented in
to get the identifiers of the ready tasks 𝜋!"#$%   to be removed, we
                                                                              d-Chiron to accommodate User_Query and modified tasks.
project over the task identifiers. Using relational algebra:
                                                                                         Table 3. User_Query table description.
       𝜋!"#$% ←    Π!"#$_!" 𝜎! 𝑅 ⋉ 𝜎!"#"$!!"#$% 𝑇𝑎𝑠𝑘 ,
                                                                                   Column name                          Description
where the formula 𝐹 is analogous to that in Section 4.1, which is                   query_id          Auto increment identifier
the criteria to select the slice § to be removed. We emphasize that               slice_query         Query that selects the slice of the dataset to be
such verifications are necessary to guarantee consistency and are                                     removed.
the SWMS’s responsibility only, not the users’. The users would                   tasks_query         Query generated by the SWMS to retrieve the
                                                                                                      ready tasks associated.
only need to specify the formula 𝐹 to select the slice.                           issued_time         Timestamp of the user interaction
To ease slice removal in d-Chiron, we developed a Steer                                               Field that determines how the user interacted.
module. With the Steer module, users can issue command lines                                          It could be “Removal”, “Addition”, and
                                                                                   query_type         others. We currently only implemented
to inform the name of the input relation 𝑅 and the formula 𝐹 to
                                                                                                      “Removal” of tuples, but it can be extended in
select the slice to be removed. Then, the module is responsible for                                   future work.
retrieving the identifiers 𝜋!"#$% of the ready tasks to be removed                                    Relationship with the user who issued the
(analogous to the set 𝑆 needed for the slice definition). Instead of                 user_id
                                                                                                      interaction query
physically removing the tasks from the wf-Database, we choose to                                      To maintain relationship with the rest of
                                                                                      wkfid
mark them with the state REMOVED_BY_USER. By doing so, we                                             workflow execution data.
enable these tasks to be later analyzed with provenance queries
and to be consistently related within the table Modified_Task.                5. ADAPTIVE MONITORING
                                                                              In this section, we combine monitoring and computational
To guarantee consistency, we take advantage of d-Chiron’s
                                                                              steering into an adaptive monitoring approach. Our workflow
database system [21]. d-Chiron uses a transaction-optimized in-
                                                                              monitoring approach relies on online queries to the continuously
memory distributed database system that provides atomicity,
                                                                              populated wf-Database. Users can set up monitoring queries (as in
consistency, isolation, and durability (ACID). In a data reduction,
                                                                              Table 1 and Table 2) and analyze monitoring results.
both d-Chiron's engine and the Steer module need to
concurrently update a shared resource, the Task table in the wf-              In Section 5.1, we present a formal description and describe the
Database. While d-Chiron's engine updates the Task table to get               implementation in Section 5.2 with the extensions to PROV-Wf to
tasks for execution and to mark them later as executed, the Steer             accommodate adaptive monitoring and online data reduction.




                                                                         48
              WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

5.1 Formal description of adaptive monitoring                                  Similar to what we did for the Steer module, we also developed
Monitoring works as follows. There is a set {𝑄}  composed of                   the module Monitor to facilitate utilization. The Monitor should
monitoring queries 𝑚𝑞! , 0   ≤   𝑖   ≤ | 𝑄 |, each one to be executed          start at any cluster node that is able to access the distributed
at each 𝑑! > 0. Users do not need to specify queries at the                    database system and should start after the workflow execution has
beginning of execution, since they do not know everything they                 begun, whenever users want to monitor the workflow execution.
want to monitor. This is why {𝑄} starts empty. After users gain                Similar to what we did for the Steer module, we also developed
insights from the data, after some interactive provenance data                 the module Monitor to facilitate utilization. The Monitor should
analyses, they can add monitoring queries to {𝑄}  in an ad-hoc                 start at any cluster node that is able to access the distributed
manner. Each 𝑑! can be adapted by users, meaning that users have               database system and should start after the workflow execution has
control of the time frame of each 𝑚𝑞!   during execution.                      begun, whenever users want to monitor the workflow execution.
Each 𝑚𝑞! execution generates a monitoring query result set                     A command line starts the Monitor module that runs in
𝑚𝑞𝑟!" , 𝑡   = 𝑘𝑑! |  𝑘   ∈ ℕ!! , at each time interval 𝑑! . We                 background. It establishes a connection with the distributed
constrain that each 𝑚𝑞𝑟!"   must deliver one column only. If users             database system (connection settings are provided in the XML
want more columns, they can write different monitoring queries                 configuration file). Chiron (and d-Chiron) makes use of this XML
for each new column. However, the number of rows in the result                 file to define the workflow design, workflow general settings, and
set is not limited. This means that each monitoring result set                 other user-defined variables. Then, the Monitor program keeps
𝑚𝑞𝑟!"   should be either a scalar value or an array.                           querying the Monitoring_Query table at each 𝑠 to check if a
To improve human-in-the-loop, the end-users have the flexibility               new monitoring query was added. The default value for 𝑠 is 30s,
to adapt monitoring during workflow execution. To do so, at each               as the time interval to check if monitoring queries were added or
instant 𝑡 after each monitoring query result 𝑚𝑞𝑟!"   has been                  removed. However, users can customize this. After the Monitor
generated, the values for 𝑑! and 𝑚𝑞!   are reloaded from the wf-               has started, users can add (or remove) monitoring queries to (or
Database. If any change has happened, it will be considered in the             from) the Monitoring_Query table. Currently, users can add
next iteration 𝑡   +    𝑑! . Moreover, at each certain amount of time          monitoring queries using a command line to inform the SQL
during execution (also configured by the user), the system checks              query to be executed at each time interval and the time interval.
if the user has added new monitoring queries in {𝑄}. Our adaptive              Whenever the Monitor module identifies that the user added a
monitoring approach takes full advantage of the data stored online             new monitoring query, it launches a new thread. Each thread is
in the wf-Database. More importantly, it enables users to                      responsible for executing each monitoring query in
dynamically steer monitoring settings (including which data will               Monitoring_Query at each defined time interval. A thread is
be monitored and how), highly benefiting them in finding                       finished when a monitoring query is removed or when the
uninteresting subsets to be removed.                                           workflow stops executing (in that case, all threads are finished).
                                                                               Figure 3 shows the steps executed at each time interval.
5.2 Implementation                                                             1. Execute the monitoring query 𝑚𝑞!
To implement our approach, we first need to extend the wf-
Database schema. To store {𝑄}, we add the table                                2. Store query results in the wf-Database
Monitoring_Query, shown in Table 4.                                            3. Reload all information for 𝑚𝑞!    from the wf-Database for the next
                                                                                  time iteration. The user could have adapted any of this information.
       Table 4. Monitoring_Query table description.
    Column name                    Description                                 4. Wait for 𝑑! seconds
   monitoring_id  Auto increment identifier                                    Figure 3. Steps executed by each thread within a time interval.
     interval     Interval time (in seconds) between each
                  monitoring query (𝑑! )                                       To enable all these monitoring capabilities and human-adaptation,
 monitoring_query Raw SQL query to be queried                                  three of these steps represent queries to the wf-Database,
                  Relationship between the monitoring queries                  including reads and writes. The stored results can be further
      wkfid       and the current execution of the workflow. In                analyzed a posteriori or, more interestingly, used as input for
                  d-Chiron’s wf-Database, there may be data
                  from past executions for a same workflow.
                                                                               runtime data visualization tools, since results are immediately
                                                                               made available after they are generated.
The main advantage of storing monitoring results in the wf-
Database (and adequately linking the results with the remainder of             Another contribution of this paper is that we add three concepts to
the data already stored in this database) whenever a monitoring                PROV-Wf [2], which is W3C PROV-compliant [15]. Our main
query result is executed is that users are able to query the results           motivations to adhere to the W3C PROV recommendations are to
immediately after their generation. The wf-Database can also                   help on query specification, to maintain compatibility between
serve as data source for data visualization applications. To store             different SWMS and facilitate interoperability between different
monitoring results in the wf-Database, we add another table:                   databases.
Monitoring_Query_Result, shown in Table 5.                                     These concepts are: UserQuery, MonitoringQuery, and
      Table 5. Monitoring_Result table description.                            MonitoringResult, as in Figure 4. Using PROV nomenclature,
      Column name                             Description                      UserQuery is a PROV Activity that stores the user queries that
 monitoring_result_id          Auto increment identifier                       remove sets of tuples and thus influence the state of the associated
      monitoring_id            Relationship with the monitoring query          tasks (i.e., remove them). MonitoringQuery is a PROV Activity
                               that generated this result                      that contains the monitoring queries submitted by the user in
   monitoring_values           Results of the monitoring_query                 specific time intervals. The monitoring queries generate PROV
                               Data type of the result values of both          Entity MonitoringResult that stores the query results.
       result_type             queries. Currently, “Integer”, “Double”,
                               “Array[Integer]”, and “Array[Double]”




                                                                          49
              WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

6. EXPERIMENTAL VALIDATION                                                   associated to low risk of fatigue life values. In the workflow
In this section, we validate our solution (for online data reduction         (Figure 1), the final value of fatigue life is calculated in Activity
and adaptive monitoring) based on a real data. In Section 6.1, we            6, but input values are obtained as output of Activity 1, gathered
show the experimental setup, Section 6.2 shows a test case where             from raw input files. Keeping provenance is essential to associate
an expert monitors the execution and removes slices of the                   data from Activity 1 with data from Activity 6.
dataset. In Section 6.3, we analyze the added overhead.                      To understand which input values are leading to high fatigue life
                                                                             values, Peter monitors the generated data online. For simplicity,
6.1 Experimental setup                                                       we consider wind speed, which is only one out of the many
Scientific workflow. As a proof of concept for this work, we use             environmental condition parameter values captured by Activity 1
a synthetic parameter sweep workflow of the Riser Fatigue                    to serve as input for Activity 2. Peter knows that wind speed has a
Analysis example (see Figure 1), which is based on a real case               strong correlation with fatigue life in risers. He expects that with
study. The workflow manipulates approximately 300 GB of raw                  low speed winds, there is a lower risk of accident.
data. In all executions, we use the same dataset, which spans over           When workflow execution starts, the Monitor module is
approximately 12,000 data elements to be processed in parallel.              initialized. Then, Peter adds two monitoring queries: 𝑚𝑞!   shows
Depending on the workflow activity, tasks may take few seconds               the average of the 10 greatest values of fatigue life calculated in
(e.g., Activity 7) or up to one minute on average (e.g., Activity 3).        the last 30s of workflow execution, setting 𝑑!    =   30s; and 𝑚𝑞!
Software. In all executions, we use d-Chiron [21], which uses                shows the average wind speed associated to the 10 greatest values
MySQL Cluster 7.4.9 as its in-memory distributed database                    of fatigue life calculated in the last 30s, also setting the query
system to manage the wf-Database. The code to run d-Chiron and               interval 𝑑!    =   30s. We recall from Table 1 that 𝑚𝑞! is similar to
setup files are in github.com/hpcdb/d-chiron.                                𝑄1, but only considering data processed in the last 30s. 𝑚𝑞! and
                                                                             𝑚𝑞!   queries are added to the Monitoring_Query table.
Hardware. The experiments were conducted in Grid5000 using a
cluster with 39 nodes, containing 24 cores each (936 cores). Every           Peter monitors the results using the Monitoring_Result table.
node has two AMD Opteron 1.7 GHz 12-core processors, 48GB                    These results can be a data source for a visualization that plots
RAM, and 250GB of local disk. All nodes are connected via                    dashboards dynamically, refreshed according to the query
Gigabit Ethernet and access a shared storage of 10TB.                        intervals. After gaining insights from the results and
                                                                             understanding patterns, he can start removing the undesired values
6.2 Test case                                                                for wind speed. The monitoring query results 𝑚𝑞𝑟!! and
Let us consider the following scenario. Peter is an offshore                 𝑚𝑞𝑟!!   for the two previously listed queries, as well as when the
engineer, expert in riser analysis and learned how to set up                 user reduced the data, are plotted along the workflow elapsed
monitoring, analyze d-Chiron’s wf-Database, and use the Steer                time, as shown in Figure 5. It presents 𝑚𝑞𝑟!!   in full black line
module developed in this work. In Peter’s project, the Design                with square markers and 𝑚𝑞𝑟!! in full gray line with triangle
Fatigue Factor is set to 3 and service life is set to 20 years,              markers. These markers determine when the monitoring occurred.
meaning that fatigue life must be at least 60 years (see from
                                                                             The workflow execution starts at 𝑡   =   0, but only after
Section 2). Peter is only interested in analyzing risers with low
                                                                             approximately 150s, the first output results from Activity 6 starts
fatigue life values, because they are critical and might need repair
                                                                             to be generated. From the first results, at 𝑡   =   150    and 𝑡 = 180,
or replacement. During workflow execution, it would be
                                                                             Peter checks that when wind speed is less than 16 Km/h (see
interesting if Peter could inform the SWMS, which input values
                                                                             horizontal dashed line in 𝑤𝑖𝑛𝑑  𝑠𝑝𝑒𝑒𝑑   =   16 in Figure 5), the
would lead to low risk of fatigue, so they could be removed.
                                                                             results lead to the largest fatigue life values. Since risers with
However, this is not simple because it is hard to determine the
                                                                             large fatigue life values are not interesting in this analysis, he
specific range of values (i.e., the slice to be removed). For this,
                                                                             decides, at 𝑡   =   190, to remove all input data elements that
Peter first needs to understand the pattern of input values




                     Figure 4. Extended PROV-Wf data module to accommodate modified tasks and monitoring




                                                                        50
                               WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

contain wind speed less than 16 Km/h. For this, the first user                                     in the large fatigue life values was registered after this last Peter’s
query 𝑞! is issued with a command line to the Steer module.                                        steering. Thus, he keeps analyzing the monitoring results, but does
User queries are represented with circles in the horizontal axis                                   not remove input data anymore until the end of execution.
(Elapsed time) in Figure 5. The exact time a user issued an                                        We store each interaction in the User_Query table and map (in
interaction query is stored in User_Query table.                                                   table Modified_Task) its rows with rows in the Task table, to
The next markers after 𝑞! happens at 𝑡   =   210. Comparing with                                   consistently keep provenance of which tasks were modified (in
the previous monitoring mark, at 𝑡   =   180, we can observe that                                  this case, removed) by each specific user steering. Thus, keeping
this Peter’s steering (𝑞! ) increases the minimum wind speed                                       provenance of user steering helps analyzing how specific
values to be considered from 14.2 Km/h to 24.1 Km/h. Also, we                                      interactions impacted the results. Figure 5 shows that some
observe a significant decrease in the slope of the largest values for                              specific interactions imply significant changes in lines’ slopes.
fatigue life (10.6% lower). This means that the removal of these                                   Queries on the wf-Database can show finer details about how
input data containing wind speed less than 16 Km/h made the                                        many tuples each user interaction made the SWMS not process, as
SWMS not process data containing low wind speed values, which                                      shown in Table 6. Each issued time follows Figure 5 and is
would lead to larger fatigue life results.                                                         registered with the timestamp of when the first activity started.
Then, monitoring continues, but that slope decrease calls Peter’s                                           Table 6. Provenance of slices removed by the user
attention. To obtain a finer detail of what is happening, he decides                                Inter         Issued                                    Number of removed
                                                                                                                                     Slice query
to adjust the monitoring interval time (𝑑!   and 𝑑! ) at runtime, by                                act.         time (s)                                     data elements
reducing to 10s to get monitoring feedbacks more frequently. We                                      𝑞!            190             wind_speed < 16                 623
can observe that for both lines 𝑚𝑞𝑟!!   and 𝑚𝑞𝑟!! , the markers
become more frequent during 𝑡   =    [220, 270]. This is because a                                   𝑞!            310             wind_speed < 25                 373
monitoring is registered at every 10s. We highlight that, although                                   𝑞!            370             wind_speed < 30                 355
in test use case we are only showing monitoring correlating wind
                                                                                                     𝑞!            430            wind_speed < 34.5                115
speed and fatigue life, other monitoring correlations could also be
analyzed and users can add, remove or adjust monitoring queries                                      𝑞!            520            wind_speed < 35.5                 3
at any time during execution.                                                                      Finally, we run the exact same workflow and input datasets, but
After verifying that the results are reasonable, Peter decides to                                  with no monitoring or interactions to compare how such slice
increase back the monitoring query intervals for both queries to                                   removals help decrease overall execution time. The workflow
30s after 𝑡   =   270. He then observes that since 𝑞! , wind speed                                 with no interaction processes all input data, including those
less than 25 Km/h are leading to large fatigue life values.                                        containing wind speed values that lead to risers with low risk of
Then, at 𝑡   =   310, he calls Steer again to issue 𝑞! that removes                                fatigue, which are not considered in Peter’s analyses. In total,
                                                                                                   Peter’s steering yields the removal of 1469 input data elements
input data for wind speed < 25 Km/h. The next markers after
                                                                                                   (out of approximately 12,000). This reduces the execution time
𝑞!   shows that this steering made the wind speed value associated
                                                                                                   for this test case by 37% compared with no steering. Furthermore,
to large fatigue life be at least 30.5 Km/h and a decrease of 6.5%
                                                                                                   these removed input data would make the workflow process and
in large fatigue life values between 𝑡   =   300  𝑎𝑛𝑑  𝑡   =   330.
                                                                                                   generate more raw data files if the input data elements were not
Similarly, Peter continues to monitor and steer the execution. He                                  removed. By querying the wf-Database in the end of execution,
issues 𝑞!   at 𝑡   =   370 to remove input data with wind speed <                                  we found that the execution with no user steering processed
30.5 Km/h, making a decrease of 4.9% in large fatigue life                                         approximately 300GB of raw data files, whereas with steering the
(comparing fatigue life in 𝑡   =   360 and 𝑡   =   390). Then, he                                  total was 258GB, representing 14% of data reduction.
issues 𝑞!   at 𝑡   =   430 to remove input data with wind speed <
34.5, attaining a decrease of 1.7% in large fatigue life (comparing                                6.3 Analyzing monitoring overhead
fatigue life in 𝑡   =   420 and 𝑡   =   450). Despite this small                                   A monitoring query 𝑚𝑞! in {𝑄} is run by a thread at each 𝑑!
decrease, he decides at t = 520 to further remove data, but with                                   seconds. Depending on the number of threads (|{𝑄}|) and on the
wind speed < 35.5 Km/h. However, no decrease greater than 1%                                       interval 𝑑! there may be too many concurrent accesses to the wf-

                                                                                                                                                                   80.0
                       35.0
                       33.0                                                                                                                                        78.0
                       31.0                                                                                                                                        76.0
   Wind speed (Km/h)




                                                                                                                                                                          Fatigue life (years)




                       29.0                                                                                                                                        74.0
                       27.0                                                                                                                                        72.0
                       25.0                                                                                                                                        70.0
                       23.0
                                                                                                                                                                   68.0
                       21.0
                                                                                                                                                                   66.0
                       19.0
                       17.0                                                                                                                                        64.0
                       15.0               q1                           q2               q3                  q4                             q5                      62.0
                       13.0                                                                                                                                        60.0
                              150   180        210 230 250 270   300        330   360        390      420        450        480      510        540   570   600
                                                                              Elapsed time (seconds)
                                                 Wind speed                         Fatigue life                                      Steering
                          Figure 5. Use case plot to analyze impact of user steering comparing Wind Speed (input) with Fatigue life (output).




                                                                                             51
                              WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

Database, which may add overhead. The goal of this experiment is                    being processed online by dimension reduction or by reducing the
to analyze such overhead.                                                           range of some parameters, sharing similar motivations to our
We set up the Monitor module to run queries, which are                              work. It uses Nimrod/K as its underlying parallel workflow
                                                                                    engine, which is an extension of the Kepler workflow system [1].
variations of the queries 𝑄1-𝑄7 presented in Table 1 and Table 2.
                                                                                    WorkWays presents several tools for user interaction in human-in-
For example, in 𝑄2, we vary the curvature value. We also modify
                                                                                    the-loop workflows, such as graphic user interfaces, data
them to calculate only the results over the last 𝑑 seconds, at each
                                                                                    visualization, and interoperability among others. However,
𝑑 seconds. To evaluate the overheads, we measure execution time
                                                                                    WorkWays does not provide for provenance representation and
without monitoring and then with monitoring, but varying the
                                                                                    users do not have query access to simulation data, execution data,
number of queries |{𝑄}| and the interval 𝑑, which is considered
                                                                                    metadata, and provenance, all related in a database, which limits
the same for all queries in {𝑄} in this experiment. The
                                                                                    the power of online computational steering. For example, it
experiments were repeated until the standard deviation of
                                                                                    prevents ad-hoc data analysis using both domain and workflow
workflow elapsed times was less than 1%. The results are the
                                                                                    execution data, such as those presented in Table 1 and Table 2,
average of these times within the 1% margin. Figure 6 shows the
                                                                                    which support the user in defining which slice of the dataset
results, where the gray portion represents the workflow execution
                                                                                    should be removed. In contrast, our work uses a robust in-memory
time when no monitoring is used; and the black portion represents
                                                                                    distributed database system to manage and relate analytical data
the difference between the workflow execution time with and
                                                                                    involved in the workflow execution. Moreover, the lack of
without monitoring (i.e., the monitoring overhead).
                                                                                    provenance data support in WorkWays, either online or post-
From these results, we observe that when the interval 𝑑 is equal to                 mortem, does not support reproducibility and prevents from
30s, the overhead is negligible. When the interval is 1s, the                       registering user adaptations, missing opportunities to determine in
overhead is higher when the number of monitoring threads is                         detail how specific user interactions influenced workflow results.
greater. This happens because three queries are executed in each                    Another notable SWMS example is WINGS/Pegasus [9], which
time interval (see Figure 3), for each thread. In the scenarios with                especially focus on assisting users in automatic data discovery. It
30 threads, there will be 120 queries in a single time interval 𝑑. In               helps generating and executing multiple combinations of
that case, if 𝑑 is small (e.g., 𝑑 = 1), there are 120 queries being                 workflows based on user contraints, selecting appropriate input
executed per second, just for the monitoring. The database that is                  data, and eliminating workflows that are not viable. However, it
queried by the monitors is also frequently queried by the SWMS                      differs from our solution in the sense that it tries to explore
engine, thus adding higher overhead. However, even in this                          multiple workflows until finding the most suitable one, whereas
specific scenario that shows higher overhead (|{𝑄}| = 30 and                        we often model our experiments as one single scientific workflow
𝑑 = 1), it is only 33s or 3.19% higher than when no monitoring is                   to be processed. Also, it does not aim at putting users in the loop
used. Most of the real monitoring cases do not need such frequent                   to actively eliminate subsets of an input dataset, especially based
(every second) updates. If 30s is frequent enough for the user,                     on extensive ad-hoc intermediary data analysis online.
there might be no added overhead, like in this test case.                           Additionally, as WorkWays, provenance data is not collected
We also evaluated the same scenarios without storing monitoring                     online, nor is it integrated with domain-specific and execution
                                                                                    data for enhanced analysis.
results in the wf-Database, but rather appending in CSV files,
which is simpler. The results are nearly the same as in Figure 6.                   While human adaptation is less explored in parallel SWMS,
This suggests storing all monitoring results in the wf-Database at                  monitoring is widely supported in several existing SWMS
runtime, which enables users to submit powerful queries as they                     [13][14]. For example, Pegasus [5] and Triana may be integrated
are generated, with all other provenance data. This would not be                    to analytical tools such as Stampede [10][22], which provides a
possible with a solution that appends data to CSV.                                  framework to monitor workflow executions and has rich
                                                                                    capabilities for online performance monitoring, troubleshooting,
7. RELATED WORK                                                                     and debugging. However, in these solutions, it is not possible to
Considering our contributions, we discuss the SWMS with                             monitor workflow execution data associating them to provenance
parallel capabilities with respect to human adaptation (especially                  and domain data, as we do using queries to the wf-Database. To
data reduction), online provenance support, and monitoring                          the best of our knowledge, there is no related work that allows for
features.                                                                           online data reduction based on a rich analytical support with
Although online human adaptation is the core of computational                       adaptive monitoring and provenance registration of human
steering, there are few parallel SWMS [11][12][19] that support it.                 adaptations in scientific workflows. These features allow for
These solutions have monitoring services and are highly scalable,                   performance improvements of scientific workflows, while
but do not allow for online data reduction as a means to reduce                     keeping data reduction consistency and provenance queries that
overall execution time. WorkWays [16] is a powerful science                         can show the history of human-in-the-loop actions and results.
gateway that enables users to steer and dynamically reduce data
                                                                                    8. CONCLUSION
                                                         Overhead                   This work contributes to putting the human in the loop of
                       16.5                              No monitor time
  Exec. time (min)




                                                                                    scientific workflow systems, especially when users can actively
                        16                                                          steer and reduce data online to improve performance. As a
                                                                                    solution to the input data reduction problem, we made use of a
                       15.5
                                                                                    tuple-oriented algebraic approach that organizes workflow data to
                        15                                                          be processed as sets of tuples stored in a wf-Database, managed
                                  d=1        d=1       d=30      d=30               by an in-memory distributed database system at runtime. We
                                |{Q}|=3   |{Q}|=30   |{Q}|=3   |{Q}|=30             developed a mechanism coupled to d-Chiron, a distributed version
                                                                                    of Chiron SWMS, which allows for reducing data, while
                     Figure 6. Results of adaptive monitoring overhead.             maintaining both data integrity and execution consistency. A
                                                                                    major challenge to the problem of data reduction is to address




                                                                               52
               WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah

which subset of the data should be removed. As a solution to this,                [4] Deelman, E., Gannon, D., Shields, M., Taylor, I. Workflows and e-
we proposed an adaptive monitoring approach that aids users in                         Science: an overview of workflow system features and capabilities.
analyzing partial result data at runtime. Based on the evaluation of                   FGCS, 25(5):528–540, 2009.
                                                                                  [5] Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S.,
input data elements and its corresponding results, the user may
                                                                                       Maechling, P.J., Mayani, R., Chen, W., Ferreira da Silva, R., et al.
find which subset of the input data is not interesting for a                           Pegasus, a workflow management system for science automation.
particular execution, hence can be removed. The adaptive                               FGCS, 46(C):17–35, 2015.
monitoring allows users not only to follow the evolution of the                   [6] Det Norske Veritas. Recommended practice: riser fatigue. DNV-RP-
workflow, but also to dynamically adjust monitoring aspects                            F204, 2010.
during execution. We extended our previous workflow                               [7] Dias, J., Guerra, G., Rochinha, F., Coutinho, A.L.G.A., Valduriez, P.,
provenance data model to be able to represent provenance of the                        Mattoso, M. Data-centric iteration in dynamic workflows. FGCS,
online data reduction actions by users and the monitoring results.                     46(C):114–126, 2015.
                                                                                  [8] Dias, J., Ogasawara, E., Oliveira, D., Porto, F., Coutinho, A.L.G.A.,
Although we implemented our solution in d-Chiron, other SWMS
                                                                                       Mattoso, M. Supporting dynamic parameter sweep in adaptive and
could be used if provenance, execution, and domain dataflow data                       user-steered workflow. WORKS, 31–36, 2011.
are managed in a database at runtime.                                             [9] Gil, Y., Ratnakar, V., Kim, J., Gonzalez-Calero, P., Groth, P., Moody,
To validate our solution, we executed a data-intensive parameter                       J., Deelman, E. Wings: Intelligent workflow-based design of
sweep workflow based on a real case study from the oil and gas                         computational experiments. Intelligent Systems, 26(1):62–72, 2011.
                                                                                  [10] Gunter, D., Deelman, E., Samak, et al. Online workflow management
industry, running on a 936-cores cluster. A test case demonstrated                     and performance analysis with Stampede. CNSM, 152–161, 2011.
how the user can monitor the execution, dynamically adapt                         [11] Jain, A., Ong, S.P., Chen, W., et al. FireWorks: a dynamic workflow
monitoring settings, and especially remove uninteresting data to                       system designed for high-throughput applications. CCPE,
be processed, all during execution. Results for this test case show                    27(17):5037–5059, 2015.
that the user interactions reduced the execution time by 37%                      [12] Lee, K., Paton, N.W., Sakellariou, R., Fernandes, A.A.A. Utility
comparing with the execution that processed the whole dataset.                         functions for adaptively executing concurrent workflows. CCPE,
Although the test case was from the oil and gas domain, any other                      23(6):646–666, 2011.
workflow application could have been used, as long as a domain                    [13] Mandal, A., Ruth, P., Baldin, I., et al. Toward an end-to-end
                                                                                       framework for modeling, monitoring and anomaly detection for
expert can tell which slice is not interesting, removed with no                        scientific workflows. IPDPSW, 1370–1379, 2016.
harm to the final results.                                                        [14] Mattoso, M., Dias, J., Ocaña, K.A.C.S., Ogasawara, E., Costa, F.,
To the best of our knowledge, this is the first work that explores                     Horta, F., Silva, V., de Oliveira, D. Dynamic steering of HPC
user-steered online data reduction in scientific workflows steered                     scientific workflows: A survey. FGCS, 46:100–113, 2015.
                                                                                  [15] Moreau, L., Missier, P. PROV-DM: the PROV data model. Available
by ad-hoc queries and adaptive monitoring, while maintaining
                                                                                       at: http://www.w3.org/TR/prov-dm Accessed: 1 Aug 2016., 2013.
provenance of user interactions. The results motivate us to extend                [16] Nguyen, H.A., Abramson, D., Kiporous, T., Janke, A., Galloway, G.
our solution and explore different aspects that can be adapted by                      WorkWays: interacting with scientific workflows. Gateway
humans based on sophisticated workflow data analysis support.                          Computing Environments Workshop, 21–24, 2014.
Our solution is currently dependent on the domain expert’s                        [17] Ogasawara, E., Dias, J., Oliveira, D., Porto, F., Valduriez, P.,
knowledge to identify correlations between input and output data                       Mattoso, M. An algebraic approach for data-centric scientific
to determine which subsets are uninteresting. We plan to address                       workflows. PVLDB, 4(12):1328–1339, 2011.
in-situ data visualization based on the adaptive monitoring and                   [18] Özsu, M.T., Valduriez, P. Principles of distributed database systems.
                                                                                       3 ed. New York, Springer, 2011.
interactive queries results and develop recommendation models to
                                                                                  [19] Reuillon, R., Leclaire, M., Rey-Coyrehourcq, S. OpenMOLE, a
suggest correlations based on history stored in the wf-Database.                       workflow engine specifically tailored for the distributed exploration
Other future work include: enabling users to set priorities to                         of simulation models. FGCS, 29(8):1981–1990, 2013.
different slices of the data in a way that the SWMS system will                   [20] Silva, V., de Oliveira, D., Valduriez, P., Mattoso, M. Analyzing
process critic slices before; improving usability of the system by                     related raw data files through dataflows. CCPE, 28:2528–2545, 2015.
developing intuitive user interfaces to decrease the learning curve,              [21] Souza, R., Silva, V., Oliveira, Daniel, Valduriez, P., Lima, A.A.B.,
especially related to the query interface, to take full advantage of                   Mattoso, M. Parallel execution of workflows driven by a distributed
the wf-Database. We also plan to expand our experiments and                            database management system. Poster in Supercomputing, 2015.
                                                                                  [22] Vahi, K., Harvey, I., Samak, T., et al. A case study into using
analyze how reducing each specific type of data (relation tuples,
                                                                                       common real-time workflow monitoring infrastructure for scientific
raw data files not processed and not generated) impact final                           workflows. J. Grid Comput., 11(3):381–406, 2013.
results.

9. ACKNOWLEDGMENTS
This work was partially funded by CNPq, FAPERJ and Inria
(MUSIC project), EU H2020 Programme and MCTI/RNP-Brazil
(HPC4E grant no. 689772), and performed (for P. Valduriez) in
the context of the Computational Biology Institute (www.ibc-
montpellier.fr). The experiments were carried out using the
Grid'5000 testbed (https://www.grid5000.fr).

10. REFERENCES
[1] Abramson, D., Enticott, C., Altinas, I. Nimrod/K: Towards massively
    parallel dynamic grid workflows. Supercomputing, 24:1–24:11, 2008.
[2] Costa, F., Silva, V., de Oliveira, D., Ocaña, K., Ogasawara, E., Dias,
    J., Mattoso, M. Capturing and querying workflow runtime provenance
    with PROV: a practical approach. EDBT Workshops, 282–289, 2013.
[3] Davidson, S.B., Freire, J. Provenance and scientific workflows:
    challenges and opportunities. SIGMOD, 1345–1350, 2008.




                                                                             53