=Paper=
{{Paper
|id=Vol-1800/paper6
|storemode=property
|title=Online Input Data Reduction in Scientific Workflows
|pdfUrl=https://ceur-ws.org/Vol-1800/paper6.pdf
|volume=Vol-1800
|authors=Renan Souza,Vitor Silva,Alvaro L G A Coutinho,Patrick Valduriez,Marta Mattoso
|dblpUrl=https://dblp.org/rec/conf/sc/SouzaSCVM16
}}
==Online Input Data Reduction in Scientific Workflows==
Online Input Data Reduction in Scientific Workflows Renan Souza§,°, Vítor Silva§, Alvaro L G A Coutinho§, Patrick Valduriez¶, Marta Mattoso§ § ° ¶ COPPE/Federal University of Rio de Janeiro, IBM Research Brazil, Inria and LIRMM, Montpellier ABSTRACT space that will not influence relevant results and thus, as with a “branch and bound” optimization strategy, can be bounded. A Many scientific workflows are data-intensive and need be similar scenario occurs when the workflow involves a large input iteratively executed for large input sets of data elements. dataset. When domain-specialist users can actively participate in Reducing input data is a powerful way to reduce overall execution the computational process, practice frequently referred to as time in such workflows. When this is accomplished online (i.e., “human-in-the-loop”, they may analyze partial result data and tell without requiring users to stop execution to reduce the data and which part of the data is relevant or not for the final result [14]. resume execution), it can save much time and user interactions Then, based on their domain knowledge, users can identify which can integrate within workflow execution. Then, a major problem subset of the data is not interesting and thus should be removed is to determine which subset of the input data should be removed. from the execution by the SWMS, thereby reducing execution Other related problems include guaranteeing that the workflow time. system will maintain execution and data consistent after reduction, and keeping track of how users interacted with Data reduction can be accomplished in at least three different execution. In this paper, we adopt the approach “human-in-the- forms. First, users can do it before the execution starts. However, loop” for scientific workflows by enabling users to steer the in most complex scenarios, the high number of possibilities makes workflow execution and reduce input elements from datasets at it impossible to know beforehand the uninteresting subsets, runtime. We propose an adaptive monitoring approach that without any prior execution. Furthermore, not only the initial combines workflow provenance monitoring and computational dataset can be reduced, but also the data generated by the steering to support users in analyzing the evolution of key workflow, since the activities composing scientific workflows parameters and determining which subset of the data should be continuously produce significant amounts of partial data that are removed. We also extend a provenance data model to keep track consumed by other activities. A second form of data reduction is of user interactions when users reduce data at runtime. In our to do it online. When the SWMS allows for partial result data experimental validation, we develop a test case from the oil and analysis, the user may interact with the generated partial data, find gas industry, using a 936-cores cluster. The results on our which slice of the dataset is not interesting, and reduce the dataset parameter sweep test case show that the user interactions for online. We use the term online for the interactions where users are online data reduction yield a 37% reduction of execution time. able to inspect workflow execution, analyze partial and performance data, and dynamically adapt (i.e., steer) workflow CCS Concepts settings while the workflow is running (i.e., at runtime). The third • Massively parallel and high-performance simulations. form of data reduction is by stopping execution, reducing the data offline, and then resuming execution with the reduced dataset. Keywords Because of the difficulty in defining the exploratory input dataset Scientific Workflows; Human in the Loop; Online Data and the long execution time of such iterative workflows, users Reduction; Provenance Data; Dynamic Workflows. frequently adopt the third form. However, in the offline form, the SWMS is not aware of the changes, and the results with one 1. INTRODUCTION workflow configuration are not related to the others. Therefore, Scientific Workflow Management Systems (SWMS) with parallel this is generally more time-consuming, there is no control or capabilities have been designed for executing data-intensive registration of user interactions, and the execution may become scientific workflows, scientific workflows for short hereafter, in inconsistent [7]. High Performance Computing (HPC) environments. A typical Online data reduction has obvious advantages but introduces execution may involve thousands of parallel tasks with large input several problems related to computational steering in HPC datasets. When the workflow is iterative, it is repeatedly executed environments [14]. First, because of the complexity of their for each element of an input dataset. The more the data to be scientific question to address and the huge amounts of data, users processed, the longer the workflow may take, which may be days do not exactly know beforehand which subset of dataset should be depending on the problem and HPC environment [7]. Configuring kept or removed. Also, if users cannot actively follow the result a scientific workflow with parameters and data to be processed is data evolution online, in particular domain data associated to hard. Typically, the user needs to try several input data or execution and provenance data (history of data derivation), they parameter combinations in different workflow executions. These can be driven to misleading conclusions when trying to identify trials make the scientific experiment take even longer. It is well the uninteresting subset of the data. Indeed, this is the main known that optimizing performance of the parallel system is a related challenge. Second, if they can find which subset to remove way to improve overall workflow execution time, but reducing the and actually try to remove, the SWMS must guarantee that the input data that was initially planned to be processed is another operation will be done consistently. Otherwise, it can introduce effective approach to reduce workflow execution time [4]. anomalous data, yielding to no control of data elimination, data In scientific workflows, the total amount of data is very large, but redundancy, or even execution crash. Third, in a long run, there not necessarily the entire input dataset has relevant data for may be more than one user interaction, each removing more achieving the goal of the workflow execution. This is particularly subsets, at different times. If the SWMS does not keep track of the case when a large parameter space needs to be processed in user actions, it may negatively impact the results’ reproducibility parameter sweep workflows. There may be slices of the parameter and reliability. Although data reduction is not new in SWMS [4], Copyright held by the authors 44 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah to the best of our knowledge, these problems related to online subsea equipment and the offshore oil floating production unit. user-steered data reduction while maintaining data provenance They are susceptible to a wide variation of environmental have not been addressed by related works. conditions (e.g., sea currents, wind speed, ocean waves, Our approach is to represent workflow input datasets as database temperature), which may damage their structure. The fatigue relations. Each input element from the scientific domain dataset is analysis workflow adopts a cumulative damage approach as part represented as a tuple of the input relations. When the elements of of the riser's risk assessment procedure considering a wide the input dataset are files, we insert paths to these files. The combination of possible conditions. The result is the estimate of approach is implemented in Chiron, a parallel SWMS that adopts riser’s fatigue life, which is the length of time that it will safely a tuple-oriented algebraic approach [17]. Chiron has been used to operate. The Design Fatigue Factor (DFF) may range from 3 to manage workflow applications with user steering in domains, 10, meaning that the riser’s fatigue life must be at least DFF times such as bioinformatics [2], computational fluid dynamics [7], higher than the service life [6]. astronomy [20], etc. Chiron continuously populates a relational Sensors located at the offshore platform collect external database at runtime to store domain-specific data, workflow conditions and floating unit data, which are stored in multiple raw execution data, and, more importantly, provenance data, all files. Offshore engineers use specialized programs (mostly integrated in the same database available for online queries. In complex simulation solvers) to consume the files, evaluate the this paper, we use the term workflow Database (wf-Database) to impact on the risers in the near future (e.g., risk of a fracture refer to this database. In addition to the traditional advantages of occurrence), and estimate the risers’ fatigue life. Figure 1 presents managing provenance data in scientific workflows (i.e., a scientific workflow composed of seven piped specialized reproducibility, reliability, and quality of result data) [3], online programs (represented by workflow activities) with a dataflow in provenance data management eases interactive domain data between. analysis [2][20]. Such interactive flexible data analysis through provenance helps finding which subset of a dataset to be removed. Moreover, execution monitoring is another very desirable feature in any data-intensive system, including SWMS, and can also be used to assist users in addressing the subset to be removed. However, Chiron does not control changes in input datasets, including removing a subset. In this work, we extend Chiron's wf- Database to maintain the provenance of the subsets of the dataset Figure 1. Risers Fatigue Analysis Workflow. that are removed. Furthermore, we take advantage of a distributed Each task of Data Gathering (Activity 1) decompresses one in-memory database system coupled to Chiron, in a version called large file into many files containing important input data, reads d-Chiron that is significantly more scalable [21], to address the decompressed files, and gathers specific values consistency issues with respect to data reduction. We make the (environmental conditions, floating unit’s, and other data), which following contributions: are used by the following activities. Preprocessing (Activity 2) performs pre-calculations and data cleansing over some other • A mechanism coupled to d-Chiron for online input data finite element mesh files that will be processed in the following reduction, which allows users to remove subsets of the dataset activities. Stress Analysis (Activity 3) runs a computational to be processed at runtime. It guarantees that both execution structural mechanics program to calculate the stresses applied to and data remain consistent after reduction. the riser. Each task consumes pre-processed meshes and other • An extension to a provenance data model (which is W3C important input data values (gathered from first activity) and PROV compliant) to maintain the history of user interactions generates result data files, such as histograms of stresses applied when users decide to reduce a dataset during workflow throughout the riser (this is an output file), and stress intensity execution. factors in the riser and principal stress tensor components. It also • An adaptive monitoring approach that combines monitoring calculates the current curvature of the riser. Then, Stress and computational steering. It helps users to follow the Critical Case Selection (Activity 4) and Curvature evolution of interesting parameters and result data to find Critical Case Selection (Activity 5) calculate the fatigue which subsets of the dataset can be removed during execution. life of the riser based on the stresses and curvature, respectively. Also, since what users find interesting may change over time, These two activities filter out results corresponding to risers that this approach allows the user to steer the monitoring certainly in a good state (no critical stress or curvature values definitions, such as which data should be monitored and how. were identified), which are of no interest to the analysis. Although existing solutions enable workflow execution Calculate Fatigue Life (Activity 6) uses previously monitoring [13][16][19], there is no approach to enable users calculated values to execute a standard methodology [6] and to run monitoring queries that integrate execution, provenance, calculate the final fatigue life value of a riser. Compress and domain data, and dynamically adapt these queries online, Results (Activity 7) compresses output files by riser. to the best of our knowledge. Most of these activities generate result data (both raw data files Paper organization: Section 2 gives a motivating example. Section and some other domain-specific data values), which are consumed 3 gives the background for this work. We present our online data by the subsequent activities. These intermediary data need to be reduction approach in Section 4 and our adaptive monitoring analyzed during workflow execution. More importantly, approach in in Section 5. Section 6 gives the experimental depending on a specific range of data values for an output result validation. Section 7 discusses related work. Section 8 concludes. data (e.g., fatigue life value), there may be a specific combination 2. MOTIVATING EXAMPLE IN OIL AND of input data (e.g., environmental conditions) that are more or less GAS INDUSTRY important during an interval of time within the workflow In ultra-deep water oil production systems, a major application is execution. The specific range is frequently hard to determine and to perform risers’ analyses. Risers are fluid conduits between requires a domain expert to analyze partial data during execution. 45 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah For example, an input data element for Activity 2 is a file that data and provenance dataflow analysis. The corresponding contains a large matrix of data values, composed of thousands of generated SQL, as well as the relational schema, are in rows and dozens of columns. Each column contains data for an http://github.com/hpcdb/d-chiron. These queries reflect typical environmental condition and each row has data collected for a user interactions. When these workflows are executed as scripts, given instant in time. Each row can be processed in parallel and without Chiron's support, users look for files in their directories, the domain application needs to consume and produce other data open files, and try to do this analysis in an ad-hoc way, frequently files (on average, about 14 MB consumed and 6 MB produced per writing programs to "query" these result files. Often they interrupt processed input data element). After many analyses online, the the execution to fine tune input data and save execution time. user finds that, for waves > 38m with frequency > 1Hz, a riser Table 1. Domain dataflow provenance interactive queries. fatigue will never happen. Thus, within the entire matrix, any What is the average of the 10 environmental conditions that are input data element that contains this specific uninteresting range 𝑸𝟏 leading to the largest fatigue life value? does not need to be processed. Therefore, by reducing the input 𝑸𝟐 What are the water craft’s hull conditions that are leading to risers’ dataset, the overall data processed and generated are reduced and, curvature lower than 800? more importantly, the overall execution time is reduced. In this 𝑸𝟑 What are the top 5 raw data files that contain original data that are paper, we use this workflow in our examples. leading to lowest fatigue life value? What are the histograms and finite element meshes files related when 𝑸𝟒 3. USER-STEERED WORKFLOWS computed fatigue life based on stress analysis is lower than 60? Mattoso et al. [14] analyze six aspects of computational steering For Queries 𝑄1-𝑄4, the SWMS needs to store the history of the in scientific workflows: interactive analysis, monitoring, human tuples generated in Activities 4 and 5 since the beginning of the adaptation, notification, interface for interaction, and computing flow, adequately linking each tuple flow in between. For example, model. Despite their importance, the first three are essential and environmental conditions (𝑄1) and hull conditions (𝑄2) are we mostly focus on those in this work. In fact, human adaptation obtained in Activity 1, and stress- and curvature-related values are is definitely the core of computational steering. However, users obtained in Activities 4 and 5, respectively. To correlate output will only know how to fine-tune parameters or which subset needs tuples from Activity 4 or 5 to tuples from Activity 1, provenance further focus if they can explore partial result data during a long- data relationships are required. term execution. For this, interactive analysis and monitoring play an important role to put the human in the loop. B. Workflow execution information. Lower level execution engine information, such as physical location (i.e., virtual machine or Online provenance data management in SWMS is an essential cluster node) where a task is being executed, can highly benefit asset to support all six aspects of computational steering in data analysis and debugging in HPC execution. Users may want to scientific workflows. In this section, we explain the three interactively investigate how many parallel tasks each node is computational steering aspects explored in this paper. running. Moreover, tasks can run domain applications that result 3.1 Interactive analysis in errors. If there were thousands of tasks in a large execution, We address two aspects of workflow data that should be how to determine which tasks resulted in domain application interactively analyzed: (A) domain dataflow and (B) workflow errors and what the errors were? This also eases debugging, an execution information [14]. important feature to be provided in large parallel executions. Furthermore, performance data analysis is very useful. Users are A. Domain dataflow. Workflows are composed of activities frequently interested in knowing how long tasks are taking. All (scientific programs, scripts, or services) linked as a dataflow. this workflow execution information is important to be analyzed Each activity invocation, or task, may consume input datasets and and can deliver much more interesting insights when linked to input raw data files and produce output datasets and files. These domain dataflow data. When execution data is stored separately flows form the domain dataflow. To support domain dataflow from domain and provenance data, these steering queries are not interactive analysis, Chiron stores dataflow provenance data in the possible or demand combining different tools and writing specific wf-Database during execution and makes them available for analysis programs [20]. online user queries. Users can then query the wf-Database using a query interface or SQL following PROV-Wf [2], a W3C PROV- To support all this, Chiron’s wf-Database registers parallel compliant data model [15] that specializes PROV entities into workflow execution data. This means that all necessary execution domain-data entities to allow for domain dataflow analysis at a information for the parallel engine to work are linked to domain finer grain than PROV. dataflow data and managed in the same database. Table 2 shows some provenance queries for the Risers Analysis workflow that Chiron’s tuple-oriented engine first stores input dataset as tuples link workflow execution data to domain dataflow. in the wf-Database. In parameter sweep, tuples are data values from the Cartesian product of input parameters. Then, each task Table 2. Domain data linked to performance data. consumes input tuples retrieved from this database, executes Determine the average of each environmental conditions (output of them, and then stores the generated output tuples in the wf- Data Gathering – Activity 1) associated to the tasks that are 𝑸𝟓 taking execution time more than 2 standard deviations of Database immediately after task execution, adequately Curvature Critical Case Selection (Activity 5). maintaining the data relationships to the input tuples. The Determine the finite element meshes files (output of Preprocessing – workflow activities that generated the tuples are also stored in the 𝑸𝟔 Activity 2) associated to the tasks that are finishing with error status. wf-Database and linked accordingly. Large raw data files List the 5 computing nodes with the greatest number of consumed or produced by each task are not stored in the wf- 𝑸𝟕 Preprocessing activity tasks that are consuming tuples that Database, but are rather linked to them, for file flow management. contain wind speed values greater than 70 Km/h. Thus, Chiron enables online fine-grained domain dataflow 3.2 Monitoring analysis [2] as well as the analysis of related domain raw data files Another important form to help gaining insights from the data is through file flow relationships [20]. Table 1 shows some useful by monitoring in a passive way. It means that users can set up queries for the riser fatigue analysis workflow involving domain some monitoring analyses and wait for the monitoring results to 46 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah be generated. Results might be delivered to end-users as graphical entire dataset, the subset that contains those values can be dashboards or three-dimensional in-situ scientific data removed, hence reducing the dataset. visualizations. As users gain insights from monitoring results, if However, reducing a dataset to be processed has specific the SWMS has dynamic analytical support, they can adapt constraints that need to be addressed so the execution remains in a previously set up monitoring configurations or add new consistent state, i.e., with valid data and with guarantee that the monitoring analysis [14]. Also, from these new insights, new data execution will not crash. As previously described, Chiron exploration through interactive analysis can be done. implements a relational data model in a tuple-oriented algebraic If the SWMS allows for provenance data analysis during approach for scientific workflows [17]. workflow execution, monitoring becomes more effective, since We propose to represent the input dataset as a database relation, the SWMS can exploit the continuously populated wf-Database. which is a set of tuples. In the tuple-oriented approach [17], tuples By doing this, all the aforementioned data provenance analysis represent a domain-specific dataset to be consumed or produced and queries executed by users may be used by a monitoring by a workflow activity. As examples, tuples may be composed of engine. parameter values of a computational model, file paths to a large 3.3 Human adaptation raw data file (e.g., genomic sequences, finite element meshes, After users have analyzed partial data and gained insight from textual data, binary files), or calculated values. In the tuple- them, they may decide to adapt the workflow execution. oriented approach, removing a subset of the entire dataset to be Adaptation brings powerful abilities to end-users, putting the processed means removing a set of tuples to be consumed by a human in full control of scientific workflow executions. Many workflow activity. As a consequence of this removal, the tasks aspects can be adapted by humans, but very few systems support that would process the tuples within the removed set of tuples will human-in-the-loop actions [14]. The human-adaptable aspects not be executed, hence, reducing both workflow execution time range from computing resources involved in the execution (e.g., and data processing. Data processing reduction becomes more adding or removing nodes), to checking-point and rolling-back evident if the removed tuples contain paths to large raw data files (debugging), loop break conditions, reducing datasets, that would be consumed by tasks if the tuples were not removed. modification of filter conditions, and very specific parameter fine- Not only this prevents execution of tasks that would consume tuning. uninteresting data, but also the non-executed tasks will not produce more data, reducing overall generated data amount in a Populating the wf-Database during workflow execution can help workflow execution. Furthermore, if a tuple of a given activity is all these aspects. In the Chiron SWMS, it has been shown that it removed, the following tuples forming the tuple-flow of the next particularly facilitates steering. For example, in [8], it was shown linked activities will not be processed too, reducing data and, that it is possible to change filter conditions during execution. more importantly, execution time in cascade. Also, in [7], it is proposed an algebraic approach to enable Addressing which specific subset will be removed is quite steering and dynamic changes of loop conditions during execution important. So, we first formalize the subset to be removed of iterative workflows (e.g., modify number of iterations or loop (Section 4.1). In Section 4.2, we describe how we implemented stop conditions), and such approach was evaluated in Chiron. this in d-Chiron, which is a modified version of Chiron that These works show that these adaptations can significantly reduce manages the wf-Database in an in-memory distributed database overall execution time, since domain expert users are able to system and is significantly more scalable than the original Chiron identify a satisfactory result before the programmed number of [21]. We also highlight that even though we implemented our iterations. Prior to this work, although [7] has highlighted its solutions in d-Chiron, other SWMS could be used. The only advantages, no work has been developed in Chiron to tackle user- requirement is that the SWMS engine needs to manage workflow steered data reduction online. data as datasets in a tuple-oriented approach and manage domain, Since provenance data is so beneficial, we consider that when a provenance, and workflow execution data online in the same user interacts with the workflow execution, new data (user database. interaction data) are generated, and thus their provenance must be registered. In a long-running execution, many interactions can 4.1 Removing slices of input datasets occur and many adaptations may be made. If the SWMS does not In the tuple-oriented approach, to address a subset of the dataset to adequately register the provenance of interaction data, the users be removed, we first define a slice, which is a subset of tuples to can easily lose track of what they have changed in the past. This is be removed according to a criteria defined by the user. Let 𝑅 with critical if the entire computational experiment takes days to finish data schema ₰ = {𝐶} be the relation that represents a workflow and many specific adaptations had to occur, since it may be activity input dataset to be reduced. {𝐶} is the set of attributes 𝑐! , impossible for the users to remember in the last day of execution 1 ≤ 𝑗 ≤ | 𝐶 |, from 𝑅 and each 𝑐! assumes a data type what they have steered in the first days. Furthermore, adding (integer, string, boolean, etc). Moreover, we split 𝑅 into two interaction data to the wf-Database enriches its content and subsets 𝑃 and 𝑆, where 𝑃 is the subset of 𝑅 containing tuples that enables future user interaction analysis. One example of how such have already been processed and 𝑆 is the subset of 𝑅 containing data can be exploited is that the registered interaction data could tuples that will be processed. Thus, 𝑅 ← 𝑃 ∪ 𝑆 | 𝑃 ∩ 𝑆 = ∅. 𝑃 be used by artificial intelligence algorithms for understanding and 𝑆 have the same schema ₰. interaction patterns and recommend future adaptations. For these Then, we define a slice § as a subset of 𝑆, which is represented as reasons, the SWMS that enables computational steering should a primary horizontal fragment of 𝑆, defined by the selection collect provenance of user interaction data. To the best of our relational algebraic operation 𝜎 [18]. Thus, § ← 𝜎! 𝑆 , where 𝐹 knowledge, this has not been done before. is the selection formula to obtain the primary horizontal fragment. 4. ONLINE DATA REDUCTION The formula 𝐹 may either be a simple predicate (e.g., 𝑐!" = ′𝐹𝐴𝑇𝐼𝐺𝑈𝐸′) or a minterm predicate (e.g., 𝑐! > 38 ∧ 𝑐! > 0.1 In this section, we show our main contribution. Suppose that after analyzing the monitoring results, a user identifies, within the ∧ 𝑐! < 1.0) [18]. Figure 2 shows a workflow example on the left: 47 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah Act. 1 consumes input relation 𝑅 and produces an output relation that also works as an input relation to be consumed by Act. 2, which produces the final output relation. Although in this illustration we show a data reduction in the first activity, we highlight that input data of any workflow activity can be reduced, including intermediary ones, as shown in Section 6, where input data from the second activity is reduced. The input relation 𝑅 is magnified on the right of Figure 2, where we show the subsets 𝑃 and 𝑆, and the slice § defined by the formula 𝐹. Once the user has selected the slice to be removed (based on user- defined criteria), the slice can be cut off. For this, we define the operation 𝐶𝑢𝑡 using the difference relational algebraic operator, so that 𝐶𝑢𝑡 𝑅, § ← 𝑅 − §. By doing so, we ensure that only tuples from 𝑆 will be removed, since § only contains tuples from 𝑆. This is necessary because only tuples that have not been processed yet (i.e., they are “ready” Figure 2. Relation 𝑹 with subsets 𝑷 and 𝑺, and a slice to be processed) can be removed. Otherwise, either the data or the component needs to update the Task table to mark the tasks with workflow execution may become inconsistent. We note that our the identifiers in the 𝜋!"#$% slice as removed by user, so that the solution is applicable to reduce input data of any workflow engine will not get them for execution. activity that needs to process a dataset, as long as the SWMS is aware of the data elements composing the dataset. The wf-Database tables are distributed, thus making concurrency control of the Task table partitions even more complex. In d- 4.2 Implementation Chiron, distributed concurrency control in the Task table is In this section, we describe how we implement slice removal and outsourced to the distributed database system that guarantees the cut in d-Chiron. In the SWMS that implements the tuple-oriented ACID properties [18]. Thus, concurrency caused by the approach and manages execution data in the wf-Database, each aforementioned updates is controlled by the database system, input tuple (or set of tuples, depending on the dataflow operator) guarantying that both execution and data remain consistent. to be consumed is related to a task. For this reason, removing To store provenance of removed tuples, we extend the wf- tuples means removing the tasks that would consume the Database schema with the table User_Query to store the queries referenced tuples. In the PROV-Wf data model [2], which d- that select the slice of the dataset to be removed. The description Chiron supports, tasks’ data and input tuples are stored in for each User_Query column is described in Table 3. We also different relations, with a relationship in between. Thus, to keep track of the removed tasks in table Modified_Task, which implement the set 𝑆, as defined previously, we need to semi-join is a table that represents a many-to-many relationship between [18] input tuples from the input relation 𝑅 with tasks in READY User_Query and Task tables. In Section 5.2, we will give the state in order to only select the tuples that will be processed. Then, necessary extensions to the PROV-Wf data model implemented in to get the identifiers of the ready tasks 𝜋!"#$% to be removed, we d-Chiron to accommodate User_Query and modified tasks. project over the task identifiers. Using relational algebra: Table 3. User_Query table description. 𝜋!"#$% ← Π!"#$_!" 𝜎! 𝑅 ⋉ 𝜎!"#"$!!"#$% 𝑇𝑎𝑠𝑘 , Column name Description where the formula 𝐹 is analogous to that in Section 4.1, which is query_id Auto increment identifier the criteria to select the slice § to be removed. We emphasize that slice_query Query that selects the slice of the dataset to be such verifications are necessary to guarantee consistency and are removed. the SWMS’s responsibility only, not the users’. The users would tasks_query Query generated by the SWMS to retrieve the ready tasks associated. only need to specify the formula 𝐹 to select the slice. issued_time Timestamp of the user interaction To ease slice removal in d-Chiron, we developed a Steer Field that determines how the user interacted. module. With the Steer module, users can issue command lines It could be “Removal”, “Addition”, and query_type others. We currently only implemented to inform the name of the input relation 𝑅 and the formula 𝐹 to “Removal” of tuples, but it can be extended in select the slice to be removed. Then, the module is responsible for future work. retrieving the identifiers 𝜋!"#$% of the ready tasks to be removed Relationship with the user who issued the (analogous to the set 𝑆 needed for the slice definition). Instead of user_id interaction query physically removing the tasks from the wf-Database, we choose to To maintain relationship with the rest of wkfid mark them with the state REMOVED_BY_USER. By doing so, we workflow execution data. enable these tasks to be later analyzed with provenance queries and to be consistently related within the table Modified_Task. 5. ADAPTIVE MONITORING In this section, we combine monitoring and computational To guarantee consistency, we take advantage of d-Chiron’s steering into an adaptive monitoring approach. Our workflow database system [21]. d-Chiron uses a transaction-optimized in- monitoring approach relies on online queries to the continuously memory distributed database system that provides atomicity, populated wf-Database. Users can set up monitoring queries (as in consistency, isolation, and durability (ACID). In a data reduction, Table 1 and Table 2) and analyze monitoring results. both d-Chiron's engine and the Steer module need to concurrently update a shared resource, the Task table in the wf- In Section 5.1, we present a formal description and describe the Database. While d-Chiron's engine updates the Task table to get implementation in Section 5.2 with the extensions to PROV-Wf to tasks for execution and to mark them later as executed, the Steer accommodate adaptive monitoring and online data reduction. 48 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah 5.1 Formal description of adaptive monitoring Similar to what we did for the Steer module, we also developed Monitoring works as follows. There is a set {𝑄} composed of the module Monitor to facilitate utilization. The Monitor should monitoring queries 𝑚𝑞! , 0 ≤ 𝑖 ≤ | 𝑄 |, each one to be executed start at any cluster node that is able to access the distributed at each 𝑑! > 0. Users do not need to specify queries at the database system and should start after the workflow execution has beginning of execution, since they do not know everything they begun, whenever users want to monitor the workflow execution. want to monitor. This is why {𝑄} starts empty. After users gain Similar to what we did for the Steer module, we also developed insights from the data, after some interactive provenance data the module Monitor to facilitate utilization. The Monitor should analyses, they can add monitoring queries to {𝑄} in an ad-hoc start at any cluster node that is able to access the distributed manner. Each 𝑑! can be adapted by users, meaning that users have database system and should start after the workflow execution has control of the time frame of each 𝑚𝑞! during execution. begun, whenever users want to monitor the workflow execution. Each 𝑚𝑞! execution generates a monitoring query result set A command line starts the Monitor module that runs in 𝑚𝑞𝑟!" , 𝑡 = 𝑘𝑑! | 𝑘 ∈ ℕ!! , at each time interval 𝑑! . We background. It establishes a connection with the distributed constrain that each 𝑚𝑞𝑟!" must deliver one column only. If users database system (connection settings are provided in the XML want more columns, they can write different monitoring queries configuration file). Chiron (and d-Chiron) makes use of this XML for each new column. However, the number of rows in the result file to define the workflow design, workflow general settings, and set is not limited. This means that each monitoring result set other user-defined variables. Then, the Monitor program keeps 𝑚𝑞𝑟!" should be either a scalar value or an array. querying the Monitoring_Query table at each 𝑠 to check if a To improve human-in-the-loop, the end-users have the flexibility new monitoring query was added. The default value for 𝑠 is 30s, to adapt monitoring during workflow execution. To do so, at each as the time interval to check if monitoring queries were added or instant 𝑡 after each monitoring query result 𝑚𝑞𝑟!" has been removed. However, users can customize this. After the Monitor generated, the values for 𝑑! and 𝑚𝑞! are reloaded from the wf- has started, users can add (or remove) monitoring queries to (or Database. If any change has happened, it will be considered in the from) the Monitoring_Query table. Currently, users can add next iteration 𝑡 + 𝑑! . Moreover, at each certain amount of time monitoring queries using a command line to inform the SQL during execution (also configured by the user), the system checks query to be executed at each time interval and the time interval. if the user has added new monitoring queries in {𝑄}. Our adaptive Whenever the Monitor module identifies that the user added a monitoring approach takes full advantage of the data stored online new monitoring query, it launches a new thread. Each thread is in the wf-Database. More importantly, it enables users to responsible for executing each monitoring query in dynamically steer monitoring settings (including which data will Monitoring_Query at each defined time interval. A thread is be monitored and how), highly benefiting them in finding finished when a monitoring query is removed or when the uninteresting subsets to be removed. workflow stops executing (in that case, all threads are finished). Figure 3 shows the steps executed at each time interval. 5.2 Implementation 1. Execute the monitoring query 𝑚𝑞! To implement our approach, we first need to extend the wf- Database schema. To store {𝑄}, we add the table 2. Store query results in the wf-Database Monitoring_Query, shown in Table 4. 3. Reload all information for 𝑚𝑞! from the wf-Database for the next time iteration. The user could have adapted any of this information. Table 4. Monitoring_Query table description. Column name Description 4. Wait for 𝑑! seconds monitoring_id Auto increment identifier Figure 3. Steps executed by each thread within a time interval. interval Interval time (in seconds) between each monitoring query (𝑑! ) To enable all these monitoring capabilities and human-adaptation, monitoring_query Raw SQL query to be queried three of these steps represent queries to the wf-Database, Relationship between the monitoring queries including reads and writes. The stored results can be further wkfid and the current execution of the workflow. In analyzed a posteriori or, more interestingly, used as input for d-Chiron’s wf-Database, there may be data from past executions for a same workflow. runtime data visualization tools, since results are immediately made available after they are generated. The main advantage of storing monitoring results in the wf- Database (and adequately linking the results with the remainder of Another contribution of this paper is that we add three concepts to the data already stored in this database) whenever a monitoring PROV-Wf [2], which is W3C PROV-compliant [15]. Our main query result is executed is that users are able to query the results motivations to adhere to the W3C PROV recommendations are to immediately after their generation. The wf-Database can also help on query specification, to maintain compatibility between serve as data source for data visualization applications. To store different SWMS and facilitate interoperability between different monitoring results in the wf-Database, we add another table: databases. Monitoring_Query_Result, shown in Table 5. These concepts are: UserQuery, MonitoringQuery, and Table 5. Monitoring_Result table description. MonitoringResult, as in Figure 4. Using PROV nomenclature, Column name Description UserQuery is a PROV Activity that stores the user queries that monitoring_result_id Auto increment identifier remove sets of tuples and thus influence the state of the associated monitoring_id Relationship with the monitoring query tasks (i.e., remove them). MonitoringQuery is a PROV Activity that generated this result that contains the monitoring queries submitted by the user in monitoring_values Results of the monitoring_query specific time intervals. The monitoring queries generate PROV Data type of the result values of both Entity MonitoringResult that stores the query results. result_type queries. Currently, “Integer”, “Double”, “Array[Integer]”, and “Array[Double]” 49 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah 6. EXPERIMENTAL VALIDATION associated to low risk of fatigue life values. In the workflow In this section, we validate our solution (for online data reduction (Figure 1), the final value of fatigue life is calculated in Activity and adaptive monitoring) based on a real data. In Section 6.1, we 6, but input values are obtained as output of Activity 1, gathered show the experimental setup, Section 6.2 shows a test case where from raw input files. Keeping provenance is essential to associate an expert monitors the execution and removes slices of the data from Activity 1 with data from Activity 6. dataset. In Section 6.3, we analyze the added overhead. To understand which input values are leading to high fatigue life values, Peter monitors the generated data online. For simplicity, 6.1 Experimental setup we consider wind speed, which is only one out of the many Scientific workflow. As a proof of concept for this work, we use environmental condition parameter values captured by Activity 1 a synthetic parameter sweep workflow of the Riser Fatigue to serve as input for Activity 2. Peter knows that wind speed has a Analysis example (see Figure 1), which is based on a real case strong correlation with fatigue life in risers. He expects that with study. The workflow manipulates approximately 300 GB of raw low speed winds, there is a lower risk of accident. data. In all executions, we use the same dataset, which spans over When workflow execution starts, the Monitor module is approximately 12,000 data elements to be processed in parallel. initialized. Then, Peter adds two monitoring queries: 𝑚𝑞! shows Depending on the workflow activity, tasks may take few seconds the average of the 10 greatest values of fatigue life calculated in (e.g., Activity 7) or up to one minute on average (e.g., Activity 3). the last 30s of workflow execution, setting 𝑑! = 30s; and 𝑚𝑞! Software. In all executions, we use d-Chiron [21], which uses shows the average wind speed associated to the 10 greatest values MySQL Cluster 7.4.9 as its in-memory distributed database of fatigue life calculated in the last 30s, also setting the query system to manage the wf-Database. The code to run d-Chiron and interval 𝑑! = 30s. We recall from Table 1 that 𝑚𝑞! is similar to setup files are in github.com/hpcdb/d-chiron. 𝑄1, but only considering data processed in the last 30s. 𝑚𝑞! and 𝑚𝑞! queries are added to the Monitoring_Query table. Hardware. The experiments were conducted in Grid5000 using a cluster with 39 nodes, containing 24 cores each (936 cores). Every Peter monitors the results using the Monitoring_Result table. node has two AMD Opteron 1.7 GHz 12-core processors, 48GB These results can be a data source for a visualization that plots RAM, and 250GB of local disk. All nodes are connected via dashboards dynamically, refreshed according to the query Gigabit Ethernet and access a shared storage of 10TB. intervals. After gaining insights from the results and understanding patterns, he can start removing the undesired values 6.2 Test case for wind speed. The monitoring query results 𝑚𝑞𝑟!! and Let us consider the following scenario. Peter is an offshore 𝑚𝑞𝑟!! for the two previously listed queries, as well as when the engineer, expert in riser analysis and learned how to set up user reduced the data, are plotted along the workflow elapsed monitoring, analyze d-Chiron’s wf-Database, and use the Steer time, as shown in Figure 5. It presents 𝑚𝑞𝑟!! in full black line module developed in this work. In Peter’s project, the Design with square markers and 𝑚𝑞𝑟!! in full gray line with triangle Fatigue Factor is set to 3 and service life is set to 20 years, markers. These markers determine when the monitoring occurred. meaning that fatigue life must be at least 60 years (see from The workflow execution starts at 𝑡 = 0, but only after Section 2). Peter is only interested in analyzing risers with low approximately 150s, the first output results from Activity 6 starts fatigue life values, because they are critical and might need repair to be generated. From the first results, at 𝑡 = 150 and 𝑡 = 180, or replacement. During workflow execution, it would be Peter checks that when wind speed is less than 16 Km/h (see interesting if Peter could inform the SWMS, which input values horizontal dashed line in 𝑤𝑖𝑛𝑑 𝑠𝑝𝑒𝑒𝑑 = 16 in Figure 5), the would lead to low risk of fatigue, so they could be removed. results lead to the largest fatigue life values. Since risers with However, this is not simple because it is hard to determine the large fatigue life values are not interesting in this analysis, he specific range of values (i.e., the slice to be removed). For this, decides, at 𝑡 = 190, to remove all input data elements that Peter first needs to understand the pattern of input values Figure 4. Extended PROV-Wf data module to accommodate modified tasks and monitoring 50 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah contain wind speed less than 16 Km/h. For this, the first user in the large fatigue life values was registered after this last Peter’s query 𝑞! is issued with a command line to the Steer module. steering. Thus, he keeps analyzing the monitoring results, but does User queries are represented with circles in the horizontal axis not remove input data anymore until the end of execution. (Elapsed time) in Figure 5. The exact time a user issued an We store each interaction in the User_Query table and map (in interaction query is stored in User_Query table. table Modified_Task) its rows with rows in the Task table, to The next markers after 𝑞! happens at 𝑡 = 210. Comparing with consistently keep provenance of which tasks were modified (in the previous monitoring mark, at 𝑡 = 180, we can observe that this case, removed) by each specific user steering. Thus, keeping this Peter’s steering (𝑞! ) increases the minimum wind speed provenance of user steering helps analyzing how specific values to be considered from 14.2 Km/h to 24.1 Km/h. Also, we interactions impacted the results. Figure 5 shows that some observe a significant decrease in the slope of the largest values for specific interactions imply significant changes in lines’ slopes. fatigue life (10.6% lower). This means that the removal of these Queries on the wf-Database can show finer details about how input data containing wind speed less than 16 Km/h made the many tuples each user interaction made the SWMS not process, as SWMS not process data containing low wind speed values, which shown in Table 6. Each issued time follows Figure 5 and is would lead to larger fatigue life results. registered with the timestamp of when the first activity started. Then, monitoring continues, but that slope decrease calls Peter’s Table 6. Provenance of slices removed by the user attention. To obtain a finer detail of what is happening, he decides Inter Issued Number of removed Slice query to adjust the monitoring interval time (𝑑! and 𝑑! ) at runtime, by act. time (s) data elements reducing to 10s to get monitoring feedbacks more frequently. We 𝑞! 190 wind_speed < 16 623 can observe that for both lines 𝑚𝑞𝑟!! and 𝑚𝑞𝑟!! , the markers become more frequent during 𝑡 = [220, 270]. This is because a 𝑞! 310 wind_speed < 25 373 monitoring is registered at every 10s. We highlight that, although 𝑞! 370 wind_speed < 30 355 in test use case we are only showing monitoring correlating wind 𝑞! 430 wind_speed < 34.5 115 speed and fatigue life, other monitoring correlations could also be analyzed and users can add, remove or adjust monitoring queries 𝑞! 520 wind_speed < 35.5 3 at any time during execution. Finally, we run the exact same workflow and input datasets, but After verifying that the results are reasonable, Peter decides to with no monitoring or interactions to compare how such slice increase back the monitoring query intervals for both queries to removals help decrease overall execution time. The workflow 30s after 𝑡 = 270. He then observes that since 𝑞! , wind speed with no interaction processes all input data, including those less than 25 Km/h are leading to large fatigue life values. containing wind speed values that lead to risers with low risk of Then, at 𝑡 = 310, he calls Steer again to issue 𝑞! that removes fatigue, which are not considered in Peter’s analyses. In total, Peter’s steering yields the removal of 1469 input data elements input data for wind speed < 25 Km/h. The next markers after (out of approximately 12,000). This reduces the execution time 𝑞! shows that this steering made the wind speed value associated for this test case by 37% compared with no steering. Furthermore, to large fatigue life be at least 30.5 Km/h and a decrease of 6.5% these removed input data would make the workflow process and in large fatigue life values between 𝑡 = 300 𝑎𝑛𝑑 𝑡 = 330. generate more raw data files if the input data elements were not Similarly, Peter continues to monitor and steer the execution. He removed. By querying the wf-Database in the end of execution, issues 𝑞! at 𝑡 = 370 to remove input data with wind speed < we found that the execution with no user steering processed 30.5 Km/h, making a decrease of 4.9% in large fatigue life approximately 300GB of raw data files, whereas with steering the (comparing fatigue life in 𝑡 = 360 and 𝑡 = 390). Then, he total was 258GB, representing 14% of data reduction. issues 𝑞! at 𝑡 = 430 to remove input data with wind speed < 34.5, attaining a decrease of 1.7% in large fatigue life (comparing 6.3 Analyzing monitoring overhead fatigue life in 𝑡 = 420 and 𝑡 = 450). Despite this small A monitoring query 𝑚𝑞! in {𝑄} is run by a thread at each 𝑑! decrease, he decides at t = 520 to further remove data, but with seconds. Depending on the number of threads (|{𝑄}|) and on the wind speed < 35.5 Km/h. However, no decrease greater than 1% interval 𝑑! there may be too many concurrent accesses to the wf- 80.0 35.0 33.0 78.0 31.0 76.0 Wind speed (Km/h) Fatigue life (years) 29.0 74.0 27.0 72.0 25.0 70.0 23.0 68.0 21.0 66.0 19.0 17.0 64.0 15.0 q1 q2 q3 q4 q5 62.0 13.0 60.0 150 180 210 230 250 270 300 330 360 390 420 450 480 510 540 570 600 Elapsed time (seconds) Wind speed Fatigue life Steering Figure 5. Use case plot to analyze impact of user steering comparing Wind Speed (input) with Fatigue life (output). 51 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah Database, which may add overhead. The goal of this experiment is being processed online by dimension reduction or by reducing the to analyze such overhead. range of some parameters, sharing similar motivations to our We set up the Monitor module to run queries, which are work. It uses Nimrod/K as its underlying parallel workflow engine, which is an extension of the Kepler workflow system [1]. variations of the queries 𝑄1-𝑄7 presented in Table 1 and Table 2. WorkWays presents several tools for user interaction in human-in- For example, in 𝑄2, we vary the curvature value. We also modify the-loop workflows, such as graphic user interfaces, data them to calculate only the results over the last 𝑑 seconds, at each visualization, and interoperability among others. However, 𝑑 seconds. To evaluate the overheads, we measure execution time WorkWays does not provide for provenance representation and without monitoring and then with monitoring, but varying the users do not have query access to simulation data, execution data, number of queries |{𝑄}| and the interval 𝑑, which is considered metadata, and provenance, all related in a database, which limits the same for all queries in {𝑄} in this experiment. The the power of online computational steering. For example, it experiments were repeated until the standard deviation of prevents ad-hoc data analysis using both domain and workflow workflow elapsed times was less than 1%. The results are the execution data, such as those presented in Table 1 and Table 2, average of these times within the 1% margin. Figure 6 shows the which support the user in defining which slice of the dataset results, where the gray portion represents the workflow execution should be removed. In contrast, our work uses a robust in-memory time when no monitoring is used; and the black portion represents distributed database system to manage and relate analytical data the difference between the workflow execution time with and involved in the workflow execution. Moreover, the lack of without monitoring (i.e., the monitoring overhead). provenance data support in WorkWays, either online or post- From these results, we observe that when the interval 𝑑 is equal to mortem, does not support reproducibility and prevents from 30s, the overhead is negligible. When the interval is 1s, the registering user adaptations, missing opportunities to determine in overhead is higher when the number of monitoring threads is detail how specific user interactions influenced workflow results. greater. This happens because three queries are executed in each Another notable SWMS example is WINGS/Pegasus [9], which time interval (see Figure 3), for each thread. In the scenarios with especially focus on assisting users in automatic data discovery. It 30 threads, there will be 120 queries in a single time interval 𝑑. In helps generating and executing multiple combinations of that case, if 𝑑 is small (e.g., 𝑑 = 1), there are 120 queries being workflows based on user contraints, selecting appropriate input executed per second, just for the monitoring. The database that is data, and eliminating workflows that are not viable. However, it queried by the monitors is also frequently queried by the SWMS differs from our solution in the sense that it tries to explore engine, thus adding higher overhead. However, even in this multiple workflows until finding the most suitable one, whereas specific scenario that shows higher overhead (|{𝑄}| = 30 and we often model our experiments as one single scientific workflow 𝑑 = 1), it is only 33s or 3.19% higher than when no monitoring is to be processed. Also, it does not aim at putting users in the loop used. Most of the real monitoring cases do not need such frequent to actively eliminate subsets of an input dataset, especially based (every second) updates. If 30s is frequent enough for the user, on extensive ad-hoc intermediary data analysis online. there might be no added overhead, like in this test case. Additionally, as WorkWays, provenance data is not collected We also evaluated the same scenarios without storing monitoring online, nor is it integrated with domain-specific and execution data for enhanced analysis. results in the wf-Database, but rather appending in CSV files, which is simpler. The results are nearly the same as in Figure 6. While human adaptation is less explored in parallel SWMS, This suggests storing all monitoring results in the wf-Database at monitoring is widely supported in several existing SWMS runtime, which enables users to submit powerful queries as they [13][14]. For example, Pegasus [5] and Triana may be integrated are generated, with all other provenance data. This would not be to analytical tools such as Stampede [10][22], which provides a possible with a solution that appends data to CSV. framework to monitor workflow executions and has rich capabilities for online performance monitoring, troubleshooting, 7. RELATED WORK and debugging. However, in these solutions, it is not possible to Considering our contributions, we discuss the SWMS with monitor workflow execution data associating them to provenance parallel capabilities with respect to human adaptation (especially and domain data, as we do using queries to the wf-Database. To data reduction), online provenance support, and monitoring the best of our knowledge, there is no related work that allows for features. online data reduction based on a rich analytical support with Although online human adaptation is the core of computational adaptive monitoring and provenance registration of human steering, there are few parallel SWMS [11][12][19] that support it. adaptations in scientific workflows. These features allow for These solutions have monitoring services and are highly scalable, performance improvements of scientific workflows, while but do not allow for online data reduction as a means to reduce keeping data reduction consistency and provenance queries that overall execution time. WorkWays [16] is a powerful science can show the history of human-in-the-loop actions and results. gateway that enables users to steer and dynamically reduce data 8. CONCLUSION Overhead This work contributes to putting the human in the loop of 16.5 No monitor time Exec. time (min) scientific workflow systems, especially when users can actively 16 steer and reduce data online to improve performance. As a solution to the input data reduction problem, we made use of a 15.5 tuple-oriented algebraic approach that organizes workflow data to 15 be processed as sets of tuples stored in a wf-Database, managed d=1 d=1 d=30 d=30 by an in-memory distributed database system at runtime. We |{Q}|=3 |{Q}|=30 |{Q}|=3 |{Q}|=30 developed a mechanism coupled to d-Chiron, a distributed version of Chiron SWMS, which allows for reducing data, while Figure 6. Results of adaptive monitoring overhead. maintaining both data integrity and execution consistency. A major challenge to the problem of data reduction is to address 52 WORKS 2016 Workshop, Workflows in Support of Large-Scale Science, November 2016, Salt Lake City, Utah which subset of the data should be removed. As a solution to this, [4] Deelman, E., Gannon, D., Shields, M., Taylor, I. Workflows and e- we proposed an adaptive monitoring approach that aids users in Science: an overview of workflow system features and capabilities. analyzing partial result data at runtime. Based on the evaluation of FGCS, 25(5):528–540, 2009. [5] Deelman, E., Vahi, K., Juve, G., Rynge, M., Callaghan, S., input data elements and its corresponding results, the user may Maechling, P.J., Mayani, R., Chen, W., Ferreira da Silva, R., et al. find which subset of the input data is not interesting for a Pegasus, a workflow management system for science automation. particular execution, hence can be removed. The adaptive FGCS, 46(C):17–35, 2015. monitoring allows users not only to follow the evolution of the [6] Det Norske Veritas. Recommended practice: riser fatigue. DNV-RP- workflow, but also to dynamically adjust monitoring aspects F204, 2010. during execution. We extended our previous workflow [7] Dias, J., Guerra, G., Rochinha, F., Coutinho, A.L.G.A., Valduriez, P., provenance data model to be able to represent provenance of the Mattoso, M. Data-centric iteration in dynamic workflows. FGCS, online data reduction actions by users and the monitoring results. 46(C):114–126, 2015. [8] Dias, J., Ogasawara, E., Oliveira, D., Porto, F., Coutinho, A.L.G.A., Although we implemented our solution in d-Chiron, other SWMS Mattoso, M. Supporting dynamic parameter sweep in adaptive and could be used if provenance, execution, and domain dataflow data user-steered workflow. WORKS, 31–36, 2011. are managed in a database at runtime. [9] Gil, Y., Ratnakar, V., Kim, J., Gonzalez-Calero, P., Groth, P., Moody, To validate our solution, we executed a data-intensive parameter J., Deelman, E. Wings: Intelligent workflow-based design of sweep workflow based on a real case study from the oil and gas computational experiments. Intelligent Systems, 26(1):62–72, 2011. [10] Gunter, D., Deelman, E., Samak, et al. Online workflow management industry, running on a 936-cores cluster. A test case demonstrated and performance analysis with Stampede. CNSM, 152–161, 2011. how the user can monitor the execution, dynamically adapt [11] Jain, A., Ong, S.P., Chen, W., et al. FireWorks: a dynamic workflow monitoring settings, and especially remove uninteresting data to system designed for high-throughput applications. CCPE, be processed, all during execution. Results for this test case show 27(17):5037–5059, 2015. that the user interactions reduced the execution time by 37% [12] Lee, K., Paton, N.W., Sakellariou, R., Fernandes, A.A.A. Utility comparing with the execution that processed the whole dataset. functions for adaptively executing concurrent workflows. CCPE, Although the test case was from the oil and gas domain, any other 23(6):646–666, 2011. workflow application could have been used, as long as a domain [13] Mandal, A., Ruth, P., Baldin, I., et al. Toward an end-to-end framework for modeling, monitoring and anomaly detection for expert can tell which slice is not interesting, removed with no scientific workflows. IPDPSW, 1370–1379, 2016. harm to the final results. [14] Mattoso, M., Dias, J., Ocaña, K.A.C.S., Ogasawara, E., Costa, F., To the best of our knowledge, this is the first work that explores Horta, F., Silva, V., de Oliveira, D. Dynamic steering of HPC user-steered online data reduction in scientific workflows steered scientific workflows: A survey. FGCS, 46:100–113, 2015. [15] Moreau, L., Missier, P. PROV-DM: the PROV data model. Available by ad-hoc queries and adaptive monitoring, while maintaining at: http://www.w3.org/TR/prov-dm Accessed: 1 Aug 2016., 2013. provenance of user interactions. The results motivate us to extend [16] Nguyen, H.A., Abramson, D., Kiporous, T., Janke, A., Galloway, G. our solution and explore different aspects that can be adapted by WorkWays: interacting with scientific workflows. Gateway humans based on sophisticated workflow data analysis support. Computing Environments Workshop, 21–24, 2014. Our solution is currently dependent on the domain expert’s [17] Ogasawara, E., Dias, J., Oliveira, D., Porto, F., Valduriez, P., knowledge to identify correlations between input and output data Mattoso, M. An algebraic approach for data-centric scientific to determine which subsets are uninteresting. We plan to address workflows. PVLDB, 4(12):1328–1339, 2011. in-situ data visualization based on the adaptive monitoring and [18] Özsu, M.T., Valduriez, P. Principles of distributed database systems. 3 ed. New York, Springer, 2011. interactive queries results and develop recommendation models to [19] Reuillon, R., Leclaire, M., Rey-Coyrehourcq, S. OpenMOLE, a suggest correlations based on history stored in the wf-Database. workflow engine specifically tailored for the distributed exploration Other future work include: enabling users to set priorities to of simulation models. FGCS, 29(8):1981–1990, 2013. different slices of the data in a way that the SWMS system will [20] Silva, V., de Oliveira, D., Valduriez, P., Mattoso, M. Analyzing process critic slices before; improving usability of the system by related raw data files through dataflows. CCPE, 28:2528–2545, 2015. developing intuitive user interfaces to decrease the learning curve, [21] Souza, R., Silva, V., Oliveira, Daniel, Valduriez, P., Lima, A.A.B., especially related to the query interface, to take full advantage of Mattoso, M. Parallel execution of workflows driven by a distributed the wf-Database. We also plan to expand our experiments and database management system. Poster in Supercomputing, 2015. [22] Vahi, K., Harvey, I., Samak, T., et al. A case study into using analyze how reducing each specific type of data (relation tuples, common real-time workflow monitoring infrastructure for scientific raw data files not processed and not generated) impact final workflows. J. Grid Comput., 11(3):381–406, 2013. results. 9. ACKNOWLEDGMENTS This work was partially funded by CNPq, FAPERJ and Inria (MUSIC project), EU H2020 Programme and MCTI/RNP-Brazil (HPC4E grant no. 689772), and performed (for P. Valduriez) in the context of the Computational Biology Institute (www.ibc- montpellier.fr). The experiments were carried out using the Grid'5000 testbed (https://www.grid5000.fr). 10. REFERENCES [1] Abramson, D., Enticott, C., Altinas, I. Nimrod/K: Towards massively parallel dynamic grid workflows. Supercomputing, 24:1–24:11, 2008. [2] Costa, F., Silva, V., de Oliveira, D., Ocaña, K., Ogasawara, E., Dias, J., Mattoso, M. Capturing and querying workflow runtime provenance with PROV: a practical approach. EDBT Workshops, 282–289, 2013. [3] Davidson, S.B., Freire, J. Provenance and scientific workflows: challenges and opportunities. SIGMOD, 1345–1350, 2008. 53