Linking provenance with system logs: a context aware information integration and exploration framework for analyzing workflow execution Elias el Khaldi Ahanach, Spiros Koulouzis, Zhiming Zhao elias.el.khaldi@gmail.com, {S.Koulouzis|Z.Zhao}@uva.nl Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands Abstract—When executing scientific workflows in a distributed for carbon monitoring in atmosphere, ecosystems and environment, anomalies of the workflow behavior are often marine environments, the European Plate Observing Sys- caused by a mixture of different issues, e.g., careless design tem (EPOS)4 for solid earth science and Euro-Argo5 for of the workflow logic, buggy workflow components, unexpected performance bottlenecks or resource failure at the underlying collecting environmental observations from large-scale infrastructure. The provenance information only defines data deployments of robotic floats in the world’s oceans. evolution at the workflow level, which does not have an explicit 3) Virtual Research Environments (VREs) provide user- connection with the system logs provided by the underlying centric support for discovering and selecting data and infrastructure. Analyzing provenance information and apposite software services from different sources, and composing system metrics requires expertise and a considerable amount of manual effort. Moreover, it is often time-consuming to aggregate and executing application workflows [1], [2], also referred this information and correlate events occurring at different levels to as Virtual Laboratories or Science Gateways [3]. in the infrastructure. In this paper, we propose an architecture Although roles and functions of these different kinds of to automate the integration among the workflow provenance information with the performance information collected from environments may substantially overlap, we can distinguish infrastructure nodes running workflow tasks. Our architecture that e-infrastructures focus on generic Information and com- enables workflow developers or domain scientists to effectively munications technology (ICT) resources (e.g., computing or browse workflow execution information together with the system networking), RIs manage data and services focused on specific metrics, and analyze contextual information for possible anoma- scientific domains, and VREs support the lifecycle of specific lies. research activities. Although the boundaries between these en- I. I NTRODUCTION vironments are not always entirely clear (often sharing services In the last decades, researchers have been using sophis- for infrastructure and data management [4]), collectively they ticated research support environments for efficient data dis- represent an important trend in many international research covery, experiment management, and workflow composition and development projects. and execution. These research support environments typically Research support environments combine multiple resources include: including federated clouds and repositories that allow the 1) e-Infrastructures, e.g., EGI1 and EUDAT2 , focus on the sharing of service-oriented architecture (SOA) based scientific management of the service lifecycle of computing, storage workflows [5]. These environments also offer the means to and network resources, and provide services to research store provenance data concerning the execution of scientific communities or other user groups to provision dedicated workflows. infrastructure and to manage persistent services and their When performing an experiment using a Workflow Manage- underlying storage, data processing and networking re- ment System (WFMS) in a VRE, different types of contextual quirements. information can be collected: 2) Research infrastructures (RIs) are facilities, resources, • Provenance information provided by the workflow sys- and services constructed for specific scientific commu- tem. nities to conduct research. They can include scientific • Application logs monitored by the platform (e.g., by equipment, knowledge-based resources such as collec- Apache Tomcat or Java virtual machine). tions, archives or scientific data. Some RIs examples in- • System logs collected by the infrastructure monitoring clude the Integrated Carbon Observation System (ICOS)3 systems. 1 http://www.egi.eu/ 2 http://www.eudat.eu/ 4 https://www.epos-ip.org/ 3 https://www.icos-ri.eu/ 5 http://www.euro-argo.eu/ Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Provenance (PROV)6 is a typical model often used for work- and obscured for the user. Although the use of cloud technolo- flow provenance. It models causality of workflow events using gies provides scientists with a dedicated infrastructure which concepts of agents, entities, and activities involved in data combines data and computation [10] along with sophisticated evolution [6]. It has been used in workflow systems like monitoring tools, it is very difficult for a scientist or service de- Apache Taverna to export information on workflow executions. veloper to discover which Virtual Machine (VM) or container Provenance can be stored using XML or RDF7 standards, with is responsible for execution bottlenecks or failed workflow many tools available for parsing and querying the data. exertions. At the same time, monitoring systems provide metrics about To be able to bridge the gap between the workflow abstrac- the usage of e-Infrastructures resources, e.g., CPU usage, tion level and the dedicated infrastructure we set the following memory consumption, and network traffic. Those metrics can objectives : be useful for workflow developers to investigate application 1) Analyse the execution time of service-based scientific behavior at the low-level resources, e.g., locating workflow workflows failures caused by the underlying infrastructure resource. 2) Detect bottlenecks that cause workflow performance However, the provenance and system metrics differ in scope degradation of information and are provided by different sources, which 3) Detect the cause of bottlenecks makes the integrated analysis difficult and time-consuming. 4) Present the analysis to a user In this paper, we propose a context-aware information 5) Make resource scaling suggestions to be used by scaling integration and exploration framework for users to effectively controllers investigate possible workflow execution bottlenecks by com- bining provenance with the system logs. We discuss an archi- III. R ELATED W ORK tecture that will allow scientists or service developers to an- alyze the execution of service-based scientific workflows and In the past, there have been many efforts to detect the visualize possible bottlenecks related with the infrastructure by sources of performance loss or detect anomalies in the in- getting a detailed view of the resource usage of each workflow frastructure while executing scientific workflows. task. By taking advantage of this information, infrastructure In [11] the authors propose an approach that aims to help administrators, service developers or scaling controllers may workflow end-users and middleware developers to understand configure the provisioned visualized infrastructure. the sources of performance losses when executing scientific The rest of the paper is organized as follows: In section workflows in Grid environments. To achieve that they propose II we discuss the motivation for our solution. Section III a model for estimating the ideal lowest execution time of presents an overview of related works. Our proposed system a workflow and then calculate the total overhead as the architecture is given in Section IV while Sections V and VI difference between the workflow’s measured Grid execution are devoted to the assessment of our proposed solution. The time and its ideal time. This work is focused on Grid envi- paper concludes with Section VII. ronments and depends on the presence of a specific scheduler named GRAM [12] to collect the state of each workflow task II. M OTIVATION submitted to a grid site. Moreover, in Grid environments, the performance of the node that is assigned to execute a workflow By abstracting the application logic of steps or processes task is more or less stable. Once a workflow task is scheduled in an experiment, the WFMS allows scientists to efficiently on a node it has exclusive use of its resources. construct, execute and validate complex application at a high level[7]. Also focusing on Grid environments, the authors of [13] presented a simple for autonomous detection and handling of The adoption of cloud technologies together with the in- operational incidents in workflow activities. In this work, the creasing popularity of containerization and DevOps practices authors attempt to classify the state of each task in a workflow have made the development, deployment, and monitoring of and apply the appropriate rule in case of an error. In this web services faster and more efficient. work issues like resource, scaling are not addressed since Scientific workflows usually interact with multiple web ser- Grid environments assign tasks in a static resource. In [14] vices which are often hosted in different cloud infrastructures the authors have gathered eight months of workflow activity and may be composed of numerous tasks[8]. When execut- from the e-BioInfra platform and present an analysis on task ing such complex workflows, it is often hard to detect the failures in their e-Infrastructure. Although this work provides underlying cause of execution bottlenecks which are related some useful insight into the behavior of workflow tasks, it is to the performance of the infrastructure. If we break down not connecting the higher level workflow deception tasks with the infrastructure into multiple abstraction levels, we can see the underlying usage of resources. that workflows are created and executed on the highest level while resource usage is measured on the lowest levels[9]. As The work presented in [15] proposes an online mechanism a result, the measured resource metrics are often unreachable for detecting anomalies while executing scientific workflows on networked clouds. The authors use an integrated framework 6 https://www.w3.org/TR/prov-overview/ to collect online monitoring time-series data from workflow 7 https://www.w3.org/RDF/ tasks and the infrastructure. This approach is tightly coupled with the Pegasus WFMS which depends on specific worker nodes to execute workflows tasks. Existing solutions are outdated and depend on specific task schedulers or WFMS. Although traditionally scientific workflows preserve provenance information, they are never used together with information collected from the infrastruc- ture nodes to analyze contextual information for possible anomalies. IV. A RCHITECTURE Our architecture, named Cross-context Workflow Execution Fig. 1. The CWEA is made by five main components: 1) the WCDR for querying provenance data and parsing workflows, 2) the RCDR for querying Analyzer (CWEA) enables workflow developers or domain performance data, 3) the WFEA for implementing diagnosis algorithms and scientists to effectively browse workflow execution informa- 4) the Interactive GUI for presenting the results. tion together with the system metrics, and analyze contextual information for possible anomalies. To achieve that it relies on the following components: WCDR parses the workflow to simply extract web service • Workflow Context Data Retriever (WCDR): this compo- endpoints and the type of call made. nent queries the provenance data generated by the WFMS To be able to query more data sources for performance to extract the name, start-time, and end-time of each web data, the RCDR may include additional implementations8 . It service call described in the workflow. The WCDR also is therefore necessary that each VM is hosting a performance parses the workflow itself to extract the endpoints of the metrics collector and a metrics database. web services used and the type of call made (e.g., GET, The GUI component is the only component that is accessible POST, etc.). from the outside. Besides acting as a graphical interface for • Resource Context Data Retriever (RCDR) this component the user, the back-end of the GUI component is a REST API uses the service endpoints obtained by the WCDR to for calls made by other applications. This design assures both query the corresponding hosts and retrieve available per- manual and programmatic interaction. formance data within the web service’s call time-ranges. For our prototype, the provenance data and workflow are The performance data typically include CPU utilization, manually uploaded by the user to the GUI. However, with memory and network usage. our design we aim to be able to connect to provenance data • Workflow Execution Analyzer (WFEA): this component workflow repositories. abstracts a set of diagnostic algorithms that are used to To be able to analyze and visualize potential bottlenecks analyze various performance metrics to find correlations in the execution of a workflow a user should perform the between workflow execution and resource usage. Cur- following steps: rently, we have implemented an algorithm that identifies 1) Once the WFMS, in our case Taverna, has executed a the most time-consuming web services within the context workflow the user exports the workflow’s provenance as of a workflow execution. The findings of the WFEA are a file. presented in the interactive GUI for visualization or can 2) Once the execution is over the user uploads the prove- be consumed in the form of JSON by other software nance and workflow files to the GUI components such as infrastructure scale controllers. 3) The GUI sends the files to the WCDR where it parses the • Interactive GUI: this is a web-based interface that allows file and returns for each service in the workflow: 1) its the user to combine and visualize the workflow execution name, 2) its endpoint, 3) its invocation start-time and steps with the performance metrics of the underlying 4) its invocation end-time. This information is returned in resources. the form of a list and visualized by the GUI. This list can Figure 1 shows the overall architecture and its individual be filtered to select the hosts the user wishes to analyze components. Each of the components is implemented as a further. RESTful microservice with its own functionality. For our 4) The user specifies the hosts to be analyzed and sends prototype the WCDR is able to parse t2flow workflows, a a request to the RCDR via the GUI to gather the rele- specification used by the Taverna WFMS. Also, our microser- vant performance data. The RCDR attempts to query to vice architecture design allows us to also support other work- databases on the endpoints to retrieve the performance flow specifications like SCUFL2 by implementing additional data bound by the timestamps of each web service parses. invocation. As mentioned above the WCDR is parsing the provenance 5) Once the performance data are available to WFEA, it data to obtain the execution trace of a workflow. Therefore, performs its analysis and returns the results to the GUI. complex workflow statures such as loops or conditions are already recorded by the provenance data. As a next step the 8 At the moment we support the Prometheus database The sequence diagram in Figure 2 shows the process • The CWEA and its components to parse the workflow, described above. gather the relevant metrics from different hosts and to perform the visualization. Since we set up our experiments using VMs, we opted for Docker to run all of these components as separate containers. Deployment of Docker containers is often used in DevOps with the combination of virtualized resources as they offer exclusive access to the resources (e.g., VMs. Moreover, a wide range of tools for monitoring is available as Docker containers. Therefore, it requires minimum effort to set up a realistic performance monitoring framework on each VM. Singularity [18] is another option for containerization of applications which focuses more on high-performance computing (HPC) by providing access to devices like GPUs or MPI hardware. Nevertheless, the wide adoption of Docker together with a wide range of tools and the fact that the scope of this work is beyond HPC made us choose Docker. Fig. 2. A sequence diagram showing the interactions between the different components. We assume the use case in which an actor is using the GUI to gather and analyze performance data, giving workflow descriptions as input. V. E XPERIMENTS For our experiments, we hosted several services on three distributed VMs. We used Taverna to create and execute a workflow comprising these services each implementing a set of methods the exhaust the system resources. By doing this, we simulated heavy CPU/MEM/NET use that can be traced back to the performance metrics. To conduct our experiment we used the following components: • The Taverna WFMS where we composed and executed a workflow. We also used Taverna to extract the workflow provenance data. Fig. 3. This diagram depicts our experimental setup. We use three VMs • Three VMs (labeled A, B and C) each containing: (named A, B and C). On each VM we hosted the a simple web service, cAdvisor and Prometheus. After the workflow execution the user provides to – A web service that offers the following methods: 1) a the CWEA (Figure 1 shows its architecture) the provenance and workflow lightweight call which requires very little resources information to query Prometheus on each VM. form the VM, 2) a CPU intensive and 3) a memory intensive 9 . Using these methods, we can simulate Figure 3 shows the configuration of each VM. We used heavy CPU and memory usage that can be traced back Taverna to create and execute a test workflow shown in to the performance metrics. Figure 4. We examined published Taverna workflows10 to – A performance metrics collector. For our experi- create a workflow that has a realistic structure. As a result, mental setup, we used cAdvisor [16], a daemon that we constructed our test workflow comprising a total of six collects, aggregates, processes, and exports a range of tasks ending in one output port. The workflow contains both system metrics about running containers. By default, sequential and parallel executions. The tasks that are executed these metrics include CPU, Memory, and Network. are of the following types: 1) a lightweight, 2) a CPU intensive – A metrics database. In our setup, we used and 3) a memory intensive. Prometheus[17], a time-series database which is used VI. R ESULTS to gather and store performance metrics from cAdvisor. We use this database to query the metrics concerning We have executed the workflow described in the previous the workflow execution. section and analyzed with CWEA. Figure 5 show the results 9 Both methods use the command stress-ng for 15 sec. 10 Taken from www.myexperiment.org the recorded metric. Similarly in Figure 6(b) we present the memory used by each task, also color-highlighted. In this graph, the x-axis shows the memory used in MB and the y- axis the time. Figures 6(c) and 6(d) show the network usage where the x-axis represent incoming or outgoing data in KB/s and the y-axis time. Considering the performance of task CPU intensive 1 we see in Figure 6(a) that it started at approximately 21:45:5 and ended at 21:45:28 and used more than 80% of CPU. However, in Figure 6(b) we see that the same task did not require any memory from its hosting VM, but it did use some network resources as it can be seen from Figure 6(d). From the results presented here, we can see that the most time-consuming task of the workflow presented in Section V is the mem intensive. Looking at the results in Figure 6, we see that this task is using CPU, memory and network resources which may attribute for the increased execution time. VII. C ONCLUSIONS AND F UTURE W ORK In this paper, we have presented and evaluated our proposed architecture that allows scientists or service developers to analyze the execution of service-based scientific workflows, Fig. 4. A simple workflow made of six tasks spread over three VMs. Tasks lightweigt 1, CPU intensive 2 and mem intensive 2 where hosted visualize possible bottlenecks related with the infrastructure on VM A. Tasks CPU intensive 1 and mem intensive 1 on VM B. Task by getting a detailed view of the resource usage of each work- CPU intensive 3 was hosted on VM C. flow task. By taking advantage of our solution infrastructure administrators, service developers or scaling controllers my as visualized by the GUI. In Figure 5(a) we see the workflow configure the provisioned visualized infrastructure. execution analysis. This table provides to the user a view of As described in Sections IV our proposed architecture is the workflow execution with a table that shows the name of relying on the WFMS to collect provenance data. This means the task, its endpoint, the HTTP method and the start and end that the workflow execution needs to complete before CWEA times of each invocation. Next, in 5(b) we see the execution can use the provenance data since. However, being able to timeline of the workflow and the time required to execute each collect provenance data as each task is completed will greatly task. In this timeline, each task is presented by a different color benefit our architecture as results would be able to be presented bar which is also highlighted in the resource usage graphs as on the fly. Such an approach requires further investigation below. The position and length of each bar correspond to on how to use the appropriate APIs from a multiple WFMS the start time and duration of each task. Table I shows the or abstract this process by relying on a provenance repository execution times in more detail. that will be able to collect provenance data as workflow tasks are completed. Task Name Exec. Duration, Perc. of Total Exec., sec % Another issue that will require our attention in the future is mem intensive 1 26.04 22.34 the persistence of the performance data on each VM. It can be mem intensive 2 24.57 21.09 the case that as soon as the workflow execution is over or the CPU intensive 1 22.83 19.59 VMs are no longer used they may be deleted causing the loss CPU intensive 3 21.80 18.71 CPU intensive 2 21.11 18.11 of all performance data. In future cases with the combination lightweight 1 0.19 0.16 of on the fly data gathering as desired above performance data TABLE I shall be copied to a separate performance database. Having E XECUTION TIMES AND PARENTAGE OF TOTAL EXECUTION FOR EACH TASK IN THE WORKFLOW. performance data from many workflow executions will also enable us to make use of statistical and AI algorithms to detect and predict possible workflow execution failures due to errors Figure 6 shows the metrics collected from each VM. All in the resource infrastructure. sub-figures highlight each service execution (when each ser- vice call started and ended) with a separate color which corre- ACKNOWLEDGMENT sponds to the colors shown in Figure 5(b). More specifically, This work was supported by the European Union’s Figure 6(a) presents the CPU used by each task highlighted Horizon 2020 research and innovation programme under with the color that corresponds to each task based on the grant agreements No. 824068 (ENVRI-FAIR), 654182 (EN- timeline shown in Figure 5(b). In this Figure, the x-axis is VRIPLUS project), 825134(ARTICONF), 676247 (VRE4EIC the percentage of the CPU used and the y-axis the time of project),643963 (SWITCH project). (a) (b) Fig. 5. Workflow execution analysis and execution timeline as presented to the user by the GUI. The execution timeline provides to the user an overview of the time taken to execute each task. (a) CPU resources: This graph shows the CPU resources utilized by (b) Memory resources: This graph shows the memory resources utilized each service during the workflow execution. by each service during the workflow execution. (c) Incoming network traffic: This graph shows the incoming data for (d) Outgoing network traffic: This graph shows the outgoing data for each task, execution. each task, execution. Fig. 6. Combined results after parsing the workflow, querying the provenance file and querying the relevant resource usage. All sub-figures highlight each service execution with a separate color which corresponds to the colors shown. in Figure 5(b) R EFERENCES [5] K. Evans, A. Jones, A. Preece, F. Quevedo, D. Rogers, I. Spasić, I. Taylor, V. Stankovski, S. Taherizadeh, J. Trnkoczy, G. Suciu, V. Suciu, [1] L. Candela, D. Castelli, and P. Pagano, “Virtual research environments: P. Martin, J. Wang, and Z. Zhao, “Dynamically reconfigurable workflows an overview and a research agenda,” Data Science Journal, vol. 12, for time-critical applications,” Proceedings of the 10th Workshop on no. 0, pp. GRDI75–GRDI81, 2013. Workflows in Support of Large-Scale Science, pp. 7:1–7:10, 2015. [2] Z. Zhao, A. Belloum, C. De Laat, P. Adriaans, and B. Hertzberger, “Distributed execution of aggregated multi domain workflows using an [6] P. Groth and L. Moreau, “Prov-overview. an overview of the prov family agent framework,” Services, 2007 IEEE Congress on, pp. 183–190, 2007. of documents,” 2013. [3] M. A. Miller, W. Pfeiffer, and T. Schwartz, “The cipres science gateway: [7] R. Cushing, S. Koulouzis, A. Belloum, and M. Bubak, “Applying enabling high-impact science for phylogenetics researchers with limited workflow as a service paradigm to application farming,” Concurrency resources,” Proceedings of the 1st Conference of the Extreme Science and Computation: Practice and Experience, vol. 26, no. 6, pp. 1297– and Engineering Discovery Environment: Bridging from the eXtreme to 1312, 2014. the campus and beyond, p. 39, 2012. [8] R. F. da Silva, R. Filgueira, I. Pietri, M. Jiang, R. Sakellariou, and [4] S. Koulouzis, A. S. Belloum, M. T. Bubak, Z. Zhao, M. ivkovi, and C. T. E. Deelman, “A characterization of workflow management systems de Laat, “Sdn-aware federation of distributed data,” Future Generation for extreme-scale applications,” Future Generation Computer Systems, Computer Systems, vol. 56, pp. 64 – 76, 2016. vol. 75, pp. 228–238, 2017. [9] Y. Demchenko, Z. Zhao, P. Grosso, A. Wibisono, and C. De Laat, “Addressing big data challenges for scientific data infrastructure,” in 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, pp. 614–617, IEEE, 2012. [10] S. Koulouzis, D. Vasyunin, R. Cushing, A. Belloum, and M. Bubak, “Cloud data federation for scientific applications,” in Euro-Par 2013: Parallel Processing Workshops (D. an Mey, M. Alexander, P. Bientinesi, M. Cannataro, C. Clauss, A. Costan, G. Kecskemeti, C. Morin, L. Ricci, J. Sahuquillo, M. Schulz, V. Scarano, S. L. Scott, and J. Weidendorfer, eds.), (Berlin, Heidelberg), pp. 13–22, Springer Berlin Heidelberg, 2014. [11] R. Prodan and T. Fahringer, “Overhead analysis of scientific workflows in grid environments,” IEEE Transactions on Parallel and Distributed Systems, vol. 19, pp. 378–393, March 2008. [12] I. Foster, “Globus toolkit version 4: Software for service-oriented systems,” Journal of computer science and technology, vol. 21, no. 4, p. 513, 2006. [13] R. Ferreira da Silva, T. Glatard, and F. Desprez, “Self-healing of oper- ational workflow incidents on distributed computing infrastructures,” in 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012), pp. 318–325, May 2012. [14] S. Madougou, S. Shahand, M. Santcroos, B. van Schaik, A. Benab- delkader, A. van Kampen, and S. Olabarriaga, “Characterizing workflow- based activity on a production e-infrastructure using provenance data,” Future Generation Computer Systems, vol. 29, no. 8, pp. 1931 – 1942, 2013. Including Special sections: Advanced Cloud Monitoring Systems & The fourth IEEE International Conference on e-Science 2011 e- Science Applications and Tools & Cluster, Grid, and Cloud Computing. [15] P. Gaikwad, A. Mandal, P. Ruth, G. Juve, D. Krl, and E. Deelman, “Anomaly detection for scientific workflow applications on networked clouds,” in 2016 International Conference on High Performance Com- puting Simulation (HPCS), pp. 645–652, July 2016. [16] “cadvisor (container advisor), official github page.” https://github.com/ google/cadvisor. Accessed: 2019-03-28. [17] “Prometheus, an open-source systems monitoring and alerting toolkit..” https://prometheus.io/docs/introduction/overview/. Accessed: 2019-03- 28. [18] G. M. Kurtzer, “Singularity 2.1. 2-linux application and environment containers for science, 2016,” Available from Internet:¡ https://doi. org/10.5281/zenodo, vol. 60736.