Cycle Orchestrator: A Knowledge-Based Approach for Structuring Cyclic ML Pipelines in the O&G Industry Rafael Brandão1, Vitor Lourenço1, Marcelo Machado1, Leonardo Azevedo1, Marcelo Cardoso1, Renan Souza1, Guilherme Lima1, Renato Cerqueira1, Marcio Moreno1 1 IBM Research, Rio de Janeiro, RJ, Brazil Abstract. This work introduces the Cycle Orchestrator, a microservices infra- structure to structure and manage workflows related to heterogeneous data from the O&G domain. Through a knowledge-based perspective, it leverages reason- ing, explainability and collaboration among stakeholders. Keywords: Knowledge-based Workflow Orchestration, ML pipelines Domain and requirements. In the natural resources domain, particularly in the oil and gas (O&G) industry, seismic data interpretation is key in exploration processes to iden- tify geological structures in the subsurface, allowing experts to detect patterns and cor- relate geological factors by exploring different data sources. Commonly, this practice involves processing massive amounts of data through diverse techniques, aiming at de- tecting geological structures, enhancing information, correcting potential inconsisten- cies in the data acquisition process, and other purposes. An increasing number of works in the literature have been proposed applying Machine Learning (ML) workflows to support aspects of such processing. To systematically model geological exploration processes that apply complex data processing pipelines, allowing other stakeholders to collaborate and consume experiments’ results, a holistic perspective is required. In this sense, we conceptualized and developed the Cycle Orchestrator, a knowledge-based workflow management system (WfMS) to support and operationalize the whole lifecy- cle of ML and general-purpose workflows. Including specification, setup, execution and provenance data management of such workflows. It was conceived within the O&G domain, primarily to support exploration use cases that apply cyclic ML workflows. Streams of tasks that can yield improved results through a chain of execution iterations. These workflows are associated to particular types of data sources (e.g. pre-stack and post-stack seismic data). The considered use cases comprised unsupervised ML pipe- lines that produce (train) new models and reuse pre-trained models and weights against new datasets, improving the quality by cyclic evolution. In this context, the orchestra- tion involves the definition of what model and version should be applied to analyze a specific data source, as required by particular workflows applied in O&G exploration processes. Knowledge-based Workflow Management. The Cycle Orchestrator takes advantage of the Hyperknowledge [2] conceptual model for relating knowledge specifications Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). aligned through a domain ontology to segments of multimodal content. Information is represented in the Hyperknowledge Base, a hybrid storage solution that uses a direct hyperlinked knowledge graph to maintain all information about workflow execution plans and provenance data stored in a knowledge base. The proposed modeling adheres to the MLWfM ontology [1] to structure basic aspects of ML and the PROV-ML [4] as provenance data model. Figure 1 shows the architectural overview of the system. Cycle Orchestrator Workflow Specification API Execution Control API Lineage API API Orchestrator Orchestrator Execution Provenance Collector Orchestrator Workflow Builder Specification Controller Controller Service Lineage Controller Workflow Specification Workflow Execution User Workflow Lineage Specification Execution Engine Setup Parser (Apache Airflow) Handler Parser Handler Handler Provenance Manager Service Knowledge Explorer Workflow Workflow Provenance System Output Data Representation DAGs Data and Logs Hyperknowledge Base Fig. 1. Cycle Orchestrator's infrastructure overview. Users interact with the system through a REST API and a web UI for curating and querying information, named Knowledge Explorer System (KES) [3]. The REST API has endpoints for workflow specification, execution, and lineage retrieval. The specifi- cation endpoints provide basic operations for workflow plans. Workflow definitions use a JSON-based specification language to model tasks, execution flow, required input data, expected output data and knowledge relations. This file is parsed, producing a Hyperknowledge representation and a directed acyclic graph (DAG) data structure. The Execution endpoint interfaces with the execution engine’s API (Apache Airflow1). The execution handler captures provenance data, structuring according to the provenance data model that can be queried through the Lineage endpoint. By integrating workflows’ lifecycles in a common representation, our approach pro- motes knowledge production, consumption and curation in the O&G domain. Enabling industry experts to design exploration processes holistically, connecting heterogenous data processing, ontologies and stakeholders. References 1. Moreno, M. et al.: Managing Machine Learning Workflow Components. In: 14th IEEE Conference on Semantic Computing, ICSC. pp. 25–30 (2020). 2. Moreno, M.F. et al.: Extending Hypermedia Conceptual Models to Support Hyper- knowledge Specifications. Int. J. Semantic Computing. 11, 01, 43–64 (2017). 3. Moreno, M.F. et al.: KES: The Knowledge Explorer System. In: 2018 International Seman- tic Web Conference (P&D/Industry/BlueSky), ISWC. (2018). 4. Souza, R. et al.: Provenance Data in the Machine Learning Lifecycle in Computational Sci- ence and Engineering. In: 2019 IEEE/ACM Workflows in Support of Large-Scale Science, WORKS. pp. 1–10 (2019). 1 https://airflow.apache.org/