<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cycle Orchestrator: A Knowledge-Based Approach for Structuring Cyclic ML Pipelines in the O&amp;G Industry</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rafael Brandão</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vitor Lourenço</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcelo Machado</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leonardo Azevedo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcelo Cardoso</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renan Souza</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guilherme Lima</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renato Cerqueira</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcio Moreno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research</institution>
          ,
          <addr-line>Rio de Janeiro, RJ</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This work introduces the Cycle Orchestrator, a microservices infrastructure to structure and manage workflows related to heterogeneous data from the O&amp;G domain. Through a knowledge-based perspective, it leverages reasoning, explainability and collaboration among stakeholders.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge-based Workflow Orchestration</kwd>
        <kwd>ML pipelines</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Domain and requirements. In the natural resources domain, particularly in the oil and
gas (O&amp;G) industry, seismic data interpretation is key in exploration processes to
identify geological structures in the subsurface, allowing experts to detect patterns and
correlate geological factors by exploring different data sources. Commonly, this practice
involves processing massive amounts of data through diverse techniques, aiming at
detecting geological structures, enhancing information, correcting potential
inconsistencies in the data acquisition process, and other purposes. An increasing number of works
in the literature have been proposed applying Machine Learning (ML) workflows to
support aspects of such processing. To systematically model geological exploration
processes that apply complex data processing pipelines, allowing other stakeholders to
collaborate and consume experiments’ results, a holistic perspective is required. In this
sense, we conceptualized and developed the Cycle Orchestrator, a knowledge-based
workflow management system (WfMS) to support and operationalize the whole
lifecycle of ML and general-purpose workflows. Including specification, setup, execution
and provenance data management of such workflows. It was conceived within the O&amp;G
domain, primarily to support exploration use cases that apply cyclic ML workflows.
Streams of tasks that can yield improved results through a chain of execution iterations.
These workflows are associated to particular types of data sources (e.g. pre-stack and
post-stack seismic data). The considered use cases comprised unsupervised ML
pipelines that produce (train) new models and reuse pre-trained models and weights against
new datasets, improving the quality by cyclic evolution. In this context, the
orchestration involves the definition of what model and version should be applied to analyze a
specific data source, as required by particular workflows applied in O&amp;G exploration
processes.</p>
      <p>
        Knowledge-based Workflow Management. The Cycle Orchestrator takes advantage
of the Hyperknowledge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] conceptual model for relating knowledge specifications
      </p>
      <p>Cycle Orchestrator</p>
      <p>API</p>
      <p>Orchestrator
on Specification Controler
iiitfccape lrde SpePcaifrisceartion
loSw naH
fr
k
o
W</p>
      <p>Workflow Builder</p>
    </sec>
    <sec id="sec-2">
      <title>Setup Parser</title>
    </sec>
    <sec id="sec-3">
      <title>Workflow Representation Workflow DAGs</title>
      <p>
        aligned through a domain ontology to segments of multimodal content. Information is
represented in the Hyperknowledge Base, a hybrid storage solution that uses a direct
hyperlinked knowledge graph to maintain all information about workflow execution
plans and provenance data stored in a knowledge base. The proposed modeling adheres
to the MLWfM ontology [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to structure basic aspects of ML and the PROV-ML [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] as
provenance data model. Figure 1 shows the architectural overview of the system.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Workflow Specification API</title>
    </sec>
    <sec id="sec-5">
      <title>Execution Control API</title>
    </sec>
    <sec id="sec-6">
      <title>Lineage API</title>
    </sec>
    <sec id="sec-7">
      <title>Orchestrator Execution Provenance Colector</title>
      <p>Controler Service
itxceuon lre E(AxepcaucthioenAEirnflgoiwn)e
lfrkooEwW adnH OauntpduLtoDgasta ProvDeantaance
Hyperknowledge Base
e
g
a
iLne lrde
fkow naH
lr
o
W</p>
    </sec>
    <sec id="sec-8">
      <title>Orchestrator Lineage Controler</title>
    </sec>
    <sec id="sec-9">
      <title>Provenance Manager Service</title>
    </sec>
    <sec id="sec-10">
      <title>Knowledge Explorer System</title>
      <p>
        Users interact with the system through a REST API and a web UI for curating and
querying information, named Knowledge Explorer System (KES) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The REST API
has endpoints for workflow specification, execution, and lineage retrieval. The
specification endpoints provide basic operations for workflow plans. Workflow definitions
use a JSON-based specification language to model tasks, execution flow, required input
data, expected output data and knowledge relations. This file is parsed, producing a
Hyperknowledge representation and a directed acyclic graph (DAG) data structure. The
Execution endpoint interfaces with the execution engine’s API (Apache Airflow1). The
execution handler captures provenance data, structuring according to the provenance
data model that can be queried through the Lineage endpoint.
      </p>
      <p>By integrating workflows’ lifecycles in a common representation, our approach
promotes knowledge production, consumption and curation in the O&amp;G domain. Enabling
industry experts to design exploration processes holistically, connecting heterogenous
data processing, ontologies and stakeholders.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          et al.:
          <article-title>Managing Machine Learning Workflow Components</article-title>
          .
          <source>In: 14th IEEE Conference on Semantic Computing, ICSC</source>
          . pp.
          <fpage>25</fpage>
          -
          <lpage>30</lpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          et al.:
          <article-title>Extending Hypermedia Conceptual Models to Support Hyperknowledge Specifications</article-title>
          .
          <source>Int. J. Semantic Computing</source>
          .
          <volume>11</volume>
          ,
          <issue>01</issue>
          ,
          <fpage>43</fpage>
          -
          <lpage>64</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          et al.:
          <article-title>KES: The Knowledge Explorer System</article-title>
          . In: 2018 International Semantic Web Conference (P&amp;D/Industry/BlueSky), ISWC. (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Souza</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          et al.:
          <article-title>Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering</article-title>
          . In: 2019 IEEE/
          <article-title>ACM Workflows in Support of Large-Scale Science</article-title>
          , WORKS. pp.
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>