<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Planning for Optimal Data Pipeline Instantiation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo Amado</string-name>
          <email>leonardo.amado@abdn.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriano Vogel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dalvan Griebler</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabriel Paludo Licks</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eric Simon</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felipe Meneguzzi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Johannes Kepler University Linz</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pontifical Catholic University of Rio Grande do Sul</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>SAP Labs</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Sapienza University of Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>University of Aberdeen</institution>
          ,
          <addr-line>Scotland</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Work done while afiliated with the Pontifical Catholic University of Rio Grande do Sul (PUCRS). © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International</institution>
          ,
          <addr-line>CC BY 4.0</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Data pipeline frameworks provide abstractions for implementing sequences of data-intensive transformation operators, automating the deployment and execution of such transformations in a cluster. Deploying a data pipeline, however, requires computing resources to be allocated in a data center, ideally minimizing the overhead for communicating data and executing operators in the pipeline while considering each operator's execution requirements. In this paper, we model the problem of optimal data pipeline deployment as planning with action costs, where we propose heuristics aiming to minimize total execution time. Experimental results indicate that the heuristics can outperform the baseline deployment and that a heuristic based on connections outperforms other strategies.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automated Planning</kwd>
        <kwd>Data Pipelines</kwd>
        <kwd>Pipeline optimization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        A recent report from the European Commission forecasts an increase to 8.4 million data professionals
in the EU27 by 2025, with 1.8 million positions to be added in the period between 2020 and 2025, leading
to a shortage of approximately 484,000 positions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The shortage of highly trained human resources
calls for data processing platforms accessible to a broader category of users (the so-called citizen data
scientists) having strong data expertise, moderate statistics training, and low expertise in programming
and distributed computing [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Several recent technology trends acknowledge this situation through
the development of data pipeline frameworks [
        <xref ref-type="bibr" rid="ref4 ref5 ref6 ref7 ref8 ref9">4, 5, 6, 7, 8, 9</xref>
        ] relying on a flow-based programming
model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], in which the data pipelines are implemented as directed graphs where nodes represent
computational operators that transform data and require diverse runtimes, and edges represent the
lfow of data between these operators.
      </p>
      <p>In these frameworks, the goal is to enable non-technical experts (aka citizen data scientists) to quickly
build, debug, and maintain high-quality data pipelines at scale within a cloud cluster by providing
high-level abstractions in their programming model. Since users design data pipelines, and are not
assumed to comprehend the underlying engine that deploys and executes the pipeline, the resulting data
pipelines derive a non-optimal allocation of resources to execute it. Thus, data pipeline engines must
provide optimization techniques to assure an optimal usage of computing resources in a distributed
system and minimize data-intensive communication costs during graph execution.</p>
      <p>
        The problem of optimizing a pipeline execution is similar to the task of constraint optimization [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
which has been a focus of research on automated planning. Indeed, research in that area has focused
on developing algorithms to compute optimized solutions for optimization problems with constraints,
leading to fast and reliable algorithms [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. With such a research background, we develop a model
for automated planning algorithms that use heuristic search and state-dependent action costs to solve
the problem of optimally allocating computation tasks to the worker nodes of a distributed system for
executing a data pipeline. We chose to formalize the problem in terms of automated planning (instead
of, e.g., a discrete optimization problem) partly to showcase recent advances on numeric planning, and
partially to assess the limits of current heuristics and planning systems on that space. We evaluate our
optimization methods using the pipeline engine of the SAP Data Intelligence commercial product 1,
called VFlow 2, which adopts a flow-based programming model, and whose execution model is based
on a distributed Kubernetes container orchestration platform.
      </p>
      <p>This work provides two main technical contributions: I) We model several optimization methods that
determine how the operators of a data pipeline should be deployed on the worker nodes of Kubernetes
clusters; II) We perform an experimental analysis of the proposed approaches with a real-world industry
data pipeline processing system, report our findings and indicate directions for future performance
improvement.</p>
      <p>We organize this article as follows. Section 2 provides an overview of automated planning and
data pipelines. In Section 3, we model the elements and constraints of deploying data pipelines and
computing plans. Then, in Section 4 we describe the approaches to pipeline grouping for reducing the
total execution time. Section 5 explains our experimental methodology that compares the results of
the heuristic approaches against solutions from a random baseline approach on various workloads.
Section 6 shows our empirical performance results for the diferent approaches using pipelines with a
sequential or parallel topology in which we vary the number of special operators and the computational
load. Finally, Section 8 summarizes our key contributions and points toward future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>This section introduces the formal background upon which our application relies. We start with a
formalization of Automated Planning and the elements of problems in this formalism in Section 2.1.
Section 2.2 then formalize data pipelines and how their definition relates to deployment in an underlying
high performance infrastructure. This allows us to formalize the pipeline optimization problem of
Section 3.</p>
      <sec id="sec-2-1">
        <title>2.1. Automated Planning</title>
        <p>We leverage automated planning techniques to drive the optimization. Automated planning is a sub-area
of Artificial Intelligence concerned with finding sequences or policies of actions (i.e., plans) able to
transition an agent from a given initial state to a desirable goal state [13, 14, Ch. 1]. A classical planning
problem can be described as one of graph search, in which the nodes in the graph represent states,
edges represent transitions between states that are caused by applying an action at a state. The solution
to a planning problem is a sequence of actions (edges) that forms a path to traverse this graph from a
unique initial state to any one of the goal states. Domain experts seldom define such graphs explicitly,
but rather use a logic-based description language (which we define later) to induce the actual search
graph. This allows graphs with huge state-spaces to be defined compactly. In what follows we adopt
the planning terminology from Ghallab et al. [15] to represent states and actions in planning domain
problems.</p>
        <p>The most fundamental part of a classical planning task is the planning domain. A planning domain is
a formal description of the dynamics of an environment in which an agent acts. Definition 1 formalizes
a planning domain.</p>
        <p>Definition 1 (Planning Domain). A planning domain Ξ consists of a pair ⟨ℱ , ⟩, which specifies the
knowledge of the domain, and comprises: a finite set of facts ℱ , i.e., a set of ground instantiated predicates,</p>
        <sec id="sec-2-1-1">
          <title>1https://www.sap.com/products/technology-platform/data-intelligence.html 2https://api.sap.com/api/vflow/overview</title>
          <p>defining the environment state properties; and a finite set of actions , which is technically a set of ground
instantiated operators, representing the actions that can be performed in the environment.
States  ⊆ ℱ consist of facts from a planning domain indicating properties that are true at any
moment in time and follow the closed world assumption so that any fact not included in a state is
false. Conditions or formulas in our formalism comprise positive and negated facts (f, ¬ f) representing
an implicit conjunctive formula indicating what must be true (alternatively, false) in a state. The
positive part of a condition  (pos()) comprises the positive facts, and the negated part of a condition
neg() comprises the negated facts. We say a state  supports a condition ,  |=  (alternatively,  is
valid in ), if and only if all positive facts are present in , and all negated facts are absent in , i.e.,
 |=  if and only if ( ∪ pos() = ) ∧ ( ∩ neg() = ∅). An action a ∈  is represented by a tuple
 = ⟨pre(a), ef(a), cost(a)⟩ containing the preconditions pre(a), the efects ef(a) , and a non-negative
cost cost(a), indicating possible transitions between states.</p>
          <p>The transition of a state  into a new state ′ using an action a is represented as the transition function
′ = (, a). The transition is valid if  |= pre(a), and  ′ = ( ∪ pos(ef(a))) − neg(ef(a)).</p>
          <p>A planning task (or planning instance) is the formal description of a task to be solved in a given
planning domain. Planning tasks comprise a planning domain Ξ and a planning problem. A planning
problem describes the finite set of objects of the environment, the initial state from which the planning
problem starts, and the goal state which an agent desires to achieve.</p>
          <p>Definition 2 (Planning Task). A planning task is a tuple Π = ⟨Ξ, ℐ, ⟩
 is a goal condition, and Ξ is a planning domain.
, in which ℐ is an initial state,</p>
          <p>The solution for a planning task, which we call a plan, is a sequence of actions that an agent can
perform in a planning domain to achieve the goal state  from the initial state ℐ. Definition 3 formalizes
the notion of a plan and an optimal plan.</p>
          <p>Definition 3 (Plan). A plan  for a planning task Π is a sequence of actions  = ⟨a 1, . . . , a⟩ that
induces a sequence of states ⟨0, 1, . . . , ⟩ such that ℐ = 0 |= pre(a1),  |=  and that every state
 ∈  is such that −1 |= pre(a) and  = ( −1 , ). The cost of a plan is the sum of the cost all of its

actions such that cost() = ∑︁ cost(a). We say a plan is optimal, denoted  *, if its cost is minimal, that
=1
is, if no other plan  has cost() &lt; cost( *).</p>
          <p>Note that Definition 3 allows for multiple optimal plans, which may have the same cost, and are
thus equivalently optimal. Modern planners use the Planning Domain Definition Language (PDDL)
as a standardized domain and problem representation medium [16]. PDDL encodes a number of
expressive planning formalisms, such as STRIPS-style [17] planning tasks. For simplicity, most planning
formalizations assume that every action in  has cost 1, so that the optimal plan is the plan with the
smallest number of actions. Here, we define actions with varying costs that depend on the state of the
environment, where the environment considered in this work relates to the configurations of a platform
to run data pipelines.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Designing and deploying data pipelines</title>
        <p>
          In the flow-based programming model [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], a data pipeline corresponds to a type of application that
processes large amounts of data and produces results promptly. We refer to processing as the capacity of
data pipelines to ingest, transform, and output data with high throughput and low latency [18]. At the
design level, the data pipelines considered in this article are graphs in which nodes are operators and
edges represent the flow of data, following the flow-based programming paradigm [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Operators3 are
responsible for performing operations with data. Operators can be written using diferent programming
3Note that we explicitly make a distinction between operators (i.e., elements of the data pipeline formalism) and actions (i.e.,
state transformation functions in the planning formalism)
languages using a Software Development Kit (SDK) provided by the data pipeline framework, formalized
as Definition 4.
        </p>
        <p>Definition 4 (SDK). Let  be the set of all SDKs made available by the pipeline engine. An SDK  ∈ 
provides a programming environment and runtime specific to a programming language, with tools such as
an API for the user to program operators and interact with the pipeline framework. Thus, the data pipeline
framework can provide SDKs for developing operators in Java, Python, Go, etc.</p>
        <p>Implementations of operators can difer in requirements that need to be satisfied by the runtime
system in order to be executable, and tags associated with an operator encode these requirements.
Definition 5 formalizes the notion of a tag.</p>
        <p>Definition 5 (Tag). Let  be the set of all possible tags defined by the user. A tag  ∈  denotes a
requirement that has to be satisfied for an operator to be executable, i.e. a specific library or package
(possibly a specific version of each).</p>
        <p>Definition 6 brings together Definitions 4 and 5 into the formal operator within a pipeline.
Definition 6 (Operator). A pipeline operator  = ⟨id , ,  ⟩ comprises an identification id , the SDK
 ∈  it is developed with, and a set of tags  ⊆  that it requires in order to execute.</p>
        <p>The runtime environment in which an operator must be executed is set by creating a Docker image
(later called OS image) [19], which is is a lightweight way to package code, run-time and all dependencies,
and by associating tags to the OS image that reflect the characteristics of the environment. Definition 7
formalizes the notion of an image in terms of the tag requirements from Definition 5.
Definition 7 (Image). An image  ⊆  is a set of tags that the image specification supports, representing
a possible grouping for operators’ tags satisfied by the image.</p>
        <p>A designer of a data pipeline can then tag an operator to indicate the runtime environment required
for its execution. By default, when deploying a data pipeline for execution, the execution framework
searches for an OS image whose tags match all the required tags of the operators in the data pipeline.
In this case, all operators become part of a default “group”. Operators in a group are deployed as a
container running in a Kubernetes Pod [20]: they share the libraries and packages available in the
environment defined by the containerized OS image. Definition 8 formalizes such matching of tags to
images as a group.</p>
        <p>Definition 8 (Group). Let  be the set of groups in a data pipeline and  be the set of operators in a
data pipeline. A group  ∈  then is a tuple ⟨, ⟩, where ⊆  is the set of operators in the group and
 ∈ ℐ is the image associated to the group. Since all operators have to be grouped, explicitly or implicitly,
the intersection between operators in each group has to be empty, i.e.  ∩  = ∅ for all ,  ∈ 
with  ̸= .</p>
        <p>However, when the execution framework cannot find a single matching OS image, the user must
manually group operators according to their required tags until the remaining non grouped operators
can be matched by a single OS image. Alternatively, in order to have control over which requirements a
user needs for specific tasks, users can manually group operators and create an OS image specific to the
grouped operators. Finally, we bring together the notion of an undeployed Data Pipeline incorporating
the elements from Definitions 4–6 in Definition 9.</p>
        <p>Definition 9 (Data Pipeline). Let  = ⟨,  ⟩ be an edge tuple representing the directed connection
between two operators  and  . A data pipeline then is a tuple  = ⟨, ℰ ⟩, containing the specified
groups and edges between all operators.</p>
        <p>O3
SDK2
(t3, t5)</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The pipeline optimization problem</title>
      <p>At design level, data pipelines are graphs in which nodes correspond to operators and edges represent
the flow of data. There may be multiple concrete implementations of data pipelines 4 and many open
source systems do, including Flink [21], Storm and others. In this work, we focus specifically on the
implementation provided in the VFlow engine where operators perform computing operations with
data. Operators in VFlow can be written using diferent programming languages using SDKs provided
by the data pipeline framework.</p>
      <p>Given a data pipeline  from Definition 9, with a set of connected operators, we must find the optimal
grouping configuration, in terms of total execution time, which satisfies the constraints imposed by the
operator tags  . It is therefore an optimization problem under constraints, where the constraints are
the tags of operators, which must match the tags of an OS image to belong to the same group.</p>
      <p>The runtime environment in which operators execute can be set by creating groups of operators.
These runtime environments are executing on container images (e.g., Docker images) where each group
ideally has the same image associated to it. This allows the operators of the same group share resources,
libraries, and packages available in that environment. In order to have control over which requirements
a user needs for specific tasks, Users can manually group operators and create an image specific to the
grouped operators. In the default group, VFlow associates available images and stock images that are
tailored to the basic requirements of each SDK. In case operators are not explicitly grouped by users
they belong to an implicit group such that the pipeline framework automatically associates an image to
it using the available images and stock images that are tailored to basic requirements of each SDK.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Automating operator grouping</title>
      <p>In this section, we detail three optimization methods: two greedy approaches and one approach based
on deriving specific costs. As a baseline, we define a random approach that optimizes the pipeline
by selecting random images and adding random operators. We use PDDL to create descriptions of
state-space search problems involving a variety of domains independently of the underlying AI search
engine, enabling us to solve the problem of Definition 2. Such flexibility, coupled with the number of
eficient planning software available [ 22, 23], allows us to prototype a number of optimization strategies
for grouping in order to achieve desirable deployment properties.</p>
      <p>In order to generate an optimized graph, we provide two such specifications to a PDDL planner such
as ENHSP [24, 25], and the resulting plan can then be translated into an optimized graph. We illustrate
this workflow in Figure 2.</p>
      <sec id="sec-4-1">
        <title>4We refer to data pipelines as graphs and use the term interchangeably.</title>
        <p>PDDL
PROBLEM
(GRAPH)
Group 1
Group 2</p>
        <sec id="sec-4-1-1">
          <title>4.1. Connection heuristic</title>
          <p>The connection greedy approach prioritizes the creation of groups with a minimum number of intergroup
communication links. To encode this approach using PDDL, we use diferent weights for intragroup and
intergroup communication, respectively 5 and 20. We assign a medium cost for instantiating groups,
since creating many groups will lead to more intergroup connections. Figure 3a exemplifies a solution
using this approach, where the number of intergroup communication links is only one and two operator
groups are created.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.2. Node heuristic</title>
          <p>The node approach prioritizes the creation of groups with as many nodes as possible, greedily searching
for groups with a larger number of operators. To encode this approach using PDDL, we use the same
weights for intragroup and intergroup communication, respectively 5. We assign a very high cost for
instantiating groups, which forces the planner to search for configurations with a very small number
of groups. Figure 3b illustrates an example of a solution using this heuristic. As we can see, this
solution ignores the number of intergroup connections, leading to two groups with many intergroup
connections.</p>
          <p>Group 1</p>
          <p>Group 2
OPERATOR_1</p>
          <p>OPERATOR_2</p>
          <p>OPERATOR_3</p>
          <p>OPERATOR_1</p>
          <p>OPERATOR_2</p>
          <p>OPERATOR_3
OPERATOR_4</p>
          <p>OPERATOR_5</p>
          <p>OPERATOR_6</p>
          <p>OPERATOR_4</p>
          <p>OPERATOR_5</p>
          <p>OPERATOR_6
F(n)
F(n)
F(n + fs)</p>
          <p>F(n + 2fs)
Generator</p>
          <p>F(n + fs)</p>
          <p>F(n + 2fs)
(a) Fibonacci step parameter in a parallel graph.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>4.3. Random baseline</title>
          <p>The random baseline approach groups operators by randomly selecting an image from the list of
available images and then grouping all operators satisfied by the selected image, if there are any. This
algorithm randomly selects images and groups operators until no operators are left to be grouped.
This process of randomly selecting groups works as follows. First, we randomly select an available
image capable of executing operators based on the available tags of this image. Second, we select all
 operators that can run using this image. Third, if the number of operators is greater than 0, we
randomly compute a number from the range [1, ], which defines how many of these possible operators
our new group will contain. Finally, we generate a new group with the selected operators and repeat
this process until all operators are assigned to a group.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation methodology</title>
      <p>The evaluation of our solution requires that we design diferent examples of data pipelines and generate
a synthetic workload that incur substantive computing. We instantiate data pipelines following the two
most common types of patterns that users design: sequential and parallel executions. As a workload
choice, we choose the computation of Fibonacci sequences, which allows us to parameterize the amount
of work that needs to be computed by an operator by defining the size of the sequence. We detail these
in the following sections.</p>
      <sec id="sec-5-1">
        <title>5.1. Workload configuration</title>
        <p>In order to evaluate the grouping methods, we use two patterns of synthetic data pipelines built from
the same operator which computes the ℎ number of the Fibonacci sequence and sends it to the next
operator. This operator allows us to control the processing time of a data pipeline. The two patterns of
data pipelines, illustrated in Figure 5, are as follows:
1. a topology representing sequential computation steps;
2. a topology representing independent and concurrent paths of sequential computation steps.</p>
        <p>We define two parameters to generate pipeline optimization problems with varying degrees of
complexity. These parameters are: number of special tags (later called special_ops) and Fibonacci
step.</p>
        <p>The special tag represents operators with specific library requirements beyond what is included in
the standard images associated with operators in each sub-engine. Those necessitate specific images to
OPERATOR_0</p>
        <p>OPERATOR_4
be run in VFlow. By default, all operators have default "golang" tags, and all images have the default
tag. In addition to the default tag "golang", we create three distinct special tags (spt-1, spt-2, and spt-3),
where each special tag belongs to a single specific image. When the parameter "special_ops" has the
value , then  randomly selected operators will receive a special tag. If  ≤ 3 , then all the operators
with special tags will have distinct tags between (spt-1, spt-2, spt-3). Otherwise, two operators can have
the same special tag. For example, if  = 4, two operators will have the special tag spt-1.</p>
        <p>The goal of the special tag is to increase the complexity of the optimization problem, mirroring real
deployment scenarios where operators require specific libraries (e.g., machine learning frameworks,
specialized data processing tools) that cannot coexist in a single image due to dependency conflicts
or size constraints. By default, all generated pipelines already contain one special tag to ensure the
problem cannot be trivially solved by including all operators in a single group. As the number of special
tags increases, the problem of grouping the operator becomes more dificult.</p>
        <p>140
120
100
()se 80
iTm 60
40
20
0</p>
        <p>Linear Graph with fibo_step = 1 and special_ops = 2</p>
        <p>Setup
Execution</p>
        <p>Total</p>
        <p>Linear Graph with fibo_step = 1 and special_ops = 4</p>
        <p>Setup
Execution</p>
        <p>Total
140
120
100
()se 80
iTm 60
40
20
0
400
350
300
) 250
s
(em 200
iT 150
100
50
0</p>
        <p>Linear Graph with fibo_step = 3 and special_ops = 4</p>
        <p>Setup
Execution</p>
        <p>Total
Conn</p>
        <p>Node Random</p>
        <p>Default
(a) L2.ine topology. Step = 1, sp-ops = (b) L4.ine topology. Step = 1, sp-ops = (c) L4.ine topology. Step = 3, sp-ops =</p>
        <p>The Fibonacci step is an application parameter related to increasing the  values. The  value starts
as 5 (the fifth element of the Fibonacci sequence) and the Fibonacci step ranges from 2 to 4. As this
parameter increases, the overall execution time of the graph increases steeply. Figure 4a illustrates the
propagation of the Fibonacci step parameter in a parallel topology, where  () is the algorithm that
computes the th element of the sequence, and   is the Fibonacci step parameter. The same logic of an
individual line of the parallel topology is applied to the linear topology pipeline.</p>
        <p>The pipelines consist of 14 operators, where 12 are computing operators, one is a generator, and one
is a terminator. The first operator is a generator, which sets up the workload for the other operators.
The generator sends a message to all connected operators, sending the  (5 as default) value for the
Fibonacci operator. This message is sent 5 times, to ensure continuous communication throughout the
execution of the operator. The generator is connected with a terminator operator, which signals the
end of the execution of the pipeline once all other operators finish. Finally, the other twelve operators
are Fibonacci operators, which receive a message with the  value and compute the ℎ element of the
Fibonacci sequence. After computing the sequence, each operator sends a message to the next one with
 +  , which will set up the execution of the next operator.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Experiment Execution</title>
        <p>To optimize data pipelines, we set up a framework to measure our experiments, which comprises the
following processes:
• a parser to convert vFlow graphs into a planning task in PDDL;
• a numeric planner to compute a grouping strategy to optimize the parsed pipeline, yielding a</p>
        <p>PDDL-based plan; and
• a converter to interpret this PDDL plan into a vFlow graph, which is executed in an installation
of VFLow and extracts performance metrics, e.g., execution time.</p>
        <p>VFlow includes a profiling tool to evaluate the performance of the designed approaches, which allows us
to collect data on three metrics: setup time, execution time, and total time. The setup time measures how
long it took to prepare the environment; execution time tracks how long the pipeline took to execute
after the setup phase. The total time is the setup time plus the execution time.</p>
        <p>Figure 4b illustrates the architecture of our approach. VFlow exports data pipeline graphs in JSON
format, which serves as the medium of communication between VFlow and our optimizer. A graph
described in JSON is sent to a converter, which converts JSON to a planning description language, PDDL.
The PDDL has two files, the domain, and problem. The domain is where we define the optimization
strategy (the ones we detailed in Section 5.1), dictating how we optimize the given pipeline. The problem
comprises the details of the given graph, detailing the operators (as black boxes) that are going to run,
how they are connected forming the topology of the given graph, and the necessary tags to run each
operator. With these files we execute the planner, ENHSP [ 23], which outputs a plan. The plan is the
optimization steps, which define the number of groups and which specific images are going to be used
to optimized the graph. We then use a converter to convert the plan (sequence of optimization steps) to
a JSON description of the optimized graph, which runs using the VFlow profiler.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experimental results</title>
      <p>In the following, we show the quantitative results we obtained from running our experimental evaluation.
We first describe the experimental environment in which we deployed the data pipelines, and follow by
a description of the results obtained for each topology of graphs that we generated, i.e. line and parallel.</p>
      <sec id="sec-6-1">
        <title>6.1. Experimental setup</title>
        <p>The experiments are executed in a virtualized cloud cluster composed of two worker nodes. Docker
containers using the Centos-72 image run within the worker nodes, where the containers are organized
in pods managed by Kubernetes (version 1.19.13).</p>
        <p>To evaluate our approaches, we run each graph five times to account for variations in the cluster’s
performance. We distinguish the measurements of the first execution (cold start) from the other
executions (warm starts). As metrics, we compute the mean setup time, execution time, and total time
discussed in the previous section.</p>
        <p>We compare our approaches (connection, node, and random) to the default baseline that has no
grouping assuming that all operators are run in a single group. For line topology experiments, it
represents an optimum grouping, as all operators are sequential and in the same image. For parallel
topology experiments, the default baseline runs all operators in a single group, which prevents them to
benefit from the parallelism of the topology. Note that it would be impossible to run the graphs with
only one image due if the special tags were real software constraints, so the default baseline is not
a viable strategy outside of this evaluation scenario. Considering this, we include it as a method for
comparison, highlighting the efects of having no groups in diferent scenarios.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Line topology</title>
        <p>Our first batch of experiments uses the line topology with sequential operators. We vary the parameters
described in Section 5.1, running nine distinct baseline graphs, and optimizing each graph using all four
approaches, resulting in 36 graphs. Figures 6a, 6b, 6c show the average results in milliseconds for all
approaches, dividing these results into Setup, Execution, and Total time. Results from these experiments
show that the heuristics provide performance gains, mainly in terms of setup time. Figures 6b and 6c
show that the connection based heuristics also achieved performance gains in execution time. Such
results indicate that planning heuristics are efective at optimizing stream processing applications.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Parallel topology</title>
        <p>The experiments with parallel lines topology use three lines of four operators. We vary the parameters
described in Section 5.1, running nine distinct baseline graphs with the same procedure we did for
the line topology. Figures 7a, 7b, and 7c show the results obtained for three distinct combinations
of parameters. Figure 7b shows the random approach performing well, even outperforming some
heuristics in execution time. The random approach creates more groups, which increases the graph’s
setup time. Hence, the approaches’ contrast in the total time is insignificant if we consider the standard
deviation. However, as the execution time is often longer with higher workloads, in some cases, the
execution speed-up can compensate the higher time for setup.</p>
        <p>Figures 8a and 8b show the results for the first execution of two parameters configuration. As we can
see, the setup time for the first execution is much higher, which impacts the performance of the random
approach when the workload is small (Fibonacci step = 1). As the workload increases, execution time
gains outweigh setup time losses.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Related Work</title>
      <p>Researchers have applied automated planning to various stages of software engineering, including
application synthesis, deployment, and optimisation of distributed systems. In stream processing
and cloud-native environments, planning methods help manage component composition, resource
allocation, and runtime adaptation. These approaches difer in formalism (classical planning, HTN
models, preference-based search, and cost-sensitive optimisation) and target distinct problems such as
300
250
200
)
s
(em 150
i
T
100
50
0</p>
      <p>Setup
Execution</p>
      <p>Total
300
250
200
)
s
(em 150
i
T
100
50
0
Parallel Graph with fibo_step = 1 and special_ops = 2</p>
      <p>Parallel Graph with fibo_step = 2 and special_ops = 2</p>
      <p>Setup
Execution</p>
      <p>Total
Conn</p>
      <p>Node</p>
      <p>Random
lfow graph generation, operator placement, and service orchestration. In what follows, we illustrate
these directions and provide context for our contribution.</p>
      <p>Arshad et al. [26] develop Planit, a system that uses temporal planning to automate the initial
deployment and dynamic reconfiguration of distributed software systems. Their planner, LPG, generates
plans that minimise total execution time, using durative actions and explicit time annotations. While
their work addresses deployment, it targets general-purpose component-based systems, focusing on
scheduling actions under temporal constraints. In contrast, our work assumes a fixed data pipeline and
optimises its deployment over a distributed infrastructure using numeric planning with state-dependent
action costs. We do not model reconfiguration or time-based scheduling, and they do not optimise
for execution eficiency of the deployed system beyond plan duration. The two approaches operate at
diferent levels of abstraction and solve distinct optimisation problems.</p>
      <p>Early work by Riabov and Liu [27] introduces SPPL, a domain-specific extension of classical planning
tailored to stream processing composition. Their planner constructs workflows by selecting and
connecting components to satisfy goal predicates, modelling streams as immutable entities with structured
input/output ports and predicate propagation. This approach exposes structural regularities in stream
processing domains that traditional PDDL encodings have struggled to exploit eficiently. Building on
this foundation, Sohrabi et al. [28] develop a preference-based HTN planning framework that translates
Cascade flow patterns into HTN domains, enabling more expressive composition and optimisation.
Their work inherits the compositional focus of SPPL but replaces its custom formalism with a
generalpurpose HTN planner, allowing integration of user preferences and improved scalability. Unlike prior
work on stream processing composition using Classical and HTN planning [27, 28], which focuses on
synthesising pipeline structures from abstract flow patterns, our work assumes a fixed pipeline and
addresses the orthogonal problem of optimally deploying it on a distributed infrastructure. While all
three approaches leverage AI planning, we focus on numeric planning with state-dependent action
costs to minimise execution time, rather than preference-based HTN planning for goal satisfaction.</p>
      <p>Georgievski et al. [29] and Quenum and Josua [30] also focus on composition, applying HTN planning
to generate cloud-ready applications from abstract component models. The former handles instance
creation, duplication, and configuration processes using a formal model (Aeolus). By contrast, the
latter generates hierarchical plans, matches actions to serverless functions across providers, and refines
compositions using constraint satisfaction over QoS attributes. Unlike [28], they target cloud services
rather than stream processing. Like the other work in this section, it applies planning to optimise
structure or configuration. In contrast, the contribution of this paper optimises deployment.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Results discussion and closing remarks</title>
      <p>The connection heuristic approach overall had better results. However, although with a high standard
deviation, the random strategy with more groups has the best results for the parallel topology. We
believe that this happens because the random approach created many groups that are capable of
executing the processing steps in parallel. Moreover, the results show that more groups for the line
topology do not improve the total execution time. The connection strategy, although consistently good
in the line topology, does not consider possible parallel executions.</p>
      <p>Our results show that the usage of AI techniques can improve the setup and execution time of pipelines
by optimizing the grouping of operators. In this work, our optimization goal was the total execution
time. In the future, other desirable evaluation metrics can be included, such as the infrastructure cost of
running a graph (which includes the number of pods used) or the resilience of the deployed graph.</p>
      <p>A hybrid strategy could also be explored to optimize each line of the parallel topology individually.
In future work, we could evaluate grouping strategies based on TCO metrics that measure the total
performance time per dollar (on a given cloud platform).</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgements</title>
      <p>We are grateful for the financial support and computing resources from SAP Labs. We also thank the
reviewers who pointed us to related work we were not familiar with at the outset of our project.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used no Generative AI tools for any part of the
writing, editing, or revision.
[14] K. Chowdhary, Fundamentals of artificial intelligence, Springer, 2020.
[15] M. Ghallab, D. S. Nau, P. Traverso, Automated Planning and Acting, 1st ed., Elsevier, 2016. 368p.
[16] M. Fox, D. Long, PDDL2. 1: An extension to PDDL for expressing temporal planning domains.,</p>
      <p>Journal of Artificial Intelligence Research (2003).
[17] R. Fikes, N. Nilsson, STRIPS: A New Approach to the Application of Theorem Proving to Problem</p>
      <p>Solving, Artificial Intelligence 2 (1971) 189–208.
[18] F. Hueske, V. Kalavri, Stream processing with Apache Flink: fundamentals, implementation, and
operation of streaming applications, O’Reilly Media, 2019.
[19] J. Turnbull, The Docker Book: Containerization Is the New Virtualization, James Turnbull, 2014.
[20] D. Bernstein, Containers and cloud: From LXC to Docker to Kubernetes, IEEE Cloud Computing 1
(2014) 81–84.
[21] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, K. Tzoumas, Apache Flink™ - Stream
and Batch Processing in a Single Engine., IEEE Data Engineering Bulletin (2015).
[22] E. Scala, P. Haslum, S. Thiébaux, Heuristics for Numeric Planning via Subgoaling., in: International</p>
      <p>Joint Conference on Artificial Intelligence, 2016.
[23] E. Scala, Expressive numeric heuristic search planner (enhsp-20), 2020. URL: https://gitlab.com/
enricos83/ENHSP-Public/-/tree/enhsp-20.
[24] E. Scala, P. Haslum, D. Magazzeni, S. Thiébaux, Landmarks for Numeric Planning Problems., in:</p>
      <p>International Joint Conference on Artificial Intelligence, 2017, pp. 4384–4390.
[25] P. Haslum, F. Ivankovic, M. Ramírez, D. Gordon, S. Thiébaux, V. Shivashankar, D. S. Nau, Extending
Classical Planning with State Constraints: Heuristics and Search for Optimal Planning, Journal of
Artificial Intelligence Research 62 (2018) 373–431.
[26] N. Arshad, D. Heimbigner, A. L. Wolf, Deployment and dynamic reconfiguration planning for
distributed software systems, in: 15th IEEE International Conference on Tools with Artificial
Intelligence, IEEE Computer Society, 2003, pp. 39–46.
[27] A. Riabov, Z. Liu, Planning for stream processing systems, in: M. M. Veloso, S. Kambhampati
(Eds.), Proceedings, The Twentieth AAAI Conference on Artificial Intelligence and the Seventeenth
Innovative Applications of Artificial Intelligence Conference, AAAI Press / The MIT Press, 2005,
pp. 1205–1210.
[28] S. Sohrabi, O. Udrea, A. Ranganathan, A. Riabov, HTN planning for the composition of stream
processing applications, in: D. Borrajo, S. Kambhampati, A. Oddi, S. Fratini (Eds.), Proceedings of
the Twenty-Third International Conference on Automated Planning and Scheduling, AAAI, 2013.
[29] I. Georgievski, F. Nizamic, A. Lazovik, M. Aiello, Cloud ready applications composed via HTN
planning, in: 10th IEEE Conference on Service-Oriented Computing and Applications, IEEE
Computer Society, 2017, pp. 81–89.
[30] J. G. Quenum, J. Josua, Multi-cloud serverless function composition, in: I. Brandic, R. Sakellariou,
J. Spillner (Eds.), IEEE/ACM 14th International Conference on Utility and Cloud Computing, ACM,
2021, pp. 9:1–9:10.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cattaneo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Tochtermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Glennon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>La Croce</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mitta</surname>
          </string-name>
          ,
          <source>The European Data Market Monitoring Tool, Luxembourg: Publications Ofice of the European Union</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <article-title>Designing Distributed Systems: Patterns and Paradigms for Scalable</article-title>
          ,
          <string-name>
            <surname>Reliable Services</surname>
            ,
            <given-names>O</given-names>
          </string-name>
          <string-name>
            <surname>'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pi Puig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Paniego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Medina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Rodríguez</given-names>
            <surname>Eguren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libutti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lanciotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Antueno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Estrebou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chichizola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Giusti</surname>
          </string-name>
          ,
          <article-title>Intelligent distributed system for energy eficient control</article-title>
          ,
          <source>in: Conference on Cloud Computing and Big Data</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paleyes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cabrera</surname>
          </string-name>
          , L. Lawrence,
          <article-title>Towards better data discovery and collection with flowbased programming</article-title>
          ,
          <source>in: Proceedings of the 35th International Conference on Neural Information Processing Systems</source>
          , Sydney, Australia,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mahapatra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Prehofer</surname>
          </string-name>
          ,
          <article-title>Graphical Flow-based Spark Programming</article-title>
          ,
          <source>Journal of Big Data</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bergius</surname>
          </string-name>
          ,
          <article-title>NoFlo: Flow-based programming for JavaScript, 2022</article-title>
          . URL: https://noflojs.org/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] TouK, Nussnacker,
          <year>2022</year>
          . URL: https://nussknacker.io/documentation/about/Overview.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Schweizer</surname>
          </string-name>
          , Flowpipe,
          <year>2022</year>
          . URL: https://github.com/PaulSchweizer/flowpipe.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>OpenJS</given-names>
            <surname>Foundation</surname>
          </string-name>
          , Node-RED,
          <year>2022</year>
          . URL: https://nodered.org/.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>M. J.P</surname>
          </string-name>
          ,
          <article-title>Flow-based programming: A New Approach to Application Development</article-title>
          , Van Nostrand Reinhold, New York, NY, USA,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hirzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Soulé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gedik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grimm</surname>
          </string-name>
          ,
          <source>A Catalog of Stream Processing Optimizations, ACM Computing Surveys</source>
          <volume>46</volume>
          (
          <year>2014</year>
          )
          <fpage>46</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Haslum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lipovetzky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Magazzeni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Muise</surname>
          </string-name>
          ,
          <article-title>An Introduction to the Planning Domain Definition Language</article-title>
          , volume
          <volume>13</volume>
          , 3 ed., Morgan &amp; Claypool Publishers,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gefner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bonet</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Concise</surname>
          </string-name>
          <article-title>Introduction to Models and Methods for Automated Planning</article-title>
          , volume
          <volume>7</volume>
          , 3 ed.,
          <source>A Concise Introduction to Models and Methods for Automated Planning</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>