=Paper= {{Paper |id=Vol-3606/62 |storemode=property |title=Orchestrating DAG Workflows on the Cloud-to-Edge Continuum |pdfUrl=https://ceur-ws.org/Vol-3606/paper62.pdf |volume=Vol-3606 |authors=Angelo Marchese,Orazio Tomarchio,Lorenzo Di Rocco,Umberto Ferraro Petrillo,Giorgio Grani,Alessandro La Ferlita,Yan Qi,Emanuel Di Nardo,Simon Mewes,Ould el Moctar,Angelo Ciaramella,Claudia Diamantini,Alex Mircoli,Domenico Potena,Simone Vagnoni,Claudia Cavallaro,Vincenzo Cutello,Mario Pavone,Patrik Cavina,Federico Manzella,Giovanni Pagliarini,Guido Sciavicco,Eduard I. Stan,Paola Barra,Zied Mnasri,Danilo Greco,Valerio Bellandi,Silvana Castano,Alfio Ferrara,Stefano Montanelli,Davide Riva,Stefano Siccardi,Alessia Antelmi,Massimo Torquati,Daniele Gregori,Francesco Polzella,Gianmarco Spinatelli,Marco Aldinucci |dblpUrl=https://dblp.org/rec/conf/itadata/MarcheseT23 }} ==Orchestrating DAG Workflows on the Cloud-to-Edge Continuum== https://ceur-ws.org/Vol-3606/paper62.pdf
                                Orchestrating DAG Workflows on the Cloud-to-Edge
                                Continuum
                                Angelo Marchese1,† , Orazio Tomarchio1,∗,†
                                1
                                    Dept. of Electrical Electronic and Computer Engineering, University of Catania, Catania, Italy


                                                                         Abstract
                                                                         Orchestrating data streaming and analytics applications presents challenges due to the increasing data
                                                                         volume and time-sensitive requirements. The combination of cloud and edge computing paradigms
                                                                         attempts to avoid their pitfalls while taking the best of both worlds: cloud scalability and compute closer
                                                                         to the edge where data is typically generated. However, placing microservices in such heterogeneous
                                                                         environments while meeting QoS constraints is a challenging task due to the geo-distribution of nodes and
                                                                         varying computational resources. In this paper we propose to extend Kubernetes to enable dynamic DAG
                                                                         workflow orchestration, taking into account both infrastructure and application states. Our approach
                                                                         aims to reduce QoS violations and improve application response time in Cloud-to-Edge continuum
                                                                         scenarios.

                                                                         Keywords
                                                                         DAG workflows, Containers technology, Orchestration, Kubernetes, Kubernetes scheduler




                                1. Introduction
                                The orchestration of modern data streaming and analytics applications is a complex problem
                                to deal with, considering the increasing amount of data that needs to be processed and the
                                requirement for deadline-constrained response times [1, 2]. These applications are typically
                                implemented as DAG (Directed Acyclic Graph) workflows, where data collected from geo-
                                graphically distributed sources, like users or sensors, is moved between different processing
                                microservices. Today, Cloud Computing offers a reliable and scalable environment to execute
                                these applications. However, Cloud data centers are far away from the network edge and
                                then from end users and devices. This can lead to high application response times and limited
                                throughput. Edge Computing paradigm has emerged as a promising technology for mitigating
                                this problem by moving computation towards the network edge [3, 4]. However, Edge envi-
                                ronments are characterized by resource-constrained nodes and this can also have a negative
                                impact on the application response time. Then, to take advantage of the high computational
                                Cloud resources and the reduced network distance between data sources and Edge nodes, both
                                Cloud and Edge infrastructure are combined together to form the Cloud-to-Edge continuum, an
                                environment for executing distributed DAG workflows on multiple nodes organized in clusters.
                                ITADATA2023: The 2nd Italian Conference on Big Data and Data Science, September 11–13, 2023, Naples, Italy
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open angelo.marchese@phd.unict.it (A. Marchese); orazio.tomarchio@unict.it (O. Tomarchio)
                                Orcid 0000-0002-0877-7063 (A. Marchese); 0000-0003-4653-0480 (O. Tomarchio)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073
                                                                       CEUR Workshop Proceedings (CEUR-WS.org)




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
However, establishing where to place each microservice of a workflow is a complex problem,
considering the geo-distribution of nodes and their heterogeneity in terms of computational
resources [5, 6].
   Kubernetes1 is today the de-facto orchestration platform for the deployment, scheduling
and management of containerized applications in Cloud environments. Kubernetes has been
initially thought for the orchestration of general purpose web services, but today it is also
used for data analytics and AI training workflows [7, 8]. However, Kubernetes has not been
designed for the orchestration of complex DAG workflows in geo-distributed and heterogeneous
environments such as the aforementioned Cloud-Edge infrastructures [9, 10]. In particular, the
default Kubernetes scheduling and orchestration strategy presents some limitations because
it does not consider the ever changing resource availability on cluster nodes, node-to-node
network latencies and the current application state, in terms of current resource usage of each
microservice and the communication relationships between microservices [11]. Considering
these factors when scheduling DAG workflows is critical in order to reduce QoS violations on
the application response time.
   To deal with those limitations, in this work we propose to extend the Kubernetes platform to
adapt its usage on environments distributed in the Cloud-to-Edge continuum. Our approach
enhances Kubernetes by implementing a dynamic DAG workflow orchestration and scheduling
strategy able to consider the current infrastructure state when determining a placement for
each microservice and to continuously tune the microservices placement based on the ever
changing infrastructure and application states.
   The rest of the paper is organized as follows. Section 2 provides some background information
about the Kubernetes platform and discusses in more detail some of its limitations that motivate
our work. In Section 3 the proposed approach is presented, providing some implementation
details of its components, while Section 4 provides results of our prototype evaluation in a
testbed environment. Section 5 examines some related works and, finally, Section 6 concludes
the work.


2. Background And Motivation
Kubernetes is a container orchestration platform which automates the lifecycle management
of distributed applications deployed on large-scale node clusters [12]. A Kubernetes cluster
consists of a control plane and a set of worker nodes. The control plane is made up of dif-
ferent management services that run inside one or many master nodes. The worker nodes
represent the execution environment of containerized application workloads. In Kubernetes,
minimal deployment units consist of Pods, which in turn contain one or more containers. In a
microservices-based application, each Pod corresponds to a single microservice instance.
   Among control plane components, the kube-scheduler2 is in charge of selecting an optimal
cluster node for each Pod to run them on, taking into account Pod requirements and node
resources availability. Each Pod scheduling attempt is split into two phases: the scheduling
cycle and the binding cycle, which in turn are divided into different sub-phases. During the

1
    https://kubernetes.io
2
    https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler
scheduling cycle a suitable node for the Pod to schedule is selected, while during the binding
cycle the scheduling decision is applied to the cluster by reserving the necessary resources and
deploying the Pod to the selected node. Each sub-phase of both cycles is implemented by one or
more plugins, which in turn can implement one or more sub-phases. The Kubernetes scheduler
is meant to be extensible. In particular, each scheduling phase represents an extension point
which one or more custom plugins can be registered at.
   Kubernetes scheduler placement decisions are influenced by the cluster state at that point of
time when a new Pod appears for scheduling. As Kubernetes clusters are very dynamic and
their state changes over time, better placement decisions may be taken with respect to the
initial scheduling of Pods. Several reasons can motivate the migration of a Pod from one node
to another one, like for example node under-utilization or over-utilization, Pod or node affinity
requirements not satisfied anymore and node failure or addition.
   To this aim a descheduler component has been recently proposed as a Kubernetes sub-project3 .
This component is in charge of evicting running Pods so that they can be rescheduled onto more
suitable nodes. The descheduler does not schedule replacement of evicted Pods but relies on
the default scheduler for that. The descheduler’s policy is configurable and includes strategies
that can be enabled or disabled.
   While the default Kubernetes scheduler and descheduler implementations are suitable for
the orchestration of DAG workflows on centralized Cloud data centers, characterized by high
and uniform computational resources and low network latencies, they present some limitations
when dealing with node clusters dislocated on the Cloud-to-Edge continuum. The Kubernetes
scheduler does not place microservices based on their resource and communication requirements
and the current infrastructure state in terms of node resource availability and node-to-node
network latencies. This means that microservices with high computational resource require-
ments could be placed on resource-constrained nodes and the microservices that exchange
traffic between them are not always placed on nearby nodes. In the same way the Kubernetes
descheduler does not reschedule Pods based on the ever changing infrastructure and application
states. This can lead to higher application response times and more frequent QoS violations.


3. Proposed Approach
3.1. Overall Design
Considering the limitations described in Section 2, in this work we propose to extend the
default Kubernetes orchestration strategy in order to adapt its usage to dynamic Cloud-to-
Edge continuum environments. Leveraging our previous work presented in [13, 14], the main
idea of the proposed approach is that in this context the orchestration and scheduling of
complex DAG workflows should consider the dynamic state of the infrastructure where the
workflow is executed and also the run time workflow microservices resource and communication
requirements. In particular, in our approach, current node resource availability, node-to-node
network latencies, resource usage of microservices and communication intensity between
microservices are continuously monitored and taken into account during workflow scheduling.

3
    https://github.com/kubernetes-sigs/descheduler
                                                                      Workflow
                                                                      Monitoring
                                                     m     etry        Agent
                                           w    tele
                                    rkflo
                               wo                                                  Workflow Graph
                                                                                                                  P1

                                                                                                      Custom
                                                                                                     Scheduler
                                                                                                                       P3
                                                                                                                  N1
                         Metrics Server

                                                                                                                  P2   N3
                                                                                                      Custom
    Kubernetes Cluster       inf
                                ras                                                                 Descheduler
                                    tru
                                          ctu
                                                re                                                                N2
                                                     tel               Cluster
                                                        em
                                                            etr       Monitoring
                                                                  y
                                                                      Operator
                                                                                   Cluster Graph



Figure 1: Overall architecture


Furthermore, a key point of our approach is the requirement to continuously tune the placement
of the workflow based on the ever changing infrastructure state of Cloud-Edge environments,
run time resource usage of microservices and their communication interactions.
   Figure 1 shows a general model of the proposed approach. The current infrastructure and
application states are monitored and all the telemetry data are collected by a metrics server. For
the infrastructure, node resource availability and node-to-node latencies are monitored, while
for the application, CPU and memory usage of microservices and the traffic amount exchanged
between them are monitored. Based on the infrastructure telemetry data the cluster monitoring
agent determines a cluster graph with the set of available resources on each cluster node and the
network latencies between them. Similarly, the workflow monitoring agent uses microservices
telemetry data to determine a workflow graph whose nodes represent microservices with their
current resource usage and the edges the communication channels between them each with
a specific weight that indicates the respective traffic amount sent through that channel. The
cluster and workflow graphs are then used by the custom scheduler to determine a placement
for each application Pod, and the custom descheduler to take Pod rescheduling actions if better
scheduling decisions can be done. Further details on the components of the proposed approach
are provided in the following subsections.

3.2. Workflow and Infrastructure Monitoring
The cluster monitoring agent runs as a controller in the Kubernetes control plane. It periodically
determines the cluster graph with the currently available CPU and memory resources on each
cluster node and the node-to-node network latencies.
   During each execution the list of cluster Node resources is fetched from the Kubernetes API
server. First, for each node 𝑛𝑖 the CPU and memory values currently available on it, 𝑐𝑝𝑢𝑖 and
𝑚𝑒𝑚𝑖 respectively, are determined. These values are fetched by the agent from a Prometheus4
metrics server, which in turn collects them from node exporters executed on each cluster
node. The 𝑐𝑝𝑢𝑖 and 𝑚𝑒𝑚𝑖 parameters are then assigned as values for the available-cpu and
available-memory annotations of the node 𝑛𝑖 .
   Then, for each pair of nodes 𝑛𝑖 and 𝑛𝑗 their network cost 𝑛𝑐𝑖,𝑗 is determined. The 𝑛𝑐𝑖,𝑗 parameter

4
    https://prometheus.io/
is proportional to the network latency between nodes 𝑛𝑖 and 𝑛𝑗 . Network latency metrics are
fetched by the operator from the Prometheus metrics server, which in turn collects them from
network probe agents executed on each cluster node as Pods managed by a DaemonSet. These
agents are configured to periodically send ICMP traffic to all the other cluster nodes in order to
measure the round trip time value. For each node 𝑛𝑖 the operator assigns to it a set of annotations
network-cost-𝑛𝑗 , with values equal to those of the corresponding 𝑛𝑐𝑖,𝑗 parameters. Finally, the
cluster graph with the updated CPU and memory available resources and the network cost
values for each node is then submitted to the Kubernetes API server.
   The workflow monitoring agent runs also as a controller in the Kubernetes control plane and
periodically determines the workflow graph with the current CPU and memory usage for each
microservice and the traffic amounts exchanged between them.
   During each execution the list of Deployment resources that constitute a DAG workflow are
fetched from the Kubernetes API server. First, for each Deployment 𝐷𝑖 its CPU and memory
usage, 𝑐𝑝𝑢𝑖 and 𝑚𝑒𝑚𝑖 respectively, are determined. These values are equal to the average CPU
and memory consumption of all the Pods managed by the Deployment 𝐷𝑖 and are fetched by
the agent from the Prometheus metrics server, that in turn collects them from CAdvisor5 agents.
These agents are executed on each cluster node and monitor current CPU and memory usage
for the Pods executed on that node. The 𝑐𝑝𝑢𝑖 and 𝑚𝑒𝑚𝑖 parameters are then assigned as values
for the cpu-usage and memory-usage annotations of the Deployment 𝐷𝑖 .
   Then for each Deployment 𝐷𝑖 , the traffic amounts 𝑡𝑟𝑎𝑓 𝑓𝑖,𝑗 with all the other Deployments 𝐷𝑗
of the workflow are determined. The 𝑡𝑟𝑎𝑓 𝑓𝑖,𝑗 parameter is proportional to the traffic amount
exchanged between microservices 𝜇𝑖 and 𝜇𝑗 . Traffic metrics are fetched by the agent from the
Prometheus metrics server, which in turn collects them from the Istio6 platform. Istio is a service
mesh implementation, whose control plane is installed in the Kubernetes cluster. The Istio
control plane injects a sidecar container running an Envoy proxy on each Pod when they are
created. All the traffic between Pods is intercepted by their corresponding Envoy proxies that in
turn expose traffic statistics through metrics exporters that can be queried by the Prometheus
server.
   Each 𝑡𝑟𝑎𝑓 𝑓𝑖,𝑗 parameter is assigned by the agent as the value for the annotation traffic-𝐷𝑗 of
the Deployment 𝐷𝑖 . The workflow graph with the set of CPU and memory usage and the traffic
amounts for each Deployment is then submitted to the Kubernetes API server.

3.3. Custom Scheduler
The proposed custom scheduler extends the default Kubernetes scheduler by implementing two
additional plugins, the ResourceAware and NetworkAware plugins that extend the node scoring
phase of the default Kubernetes scheduler. For each Pod to be scheduled, each of the two plugins
assigns a partial score to each candidate node of the cluster that has passed the filtering phase.
The ResourceAware plugin takes into account the values of the cpu-usage and memory-usage
annotations of the Deployment associated with the Pod to be scheduled and the values of the
available-cpu and available-memory annotations of the node to be scored. The NetworkAware
plugin takes into account the values of the traffic annotations of the Deployment associated
5
    https://github.com/google/cadvisor
6
    https://istio.io
with the Pod to be scheduled and the values of the network-cost annotations of the node to be
scored. The node scores calculated by the ResourceAware and NetworkAware plugins are added
to the scores of the other scoring plugins of the default Kubernetes scheduler.

Algorithm 1 Custom scheduler node scoring function
Input: 𝑝, 𝑐𝑝𝑢𝑝𝑖 , 𝑚𝑒𝑚𝑝𝑖 , 𝑛, 𝑐𝑝𝑢𝑛𝑖 , 𝑚𝑒𝑚𝑛𝑖 𝑐𝑛𝑠, 𝑛𝑐, 𝑡𝑟𝑎𝑓 𝑓
Output: 𝑠𝑐𝑜𝑟𝑒
                    𝑐𝑝𝑢𝑛𝑖 −𝑐𝑝𝑢𝑝             𝑚𝑒𝑚𝑛𝑖 −𝑚𝑒𝑚𝑝
 1: 𝑟𝑎𝑠𝑐𝑜𝑟𝑒 ← 𝛼 ×               × 100 + 𝛽 ×             × 100
                       𝑐𝑝𝑢𝑝                    𝑚𝑒𝑚𝑝
 2: 𝑐𝑚𝑐 ← 0
 3: for 𝑐𝑛 in 𝑐𝑛𝑠 do
 4:    𝑝𝑐𝑚𝑐 ← 0
 5:    for 𝑐𝑛𝑝 in 𝑐𝑛.𝑝𝑜𝑑𝑠 do
 6:      𝑝𝑐𝑚𝑐 ← 𝑝𝑐𝑚𝑐 + 𝑛𝑐𝑛,𝑐𝑛 × 𝑡𝑟𝑎𝑓 𝑓𝑝,𝑐𝑛𝑝
 7:    end for
 8:    𝑐𝑚𝑐 ← 𝑐𝑚𝑐 + 𝑝𝑐𝑚𝑐
 9: end for
10: 𝑛𝑎𝑠𝑐𝑜𝑟𝑒 ← −𝑐𝑚𝑐
11: 𝑠𝑐𝑜𝑟𝑒 ← 𝛾 × 𝑟𝑎𝑠𝑐𝑜𝑟𝑒 + 𝛿 × 𝑛𝑎𝑠𝑐𝑜𝑟𝑒


  The algorithm takes as inputs the following arguments:

      • 𝑝: the Pod to be scheduled.
      • 𝑐𝑝𝑢𝑝 : the CPU usage of Pod 𝑝.
      • 𝑚𝑒𝑚𝑝 : the memory usage of Pod 𝑝.
      • 𝑛: the node to be scored.
      • 𝑐𝑝𝑢𝑛 : the CPU available on node 𝑛.
      • 𝑚𝑒𝑚𝑛 : the memory available on node 𝑛.
      • 𝑐𝑛𝑠: the set of nodes in the cluster, including node 𝑛.
      • 𝑛𝑐: the network costs between node 𝑛 and all the other nodes 𝑐𝑛𝑠.
      • 𝑡𝑟𝑎𝑓 𝑓: the traffic amounts between the Pod 𝑝 and all the other Pods in the workflow.

   The algorithm starts by calculating the value of the variable 𝑟𝑎𝑠𝑐𝑜𝑟𝑒. This variable represents
the partial contribution to the final node score given by the ResourceAware scheduler plugin.
Its value is given by the weighted sum of the percentage differences between the CPU and
memory currently available on node 𝑛 and the respective usage values of Pod 𝑝. The α and
β parameters are in the range between 0 and 1 and their sum is equal to 1. By changing the
values of these parameters, a different contribution to the 𝑟𝑎𝑠𝑐𝑜𝑟𝑒 variable value is given by
the respective CPU and memory percentage differences. The higher the difference between
available resources on node 𝑛 and those used by Pod 𝑝, the greater the score assigned to node 𝑛.
This allows to effectively balance the load between cluster nodes and then to reduce the shared
resource interference between Pods resulting from incorrect node resource usage estimation
and then its impact on application performances.
   Then the partial contribution to the final node score given by the NetworkAware scheduler
plugin is calculated. First the variable 𝑐𝑚𝑐 is initialized to zero. This variable represents the total
cost of communication between the Pod 𝑝 and all the other Pods in the application when the Pod
𝑝 is placed on node 𝑛. This variable represents the total cost of communication between the Pod
𝑝 and all the other Pods of the application when the Pod 𝑝 is placed on node 𝑛. The algorithm
iterates through the list of cluster nodes 𝑐𝑛𝑠. For each cluster node 𝑐𝑛 the 𝑝𝑐𝑚𝑐 variable value is
calculated. This variable represents the cost of communication between the Pod 𝑝 and all the
other Pods 𝑐𝑛.𝑝𝑜𝑑𝑠 currently running on node 𝑐𝑛 when the Pod 𝑝 is placed on node 𝑛. For each
Pod 𝑐𝑛𝑝 running on node 𝑐𝑛 the 𝑡𝑟𝑎𝑓 𝑓𝑝,𝑐𝑛𝑝 parameter value is multiplied by the network cost
𝑛𝑐𝑛,𝑐𝑛 between node 𝑛 and node 𝑐𝑛 and added to the 𝑝𝑐𝑚𝑐 variable. The 𝑝𝑐𝑚𝑐 variable value is
then added to the 𝑐𝑚𝑐 variable. The final partial node score contribution of the NetworkAware
scheduler plugin is assigned to the variable 𝑛𝑎𝑠𝑐𝑜𝑟𝑒 as the opposite of the 𝑐𝑚𝑐 variable value.
The 𝑛𝑎𝑠𝑐𝑜𝑟𝑒 variable value is assigned in such a way that the Pod 𝑝 is placed on the node, or in
a nearby node in terms of network latencies, where the Pods with which the Pod 𝑝 exchanges
the greatest amount of traffic are executed.
   Then the final node score is calculated as the weighted sum between the 𝑟𝑎𝑠𝑐𝑜𝑟𝑒 and 𝑛𝑎𝑠𝑐𝑜𝑟𝑒
variables values, where the γ and δ parameters are in the range between 0 and 1 and their sum
is equal to 1. By changing the values of these parameters, a different contribution to the 𝑠𝑐𝑜𝑟𝑒
variable value is given by the 𝑟𝑎𝑠𝑐𝑜𝑟𝑒 and 𝑛𝑎𝑠𝑐𝑜𝑟𝑒 values.

3.4. Custom Descheduler
The custom descheduler runs as a controller in the Kubernetes control plane. The main business
logic of the custom descheduler is implemented by a descheduling function that is called
periodically for each application Pod to establish if that Pod should be rescheduled or not. Inside
the descheduling function the same node scoring function implemented by the custom scheduler
and showed in Algorithm 1 is invoked for each cluster node in order to assign them a score based
on the current cluster and workflow graphs determined by the cluster and workflow monitoring
agent respectively. If there is at least one node with a higher score than that of the node where
the Pod is currently executed, the descheduler evicts the Pod. As in the case of the default
Kubernetes descheduler, the proposed custom descheduler does not schedule a replacement
of evicted Pods but relies on the custom scheduler for that. The use of the proposed custom
descheduler is aimed at giving the running application Pods the possibility to be rescheduled on
the basis of the current cluster network latencies and computational resources availability on
each node and the traffic exchanged between microservices and their computational resources
usage, thus allowing to optimize the application placement at run-time. By evicting currently
running Pods and then forcing them to be rescheduled, application scheduling can take into
account the ever changing cluster and application states with the latter mainly influenced by the
user request load and patterns. One limitation of the proposed approach is that Pod eviction can
cause downtime in the overall application. However, it should be considered that cloud-native
microservices are typically replicated, so the temporary shutdown of one instance generally
causes only a graceful degradation of the application quality of service. To reduce the impact of
Pod rescheduling, for each execution the descheduler evicts one Pod at most among the replicas
of a single Deployment.
                                                      m0




                   m1         m2      m3         m4        m5         m6   m7         m8




                        m9                 m10                  m11             m12




                        db1                db2                  db3             db4




Figure 2: Sample DAG workflow application


4. Evaluation
The proposed solution has been validated using a sample DAG workflow application executed
on a test bed environment. The application, whose structure is depicted in Figure 2, is composed
of different microservices and database servers. Microservice 𝑚0 represents the entry point for
external traffic coming from end users and input data sources. This service represents the entry
point for external user requests that are served by backend microservices that interact between
them by means of network communication.
   The test bed environment for the experiments consists of a Kubernetes cluster with one master
node and five worker nodes. These nodes are deployed as virtual machines on a Proxmox7
physical node and configured with 8GB of RAM and 2 vCPU. In order to simulate a realistic
Cloud-to-Edge continuum environment with geo-distributed nodes, network latencies between
cluster nodes are simulated by using the Linux traffic control (tc 8 ) utility. By using this utility
network latency delays are configured on virtual network cards of the cluster nodes.
   We conduct black box experiments by evaluating the end-to-end response time of the workflow
application when HTTP requests are sent to the microservice 𝑚0 with a specified number of
virtual users each sending one request every second in parallel. Requests to the application
are sent through the k6 load testing utility9 . Each experiment consists of 10 trials, during
which the k6 tool sends requests to the microservice 𝑚0 for 40 minutes. For each trial, statistics
about the end-to-end application response time are measured and averaged with those of the
other trials of the same experiment. For each experiment we compare both cases when our
cluster and workflow monitoring agents and custom scheduler and descheduler components
are deployed on the cluster and when only the default Kubernetes scheduler is present. We
consider three different scenarios based on the network latency between the cluster nodes:
10ms, 100ms and 200ms. In all the scenarios the α and β parameters of the ResourceAware
plugin of the custom scheduler are assigned the same value of 0.5 in order to make the CPU and
memory percentage differences between the respective resource availability on cluster nodes
7
  https://www.proxmox.com
8
  https://man7.org/linux/man-pages/man8/tc.8.html
9
  https://k6.io
                                                    10ms network latency                                                                      100ms network latency                                                                     200ms network latency

                                               Proposed Approach                                                                         Proposed Approach                                                                         Proposed Approach
95th percentile response time (ms)




                                                                                          95th percentile response time (ms)




                                                                                                                                                                                    95th percentile response time (ms)
                                     3,500                                                                                     3,500                                                                                     3,500
                                                Default Scheduler                                                                         Default Scheduler                                                                         Default Scheduler
                                     3,000                                                                                     3,000                                                                                     3,000



                                     2,000                                                                                     2,000                                                                                     2,000



                                     1,000                                                                                     1,000                                                                                     1,000

                                      500                                                                                       500                                                                                       500
                                      100                                                                                       100                                                                                       100
                                         10   100    200       300      400   500   600                                            10   100    200       300      400   500   600                                            10   100    200       300      400   500   600
                                                           Virtual users                                                                             Virtual users                                                                             Virtual users


Figure 3: Experiments results


and the resource usage of Pods contribute equally to the 𝑟𝑎𝑠𝑐𝑜𝑟𝑒 variable value. Similarly the γ
and δ parameters are assigned the same value of 0.5 in order to make the 𝑟𝑎𝑠𝑐𝑜𝑟𝑒 and 𝑛𝑎𝑠𝑐𝑜𝑟𝑒
variables values, determined by the ResourceAware and NetworkAware plugins respectively,
equally contribute to the final node score.
   Figure 3 illustrate the results of the three experiments performed, each for a different scenario,
showing the 95th percentile of the application response time as a function of the number of
virtual users that send requests to the application in parallel. In all the cases, the proposed
approach performs better than the default Kubernetes scheduler with average improvements of
39%, 56% and 66% in the three scenarios respectively. In the first scenario, network communica-
tion has no significant impact on the application response time because of the low node-to-node
network latencies. Thus, the proposed network-aware scheduling strategy does not lead to high
improvements in the application response time. Furthermore, for a low number of virtual users
the proposed approach has similar performances to the default scheduler. This is because of the
limited shared resource interference between Pods though they are placed on the same nodes
by the default scheduler. However, when the number of virtual users increases, the proposed
approach performs better than the default scheduler, with higher improvements for higher
numbers of virtual users. The response time in the case of the default scheduler grows faster
than in the case of the proposed approach. This is because of the proposed resource-aware
scheduling strategy that distributes Pods on cluster nodes based on their run time resource
usage, then reducing the shared resource interference between Pods. In the other scenarios,
network communication becomes a bottleneck for the application response time and the lack
of a network-aware scheduling strategy leads to high response times. In these scenarios our
approach performs better than the default scheduler for low numbers of virtual users also,
with higher improvements for higher node-to-node network latencies. One consequence of the
combination of both a resource-aware and a network-aware scheduling strategy in our approach
is that when the number of virtual users increases the response time grows much faster for high
network latencies, though it remains lower than the response time in the case of the default
scheduler. This can be explained by the fact that, for a higher number of virtual users and then
for a higher request load, the average resource usage of microservices increases and then the
distribution of Pods among cluster nodes caused by the resource-aware scheduling strategy is
higher. This leads to an increase in the network latency between application microservices and
then in the end-to-end application response times.


5. Related Work
In the literature, there is a variety of works that propose to extend the Kubernetes platform
in order to adapt its usage for the orchestration of microservices-based applications and in
particular DAG workflows on the Cloud-to-Edge continuum [15, 16].
   NetMARKS [17] is a Kubernetes scheduler extender that uses dynamic network metrics
collected with Istio Service Mesh to ensure an efficient placement of Service Function Chains,
based on the historical traffic amount exchanged between services. The proposed scheduler
however does not consider run-time cluster network conditions in its placement decisions.
   The authors of [18] propose to leverage application-level telemetry information during the
lifetime of an application to create service communication graphs that represent the internal
communication patterns of all components. The graph-based representations are then used to
generate colocation policies of the application workload in such a way that the cross-server
internal communication is minimized. However, in this work scheduling decisions are not
influenced by the cluster network state.
   In [19] a scheduling framework is proposed which enables edge sensitive and Service-Level
Objectives (SLO) aware scheduling in the Cloud-Edge-IoT Continuum. The proposed scheduler
extends the base Kubernetes scheduler and makes scheduling decisions based on a service
graph, which models application components and their interactions, and a cluster topology
graph, which maintains current cluster and infrastructure-speci�c states. However, this work
does not consider historical information about the traffic exchanged between microservices in
order to determine their run time communication affinity.
   In [20] Nautilus is presented, a run-time system that includes, among its modules, a
communication-aware microservice mapper. This module divides the microservice graph
into multiple partitions based on the communication overhead between microservices and
maps the partitions to the cluster nodes in order to make frequent data interaction complete
in memory. While the proposed solution migrates application Pod if computational resources
utilization is unbalanced among nodes, there is no Pod rescheduling in the case of degradation
in the communication between microservices.
   In [21] Pogonip, an edge-aware scheduler for Kubernetes, designed for asynchronous microser-
vices is presented. Authors formulate the placement problem as an Integer Linear Programming
optimization problem and define a heuristic to quickly find an approximate solution for real-
world execution scenarios. The heuristic is implemented as a set of Kubernetes scheduler
plugins. Also in this work, there is no Pod rescheduling if network conditions change over time.
   In [22] an extension to the Kubernetes default scheduler is proposed that uses information
about the status of the network, like bandwidth and round trip time, to optimize batch job
scheduling decisions. The scheduler predicts whether an application can be executed within
its deadline and rejects applications if their deadlines cannot be met. Although information
about current network conditions and historical job execution times is used during scheduling
decisions, communication interactions between microservices are not considered in this work.
6. Conclusions
In this work we proposed to extend the Kubernetes platform to adapt its usage for the orches-
tration of complex DAG workflows executed on the Cloud-to-Edge continuum. The main goal
is to overcome the limitations of the Kubernetes static scheduling policies when dealing with
the placement of DAG workflows on highly distributed environments. The idea is to make the
Kubernetes scheduler aware of the run time communication intensity between the workflow
microservices and their resource usage, and the cluster network conditions to make scheduling
decisions that aim to reduce the overall workflow response time. Furthermore, a descheduler is
proposed to dynamically reschedule microservices if better scheduling decisions can be made
based on the ever changing application and cluster network states. As a future work we plan
to improve the proposed scheduling and descheduling strategies, by using AI techniques, in
particular those in the field of Reinforcement Learning, in order to design more sophisticated
algorithms that take into account historical information about both the infrastructure and
application run time states.


References
 [1] F. A. Salaht, F. Desprez, A. Lebre, An overview of service placement problem in fog and
     edge computing, ACM Comput. Surv. 53 (2020). doi:10.1145/3391196 .
 [2] D. Calcaterra, G. Di Modica, O. Tomarchio, Cloud resource orchestration in the multi-cloud
     landscape: a systematic review of existing frameworks, Journal of Cloud Computing 9
     (2020). doi:10.1186/s13677- 020- 00194- 7 .
 [3] B. Varghese, E. de Lara, A. Ding, C. Hong, F. Bonomi, S. Dustdar, P. Harvey, P. Hewkin,
     W. Shi, M. Thiele, P. Willis, Revisiting the arguments for edge computing research, IEEE
     Internet Computing 25 (2021) 36–42. doi:10.1109/MIC.2021.3093924 .
 [4] X. Kong, Y. Wu, H. Wang, F. Xia, Edge computing for internet of everything: A survey,
     IEEE Internet of Things Journal 9 (2022) 23472–23485. doi:10.1109/JIOT.2022.3200431 .
 [5] M. Goudarzi, M. Palaniswami, R. Buyya, Scheduling iot applications in edge and fog
     computing environments: A taxonomy and future directions, ACM Comput. Surv. 55
     (2022). doi:10.1145/3544836 .
 [6] W. Z. Khan, E. Ahmed, S. Hakak, I. Yaqoob, A. Ahmed, Edge computing: A survey, Future
     Generation Computer Systems 97 (2019) 219–235. doi:10.1016/j.future.2019.02.050 .
 [7] M. Riedlinger, R. Bernijazov, F. Hanke, AI Marketplace: Serving Environment for AI
     Solutions using Kubernetes, in: Proceedings of the 13th International Conference on
     Cloud Computing and Services Science - CLOSER, 2023, pp. 269–276. doi:10.5220/
     0000172900003488 .
 [8] A.-A. Corodescu, N. Nikolov, A. Q. Khan, A. Soylu, M. Matskin, A. H. Payberah, D. Roman,
     Big data workflows: Locality-aware orchestration using software containers, Sensors 21
     (2021). doi:10.3390/s21248212 .
 [9] P. Kayal, Kubernetes in fog computing: Feasibility demonstration, limitations and im-
     provement scope : Invited paper, in: 2020 IEEE 6th World Forum on Internet of Things
     (WF-IoT), 2020, pp. 1–6. doi:10.1109/WF- IoT48130.2020.9221340 .
[10] S. Böhm, G. Wirtz, Towards orchestration of cloud-edge architectures with kubernetes, in:
     S. Paiva, X. Li, S. I. Lopes, N. Gupta, D. B. Rawat, A. Patel, H. R. Karimi (Eds.), Science and
     Technologies for Smart Cities, Springer International Publishing, Cham, 2022, pp. 207–230.
[11] I. Ahmad, M. G. AlFailakawi, A. AlMutawa, L. Alsalman, Container scheduling techniques:
     A survey and assessment, Journal of King Saud University - Computer and Information
     Sciences (2021). doi:j.jksuci.2021.03.002 .
[12] B. Burns, B. Grant, D. Oppenheimer, E. Brewer, J. Wilkes, Borg, omega, and kubernetes:
     Lessons learned from three container-management systems over a decade, Queue 14 (2016)
     70–93. doi:10.1145/2898442.2898444 .
[13] A. Marchese, O. Tomarchio, Extending the Kubernetes Platform with Network-Aware
     Scheduling Capabilities, in: Service-Oriented Computing: 20th International Con-
     ference, ICSOC 2022, Springer-Verlag, Seville, Spain, 2022, p. 465–480. doi:10.1007/
     978- 3- 031- 20984- 0_33 .
[14] A. Marchese and O. Tomarchio, Sophos: A Framework for Application Orchestration in
     the Cloud-to-Edge Continuum, in: Proceedings of the 13th International Conference on
     Cloud Computing and Services Science (CLOSER 2023), SciTePress, 2023, pp. 261–268.
     doi:10.5220/0011972600003488 .
[15] Z. Rejiba, J. Chamanara, Custom scheduling in kubernetes: A survey on common problems
     and solution approaches, ACM Comput. Surv. 55 (2022). doi:10.1145/3544788 .
[16] C. Carrión, Kubernetes scheduling: Taxonomy, ongoing issues and challenges, ACM
     Comput. Surv. 55 (2022). doi:10.1145/3539606 .
[17] L. Wojciechowski, K. Opasiak, J. Latusek, M. Wereski, V. Morales, T. Kim, M. Hong,
     Netmarks: Network metrics-aware kubernetes scheduler powered by service mesh, in:
     IEEE INFOCOM 2021 - IEEE Conference on Computer Communications, 2021, pp. 1–9.
     doi:10.1109/INFOCOM42981.2021.9488670 .
[18] L. Cao, P. Sharma, Co-locating containerized workload using service mesh telemetry, in:
     Proceedings of the 17th International Conference on Emerging Networking EXperiments
     and Technologies, CoNEXT ’21, Association for Computing Machinery, New York, NY,
     USA, 2021, p. 168–174. doi:10.1145/3485983.3494867 .
[19] S. Nastic, T. Pusztai, A. Morichetta, V. C. Pujol, S. Dustdar, D. Vii, Y. Xiong, Polaris
     scheduler: Edge sensitive and slo aware workload scheduling in cloud-edge-iot clusters,
     in: 2021 IEEE 14th International Conference on Cloud Computing (CLOUD), 2021, pp.
     206–216. doi:10.1109/CLOUD53861.2021.00034 .
[20] K. Fu, W. Zhang, Q. Chen, D. Zeng, X. Peng, W. Zheng, M. Guo, Qos-aware and re-
     source efficient microservice deployment in cloud-edge continuum, in: IEEE Inter-
     national Parallel and Distributed Processing Symposium (IPDPS), 2021, pp. 932–941.
     doi:10.1109/IPDPS49936.2021.00102 .
[21] T. Pusztai, F. Rossi, S. Dustdar, Pogonip: Scheduling asynchronous applications on the
     edge, in: IEEE 14th International Conference on Cloud Computing (CLOUD), 2021, pp.
     660–670. doi:10.1109/CLOUD53861.2021.00085 .
[22] A. C. Caminero, R. Muñoz-Mansilla, Quality of service provision in fog computing:
     Network-aware scheduling of containers, Sensors 21 (2021). doi:10.3390/s21123978 .