<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>multi-workflow scheduling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yifan Yang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hui Ma</string-name>
          <email>hui.ma@ecs.vuw.ac.nz</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gang Chen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sven Hartmann</string-name>
          <email>sven.hartmann@tu-clausthal.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Clausthal University of Technology</institution>
          ,
          <addr-line>Clausthal-Zellerfeld</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Project Exhibitions</institution>
          ,
          <addr-line>Posters and Demos, and Doctoral Consortium</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Victoria University of Wellington</institution>
          ,
          <addr-line>Wellington</addr-line>
          ,
          <country country="NZ">New Zealand</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Machine learning is getting more and more popular in solving dynamic combinatorial optimization problems. With the popularity of cloud computing, machine learning has been applied in solving dynamic multi-workflow scheduling, a challenging combinatorial optimization problem that involves many features of the problem, i.e., workflow patterns, tasks, virtual machine capacities, and cost. Existing machine learning approaches to dynamic combinatorial optimization problems use randomly generated training instances to train optimization rules or heuristics, without paying attention to the knowledge of the problem domain data. In this paper, we propose a model-driven machine learning approach to dynamic multi-workflow scheduling. In particular, we propose a conceptual database schema that can be used to model the information of the dynamic multi-workflow scheduling problem. We further propose a new approach to design training instances for dynamic multi-workflow scheduling by incorporating important attributes of the problem into the database schema. We evaluate our proposed approach using a case study with real-world data. The results of the evaluation demonstrate that our proposed model-driven machine learning approach can efectively solve dynamic workflow scheduling problems.</p>
      </abstract>
      <kwd-group>
        <kwd>machine learning</kwd>
        <kwd>dynamic workflow scheduling</kwd>
        <kwd>data modeling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        In the era of big data, more and more data-intensive applications are used in various domains
to solve complex data-intensive problems. To efectively solve such problems we need to design
not only efective machine learning algorithms but also the training datasets to capture the
essential requirements of the problem domain. However, with the increasing complexity of
combinatorial optimization problems, designing training datasets for a problem is getting more
and more challenging, i.e., to be representative of the problem domain and robust to unseen
problem instances while at the same time without hurting the eficiency of the training process.
For example, combinatorial optimization is a category of optimization where the size and
relationships of data are massively increasing due to the escalated problem complexity [
        <xref ref-type="bibr" rid="ref1">1, 2</xref>
        ].
      </p>
      <p>A typical example of combinatorial problems is the dynamic workflow scheduling problem
in cloud computing where workflow tasks are scheduled to cloud resources in the best possible
CEUR
Workshop
Proceedings
way with some objectives, to minimize the makespan and/or response time. Cloud users often
request multiple workflows to be scheduled for execution on heterogeneous cloud resources.
This gives rise to the dynamic multi-workflow scheduling (DMWS) problem. To solve such
a combinatoral optimization problem, we need to use not only data about tasks and cloud
resources (e.g., VMs), but also relationships among these tasks and workflows, because they can
significantly afect the response time and the overall cost.</p>
      <p>Dynamic multi-workflow scheduling is an NP-hard problem [ 3]. A major challenge in practice
is that solutions must be determined in real-time. For the DMWS, workflow scheduling must be
decided in real-time. Therefore, heuristics need to be used. The manual design of heuristics is
very time consuming and requires domain experts who have good domain knowledge. Genetic
programming hyper heuristic (GPHH), a machine learning method, is an efective approach to
automatically design heuristics that does not require domain experts [4]. However, datasets are
needed to train heuristics so that they can solve the problem efectively in real-time.</p>
      <p>To eficiently search for heuristics for dynamic multi-workflow scheduling, the quality of
datasets used for training heuristics plays a crucial role to make the heuristics efective in
practice such as robustness to unseen problem instances [5]. While it is a common practice to
use historical datasets as training instances, they are not eficient for evaluating heuristics that
involve expensive evaluations over all the historical data [6]. To avoid expensive training while
at the same time to train heuristics that are robust to unseen problem instances, it is important
to carefully design the training instances. Current approaches of designing training instances
are ad hoc, and are not clearly described. Most of existing work use randomly generated training
instance in the training process [7], without any attention to the knowledge of the datasets
given. Further, dynamic workflow scheduling is a dynamic combinatorial optimization problem
for which diferent states of the problem domain will be used during the training process,
including resource scheduling requests and resource status, which are changing dynamically.
Meanwhile, to train heuristics that are efective and robust in solving dynamic multi-workflow
scheduling problems in real time, it is ineficient and unnecessary to use all the possible states
of the requests and data centers.</p>
      <p>A conceptual model of training datasets, consisting of the data and relationships between
diferent data, would help machine learning to be systematic and efective. A conceptual
database schema consists of entities and relationships. Designing training data is actually
retrieving, or sampling, datasets from database of historical data. This makes conceptual model
attractive for the machine learning domain since it provides a high-level view of the datasets.
However, there is no data model in the literature that can be used to support machine learning
processes to learn heuristics efectively and eficiently when there is a large number of requests
and the size of the workflow is extremely large.</p>
      <p>To take advantage of conceptual models diferent views of the datasets can be retrieved as
training datasets to eficiently evaluate and train heuristics [ 8]. Adequate modeling of historical
data and training data plays an important role to design a high-quality training dataset for
machine learning. However, despite the popularity of machine learning for combinatorial
optimization, there is limited research work on data modeling for machine learning. A data
model, GR4ML,is proposed in [9] for machine learning focusing on expressing machine learning
requirements and solution design. In practice, datasets are often designed in an ad hoc way or
based on best practice recommendations [10]. There are no common or well-defined approaches
to model training datasets and design training instances using data models.</p>
      <p>The overall aim of this paper is to propose a model-driven machine learning approach to the
dynamic multi-workflow scheduling problem. In particular, we propose a GPHH approach that
trains heuristics using well defined training instances based on conceptual database schema. We
expect that the heuristics learned from our machine learning approaches can schedule dynamic
workflows efectively. The major contributions of this paper are the following:
• We propose a conceptual model that can be used for designing training datasets for
efective learning heuristics of dynamic workflow scheduling.
• We present a model-driven GPHH approach to automatically learn heuristics that can be
used to schedule multiple workflows in cloud data centers. In particular, our proposed
GPHH approach will use training datasets generated using the conceptual model designed
for DMWS.
• We evaluate the efectiveness of our proposed conceptual model in designing training
datasets with a case study using benchmark data of workflow scheduling.</p>
      <p>To the best of our knowledge, our proposed model-driven approach is the first attempt
incorporating conceptual modeling into the process of machine learning for solving dynamic
workflow scheduling and other combinatorial optimization problems. In particular our
conceptual database schema provides a good overview of the structure of data involved in dynamic
workflow scheduling problems, to support systematically the design of training instances
for efectively learning heuristics that can be used to eficiently schedule tasks of numerous
dynamically arriving workflows.</p>
      <p>Organization. This paper is organized as follows. Section 2 briefly outlines current
approaches for machine learning, in particular for dynamic multi-workflow scheduling problems
from the literature, while in Section 3 we introduce dynamic workflow scheduling in more detail.
In Section 4 we present a conceptual data schema for dynamic multi-workflow scheduling, and
propose a GPHH approach to automatically design heuristics for the DMWS problem, with a
model-driven approach for designing training instances. In Section 5 we report on a case study
that we have conducted to evaluate our proposed model-driven machine learning approach.
Finally, Section 6 gives conclusions and an outlook on future work.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related work</title>
      <p>As dynamic multi-workflow scheduling is known to be NP-hard, heuristics are widely used
to ensure scalability and real-time applicability. Heuristics are priority functions for making
decisions on cloud resources, e.g., VMs. IaaS cloud partial critical paths (IC-PCP) is proposed in
[11] for scheduling computation requests in data centers. Heterogeneous earliest finish time
(HEFT) is another manually designed heuristic proposed in [12]. However, manually designing
such heuristics requires domain knowledge and is time-consuming. Also heuristics may be
efective in some of problem instances but may not be efective in some other problem instances
since only small set of domain attributes are considered.</p>
      <p>Machine learning approaches have been proposed in the literature to automatically learn
heuristics for dynamic workflow scheduling problems and other dynamic scheduling problems.
In [13, 14], GPHH is utilized for online service resource allocation in clouds. In [15], a GPHH
approach is proposed to solve multi-cloud resource allocation problems. A training instance
is used during the whole training process. In order to generate heuristics which are robust to
unseen problem instances, in [16], a GPHH approach is proposed to use training instances that
are randomly generated during the training process, to learn heuristics for the DWS problem.
Similar approaches of generating training instances are used in [7], which proposes a GPHH
approach to learn heuristics for the dynamic job shop scheduling problem. However, none
of the above mentioned machine learning studies paid attention to the structure of problem
attributes or data when designing training instances.</p>
      <p>To understand machine leaning requirements of organizations, conceptual modelers have
proposed diferent models. In [ 17], a conceptual modeling framework is proposed to
systematically justify the needs of predictive analytics within an organizational context and to identify
the requirements of predictive analytics design. To incorporate domain-specific questions, [ 18]
adopts conceptual modeling methods to elicit analytical requirements, design machine learning
solutions, and to ensure the alignment between analytic initiatives and business strategies,
among others. The modeling framework consists of three complementary modeling views:
business view, analytic design view, and data preparation view. However, as far as we are aware,
there is no systematic way of designing training instances for machine learning in dynamic
combinatorial optimisation problems, including dynamic workflow scheduling problems.</p>
      <p>In summary, research on data modeling of training instances for machine learning of
combinatorial optimization is still in its early stages. To the best of our knowledge, there is very
limited work on conceptual modeling of machine learning data of dynamic multi-workflow
scheduling, which motivate us to propose an efective design method of training instances by
incorporating application-specific data models.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Preliminaries</title>
      <p>SLA-aware dynamic multi-workflow scheduling (DWMS) has attracted rising interests in
research and practice. Workflows arrive at a data center dynamically and is unknown beforehand,
i.e., requests of scheduling workflows arrive on the fly and must be executed in real-time, to
meet deadline constraints set in service-level agreements (SLA). The tasks of the workflows are
scheduled to execute on a number of VM instances that are available in a cloud data center. Each
VM instance is of a VM type, which defines its capacities. During the process, we need to make
two types of decisions concurrently: task allocation, which allocates the tasks of the workflows
to VMs, and task sequencing, which sequence the task on each individual VM instance.</p>
      <p>Figure 1 shows an overview of DWMS. The overall goal is to satisfy the workflow scheduling
requests in the best possible quality, e.g., minimizing the makespan of workflows, minimizing
the costs of renting cloud resources, and minimizing SLA violation penalties. In particular, the
SLA-aware dynamic multi-workflow scheduling problem focuses on scheduling a sequence of
workflows to cloud resources so that the overall cost and SLA penalty are minimized. In the
rest of the paper, DMWS refers the SLA-aware dynamic multi-workflow scheduling problem.</p>
      <p>A workflow  arrives at a time with a deadline, can be represented as a directed acyclic graph
(DAG), denoted by  = ( ,  ). Herein,  = {  },  ∈ 1, … ,  is a set of nodes representing tasks
workflow_m</p>
      <p>Workflow_1
in the workflow, and  = {  } is a set of directed edges where   connects   with   . A task   is
defined by its size  . Here,   is a parent task and   is a child task. Note that a child task can be

executed only after all its parent tasks are completed. We use   (
 ) to denote all immediate
predecessors of task   . There might be multiple tasks allocated to the same VM instance to be
executed. We use   (</p>
      <p>) to denote the previous task of   executed on the same VM.</p>
      <p>In a data center, there are VM instances to perform scheduled workflow tasks. Each VM
instance is of a VM type that is defined by its speed  , memory   , and cost   . Any task must

be scheduled to a VM instance with suficient capacity/resource, i.e., CPU and memory. Note
that, one VM instance can only execute one task at any time, and each task can only be executed
on a single VM instance. The time to process a task depends on the resource requirement   of
the task, and the resource   that is scheduled. The more capacity of the resources allocated to
the task the faster it is processed, and less probability to violate deadline constraints set in the
SLA agreement. However, the more resources used for executing workflows, the more cost it
will take. Diferent workflow schedules will lead to diferent time units for renting VMs, and
then diferent makespan and costs.</p>
      <p>For a given sequence workflows  = [</p>
      <p>1, … ,   ], and set of available VM instances, the
objective of the DMWS problem is to schedule workflow tasks for execution on VMs so as to
minimize the total cost for workflow execution, consisting of rental fees (denoted as  
incurred from resource provisioning and deadline penalties (denoted as  
) caused by
violations, formulated by
)
(1)
min   =
∑  
∈
 +</p>
      <p>∑
∶  ∈
where  refers to the set of rented VMs. Particularly,  
 is the rental fee of VM   , defined
as the hourly-based fee for the rental time   between the time of processing of its first allocated

task   


and the time of processing the last allocated task  
calculated by
 
 =   
 ⋅ ⌈
  


−  

 
3600
 
⌉</p>
      <p>is the penalty of workflow   , defined as the penalty fee paid for the portion beyond
its deadline   , calculated by</p>
      <p>[19]. A larger coeficient represents a lower tolerance for violating
workflow deadline. Details of the problem definition can be found in [
4. Model-driven GPHH for dynamic multi-workflow scheduling
In this section we present our model-driven genetic programming hyper heuristic (GPHH)
approach for the dynamic multi-workflow scheduling problem. To begin with we present a data
model of the domain data for scheduling multiple dynamic workflows. Afterwards we describe
our proposed GPHH approach using model-driven designed training instances.</p>
      <p>Data model. To ensure the robustness of heuristics obtained by machine learning, we first
model the data involved in the training of heuristics by GPHH. Figure 2 presents a conceptual
data schema, denoted as ER-DMWS, for the problem of dynamic multi-workflow scheduling.
(2)
(3)
VID</p>
      <p>VM</p>
      <p>Time</p>
      <sec id="sec-4-1">
        <title>Schedule</title>
        <p>Size
Mem
TID</p>
      </sec>
      <sec id="sec-4-2">
        <title>Task</title>
        <p>Parent
Child
TypeName
CPU
Mem</p>
      </sec>
      <sec id="sec-4-3">
        <title>VM-Type</title>
        <p>Cost
PatterName
Size</p>
      </sec>
      <sec id="sec-4-4">
        <title>Pattern</title>
        <p>Generation</p>
      </sec>
      <sec id="sec-4-5">
        <title>Mini-Batch</title>
        <p>(n,n)</p>
      </sec>
      <sec id="sec-4-6">
        <title>Problem</title>
      </sec>
      <sec id="sec-4-7">
        <title>Instance</title>
        <p>ArriveTime
WorkflowID
InstanceID</p>
      </sec>
      <sec id="sec-4-8">
        <title>Depend</title>
        <p>Deadline</p>
      </sec>
      <sec id="sec-4-9">
        <title>Workflow</title>
        <p>(m,m)</p>
        <p>We utilize the higher-order ER model (HERM) [20] which preserves the aggregation-based
modeling principle of the basic ER model where a conceptual data schema consists of a set of
object types. Object types are characterized by their attributes. Recall the definition of attributes
from [20]. Let U denote a countable set of simple attributes (called the universe) together with
a type assignment () that assigns to each attribute  ∈ U its data type () . Complex
attributes may be obtained from simple ones by nesting. Let A denote the smallest superset of
U such that
 (
1, … ,   ),  {},  [],  ⟨⟩, 
1( 1) ⊕ ⋯ ⊕   (  ),  (
1 →  2) ∈ A
holds whenever ,  1, … ,   ∈ A holds, with labels  ,  1, … ,   chosen from some fixed
alphabet ℒ.</p>
        <p>An important extension of HERM is that dynamic and complex data types are permitted in A.
The data type () assigned to a (simple) attribute  ∈ U is some simple data type, like INT ,
FLOAT , STRING, DATE, or TIME, which are relevant features of dynamic workflow scheduling
problems. It should be emphasized that complex data types ofer additional opportunities for
machine learning researchers but are not a must-use.</p>
        <p>A structured component is a pair  ∶  with a role name  and a component expression
 ∈ C. An object type  is a triple ((),  (),  ()) where () is a finite set
{ 1 ∶  1, … ,   ∶   } of structured components with pairwise diferent role names  1, … ,   ,
 () is a finite set { 1, … ,   } ⊆ A of complex attributes, and  () ⊆ () ∪  ()
is the primary key of  . An object type  is called an entity type if () is the empty set,
otherwise it is called a relationship type.</p>
        <p>Example 1. In the diagram of ER-DMWS in Figure 2, the entity types are displayed as rectangles
while relationship types are displayed as diamonds. Let  be a relationship type and   a component
of  , say in the form   or   ∶   . In the diagram this is displayed by an edge from  to   . The
relationship type Depend in Figure 2 can be specified as follows:</p>
        <p>Depend = ({Parent: Task, Child: Task}, {}, {Parent: Task, Child: Task})
It has two structured components, each with diferent roles, no attributes, and its primary key
includes both structured components. A cardinality constraint is an expression  (,   ) = (, )
where  is a natural number, and  is a natural number or ∞. For more details on
cardinality constraints see [20, 21]. For the integration of diferent user views in HERM see, e.g.,
[22]. Cardinality constraints are also displayed in the diagram. For example, the constraint
 ( Problem-Instance, Workflow ) = (, ) states that every problem instance consists of 
worklfows. The cardinality constraint  ( Mini-Batch, Problem-Instance) = (, ) states that every
mini-batch consists of  problem instances. Note that a mini-batch, containing a set of problem
instances, will be used in the machine learning process to ensure the generality of the learned
heuristics.</p>
        <p>We use the conceptual schema in Figure 2 to represent data on workflows, virtual machines,
scheduling as well as training instances, i.e., mini-batches, used during the machine learning
process. Note that the entity and relationship types in gray color capture the information on
workflows, while the entity and relationship type in green color capture the information on
cloud resources, i.e., virtual machines. Further, there are two relationship types Schedule and
Mini-Batch in blue color that capture the information on scheduling decisions and training
instances.</p>
        <p>For example, a workflow pattern can be considered as a complex data type which may be
regarded as a list of tasks and a set of edges defined by pairs of tasks. If the machine learning
researchers consider the workflow pattern as the internal structure of tasks to be conceptually
irrelevant for the application under development then they can define a simple attribute pattern
with ( ) = DAG. Otherwise, if the internal structure of the workflow pattern is
conceptually relevant then she/he may use a complex attribute pattern({task}, ({parent: task, child: task})).</p>
        <p>Model-driven approach to automatic learning heuristics. Now we present our
modeldriven GPHH for dynamic multi-workflow scheduling. Figure 3 shows the high-level overview
of the GPHH algorithm for learning heuristics based on model-driven training instances.</p>
        <p>Initialization
Evaluaiton</p>
        <p>Mini-batch</p>
        <p>Generate</p>
        <p>Selection
Evolution
Reproduction
Cross Over</p>
        <p>Mutation
No</p>
        <p>Terminate</p>
        <p>Yes</p>
        <p>ET
+</p>
        <p>TS</p>
        <p>Here, we represent heuristics as syntax trees, with internal nodes as arithmetic functions and
leave nodes as terminals. For an example see the heuristic in Figure 3. The terminals can be
either attributes or derived attributes of the object types in the ER-DMWS schema, e.g., task
size   . Table 1 summarizes the terminal set used in this case, and the function set including
{+, −, ×, ÷, , } based on [23].</p>
        <p>During the process of solving a DMWS problem instance in real time, any GP tree evolved
by GPHH, is treated as a priority function. Whenever a task in a workflow becomes ready for
execution, i.e., all its parent tasks have been completed, the GP tree is applied to each eligible
cloud resource, e.g., a leased VM instance with the requested CPU and memory capacities to
accomplish the task, to calculate their respective priorities. For this purpose, the value of each
terminal node of the GP tree is determined first. Note that, for diferent cloud resources under
consideration, the same terminal node can have varied values. With respect to any candidate
cloud resource, driven by the corresponding values of all terminal nodes, we can proceed to
calculate the value of each internal node of the GP tree in the bottom-up order and eventually
determine the value of the root node, which will be assigned to the candidate cloud resource
as its priority. After calculating the priorities of all the eligible VMs, the VM with the highest
priority is then scheduled to process the task. The above process is performed repeatedly
to schedule the execution of every task of each workflow, whenever a task becomes ready.
Therefore, guided by a GP tree, the full schedule that solves any DMWS problem instance can
be obtained. Details regarding the GPHH algorithm for DMWS can be found in [4, 16].</p>
        <p>Apparently, diferent GP trees can produce varied full schedules for the same DMWS problem
instance. Each full schedule also leads to varied total costs. The aim of the GPHH algorithm is
to identify a GP tree that can achieve on average the smallest total cost, the objective defined
in 1, for a wide variety of DMWS problem instances. For this purpose, the GPHH algorithm
randomly generates a set (aka. a population) of initial GP trees (called the initial population).
Each GP tree of the initial population will be evaluated using training instances and a fitness
function, i.e., TotalCost. GP trees with good fitness values will be selected for reproduction,
mutation and crossover to generate the next generation of GP trees. This process will continue
until the stopping criteria are met, e.g., the maximum number of generations is reached. The
evolved GP tree with the best fitness is then reported by the GPHH algorithm as its final learned
heuristic for DMWS. As can be seen in Figure 3, in every generation, i.e., iteration, all the
heuristics will be evaluated using the training instances. Evidently these training instances play
an important role in the evolution (or heuristic learning) process.</p>
        <p>Model-driven training instance design. The modeling technique that we have just
proposed can be generalized to generate training datasets for GPHH for dynamic multi-workflow
scheduling. In order to learn good heuristics that can perform well for unseen problem instances,
for each generation, diferent training instances will be rotated, see Figure 4. However, if
training instances are too diferent from generation to generation, then no individuals can
steadily survive through generations to evolve. Therefore, we need to carefully design suitable
training instances. This can be controlled by specifying a proper size for the instance pool. In
the mean time, to train heuristics that perform well for unseen problem instances, we adapt the
concept of ‘mini-batch’ [5] in the design of training instances. As can be seen in Figure 2, a
mini-batch may contain one or more problem instances. We will explore settings of diferent
batch sizes in the section on experiments.</p>
        <p>As can be seen in Figure 4, our model-driven approach first creates a problem instance pool
before running GPHH. During the training process of GPHH, training instance can be obtained
by querying the problem instance pool. To continue with, we provide details of each step.
Definition 1. A dynamic multi-workflow scheduling problem instance   consists of a list of
 workflows, i.e.,   = [ 1 , … ,   ], where   is randomly generated using available workflow
patterns and sizes, (PatternName, Size), and is assigned a random arrival time ArriveTime. A
problem instance pool ℙ consists of  problem instances, i.e., ℙ = { 1, … ,   … ,   }.</p>
        <p>As shown in Figure 2, we propose to use a mini-batch consisting of a set of problem instances
for a particular generation throughout generations of the training process.</p>
        <p>Definition 2.</p>
        <p>A mini-batch (  ) for generation   is a set of  problem instances, i.e.,</p>
        <p>(  ) = { 1(  ),  2(  ), … ,   (  ), … ,   (  )}
where each problem instance   (  ) is retrieved from the problem instance pool ℙ, i.e.,   (  ) ∈ ℙ for
 = 1, … ,  .</p>
        <p>To create a problem instance pool ℙ, we generate a set of problem instances, using the schema
defined for the DMWS problem, but each time with diferent combinations of patterns, sizes and
arrival times. In this way, our problem instance pool contains a variety of problem instances to
represent various possible problem scenarios. We can then obtain a mini-batch of  problem
instances by randomly selecting  problem instances from ℙ.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. A case study</title>
      <p>To evaluate the performance of our proposed model-driven approach, we have implemented
a machine learning simulator using benchmark data of workflow scheduling as in [ 4, 24, 25],
i.e., benchmark workflow patterns with various sizes, and VM data obtained from commercial
AWS data centers [26, 27]. The simulator is used to train heuristics for dynamic workflow
scheduling. After the heuristics are trained using the simulator, we evaluate the heuristics
using test datasets. We evaluated the performance (in terms of the total cost) of the heuristics
in solving SLA-aware dynamic multi-workflow scheduling, comparing with heuristics that
are trained using the training datasets generated using some existing approach. That is, two
diferent training datasets are used, with one based on our proposed model-driven approach
and the other based on the existing approach [12, 16, 28]. Two sets of heuristics are trained on
each of the two training datasets. Since real-world datasets were considered, the number of
workflow scheduling requests will naturally vary throughout the day and week. This may lead
to increased execution time for a single workflow due to the queuing time of tasks VMs.</p>
      <p>Datasets. We first discuss the simulation configuration, followed by the parameter settings of
GPHH. The detailed setup of comparative experiments will be introduced, too. A simulated cloud
environment is used to experimentally evaluate the proposed model-driven GPHH approach to
DMWS. It contains several key components listed below.</p>
      <p>VM Types: According to Amazon EC21, we assume that the cloud environment supports six
diferent VM types with their respective configurations summarized in Table 2. The maximum
number of VM instances of each type is unlimited.
1https://aws.amazon.com/ec2/pricing/on-demand/</p>
      <p>Workflows: Figure 6 shows four widely used workflow patterns 2, namely CyberShake, Inspiral,
Montage, and SIPHT, that are employed for our experimental study. Table 3 gives detailed
information of 12 heterogeneous workflows used in our experiments.</p>
      <p>CyberShake</p>
      <p>Inspiral</p>
      <p>Montage</p>
      <p>SIPHT
2https://confluence.pegasus.isi.edu/display/pegasus/Deprecated+Workflow+Generator</p>
      <p>Baselines. Several baselines are utilized to provide a point of comparison for the experiments.
• HEFT: Heterogeneous earliest finish time [ 12] is a widely known heuristic algorithm for
workflow scheduling. All individuals in all generations are evaluated by the same one
training instance.
• Random Generate (no rotation): Traditional approach where the GPHH algorithm is
set to use one training instance per generation and is constant across generations [15].
• Random Generate (rotation): Traditional approach where the GPHH algorithm is set to
use one training instance per generation, and randomly generates a new training instance
in each generation [7].</p>
      <p>Parameter settings. The study utilized default parameter settings for GPHH, following
the recommendations provided in [4, 29]. The population size was set to 1024, with a total
of 50 iterations. One individual was designated as elite, and the tournament size was set to
7. Crossover, mutation, and reproduction rates were set at 0.80, 0.15, and 0.05, respectively.
Furthermore, the initial depth of GP trees ranged from 2 to 6, and the maximum depth was
constrained to 8 throughout the evolutionary process.</p>
      <p>Evaluation results. Let Mini-batch Select / denote our proposed model-driven approach
where the GPHH algorithm is set to randomly select  problem instances from a problem instance
pool with  instances in each generation. Next we show the performance of the proposed
model-driven GPHH approach in comparison to the baselines.</p>
      <p>We have also conducted the Wilcoxon rank sum test with a significance level of 0.05 to make
pairwise comparisons (each algorithm is compared with all the algorithms to its left in the
table). The “(+)/(-)/(=)” after each entry in the table indicates that the corresponding results are
significantly better/worse than or similar to the results of the compared method.</p>
      <p>To provide insides of the performance, Figure 7 shows the convergence curves for the methods
using diferent training instances. It can be observed that the heuristics learned using a
minibatch designed based on our model-driven approach performs better than the heuristics learned
using randomly generated training instances, throughout the whole training process.</p>
      <p>We further evaluate the performance of our proposed approach using a mini-batch with
multiple problem instances. Here we consider  = 3, 5 . Table 6 shows the evaluation results.</p>
      <p>As we can see from Table 6, our proposed model-driven approach using a mini-batch of
size larger than 1 can achieve better results than just using a mini-batch of size 1. However,
even with a mini-batch of size 1, our proposed model-driven approach performs better than
the existing approaches using randomly generated training instances as done in the recent
literature on dynamic workflow scheduling, cf. [ 7], according to Table 4. This demonstrates
that using the training instances generated based on the conceptual schema in Figure 2 can well
support GPHH to learning better heuristics. Hence, model-driven GPHH can produce better
heuristics than the traditional approach that uses randomly generated training instances.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and future work</title>
      <p>In this paper we proposed a model-driven machine learning approach to the dynamic
multiworkflow scheduling (DMWS) problem. For machine learning algorithms to efectively learn
optimization rules or heuristics we proposed a model-driven approach to design a mini-batch
of training instances. We designed a conceptual database schema for DMWS, on which training
instances can be generated in a systematic way, to cover possible problem scenarios. Simulations
using benchmark and real-world data demonstrate that our proposed model-driven GPHH
60
s
t
s
o50
C
l
a
t
oT40
30
250
s
t
s
o
lC200
a
t
o
T
150
80
s
t
so60
C
l
a
t
oT40
20
250
s
t
so200
C
l
a
to150
T
100
160
approach outperforms existing approaches to dynamic multi-workflow scheduling problems.
Our novel model-driven approach is very efective in supporting machine learning algorithms
to train efective heuristics for dynamic workflow scheduling problems. Future work can be
conducted to investigate the optimal choice of the mini-batch size and the problem pool size.
[2] C. H. Papadimitriou, K. Steiglitz, Combinatorial optimization: algorithms and complexity,</p>
      <p>Courier Corporation, 1998.
[3] Z. Zhu, G. Zhang, M. Li, X. Liu, Evolutionary multi-objective workflow scheduling in
cloud, IEEE Transactions on Parallel and Distributed Systems 27 (2015) 1344–1357.
[4] Y. Yang, G. Chen, H. Ma, M. Zhang, Dual-tree genetic programming for
deadlineconstrained dynamic workflow scheduling in cloud, in: International Conference on
Service-Oriented Computing (ICSOC), Springer, 2022, pp. 433–448.
[5] N. Gazagnadou, R. Gower, J. Salmon, Optimal mini-batch and step sizes for saga, in:</p>
      <p>International Conference on Machine Learning (ICML), PMLR, 2019, pp. 2142–2150.
[6] J. Wu, W. Hu, H. Xiong, J. Huan, V. Braverman, Z. Zhu, On the noisy gradient descent that
generalizes as SGD, in: International Conference on Machine Learning (ICML), PMLR,
2020, pp. 10367–10376.
[7] F. Zhang, Y. Mei, S. Nguyen, M. Zhang, Evolving scheduling heuristics via genetic
programming with feature selection in dynamic flexible job-shop scheduling, IEEE Transactions
on Cybernetics 51 (2021) 1797–1811.
[8] S. Nalchigar, E. Yu, R. Ramani, A conceptual modeling framework for business analytics,
in: International Conference on Conceptual Modeling (ER), Springer, 2016, pp. 35–49.
[9] S. Nalchigar, E. Yu, K. Keshavjee, Modeling machine learning requirements from three
perspectives: a case report from the healthcare domain, Requirements Engineering 26
(2021) 237–254.
[10] Y. Liu, Y. Mei, M. Zhang, Z. Zhang, Automated heuristic design using genetic programming
hyper-heuristic for uncertain capacitated arc routing problem, in: ACM Genetic and
Evolutionary Computation Conference (GECCO), 2017, pp. 290–297.
[11] S. Abrishami, M. Naghibzadeh, D. H. Epema, Deadline-constrained workflow scheduling
algorithms for infrastructure as a service clouds, Future Generation Computer Systems 29
(2013) 158–169.
[12] H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-efective and low-complexity task
scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems
13 (2002) 260–274.
[13] B. Tan, H. Ma, Y. Mei, M. Zhang, A cooperative coevolution genetic programming
hyperheuristics approach for on-line resource allocation in container-based clouds, IEEE Trans.</p>
      <p>Cloud Comput. 10 (2022) 1500–1514.
[14] B. Tan, H. Ma, Y. Mei, A hybrid genetic programming hyper-heuristic approach for online
two-level resource allocation in container-based clouds, in: IEEE Congress on Evolutionary
Computation (CEC), IEEE, 2019, pp. 2681–2688.
[15] Y. Chen, T. Shi, H. Ma, G. Chen, Multi-objective location-aware service brokering in
multi-cloud - a GPHH approach with transfer learning, in: International Conference on
Applications of Evolutionary Computation (EvoStar), Springer, 2023, pp. 573–587.
[16] K.-R. Escott, H. Ma, G. Chen, Genetic programming based hyper heuristic approach for
dynamic workflow scheduling in the cloud, in: International Conference on Database and
Expert Systems Applications (DEXA), Springer, 2020, pp. 76–90.
[17] A. Nasiri, S. Nalchigar, E. Yu, W. Ahmed, R. Wrembel, E. Zimányi, From indicators to
predictive analytics: A conceptual modelling framework, in: IFIP Working Conference on
The Practice of Enterprise Modeling (PoEM)), Springer, 2017, pp. 171–186.
[18] S. Nalchigar, E. Yu, Business-driven data analytics: A conceptual modeling framework,</p>
      <p>Data &amp; Knowledge Engineering 117 (2018) 359–372.
[19] C.-H. Youn, M. Chen, P. Dazzi, Cloud broker and cloudlet for workflow scheduling, Springer,
2017.
[20] B. Thalheim, Entity Relationship Modeling – Foundations of Database Technology,</p>
      <p>Springer, 2000.
[21] S. Hartmann, On the consistency of int-cardinality constraints, in: International
Conference on Conceptual Modeling (ER), Springer, 1998, pp. 150–163.
[22] H. Ma, K. Schewe, B. Thalheim, J. Zhao, View integration and cooperation in databases,
data warehouses and web information systems, J. Data Semantics (2005) 213–249.
[23] Q.-Z. Xiao, J. Zhong, L. Feng, L. Luo, J. Lv, A cooperative coevolution hyper-heuristic
framework for workflow scheduling problem, IEEE Transactions on Services Computing
15 (2022) 150–163.
[24] W. Chen, E. Deelman, Workflowsim: A toolkit for simulating scientific workflows in
distributed environments, in: IEEE International Conference on e-Science (E-Science),
IEEE, 2012, pp. 1–8.
[25] J. Liu, J. Ren, W. Dai, D. Zhang, P. Zhou, Y. Zhang, G. Min, N. Najjari, Online multi-workflow
scheduling under uncertain task execution time in IaaS clouds, IEEE Transactions on
Cloud Computing 9 (2019) 1180–1194.
[26] H. R. Faragardi, M. R. S. Sedghpour, S. Fazliahmadi, T. Fahringer, N. Rasouli, GRP-HEFT: A
budget-constrained resource provisioning scheme for workflow scheduling in IaaS clouds,
IEEE Transactions on Parallel and Distributed Systems 31 (2019) 1239–1254.
[27] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, P. Maechling, Scientific
workflow applications on Amazon EC2, in: IEEE International Conference on e-Science
(E-Science), IEEE, 2009, pp. 59–66.
[28] V. Arabnejad, K. Bubendorfer, B. Ng, Dynamic multi-workflow scheduling: a deadline and
cost-aware approach for commercial clouds, Future Generation Computer Systems 100
(2019) 98–108.
[29] M. Xu, Y. Mei, S. Zhu, B. Zhang, T. Xiang, F. Zhang, M. Zhang, Genetic programming for
dynamic workflow scheduling in fog computing, IEEE Transactions on Services Computing
16 (2023) 2657–2671.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Schrijver</surname>
          </string-name>
          , et al.,
          <article-title>Combinatorial optimization: polyhedra and eficiency</article-title>
          , volume
          <volume>24</volume>
          ,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>