=Paper=
{{Paper
|id=Vol-3618/forum_paper_26
|storemode=property
|title=A model-driven machine learning approach to dynamic multi-workflow scheduling
|pdfUrl=https://ceur-ws.org/Vol-3618/forum_paper_26.pdf
|volume=Vol-3618
|authors=Hui Ma,Sven Hartmann,Gang Chen,Yifan Yang
|dblpUrl=https://dblp.org/rec/conf/er/0001HCY23
}}
==A model-driven machine learning approach to dynamic multi-workflow scheduling==
A model-driven machine learning approach to dynamic multi-workflow scheduling Yifan Yang1, Hui Ma1,∗, Gang Chen1 and Sven Hartmann2 1 Victoria University of Wellington, Wellington, New Zealand 2 Clausthal University of Technology, Clausthal-Zellerfeld, Germany Abstract Machine learning is getting more and more popular in solving dynamic combinatorial optimization problems. With the popularity of cloud computing, machine learning has been applied in solving dynamic multi-workflow scheduling, a challenging combinatorial optimization problem that involves many features of the problem, i.e., workflow patterns, tasks, virtual machine capacities, and cost. Existing machine learning approaches to dynamic combinatorial optimization problems use randomly generated training instances to train optimization rules or heuristics, without paying attention to the knowledge of the problem domain data. In this paper, we propose a model-driven machine learning approach to dynamic multi-workflow scheduling. In particular, we propose a conceptual database schema that can be used to model the information of the dynamic multi-workflow scheduling problem. We further propose a new approach to design training instances for dynamic multi-workflow scheduling by incorporating important attributes of the problem into the database schema. We evaluate our proposed approach using a case study with real-world data. The results of the evaluation demonstrate that our proposed model-driven machine learning approach can effectively solve dynamic workflow scheduling problems. Keywords machine learning, dynamic workflow scheduling, data modeling 1. Introduction In the era of big data, more and more data-intensive applications are used in various domains to solve complex data-intensive problems. To effectively solve such problems we need to design not only effective machine learning algorithms but also the training datasets to capture the essential requirements of the problem domain. However, with the increasing complexity of combinatorial optimization problems, designing training datasets for a problem is getting more and more challenging, i.e., to be representative of the problem domain and robust to unseen problem instances while at the same time without hurting the efficiency of the training process. For example, combinatorial optimization is a category of optimization where the size and relationships of data are massively increasing due to the escalated problem complexity [1, 2]. A typical example of combinatorial problems is the dynamic workflow scheduling problem in cloud computing where workflow tasks are scheduled to cloud resources in the best possible ER2023: Companion Proceedings of the 42nd International Conference on Conceptual Modeling: ER Forum, 7th SCME, Project Exhibitions, Posters and Demos, and Doctoral Consortium, November 06-09, 2023, Lisbon, Portugal ∗ Corresponding author. Envelope-Open hui.ma@ecs.vuw.ac.nz (H. Ma); sven.hartmann@tu-clausthal.de (S. Hartmann) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings way with some objectives, to minimize the makespan and/or response time. Cloud users often request multiple workflows to be scheduled for execution on heterogeneous cloud resources. This gives rise to the dynamic multi-workflow scheduling (DMWS) problem. To solve such a combinatoral optimization problem, we need to use not only data about tasks and cloud resources (e.g., VMs), but also relationships among these tasks and workflows, because they can significantly affect the response time and the overall cost. Dynamic multi-workflow scheduling is an NP-hard problem [3]. A major challenge in practice is that solutions must be determined in real-time. For the DMWS, workflow scheduling must be decided in real-time. Therefore, heuristics need to be used. The manual design of heuristics is very time consuming and requires domain experts who have good domain knowledge. Genetic programming hyper heuristic (GPHH), a machine learning method, is an effective approach to automatically design heuristics that does not require domain experts [4]. However, datasets are needed to train heuristics so that they can solve the problem effectively in real-time. To efficiently search for heuristics for dynamic multi-workflow scheduling, the quality of datasets used for training heuristics plays a crucial role to make the heuristics effective in practice such as robustness to unseen problem instances [5]. While it is a common practice to use historical datasets as training instances, they are not efficient for evaluating heuristics that involve expensive evaluations over all the historical data [6]. To avoid expensive training while at the same time to train heuristics that are robust to unseen problem instances, it is important to carefully design the training instances. Current approaches of designing training instances are ad hoc, and are not clearly described. Most of existing work use randomly generated training instance in the training process [7], without any attention to the knowledge of the datasets given. Further, dynamic workflow scheduling is a dynamic combinatorial optimization problem for which different states of the problem domain will be used during the training process, including resource scheduling requests and resource status, which are changing dynamically. Meanwhile, to train heuristics that are effective and robust in solving dynamic multi-workflow scheduling problems in real time, it is inefficient and unnecessary to use all the possible states of the requests and data centers. A conceptual model of training datasets, consisting of the data and relationships between different data, would help machine learning to be systematic and effective. A conceptual database schema consists of entities and relationships. Designing training data is actually retrieving, or sampling, datasets from database of historical data. This makes conceptual model attractive for the machine learning domain since it provides a high-level view of the datasets. However, there is no data model in the literature that can be used to support machine learning processes to learn heuristics effectively and efficiently when there is a large number of requests and the size of the workflow is extremely large. To take advantage of conceptual models different views of the datasets can be retrieved as training datasets to efficiently evaluate and train heuristics [8]. Adequate modeling of historical data and training data plays an important role to design a high-quality training dataset for machine learning. However, despite the popularity of machine learning for combinatorial optimization, there is limited research work on data modeling for machine learning. A data model, GR4ML,is proposed in [9] for machine learning focusing on expressing machine learning requirements and solution design. In practice, datasets are often designed in an ad hoc way or based on best practice recommendations [10]. There are no common or well-defined approaches to model training datasets and design training instances using data models. The overall aim of this paper is to propose a model-driven machine learning approach to the dynamic multi-workflow scheduling problem. In particular, we propose a GPHH approach that trains heuristics using well defined training instances based on conceptual database schema. We expect that the heuristics learned from our machine learning approaches can schedule dynamic workflows effectively. The major contributions of this paper are the following: • We propose a conceptual model that can be used for designing training datasets for effective learning heuristics of dynamic workflow scheduling. • We present a model-driven GPHH approach to automatically learn heuristics that can be used to schedule multiple workflows in cloud data centers. In particular, our proposed GPHH approach will use training datasets generated using the conceptual model designed for DMWS. • We evaluate the effectiveness of our proposed conceptual model in designing training datasets with a case study using benchmark data of workflow scheduling. To the best of our knowledge, our proposed model-driven approach is the first attempt incorporating conceptual modeling into the process of machine learning for solving dynamic workflow scheduling and other combinatorial optimization problems. In particular our concep- tual database schema provides a good overview of the structure of data involved in dynamic workflow scheduling problems, to support systematically the design of training instances for effectively learning heuristics that can be used to efficiently schedule tasks of numerous dynamically arriving workflows. Organization. This paper is organized as follows. Section 2 briefly outlines current ap- proaches for machine learning, in particular for dynamic multi-workflow scheduling problems from the literature, while in Section 3 we introduce dynamic workflow scheduling in more detail. In Section 4 we present a conceptual data schema for dynamic multi-workflow scheduling, and propose a GPHH approach to automatically design heuristics for the DMWS problem, with a model-driven approach for designing training instances. In Section 5 we report on a case study that we have conducted to evaluate our proposed model-driven machine learning approach. Finally, Section 6 gives conclusions and an outlook on future work. 2. Related work As dynamic multi-workflow scheduling is known to be NP-hard, heuristics are widely used to ensure scalability and real-time applicability. Heuristics are priority functions for making decisions on cloud resources, e.g., VMs. IaaS cloud partial critical paths (IC-PCP) is proposed in [11] for scheduling computation requests in data centers. Heterogeneous earliest finish time (HEFT) is another manually designed heuristic proposed in [12]. However, manually designing such heuristics requires domain knowledge and is time-consuming. Also heuristics may be effective in some of problem instances but may not be effective in some other problem instances since only small set of domain attributes are considered. Machine learning approaches have been proposed in the literature to automatically learn heuristics for dynamic workflow scheduling problems and other dynamic scheduling problems. In [13, 14], GPHH is utilized for online service resource allocation in clouds. In [15], a GPHH approach is proposed to solve multi-cloud resource allocation problems. A training instance is used during the whole training process. In order to generate heuristics which are robust to unseen problem instances, in [16], a GPHH approach is proposed to use training instances that are randomly generated during the training process, to learn heuristics for the DWS problem. Similar approaches of generating training instances are used in [7], which proposes a GPHH approach to learn heuristics for the dynamic job shop scheduling problem. However, none of the above mentioned machine learning studies paid attention to the structure of problem attributes or data when designing training instances. To understand machine leaning requirements of organizations, conceptual modelers have proposed different models. In [17], a conceptual modeling framework is proposed to systemati- cally justify the needs of predictive analytics within an organizational context and to identify the requirements of predictive analytics design. To incorporate domain-specific questions, [18] adopts conceptual modeling methods to elicit analytical requirements, design machine learning solutions, and to ensure the alignment between analytic initiatives and business strategies, among others. The modeling framework consists of three complementary modeling views: business view, analytic design view, and data preparation view. However, as far as we are aware, there is no systematic way of designing training instances for machine learning in dynamic combinatorial optimisation problems, including dynamic workflow scheduling problems. In summary, research on data modeling of training instances for machine learning of com- binatorial optimization is still in its early stages. To the best of our knowledge, there is very limited work on conceptual modeling of machine learning data of dynamic multi-workflow scheduling, which motivate us to propose an effective design method of training instances by incorporating application-specific data models. 3. Preliminaries SLA-aware dynamic multi-workflow scheduling (DWMS) has attracted rising interests in re- search and practice. Workflows arrive at a data center dynamically and is unknown beforehand, i.e., requests of scheduling workflows arrive on the fly and must be executed in real-time, to meet deadline constraints set in service-level agreements (SLA). The tasks of the workflows are scheduled to execute on a number of VM instances that are available in a cloud data center. Each VM instance is of a VM type, which defines its capacities. During the process, we need to make two types of decisions concurrently: task allocation, which allocates the tasks of the workflows to VMs, and task sequencing, which sequence the task on each individual VM instance. Figure 1 shows an overview of DWMS. The overall goal is to satisfy the workflow scheduling requests in the best possible quality, e.g., minimizing the makespan of workflows, minimizing the costs of renting cloud resources, and minimizing SLA violation penalties. In particular, the SLA-aware dynamic multi-workflow scheduling problem focuses on scheduling a sequence of workflows to cloud resources so that the overall cost and SLA penalty are minimized. In the rest of the paper, DMWS refers the SLA-aware dynamic multi-workflow scheduling problem. A workflow 𝑊 arrives at a time with a deadline, can be represented as a directed acyclic graph (DAG), denoted by 𝐺 = (𝑇 , 𝐸). Herein, 𝑇 = {𝑡𝑖 }, 𝑖 ∈ 1, … , 𝑁 is a set of nodes representing tasks Heuristic Cloud Resources Problem Instance workflow_m Workflow_1 Figure 1: Dynamic multi-workflow scheduling in a cloud data center in the workflow, and 𝐸 = {𝑒𝑖𝑗 } is a set of directed edges where 𝑒𝑖𝑗 connects 𝑡𝑖 with 𝑡𝑗 . A task 𝑡𝑖 is defined by its size 𝑠𝑖 . Here, 𝑡𝑖 is a parent task and 𝑡𝑗 is a child task. Note that a child task can be executed only after all its parent tasks are completed. We use 𝑃𝑎𝑟𝑒𝑛𝑡(𝑡𝑖 ) to denote all immediate predecessors of task 𝑡𝑖 . There might be multiple tasks allocated to the same VM instance to be executed. We use 𝑃𝑟𝑒(𝑡𝑖 ) to denote the previous task of 𝑡𝑖 executed on the same VM. In a data center, there are VM instances to perform scheduled workflow tasks. Each VM instance is of a VM type that is defined by its speed 𝑣𝑘 , memory 𝑚𝑘 , and cost 𝑐𝑘 . Any task must be scheduled to a VM instance with sufficient capacity/resource, i.e., CPU and memory. Note that, one VM instance can only execute one task at any time, and each task can only be executed on a single VM instance. The time to process a task depends on the resource requirement 𝑠𝑖 of the task, and the resource 𝑣𝑚 that is scheduled. The more capacity of the resources allocated to the task the faster it is processed, and less probability to violate deadline constraints set in the SLA agreement. However, the more resources used for executing workflows, the more cost it will take. Different workflow schedules will lead to different time units for renting VMs, and then different makespan and costs. For a given sequence workflows 𝒲 = [𝑊1 , … , 𝑊𝑘 ], and set of available VM instances, the objective of the DMWS problem is to schedule workflow tasks for execution on VMs so as to minimize the total cost for workflow execution, consisting of rental fees (denoted as 𝑅𝑒𝑛𝑡𝐹 𝑒𝑒) incurred from resource provisioning and deadline penalties (denoted as 𝑃𝑒𝑛𝑎𝑙𝑡𝑦) caused by violations, formulated by min 𝑇 𝑜𝑡𝑎𝑙𝐶𝑜𝑠𝑡 = ∑ 𝑅𝑒𝑛𝑡𝐹 𝑒𝑒𝑘 + ∑ 𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝑖 (1) 𝑘∈𝒱 𝑖∶𝑊𝑖 ∈𝒲 where 𝒱 refers to the set of rented VMs. Particularly, 𝑅𝑒𝑛𝑡𝐹 𝑒𝑒𝑘 is the rental fee of VM 𝑉𝑘 , defined as the hourly-based fee for the rental time 𝑉𝑘 between the time of processing of its first allocated task 𝐹 𝑇𝑡𝑘𝑙𝑎𝑠𝑡 and the time of processing the last allocated task 𝑆𝑇𝑡𝑘𝑓 𝑖𝑟𝑠𝑡 calculated by 𝐹 𝑇𝑡𝑘𝑙𝑎𝑠𝑡 − 𝑆𝑇𝑡𝑘𝑓 𝑖𝑟𝑠𝑡 𝑅𝑒𝑛𝑡𝐹 𝑒𝑒𝑘 = 𝑃𝑅𝐼 𝐶𝐸𝑘 ⋅ ⌈ ⌉ (2) 3600 𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝑖 is the penalty of workflow 𝑊𝑖 , defined as the penalty fee paid for the portion beyond its deadline 𝐷𝐿𝑖 , calculated by 𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝑖 = 𝛿 ⋅ max {0, 𝐴𝑇𝑖 + 𝑀𝑎𝑘𝑒𝑠𝑝𝑎𝑛𝑖 − 𝐷𝐿𝑖 } (3) where 𝛿 is a penalty coefficient [19]. A larger coefficient represents a lower tolerance for violating workflow deadline. Details of the problem definition can be found in [4]. 4. Model-driven GPHH for dynamic multi-workflow scheduling In this section we present our model-driven genetic programming hyper heuristic (GPHH) approach for the dynamic multi-workflow scheduling problem. To begin with we present a data model of the domain data for scheduling multiple dynamic workflows. Afterwards we describe our proposed GPHH approach using model-driven designed training instances. Data model. To ensure the robustness of heuristics obtained by machine learning, we first model the data involved in the training of heuristics by GPHH. Figure 2 presents a conceptual data schema, denoted as ER-DMWS, for the problem of dynamic multi-workflow scheduling. CPU Mem VID TypeName VM VM-Type Cost Time Schedule Size Mem PatterName Size Generation TID Task Pattern Mini-Batch Parent Child (n,n) ArriveTime WorkflowID InstanceID Deadline Depend Problem Workflow (m,m) Instance Figure 2: ER-DMWS - Data model for dynamic workflow scheduling training instances We utilize the higher-order ER model (HERM) [20] which preserves the aggregation-based modeling principle of the basic ER model where a conceptual data schema consists of a set of object types. Object types are characterized by their attributes. Recall the definition of attributes from [20]. Let U denote a countable set of simple attributes (called the universe) together with a type assignment 𝑡𝑝(𝐴) that assigns to each attribute 𝐴 ∈ U its data type 𝑡𝑝(𝐴). Complex attributes may be obtained from simple ones by nesting. Let A denote the smallest superset of U such that 𝑋 (𝐴1 , … , 𝐴𝑛 ), 𝑋 {𝐴}, 𝑋 [𝐴], 𝑋 ⟨𝐴⟩, 𝑋1 (𝐴1 ) ⊕ ⋯ ⊕ 𝑋𝑛 (𝐴𝑛 ), 𝑋 (𝐴1 → 𝐴2 ) ∈ A holds whenever 𝐴, 𝐴1 , … , 𝐴𝑛 ∈ A holds, with labels 𝑋 , 𝑋1 , … , 𝑋𝑛 chosen from some fixed alphabet ℒ. An important extension of HERM is that dynamic and complex data types are permitted in A. The data type 𝑡𝑝(𝐴) assigned to a (simple) attribute 𝐴 ∈ U is some simple data type, like INT , FLOAT , STRING, DATE, or TIME, which are relevant features of dynamic workflow scheduling problems. It should be emphasized that complex data types offer additional opportunities for machine learning researchers but are not a must-use. A structured component is a pair 𝜌 ∶ 𝐶 with a role name 𝜌 and a component expression 𝐶 ∈ C. An object type 𝑂 is a triple (𝑐𝑜𝑚𝑝(𝑂), 𝑎𝑡𝑡𝑟(𝑂), 𝑘𝑒𝑦(𝑂)) where 𝑐𝑜𝑚𝑝(𝑂) is a finite set {𝜌1 ∶ 𝐶1 , … , 𝜌𝑛 ∶ 𝐶𝑛 } of structured components with pairwise different role names 𝜌1 , … , 𝜌𝑛 , 𝑎𝑡𝑡𝑟(𝑂) is a finite set {𝐴1 , … , 𝐴𝑚 } ⊆ A of complex attributes, and 𝑘𝑒𝑦(𝑂) ⊆ 𝑐𝑜𝑚𝑝(𝑂) ∪ 𝑎𝑡𝑡𝑟(𝑂) is the primary key of 𝑂. An object type 𝑂 is called an entity type if 𝑐𝑜𝑚𝑝(𝑂) is the empty set, otherwise it is called a relationship type. Example 1. In the diagram of ER-DMWS in Figure 2, the entity types are displayed as rectangles while relationship types are displayed as diamonds. Let 𝑂 be a relationship type and 𝑋𝑖 a component of 𝑂, say in the form 𝑂𝑖 or 𝑝𝑖 ∶ 𝑂𝑖 . In the diagram this is displayed by an edge from 𝑂 to 𝑂𝑖 . The relationship type Depend in Figure 2 can be specified as follows: Depend = ({Parent: Task, Child: Task}, {}, {Parent: Task, Child: Task}) It has two structured components, each with different roles, no attributes, and its primary key includes both structured components. A cardinality constraint is an expression 𝑐𝑎𝑟𝑑(𝑂, 𝑋𝑖 ) = (𝑎, 𝑏) where 𝑎 is a natural number, and 𝑏 is a natural number or ∞. For more details on cardi- nality constraints see [20, 21]. For the integration of different user views in HERM see, e.g., [22]. Cardinality constraints are also displayed in the diagram. For example, the constraint 𝑐𝑎𝑟𝑑(Problem-Instance, Workflow) = (𝑚, 𝑚) states that every problem instance consists of 𝑚 work- flows. The cardinality constraint 𝑐𝑎𝑟𝑑(Mini-Batch, Problem-Instance) = (𝑛, 𝑛) states that every mini-batch consists of 𝑛 problem instances. Note that a mini-batch, containing a set of problem instances, will be used in the machine learning process to ensure the generality of the learned heuristics. We use the conceptual schema in Figure 2 to represent data on workflows, virtual machines, scheduling as well as training instances, i.e., mini-batches, used during the machine learning process. Note that the entity and relationship types in gray color capture the information on workflows, while the entity and relationship type in green color capture the information on cloud resources, i.e., virtual machines. Further, there are two relationship types Schedule and Mini-Batch in blue color that capture the information on scheduling decisions and training instances. For example, a workflow pattern can be considered as a complex data type which may be regarded as a list of tasks and a set of edges defined by pairs of tasks. If the machine learning researchers consider the workflow pattern as the internal structure of tasks to be conceptually irrelevant for the application under development then they can define a simple attribute pattern with 𝑡𝑝(𝑝𝑎𝑡𝑡𝑒𝑟𝑛) = DAG. Otherwise, if the internal structure of the workflow pattern is conceptu- ally relevant then she/he may use a complex attribute pattern({task}, ({parent: task, child: task})). Model-driven approach to automatic learning heuristics. Now we present our model- driven GPHH for dynamic multi-workflow scheduling. Figure 3 shows the high-level overview of the GPHH algorithm for learning heuristics based on model-driven training instances. Initialization Evaluaiton Mini-batch Generate Selection Evolution DataSet Data Model Reproduction Cross Over Mutation - No Yes + ET Terminate TS AT Heuristics Figure 3: Model-driven machine learning for dynamic multi-workflow scheduling Here, we represent heuristics as syntax trees, with internal nodes as arithmetic functions and leave nodes as terminals. For an example see the heuristic in Figure 3. The terminals can be either attributes or derived attributes of the object types in the ER-DMWS schema, e.g., task size 𝑇 𝑆. Table 1 summarizes the terminal set used in this case, and the function set including {+, −, ×, ÷, 𝑚𝑎𝑥, 𝑚𝑖𝑛} based on [23]. During the process of solving a DMWS problem instance in real time, any GP tree evolved by GPHH, is treated as a priority function. Whenever a task in a workflow becomes ready for execution, i.e., all its parent tasks have been completed, the GP tree is applied to each eligible cloud resource, e.g., a leased VM instance with the requested CPU and memory capacities to accomplish the task, to calculate their respective priorities. For this purpose, the value of each terminal node of the GP tree is determined first. Note that, for different cloud resources under consideration, the same terminal node can have varied values. With respect to any candidate cloud resource, driven by the corresponding values of all terminal nodes, we can proceed to calculate the value of each internal node of the GP tree in the bottom-up order and eventually determine the value of the root node, which will be assigned to the candidate cloud resource as its priority. After calculating the priorities of all the eligible VMs, the VM with the highest priority is then scheduled to process the task. The above process is performed repeatedly Table 1 The terminal set Terminal Definition 𝑇𝑆 The size of a task task-related 𝐸𝑇 The execution time of a task 𝐶𝑈 The compute unit of a VM 𝑃𝑅𝐼 𝐶𝐸 The price of renting a VM for one hour 𝑇𝐼𝑄 The total execution time of all tasks in a VM queue VM-related 𝑉 𝑀𝑅 The remaining available time for a VM 𝐹𝑇 The finish time of a task on a VM 𝑁𝐼𝑄 The number of tasks in a VM queue 𝑁 𝑂𝐶 The number of successor tasks of a task workflow-related 𝑁 𝑂𝑅 The number of remaining tasks in a workflow problem-specific 𝑅𝐷𝐿 The remaining deadline time of a workflow to schedule the execution of every task of each workflow, whenever a task becomes ready. Therefore, guided by a GP tree, the full schedule that solves any DMWS problem instance can be obtained. Details regarding the GPHH algorithm for DMWS can be found in [4, 16]. Apparently, different GP trees can produce varied full schedules for the same DMWS problem instance. Each full schedule also leads to varied total costs. The aim of the GPHH algorithm is to identify a GP tree that can achieve on average the smallest total cost, the objective defined in 1, for a wide variety of DMWS problem instances. For this purpose, the GPHH algorithm randomly generates a set (aka. a population) of initial GP trees (called the initial population). Each GP tree of the initial population will be evaluated using training instances and a fitness function, i.e., TotalCost. GP trees with good fitness values will be selected for reproduction, mutation and crossover to generate the next generation of GP trees. This process will continue until the stopping criteria are met, e.g., the maximum number of generations is reached. The evolved GP tree with the best fitness is then reported by the GPHH algorithm as its final learned heuristic for DMWS. As can be seen in Figure 3, in every generation, i.e., iteration, all the heuristics will be evaluated using the training instances. Evidently these training instances play an important role in the evolution (or heuristic learning) process. Model-driven training instance design. The modeling technique that we have just proposed can be generalized to generate training datasets for GPHH for dynamic multi-workflow scheduling. In order to learn good heuristics that can perform well for unseen problem instances, for each generation, different training instances will be rotated, see Figure 4. However, if training instances are too different from generation to generation, then no individuals can steadily survive through generations to evolve. Therefore, we need to carefully design suitable training instances. This can be controlled by specifying a proper size for the instance pool. In the mean time, to train heuristics that perform well for unseen problem instances, we adapt the concept of ‘mini-batch’ [5] in the design of training instances. As can be seen in Figure 2, a mini-batch may contain one or more problem instances. We will explore settings of different batch sizes in the section on experiments. Model-Driven Approach Create Query Mini-Batch Problem Instance Pool Data Model Random Training Generate Instance DataSet Traditional Approach Figure 4: Generating training instances for GPHH Figure 4 shows two possible ways to generate training instances, i.e., the model-driven and the traditional approach. The traditional approach randomly generates a problem instance consisting of 𝑛 workflows following randomly selected workflow patterns and scales. Our model- driven approach creates a problem instance pool, which consists of 𝑚 problem instances. For each generation, a mini-batch of training instances is created by selecting 𝑛 problem instances. Mini-batch(g1) Mini-batch(gi) Mini-batch(gG) Training Instance01 … Training Instance0n …. Training Instancei1 … Training Instancein ……. Training InstanceG1 … Training InstanceGn Generation(0) Generation(G) (a) Rotating training instance - Model-driven approach through generations Training Training Training Training Instance1 Instance2 …. Instancei ……... InstanceG Generation(0) Generation(G) (b) Rotating training instance - Traditional approach Figure 5: Rotating training instance through GPHH generations As can be seen in Figure 4, our model-driven approach first creates a problem instance pool before running GPHH. During the training process of GPHH, training instance can be obtained by querying the problem instance pool. To continue with, we provide details of each step. Definition 1. A dynamic multi-workflow scheduling problem instance 𝐼𝑡 consists of a list of 𝑚 workflows, i.e., 𝐼𝑡 = [𝑤𝑡1 , … , 𝑤𝑡𝑚 ], where 𝑤𝑡𝑖 is randomly generated using available workflow patterns and sizes, (PatternName, Size), and is assigned a random arrival time ArriveTime. A problem instance pool ℙ consists of 𝑝 problem instances, i.e., ℙ = {𝐼1 , … , 𝐼𝑡 … , 𝐼𝑝 }. As shown in Figure 2, we propose to use a mini-batch consisting of a set of problem instances for a particular generation throughout generations of the training process. Definition 2. A mini-batch 𝐵(𝑔𝑖 ) for generation 𝑔𝑖 is a set of 𝑛 problem instances, i.e., 𝐵(𝑔𝑖 ) = {𝐼 1 (𝑔𝑖 ), 𝐼 2 (𝑔𝑖 ), … , 𝐼 𝑘 (𝑔𝑖 ), … , 𝐼 𝑛 (𝑔𝑖 )} where each problem instance 𝐼 𝑘 (𝑔𝑖 ) is retrieved from the problem instance pool ℙ, i.e., 𝐼 𝑘 (𝑔𝑖 ) ∈ ℙ for 𝑘 = 1, … , 𝑛. To create a problem instance pool ℙ, we generate a set of problem instances, using the schema defined for the DMWS problem, but each time with different combinations of patterns, sizes and arrival times. In this way, our problem instance pool contains a variety of problem instances to represent various possible problem scenarios. We can then obtain a mini-batch of 𝑛 problem instances by randomly selecting 𝑛 problem instances from ℙ. 5. A case study To evaluate the performance of our proposed model-driven approach, we have implemented a machine learning simulator using benchmark data of workflow scheduling as in [4, 24, 25], i.e., benchmark workflow patterns with various sizes, and VM data obtained from commercial AWS data centers [26, 27]. The simulator is used to train heuristics for dynamic workflow scheduling. After the heuristics are trained using the simulator, we evaluate the heuristics using test datasets. We evaluated the performance (in terms of the total cost) of the heuristics in solving SLA-aware dynamic multi-workflow scheduling, comparing with heuristics that are trained using the training datasets generated using some existing approach. That is, two different training datasets are used, with one based on our proposed model-driven approach and the other based on the existing approach [12, 16, 28]. Two sets of heuristics are trained on each of the two training datasets. Since real-world datasets were considered, the number of workflow scheduling requests will naturally vary throughout the day and week. This may lead to increased execution time for a single workflow due to the queuing time of tasks VMs. Datasets. We first discuss the simulation configuration, followed by the parameter settings of GPHH. The detailed setup of comparative experiments will be introduced, too. A simulated cloud environment is used to experimentally evaluate the proposed model-driven GPHH approach to DMWS. It contains several key components listed below. VM Types: According to Amazon EC21 , we assume that the cloud environment supports six different VM types with their respective configurations summarized in Table 2. The maximum number of VM instances of each type is unlimited. 1 https://aws.amazon.com/ec2/pricing/on-demand/ Table 2 Configurations of 6 VM types based on Amazon EC2 Instance Name vCPU Memory On-Demand hourly rate m5.large 2 8 GiB $0.096 m5.xlarge 4 16 GiB $0.192 m5.2xlarge 8 32 GiB $0.384 m5.4xlarge 16 64 GiB $0.768 m5.8xlarge 32 128 GiB $1.536 m5.12xlarge 48 192 GiB $2.304 Workflows: Figure 6 shows four widely used workflow patterns2 , namely CyberShake, Inspiral, Montage, and SIPHT, that are employed for our experimental study. Table 3 gives detailed information of 12 heterogeneous workflows used in our experiments. CyberShake Inspiral Montage SIPHT Figure 6: Four workflow patterns widely used in the state-of-the-art literature on workflow scheduling Table 3 Information of 12 different workflow types based on Figure 6 and utilized in our experiments Number Number Average task Index DAG-Size of nodes of edges process time (CU=1) 1 CyberShake_30 30 52 405.62 s 2 CyberShake_50 50 88 487.86 s 3 CyberShake_100 100 180 514.52 s 4 Inspiral_30 30 35 3529.10 s 5 Inspiral_50 50 60 3763.82 s 6 Inspiral_100 100 119 3363.83 s 7 Montage_25 25 45 145.76 s 8 Montage_50 50 106 162.76 s 9 Montage_100 100 233 172.69 s 10 SIPHT_30 29 33 3060.12 s 11 SIPHT_60 58 66 3219.01 s 12 SIPHT_100 97 109 2866.76 s 2 https://confluence.pegasus.isi.edu/display/pegasus/Deprecated+Workflow+Generator Baselines. Several baselines are utilized to provide a point of comparison for the experiments. • HEFT: Heterogeneous earliest finish time [12] is a widely known heuristic algorithm for workflow scheduling. All individuals in all generations are evaluated by the same one training instance. • Random Generate (no rotation): Traditional approach where the GPHH algorithm is set to use one training instance per generation and is constant across generations [15]. • Random Generate (rotation): Traditional approach where the GPHH algorithm is set to use one training instance per generation, and randomly generates a new training instance in each generation [7]. Parameter settings. The study utilized default parameter settings for GPHH, following the recommendations provided in [4, 29]. The population size was set to 1024, with a total of 50 iterations. One individual was designated as elite, and the tournament size was set to 7. Crossover, mutation, and reproduction rates were set at 0.80, 0.15, and 0.05, respectively. Furthermore, the initial depth of GP trees ranged from 2 to 6, and the maximum depth was constrained to 8 throughout the evolutionary process. Evaluation results. Let Mini-batch Select 𝑛/𝑝 denote our proposed model-driven approach where the GPHH algorithm is set to randomly select 𝑛 problem instances from a problem instance pool with 𝑝 instances in each generation. Next we show the performance of the proposed model-driven GPHH approach in comparison to the baselines. Table 4 The mean (standard deviation) of the test performance of 30 independent runs of our proposed model- driven approach using mini-batch size 𝑛 = 1 and problem instance pool sizes 𝑝 = 10, 20, 40 in comparison to the baselines HEFT and the traditional GPHH approaches Random Random Mini-batch Mini-batch Mini-batch HEFT Test Case Generate Generate Select 1/10 Select 1/20 Select 1/40 (no rotation) (rotation) (rotation) (rotation) (rotation) mix_all 74.03(6.47) 83.28(15.26) 66.75(4.85) 69.09(12.81) 66.37(4.59) 72.17(13.35) mix_small 47.67(4.68) 40.01(7.38) 31.65(2.19) 33.71(6.79) 32.52(3.06) 34.98(6.81) mix_medium 72.0(2.37) 75.05(13.53) 60.18(4.03) 64.5(10.45) 62.35(4.15) 66.7(11.12) mix_large 110.25(8.85) 122.38(19.16) 99.88(6.58) 103.05(16.56) 99.89(6.01) 108.18(20.17) pure_cybershake 37.05(0.63) 32.95(6.34) 26.68(2.62) 27.08(4.41) 26.83(3.48) 27.39(4.68) pure_inspiral 124.55(15.77) 137.06(31.45) 119.17(7.44) 120.75(14.14) 115.74(7.45) 119.74(10.64) pure_montage 36.03(0.5) 23.27(3.58) 19.28(1.54) 18.76(2.91) 17.62(1.93) 18.97(3.65) pure_sipht 89.87(0.08) 103.6(26.16) 86.89(3.73) 88.28(9.61) 85.52(4.89) 89.99(10.97) Average Rank 5.25 5.62 1.75 3 1.5 3.88 We have also conducted the Wilcoxon rank sum test with a significance level of 0.05 to make pairwise comparisons (each algorithm is compared with all the algorithms to its left in the table). The “(+)/(-)/(=)” after each entry in the table indicates that the corresponding results are significantly better/worse than or similar to the results of the compared method. To provide insides of the performance, Figure 7 shows the convergence curves for the methods using different training instances. It can be observed that the heuristics learned using a mini- batch designed based on our model-driven approach performs better than the heuristics learned using randomly generated training instances, throughout the whole training process. Table 5 Wilcoxon test on the approaches compared in Table 4 Random Random Mini-batch Mini-batch Mini-batch HEFT Test Case Generate Generate Select 1/10 Select 1/20 Select 1/40 (no rotation) (rotation) (rotation) (rotation) (rotation) mix_all — (-) (+)(+) (+)(+)(=) (+)(+)(=)(=) (=)(+)(-)(=)(-) mix_small — (+) (+)(+) (+)(+)(=) (+)(+)(=)(=) (+)(+)(-)(=)(=) mix_medium — (=) (+)(+) (+)(+)(-) (+)(+)(-)(=) (+)(+)(-)(=)(=) mix_large — (-) (+)(+) (+)(+)(=) (+)(+)(=)(=) (=)(+)(-)(=)(-) pure_cybershake — (+) (+)(+) (+)(+)(=) (+)(+)(=)(=) (+)(+)(=)(=)(=) pure_inspiral — (-) (=)(+) (=)(+)(=) (+)(+)(=)(=) (+)(+)(=)(=)(=) pure_montage — (+) (+)(+) (+)(+)(+) (+)(+)(+)(+) (+)(+)(=)(=)(-) pure_sipht — (-) (+)(+) (+)(+)(=) (+)(+)(+)(=) (+)(+)(=)(=)(=) Table 6 The mean (standard deviation) of the test performance of 30 independent runs of our proposed model- driven approach using mini-batch sizes 𝑛 = 1, 3, 5 and problem instance pool sizes 𝑝 up to 40 Mini-batch Mini-batch Mini-batch Mini-batch Mini-batch Test Case Select 1/20 Select 3/10 Select 3/20 Select 5/20 Select 5/40 mix_all 66.37(4.59) 62.86(4.32)(+) 64.79(6.0)(=) 62.77(4.52)(+) 63.87(5.2)(+) mix_small 32.52(3.06) 30.47(2.09)(+) 31.37(3.54)(+) 30.26(1.84)(+) 30.7(2.01)(+) mix_medium 62.35(4.15) 59.05(4.12)(+) 60.17(5.47)(+) 58.99(4.07)(+) 59.62(4.62)(+) mix_large 99.89(6.01) 95.26(6.87)(+) 97.27(8.38)(=) 94.72(7.0)(+) 95.8(7.98)(+) pure_cybershake 26.83(3.48) 26.07(1.8)(=) 26.27(2.52)(=) 26.06(1.47)(=) 26.26(2.12)(=) pure_inspiral 115.74(7.45) 115.59(7.99)(=) 113.21(7.79)(=) 114.16(9.03)(=) 117.16(8.12)(=) pure_montage 17.62(1.93) 17.52(1.59)(=) 18.09(2.66)(=) 17.24(1.23)(=) 18.14(2.18)(=) pure_sipht 85.52(4.89) 84.26(2.95)(=) 85.06(4.42)(=) 83.68(2.36)(+) 84.6(2.85)(=) +/=/- — 4/4/0 2/6/0 5/3/0 4/4/0 Average Rank 4.62 2.12 3.62 1.12 3.5 We further evaluate the performance of our proposed approach using a mini-batch with multiple problem instances. Here we consider 𝑛 = 3, 5. Table 6 shows the evaluation results. As we can see from Table 6, our proposed model-driven approach using a mini-batch of size larger than 1 can achieve better results than just using a mini-batch of size 1. However, even with a mini-batch of size 1, our proposed model-driven approach performs better than the existing approaches using randomly generated training instances as done in the recent literature on dynamic workflow scheduling, cf. [7], according to Table 4. This demonstrates that using the training instances generated based on the conceptual schema in Figure 2 can well support GPHH to learning better heuristics. Hence, model-driven GPHH can produce better heuristics than the traditional approach that uses randomly generated training instances. 6. Conclusions and future work In this paper we proposed a model-driven machine learning approach to the dynamic multi- workflow scheduling (DMWS) problem. For machine learning algorithms to effectively learn optimization rules or heuristics we proposed a model-driven approach to design a mini-batch of training instances. We designed a conceptual database schema for DMWS, on which training instances can be generated in a systematic way, to cover possible problem scenarios. Simulations using benchmark and real-world data demonstrate that our proposed model-driven GPHH5 D Q G R P Q R U R W D W L R Q 5 D Q G R P Q R U R W D W L R Q 5 D Q G R P U R W D W L R Q 5 D Q G R P U R W D W L R Q Mini-batch 6 H O H F W Mini-batch 6 H O H F W 7 R W D O &