-

1613-0073

multi-workflow scheduling

Yifan Yang

1 2

Hui Ma

hui.ma@ecs.vuw.ac.nz 1 2

Gang Chen

1 2

Sven Hartmann

sven.hartmann@tu-clausthal.de 0 1 0 Clausthal University of Technology , Clausthal-Zellerfeld , Germany 1 Project Exhibitions , Posters and Demos, and Doctoral Consortium 2 Victoria University of Wellington , Wellington , New Zealand

Machine learning is getting more and more popular in solving dynamic combinatorial optimization problems. With the popularity of cloud computing, machine learning has been applied in solving dynamic multi-workflow scheduling, a challenging combinatorial optimization problem that involves many features of the problem, i.e., workflow patterns, tasks, virtual machine capacities, and cost. Existing machine learning approaches to dynamic combinatorial optimization problems use randomly generated training instances to train optimization rules or heuristics, without paying attention to the knowledge of the problem domain data. In this paper, we propose a model-driven machine learning approach to dynamic multi-workflow scheduling. In particular, we propose a conceptual database schema that can be used to model the information of the dynamic multi-workflow scheduling problem. We further propose a new approach to design training instances for dynamic multi-workflow scheduling by incorporating important attributes of the problem into the database schema. We evaluate our proposed approach using a case study with real-world data. The results of the evaluation demonstrate that our proposed model-driven machine learning approach can efectively solve dynamic workflow scheduling problems.

machine learning dynamic workflow scheduling data modeling

CEUR ceur-ws.org

1. Introduction

In the era of big data, more and more data-intensive applications are used in various domains to solve complex data-intensive problems. To efectively solve such problems we need to design not only efective machine learning algorithms but also the training datasets to capture the essential requirements of the problem domain. However, with the increasing complexity of combinatorial optimization problems, designing training datasets for a problem is getting more and more challenging, i.e., to be representative of the problem domain and robust to unseen problem instances while at the same time without hurting the eficiency of the training process. For example, combinatorial optimization is a category of optimization where the size and relationships of data are massively increasing due to the escalated problem complexity [ 1, 2 ].

A typical example of combinatorial problems is the dynamic workflow scheduling problem in cloud computing where workflow tasks are scheduled to cloud resources in the best possible CEUR Workshop Proceedings way with some objectives, to minimize the makespan and/or response time. Cloud users often request multiple workflows to be scheduled for execution on heterogeneous cloud resources. This gives rise to the dynamic multi-workflow scheduling (DMWS) problem. To solve such a combinatoral optimization problem, we need to use not only data about tasks and cloud resources (e.g., VMs), but also relationships among these tasks and workflows, because they can significantly afect the response time and the overall cost.

Dynamic multi-workflow scheduling is an NP-hard problem [ 3]. A major challenge in practice is that solutions must be determined in real-time. For the DMWS, workflow scheduling must be decided in real-time. Therefore, heuristics need to be used. The manual design of heuristics is very time consuming and requires domain experts who have good domain knowledge. Genetic programming hyper heuristic (GPHH), a machine learning method, is an efective approach to automatically design heuristics that does not require domain experts [4]. However, datasets are needed to train heuristics so that they can solve the problem efectively in real-time.

To eficiently search for heuristics for dynamic multi-workflow scheduling, the quality of datasets used for training heuristics plays a crucial role to make the heuristics efective in practice such as robustness to unseen problem instances [5]. While it is a common practice to use historical datasets as training instances, they are not eficient for evaluating heuristics that involve expensive evaluations over all the historical data [6]. To avoid expensive training while at the same time to train heuristics that are robust to unseen problem instances, it is important to carefully design the training instances. Current approaches of designing training instances are ad hoc, and are not clearly described. Most of existing work use randomly generated training instance in the training process [7], without any attention to the knowledge of the datasets given. Further, dynamic workflow scheduling is a dynamic combinatorial optimization problem for which diferent states of the problem domain will be used during the training process, including resource scheduling requests and resource status, which are changing dynamically. Meanwhile, to train heuristics that are efective and robust in solving dynamic multi-workflow scheduling problems in real time, it is ineficient and unnecessary to use all the possible states of the requests and data centers.

A conceptual model of training datasets, consisting of the data and relationships between diferent data, would help machine learning to be systematic and efective. A conceptual database schema consists of entities and relationships. Designing training data is actually retrieving, or sampling, datasets from database of historical data. This makes conceptual model attractive for the machine learning domain since it provides a high-level view of the datasets. However, there is no data model in the literature that can be used to support machine learning processes to learn heuristics efectively and eficiently when there is a large number of requests and the size of the workflow is extremely large.

To take advantage of conceptual models diferent views of the datasets can be retrieved as training datasets to eficiently evaluate and train heuristics [ 8]. Adequate modeling of historical data and training data plays an important role to design a high-quality training dataset for machine learning. However, despite the popularity of machine learning for combinatorial optimization, there is limited research work on data modeling for machine learning. A data model, GR4ML,is proposed in [9] for machine learning focusing on expressing machine learning requirements and solution design. In practice, datasets are often designed in an ad hoc way or based on best practice recommendations [10]. There are no common or well-defined approaches to model training datasets and design training instances using data models.

The overall aim of this paper is to propose a model-driven machine learning approach to the dynamic multi-workflow scheduling problem. In particular, we propose a GPHH approach that trains heuristics using well defined training instances based on conceptual database schema. We expect that the heuristics learned from our machine learning approaches can schedule dynamic workflows efectively. The major contributions of this paper are the following: • We propose a conceptual model that can be used for designing training datasets for efective learning heuristics of dynamic workflow scheduling. • We present a model-driven GPHH approach to automatically learn heuristics that can be used to schedule multiple workflows in cloud data centers. In particular, our proposed GPHH approach will use training datasets generated using the conceptual model designed for DMWS. • We evaluate the efectiveness of our proposed conceptual model in designing training datasets with a case study using benchmark data of workflow scheduling.

To the best of our knowledge, our proposed model-driven approach is the first attempt incorporating conceptual modeling into the process of machine learning for solving dynamic workflow scheduling and other combinatorial optimization problems. In particular our conceptual database schema provides a good overview of the structure of data involved in dynamic workflow scheduling problems, to support systematically the design of training instances for efectively learning heuristics that can be used to eficiently schedule tasks of numerous dynamically arriving workflows.

Organization. This paper is organized as follows. Section 2 briefly outlines current approaches for machine learning, in particular for dynamic multi-workflow scheduling problems from the literature, while in Section 3 we introduce dynamic workflow scheduling in more detail. In Section 4 we present a conceptual data schema for dynamic multi-workflow scheduling, and propose a GPHH approach to automatically design heuristics for the DMWS problem, with a model-driven approach for designing training instances. In Section 5 we report on a case study that we have conducted to evaluate our proposed model-driven machine learning approach. Finally, Section 6 gives conclusions and an outlook on future work.

2. Related work

As dynamic multi-workflow scheduling is known to be NP-hard, heuristics are widely used to ensure scalability and real-time applicability. Heuristics are priority functions for making decisions on cloud resources, e.g., VMs. IaaS cloud partial critical paths (IC-PCP) is proposed in [11] for scheduling computation requests in data centers. Heterogeneous earliest finish time (HEFT) is another manually designed heuristic proposed in [12]. However, manually designing such heuristics requires domain knowledge and is time-consuming. Also heuristics may be efective in some of problem instances but may not be efective in some other problem instances since only small set of domain attributes are considered.

Machine learning approaches have been proposed in the literature to automatically learn heuristics for dynamic workflow scheduling problems and other dynamic scheduling problems. In [13, 14], GPHH is utilized for online service resource allocation in clouds. In [15], a GPHH approach is proposed to solve multi-cloud resource allocation problems. A training instance is used during the whole training process. In order to generate heuristics which are robust to unseen problem instances, in [16], a GPHH approach is proposed to use training instances that are randomly generated during the training process, to learn heuristics for the DWS problem. Similar approaches of generating training instances are used in [7], which proposes a GPHH approach to learn heuristics for the dynamic job shop scheduling problem. However, none of the above mentioned machine learning studies paid attention to the structure of problem attributes or data when designing training instances.

To understand machine leaning requirements of organizations, conceptual modelers have proposed diferent models. In [ 17], a conceptual modeling framework is proposed to systematically justify the needs of predictive analytics within an organizational context and to identify the requirements of predictive analytics design. To incorporate domain-specific questions, [ 18] adopts conceptual modeling methods to elicit analytical requirements, design machine learning solutions, and to ensure the alignment between analytic initiatives and business strategies, among others. The modeling framework consists of three complementary modeling views: business view, analytic design view, and data preparation view. However, as far as we are aware, there is no systematic way of designing training instances for machine learning in dynamic combinatorial optimisation problems, including dynamic workflow scheduling problems.

In summary, research on data modeling of training instances for machine learning of combinatorial optimization is still in its early stages. To the best of our knowledge, there is very limited work on conceptual modeling of machine learning data of dynamic multi-workflow scheduling, which motivate us to propose an efective design method of training instances by incorporating application-specific data models.

3. Preliminaries

SLA-aware dynamic multi-workflow scheduling (DWMS) has attracted rising interests in research and practice. Workflows arrive at a data center dynamically and is unknown beforehand, i.e., requests of scheduling workflows arrive on the fly and must be executed in real-time, to meet deadline constraints set in service-level agreements (SLA). The tasks of the workflows are scheduled to execute on a number of VM instances that are available in a cloud data center. Each VM instance is of a VM type, which defines its capacities. During the process, we need to make two types of decisions concurrently: task allocation, which allocates the tasks of the workflows to VMs, and task sequencing, which sequence the task on each individual VM instance.

Figure 1 shows an overview of DWMS. The overall goal is to satisfy the workflow scheduling requests in the best possible quality, e.g., minimizing the makespan of workflows, minimizing the costs of renting cloud resources, and minimizing SLA violation penalties. In particular, the SLA-aware dynamic multi-workflow scheduling problem focuses on scheduling a sequence of workflows to cloud resources so that the overall cost and SLA penalty are minimized. In the rest of the paper, DMWS refers the SLA-aware dynamic multi-workflow scheduling problem.

A workflow arrives at a time with a deadline, can be represented as a directed acyclic graph (DAG), denoted by = ( , ). Herein, = { }, ∈ 1, … , is a set of nodes representing tasks workflow_m

Workflow_1 in the workflow, and = { } is a set of directed edges where connects with . A task is defined by its size . Here, is a parent task and is a child task. Note that a child task can be executed only after all its parent tasks are completed. We use ( ) to denote all immediate predecessors of task . There might be multiple tasks allocated to the same VM instance to be executed. We use (

) to denote the previous task of executed on the same VM.

In a data center, there are VM instances to perform scheduled workflow tasks. Each VM instance is of a VM type that is defined by its speed , memory , and cost . Any task must be scheduled to a VM instance with suficient capacity/resource, i.e., CPU and memory. Note that, one VM instance can only execute one task at any time, and each task can only be executed on a single VM instance. The time to process a task depends on the resource requirement of the task, and the resource that is scheduled. The more capacity of the resources allocated to the task the faster it is processed, and less probability to violate deadline constraints set in the SLA agreement. However, the more resources used for executing workflows, the more cost it will take. Diferent workflow schedules will lead to diferent time units for renting VMs, and then diferent makespan and costs.

For a given sequence workflows = [

1, … , ], and set of available VM instances, the objective of the DMWS problem is to schedule workflow tasks for execution on VMs so as to minimize the total cost for workflow execution, consisting of rental fees (denoted as incurred from resource provisioning and deadline penalties (denoted as ) caused by violations, formulated by ) (1) min = ∑ ∈ +

∑ ∶ ∈ where refers to the set of rented VMs. Particularly, is the rental fee of VM , defined as the hourly-based fee for the rental time between the time of processing of its first allocated task and the time of processing the last allocated task calculated by = ⋅ ⌈ − 3600 ⌉

is the penalty of workflow , defined as the penalty fee paid for the portion beyond its deadline , calculated by

[19]. A larger coeficient represents a lower tolerance for violating workflow deadline. Details of the problem definition can be found in [ 4. Model-driven GPHH for dynamic multi-workflow scheduling In this section we present our model-driven genetic programming hyper heuristic (GPHH) approach for the dynamic multi-workflow scheduling problem. To begin with we present a data model of the domain data for scheduling multiple dynamic workflows. Afterwards we describe our proposed GPHH approach using model-driven designed training instances.

Data model. To ensure the robustness of heuristics obtained by machine learning, we first model the data involved in the training of heuristics by GPHH. Figure 2 presents a conceptual data schema, denoted as ER-DMWS, for the problem of dynamic multi-workflow scheduling. (2) (3) VID

Time

Schedule

Size Mem TID

Task

Parent Child TypeName CPU Mem

VM-Type

Cost PatterName Size

Pattern

Generation

Mini-Batch

(n,n)

Problem Instance

ArriveTime WorkflowID InstanceID

Depend

Deadline

Workflow

(m,m)

We utilize the higher-order ER model (HERM) [20] which preserves the aggregation-based modeling principle of the basic ER model where a conceptual data schema consists of a set of object types. Object types are characterized by their attributes. Recall the definition of attributes from [20]. Let U denote a countable set of simple attributes (called the universe) together with a type assignment () that assigns to each attribute ∈ U its data type () . Complex attributes may be obtained from simple ones by nesting. Let A denote the smallest superset of U such that ( 1, … , ), {}, [], ⟨⟩, 1( 1) ⊕ ⋯ ⊕ ( ), ( 1 → 2) ∈ A holds whenever , 1, … , ∈ A holds, with labels , 1, … , chosen from some fixed alphabet ℒ.

An important extension of HERM is that dynamic and complex data types are permitted in A. The data type () assigned to a (simple) attribute ∈ U is some simple data type, like INT , FLOAT , STRING, DATE, or TIME, which are relevant features of dynamic workflow scheduling problems. It should be emphasized that complex data types ofer additional opportunities for machine learning researchers but are not a must-use.

A structured component is a pair ∶ with a role name and a component expression ∈ C. An object type is a triple ((), (), ()) where () is a finite set { 1 ∶ 1, … , ∶ } of structured components with pairwise diferent role names 1, … , , () is a finite set { 1, … , } ⊆ A of complex attributes, and () ⊆ () ∪ () is the primary key of . An object type is called an entity type if () is the empty set, otherwise it is called a relationship type.

Example 1. In the diagram of ER-DMWS in Figure 2, the entity types are displayed as rectangles while relationship types are displayed as diamonds. Let be a relationship type and a component of , say in the form or ∶ . In the diagram this is displayed by an edge from to . The relationship type Depend in Figure 2 can be specified as follows:

Depend = ({Parent: Task, Child: Task}, {}, {Parent: Task, Child: Task}) It has two structured components, each with diferent roles, no attributes, and its primary key includes both structured components. A cardinality constraint is an expression (, ) = (, ) where is a natural number, and is a natural number or ∞. For more details on cardinality constraints see [20, 21]. For the integration of diferent user views in HERM see, e.g., [22]. Cardinality constraints are also displayed in the diagram. For example, the constraint ( Problem-Instance, Workflow ) = (, ) states that every problem instance consists of worklfows. The cardinality constraint ( Mini-Batch, Problem-Instance) = (, ) states that every mini-batch consists of problem instances. Note that a mini-batch, containing a set of problem instances, will be used in the machine learning process to ensure the generality of the learned heuristics.

We use the conceptual schema in Figure 2 to represent data on workflows, virtual machines, scheduling as well as training instances, i.e., mini-batches, used during the machine learning process. Note that the entity and relationship types in gray color capture the information on workflows, while the entity and relationship type in green color capture the information on cloud resources, i.e., virtual machines. Further, there are two relationship types Schedule and Mini-Batch in blue color that capture the information on scheduling decisions and training instances.

For example, a workflow pattern can be considered as a complex data type which may be regarded as a list of tasks and a set of edges defined by pairs of tasks. If the machine learning researchers consider the workflow pattern as the internal structure of tasks to be conceptually irrelevant for the application under development then they can define a simple attribute pattern with ( ) = DAG. Otherwise, if the internal structure of the workflow pattern is conceptually relevant then she/he may use a complex attribute pattern({task}, ({parent: task, child: task})).

Model-driven approach to automatic learning heuristics. Now we present our modeldriven GPHH for dynamic multi-workflow scheduling. Figure 3 shows the high-level overview of the GPHH algorithm for learning heuristics based on model-driven training instances.

Initialization Evaluaiton

Mini-batch

Generate

Selection Evolution Reproduction Cross Over

Mutation No

Terminate

Yes

ET +

Here, we represent heuristics as syntax trees, with internal nodes as arithmetic functions and leave nodes as terminals. For an example see the heuristic in Figure 3. The terminals can be either attributes or derived attributes of the object types in the ER-DMWS schema, e.g., task size . Table 1 summarizes the terminal set used in this case, and the function set including {+, −, ×, ÷, , } based on [23].

During the process of solving a DMWS problem instance in real time, any GP tree evolved by GPHH, is treated as a priority function. Whenever a task in a workflow becomes ready for execution, i.e., all its parent tasks have been completed, the GP tree is applied to each eligible cloud resource, e.g., a leased VM instance with the requested CPU and memory capacities to accomplish the task, to calculate their respective priorities. For this purpose, the value of each terminal node of the GP tree is determined first. Note that, for diferent cloud resources under consideration, the same terminal node can have varied values. With respect to any candidate cloud resource, driven by the corresponding values of all terminal nodes, we can proceed to calculate the value of each internal node of the GP tree in the bottom-up order and eventually determine the value of the root node, which will be assigned to the candidate cloud resource as its priority. After calculating the priorities of all the eligible VMs, the VM with the highest priority is then scheduled to process the task. The above process is performed repeatedly to schedule the execution of every task of each workflow, whenever a task becomes ready. Therefore, guided by a GP tree, the full schedule that solves any DMWS problem instance can be obtained. Details regarding the GPHH algorithm for DMWS can be found in [4, 16].

Apparently, diferent GP trees can produce varied full schedules for the same DMWS problem instance. Each full schedule also leads to varied total costs. The aim of the GPHH algorithm is to identify a GP tree that can achieve on average the smallest total cost, the objective defined in 1, for a wide variety of DMWS problem instances. For this purpose, the GPHH algorithm randomly generates a set (aka. a population) of initial GP trees (called the initial population). Each GP tree of the initial population will be evaluated using training instances and a fitness function, i.e., TotalCost. GP trees with good fitness values will be selected for reproduction, mutation and crossover to generate the next generation of GP trees. This process will continue until the stopping criteria are met, e.g., the maximum number of generations is reached. The evolved GP tree with the best fitness is then reported by the GPHH algorithm as its final learned heuristic for DMWS. As can be seen in Figure 3, in every generation, i.e., iteration, all the heuristics will be evaluated using the training instances. Evidently these training instances play an important role in the evolution (or heuristic learning) process.

Model-driven training instance design. The modeling technique that we have just proposed can be generalized to generate training datasets for GPHH for dynamic multi-workflow scheduling. In order to learn good heuristics that can perform well for unseen problem instances, for each generation, diferent training instances will be rotated, see Figure 4. However, if training instances are too diferent from generation to generation, then no individuals can steadily survive through generations to evolve. Therefore, we need to carefully design suitable training instances. This can be controlled by specifying a proper size for the instance pool. In the mean time, to train heuristics that perform well for unseen problem instances, we adapt the concept of ‘mini-batch’ [5] in the design of training instances. As can be seen in Figure 2, a mini-batch may contain one or more problem instances. We will explore settings of diferent batch sizes in the section on experiments.

As can be seen in Figure 4, our model-driven approach first creates a problem instance pool before running GPHH. During the training process of GPHH, training instance can be obtained by querying the problem instance pool. To continue with, we provide details of each step. Definition 1. A dynamic multi-workflow scheduling problem instance consists of a list of workflows, i.e., = [ 1 , … , ], where is randomly generated using available workflow patterns and sizes, (PatternName, Size), and is assigned a random arrival time ArriveTime. A problem instance pool ℙ consists of problem instances, i.e., ℙ = { 1, … , … , }.

As shown in Figure 2, we propose to use a mini-batch consisting of a set of problem instances for a particular generation throughout generations of the training process.

Definition 2.

A mini-batch ( ) for generation is a set of problem instances, i.e.,

( ) = { 1( ), 2( ), … , ( ), … , ( )} where each problem instance ( ) is retrieved from the problem instance pool ℙ, i.e., ( ) ∈ ℙ for = 1, … , .

To create a problem instance pool ℙ, we generate a set of problem instances, using the schema defined for the DMWS problem, but each time with diferent combinations of patterns, sizes and arrival times. In this way, our problem instance pool contains a variety of problem instances to represent various possible problem scenarios. We can then obtain a mini-batch of problem instances by randomly selecting problem instances from ℙ.

5. A case study

To evaluate the performance of our proposed model-driven approach, we have implemented a machine learning simulator using benchmark data of workflow scheduling as in [ 4, 24, 25], i.e., benchmark workflow patterns with various sizes, and VM data obtained from commercial AWS data centers [26, 27]. The simulator is used to train heuristics for dynamic workflow scheduling. After the heuristics are trained using the simulator, we evaluate the heuristics using test datasets. We evaluated the performance (in terms of the total cost) of the heuristics in solving SLA-aware dynamic multi-workflow scheduling, comparing with heuristics that are trained using the training datasets generated using some existing approach. That is, two diferent training datasets are used, with one based on our proposed model-driven approach and the other based on the existing approach [12, 16, 28]. Two sets of heuristics are trained on each of the two training datasets. Since real-world datasets were considered, the number of workflow scheduling requests will naturally vary throughout the day and week. This may lead to increased execution time for a single workflow due to the queuing time of tasks VMs.

Datasets. We first discuss the simulation configuration, followed by the parameter settings of GPHH. The detailed setup of comparative experiments will be introduced, too. A simulated cloud environment is used to experimentally evaluate the proposed model-driven GPHH approach to DMWS. It contains several key components listed below.

VM Types: According to Amazon EC21, we assume that the cloud environment supports six diferent VM types with their respective configurations summarized in Table 2. The maximum number of VM instances of each type is unlimited. 1https://aws.amazon.com/ec2/pricing/on-demand/

Workflows: Figure 6 shows four widely used workflow patterns 2, namely CyberShake, Inspiral, Montage, and SIPHT, that are employed for our experimental study. Table 3 gives detailed information of 12 heterogeneous workflows used in our experiments.

CyberShake

Inspiral

Montage

SIPHT 2https://confluence.pegasus.isi.edu/display/pegasus/Deprecated+Workflow+Generator

Baselines. Several baselines are utilized to provide a point of comparison for the experiments. • HEFT: Heterogeneous earliest finish time [ 12] is a widely known heuristic algorithm for workflow scheduling. All individuals in all generations are evaluated by the same one training instance. • Random Generate (no rotation): Traditional approach where the GPHH algorithm is set to use one training instance per generation and is constant across generations [15]. • Random Generate (rotation): Traditional approach where the GPHH algorithm is set to use one training instance per generation, and randomly generates a new training instance in each generation [7].

Parameter settings. The study utilized default parameter settings for GPHH, following the recommendations provided in [4, 29]. The population size was set to 1024, with a total of 50 iterations. One individual was designated as elite, and the tournament size was set to 7. Crossover, mutation, and reproduction rates were set at 0.80, 0.15, and 0.05, respectively. Furthermore, the initial depth of GP trees ranged from 2 to 6, and the maximum depth was constrained to 8 throughout the evolutionary process.

Evaluation results. Let Mini-batch Select / denote our proposed model-driven approach where the GPHH algorithm is set to randomly select problem instances from a problem instance pool with instances in each generation. Next we show the performance of the proposed model-driven GPHH approach in comparison to the baselines.

We have also conducted the Wilcoxon rank sum test with a significance level of 0.05 to make pairwise comparisons (each algorithm is compared with all the algorithms to its left in the table). The “(+)/(-)/(=)” after each entry in the table indicates that the corresponding results are significantly better/worse than or similar to the results of the compared method.

To provide insides of the performance, Figure 7 shows the convergence curves for the methods using diferent training instances. It can be observed that the heuristics learned using a minibatch designed based on our model-driven approach performs better than the heuristics learned using randomly generated training instances, throughout the whole training process.

We further evaluate the performance of our proposed approach using a mini-batch with multiple problem instances. Here we consider = 3, 5 . Table 6 shows the evaluation results.

As we can see from Table 6, our proposed model-driven approach using a mini-batch of size larger than 1 can achieve better results than just using a mini-batch of size 1. However, even with a mini-batch of size 1, our proposed model-driven approach performs better than the existing approaches using randomly generated training instances as done in the recent literature on dynamic workflow scheduling, cf. [ 7], according to Table 4. This demonstrates that using the training instances generated based on the conceptual schema in Figure 2 can well support GPHH to learning better heuristics. Hence, model-driven GPHH can produce better heuristics than the traditional approach that uses randomly generated training instances.

6. Conclusions and future work

In this paper we proposed a model-driven machine learning approach to the dynamic multiworkflow scheduling (DMWS) problem. For machine learning algorithms to efectively learn optimization rules or heuristics we proposed a model-driven approach to design a mini-batch of training instances. We designed a conceptual database schema for DMWS, on which training instances can be generated in a systematic way, to cover possible problem scenarios. Simulations using benchmark and real-world data demonstrate that our proposed model-driven GPHH 60 s t s o50 C l a t oT40 30 250 s t s o lC200 a t o T 150 80 s t so60 C l a t oT40 20 250 s t so200 C l a to150 T 100 160 approach outperforms existing approaches to dynamic multi-workflow scheduling problems. Our novel model-driven approach is very efective in supporting machine learning algorithms to train efective heuristics for dynamic workflow scheduling problems. Future work can be conducted to investigate the optimal choice of the mini-batch size and the problem pool size. [2] C. H. Papadimitriou, K. Steiglitz, Combinatorial optimization: algorithms and complexity,

Courier Corporation, 1998. [3] Z. Zhu, G. Zhang, M. Li, X. Liu, Evolutionary multi-objective workflow scheduling in cloud, IEEE Transactions on Parallel and Distributed Systems 27 (2015) 1344–1357. [4] Y. Yang, G. Chen, H. Ma, M. Zhang, Dual-tree genetic programming for deadlineconstrained dynamic workflow scheduling in cloud, in: International Conference on Service-Oriented Computing (ICSOC), Springer, 2022, pp. 433–448. [5] N. Gazagnadou, R. Gower, J. Salmon, Optimal mini-batch and step sizes for saga, in:

International Conference on Machine Learning (ICML), PMLR, 2019, pp. 2142–2150. [6] J. Wu, W. Hu, H. Xiong, J. Huan, V. Braverman, Z. Zhu, On the noisy gradient descent that generalizes as SGD, in: International Conference on Machine Learning (ICML), PMLR, 2020, pp. 10367–10376. [7] F. Zhang, Y. Mei, S. Nguyen, M. Zhang, Evolving scheduling heuristics via genetic programming with feature selection in dynamic flexible job-shop scheduling, IEEE Transactions on Cybernetics 51 (2021) 1797–1811. [8] S. Nalchigar, E. Yu, R. Ramani, A conceptual modeling framework for business analytics, in: International Conference on Conceptual Modeling (ER), Springer, 2016, pp. 35–49. [9] S. Nalchigar, E. Yu, K. Keshavjee, Modeling machine learning requirements from three perspectives: a case report from the healthcare domain, Requirements Engineering 26 (2021) 237–254. [10] Y. Liu, Y. Mei, M. Zhang, Z. Zhang, Automated heuristic design using genetic programming hyper-heuristic for uncertain capacitated arc routing problem, in: ACM Genetic and Evolutionary Computation Conference (GECCO), 2017, pp. 290–297. [11] S. Abrishami, M. Naghibzadeh, D. H. Epema, Deadline-constrained workflow scheduling algorithms for infrastructure as a service clouds, Future Generation Computer Systems 29 (2013) 158–169. [12] H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-efective and low-complexity task scheduling for heterogeneous computing, IEEE Transactions on Parallel and Distributed Systems 13 (2002) 260–274. [13] B. Tan, H. Ma, Y. Mei, M. Zhang, A cooperative coevolution genetic programming hyperheuristics approach for on-line resource allocation in container-based clouds, IEEE Trans.

Cloud Comput. 10 (2022) 1500–1514. [14] B. Tan, H. Ma, Y. Mei, A hybrid genetic programming hyper-heuristic approach for online two-level resource allocation in container-based clouds, in: IEEE Congress on Evolutionary Computation (CEC), IEEE, 2019, pp. 2681–2688. [15] Y. Chen, T. Shi, H. Ma, G. Chen, Multi-objective location-aware service brokering in multi-cloud - a GPHH approach with transfer learning, in: International Conference on Applications of Evolutionary Computation (EvoStar), Springer, 2023, pp. 573–587. [16] K.-R. Escott, H. Ma, G. Chen, Genetic programming based hyper heuristic approach for dynamic workflow scheduling in the cloud, in: International Conference on Database and Expert Systems Applications (DEXA), Springer, 2020, pp. 76–90. [17] A. Nasiri, S. Nalchigar, E. Yu, W. Ahmed, R. Wrembel, E. Zimányi, From indicators to predictive analytics: A conceptual modelling framework, in: IFIP Working Conference on The Practice of Enterprise Modeling (PoEM)), Springer, 2017, pp. 171–186. [18] S. Nalchigar, E. Yu, Business-driven data analytics: A conceptual modeling framework,

Data & Knowledge Engineering 117 (2018) 359–372. [19] C.-H. Youn, M. Chen, P. Dazzi, Cloud broker and cloudlet for workflow scheduling, Springer, 2017. [20] B. Thalheim, Entity Relationship Modeling – Foundations of Database Technology,

Springer, 2000. [21] S. Hartmann, On the consistency of int-cardinality constraints, in: International Conference on Conceptual Modeling (ER), Springer, 1998, pp. 150–163. [22] H. Ma, K. Schewe, B. Thalheim, J. Zhao, View integration and cooperation in databases, data warehouses and web information systems, J. Data Semantics (2005) 213–249. [23] Q.-Z. Xiao, J. Zhong, L. Feng, L. Luo, J. Lv, A cooperative coevolution hyper-heuristic framework for workflow scheduling problem, IEEE Transactions on Services Computing 15 (2022) 150–163. [24] W. Chen, E. Deelman, Workflowsim: A toolkit for simulating scientific workflows in distributed environments, in: IEEE International Conference on e-Science (E-Science), IEEE, 2012, pp. 1–8. [25] J. Liu, J. Ren, W. Dai, D. Zhang, P. Zhou, Y. Zhang, G. Min, N. Najjari, Online multi-workflow scheduling under uncertain task execution time in IaaS clouds, IEEE Transactions on Cloud Computing 9 (2019) 1180–1194. [26] H. R. Faragardi, M. R. S. Sedghpour, S. Fazliahmadi, T. Fahringer, N. Rasouli, GRP-HEFT: A budget-constrained resource provisioning scheme for workflow scheduling in IaaS clouds, IEEE Transactions on Parallel and Distributed Systems 31 (2019) 1239–1254. [27] G. Juve, E. Deelman, K. Vahi, G. Mehta, B. Berriman, B. P. Berman, P. Maechling, Scientific workflow applications on Amazon EC2, in: IEEE International Conference on e-Science (E-Science), IEEE, 2009, pp. 59–66. [28] V. Arabnejad, K. Bubendorfer, B. Ng, Dynamic multi-workflow scheduling: a deadline and cost-aware approach for commercial clouds, Future Generation Computer Systems 100 (2019) 98–108. [29] M. Xu, Y. Mei, S. Zhu, B. Zhang, T. Xiang, F. Zhang, M. Zhang, Genetic programming for dynamic workflow scheduling in fog computing, IEEE Transactions on Services Computing 16 (2023) 2657–2671.

[1]

Schrijver , et al., Combinatorial optimization: polyhedra and eficiency , volume 24 ,