-

A CP Scheduler for High-Performance Computers

Thomas Bridi

Michele Lombardi

Andrea Bartolini

Luca Benini

Michela Milano

Scheduling and dispatching tools for High-Performance Computing (HPC) machines have the role of mapping incoming jobs to the available resources, trying to maximize equipment utilization and user satisfaction. Optimal Job Scheduling is a well-known NP-hard problem, forcing commercial schedulers to adopt greedy approaches based on rules. Constraint Programming (CP) is a well-known combinatorial optimization approach that has been shown to be very e ective in optimally solving scheduling problems. We present the rst CP-based job scheduler for HPC machines, working in a real-life production environment. We evaluate our solution both on a cluster of virtual machines and on the Eurora Supercomputer with production workloads. Results show significant improvements in terms of user fair-waiting without degradation in overall machine utilization w.r.t state-of-the-art rule-based dispatchers.

Constraint Programming Scheduling Supercomputer

Today high-performance computing (HPC) centers are investment-intensive facilities with short depreciation cycles. An average supercomputer reaches full depreciation in just three years, hence its utilization has to be aggressively managed to produce and acceptable return on investment. A key role in this challenge is played by scheduling software that decides where and when a job submitted by a user has to be started. The scheduling software orders the set of jobs and the set of nodes and then tries to allocate each job to nodes taking into account the amount of currently available computing resources. When a new job is submitted, the software usually applies a back- lling algorithm that tries to place the job on unused resources without delaying the start of highest priority jobs in queues. These priority-rule-based algorithms are simple and reasonably fast, but they usually do not nd the best solution of the scheduling problem. One of the most widespread queue-based scheduling software in HPC facilities is PBS Professional [ 5 ]. In this work, we present a complete CP model for solving the optimal scheduling problem in HPC machines. The model signi cantly extends the one in [ 1 ] to account for multiple classes of jobs and their temporal constraints. In addition, the solution space exploration strategies have been optimized for on-line use, taking into account the impact of the schedule computation time on machine utilization. The CP solver based on the new model has been embedded as a plug-in module within the software framework of a well-known commercial HPC scheduler [ 5 ] replacing it's scheduling engine. By linking our solver with a state-of-the-art HPC scheduling tool, we have been able to validate our approach on a real-life HPC machine, Eurora from CINECA (Consorzio Interuniversitario Calcolo Automatico). Experiments on Eurora demonstrate that the new scheduler achieves signi cant improvements in job waiting time with respect to the commercial scheduler used in CINECA, while at the same time maintaining high machine utilization. An experimental campaign on a wide range of synthetic workloads proves that the approach is exible, robust and well-suited for integration in a portfolio of scheduling strategies to cover di erent levels of average machine utilizations. In section 3 we show an overview of the scheduling software running on the Eurora HPC machine. In section 4 we formally describe the problem of scheduling. In section 5 we describe the optimization techniques used to model the problem. In section 6 we present our optimization model and all the features implemented to make it desirable for a real HPC center. In section 7 we show results from simulations and from the Eurora supercomputer and we report statistics on the computational overhead. Finally in section 8 we show our conclusions. 2

Related work

The problem of batch scheduling is well-known and widely investigated. The interested reader can refer to [ 4 ] for a good survey of the scheduling algorithms used in HPC and computing clusters. Most of the algorithms described in [ 4 ] can be implemented withincommercial scheduling software by de ning appropriate \scheduling rules". To the best of our knowledge, the only examples that apply optimization techniques to a scheduler in a production context is [ 2 ]. In this paper, the author present an optimization technique applied as an extension to the TORQUE scheduler. This extension replaces the scheduling core of the framework with a back lling-like algorithm that inserts one job at a time into the schedule starting from a previous solution and then applies a Tabu Search to optimize the solution. This approach considers a job as a set of resources. This assumption drastically decreases the exibility of the scheduler by avoiding the possibility for a job to request more than one node. In our work, instead, we consider a job as a set of nodes, each requiring a set of resources. In this way we maintain the exibility of commercial schedulers (like TORQUE and PBS Professional) but we deal with a more complex settings w.r.t. [ 2 ]. 3

Eurora, Heterogeneous Supercomputer

Eurora is a heterogeneous HPC machine of CINECA. Eurora is composed of 65 nodes, one login node with 12 cores, 32 nodes with 16 cores at 3.1GHz and 2 GPU Kepler K20 each and 32 nodes with 16 cores at 2.1GHz and 2 Intel Xeon phi (MIC) each. Users from this HPC machine can submit a job that speci es the amount of resources, nodes and walltime to a queue; a queue is the place where job waits to be executed. Each queue has a name and a priority, after the submission. The scheduling software decides the start time and nodes where to execute the job. The scheduling and dispatching software currently used in Eurora is PBS Professional 12 from Altair; PBS Professional is a Portable Batch System [ 5 ] that schedules jobs based on rules. The original scheduler PBS sched can be disabled and replaced by with ad-hoc scheduling algorithms. We take advantage of this functionality to implement in a plug-and-play fashion our optimized scheduling policy. 4

The Scheduling problem n this section, we formally describe the problem of on-line scheduling and dispatching of a supercomputer. The scheduling problem considers a set of jobs J and a set of queues Q. Every jobi, de ned on the set J , is submitted in a speci c queue qi de ned on the set Q. Each job, when submitted, has to specify its maximal duration di, the number of jobs units ui (the number of virtual nodes required for the execution) and the amount of required resource rijkl (cores, memory, GPUs, MICs) for each job unit k 2 [1::ui] and for each l 2 R, where R is the set of resource. Each node nj of the system has a limit rljl for each resource l, with j 2 N where N is the set of nodes. We have to assign the starting execution time si and for each job unit juik of the job jobi, the node nj where it has to be executed. Given the current time t, the set of running jobs cannot be changed (migration is not supported), while the set of waiting jobs has to be allocated and scheduled on resources without exceeding their capacity at any point in time. 5

Constraint Programming

The technique used in this work to model the problem is Constraint Programming. A Constraint Program is de ned on a set of variables, each de ned on a discrete domain, and a set of constraints. Di erently from Convex Optimization (like LP, ILP, etc. . . ), with this paradigm we are not forced to have a convex polytope as solution set and a convex objective function. The global constraint we will use are: { alternative(a; [b]; C) : the constraint holds i at least C activities from the vector [b] has the same start time and duration of the activity a. { cumulative([s]; [d]; [r]; L) : the constraint holds i all the activities i de ned by a starting at time si, a duration si and a resource requirement ri never exceed the resource capacity L at any point in time. { noOverlap(a; t; [s]; [d]) : the constraint holds i all the activities i with start time si and duration di do not overlap an activity with start time a and duration t (for each i we can have a si + di OR a + t si). { synchronize(a; [b]) : the constraint holds i the start time a is synchronized with each start time i of the vector [b].

Model

Starting from the work in [ 1 ] we create a model that contains all the requirements and services needed by supercomputers in production. For every job i we have a Conditional Interval Variables (CVI, see [ 3 ]) jobi. A CVI represents an interval variable. The domain of a CVI is a subset of f?g Sf[s; e)js; e 2 Z; s eg. If a CVI has domain ?, this variable does not belong to the model and it is not considered in the solution process. A job's CVI contains the job walltime di speci ed by the user. For every job we have also a matrix U Ni of M xPij of CVIs, where M is the number of nodes in the system and Pij is the maximum number of job units dispatchable in the jth node. These elements assume the value s(i) if the ith job uses the node j, the value bottom otherwise. R is the set of resources of the node, A the set of jobs in a queue and B the set of running jobs. The base model created is described in 1. With the alternative constraint, we introduce the possibility for every job unit to be displaced in a node partially used by another job unit of the same job. The cumulative constrain the set of jobs start times. jobi

t 8i 2 A jobi = s(b) 8i 2 B alternative(jobi; U Nijk; ui) 8i = 1::N cumulative(U Nijk; diPij ; riPjikjl; rljl) 8k = 1::M; l 2 R Equation 2 represents the objective function used in this model. This function takes the job waiting-time weighted on the expected waiting time (ewti) of the queue where the job is submitted. This objective function is designed to optimize the jobs waitings paying attention to the fairness of these. This mean that waitings have to be distributed taking into account the priority of the jobs. min z = n X si

ewti i=1

qi 7

Experimental Results

We have evaluated the performance of our scheduler in two distinct experimental setups, namely (1) in a simulated environment on Virtual Machines (VM); and (2) on the actual Eurora HPC machine. The PBS software can be con gured in di erent modes to suit the purpose of the system administrator. the following experiments consider two di erent PBS setups: 1. The CINECA PBS con guration (referred to as PBSFifo): this setup uses a FIFO job ordering, no preemption, and back lling limited to the rst 10 jobs in the queue.

Test 1 Test 2 PBSFifo PBSWalltime CP Sched. PBSFifo PBSWalltime CP Sched. 152,94 137,74 119,77 1034,2 853,681 2441,3

65 60 46 234 200 376 1298810 1223690 1003970 16798300 13693000 16774800 0,47 3,14 11,45 1,02 15,47 34,82

Table 1: Test 1 and Test 2 results 2. A PBS con guration (referred to as PBSWalltime) designed to get the best trade-o between waiting time and computational overhead: this setup employs a strict job ordering (by increasing walltime), no preemption and backlling limited to the rst 400 jobs. 7.1

Simulation-based tests

We have designed the simulation so as to evaluate the performance of our CP scheduler w.r.t. PBS. The experiments di er under a wide range of conditions with respect to number of jobs, job units, and platform nodes. The goal is to assess the scalability of both approaches and their ability to deal with workloads having di erent resource requirements and processing times. The quality of the schedules was measured according to a number of metrics. Speci cally, we have de ned: { Weighted queue time (WQT): sum of job waiting-times, each divided (for fairness) by the maximum wait-time of the job queue. { Number of late jobs (NL): the number of jobs exceeding the maximum waittime of their queue. { Tardiness (TR): sum of job delays, where the delay of a job is the amount of time by which the maximum wait-time of its queue is exceeded. { Average overhead (AO): average computation time of the scheduler. In test 1 we simulate all the 65 Eurora nodes: the results are in Table 1. Our model manages to outperform considerably PBSFifo and PBSWalltime in terms of all the metrics related to waiting time and delay. In test 2 tested a 65 nodes con guration with a larger number of jobs (namely 700): the results are reported in Table 1. Due to the large number of jobs and (more importantly) job units, in this case, our framework was forced to employ the overhead reduction techniques. Such techniques are indeed e ective in limiting the overhead, but they also have an adverse e ect on the quality of the model solutions. As it can be seen in the table, our model yields a small improvement in tardiness w.r.t. PBSFifo, a small increase in the total time in queue, and a considerable increase of the number of late jobs, the WQT, and the weighted tardiness. 7.2

Execution on Eurora

Thanks to our modeling and design from Section 6, we have managed to obtain a scheduling system that is mature enough to be deployed and evaluated on the actual Eurora HPC machine. In detail, we have compared the performance of our approach and the PBSFifo con guration over ve weeks of regular utilization of the HPC machine. Since the comparison is performed in a production environment, it is impossible to guarantee that the two approaches process the same sequence of jobs. For this reason, we chose to compare the CP approach and PBSFifo in terms of: (1) the average WQT per job, and (2) the average number of used cores over time (i.e. the average core utilization). Our CP system performed consistently better with an average WQT per job of 2:50 10 6, against the 3:93 10 6 of PBSFifo. The standard deviation for the two approaches is very similar. The average core utilization obtained by both approaches during each week, show that the two approach have similar performance, which ranges between 520 and 599 for PBSFifo and between 510 and 573 for CP. 8

Conclusion

In this paper we presented a scheduler, based on Constraint Programming techniques, that can improve the results obtained from commercial schedulers highly tuned for a production environment and we implemented all the features for made it usable on a real-life HPC setting. The scheduler has been tested both in a simulated environment and on a real HPC machine with promising results. We have seen that in the medium hardness range we can improve results obtained by the commercial scheduler by a signi cant amount (21% on 152 points of WQT) and in the high hardness range we did not get an improvement due to the computational time and the too aggressive technique we had to implement. The experimental results on the Eurora HPC system shown an improvement on the weighted queue time while the system utilization. Future work will focus on the following directions: improving the integration between the scheduling management framework and the optimizer, and developing incremental strategies to hot-start the optimization engine.

1. Bartolini , A. , Borghesi , A. , Bridi , T. , Lombardi , M. , Milano , M. : Proactive workload dispatching on the eurora supercomputer . In: OSullivan, B. (ed.) Principles and Practice of Constraint Programming, Lecture Notes in Computer Science , vol. 8656 , pp. 765 { 780 . Springer International Publishing ( 2014 )

2. Chlumsky , V. , Klusacek , D. , Ruda , M.: The extension of torque scheduler allowing the use of planning and optimization in grids . Computer Science 13 , 5 { 19 ( 2012 )

3. Laborie , P. , Rogerie , J. : Reasoning with conditional time-intervals . In: Proc. of FLAIRS . pp. 555 { 560 ( 2008 )

4. Salot , P.: A survey of various scheduling algorithm in cloud computing environment . International Journal of research and engineering Technology (IJRET) , ISSN pp. 2319 { 1163 ( 2013 )

5. Works , P. : Pbs professional 12 .2, administrators

guide

, november 2013