A CP Scheduler for High-Performance
                   Computers

Thomas Bridi, Michele Lombardi, Andrea Bartolini, Luca Benini, and Michela
                                 Milano

{thomas.bridi,michele.lombardi2,a.bartolini,luca.benini,michela.milano}@
                                unibo.it
                           Università di Bologna


      Abstract. Scheduling and dispatching tools for High-Performance Com-
      puting (HPC) machines have the role of mapping incoming jobs to the
      available resources, trying to maximize equipment utilization and user
      satisfaction. Optimal Job Scheduling is a well-known NP-hard problem,
      forcing commercial schedulers to adopt greedy approaches based on rules.
      Constraint Programming (CP) is a well-known combinatorial optimiza-
      tion approach that has been shown to be very effective in optimally
      solving scheduling problems. We present the first CP-based job sched-
      uler for HPC machines, working in a real-life production environment.
      We evaluate our solution both on a cluster of virtual machines and on the
      Eurora Supercomputer with production workloads. Results show signif-
      icant improvements in terms of user fair-waiting without degradation in
      overall machine utilization w.r.t state-of-the-art rule-based dispatchers.

      Keywords: Constraint Programming, Scheduling, Supercomputer


1   Introduction
Today high-performance computing (HPC) centers are investment-intensive fa-
cilities with short depreciation cycles. An average supercomputer reaches full
depreciation in just three years, hence its utilization has to be aggressively man-
aged to produce and acceptable return on investment. A key role in this challenge
is played by scheduling software that decides where and when a job submitted
by a user has to be started. The scheduling software orders the set of jobs and
the set of nodes and then tries to allocate each job to nodes taking into account
the amount of currently available computing resources. When a new job is sub-
mitted, the software usually applies a back-filling algorithm that tries to place
the job on unused resources without delaying the start of highest priority jobs
in queues. These priority-rule-based algorithms are simple and reasonably fast,
but they usually do not find the best solution of the scheduling problem. One of
the most widespread queue-based scheduling software in HPC facilities is PBS
Professional [5]. In this work, we present a complete CP model for solving the op-
timal scheduling problem in HPC machines. The model significantly extends the
one in [1] to account for multiple classes of jobs and their temporal constraints.
2       A CP Scheduler for High-Performance Computers

In addition, the solution space exploration strategies have been optimized for
on-line use, taking into account the impact of the schedule computation time on
machine utilization. The CP solver based on the new model has been embedded
as a plug-in module within the software framework of a well-known commercial
HPC scheduler [5] replacing it’s scheduling engine. By linking our solver with a
state-of-the-art HPC scheduling tool, we have been able to validate our approach
on a real-life HPC machine, Eurora from CINECA (Consorzio Interuniversitario
Calcolo Automatico). Experiments on Eurora demonstrate that the new sched-
uler achieves significant improvements in job waiting time with respect to the
commercial scheduler used in CINECA, while at the same time maintaining
high machine utilization. An experimental campaign on a wide range of syn-
thetic workloads proves that the approach is flexible, robust and well-suited for
integration in a portfolio of scheduling strategies to cover different levels of av-
erage machine utilizations. In section 3 we show an overview of the scheduling
software running on the Eurora HPC machine. In section 4 we formally describe
the problem of scheduling. In section 5 we describe the optimization techniques
used to model the problem. In section 6 we present our optimization model and
all the features implemented to make it desirable for a real HPC center. In sec-
tion 7 we show results from simulations and from the Eurora supercomputer and
we report statistics on the computational overhead. Finally in section 8 we show
our conclusions.


2   Related work
The problem of batch scheduling is well-known and widely investigated. The
interested reader can refer to [4] for a good survey of the scheduling algorithms
used in HPC and computing clusters. Most of the algorithms described in [4] can
be implemented withincommercial scheduling software by defining appropriate
“scheduling rules”. To the best of our knowledge, the only examples that apply
optimization techniques to a scheduler in a production context is [2]. In this
paper, the author present an optimization technique applied as an extension
to the TORQUE scheduler. This extension replaces the scheduling core of the
framework with a backfilling-like algorithm that inserts one job at a time into
the schedule starting from a previous solution and then applies a Tabu Search to
optimize the solution. This approach considers a job as a set of resources. This
assumption drastically decreases the flexibility of the scheduler by avoiding the
possibility for a job to request more than one node. In our work, instead, we
consider a job as a set of nodes, each requiring a set of resources. In this way
we maintain the flexibility of commercial schedulers (like TORQUE and PBS
Professional) but we deal with a more complex settings w.r.t. [2].


3   Eurora, Heterogeneous Supercomputer
Eurora is a heterogeneous HPC machine of CINECA. Eurora is composed of
65 nodes, one login node with 12 cores, 32 nodes with 16 cores at 3.1GHz and
                           A CP Scheduler for High-Performance Computers             3

2 GPU Kepler K20 each and 32 nodes with 16 cores at 2.1GHz and 2 Intel
Xeon phi (MIC) each. Users from this HPC machine can submit a job that
specifies the amount of resources, nodes and walltime to a queue; a queue is the
place where job waits to be executed. Each queue has a name and a priority,
after the submission. The scheduling software decides the start time and nodes
where to execute the job. The scheduling and dispatching software currently used
in Eurora is PBS Professional 12 from Altair; PBS Professional is a Portable
Batch System [5] that schedules jobs based on rules. The original scheduler
PBS sched can be disabled and replaced by with ad-hoc scheduling algorithms.
We take advantage of this functionality to implement in a plug-and-play fashion
our optimized scheduling policy.

4   The Scheduling problem
n this section, we formally describe the problem of on-line scheduling and dis-
patching of a supercomputer. The scheduling problem considers a set of jobs J
and a set of queues Q. Every jobi , defined on the set J, is submitted in a spe-
cific queue qi defined on the set Q. Each job, when submitted, has to specify its
maximal duration di , the number of jobs units ui (the number of virtual nodes
required for the execution) and the amount of required resource rijkl (cores,
memory, GPUs, MICs) for each job unit k ∈ [1..ui ] and for each l ∈ R, where
R is the set of resource. Each node nj of the system has a limit rljl for each re-
source l, with j ∈ N where N is the set of nodes. We have to assign the starting
execution time si and for each job unit juik of the job jobi , the node nj where
it has to be executed. Given the current time t, the set of running jobs cannot
be changed (migration is not supported), while the set of waiting jobs has to
be allocated and scheduled on resources without exceeding their capacity at any
point in time.

5   Constraint Programming
The technique used in this work to model the problem is Constraint Program-
ming. A Constraint Program is defined on a set of variables, each defined on a
discrete domain, and a set of constraints. Differently from Convex Optimization
(like LP, ILP, etc. . . ), with this paradigm we are not forced to have a convex
polytope as solution set and a convex objective function. The global constraint
we will use are:
 – alternative(a, [b], C) : the constraint holds iff at least C activities from the
   vector [b] has the same start time and duration of the activity a.
 – cumulative([s], [d], [r], L) : the constraint holds iff all the activities i defined
   by a starting at time si , a duration si and a resource requirement ri never
   exceed the resource capacity L at any point in time.
 – noOverlap(a, t, [s], [d]) : the constraint holds iff all the activities i with start
   time si and duration di do not overlap an activity with start time a and
   duration t (for each i we can have a ≥ si + di OR a + t ≤ si ).
4         A CP Scheduler for High-Performance Computers

    – synchronize(a, [b]) : the constraint holds iff the start time a is synchronized
      with each start time i of the vector [b].

6      CP Model
Starting from the work in [1] we create a model that contains all the requirements
and services needed by supercomputers in production. For every job i we have a
Conditional Interval Variables (CVI, see [3]) jobi . ASCVI represents an interval
variable. The domain of a CVI is a subset of {⊥} {[s, e)|s, e ∈ Z, s ≤ e}. If
a CVI has domain ⊥, this variable does not belong to the model and it is not
considered in the solution process. A job’s CVI contains the job walltime di
specified by the user. For every job we have also a matrix U Ni of M xPij of
CVIs, where M is the number of nodes in the system and Pij is the maximum
number of job units dispatchable in the jth node. These elements assume the
value s(i) if the ith job uses the node j, the value bottom otherwise. R is the set
of resources of the node, A the set of jobs in a queue and B the set of running
jobs. The base model created is described in 1. With the alternative constraint,
we introduce the possibility for every job unit to be displaced in a node partially
used by another job unit of the same job. The cumulative constrain the set of
jobs start times.
      jobi ≥ t ∀i ∈ A
      jobi = s(b) ∀i ∈ B
                                                                                 (1)
      alternative(jobi , U Nijk , ui ) ∀i = 1..N
                              P    P
      cumulative(U Nijk , di ij , rijkl
                                     ij
                                        , rljl ) ∀k = 1..M, l ∈ R
Equation 2 represents the objective function used in this model. This function
takes the job waiting-time weighted on the expected waiting time (ewti ) of the
queue where the job is submitted. This objective function is designed to opti-
mize the jobs waitings paying attention to the fairness of these. This mean that
waitings have to be distributed taking into account the priority of the jobs.
                 n
                 X si − qi
       min z =                                                                   (2)
                 i=1
                       ewti


7      Experimental Results
We have evaluated the performance of our scheduler in two distinct experimental
setups, namely (1) in a simulated environment on Virtual Machines (VM); and
(2) on the actual Eurora HPC machine. The PBS software can be configured in
different modes to suit the purpose of the system administrator. the following
experiments consider two different PBS setups:
 1. The CINECA PBS configuration (referred to as PBSFifo): this setup uses
    a FIFO job ordering, no preemption, and backfilling limited to the first 10
    jobs in the queue.
                          A CP Scheduler for High-Performance Computers         5

                    Test 1                        Test 2
         PBSFifo PBSWalltime CP Sched. PBSFifo PBSWalltime CP Sched.
      WQT 152,94      137,74    119,77  1034,2     853,681    2441,3
      NL      65           60       46     234          200      376
      TR 1298810     1223690 1003970 16798300     13693000 16774800
      AO    0,47         3,14    11,45    1,02        15,47    34,82
                       Table 1: Test 1 and Test 2 results


2. A PBS configuration (referred to as PBSWalltime) designed to get the best
   trade-off between waiting time and computational overhead: this setup em-
   ploys a strict job ordering (by increasing walltime), no preemption and back-
   filling limited to the first 400 jobs.

7.1    Simulation-based tests
We have designed the simulation so as to evaluate the performance of our CP
scheduler w.r.t. PBS. The experiments differ under a wide range of conditions
with respect to number of jobs, job units, and platform nodes. The goal is to
assess the scalability of both approaches and their ability to deal with workloads
having different resource requirements and processing times. The quality of the
schedules was measured according to a number of metrics. Specifically, we have
defined:
 – Weighted queue time (WQT): sum of job waiting-times, each divided (for
   fairness) by the maximum wait-time of the job queue.
 – Number of late jobs (NL): the number of jobs exceeding the maximum wait-
   time of their queue.
 – Tardiness (TR): sum of job delays, where the delay of a job is the amount
   of time by which the maximum wait-time of its queue is exceeded.
 – Average overhead (AO): average computation time of the scheduler.
In test 1 we simulate all the 65 Eurora nodes: the results are in Table 1. Our
model manages to outperform considerably PBSFifo and PBSWalltime in terms
of all the metrics related to waiting time and delay. In test 2 tested a 65 nodes
configuration with a larger number of jobs (namely 700): the results are reported
in Table 1. Due to the large number of jobs and (more importantly) job units, in
this case, our framework was forced to employ the overhead reduction techniques.
Such techniques are indeed effective in limiting the overhead, but they also have
an adverse effect on the quality of the model solutions. As it can be seen in the
table, our model yields a small improvement in tardiness w.r.t. PBSFifo, a small
increase in the total time in queue, and a considerable increase of the number of
late jobs, the WQT, and the weighted tardiness.

7.2    Execution on Eurora
Thanks to our modeling and design from Section 6, we have managed to obtain
a scheduling system that is mature enough to be deployed and evaluated on the
6       A CP Scheduler for High-Performance Computers

actual Eurora HPC machine. In detail, we have compared the performance of our
approach and the PBSFifo configuration over five weeks of regular utilization of
the HPC machine. Since the comparison is performed in a production environ-
ment, it is impossible to guarantee that the two approaches process the same
sequence of jobs. For this reason, we chose to compare the CP approach and
PBSFifo in terms of: (1) the average WQT per job, and (2) the average number
of used cores over time (i.e. the average core utilization). Our CP system per-
formed consistently better with an average WQT per job of ∼ 2.50∗10−6 , against
the ∼ 3.93 ∗ 10−6 of PBSFifo. The standard deviation for the two approaches is
very similar. The average core utilization obtained by both approaches during
each week, show that the two approach have similar performance, which ranges
between 520 and 599 for PBSFifo and between 510 and 573 for CP.


8    Conclusion
In this paper we presented a scheduler, based on Constraint Programming tech-
niques, that can improve the results obtained from commercial schedulers highly
tuned for a production environment and we implemented all the features for
made it usable on a real-life HPC setting. The scheduler has been tested both
in a simulated environment and on a real HPC machine with promising results.
We have seen that in the medium hardness range we can improve results ob-
tained by the commercial scheduler by a significant amount (21% on 152 points
of WQT) and in the high hardness range we did not get an improvement due to
the computational time and the too aggressive technique we had to implement.
The experimental results on the Eurora HPC system shown an improvement on
the weighted queue time while the system utilization. Future work will focus on
the following directions: improving the integration between the scheduling man-
agement framework and the optimizer, and developing incremental strategies to
hot-start the optimization engine.


References
1. Bartolini, A., Borghesi, A., Bridi, T., Lombardi, M., Milano, M.: Proactive workload
   dispatching on the eurora supercomputer. In: OSullivan, B. (ed.) Principles and
   Practice of Constraint Programming, Lecture Notes in Computer Science, vol. 8656,
   pp. 765–780. Springer International Publishing (2014)
2. Chlumsky, V., Klusácek, D., Ruda, M.: The extension of torque scheduler allowing
   the use of planning and optimization in grids. Computer Science 13, 5–19 (2012)
3. Laborie, P., Rogerie, J.: Reasoning with conditional time-intervals. In: Proc. of
   FLAIRS. pp. 555–560 (2008)
4. Salot, P.: A survey of various scheduling algorithm in cloud computing environment.
   International Journal of research and engineering Technology (IJRET), ISSN pp.
   2319–1163 (2013)
5. Works, P.: Pbs professional 12.2, administrators guide, november 2013