<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A CP Scheduler for High-Performance Computers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thomas Bridi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Lombardi</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Bartolini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Benini</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michela Milano</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Scheduling and dispatching tools for High-Performance Computing (HPC) machines have the role of mapping incoming jobs to the available resources, trying to maximize equipment utilization and user satisfaction. Optimal Job Scheduling is a well-known NP-hard problem, forcing commercial schedulers to adopt greedy approaches based on rules. Constraint Programming (CP) is a well-known combinatorial optimization approach that has been shown to be very e ective in optimally solving scheduling problems. We present the rst CP-based job scheduler for HPC machines, working in a real-life production environment. We evaluate our solution both on a cluster of virtual machines and on the Eurora Supercomputer with production workloads. Results show significant improvements in terms of user fair-waiting without degradation in overall machine utilization w.r.t state-of-the-art rule-based dispatchers.</p>
      </abstract>
      <kwd-group>
        <kwd>Constraint Programming</kwd>
        <kwd>Scheduling</kwd>
        <kwd>Supercomputer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Today high-performance computing (HPC) centers are investment-intensive
facilities with short depreciation cycles. An average supercomputer reaches full
depreciation in just three years, hence its utilization has to be aggressively
managed to produce and acceptable return on investment. A key role in this challenge
is played by scheduling software that decides where and when a job submitted
by a user has to be started. The scheduling software orders the set of jobs and
the set of nodes and then tries to allocate each job to nodes taking into account
the amount of currently available computing resources. When a new job is
submitted, the software usually applies a back- lling algorithm that tries to place
the job on unused resources without delaying the start of highest priority jobs
in queues. These priority-rule-based algorithms are simple and reasonably fast,
but they usually do not nd the best solution of the scheduling problem. One of
the most widespread queue-based scheduling software in HPC facilities is PBS
Professional [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this work, we present a complete CP model for solving the
optimal scheduling problem in HPC machines. The model signi cantly extends the
one in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to account for multiple classes of jobs and their temporal constraints.
In addition, the solution space exploration strategies have been optimized for
on-line use, taking into account the impact of the schedule computation time on
machine utilization. The CP solver based on the new model has been embedded
as a plug-in module within the software framework of a well-known commercial
HPC scheduler [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] replacing it's scheduling engine. By linking our solver with a
state-of-the-art HPC scheduling tool, we have been able to validate our approach
on a real-life HPC machine, Eurora from CINECA (Consorzio Interuniversitario
Calcolo Automatico). Experiments on Eurora demonstrate that the new
scheduler achieves signi cant improvements in job waiting time with respect to the
commercial scheduler used in CINECA, while at the same time maintaining
high machine utilization. An experimental campaign on a wide range of
synthetic workloads proves that the approach is exible, robust and well-suited for
integration in a portfolio of scheduling strategies to cover di erent levels of
average machine utilizations. In section 3 we show an overview of the scheduling
software running on the Eurora HPC machine. In section 4 we formally describe
the problem of scheduling. In section 5 we describe the optimization techniques
used to model the problem. In section 6 we present our optimization model and
all the features implemented to make it desirable for a real HPC center. In
section 7 we show results from simulations and from the Eurora supercomputer and
we report statistics on the computational overhead. Finally in section 8 we show
our conclusions.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        The problem of batch scheduling is well-known and widely investigated. The
interested reader can refer to [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for a good survey of the scheduling algorithms
used in HPC and computing clusters. Most of the algorithms described in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] can
be implemented withincommercial scheduling software by de ning appropriate
\scheduling rules". To the best of our knowledge, the only examples that apply
optimization techniques to a scheduler in a production context is [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this
paper, the author present an optimization technique applied as an extension
to the TORQUE scheduler. This extension replaces the scheduling core of the
framework with a back lling-like algorithm that inserts one job at a time into
the schedule starting from a previous solution and then applies a Tabu Search to
optimize the solution. This approach considers a job as a set of resources. This
assumption drastically decreases the exibility of the scheduler by avoiding the
possibility for a job to request more than one node. In our work, instead, we
consider a job as a set of nodes, each requiring a set of resources. In this way
we maintain the exibility of commercial schedulers (like TORQUE and PBS
Professional) but we deal with a more complex settings w.r.t. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Eurora, Heterogeneous Supercomputer</title>
      <p>
        Eurora is a heterogeneous HPC machine of CINECA. Eurora is composed of
65 nodes, one login node with 12 cores, 32 nodes with 16 cores at 3.1GHz and
2 GPU Kepler K20 each and 32 nodes with 16 cores at 2.1GHz and 2 Intel
Xeon phi (MIC) each. Users from this HPC machine can submit a job that
speci es the amount of resources, nodes and walltime to a queue; a queue is the
place where job waits to be executed. Each queue has a name and a priority,
after the submission. The scheduling software decides the start time and nodes
where to execute the job. The scheduling and dispatching software currently used
in Eurora is PBS Professional 12 from Altair; PBS Professional is a Portable
Batch System [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] that schedules jobs based on rules. The original scheduler
PBS sched can be disabled and replaced by with ad-hoc scheduling algorithms.
We take advantage of this functionality to implement in a plug-and-play fashion
our optimized scheduling policy.
4
      </p>
      <p>The Scheduling problem
n this section, we formally describe the problem of on-line scheduling and
dispatching of a supercomputer. The scheduling problem considers a set of jobs J
and a set of queues Q. Every jobi, de ned on the set J , is submitted in a
speci c queue qi de ned on the set Q. Each job, when submitted, has to specify its
maximal duration di, the number of jobs units ui (the number of virtual nodes
required for the execution) and the amount of required resource rijkl (cores,
memory, GPUs, MICs) for each job unit k 2 [1::ui] and for each l 2 R, where
R is the set of resource. Each node nj of the system has a limit rljl for each
resource l, with j 2 N where N is the set of nodes. We have to assign the starting
execution time si and for each job unit juik of the job jobi, the node nj where
it has to be executed. Given the current time t, the set of running jobs cannot
be changed (migration is not supported), while the set of waiting jobs has to
be allocated and scheduled on resources without exceeding their capacity at any
point in time.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Constraint Programming</title>
      <p>The technique used in this work to model the problem is Constraint
Programming. A Constraint Program is de ned on a set of variables, each de ned on a
discrete domain, and a set of constraints. Di erently from Convex Optimization
(like LP, ILP, etc. . . ), with this paradigm we are not forced to have a convex
polytope as solution set and a convex objective function. The global constraint
we will use are:
{ alternative(a; [b]; C) : the constraint holds i at least C activities from the
vector [b] has the same start time and duration of the activity a.
{ cumulative([s]; [d]; [r]; L) : the constraint holds i all the activities i de ned
by a starting at time si, a duration si and a resource requirement ri never
exceed the resource capacity L at any point in time.
{ noOverlap(a; t; [s]; [d]) : the constraint holds i all the activities i with start
time si and duration di do not overlap an activity with start time a and
duration t (for each i we can have a si + di OR a + t si).
{ synchronize(a; [b]) : the constraint holds i the start time a is synchronized
with each start time i of the vector [b].</p>
      <p>CP</p>
    </sec>
    <sec id="sec-5">
      <title>Model</title>
      <p>
        Starting from the work in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we create a model that contains all the requirements
and services needed by supercomputers in production. For every job i we have a
Conditional Interval Variables (CVI, see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) jobi. A CVI represents an interval
variable. The domain of a CVI is a subset of f?g Sf[s; e)js; e 2 Z; s eg. If
a CVI has domain ?, this variable does not belong to the model and it is not
considered in the solution process. A job's CVI contains the job walltime di
speci ed by the user. For every job we have also a matrix U Ni of M xPij of
CVIs, where M is the number of nodes in the system and Pij is the maximum
number of job units dispatchable in the jth node. These elements assume the
value s(i) if the ith job uses the node j, the value bottom otherwise. R is the set
of resources of the node, A the set of jobs in a queue and B the set of running
jobs. The base model created is described in 1. With the alternative constraint,
we introduce the possibility for every job unit to be displaced in a node partially
used by another job unit of the same job. The cumulative constrain the set of
jobs start times.
jobi
      </p>
      <p>t 8i 2 A
jobi = s(b) 8i 2 B
alternative(jobi; U Nijk; ui) 8i = 1::N
cumulative(U Nijk; diPij ; riPjikjl; rljl) 8k = 1::M; l 2 R
Equation 2 represents the objective function used in this model. This function
takes the job waiting-time weighted on the expected waiting time (ewti) of the
queue where the job is submitted. This objective function is designed to
optimize the jobs waitings paying attention to the fairness of these. This mean that
waitings have to be distributed taking into account the priority of the jobs.
min z =
n
X si</p>
      <p>ewti
i=1</p>
      <p>qi
7</p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Results</title>
      <p>We have evaluated the performance of our scheduler in two distinct experimental
setups, namely (1) in a simulated environment on Virtual Machines (VM); and
(2) on the actual Eurora HPC machine. The PBS software can be con gured in
di erent modes to suit the purpose of the system administrator. the following
experiments consider two di erent PBS setups:
1. The CINECA PBS con guration (referred to as PBSFifo): this setup uses
a FIFO job ordering, no preemption, and back lling limited to the rst 10
jobs in the queue.</p>
      <p>Test 1 Test 2
PBSFifo PBSWalltime CP Sched. PBSFifo PBSWalltime CP Sched.
152,94 137,74 119,77 1034,2 853,681 2441,3</p>
      <p>65 60 46 234 200 376
1298810 1223690 1003970 16798300 13693000 16774800
0,47 3,14 11,45 1,02 15,47 34,82</p>
      <p>Table 1: Test 1 and Test 2 results
2. A PBS con guration (referred to as PBSWalltime) designed to get the best
trade-o between waiting time and computational overhead: this setup
employs a strict job ordering (by increasing walltime), no preemption and
backlling limited to the rst 400 jobs.
7.1</p>
      <sec id="sec-6-1">
        <title>Simulation-based tests</title>
        <p>We have designed the simulation so as to evaluate the performance of our CP
scheduler w.r.t. PBS. The experiments di er under a wide range of conditions
with respect to number of jobs, job units, and platform nodes. The goal is to
assess the scalability of both approaches and their ability to deal with workloads
having di erent resource requirements and processing times. The quality of the
schedules was measured according to a number of metrics. Speci cally, we have
de ned:
{ Weighted queue time (WQT): sum of job waiting-times, each divided (for
fairness) by the maximum wait-time of the job queue.
{ Number of late jobs (NL): the number of jobs exceeding the maximum
waittime of their queue.
{ Tardiness (TR): sum of job delays, where the delay of a job is the amount
of time by which the maximum wait-time of its queue is exceeded.
{ Average overhead (AO): average computation time of the scheduler.
In test 1 we simulate all the 65 Eurora nodes: the results are in Table 1. Our
model manages to outperform considerably PBSFifo and PBSWalltime in terms
of all the metrics related to waiting time and delay. In test 2 tested a 65 nodes
con guration with a larger number of jobs (namely 700): the results are reported
in Table 1. Due to the large number of jobs and (more importantly) job units, in
this case, our framework was forced to employ the overhead reduction techniques.
Such techniques are indeed e ective in limiting the overhead, but they also have
an adverse e ect on the quality of the model solutions. As it can be seen in the
table, our model yields a small improvement in tardiness w.r.t. PBSFifo, a small
increase in the total time in queue, and a considerable increase of the number of
late jobs, the WQT, and the weighted tardiness.
7.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Execution on Eurora</title>
        <p>Thanks to our modeling and design from Section 6, we have managed to obtain
a scheduling system that is mature enough to be deployed and evaluated on the
actual Eurora HPC machine. In detail, we have compared the performance of our
approach and the PBSFifo con guration over ve weeks of regular utilization of
the HPC machine. Since the comparison is performed in a production
environment, it is impossible to guarantee that the two approaches process the same
sequence of jobs. For this reason, we chose to compare the CP approach and
PBSFifo in terms of: (1) the average WQT per job, and (2) the average number
of used cores over time (i.e. the average core utilization). Our CP system
performed consistently better with an average WQT per job of 2:50 10 6, against
the 3:93 10 6 of PBSFifo. The standard deviation for the two approaches is
very similar. The average core utilization obtained by both approaches during
each week, show that the two approach have similar performance, which ranges
between 520 and 599 for PBSFifo and between 510 and 573 for CP.
8</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper we presented a scheduler, based on Constraint Programming
techniques, that can improve the results obtained from commercial schedulers highly
tuned for a production environment and we implemented all the features for
made it usable on a real-life HPC setting. The scheduler has been tested both
in a simulated environment and on a real HPC machine with promising results.
We have seen that in the medium hardness range we can improve results
obtained by the commercial scheduler by a signi cant amount (21% on 152 points
of WQT) and in the high hardness range we did not get an improvement due to
the computational time and the too aggressive technique we had to implement.
The experimental results on the Eurora HPC system shown an improvement on
the weighted queue time while the system utilization. Future work will focus on
the following directions: improving the integration between the scheduling
management framework and the optimizer, and developing incremental strategies to
hot-start the optimization engine.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bartolini</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borghesi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bridi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lombardi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milano</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Proactive workload dispatching on the eurora supercomputer</article-title>
          . In: OSullivan, B. (ed.)
          <source>Principles and Practice of Constraint Programming, Lecture Notes in Computer Science</source>
          , vol.
          <volume>8656</volume>
          , pp.
          <volume>765</volume>
          {
          <fpage>780</fpage>
          . Springer International Publishing (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chlumsky</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klusacek</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ruda</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The extension of torque scheduler allowing the use of planning and optimization in grids</article-title>
          .
          <source>Computer Science</source>
          <volume>13</volume>
          ,
          <issue>5</issue>
          {
          <fpage>19</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Laborie</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rogerie</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Reasoning with conditional time-intervals</article-title>
          .
          <source>In: Proc. of FLAIRS</source>
          . pp.
          <volume>555</volume>
          {
          <issue>560</issue>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Salot</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A survey of various scheduling algorithm in cloud computing environment</article-title>
          .
          <source>International Journal of research and engineering Technology (IJRET)</source>
          , ISSN pp.
          <volume>2319</volume>
          {
          <issue>1163</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Works</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <source>Pbs professional 12</source>
          .2,
          <string-name>
            <surname>administrators</surname>
            <given-names>guide</given-names>
          </string-name>
          ,
          <source>november 2013</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>