<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Recommender Systems, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Learning and Graph Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Platten</string-name>
          <email>platten.ben@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Macfarlane</string-name>
          <email>m.v.macfarlane@uva.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Graus</string-name>
          <email>david.graus@randstadgroep.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sepideh Mesbah</string-name>
          <email>sepideh.mesbah@randstadgroep.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Randstad Groep Nederland</institution>
          ,
          <addr-line>Diemermere 25, 1112 TC Diemen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universiteit van Amsterdam</institution>
          ,
          <addr-line>Binnengasthuisstraat 9, 1012 ZA Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>This research investigates the extent to which a reinforcement learning agent can learn the heuristics of a combinatorial optimisation problem (CO), personnel scheduling, from a graph representation. In recent years, reinforcement learning has emerged as an efective data-driven approach to solving CO problems, a subset of mathematical optimisation that appear in many real-world tasks and can be costly to solve in terms of computing power and/or human resource. CO problems often have an underlying graph structure with relational inductive biases that can be exploited by powerful graph neural networks. Across a range of problem complexities, our approach was able return consistently high average reward, performing significantly better than baselines.</p>
      </abstract>
      <kwd-group>
        <kwd>Neural Networks</kwd>
        <kwd>Combinatorial optimisation</kwd>
        <kwd>Reinforcement learning</kwd>
        <kwd>Graph neural networks</kwd>
        <kwd>Personnel scheduling</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Matching personnel with shifts in a way that ensures
coverage, legality and employee well-being is a process
that becomes exponentially more complicated as scale
increases [1]. Randstad is a human resource consulting
ifrm responsible for scheduling over 40,000 employees
into more than 70,000 shifts every week. Employees are
matched with work from several diferent industries such
as retail, healthcare, and catering. Staf shortages and
increased demand have made this process even more
challenging in recent years. Personnel scheduling (PS)
as an operations research problem has generally been
erence to the dificult task of assigning nurses to shifts
in a 24-7 cycle—and has been studied since the 1960s by
researchers from computer science, mathematics,
operations research and medicine [2]. Scheduling problems
carry hard and soft constraints; a typical hard constraint
in NRPs is that there is suficient staf coverage at all
times. A soft constraint could be an employee’s
prefernel scheduling is characterised by high variance between
clients (e.g. diferent retail companies) in terms of
problem requirements; planning horizon, shift types, requisite
skills and volume of staf and shifts vary significantly.</p>
      <sec id="sec-1-1">
        <title>For example, a single client could require as many as</title>
        <p>nEvelop-O
Human Resources, in conjunction with the 16th ACM Conference on
without violating a weight constraint [3]. If a CO prob- scheduling problems. The complexity of a NRP is
deterlem is modelled as a sequential decision-making process, mined by the combination of constraints and parameters
neural network function approximation can be used to [1]. Usually, problems with a large number of possible
learn features of a problem and use them to optimise the shift types, a large number of nurses and/or a long
plandecisions of an agent [4]. This is the essence of Deep ning period, are expected to require more computational
Reinforcement Learning (RL), a class of algorithms that efort. In their comprehensive review, Burke et al [ 2],
have attained some notoriety in recent years due to eye- declared that exact methods ”cannot cope with the
enorcatching performance on various games [5]. Many CO mous search spaces that are represented by real problems
problems naturally induce an underlying graph structure (at least on their own)”. However, innovative modelling
of nodes and edges [6]. By representing a problem using techniques have been used to formulate integer linear
graph neural networks (GNNs), the learned vector rep- programs (ILPs) with huge numbers of variables.
Branchresentation encodes crucial structures that improve the and-price, a technique that augments linear relaxation
ability of an algorithm to learn from such problems and a method for solving hard ILPs by temporarily relaxing
solve them eficiently [ 7]. CO is at the heart of many real the integer constraints - by dropping a proportion of the
world problems; vehicle routing, microchip design, big constraints and then checking if the subsequent solution
data processing, and numerous forms of scheduling tasks, is feasible, was used by [8]. Modelling complex ILPs is
to name a few. This paper seeks to investigate whether a challenging and powerful solvers are needed to run them
RL agent can learn to solve personnel scheduling prob- [9].
lems represented as graphs. The most common form of heuristic optimisation is</p>
        <p>Research question: To what extent can reinforcement meta-heuristics, which use strategies inspired by other
learning optimise personnel scheduling problems? systems to guide the search process. For example, Ant</p>
        <p>There are many ways to design experiments to an- colony optimisation meta-heuristic algorithms build
soswer this research question. We chose to keep the RL lutions by mimicking the foraging behaviour of ants [10].
algorithm and problem constraints fixed throughout the Although efective and simpler to implement than exact
experiments and instead to focus on problem complexity methods, heuristics need to be revised if the problem
and reward function. Problem complexity can be con- statement changes slightly [2]. Furthermore, costly and
trolled by the number of shifts and also by the ratio be- error-prone feature engineering can introduce biases that
tween shifts and employees, both of which determine the do not align with the real world and lead to imprecise
probability that a correct action can be taken by chance. decisions.</p>
        <p>An increase in problem complexity will therefore lead to Methods that hybridise exact methods with heuristic
increased constraint violations from a random agent. A approaches are known as hybrid approaches and Burke
reward function determines whether an agent’s actions et al. [2] went so far as to say that ”some kind of (hybrid)
are rewarded with a positive value or punished with a heuristic method ofers the only realistic way of tackling
negative value. Reward is averaged over a set number such dificult and challenging problems in the foreseeable
of episodes (complete runs through a problem) as this future.”
provides a stable measure of an agent’s performance over In general, the models seen in the literature are
comtime. plex, narrowly applicable to a certain environment, and</p>
        <p>In this context, we focus on the following sub-research lack generality, but it is dificult draw conclusions when
questions: there is so much variance between studies. We can say
that, by considering each instance of a problem in
isolation, these existing methods do not exploit the fact that
the problem instances often come from a common real
world underlying distribution.
1. To what extent does increasing problem
complex</p>
        <p>ity afect average reward?
2. To what extent does reward shaping afect
aver</p>
        <p>age reward?</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.2. Machine Learning Methods for CO problems</title>
        <p>Substantial research has been conducted into personnel For real world CO problems, it is highly likely that
probscheduling and combinatorial optimisation more gener- lem instances will share certain characteristics or
patally. terns [6]. Data-dependent machine learning approaches
can exploit these patterns, leading to the development of
2.1. Personnel Scheduling Methods faster solutions for practical cases.
A neural network model was deployed on a CO
probThe most common methods - exact, heuristic, and hybrid lem as far back as 1996 when a Hopfield network was
- have diferent ways of dealing with the complexity of used by Mańdziuk [11] to solve instances of the
Travelling Salesman Problem. Chen et al. [12] used neural net- CNNs get their name from the process of convolving the
works to guide a tree search. By representing schedules input vectors with a sliding kernel which reduces the
as matrices, a network could analyse existing solutions size of an image whilst exploiting its spatial structure.
and learn to predict the probability of each node leading Accounting for the underlying structure of the data
into a solution. Václavík et al. [13] use classifiers to discard troduces an inductive bias which promotes learning and
bad solutions and speed up a subsequent heuristic search. leads to better results [7]. The inductive bias of a learning</p>
        <p>Reinforcement learning (RL) refers to a computational algorithm is the set of assumptions that the learner uses
approach to learning from interaction [9]. In RL, an agent to predict outputs for new data. GNNs also perform
cominteracts with a Markov Decision Process (MDP) with putations over input data in a way that exploits spatial
the goal of maximising some reward generated by the en- structure, a process known as message passing. Node
vironment. A combination of a state of the environment and edge feature vectors are propagated by an exchange
and an available action, a state-action pair (s, a), has an of information (messages) between adjacent nodes.
expected return known as a state-action value function, Intuitively, it makes sense that a node might be
repreor Q-value. A policy is a conditional distribution that the sented by the average of the connections to its neighbours
agent uses to select actions. The main aim of RL is to and in this way the spatial information of the graph is
ifnd an optimal policy - a conditional distribution used maintained. It also means that the number of message
by the agent as a strategy to select actions - such that the passing layers determines how far a signal can travel;
state-action value function is optimised. with 2-layers, the message passing happens twice so the</p>
        <p>Broadly, there are two categories of RL algorithm; message travels two hops away. There are many GNN
arthose which learn by optimising the value function to chitectures optimised for diferent functions; node, edge,
increase reward at each step and those which directly or graph classification, using features stored at the node,
optimise the policy to increase total expected reward edge or both. The promise of GNNs is that the learned
[14]. Policy-based methods directly model the agent’s vector representation encodes crucial graph structures
policy as a parametric function. By collecting previous that help solve a CO problem more eficiently [ 6].
decisions that the agent made in the environment we can Mirhoseini et al. [18] developed a learning-based
apoptimise the parameters of the network to maximise the proach to chip placement, one of the most complex and
expected reward of the process [14]. time-consuming stages of the chip design process.
Exploiting the graph structure of netlists (a description of
 +1 =   +    ∇((  | | , , )) t[h18e]cuosnendeactGivNitNy otof acno melpeucttreofneiactucirrecsuoitf),nMetilrishtonsoeidneiseatnadl.</p>
        <p>An episode refers to all the states of an environment create a rich representation for learning. Mirhoseini et al.
that come in between an initial-state and a terminal- [18] claim that their method is the first that can quickly
state, generated from the sequence of actions taken and generalise to previously unseen netlists and produce
’suchanges observed in the environment. An episode can perhuman’ results without the need for human experts
be terminated due to success or failure, and, over many in the loop. In the opinion of Cappart et al. [6], this study
episodes, the parameters of a policy network are trained is only scratching the surface of ”the innovations that
until they can achieve good performance on an unseen can be enabled from a careful synergy of GNNs and CO”.
problem. Stronger convergence guarantees are available Kool et al. [15] represented travelling salesman
probfor policy-based methods than for value-based methods. lems as graphs and used an encoder-decoder GNN as a
One of the main strengths of RL is that agents can learn parameterised policy to make decisions on which node
the rules of an environment without the need for explicit to visit next. An encoder transforms input data into
programming or examples of optimal endpoints. A key embeddings of a specific dimensions. The decoder then
aim of RL research is to train models that make decisions transforms the embeddings to the required output
dimento solve real problems [15] sions. The solution is constructed incrementally, adding</p>
        <p>Reinforcement Learning (RL) methods were first ap- one node at a time. The policy network is trained
usplied to CO by Zhang and Dietterich, 1995, with a model ing the REINFORCE algorithm, a policy gradient method.
designed to solve the NP-Hard job-shop problem [16]. The motivation for this choice was greater eficiency than</p>
        <p>A growing body of research uses RL in combination a value function approach. Kool et al. [15] report
signifwith graph neural networks (GNNs), a type of function icantly improved results over recent learned heuristic
approximator designed to operate directly on a graph approaches to TSP. Their solution worked on problems
structure [17]. Whenever there is a set of entities that with up to 100 nodes and generalised to 2 variants of
interact with one another or are related in any way, they the problem using the same hyper-parameters. They are
can be expressed in the form of a graph [6]. GNNs are clear that their goal is not to outperform specialised TSP
structurally similar to convolutional neural networks algorithms, but to show that their approach is flexible
(CNNs) which are mostly used for image classification. enough to generalise to similar problems.</p>
        <sec id="sec-2-1-1">
          <title>After a thorough search of the literature, there was</title>
          <p>no credible study found to be using an RL algorithm
and graph representations for personnel scheduling. The
research most relevant to our aims is Kool et al. [15]
which has many useful insights.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Avg. Ratio
Our goal is to train an agent to sequentially assign em- Table 1
ployees to shifts in an acceptable configuration. A sched- Overview of the scheduling problems used for training the
ule is acceptable if the constraints are not violated. The scheduling agent. 420 in total problems with 6 diferent shift
foundation of this research is a pipeline that ingests a counts and range of shift-employee ratios. The number of
pool and a shift schedule and outputs a trained schedul- shifts, and the ratio between shifts and employees, determine
ing agent. We train models on a fixed set of problems the complexity of our scheduling problems: the probability
in order to generate an agent that is capable of solving that a correct action can be taken by chance.
general, previously unseen problems. The pipeline was
designed to take pool and shift schedule input of any
size, adapting the parameters of the environment, graph 3.1. Data
representation, and Policy GNN accordingly.</p>
      <p>At Randstad, employees are grouped into pools which act
as a pre-filter for eligibility; anyone assigned to a pool
is eligible for shifts assigned to that pool. A pool
represents an instance of a scheduling problem; the employees
and shifts must be combined in a way that provides
acceptable coverage and respects employment law and
employee preference. Computational complexity increases
with shift count and shift-employee ratio meaning that
Randstad data includes many highly complex scheduling
problems. Furthermore, the data contains many features
with high cardinality, such as shift length and start time
of a shift. To test the viability of an RL approach, simpler
problem instances were required.</p>
      <p>Inspired by the characteristics of Randstad data, we
• A pool csv containing a list of employee ids, and a created synthetic schedule and pool data sets that can
schedule csv containing a list of shift ids, are the initial be combined to created simplified scheduling problems.
input. The schedule csv also contains features for The aim is to create problems that test an agent's ability
each shift: day of week and time of day (morning or to learn to satisfy a single constraint: the same employee
evening). can’t work sequential shifts. This constraint has to be
• A custom OpenAI Gym RL environment takes the learned from the features of the problems.</p>
      <p>input csvs and combines them into a single matrix. Synthetic schedules contain shift id and 2 features; day
• The length of an episode is equal to the number of of week (Monday - Friday) and time of day (morning or
shifts in the schedule csv. At each step in an episode, evening). Schedule shift count ranges from 3 to 8 and
the matrix is converted to a bipartite graph and passed pool employee count ranges from 2 to 8. By combining
through the GNN. these schedules and pools, we created a training set of
• The GNN outputs a policy; a probability distribution 420 unique problems. Full details of the training set can
over the employee / action space. be seen in Table 1. Functions were created to manage
• uTphdeaatgeesntthteawkeesigahcttsioonfsthacecpoordliicnygnteotwthoerkpoblaicsyedanodn the problem combinations and check for uniqueness /
the loss function. duplicate problems. Another function randomly
gener• Repeat until a terminal state is reached: all shifts are ates new, unseen problems of variable shift count and
assigned. shift-employee ratio, for testing.</p>
      <sec id="sec-3-1">
        <title>3.2. Pipeline</title>
        <sec id="sec-3-1-1">
          <title>So that a model can learn to find acceptable solutions to the scheduling problems, we define a Reinforcement method called REINFORCE.</title>
          <p>3.2.1. REINFORCE</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>REINFORCE was proposed by Williams [19]. We start</title>
          <p>by initialising a random policy which the REINFORCE
algorithm takes as input. The agent collects the
trajectory of an episode from current policy, storing actions,
states and rewards as a history. For each episode we
calculate the reward and use it to evaluate the policy.</p>
          <p>For each action in the episode, the loss is calculated as
the probability (according to the policy) of choosing the
action, multiplied by the reward received. The loss is
then back-propagated through the network to update
the policy, increasing the probability that the agent will
choose the actions that lead to higher reward in the next
episode.</p>
          <p>+1 =   +     (  |  , )</p>
          <p>∇ (  |  , )
ing the Deep Graph Library.
3.2.2. Policy Network
scheduling problem graph representation can be seen in
et al. [15]’s framework for tackling Travelling Salesman
The encoder-decoder policy network is implemented us- being re-initialised with the parameters of the current
state of a scheduling problem as input and uses message
passing convolutional layers to update and aggregate the
embeddings.</p>
          <p>ℎ

+1 =  ( ∑   (ℎ
()
, ℎ</p>
          <p>() ))
∈</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>Then, at each time step, the decoder network calculates</title>
          <p>the inner product of the current shift’s node features and
the worker edge features, which serves as a compatibility
score. The decoder calculate this score for problems
of any size. The softmax output of these compatibility
scores are the probabilities for the policy.</p>
          <p>The number of nodes in the input graph reflects the
count of shifts and employees in the scheduling
problem. Shift features are consistent (5 days of week and 2
times of day) but the employee features - which indicates
whether an employee as been assigned to a specific shift
vary with employee and shift count. The network needs
to be able to process graphs of diferent sizes without
graph, which would reset the learned weights. To achieve
this, we designed the architecture to store the employee
feature data in the edges, linear transformation of
employee features could be performed at each step. This is
illustrated in Figure 3. We use a message passing layer
that incorporates edge data in it’s update and aggregation
functions. The design of this layer was proposed by [20]
:
ℎ

(+1) =  ( ∑
∑  ,  
()
ℎ

() +  0() ℎ</p>
          <p>() )</p>
          <p>()
the self-loop weight.
where   () is the neighbor set of node  w.r.t. relation
 .  , is the normaliser.  is an activation function.  0 is</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>These edge features would allow us to measure im</title>
          <p>portant statistics such as total shifts allocated for each
employee without any hard-coding with of problem size.</p>
        </sec>
        <sec id="sec-3-1-5">
          <title>These statistics could be used to add additional constraints such as minimum / maximum hours worked per</title>
        </sec>
        <sec id="sec-3-1-6">
          <title>The network has a fixed learning rate of 1e-3. The</title>
          <p>encoder uses 2 convolutional layers, 5 × 32 shift
embeddings and  × 32 employee embeddings, where  = to the
number of employee nodes. The decoder uses 32 × 32
embeddings of the features for the current shift and 32 × 32
embeddings for each employee.</p>
          <p>For the agent’s policy, we implemented an encoder- employee.
decoder GNN architecture, similar to the model used
by Kool et al. [15]. The building blocks of the GNN are
from the Deep Graph Library, which supports a range
of GNN architectures on top of the PyTorch framework
and also provides graph building functions. An example</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <sec id="sec-4-1">
        <title>The process of repeating episodes to train the policy network is managed by a custom Open AI Gym environment.</title>
      </sec>
      <sec id="sec-4-2">
        <title>There are many possible variants for the design of this re</title>
        <p>search, such as RL algorithm, training problem
complex3.2.3. Open AI Gym ity, feature engineering, reward function and network
architecture, to name a few. We have chosen to focus
OpenAI’s Gym library is a popular choice for implement- on reward function and problem complexity. Reward is
ing environments for RL. It’s main building block is a based on performance with regards to a single constraint:
Python class, Env, which simulates the environment you an employee cannot work simultaneous or consecutive
want to train your agent in. There are many standard shifts. Given time constraints, hyper-parameter tuning
environments available, such as the Atari games used by was not performed for this study. The code can be viewed
DeepMind [21], and it supports custom environments. on github.</p>
        <p>The key components of an environment are the state Average reward is used to evaluate a models
perforspace, action space, reward function. For the state space mance. Reward is averaged across the previous 100
of our custom scheduling environment, we combine a episodes to better capture an agent’s long-term
behavpool and a schedule into a matrix. The length of the ior. In a real world setting, the constraint we used is
matrix equals the number of shifts, and width of the ma- considered a hard constraint meaning that a violation
trix equals the number of shift features plus the number renders the solution unacceptable. A solution could
feaof employees in the pool. For each shift, the agent can sibly have a high reward but be unacceptable due to a
assign an employee from the pool, therefore the action single constraint violation. We used an additional metric,
space is equal to the number of employees. The reward ’percentage acceptable’ to measure the number of
acceptfunction is adjusted as part of the experimental setup able solutions produced by the agent: a solution with a
but essentially it rewards the agent when an acceptable maximum score and therefore no constraint violations.
schedule is created. Or rather, it punishes the agent when
constraints are violated. Another key part of the environ- 4.1. Reward Functions
ment is the step function, which takes the agents action
as input and updates the state and reward accordingly.</p>
        <p>TSPs have an inbuilt cost, tour length (the cumulative
distance between points), which the agent in Kool et al.
[15] tries to minimise. There are examples of TSP solvers
that apply cost (negative reward) incrementally (after
each stop)—which has been shown to encourage greedy
behaviour—and at terminal states (when the tour is
complete) [15]. As there is no implicit reward or cost for
scheduling problems, we chose an arbitrary total reward
value of 1. Constraint violations results in a negative
reward. Constraint violations are calculated based on the
features in the schedule data: Assigning 2 shifts on the
same day to an employee is a constraint violation.
Applying shifts on subsequent days to the same employee
is a constraint violation only if shift 1 is in the evening
and shift 2 is in the morning. The cost of a constraint
violation is 1/(number of shifts-1) because it is not possible
to have a violation on the first assignment.</p>
        <p>The efect of reward function on performance is tested
by comparing the performance of 3 reward functions;
1) calculating reward at the terminal state of an episode
(Terminal), 2) calculating reward at every step (Step),
and 3) assigning half of the reward stepwise, and the
remaining half at a terminal state as reward for an
acceptable solution (Step Bonus).</p>
        <sec id="sec-4-2-1">
          <title>4.2. Problem Complexity</title>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>Problem complexity can be increased in two ways; by</title>
        <p>the number of shifts and also by the ratio between shifts
and employees. More shifts means more decisions for
the agent to take, so there is more chance of a constraint
violation occurring. The efect of shift-employee ratio is
illustrated in 4, which shows a schedule and two pools. Table 2
Combining the schedule with pool 1 has a higher ratio
than for pool 2, which reduces the number of actions an
agent can take without incurring a constraint violation.</p>
        <p>Models were trained on 420 problems with between 3
and 8 shifts across 10,000 episodes. This means that the
agent would see the same problem many times during
training.</p>
        <p>Models were tested on 500 problems from a range of
complexity distributions, as can be seen in Table 2. The
agent sees each problem once only.</p>
        <p>The test problems have two levels of shift-employee
ratio: an ’average’ ratio of between 1.2 and 2.8, which
reflects the first and third quartiles for shift-employee
ratio in the Randstad data, and a ’high’ ratio of between
2.9 and 5 to reflect the more complex problems seen in
the data.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>Models were trained on 5 diferent seeds and each model</title>
        <p>was tested on every problem twice. The results were
averaged across both runs for all 5 models. We tested
the significance of our results using Welch’s t-test, which
doesn’t assume equal variance between groups. Results
are summarised in Table 3.
Summary of test set problems showing size, shift-employee
ratio and count of unique problems.</p>
        <p>Across all experiments, the best performing model was
the Step Bonus reward function (mean=0.89, standard
deviation=0.22) which returned significantly higher
average reward than Random baseline (m=0.49, sd=0.49),
p&lt;.05. Step Bonus also performed significantly better
than the other reward functions Step (m=0.62, sd=0.62),
p&lt;.05, and Terminal (m=0.58, sd=0.57), p&lt;.05.</p>
        <p>In terms of problem complexity, average reward was
significantly higher for Average ratio problems (m=0.74,
sd=0.39) than for High ratio problems (m=0.65, sd=0.65),
p&lt;.05. This efect was observed for each model. The
efect of increasing shift count is less clear. Excluding
results from the Random agent, problems with a max shift
count of 8 (ms8) (m=0.72, sd=0.36) show significantly
higher average reward than problems with max shift
counts of 14 (m=0.68, sd=0.68), 18 (m=0.68, sd=0.68) and
23 (m=0.68, sd=0.68). However, there was no
significant diference in average reward observed between the
smallest problems (ms8) and the largest problems (ms30)
(m=0.72, sd=0.41). Furthermore, ms30 showed higher
average reward than ms23, ms18, and ms14. These
effects were observed across both Average and High ratio
problems.</p>
        <p>Figure 5 shows the performance of the 3 reward
functions against a Random agent baseline on average ratio
scheduling problems of increasing shift count.</p>
        <p>Figure 6 shows the performance of 3 reward functions
against a Random agent baseline on high ratio scheduling
problems of increasing shift count.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <sec id="sec-6-1">
        <title>Personnel scheduling is a real-world combinatorial opti</title>
        <p>misation task that requires significant time and expertise
to manage efectively. In this paper, we developed a
scheduling problem solver based on a framework from practical value. We created a graph representation of
Kool et al. [15], which was designed to solve another CO each combination of shift schedule and employee pool
problem, the travelling salesman problem. Compared to for use in a GNN. The GNN functions as the policy, using
previous works, which used problem-specific approaches, valuable information about the node in the context of
this framework is an opportunity for a data-driven ap- the graph to guide the decisions of an agent. The
paramproach to personnel scheduling that could be of great eters of the network are trained using the REINFORCE
Model
Random agent
Step
Terminal
Step Bonus</p>
        <p>Average Reward</p>
        <p>% Acceptable
0.49
0.62
0.58
0.89
14.00
51.68
51.56
79.60
algorithm. To measure the ability of the RL scheduler methods. State of the art algorithms could be used, but a
to optimise scheduling problems, we used average re- simple change worth testing is the use of a baseline, as
ward and percentage acceptable metrics. We compared seen in [15]: ”A good baseline reduces gradient variance
performance across 3 diferent reward functions and 10 and therefore increases speed of learning”. As Figure
diferent levels of problem dificulty. 7 shows, Step Bonus appears to converge on a stable</p>
        <p>The Step Bonus model achieved consistently high per- reward after only a few hundred episodes, but this could
formance across all problems, significantly better than well change if more constraints were added.
the baseline and the other reward functions. Despite the
fact that the problems in training set had a max shift
count of 8, the trained Step Bonus agent was able to 7. Conclusion
able achieve maximum score on 77% of High ratio, max
shift 30 problems. The results show that, with the right reward function, an</p>
        <p>With a focus on an academic problem, rather than a RL agent was able to solve simplified personnel
schedulreal world application, [15] were able to test their model ing problems across a range of shift counts and
shifton several standard sets of travelling salesman problems employee ratios. We have shown that an agent can learn
(plus variants) and compare the performance to other the constraints of a combinatorial optimisation problem
studies that used the same data. There are standard test from data. For future work, additional evaluation using
sets for personnel scheduling problems but they were real world data is of importance to better understand the
out of scope for this research as these problems generally range of applicability of our approach.
have more complicated constraints than the one we used.</p>
        <p>A comparison that can be made between this study and References
[15], is the efect of node count on performance. The
cost for TSPs is positively correlated to the number of [1] S. Den Hartog, et al., On the complexity
nodes, so performance appears worse (higher cost) for of nurse scheduling problems, Master thesis
larger problems. However, [15] observed performance (2016). URL: https://studenttheses.uu.nl/handle/20.
on 100 node TSP problems comparable to state of the 500.12932/22268.
art TSP solvers, suggesting that their graph model scaled [2] E. K. Burke, P. De Causmaecker, G. V. Berghe,
well to larger problems. A similar case could be made H. Van Landeghem, The state of the art of
for this research, as we saw no significant diference nurse rostering, Journal of scheduling 7 (2004)
in performance between the hardest (ms30, High ratio) 441–499. URL: http://link.springer.com/10.1023/B:
problems and the easiest (ms8, average ratio). JOSH.0000046076.75950.0b.</p>
        <p>Because the size of the reward / cost for our system is [3] Y. Bengio, A. Lodi, A. Prouvost, Machine learning
negatively correlated to the number of shifts in a prob- for combinatorial optimization: a methodological
lem, violations are punished less harshly for big problems. tour d’horizon, European Journal of Operational
However, the award of a bonus reward of .5 for an ac- Research 290 (2021) 405–421. URL: http://arxiv.org/
ceptable solution should counteract this. abs/1811.06128.</p>
        <p>Another explanation is that shift count and shift- [4] I. Bello, H. Pham, Q. V. Le, M. Norouzi, S.
Benemployee ratio are not the best or only way to mea- gio, Neural combinatorial optimization with
resure the complexity of a personnel scheduling problem. inforcement learning, 5th International Conference
The consistently high rewards we observed for the Step on Learning Representations, ICLR (2017). URL:
Bonus model suggest that the network was able to learn http://arxiv.org/abs/1611.09940.
the constraint - no simultaneous or consecutive employee [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
assignments - and that it was no harder to exploit it for J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,
a large problem than a small one. Aside from the bonus A. K. Fidjeland, G. Ostrovski, et al., Human-level
reward, a decision made at time-step  doesn’t efect  + 1 control through deep reinforcement learning,
na. In a real world scheduling problem, this is unlikely to ture 518 (2015) 529–533. URL: http://www.nature.
be the case as there will be additional constraints to con- com/articles/nature14236.
sider. The most obvious constraint to add to this research [6] Q. Cappart, D. Chételat, E. B. Khalil, A. Lodi, C.
Morwould be one to ensure that each worker in the pool is ris, P. Veličković, Combinatorial optimization and
assigned to at least one shift. reasoning with graph neural networks, Proceedings</p>
        <p>If, as expected, adding additional constraints increases of the Thirtieth International Joint Conference on
complexity to an extent that performance is gets worse, Artificial Intelligence, IJCAI-21 (2021) 4348–4355.
there are architecture changes that could be made make URL: https://doi.org/10.24963/ijcai.2021/595.
the model more efective. The RL algorithm we have used, [7] P. Battaglia, J. B. C. Hamrick, V. Bapst, A. Sanchez,
REINFORCE, is one of the most simple policy gradient V. Zambaldi, M. Malinowski, A. Tacchetti, D.
Ra</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>poso</surname>
            , A. Santoro,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Faulkner</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Gulcehre</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Song</surname>
          </string-name>
          , arXiv preprint arXiv:
          <volume>1701</volume>
          .07274 (
          <year>2017</year>
          ). URL: http:
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Ballard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gilmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Dahl</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Vaswani, //arxiv.org/abs/1701.07274.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. J.</given-names>
            <surname>Langston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Heess</surname>
          </string-name>
          , [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kool</surname>
          </string-name>
          , H. van Hoof,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          , Attention,
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Botvinick</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Vinyals, learn to solve routing problems!</article-title>
          , in: 7th Inter-
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>learning</surname>
          </string-name>
          , and graph networks,
          <source>arXiv</source>
          (
          <year>2018</year>
          ).
          <source>URL: ICLR</source>
          <year>2019</year>
          ,
          <article-title>New Orleans</article-title>
          , LA, USA, May 6-
          <issue>9</issue>
          ,
          <year>2019</year>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          https://arxiv.org/pdf/
          <year>1806</year>
          .01261.pdf. OpenReview.net,
          <year>2019</year>
          . URL: https://openreview. [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Legrain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Omer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rosat</surname>
          </string-name>
          , A rotation- net/forum?id=ByxBFsRqYm.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>based branch-and-price approach for the nurse</article-title>
          [16]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          , Reinforcement learning:
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <article-title>scheduling problem, Mathematical Programming An introduction (</article-title>
          <year>2018</year>
          )
          <fpage>352</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>Computation</source>
          <volume>12</volume>
          (
          <year>2020</year>
          )
          <fpage>417</fpage>
          -
          <lpage>450</lpage>
          . URL: https://hal. [17]
          <string-name>
            <given-names>F.</given-names>
            <surname>Scarselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Tsoi</surname>
          </string-name>
          , M. Hagenbuchner,
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>archives-ouvertes</article-title>
          .fr/hal-01545421. G. Monfardini,
          <article-title>The graph neural network model</article-title>
          , [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Koole</surname>
          </string-name>
          , An Introduction to Business Analytics,
          <source>IEEE transactions on neural networks 20</source>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          Lulu. com,
          <year>2019</year>
          .
          <fpage>61</fpage>
          -
          <lpage>80</lpage>
          . [10]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Jaradat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Badareen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ayob</surname>
          </string-name>
          , M. Al- [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mirhoseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goldie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yazgan</surname>
          </string-name>
          , J. Jiang,
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          University-Computer and
          <source>Information Sciences</source>
          <volume>31</volume>
          arXiv:
          <year>2004</year>
          .
          <volume>10746</volume>
          (
          <year>2020</year>
          ). URL: http://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          (
          <year>2019</year>
          )
          <fpage>378</fpage>
          -
          <lpage>384</lpage>
          . URL: https://linkinghub.elsevier.
          <year>2004</year>
          .
          <volume>10746</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          com/retrieve/pii/S1319157818300363. [19]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Williams</surname>
          </string-name>
          , Simple statistical gradient-following [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mańdziuk</surname>
          </string-name>
          ,
          <article-title>Solving the travelling salesman prob- algorithms for connectionist reinforcement learn-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>lem with a hopfield-type neural network, Demon- ing, Machine learning 8 (</article-title>
          <year>1992</year>
          )
          <fpage>229</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>stratio Mathematica</source>
          <volume>29</volume>
          (
          <year>1996</year>
          )
          <fpage>219</fpage>
          -
          <lpage>232</lpage>
          . [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Kipf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bloem</surname>
          </string-name>
          , R. v. d. Berg, [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , P. De Causmaecker,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dou</surname>
          </string-name>
          , Deep neural
          <string-name>
            <given-names>I.</given-names>
            <surname>Titov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Welling</surname>
          </string-name>
          ,
          <article-title>Modeling relational data with</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>rostering problem</article-title>
          ,
          <source>in: Proceedings of the 13th In- mantic web conference</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>593</fpage>
          -
          <lpage>607</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>ternational Conference on the Practice and Theory URL</source>
          : http://arxiv.org/abs/1703.06103.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>of Automated Timetabling-PATAT</source>
          , volume
          <volume>2</volume>
          ,
          <year>2021</year>
          . [21]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Rusu</surname>
          </string-name>
          , [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Václavík</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Šcha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hanzálek</surname>
          </string-name>
          , Roster evaluation
          <string-name>
            <given-names>J.</given-names>
            <surname>Veness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Bellemare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          , M. Riedmiller,
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>Journal of Heuristics</source>
          <volume>22</volume>
          (
          <year>2016</year>
          )
          <fpage>667</fpage>
          -
          <lpage>697</lpage>
          . URL:
          <article-title>http: control through deep reinforcement learning</article-title>
          , na-
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          //arxiv.org/abs/
          <year>1804</year>
          .05002. ture 518 (
          <year>2015</year>
          )
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          . URL: http://www.nature. [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning: An overview</article-title>
          , com/articles/nature14236.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>