Automated Personnel Scheduling with Reinforcement
Learning and Graph Neural Networks
Benjamin Platten1,2 , Matthew Macfarlane1 , David Graus2 and Sepideh Mesbah2
1
    Universiteit van Amsterdam, Binnengasthuisstraat 9, 1012 ZA Amsterdam, The Netherlands
2
    Randstad Groep Nederland, Diemermere 25, 1112 TC Diemen, The Netherlands


                                          Abstract
                                          This research investigates the extent to which a reinforcement learning agent can learn the heuristics of a combinatorial
                                          optimisation problem (CO), personnel scheduling, from a graph representation. In recent years, reinforcement learning has
                                          emerged as an effective data-driven approach to solving CO problems, a subset of mathematical optimisation that appear
                                          in many real-world tasks and can be costly to solve in terms of computing power and/or human resource. CO problems
                                          often have an underlying graph structure with relational inductive biases that can be exploited by powerful graph neural
                                          networks. Across a range of problem complexities, our approach was able return consistently high average reward, performing
                                          significantly better than baselines.

                                          Keywords
                                          Combinatorial optimisation, Reinforcement learning, Graph neural networks, Personnel scheduling


1. Introduction                                                                                                   2,000 shifts to be scheduled in one week, or as few as 2.
                                                                                                                  The average is somewhere closer to 80 shifts per week
Matching personnel with shifts in a way that ensures per client. In this context, to help clients/planners to fill
coverage, legality and employee well-being is a process shifts in an efficient way there is a need for an automated
that becomes exponentially more complicated as scale technique that recommends personnel for given shifts by
increases [1]. Randstad is a human resource consulting considering the constraints.
firm responsible for scheduling over 40,000 employees                                                                Researchers have developed linear models that find
into more than 70,000 shifts every week. Employees are optimal solutions for scheduling problems by exploring
matched with work from several different industries such the search space of possible solutions [2]. Known as ex-
as retail, healthcare, and catering. Staff shortages and act methods, these algorithms simulate a decision tree
increased demand have made this process even more where each node of the tree is a possible state that the
challenging in recent years. Personnel scheduling (PS) scheduling problem can take and the solutions are situ-
as an operations research problem has generally been ated at the leaves of the tree. Due to the enormous and
known as the Nurse Rostering Problem (NRP) [2]—a ref- complex search spaces of modern scheduling problems,
erence to the difficult task of assigning nurses to shifts exact methods alone are not viable. Furthermore, an ’op-
in a 24-7 cycle—and has been studied since the 1960s by timal’ solution is usually not the goal; administrators
researchers from computer science, mathematics, oper- want to quickly generate a high quality schedule that
ations research and medicine [2]. Scheduling problems satisfies all hard constraints and as many of a wide range
carry hard and soft constraints; a typical hard constraint of soft constraints as possible. The other broad class of
in NRPs is that there is sufficient staff coverage at all solutions is heuristic methods which make use of higher
times. A soft constraint could be an employee’s prefer- level knowledge of a problem to take shortcuts through
ence not to work on weekends [2]. For Randstad, person- the search space. Reliance on domain expertise and hand-
nel scheduling is characterised by high variance between crafted features means that these methods are susceptible
clients (e.g. different retail companies) in terms of prob- to changes in the problem formulation. Neither of these
lem requirements; planning horizon, shift types, requisite approaches generalise well to new problems [2].
skills and volume of staff and shifts vary significantly.                                                            Personnel scheduling can be framed as a combinatorial
For example, a single client could require as many as optimisation problem and there is a growing body of re-
                                                                                                                  search using machine learning techniques to solve such
RecSys in HR’22: The 2nd Workshop on Recommender Systems for problems [3]. CO is the process of searching for an opti-
Human Resources, in conjunction with the 16th ACM Conference on mal solution amongst a finite set of possibilities. Classic
Recommender Systems, September 18–23, 2022, Seattle, USA.                                                         problems include the Travelling Salesman (TSP)—where
Envelope-Open platten.ben@gmail.com (B. Platten); m.v.macfarlane@uva.nl
                                                                                                                  the objective is to find the shortest route through a list
(M. Macfarlane); david.graus@randstadgroep.nl (D. Graus);
sepideh.mesbah@randstadgroep.nl (S. Mesbah)                                                                       of cities—and the knapsack problem, where the objec-
                     © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License tive is to maximise the value of objects in a knapsack
                     Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
without violating a weight constraint [3]. If a CO prob-
                                                       scheduling problems. The complexity of a NRP is deter-
lem is modelled as a sequential decision-making process,
                                                       mined by the combination of constraints and parameters
neural network function approximation can be used to   [1]. Usually, problems with a large number of possible
learn features of a problem and use them to optimise the
                                                       shift types, a large number of nurses and/or a long plan-
decisions of an agent [4]. This is the essence of Deep ning period, are expected to require more computational
Reinforcement Learning (RL), a class of algorithms thateffort. In their comprehensive review, Burke et al [2],
have attained some notoriety in recent years due to eye-
                                                       declared that exact methods ”cannot cope with the enor-
catching performance on various games [5]. Many CO     mous search spaces that are represented by real problems
problems naturally induce an underlying graph structure(at least on their own)”. However, innovative modelling
of nodes and edges [6]. By representing a problem usingtechniques have been used to formulate integer linear
graph neural networks (GNNs), the learned vector rep-  programs (ILPs) with huge numbers of variables. Branch-
resentation encodes crucial structures that improve theand-price, a technique that augments linear relaxation -
ability of an algorithm to learn from such problems anda method for solving hard ILPs by temporarily relaxing
solve them efficiently [7]. CO is at the heart of many real
                                                       the integer constraints - by dropping a proportion of the
world problems; vehicle routing, microchip design, big constraints and then checking if the subsequent solution
data processing, and numerous forms of scheduling tasks,
                                                       is feasible, was used by [8]. Modelling complex ILPs is
to name a few. This paper seeks to investigate whether a
                                                       challenging and powerful solvers are needed to run them
RL agent can learn to solve personnel scheduling prob- [9].
lems represented as graphs.                               The most common form of heuristic optimisation is
   Research question: To what extent can reinforcement meta-heuristics, which use strategies inspired by other
learning optimise personnel scheduling problems?       systems to guide the search process. For example, Ant
   There are many ways to design experiments to an-    colony optimisation meta-heuristic algorithms build so-
swer this research question. We chose to keep the RL   lutions by mimicking the foraging behaviour of ants [10].
algorithm and problem constraints fixed throughout the Although effective and simpler to implement than exact
experiments and instead to focus on problem complexity methods, heuristics need to be revised if the problem
and reward function. Problem complexity can be con-    statement changes slightly [2]. Furthermore, costly and
trolled by the number of shifts and also by the ratio be-
                                                       error-prone feature engineering can introduce biases that
tween shifts and employees, both of which determine thedo not align with the real world and lead to imprecise
probability that a correct action can be taken by chance.
                                                       decisions.
An increase in problem complexity will therefore lead to  Methods that hybridise exact methods with heuristic
increased constraint violations from a random agent. A approaches are known as hybrid approaches and Burke
reward function determines whether an agent’s actions  et al. [2] went so far as to say that ”some kind of (hybrid)
are rewarded with a positive value or punished with a  heuristic method offers the only realistic way of tackling
negative value. Reward is averaged over a set number   such difficult and challenging problems in the foreseeable
of episodes (complete runs through a problem) as this  future.”
provides a stable measure of an agent’s performance over  In general, the models seen in the literature are com-
time.                                                  plex, narrowly applicable to a certain environment, and
   In this context, we focus on the following sub-research
                                                       lack generality, but it is difficult draw conclusions when
questions:                                             there is so much variance between studies. We can say
                                                       that, by considering each instance of a problem in isola-
    1. To what extent does increasing problem complex-
                                                       tion, these existing methods do not exploit the fact that
       ity affect average reward?
                                                       the problem instances often come from a common real
    2. To what extent does reward shaping affect aver- world underlying distribution.
       age reward?

                                                              2.2. Machine Learning Methods for CO
2. Related Work                                                    problems
Substantial research has been conducted into personnel For real world CO problems, it is highly likely that prob-
scheduling and combinatorial optimisation more gener- lem instances will share certain characteristics or pat-
ally.                                                   terns [6]. Data-dependent machine learning approaches
                                                        can exploit these patterns, leading to the development of
2.1. Personnel Scheduling Methods                       faster solutions for practical cases.
                                                           A neural network model was deployed on a CO prob-
The most common methods - exact, heuristic, and hybrid lem as far back as 1996 when a Hopfield network was
- have different ways of dealing with the complexity of used by Mańdziuk [11] to solve instances of the Travel-
ling Salesman Problem. Chen et al. [12] used neural net-         CNNs get their name from the process of convolving the
works to guide a tree search. By representing schedules          input vectors with a sliding kernel which reduces the
as matrices, a network could analyse existing solutions          size of an image whilst exploiting its spatial structure.
and learn to predict the probability of each node leading        Accounting for the underlying structure of the data in-
to a solution. Václavík et al. [13] use classifiers to discard   troduces an inductive bias which promotes learning and
bad solutions and speed up a subsequent heuristic search.        leads to better results [7]. The inductive bias of a learning
   Reinforcement learning (RL) refers to a computational         algorithm is the set of assumptions that the learner uses
approach to learning from interaction [9]. In RL, an agent       to predict outputs for new data. GNNs also perform com-
interacts with a Markov Decision Process (MDP) with              putations over input data in a way that exploits spatial
the goal of maximising some reward generated by the en-          structure, a process known as message passing. Node
vironment. A combination of a state of the environment           and edge feature vectors are propagated by an exchange
and an available action, a state-action pair (s, a), has an      of information (messages) between adjacent nodes.
expected return known as a state-action value function,             Intuitively, it makes sense that a node might be repre-
or Q-value. A policy is a conditional distribution that the      sented by the average of the connections to its neighbours
agent uses to select actions. The main aim of RL is to           and in this way the spatial information of the graph is
find an optimal policy - a conditional distribution used         maintained. It also means that the number of message
by the agent as a strategy to select actions - such that the     passing layers determines how far a signal can travel;
state-action value function is optimised.                        with 2-layers, the message passing happens twice so the
   Broadly, there are two categories of RL algorithm;            message travels two hops away. There are many GNN ar-
those which learn by optimising the value function to            chitectures optimised for different functions; node, edge,
increase reward at each step and those which directly            or graph classification, using features stored at the node,
optimise the policy to increase total expected reward            edge or both. The promise of GNNs is that the learned
[14]. Policy-based methods directly model the agent’s            vector representation encodes crucial graph structures
policy as a parametric function. By collecting previous          that help solve a CO problem more efficiently [6].
decisions that the agent made in the environment we can             Mirhoseini et al. [18] developed a learning-based ap-
optimise the parameters of the network to maximise the           proach to chip placement, one of the most complex and
expected reward of the process [14].                             time-consuming stages of the chip design process. Ex-
                                                                 ploiting the graph structure of netlists (a description of
                                  ∇𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃)
                𝜃𝑡+1 = 𝜃𝑡 + 𝛼𝑅𝑡                                  the connectivity of an electronic circuit), Mirhoseini et al.
                                  𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃)                  [18] used a GNN to compute features of netlist nodes and
   An episode refers to all the states of an environment         create a rich representation for learning. Mirhoseini et al.
that come in between an initial-state and a terminal-            [18] claim that their method is the first that can quickly
state, generated from the sequence of actions taken and          generalise to previously unseen netlists and produce ’su-
changes observed in the environment. An episode can              perhuman’ results without the need for human experts
be terminated due to success or failure, and, over many          in the loop. In the opinion of Cappart et al. [6], this study
episodes, the parameters of a policy network are trained         is only scratching the surface of ”the innovations that
until they can achieve good performance on an unseen             can be enabled from a careful synergy of GNNs and CO”.
problem. Stronger convergence guarantees are available              Kool et al. [15] represented travelling salesman prob-
for policy-based methods than for value-based methods.           lems as graphs and used an encoder-decoder GNN as a
One of the main strengths of RL is that agents can learn         parameterised policy to make decisions on which node
the rules of an environment without the need for explicit        to visit next. An encoder transforms input data into
programming or examples of optimal endpoints. A key              embeddings of a specific dimensions. The decoder then
aim of RL research is to train models that make decisions        transforms the embeddings to the required output dimen-
to solve real problems [15]                                      sions. The solution is constructed incrementally, adding
   Reinforcement Learning (RL) methods were first ap-            one node at a time. The policy network is trained us-
plied to CO by Zhang and Dietterich, 1995, with a model          ing the REINFORCE algorithm, a policy gradient method.
designed to solve the NP-Hard job-shop problem [16].             The motivation for this choice was greater efficiency than
   A growing body of research uses RL in combination             a value function approach. Kool et al. [15] report signif-
with graph neural networks (GNNs), a type of function            icantly improved results over recent learned heuristic
approximator designed to operate directly on a graph             approaches to TSP. Their solution worked on problems
structure [17]. Whenever there is a set of entities that         with up to 100 nodes and generalised to 2 variants of
interact with one another or are related in any way, they        the problem using the same hyper-parameters. They are
can be expressed in the form of a graph [6]. GNNs are            clear that their goal is not to outperform specialised TSP
structurally similar to convolutional neural networks            algorithms, but to show that their approach is flexible
(CNNs) which are mostly used for image classification.           enough to generalise to similar problems.
  After a thorough search of the literature, there was             Shifts   Problem Count    Avg. Ratio   Min Ratio   Max Ratio
no credible study found to be using an RL algorithm                    3                70         0.74       0.375          1.5
and graph representations for personnel scheduling. The                4                70         0.98       0.500          2.0
research most relevant to our aims is Kool et al. [15]                 5                70         1.23       0.625          2.5
                                                                       6                70         1.47       0.750          3.0
which has many useful insights.                                        7                70         1.72       0.875          3.5
                                                                       8                70         1.96       1.000          4.0

3. Methodology
Our goal is to train an agent to sequentially assign em-         Table 1
ployees to shifts in an acceptable configuration. A sched-       Overview of the scheduling problems used for training the
ule is acceptable if the constraints are not violated. The       scheduling agent. 420 in total problems with 6 different shift
foundation of this research is a pipeline that ingests a         counts and range of shift-employee ratios. The number of
pool and a shift schedule and outputs a trained schedul-         shifts, and the ratio between shifts and employees, determine
                                                                 the complexity of our scheduling problems: the probability
ing agent. We train models on a fixed set of problems
                                                                 that a correct action can be taken by chance.
in order to generate an agent that is capable of solving
general, previously unseen problems. The pipeline was
designed to take pool and shift schedule input of any
size, adapting the parameters of the environment, graph          3.1. Data
representation, and Policy GNN accordingly.                         At Randstad, employees are grouped into pools which act
                                                                    as a pre-filter for eligibility; anyone assigned to a pool
                                                                    is eligible for shifts assigned to that pool. A pool repre-
                                                                    sents an instance of a scheduling problem; the employees
                                                                    and shifts must be combined in a way that provides ac-
                                                                    ceptable coverage and respects employment law and em-
                                                                    ployee preference. Computational complexity increases
                                                                    with shift count and shift-employee ratio meaning that
                                                                    Randstad data includes many highly complex scheduling
                                                                    problems. Furthermore, the data contains many features
                                                                    with high cardinality, such as shift length and start time
                                                                    of a shift. To test the viability of an RL approach, simpler
                                                                    problem instances were required.
Figure 1: Pipeline                                                     Inspired by the characteristics of Randstad data, we
     • A pool csv containing a list of employee ids, and a created synthetic schedule and pool data sets that can
       schedule csv containing a list of shift ids, are the initial be combined to created simplified scheduling problems.
       input. The schedule csv also contains features for The aim is to create problems that test an agent's ability
       each shift: day of week and time of day (morning or to learn to satisfy a single constraint: the same employee
       evening).                                                    can’t work sequential shifts. This constraint has to be
     • A custom OpenAI Gym RL environment takes the learned from the features of the problems.
       input csvs and combines them into a single matrix.              Synthetic schedules contain shift id and 2 features; day
     • The length of an episode is equal to the number of of week (Monday - Friday) and time of day (morning or
       shifts in the schedule csv. At each step in an episode, evening). Schedule shift count ranges from 3 to 8 and
       the matrix is converted to a bipartite graph and passed pool employee count ranges from 2 to 8. By combining
       through the GNN.
                                                                    these schedules and pools, we created a training set of
     • The GNN outputs a policy; a probability distribution
                                                                    420 unique problems. Full details of the training set can
       over the employee / action space.
                                                                    be seen in Table 1. Functions were created to manage
     • The agent takes actions according to the policy and
                                                                    the problem combinations and check for uniqueness /
       updates the weights of the policy network based on
       the loss function.                                           duplicate problems. Another function randomly gener-
     • Repeat until a terminal state is reached: all shifts are ates new, unseen problems of variable shift count and
       assigned.                                                    shift-employee ratio, for testing.

                                                                 3.2. Pipeline
                                                                 So that a model can learn to find acceptable solutions
                                                                 to the scheduling problems, we define a Reinforcement
Learning framework specific for this task, similar to Kool    scheduling problem graph representation can be seen in
et al. [15]’s framework for tackling Travelling Salesman      Figure 2.
problems. The aim is to optimise the reward received for         The role of graph neural networks is to distill informa-
taking actions (assigning workers) at each step (shift). In   tion about the type and connectivity of a node within a
this framework, an encoder-decoder graph neural net-          large graph into low-dimensional vector representations
work is used to estimate a probability distribution over      which can be used in downstream tasks. Our downstream
the actions available at each state of a problem. The en-     task is to output log probabilities for each action in the
coder produces embeddings of all input nodes, whilst the      action space.
decoder outputs log probabilities for each available ac-         The encoder produces embeddings of all input nodes.
tion. To train this policy model, we used a policy gradient   To do this, it takes a graph representation of the current
method called REINFORCE.                                      state of a scheduling problem as input and uses message
                                                              passing convolutional layers to update and aggregate the
3.2.1. REINFORCE                                              embeddings.

REINFORCE was proposed by Williams [19]. We start                            ℎ𝑙+1
                                                                                                 (𝑙)   (𝑙)
                                                                                  = 𝜎 ( ∑ 𝑔𝑚 (ℎ𝑖 , ℎ𝑗 ))
                                                                              𝑖
by initialising a random policy which the REINFORCE                                      𝑚∈𝑀𝑖
algorithm takes as input. The agent collects the trajec-
                                                      Then, at each time step, the decoder network calculates
tory of an episode from current policy, storing actions,
                                                      the inner product of the current shift’s node features and
states and rewards as a history. For each episode we  the worker edge features, which serves as a compatibility
calculate the reward and use it to evaluate the policy.
                                                      score. The decoder calculate this score for problems
For each action in the episode, the loss is calculated as
                                                      of any size. The softmax output of these compatibility
the probability (according to the policy) of choosing the
                                                      scores are the probabilities for the policy.
action, multiplied by the reward received. The loss is   The number of nodes in the input graph reflects the
then back-propagated through the network to update    count of shifts and employees in the scheduling prob-
the policy, increasing the probability that the agent will
                                                      lem. Shift features are consistent (5 days of week and 2
choose the actions that lead to higher reward in the next
                                                      times of day) but the employee features - which indicates
episode.                                              whether an employee as been assigned to a specific shift -
                                 ∇𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃)
                 𝜃𝑡+1 = 𝜃𝑡 + 𝛼𝑅𝑡                      vary with employee and shift count. The network needs
                                 𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃)        to be able to process graphs of different sizes without
The encoder-decoder policy network is implemented us- being re-initialised with the parameters of the current
ing the Deep Graph Library.                           graph, which would reset the learned weights. To achieve
                                                      this, we designed the architecture to store the employee
3.2.2. Policy Network                                 feature data in the edges, linear transformation of em-
                                                      ployee features could be performed at each step. This is
                                                      illustrated in Figure 3. We use a message passing layer
                                                      that incorporates edge data in it’s update and aggregation
                                                      functions. The design of this layer was proposed by [20]
                                                      :
                                                               (𝑙+1)                     (𝑙) (𝑙)   (𝑙) (𝑙)
                                                              ℎ𝑖     = 𝜎(∑ ∑ 𝑒𝑗,𝑖 𝑊𝑟 ℎ𝑗 + 𝑊0 ℎ𝑖 )
                                                                                 𝑟𝜖𝑅 𝑗𝜖𝑁 𝑟 (𝑖)
                                                             where 𝑁 𝑟 (𝑖) is the neighbor set of node 𝑖 w.r.t. relation
                                                             𝑟. 𝑒𝑗,𝑖 is the normaliser. 𝜎 is an activation function. 𝑊0 is
Figure 2: Example of a graph representation of a scheduling the self-loop weight.
problem with 4 shifts and 2 employees. A bipartite graph has
                                                                These edge features would allow us to measure im-
two distinct sets of nodes with edges connecting each node
                                                             portant statistics such as total shifts allocated for each
in one set with each node in the other. These graphs can be
used as input to a graph neural network.                     employee without any hard-coding with of problem size.
                                                             These statistics could be used to add additional con-
                                                             straints such as minimum / maximum hours worked per
   For the agent’s policy, we implemented an encoder- employee.
decoder GNN architecture, similar to the model used             The network has a fixed learning rate of 1e-3. The
by Kool et al. [15]. The building blocks of the GNN are encoder uses 2 convolutional layers, 5 × 32 shift embed-
from the Deep Graph Library, which supports a range dings and 𝑛 × 32 employee embeddings, where 𝑛 = to the
of GNN architectures on top of the PyTorch framework number of employee nodes. The decoder uses 32 × 32 em-
and also provides graph building functions. An example beddings of the features for the current shift and 32 × 32
                                                             embeddings for each employee.
Figure 3: This figure shows 2 scheduling problems with different shift counts in matrix and graph form. By using a fixed
number of shift features for each problem size (highlighted in yellow), the same linear transformation (5 × 32) can be applied to
the feature data of different sized graph representations. The employee features, which indicate an assignment of employee to
shift, are stored in the edges. A red edge indicates an assignment. The employee feature data is transformed to 𝑛 × 32, where 𝑛
= to the number of employee nodes. Indexing starts at 0 (e.g. shift 0) and all features are one-hot encoded. With the exception
of employee features, the first value for each feature is dropped to reduce collinearity, in accordance with standard practice.


   The process of repeating episodes to train the policy 4. Experimental Setup
network is managed by a custom Open AI Gym environ-
ment.                                                      There are many possible variants for the design of this re-
                                                           search, such as RL algorithm, training problem complex-
3.2.3. Open AI Gym                                         ity, feature engineering, reward function and network
                                                           architecture, to name a few. We have chosen to focus
OpenAI’s Gym library is a popular choice for implement- on reward function and problem complexity. Reward is
ing environments for RL. It’s main building block is a based on performance with regards to a single constraint:
Python class, Env, which simulates the environment you an employee cannot work simultaneous or consecutive
want to train your agent in. There are many standard shifts. Given time constraints, hyper-parameter tuning
environments available, such as the Atari games used by was not performed for this study. The code can be viewed
DeepMind [21], and it supports custom environments. on github.
The key components of an environment are the state            Average reward is used to evaluate a models perfor-
space, action space, reward function. For the state space mance. Reward is averaged across the previous 100
of our custom scheduling environment, we combine a episodes to better capture an agent’s long-term behav-
pool and a schedule into a matrix. The length of the ior. In a real world setting, the constraint we used is
matrix equals the number of shifts, and width of the ma- considered a hard constraint meaning that a violation
trix equals the number of shift features plus the number renders the solution unacceptable. A solution could fea-
of employees in the pool. For each shift, the agent can sibly have a high reward but be unacceptable due to a
assign an employee from the pool, therefore the action single constraint violation. We used an additional metric,
space is equal to the number of employees. The reward ’percentage acceptable’ to measure the number of accept-
function is adjusted as part of the experimental setup able solutions produced by the agent: a solution with a
but essentially it rewards the agent when an acceptable maximum score and therefore no constraint violations.
schedule is created. Or rather, it punishes the agent when
constraints are violated. Another key part of the environ-
                                                           4.1. Reward Functions
ment is the step function, which takes the agents action
as input and updates the state and reward accordingly.     TSPs have an inbuilt cost, tour length (the cumulative
                                                           distance between points), which the agent in Kool et al.
                                                           [15] tries to minimise. There are examples of TSP solvers
                                                           that apply cost (negative reward) incrementally (after
each stop)—which has been shown to encourage greedy
behaviour—and at terminal states (when the tour is com-
plete) [15]. As there is no implicit reward or cost for
scheduling problems, we chose an arbitrary total reward
value of 1. Constraint violations results in a negative
reward. Constraint violations are calculated based on the
features in the schedule data: Assigning 2 shifts on the
same day to an employee is a constraint violation. Ap-
plying shifts on subsequent days to the same employee
is a constraint violation only if shift 1 is in the evening
and shift 2 is in the morning. The cost of a constraint vi-
olation is 1/(number of shifts-1) because it is not possible
                                                             Figure 4: The effect of shift-employee ratio on action space.
to have a violation on the first assignment.                 Green squares represent actions that will lead to an acceptable
   The effect of reward function on performance is tested solution.
by comparing the performance of 3 reward functions;
1) calculating reward at the terminal state of an episode
(Terminal ), 2) calculating reward at every step (Step ),              Shifts Ratio Range Problem Count
and 3) assigning half of the reward stepwise, and the                  3-8      1.2-2.8                         50
remaining half at a terminal state as reward for an ac-                3-8      2.9-5                           50
ceptable solution (Step Bonus ).                                       9-14     1.2-2.8                         50
                                                                         9-14     2.9-5                        50
                                                                         15-18    1.2-2.8                      50
4.2. Problem Complexity                                                  15-18    2.9-5                        50
Problem complexity can be increased in two ways; by                      19-23    1.2-2.8                      50
the number of shifts and also by the ratio between shifts                19-23    2.9-5                        50
and employees. More shifts means more decisions for                      24-30    1.2-2.8                      50
                                                                         24-30    2.9-5                        50
the agent to take, so there is more chance of a constraint
violation occurring. The effect of shift-employee ratio is
illustrated in 4, which shows a schedule and two pools.        Table 2
Combining the schedule with pool 1 has a higher ratio          Summary of test set problems showing size, shift-employee
than for pool 2, which reduces the number of actions an        ratio and count of unique problems.
agent can take without incurring a constraint violation.
   Models were trained on 420 problems with between 3
and 8 shifts across 10,000 episodes. This means that the          Across all experiments, the best performing model was
agent would see the same problem many times during             the Step Bonus reward function (mean=0.89, standard
training.                                                      deviation=0.22) which returned significantly higher av-
   Models were tested on 500 problems from a range of          erage reward than Random baseline (m=0.49, sd=0.49),
complexity distributions, as can be seen in Table 2. The       p<.05. Step Bonus also performed significantly better
agent sees each problem once only.                             than the other reward functions Step (m=0.62, sd=0.62),
   The test problems have two levels of shift-employee         p<.05, and Terminal (m=0.58, sd=0.57), p<.05.
ratio: an ’average’ ratio of between 1.2 and 2.8, which           In terms of problem complexity, average reward was
reflects the first and third quartiles for shift-employee      significantly higher for Average ratio problems (m=0.74,
ratio in the Randstad data, and a ’high’ ratio of between      sd=0.39) than for High ratio problems (m=0.65, sd=0.65),
2.9 and 5 to reflect the more complex problems seen in         p<.05. This effect was observed for each model. The
the data.                                                      effect of increasing shift count is less clear. Excluding
                                                               results from the Random agent, problems with a max shift
                                                               count of 8 (ms8) (m=0.72, sd=0.36) show significantly
5. Results                                                     higher average reward than problems with max shift
Models were trained on 5 different seeds and each model        counts of 14 (m=0.68, sd=0.68), 18 (m=0.68, sd=0.68) and
was tested on every problem twice. The results were            23 (m=0.68, sd=0.68). However, there was no signifi-
averaged across both runs for all 5 models. We tested          cant difference in average reward observed between the
the significance of our results using Welch’s t-test, which    smallest problems (ms8) and the largest problems (ms30)
doesn’t assume equal variance between groups. Results          (m=0.72, sd=0.41). Furthermore, ms30 showed higher
are summarised in Table 3.                                     average reward than ms23, ms18, and ms14. These ef-
Figure 5: Performance on problems with Average Shift-Employee ratio. Performance on both metrics is stable across levels of
Problem Complexity, even increasing slightly.


Figure 6: Performance on problems with High Shift-Employee ratio. Performance is less stable and significantly lower on
high ratio problems.


fects were observed across both Average and High ratio                               Average Reward       % Acceptable
problems.                                                    Model
   Figure 5 shows the performance of the 3 reward func-      Random agent                         0.49            14.00
tions against a Random agent baseline on average ratio       Step                                 0.62            51.68
scheduling problems of increasing shift count.               Terminal                             0.58            51.56
   Figure 6 shows the performance of 3 reward functions      Step Bonus                           0.89            79.60
against a Random agent baseline on high ratio scheduling
                                                         Table 3
problems of increasing shift count.
                                                              Summary of results across all test problems by model (reward
                                                              function). In a real world setting, a hard constraint violation
6. Discussion                                                 invalidates a solution. % Acceptable indicates the percentage
                                                              of solutions that had no constraint violations and earned a
                                                              maximum reward of 1.
Personnel scheduling is a real-world combinatorial opti-
misation task that requires significant time and expertise
to manage effectively. In this paper, we developed a
scheduling problem solver based on a framework from           practical value. We created a graph representation of
Kool et al. [15], which was designed to solve another CO      each combination of shift schedule and employee pool
problem, the travelling salesman problem. Compared to         for use in a GNN. The GNN functions as the policy, using
previous works, which used problem-specific approaches,       valuable information about the node in the context of
this framework is an opportunity for a data-driven ap-        the graph to guide the decisions of an agent. The param-
proach to personnel scheduling that could be of great         eters of the network are trained using the REINFORCE
algorithm. To measure the ability of the RL scheduler         methods. State of the art algorithms could be used, but a
to optimise scheduling problems, we used average re-          simple change worth testing is the use of a baseline, as
ward and percentage acceptable metrics. We compared           seen in [15]: ”A good baseline reduces gradient variance
performance across 3 different reward functions and 10        and therefore increases speed of learning”. As Figure
different levels of problem difficulty.                       7 shows, Step Bonus appears to converge on a stable
   The Step Bonus model achieved consistently high per-       reward after only a few hundred episodes, but this could
formance across all problems, significantly better than       well change if more constraints were added.
the baseline and the other reward functions. Despite the
fact that the problems in training set had a max shift
count of 8, the trained Step Bonus agent was able to          7. Conclusion
able achieve maximum score on 77% of High ratio, max
                                                              The results show that, with the right reward function, an
shift 30 problems.
                                                              RL agent was able to solve simplified personnel schedul-
   With a focus on an academic problem, rather than a
                                                              ing problems across a range of shift counts and shift-
real world application, [15] were able to test their model
                                                              employee ratios. We have shown that an agent can learn
on several standard sets of travelling salesman problems
                                                              the constraints of a combinatorial optimisation problem
(plus variants) and compare the performance to other
                                                              from data. For future work, additional evaluation using
studies that used the same data. There are standard test
                                                              real world data is of importance to better understand the
sets for personnel scheduling problems but they were
                                                              range of applicability of our approach.
out of scope for this research as these problems generally
have more complicated constraints than the one we used.
A comparison that can be made between this study and          References
[15], is the effect of node count on performance. The
cost for TSPs is positively correlated to the number of        [1] S. Den Hartog, et al.,          On the complexity
nodes, so performance appears worse (higher cost) for              of nurse scheduling problems, Master thesis
larger problems. However, [15] observed performance                (2016). URL: https://studenttheses.uu.nl/handle/20.
on 100 node TSP problems comparable to state of the                500.12932/22268.
art TSP solvers, suggesting that their graph model scaled      [2] E. K. Burke, P. De Causmaecker, G. V. Berghe,
well to larger problems. A similar case could be made              H. Van Landeghem, The state of the art of
for this research, as we saw no significant difference             nurse rostering, Journal of scheduling 7 (2004)
in performance between the hardest (ms30, High ratio)              441–499. URL: http://link.springer.com/10.1023/B:
problems and the easiest (ms8, average ratio).                     JOSH.0000046076.75950.0b.
   Because the size of the reward / cost for our system is     [3] Y. Bengio, A. Lodi, A. Prouvost, Machine learning
negatively correlated to the number of shifts in a prob-           for combinatorial optimization: a methodological
lem, violations are punished less harshly for big problems.        tour d’horizon, European Journal of Operational
However, the award of a bonus reward of .5 for an ac-              Research 290 (2021) 405–421. URL: http://arxiv.org/
ceptable solution should counteract this.                          abs/1811.06128.
   Another explanation is that shift count and shift-          [4] I. Bello, H. Pham, Q. V. Le, M. Norouzi, S. Ben-
employee ratio are not the best or only way to mea-                gio, Neural combinatorial optimization with re-
sure the complexity of a personnel scheduling problem.             inforcement learning, 5th International Conference
The consistently high rewards we observed for the Step             on Learning Representations, ICLR (2017). URL:
Bonus model suggest that the network was able to learn             http://arxiv.org/abs/1611.09940.
the constraint - no simultaneous or consecutive employee       [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
assignments - and that it was no harder to exploit it for          J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,
a large problem than a small one. Aside from the bonus             A. K. Fidjeland, G. Ostrovski, et al., Human-level
reward, a decision made at time-step 𝑡 doesn’t effect 𝑡 + 1        control through deep reinforcement learning, na-
. In a real world scheduling problem, this is unlikely to          ture 518 (2015) 529–533. URL: http://www.nature.
be the case as there will be additional constraints to con-        com/articles/nature14236.
sider. The most obvious constraint to add to this research     [6] Q. Cappart, D. Chételat, E. B. Khalil, A. Lodi, C. Mor-
would be one to ensure that each worker in the pool is             ris, P. Veličković, Combinatorial optimization and
assigned to at least one shift.                                    reasoning with graph neural networks, Proceedings
   If, as expected, adding additional constraints increases        of the Thirtieth International Joint Conference on
complexity to an extent that performance is gets worse,            Artificial Intelligence, IJCAI-21 (2021) 4348–4355.
there are architecture changes that could be made make             URL: https://doi.org/10.24963/ijcai.2021/595.
the model more effective. The RL algorithm we have used,       [7] P. Battaglia, J. B. C. Hamrick, V. Bapst, A. Sanchez,
REINFORCE, is one of the most simple policy gradient               V. Zambaldi, M. Malinowski, A. Tacchetti, D. Ra-
Figure 7: Performance on training data. The Step Bonus reward function converges after less than 1000 episodes.


     poso, A. Santoro, R. Faulkner, C. Gulcehre, F. Song,         arXiv preprint arXiv:1701.07274 (2017). URL: http:
     A. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani,               //arxiv.org/abs/1701.07274.
     K. Allen, C. Nash, V. J. Langston, C. Dyer, N. Heess,   [15] W. Kool, H. van Hoof, M. Welling, Attention,
     D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals,             learn to solve routing problems!, in: 7th Inter-
     Y. Li, R. Pascanu, Relational inductive biases, deep         national Conference on Learning Representations,
     learning, and graph networks, arXiv (2018). URL:             ICLR 2019, New Orleans, LA, USA, May 6-9, 2019,
     https://arxiv.org/pdf/1806.01261.pdf.                        OpenReview.net, 2019. URL: https://openreview.
 [8] A. Legrain, J. Omer, S. Rosat,           A rotation-         net/forum?id=ByxBFsRqYm.
     based branch-and-price approach for the nurse           [16] R. S. Sutton, A. G. Barto, Reinforcement learning:
     scheduling problem, Mathematical Programming                 An introduction (2018) 352.
     Computation 12 (2020) 417–450. URL: https://hal.        [17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner,
     archives-ouvertes.fr/hal-01545421.                           G. Monfardini, The graph neural network model,
 [9] G. Koole, An Introduction to Business Analytics,             IEEE transactions on neural networks 20 (2008)
     Lulu. com, 2019.                                             61–80.
[10] G. M. Jaradat, A. Al-Badareen, M. Ayob, M. Al-          [18] A. Mirhoseini, A. Goldie, M. Yazgan, J. Jiang,
     Smadi, I. Al-Marashdeh, M. Ash-Shuqran, E. Al-               E. Songhori, S. Wang, Y.-J. Lee, E. Johnson,
     Odat,       Hybrid elitist-ant system for nurse-             O. Pathak, S. Bae, et al., Chip placement with
     rostering problem,          Journal of King Saud             deep reinforcement learning,          arXiv preprint
     University-Computer and Information Sciences 31              arXiv:2004.10746 (2020). URL: http://arxiv.org/abs/
     (2019) 378–384. URL: https://linkinghub.elsevier.            2004.10746.
     com/retrieve/pii/S1319157818300363.                     [19] R. J. Williams, Simple statistical gradient-following
[11] J. Mańdziuk, Solving the travelling salesman prob-           algorithms for connectionist reinforcement learn-
     lem with a hopfield-type neural network, Demon-              ing, Machine learning 8 (1992) 229–256.
     stratio Mathematica 29 (1996) 219–232.                  [20] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg,
[12] Z. Chen, P. De Causmaecker, Y. Dou, Deep neural              I. Titov, M. Welling, Modeling relational data with
     networked assisted tree search for the personnel             graph convolutional networks, in: European se-
     rostering problem, in: Proceedings of the 13th In-           mantic web conference, Springer, 2018, pp. 593–607.
     ternational Conference on the Practice and Theory            URL: http://arxiv.org/abs/1703.06103.
     of Automated Timetabling-PATAT, volume 2, 2021.         [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu,
[13] R. Václavík, P. Šcha, Z. Hanzálek, Roster evaluation         J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller,
     based on classifiers for the nurse rostering problem,        A. K. Fidjeland, G. Ostrovski, et al., Human-level
     Journal of Heuristics 22 (2016) 667–697. URL: http:          control through deep reinforcement learning, na-
     //arxiv.org/abs/1804.05002.                                  ture 518 (2015) 529–533. URL: http://www.nature.
[14] Y. Li, Deep reinforcement learning: An overview,             com/articles/nature14236.