Automated Personnel Scheduling with Reinforcement Learning and Graph Neural Networks Benjamin Platten1,2 , Matthew Macfarlane1 , David Graus2 and Sepideh Mesbah2 1 Universiteit van Amsterdam, Binnengasthuisstraat 9, 1012 ZA Amsterdam, The Netherlands 2 Randstad Groep Nederland, Diemermere 25, 1112 TC Diemen, The Netherlands Abstract This research investigates the extent to which a reinforcement learning agent can learn the heuristics of a combinatorial optimisation problem (CO), personnel scheduling, from a graph representation. In recent years, reinforcement learning has emerged as an effective data-driven approach to solving CO problems, a subset of mathematical optimisation that appear in many real-world tasks and can be costly to solve in terms of computing power and/or human resource. CO problems often have an underlying graph structure with relational inductive biases that can be exploited by powerful graph neural networks. Across a range of problem complexities, our approach was able return consistently high average reward, performing significantly better than baselines. Keywords Combinatorial optimisation, Reinforcement learning, Graph neural networks, Personnel scheduling 1. Introduction 2,000 shifts to be scheduled in one week, or as few as 2. The average is somewhere closer to 80 shifts per week Matching personnel with shifts in a way that ensures per client. In this context, to help clients/planners to fill coverage, legality and employee well-being is a process shifts in an efficient way there is a need for an automated that becomes exponentially more complicated as scale technique that recommends personnel for given shifts by increases [1]. Randstad is a human resource consulting considering the constraints. firm responsible for scheduling over 40,000 employees Researchers have developed linear models that find into more than 70,000 shifts every week. Employees are optimal solutions for scheduling problems by exploring matched with work from several different industries such the search space of possible solutions [2]. Known as ex- as retail, healthcare, and catering. Staff shortages and act methods, these algorithms simulate a decision tree increased demand have made this process even more where each node of the tree is a possible state that the challenging in recent years. Personnel scheduling (PS) scheduling problem can take and the solutions are situ- as an operations research problem has generally been ated at the leaves of the tree. Due to the enormous and known as the Nurse Rostering Problem (NRP) [2]—a ref- complex search spaces of modern scheduling problems, erence to the difficult task of assigning nurses to shifts exact methods alone are not viable. Furthermore, an ’op- in a 24-7 cycle—and has been studied since the 1960s by timal’ solution is usually not the goal; administrators researchers from computer science, mathematics, oper- want to quickly generate a high quality schedule that ations research and medicine [2]. Scheduling problems satisfies all hard constraints and as many of a wide range carry hard and soft constraints; a typical hard constraint of soft constraints as possible. The other broad class of in NRPs is that there is sufficient staff coverage at all solutions is heuristic methods which make use of higher times. A soft constraint could be an employee’s prefer- level knowledge of a problem to take shortcuts through ence not to work on weekends [2]. For Randstad, person- the search space. Reliance on domain expertise and hand- nel scheduling is characterised by high variance between crafted features means that these methods are susceptible clients (e.g. different retail companies) in terms of prob- to changes in the problem formulation. Neither of these lem requirements; planning horizon, shift types, requisite approaches generalise well to new problems [2]. skills and volume of staff and shifts vary significantly. Personnel scheduling can be framed as a combinatorial For example, a single client could require as many as optimisation problem and there is a growing body of re- search using machine learning techniques to solve such RecSys in HR’22: The 2nd Workshop on Recommender Systems for problems [3]. CO is the process of searching for an opti- Human Resources, in conjunction with the 16th ACM Conference on mal solution amongst a finite set of possibilities. Classic Recommender Systems, September 18–23, 2022, Seattle, USA. problems include the Travelling Salesman (TSP)—where Envelope-Open platten.ben@gmail.com (B. Platten); m.v.macfarlane@uva.nl the objective is to find the shortest route through a list (M. Macfarlane); david.graus@randstadgroep.nl (D. Graus); sepideh.mesbah@randstadgroep.nl (S. Mesbah) of cities—and the knapsack problem, where the objec- © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License tive is to maximise the value of objects in a knapsack Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) without violating a weight constraint [3]. If a CO prob- scheduling problems. The complexity of a NRP is deter- lem is modelled as a sequential decision-making process, mined by the combination of constraints and parameters neural network function approximation can be used to [1]. Usually, problems with a large number of possible learn features of a problem and use them to optimise the shift types, a large number of nurses and/or a long plan- decisions of an agent [4]. This is the essence of Deep ning period, are expected to require more computational Reinforcement Learning (RL), a class of algorithms thateffort. In their comprehensive review, Burke et al [2], have attained some notoriety in recent years due to eye- declared that exact methods ”cannot cope with the enor- catching performance on various games [5]. Many CO mous search spaces that are represented by real problems problems naturally induce an underlying graph structure(at least on their own)”. However, innovative modelling of nodes and edges [6]. By representing a problem usingtechniques have been used to formulate integer linear graph neural networks (GNNs), the learned vector rep- programs (ILPs) with huge numbers of variables. Branch- resentation encodes crucial structures that improve theand-price, a technique that augments linear relaxation - ability of an algorithm to learn from such problems anda method for solving hard ILPs by temporarily relaxing solve them efficiently [7]. CO is at the heart of many real the integer constraints - by dropping a proportion of the world problems; vehicle routing, microchip design, big constraints and then checking if the subsequent solution data processing, and numerous forms of scheduling tasks, is feasible, was used by [8]. Modelling complex ILPs is to name a few. This paper seeks to investigate whether a challenging and powerful solvers are needed to run them RL agent can learn to solve personnel scheduling prob- [9]. lems represented as graphs. The most common form of heuristic optimisation is Research question: To what extent can reinforcement meta-heuristics, which use strategies inspired by other learning optimise personnel scheduling problems? systems to guide the search process. For example, Ant There are many ways to design experiments to an- colony optimisation meta-heuristic algorithms build so- swer this research question. We chose to keep the RL lutions by mimicking the foraging behaviour of ants [10]. algorithm and problem constraints fixed throughout the Although effective and simpler to implement than exact experiments and instead to focus on problem complexity methods, heuristics need to be revised if the problem and reward function. Problem complexity can be con- statement changes slightly [2]. Furthermore, costly and trolled by the number of shifts and also by the ratio be- error-prone feature engineering can introduce biases that tween shifts and employees, both of which determine thedo not align with the real world and lead to imprecise probability that a correct action can be taken by chance. decisions. An increase in problem complexity will therefore lead to Methods that hybridise exact methods with heuristic increased constraint violations from a random agent. A approaches are known as hybrid approaches and Burke reward function determines whether an agent’s actions et al. [2] went so far as to say that ”some kind of (hybrid) are rewarded with a positive value or punished with a heuristic method offers the only realistic way of tackling negative value. Reward is averaged over a set number such difficult and challenging problems in the foreseeable of episodes (complete runs through a problem) as this future.” provides a stable measure of an agent’s performance over In general, the models seen in the literature are com- time. plex, narrowly applicable to a certain environment, and In this context, we focus on the following sub-research lack generality, but it is difficult draw conclusions when questions: there is so much variance between studies. We can say that, by considering each instance of a problem in isola- 1. To what extent does increasing problem complex- tion, these existing methods do not exploit the fact that ity affect average reward? the problem instances often come from a common real 2. To what extent does reward shaping affect aver- world underlying distribution. age reward? 2.2. Machine Learning Methods for CO 2. Related Work problems Substantial research has been conducted into personnel For real world CO problems, it is highly likely that prob- scheduling and combinatorial optimisation more gener- lem instances will share certain characteristics or pat- ally. terns [6]. Data-dependent machine learning approaches can exploit these patterns, leading to the development of 2.1. Personnel Scheduling Methods faster solutions for practical cases. A neural network model was deployed on a CO prob- The most common methods - exact, heuristic, and hybrid lem as far back as 1996 when a Hopfield network was - have different ways of dealing with the complexity of used by Mańdziuk [11] to solve instances of the Travel- ling Salesman Problem. Chen et al. [12] used neural net- CNNs get their name from the process of convolving the works to guide a tree search. By representing schedules input vectors with a sliding kernel which reduces the as matrices, a network could analyse existing solutions size of an image whilst exploiting its spatial structure. and learn to predict the probability of each node leading Accounting for the underlying structure of the data in- to a solution. Václavík et al. [13] use classifiers to discard troduces an inductive bias which promotes learning and bad solutions and speed up a subsequent heuristic search. leads to better results [7]. The inductive bias of a learning Reinforcement learning (RL) refers to a computational algorithm is the set of assumptions that the learner uses approach to learning from interaction [9]. In RL, an agent to predict outputs for new data. GNNs also perform com- interacts with a Markov Decision Process (MDP) with putations over input data in a way that exploits spatial the goal of maximising some reward generated by the en- structure, a process known as message passing. Node vironment. A combination of a state of the environment and edge feature vectors are propagated by an exchange and an available action, a state-action pair (s, a), has an of information (messages) between adjacent nodes. expected return known as a state-action value function, Intuitively, it makes sense that a node might be repre- or Q-value. A policy is a conditional distribution that the sented by the average of the connections to its neighbours agent uses to select actions. The main aim of RL is to and in this way the spatial information of the graph is find an optimal policy - a conditional distribution used maintained. It also means that the number of message by the agent as a strategy to select actions - such that the passing layers determines how far a signal can travel; state-action value function is optimised. with 2-layers, the message passing happens twice so the Broadly, there are two categories of RL algorithm; message travels two hops away. There are many GNN ar- those which learn by optimising the value function to chitectures optimised for different functions; node, edge, increase reward at each step and those which directly or graph classification, using features stored at the node, optimise the policy to increase total expected reward edge or both. The promise of GNNs is that the learned [14]. Policy-based methods directly model the agent’s vector representation encodes crucial graph structures policy as a parametric function. By collecting previous that help solve a CO problem more efficiently [6]. decisions that the agent made in the environment we can Mirhoseini et al. [18] developed a learning-based ap- optimise the parameters of the network to maximise the proach to chip placement, one of the most complex and expected reward of the process [14]. time-consuming stages of the chip design process. Ex- ploiting the graph structure of netlists (a description of ∇𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃) 𝜃𝑡+1 = 𝜃𝑡 + 𝛼𝑅𝑡 the connectivity of an electronic circuit), Mirhoseini et al. 𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃) [18] used a GNN to compute features of netlist nodes and An episode refers to all the states of an environment create a rich representation for learning. Mirhoseini et al. that come in between an initial-state and a terminal- [18] claim that their method is the first that can quickly state, generated from the sequence of actions taken and generalise to previously unseen netlists and produce ’su- changes observed in the environment. An episode can perhuman’ results without the need for human experts be terminated due to success or failure, and, over many in the loop. In the opinion of Cappart et al. [6], this study episodes, the parameters of a policy network are trained is only scratching the surface of ”the innovations that until they can achieve good performance on an unseen can be enabled from a careful synergy of GNNs and CO”. problem. Stronger convergence guarantees are available Kool et al. [15] represented travelling salesman prob- for policy-based methods than for value-based methods. lems as graphs and used an encoder-decoder GNN as a One of the main strengths of RL is that agents can learn parameterised policy to make decisions on which node the rules of an environment without the need for explicit to visit next. An encoder transforms input data into programming or examples of optimal endpoints. A key embeddings of a specific dimensions. The decoder then aim of RL research is to train models that make decisions transforms the embeddings to the required output dimen- to solve real problems [15] sions. The solution is constructed incrementally, adding Reinforcement Learning (RL) methods were first ap- one node at a time. The policy network is trained us- plied to CO by Zhang and Dietterich, 1995, with a model ing the REINFORCE algorithm, a policy gradient method. designed to solve the NP-Hard job-shop problem [16]. The motivation for this choice was greater efficiency than A growing body of research uses RL in combination a value function approach. Kool et al. [15] report signif- with graph neural networks (GNNs), a type of function icantly improved results over recent learned heuristic approximator designed to operate directly on a graph approaches to TSP. Their solution worked on problems structure [17]. Whenever there is a set of entities that with up to 100 nodes and generalised to 2 variants of interact with one another or are related in any way, they the problem using the same hyper-parameters. They are can be expressed in the form of a graph [6]. GNNs are clear that their goal is not to outperform specialised TSP structurally similar to convolutional neural networks algorithms, but to show that their approach is flexible (CNNs) which are mostly used for image classification. enough to generalise to similar problems. After a thorough search of the literature, there was Shifts Problem Count Avg. Ratio Min Ratio Max Ratio no credible study found to be using an RL algorithm 3 70 0.74 0.375 1.5 and graph representations for personnel scheduling. The 4 70 0.98 0.500 2.0 research most relevant to our aims is Kool et al. [15] 5 70 1.23 0.625 2.5 6 70 1.47 0.750 3.0 which has many useful insights. 7 70 1.72 0.875 3.5 8 70 1.96 1.000 4.0 3. Methodology Our goal is to train an agent to sequentially assign em- Table 1 ployees to shifts in an acceptable configuration. A sched- Overview of the scheduling problems used for training the ule is acceptable if the constraints are not violated. The scheduling agent. 420 in total problems with 6 different shift foundation of this research is a pipeline that ingests a counts and range of shift-employee ratios. The number of pool and a shift schedule and outputs a trained schedul- shifts, and the ratio between shifts and employees, determine the complexity of our scheduling problems: the probability ing agent. We train models on a fixed set of problems that a correct action can be taken by chance. in order to generate an agent that is capable of solving general, previously unseen problems. The pipeline was designed to take pool and shift schedule input of any size, adapting the parameters of the environment, graph 3.1. Data representation, and Policy GNN accordingly. At Randstad, employees are grouped into pools which act as a pre-filter for eligibility; anyone assigned to a pool is eligible for shifts assigned to that pool. A pool repre- sents an instance of a scheduling problem; the employees and shifts must be combined in a way that provides ac- ceptable coverage and respects employment law and em- ployee preference. Computational complexity increases with shift count and shift-employee ratio meaning that Randstad data includes many highly complex scheduling problems. Furthermore, the data contains many features with high cardinality, such as shift length and start time of a shift. To test the viability of an RL approach, simpler problem instances were required. Figure 1: Pipeline Inspired by the characteristics of Randstad data, we • A pool csv containing a list of employee ids, and a created synthetic schedule and pool data sets that can schedule csv containing a list of shift ids, are the initial be combined to created simplified scheduling problems. input. The schedule csv also contains features for The aim is to create problems that test an agent's ability each shift: day of week and time of day (morning or to learn to satisfy a single constraint: the same employee evening). can’t work sequential shifts. This constraint has to be • A custom OpenAI Gym RL environment takes the learned from the features of the problems. input csvs and combines them into a single matrix. Synthetic schedules contain shift id and 2 features; day • The length of an episode is equal to the number of of week (Monday - Friday) and time of day (morning or shifts in the schedule csv. At each step in an episode, evening). Schedule shift count ranges from 3 to 8 and the matrix is converted to a bipartite graph and passed pool employee count ranges from 2 to 8. By combining through the GNN. these schedules and pools, we created a training set of • The GNN outputs a policy; a probability distribution 420 unique problems. Full details of the training set can over the employee / action space. be seen in Table 1. Functions were created to manage • The agent takes actions according to the policy and the problem combinations and check for uniqueness / updates the weights of the policy network based on the loss function. duplicate problems. Another function randomly gener- • Repeat until a terminal state is reached: all shifts are ates new, unseen problems of variable shift count and assigned. shift-employee ratio, for testing. 3.2. Pipeline So that a model can learn to find acceptable solutions to the scheduling problems, we define a Reinforcement Learning framework specific for this task, similar to Kool scheduling problem graph representation can be seen in et al. [15]’s framework for tackling Travelling Salesman Figure 2. problems. The aim is to optimise the reward received for The role of graph neural networks is to distill informa- taking actions (assigning workers) at each step (shift). In tion about the type and connectivity of a node within a this framework, an encoder-decoder graph neural net- large graph into low-dimensional vector representations work is used to estimate a probability distribution over which can be used in downstream tasks. Our downstream the actions available at each state of a problem. The en- task is to output log probabilities for each action in the coder produces embeddings of all input nodes, whilst the action space. decoder outputs log probabilities for each available ac- The encoder produces embeddings of all input nodes. tion. To train this policy model, we used a policy gradient To do this, it takes a graph representation of the current method called REINFORCE. state of a scheduling problem as input and uses message passing convolutional layers to update and aggregate the 3.2.1. REINFORCE embeddings. REINFORCE was proposed by Williams [19]. We start ℎ𝑙+1 (𝑙) (𝑙) = 𝜎 ( ∑ 𝑔𝑚 (ℎ𝑖 , ℎ𝑗 )) 𝑖 by initialising a random policy which the REINFORCE 𝑚∈𝑀𝑖 algorithm takes as input. The agent collects the trajec- Then, at each time step, the decoder network calculates tory of an episode from current policy, storing actions, the inner product of the current shift’s node features and states and rewards as a history. For each episode we the worker edge features, which serves as a compatibility calculate the reward and use it to evaluate the policy. score. The decoder calculate this score for problems For each action in the episode, the loss is calculated as of any size. The softmax output of these compatibility the probability (according to the policy) of choosing the scores are the probabilities for the policy. action, multiplied by the reward received. The loss is The number of nodes in the input graph reflects the then back-propagated through the network to update count of shifts and employees in the scheduling prob- the policy, increasing the probability that the agent will lem. Shift features are consistent (5 days of week and 2 choose the actions that lead to higher reward in the next times of day) but the employee features - which indicates episode. whether an employee as been assigned to a specific shift - ∇𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃) 𝜃𝑡+1 = 𝜃𝑡 + 𝛼𝑅𝑡 vary with employee and shift count. The network needs 𝜋(𝑎𝑡 |𝑠𝑡 , 𝜃) to be able to process graphs of different sizes without The encoder-decoder policy network is implemented us- being re-initialised with the parameters of the current ing the Deep Graph Library. graph, which would reset the learned weights. To achieve this, we designed the architecture to store the employee 3.2.2. Policy Network feature data in the edges, linear transformation of em- ployee features could be performed at each step. This is illustrated in Figure 3. We use a message passing layer that incorporates edge data in it’s update and aggregation functions. The design of this layer was proposed by [20] : (𝑙+1) (𝑙) (𝑙) (𝑙) (𝑙) ℎ𝑖 = 𝜎(∑ ∑ 𝑒𝑗,𝑖 𝑊𝑟 ℎ𝑗 + 𝑊0 ℎ𝑖 ) 𝑟𝜖𝑅 𝑗𝜖𝑁 𝑟 (𝑖) where 𝑁 𝑟 (𝑖) is the neighbor set of node 𝑖 w.r.t. relation 𝑟. 𝑒𝑗,𝑖 is the normaliser. 𝜎 is an activation function. 𝑊0 is Figure 2: Example of a graph representation of a scheduling the self-loop weight. problem with 4 shifts and 2 employees. A bipartite graph has These edge features would allow us to measure im- two distinct sets of nodes with edges connecting each node portant statistics such as total shifts allocated for each in one set with each node in the other. These graphs can be used as input to a graph neural network. employee without any hard-coding with of problem size. These statistics could be used to add additional con- straints such as minimum / maximum hours worked per For the agent’s policy, we implemented an encoder- employee. decoder GNN architecture, similar to the model used The network has a fixed learning rate of 1e-3. The by Kool et al. [15]. The building blocks of the GNN are encoder uses 2 convolutional layers, 5 × 32 shift embed- from the Deep Graph Library, which supports a range dings and 𝑛 × 32 employee embeddings, where 𝑛 = to the of GNN architectures on top of the PyTorch framework number of employee nodes. The decoder uses 32 × 32 em- and also provides graph building functions. An example beddings of the features for the current shift and 32 × 32 embeddings for each employee. Figure 3: This figure shows 2 scheduling problems with different shift counts in matrix and graph form. By using a fixed number of shift features for each problem size (highlighted in yellow), the same linear transformation (5 × 32) can be applied to the feature data of different sized graph representations. The employee features, which indicate an assignment of employee to shift, are stored in the edges. A red edge indicates an assignment. The employee feature data is transformed to 𝑛 × 32, where 𝑛 = to the number of employee nodes. Indexing starts at 0 (e.g. shift 0) and all features are one-hot encoded. With the exception of employee features, the first value for each feature is dropped to reduce collinearity, in accordance with standard practice. The process of repeating episodes to train the policy 4. Experimental Setup network is managed by a custom Open AI Gym environ- ment. There are many possible variants for the design of this re- search, such as RL algorithm, training problem complex- 3.2.3. Open AI Gym ity, feature engineering, reward function and network architecture, to name a few. We have chosen to focus OpenAI’s Gym library is a popular choice for implement- on reward function and problem complexity. Reward is ing environments for RL. It’s main building block is a based on performance with regards to a single constraint: Python class, Env, which simulates the environment you an employee cannot work simultaneous or consecutive want to train your agent in. There are many standard shifts. Given time constraints, hyper-parameter tuning environments available, such as the Atari games used by was not performed for this study. The code can be viewed DeepMind [21], and it supports custom environments. on github. The key components of an environment are the state Average reward is used to evaluate a models perfor- space, action space, reward function. For the state space mance. Reward is averaged across the previous 100 of our custom scheduling environment, we combine a episodes to better capture an agent’s long-term behav- pool and a schedule into a matrix. The length of the ior. In a real world setting, the constraint we used is matrix equals the number of shifts, and width of the ma- considered a hard constraint meaning that a violation trix equals the number of shift features plus the number renders the solution unacceptable. A solution could fea- of employees in the pool. For each shift, the agent can sibly have a high reward but be unacceptable due to a assign an employee from the pool, therefore the action single constraint violation. We used an additional metric, space is equal to the number of employees. The reward ’percentage acceptable’ to measure the number of accept- function is adjusted as part of the experimental setup able solutions produced by the agent: a solution with a but essentially it rewards the agent when an acceptable maximum score and therefore no constraint violations. schedule is created. Or rather, it punishes the agent when constraints are violated. Another key part of the environ- 4.1. Reward Functions ment is the step function, which takes the agents action as input and updates the state and reward accordingly. TSPs have an inbuilt cost, tour length (the cumulative distance between points), which the agent in Kool et al. [15] tries to minimise. There are examples of TSP solvers that apply cost (negative reward) incrementally (after each stop)—which has been shown to encourage greedy behaviour—and at terminal states (when the tour is com- plete) [15]. As there is no implicit reward or cost for scheduling problems, we chose an arbitrary total reward value of 1. Constraint violations results in a negative reward. Constraint violations are calculated based on the features in the schedule data: Assigning 2 shifts on the same day to an employee is a constraint violation. Ap- plying shifts on subsequent days to the same employee is a constraint violation only if shift 1 is in the evening and shift 2 is in the morning. The cost of a constraint vi- olation is 1/(number of shifts-1) because it is not possible Figure 4: The effect of shift-employee ratio on action space. to have a violation on the first assignment. Green squares represent actions that will lead to an acceptable The effect of reward function on performance is tested solution. by comparing the performance of 3 reward functions; 1) calculating reward at the terminal state of an episode (Terminal ), 2) calculating reward at every step (Step ), Shifts Ratio Range Problem Count and 3) assigning half of the reward stepwise, and the 3-8 1.2-2.8 50 remaining half at a terminal state as reward for an ac- 3-8 2.9-5 50 ceptable solution (Step Bonus ). 9-14 1.2-2.8 50 9-14 2.9-5 50 15-18 1.2-2.8 50 4.2. Problem Complexity 15-18 2.9-5 50 Problem complexity can be increased in two ways; by 19-23 1.2-2.8 50 the number of shifts and also by the ratio between shifts 19-23 2.9-5 50 and employees. More shifts means more decisions for 24-30 1.2-2.8 50 24-30 2.9-5 50 the agent to take, so there is more chance of a constraint violation occurring. The effect of shift-employee ratio is illustrated in 4, which shows a schedule and two pools. Table 2 Combining the schedule with pool 1 has a higher ratio Summary of test set problems showing size, shift-employee than for pool 2, which reduces the number of actions an ratio and count of unique problems. agent can take without incurring a constraint violation. Models were trained on 420 problems with between 3 and 8 shifts across 10,000 episodes. This means that the Across all experiments, the best performing model was agent would see the same problem many times during the Step Bonus reward function (mean=0.89, standard training. deviation=0.22) which returned significantly higher av- Models were tested on 500 problems from a range of erage reward than Random baseline (m=0.49, sd=0.49), complexity distributions, as can be seen in Table 2. The p<.05. Step Bonus also performed significantly better agent sees each problem once only. than the other reward functions Step (m=0.62, sd=0.62), The test problems have two levels of shift-employee p<.05, and Terminal (m=0.58, sd=0.57), p<.05. ratio: an ’average’ ratio of between 1.2 and 2.8, which In terms of problem complexity, average reward was reflects the first and third quartiles for shift-employee significantly higher for Average ratio problems (m=0.74, ratio in the Randstad data, and a ’high’ ratio of between sd=0.39) than for High ratio problems (m=0.65, sd=0.65), 2.9 and 5 to reflect the more complex problems seen in p<.05. This effect was observed for each model. The the data. effect of increasing shift count is less clear. Excluding results from the Random agent, problems with a max shift count of 8 (ms8) (m=0.72, sd=0.36) show significantly 5. Results higher average reward than problems with max shift Models were trained on 5 different seeds and each model counts of 14 (m=0.68, sd=0.68), 18 (m=0.68, sd=0.68) and was tested on every problem twice. The results were 23 (m=0.68, sd=0.68). However, there was no signifi- averaged across both runs for all 5 models. We tested cant difference in average reward observed between the the significance of our results using Welch’s t-test, which smallest problems (ms8) and the largest problems (ms30) doesn’t assume equal variance between groups. Results (m=0.72, sd=0.41). Furthermore, ms30 showed higher are summarised in Table 3. average reward than ms23, ms18, and ms14. These ef- Figure 5: Performance on problems with Average Shift-Employee ratio. Performance on both metrics is stable across levels of Problem Complexity, even increasing slightly. Figure 6: Performance on problems with High Shift-Employee ratio. Performance is less stable and significantly lower on high ratio problems. fects were observed across both Average and High ratio Average Reward % Acceptable problems. Model Figure 5 shows the performance of the 3 reward func- Random agent 0.49 14.00 tions against a Random agent baseline on average ratio Step 0.62 51.68 scheduling problems of increasing shift count. Terminal 0.58 51.56 Figure 6 shows the performance of 3 reward functions Step Bonus 0.89 79.60 against a Random agent baseline on high ratio scheduling Table 3 problems of increasing shift count. Summary of results across all test problems by model (reward function). In a real world setting, a hard constraint violation 6. Discussion invalidates a solution. % Acceptable indicates the percentage of solutions that had no constraint violations and earned a maximum reward of 1. Personnel scheduling is a real-world combinatorial opti- misation task that requires significant time and expertise to manage effectively. In this paper, we developed a scheduling problem solver based on a framework from practical value. We created a graph representation of Kool et al. [15], which was designed to solve another CO each combination of shift schedule and employee pool problem, the travelling salesman problem. Compared to for use in a GNN. The GNN functions as the policy, using previous works, which used problem-specific approaches, valuable information about the node in the context of this framework is an opportunity for a data-driven ap- the graph to guide the decisions of an agent. The param- proach to personnel scheduling that could be of great eters of the network are trained using the REINFORCE algorithm. To measure the ability of the RL scheduler methods. State of the art algorithms could be used, but a to optimise scheduling problems, we used average re- simple change worth testing is the use of a baseline, as ward and percentage acceptable metrics. We compared seen in [15]: ”A good baseline reduces gradient variance performance across 3 different reward functions and 10 and therefore increases speed of learning”. As Figure different levels of problem difficulty. 7 shows, Step Bonus appears to converge on a stable The Step Bonus model achieved consistently high per- reward after only a few hundred episodes, but this could formance across all problems, significantly better than well change if more constraints were added. the baseline and the other reward functions. Despite the fact that the problems in training set had a max shift count of 8, the trained Step Bonus agent was able to 7. Conclusion able achieve maximum score on 77% of High ratio, max The results show that, with the right reward function, an shift 30 problems. RL agent was able to solve simplified personnel schedul- With a focus on an academic problem, rather than a ing problems across a range of shift counts and shift- real world application, [15] were able to test their model employee ratios. We have shown that an agent can learn on several standard sets of travelling salesman problems the constraints of a combinatorial optimisation problem (plus variants) and compare the performance to other from data. For future work, additional evaluation using studies that used the same data. There are standard test real world data is of importance to better understand the sets for personnel scheduling problems but they were range of applicability of our approach. out of scope for this research as these problems generally have more complicated constraints than the one we used. A comparison that can be made between this study and References [15], is the effect of node count on performance. The cost for TSPs is positively correlated to the number of [1] S. Den Hartog, et al., On the complexity nodes, so performance appears worse (higher cost) for of nurse scheduling problems, Master thesis larger problems. However, [15] observed performance (2016). URL: https://studenttheses.uu.nl/handle/20. on 100 node TSP problems comparable to state of the 500.12932/22268. art TSP solvers, suggesting that their graph model scaled [2] E. K. Burke, P. De Causmaecker, G. V. Berghe, well to larger problems. A similar case could be made H. Van Landeghem, The state of the art of for this research, as we saw no significant difference nurse rostering, Journal of scheduling 7 (2004) in performance between the hardest (ms30, High ratio) 441–499. URL: http://link.springer.com/10.1023/B: problems and the easiest (ms8, average ratio). JOSH.0000046076.75950.0b. Because the size of the reward / cost for our system is [3] Y. Bengio, A. Lodi, A. Prouvost, Machine learning negatively correlated to the number of shifts in a prob- for combinatorial optimization: a methodological lem, violations are punished less harshly for big problems. tour d’horizon, European Journal of Operational However, the award of a bonus reward of .5 for an ac- Research 290 (2021) 405–421. URL: http://arxiv.org/ ceptable solution should counteract this. abs/1811.06128. Another explanation is that shift count and shift- [4] I. Bello, H. Pham, Q. V. Le, M. Norouzi, S. Ben- employee ratio are not the best or only way to mea- gio, Neural combinatorial optimization with re- sure the complexity of a personnel scheduling problem. inforcement learning, 5th International Conference The consistently high rewards we observed for the Step on Learning Representations, ICLR (2017). URL: Bonus model suggest that the network was able to learn http://arxiv.org/abs/1611.09940. the constraint - no simultaneous or consecutive employee [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, assignments - and that it was no harder to exploit it for J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, a large problem than a small one. Aside from the bonus A. K. Fidjeland, G. Ostrovski, et al., Human-level reward, a decision made at time-step 𝑡 doesn’t effect 𝑡 + 1 control through deep reinforcement learning, na- . In a real world scheduling problem, this is unlikely to ture 518 (2015) 529–533. URL: http://www.nature. be the case as there will be additional constraints to con- com/articles/nature14236. sider. The most obvious constraint to add to this research [6] Q. Cappart, D. Chételat, E. B. Khalil, A. Lodi, C. Mor- would be one to ensure that each worker in the pool is ris, P. Veličković, Combinatorial optimization and assigned to at least one shift. reasoning with graph neural networks, Proceedings If, as expected, adding additional constraints increases of the Thirtieth International Joint Conference on complexity to an extent that performance is gets worse, Artificial Intelligence, IJCAI-21 (2021) 4348–4355. there are architecture changes that could be made make URL: https://doi.org/10.24963/ijcai.2021/595. the model more effective. The RL algorithm we have used, [7] P. Battaglia, J. B. C. Hamrick, V. Bapst, A. Sanchez, REINFORCE, is one of the most simple policy gradient V. Zambaldi, M. Malinowski, A. Tacchetti, D. Ra- Figure 7: Performance on training data. The Step Bonus reward function converges after less than 1000 episodes. poso, A. Santoro, R. Faulkner, C. Gulcehre, F. Song, arXiv preprint arXiv:1701.07274 (2017). URL: http: A. Ballard, J. Gilmer, G. E. Dahl, A. Vaswani, //arxiv.org/abs/1701.07274. K. Allen, C. Nash, V. J. Langston, C. Dyer, N. Heess, [15] W. Kool, H. van Hoof, M. Welling, Attention, D. Wierstra, P. Kohli, M. Botvinick, O. Vinyals, learn to solve routing problems!, in: 7th Inter- Y. Li, R. Pascanu, Relational inductive biases, deep national Conference on Learning Representations, learning, and graph networks, arXiv (2018). URL: ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, https://arxiv.org/pdf/1806.01261.pdf. OpenReview.net, 2019. URL: https://openreview. [8] A. Legrain, J. Omer, S. Rosat, A rotation- net/forum?id=ByxBFsRqYm. based branch-and-price approach for the nurse [16] R. S. Sutton, A. G. Barto, Reinforcement learning: scheduling problem, Mathematical Programming An introduction (2018) 352. Computation 12 (2020) 417–450. URL: https://hal. [17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, archives-ouvertes.fr/hal-01545421. G. Monfardini, The graph neural network model, [9] G. Koole, An Introduction to Business Analytics, IEEE transactions on neural networks 20 (2008) Lulu. com, 2019. 61–80. [10] G. M. Jaradat, A. Al-Badareen, M. Ayob, M. Al- [18] A. Mirhoseini, A. Goldie, M. Yazgan, J. Jiang, Smadi, I. Al-Marashdeh, M. Ash-Shuqran, E. Al- E. Songhori, S. Wang, Y.-J. Lee, E. Johnson, Odat, Hybrid elitist-ant system for nurse- O. Pathak, S. Bae, et al., Chip placement with rostering problem, Journal of King Saud deep reinforcement learning, arXiv preprint University-Computer and Information Sciences 31 arXiv:2004.10746 (2020). URL: http://arxiv.org/abs/ (2019) 378–384. URL: https://linkinghub.elsevier. 2004.10746. com/retrieve/pii/S1319157818300363. [19] R. J. Williams, Simple statistical gradient-following [11] J. Mańdziuk, Solving the travelling salesman prob- algorithms for connectionist reinforcement learn- lem with a hopfield-type neural network, Demon- ing, Machine learning 8 (1992) 229–256. stratio Mathematica 29 (1996) 219–232. [20] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. v. d. Berg, [12] Z. Chen, P. De Causmaecker, Y. Dou, Deep neural I. Titov, M. Welling, Modeling relational data with networked assisted tree search for the personnel graph convolutional networks, in: European se- rostering problem, in: Proceedings of the 13th In- mantic web conference, Springer, 2018, pp. 593–607. ternational Conference on the Practice and Theory URL: http://arxiv.org/abs/1703.06103. of Automated Timetabling-PATAT, volume 2, 2021. [21] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, [13] R. Václavík, P. Šcha, Z. Hanzálek, Roster evaluation J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, based on classifiers for the nurse rostering problem, A. K. Fidjeland, G. Ostrovski, et al., Human-level Journal of Heuristics 22 (2016) 667–697. URL: http: control through deep reinforcement learning, na- //arxiv.org/abs/1804.05002. ture 518 (2015) 529–533. URL: http://www.nature. [14] Y. Li, Deep reinforcement learning: An overview, com/articles/nature14236.