1. Introduction

Recommender Systems, September

Learning and Graph Neural Networks

Benjamin Platten

platten.ben@gmail.com 0 1

Matthew Macfarlane

m.v.macfarlane@uva.nl 1

David Graus

david.graus@randstadgroep.nl 0

Sepideh Mesbah

sepideh.mesbah@randstadgroep.nl 0 0 Randstad Groep Nederland , Diemermere 25, 1112 TC Diemen , The Netherlands 1 Universiteit van Amsterdam , Binnengasthuisstraat 9, 1012 ZA Amsterdam , The Netherlands

2022

1 8 23

This research investigates the extent to which a reinforcement learning agent can learn the heuristics of a combinatorial optimisation problem (CO), personnel scheduling, from a graph representation. In recent years, reinforcement learning has emerged as an efective data-driven approach to solving CO problems, a subset of mathematical optimisation that appear in many real-world tasks and can be costly to solve in terms of computing power and/or human resource. CO problems often have an underlying graph structure with relational inductive biases that can be exploited by powerful graph neural networks. Across a range of problem complexities, our approach was able return consistently high average reward, performing significantly better than baselines.

Neural Networks Combinatorial optimisation Reinforcement learning Graph neural networks Personnel scheduling

1. Introduction

Matching personnel with shifts in a way that ensures coverage, legality and employee well-being is a process that becomes exponentially more complicated as scale increases [1]. Randstad is a human resource consulting ifrm responsible for scheduling over 40,000 employees into more than 70,000 shifts every week. Employees are matched with work from several diferent industries such as retail, healthcare, and catering. Staf shortages and increased demand have made this process even more challenging in recent years. Personnel scheduling (PS) as an operations research problem has generally been erence to the dificult task of assigning nurses to shifts in a 24-7 cycle—and has been studied since the 1960s by researchers from computer science, mathematics, operations research and medicine [2]. Scheduling problems carry hard and soft constraints; a typical hard constraint in NRPs is that there is suficient staf coverage at all times. A soft constraint could be an employee’s prefernel scheduling is characterised by high variance between clients (e.g. diferent retail companies) in terms of problem requirements; planning horizon, shift types, requisite skills and volume of staf and shifts vary significantly.

For example, a single client could require as many as

nEvelop-O Human Resources, in conjunction with the 16th ACM Conference on without violating a weight constraint [3]. If a CO prob- scheduling problems. The complexity of a NRP is deterlem is modelled as a sequential decision-making process, mined by the combination of constraints and parameters neural network function approximation can be used to [1]. Usually, problems with a large number of possible learn features of a problem and use them to optimise the shift types, a large number of nurses and/or a long plandecisions of an agent [4]. This is the essence of Deep ning period, are expected to require more computational Reinforcement Learning (RL), a class of algorithms that efort. In their comprehensive review, Burke et al [ 2], have attained some notoriety in recent years due to eye- declared that exact methods ”cannot cope with the enorcatching performance on various games [5]. Many CO mous search spaces that are represented by real problems problems naturally induce an underlying graph structure (at least on their own)”. However, innovative modelling of nodes and edges [6]. By representing a problem using techniques have been used to formulate integer linear graph neural networks (GNNs), the learned vector rep- programs (ILPs) with huge numbers of variables. Branchresentation encodes crucial structures that improve the and-price, a technique that augments linear relaxation ability of an algorithm to learn from such problems and a method for solving hard ILPs by temporarily relaxing solve them eficiently [ 7]. CO is at the heart of many real the integer constraints - by dropping a proportion of the world problems; vehicle routing, microchip design, big constraints and then checking if the subsequent solution data processing, and numerous forms of scheduling tasks, is feasible, was used by [8]. Modelling complex ILPs is to name a few. This paper seeks to investigate whether a challenging and powerful solvers are needed to run them RL agent can learn to solve personnel scheduling prob- [9]. lems represented as graphs. The most common form of heuristic optimisation is

Research question: To what extent can reinforcement meta-heuristics, which use strategies inspired by other learning optimise personnel scheduling problems? systems to guide the search process. For example, Ant

There are many ways to design experiments to an- colony optimisation meta-heuristic algorithms build soswer this research question. We chose to keep the RL lutions by mimicking the foraging behaviour of ants [10]. algorithm and problem constraints fixed throughout the Although efective and simpler to implement than exact experiments and instead to focus on problem complexity methods, heuristics need to be revised if the problem and reward function. Problem complexity can be con- statement changes slightly [2]. Furthermore, costly and trolled by the number of shifts and also by the ratio be- error-prone feature engineering can introduce biases that tween shifts and employees, both of which determine the do not align with the real world and lead to imprecise probability that a correct action can be taken by chance. decisions.

An increase in problem complexity will therefore lead to Methods that hybridise exact methods with heuristic increased constraint violations from a random agent. A approaches are known as hybrid approaches and Burke reward function determines whether an agent’s actions et al. [2] went so far as to say that ”some kind of (hybrid) are rewarded with a positive value or punished with a heuristic method ofers the only realistic way of tackling negative value. Reward is averaged over a set number such dificult and challenging problems in the foreseeable of episodes (complete runs through a problem) as this future.” provides a stable measure of an agent’s performance over In general, the models seen in the literature are comtime. plex, narrowly applicable to a certain environment, and

In this context, we focus on the following sub-research lack generality, but it is dificult draw conclusions when questions: there is so much variance between studies. We can say that, by considering each instance of a problem in isolation, these existing methods do not exploit the fact that the problem instances often come from a common real world underlying distribution. 1. To what extent does increasing problem complex

ity afect average reward? 2. To what extent does reward shaping afect aver

age reward?

2. Related Work 2.2. Machine Learning Methods for CO problems

Substantial research has been conducted into personnel For real world CO problems, it is highly likely that probscheduling and combinatorial optimisation more gener- lem instances will share certain characteristics or patally. terns [6]. Data-dependent machine learning approaches can exploit these patterns, leading to the development of 2.1. Personnel Scheduling Methods faster solutions for practical cases. A neural network model was deployed on a CO probThe most common methods - exact, heuristic, and hybrid lem as far back as 1996 when a Hopfield network was - have diferent ways of dealing with the complexity of used by Mańdziuk [11] to solve instances of the Travelling Salesman Problem. Chen et al. [12] used neural net- CNNs get their name from the process of convolving the works to guide a tree search. By representing schedules input vectors with a sliding kernel which reduces the as matrices, a network could analyse existing solutions size of an image whilst exploiting its spatial structure. and learn to predict the probability of each node leading Accounting for the underlying structure of the data into a solution. Václavík et al. [13] use classifiers to discard troduces an inductive bias which promotes learning and bad solutions and speed up a subsequent heuristic search. leads to better results [7]. The inductive bias of a learning

Reinforcement learning (RL) refers to a computational algorithm is the set of assumptions that the learner uses approach to learning from interaction [9]. In RL, an agent to predict outputs for new data. GNNs also perform cominteracts with a Markov Decision Process (MDP) with putations over input data in a way that exploits spatial the goal of maximising some reward generated by the en- structure, a process known as message passing. Node vironment. A combination of a state of the environment and edge feature vectors are propagated by an exchange and an available action, a state-action pair (s, a), has an of information (messages) between adjacent nodes. expected return known as a state-action value function, Intuitively, it makes sense that a node might be repreor Q-value. A policy is a conditional distribution that the sented by the average of the connections to its neighbours agent uses to select actions. The main aim of RL is to and in this way the spatial information of the graph is ifnd an optimal policy - a conditional distribution used maintained. It also means that the number of message by the agent as a strategy to select actions - such that the passing layers determines how far a signal can travel; state-action value function is optimised. with 2-layers, the message passing happens twice so the

Broadly, there are two categories of RL algorithm; message travels two hops away. There are many GNN arthose which learn by optimising the value function to chitectures optimised for diferent functions; node, edge, increase reward at each step and those which directly or graph classification, using features stored at the node, optimise the policy to increase total expected reward edge or both. The promise of GNNs is that the learned [14]. Policy-based methods directly model the agent’s vector representation encodes crucial graph structures policy as a parametric function. By collecting previous that help solve a CO problem more eficiently [ 6]. decisions that the agent made in the environment we can Mirhoseini et al. [18] developed a learning-based apoptimise the parameters of the network to maximise the proach to chip placement, one of the most complex and expected reward of the process [14]. time-consuming stages of the chip design process. Exploiting the graph structure of netlists (a description of +1 = + ∇(( | | , , )) t[h18e]cuosnendeactGivNitNy otof acno melpeucttreofneiactucirrecsuoitf),nMetilrishtonsoeidneiseatnadl.

An episode refers to all the states of an environment create a rich representation for learning. Mirhoseini et al. that come in between an initial-state and a terminal- [18] claim that their method is the first that can quickly state, generated from the sequence of actions taken and generalise to previously unseen netlists and produce ’suchanges observed in the environment. An episode can perhuman’ results without the need for human experts be terminated due to success or failure, and, over many in the loop. In the opinion of Cappart et al. [6], this study episodes, the parameters of a policy network are trained is only scratching the surface of ”the innovations that until they can achieve good performance on an unseen can be enabled from a careful synergy of GNNs and CO”. problem. Stronger convergence guarantees are available Kool et al. [15] represented travelling salesman probfor policy-based methods than for value-based methods. lems as graphs and used an encoder-decoder GNN as a One of the main strengths of RL is that agents can learn parameterised policy to make decisions on which node the rules of an environment without the need for explicit to visit next. An encoder transforms input data into programming or examples of optimal endpoints. A key embeddings of a specific dimensions. The decoder then aim of RL research is to train models that make decisions transforms the embeddings to the required output dimento solve real problems [15] sions. The solution is constructed incrementally, adding

Reinforcement Learning (RL) methods were first ap- one node at a time. The policy network is trained usplied to CO by Zhang and Dietterich, 1995, with a model ing the REINFORCE algorithm, a policy gradient method. designed to solve the NP-Hard job-shop problem [16]. The motivation for this choice was greater eficiency than

A growing body of research uses RL in combination a value function approach. Kool et al. [15] report signifwith graph neural networks (GNNs), a type of function icantly improved results over recent learned heuristic approximator designed to operate directly on a graph approaches to TSP. Their solution worked on problems structure [17]. Whenever there is a set of entities that with up to 100 nodes and generalised to 2 variants of interact with one another or are related in any way, they the problem using the same hyper-parameters. They are can be expressed in the form of a graph [6]. GNNs are clear that their goal is not to outperform specialised TSP structurally similar to convolutional neural networks algorithms, but to show that their approach is flexible (CNNs) which are mostly used for image classification. enough to generalise to similar problems.

After a thorough search of the literature, there was

no credible study found to be using an RL algorithm and graph representations for personnel scheduling. The research most relevant to our aims is Kool et al. [15] which has many useful insights.

3. Methodology

Avg. Ratio Our goal is to train an agent to sequentially assign em- Table 1 ployees to shifts in an acceptable configuration. A sched- Overview of the scheduling problems used for training the ule is acceptable if the constraints are not violated. The scheduling agent. 420 in total problems with 6 diferent shift foundation of this research is a pipeline that ingests a counts and range of shift-employee ratios. The number of pool and a shift schedule and outputs a trained schedul- shifts, and the ratio between shifts and employees, determine ing agent. We train models on a fixed set of problems the complexity of our scheduling problems: the probability in order to generate an agent that is capable of solving that a correct action can be taken by chance. general, previously unseen problems. The pipeline was designed to take pool and shift schedule input of any size, adapting the parameters of the environment, graph 3.1. Data representation, and Policy GNN accordingly.

At Randstad, employees are grouped into pools which act as a pre-filter for eligibility; anyone assigned to a pool is eligible for shifts assigned to that pool. A pool represents an instance of a scheduling problem; the employees and shifts must be combined in a way that provides acceptable coverage and respects employment law and employee preference. Computational complexity increases with shift count and shift-employee ratio meaning that Randstad data includes many highly complex scheduling problems. Furthermore, the data contains many features with high cardinality, such as shift length and start time of a shift. To test the viability of an RL approach, simpler problem instances were required.

Inspired by the characteristics of Randstad data, we • A pool csv containing a list of employee ids, and a created synthetic schedule and pool data sets that can schedule csv containing a list of shift ids, are the initial be combined to created simplified scheduling problems. input. The schedule csv also contains features for The aim is to create problems that test an agent's ability each shift: day of week and time of day (morning or to learn to satisfy a single constraint: the same employee evening). can’t work sequential shifts. This constraint has to be • A custom OpenAI Gym RL environment takes the learned from the features of the problems.

input csvs and combines them into a single matrix. Synthetic schedules contain shift id and 2 features; day • The length of an episode is equal to the number of of week (Monday - Friday) and time of day (morning or shifts in the schedule csv. At each step in an episode, evening). Schedule shift count ranges from 3 to 8 and the matrix is converted to a bipartite graph and passed pool employee count ranges from 2 to 8. By combining through the GNN. these schedules and pools, we created a training set of • The GNN outputs a policy; a probability distribution 420 unique problems. Full details of the training set can over the employee / action space. be seen in Table 1. Functions were created to manage • uTphdeaatgeesntthteawkeesigahcttsioonfsthacecpoordliicnygnteotwthoerkpoblaicsyedanodn the problem combinations and check for uniqueness / the loss function. duplicate problems. Another function randomly gener• Repeat until a terminal state is reached: all shifts are ates new, unseen problems of variable shift count and assigned. shift-employee ratio, for testing.

3.2. Pipeline So that a model can learn to find acceptable solutions to the scheduling problems, we define a Reinforcement method called REINFORCE.

3.2.1. REINFORCE

REINFORCE was proposed by Williams [19]. We start

by initialising a random policy which the REINFORCE algorithm takes as input. The agent collects the trajectory of an episode from current policy, storing actions, states and rewards as a history. For each episode we calculate the reward and use it to evaluate the policy.

For each action in the episode, the loss is calculated as the probability (according to the policy) of choosing the action, multiplied by the reward received. The loss is then back-propagated through the network to update the policy, increasing the probability that the agent will choose the actions that lead to higher reward in the next episode.

+1 = + ( | , )

∇ ( | , ) ing the Deep Graph Library. 3.2.2. Policy Network scheduling problem graph representation can be seen in et al. [15]’s framework for tackling Travelling Salesman The encoder-decoder policy network is implemented us- being re-initialised with the parameters of the current state of a scheduling problem as input and uses message passing convolutional layers to update and aggregate the embeddings.

ℎ +1 = ( ∑ (ℎ () , ℎ

() )) ∈

Then, at each time step, the decoder network calculates

the inner product of the current shift’s node features and the worker edge features, which serves as a compatibility score. The decoder calculate this score for problems of any size. The softmax output of these compatibility scores are the probabilities for the policy.

The number of nodes in the input graph reflects the count of shifts and employees in the scheduling problem. Shift features are consistent (5 days of week and 2 times of day) but the employee features - which indicates whether an employee as been assigned to a specific shift vary with employee and shift count. The network needs to be able to process graphs of diferent sizes without graph, which would reset the learned weights. To achieve this, we designed the architecture to store the employee feature data in the edges, linear transformation of employee features could be performed at each step. This is illustrated in Figure 3. We use a message passing layer that incorporates edge data in it’s update and aggregation functions. The design of this layer was proposed by [20] : ℎ (+1) = ( ∑ ∑ , () ℎ () + 0() ℎ

() )

() the self-loop weight. where () is the neighbor set of node w.r.t. relation . , is the normaliser. is an activation function. 0 is

These edge features would allow us to measure im

portant statistics such as total shifts allocated for each employee without any hard-coding with of problem size.

These statistics could be used to add additional constraints such as minimum / maximum hours worked per The network has a fixed learning rate of 1e-3. The

encoder uses 2 convolutional layers, 5 × 32 shift embeddings and × 32 employee embeddings, where = to the number of employee nodes. The decoder uses 32 × 32 embeddings of the features for the current shift and 32 × 32 embeddings for each employee.

For the agent’s policy, we implemented an encoder- employee. decoder GNN architecture, similar to the model used by Kool et al. [15]. The building blocks of the GNN are from the Deep Graph Library, which supports a range of GNN architectures on top of the PyTorch framework and also provides graph building functions. An example

4. Experimental Setup The process of repeating episodes to train the policy network is managed by a custom Open AI Gym environment. There are many possible variants for the design of this re

search, such as RL algorithm, training problem complex3.2.3. Open AI Gym ity, feature engineering, reward function and network architecture, to name a few. We have chosen to focus OpenAI’s Gym library is a popular choice for implement- on reward function and problem complexity. Reward is ing environments for RL. It’s main building block is a based on performance with regards to a single constraint: Python class, Env, which simulates the environment you an employee cannot work simultaneous or consecutive want to train your agent in. There are many standard shifts. Given time constraints, hyper-parameter tuning environments available, such as the Atari games used by was not performed for this study. The code can be viewed DeepMind [21], and it supports custom environments. on github.

The key components of an environment are the state Average reward is used to evaluate a models perforspace, action space, reward function. For the state space mance. Reward is averaged across the previous 100 of our custom scheduling environment, we combine a episodes to better capture an agent’s long-term behavpool and a schedule into a matrix. The length of the ior. In a real world setting, the constraint we used is matrix equals the number of shifts, and width of the ma- considered a hard constraint meaning that a violation trix equals the number of shift features plus the number renders the solution unacceptable. A solution could feaof employees in the pool. For each shift, the agent can sibly have a high reward but be unacceptable due to a assign an employee from the pool, therefore the action single constraint violation. We used an additional metric, space is equal to the number of employees. The reward ’percentage acceptable’ to measure the number of acceptfunction is adjusted as part of the experimental setup able solutions produced by the agent: a solution with a but essentially it rewards the agent when an acceptable maximum score and therefore no constraint violations. schedule is created. Or rather, it punishes the agent when constraints are violated. Another key part of the environ- 4.1. Reward Functions ment is the step function, which takes the agents action as input and updates the state and reward accordingly.

TSPs have an inbuilt cost, tour length (the cumulative distance between points), which the agent in Kool et al. [15] tries to minimise. There are examples of TSP solvers that apply cost (negative reward) incrementally (after each stop)—which has been shown to encourage greedy behaviour—and at terminal states (when the tour is complete) [15]. As there is no implicit reward or cost for scheduling problems, we chose an arbitrary total reward value of 1. Constraint violations results in a negative reward. Constraint violations are calculated based on the features in the schedule data: Assigning 2 shifts on the same day to an employee is a constraint violation. Applying shifts on subsequent days to the same employee is a constraint violation only if shift 1 is in the evening and shift 2 is in the morning. The cost of a constraint violation is 1/(number of shifts-1) because it is not possible to have a violation on the first assignment.

The efect of reward function on performance is tested by comparing the performance of 3 reward functions; 1) calculating reward at the terminal state of an episode (Terminal), 2) calculating reward at every step (Step), and 3) assigning half of the reward stepwise, and the remaining half at a terminal state as reward for an acceptable solution (Step Bonus).

4.2. Problem Complexity Problem complexity can be increased in two ways; by

the number of shifts and also by the ratio between shifts and employees. More shifts means more decisions for the agent to take, so there is more chance of a constraint violation occurring. The efect of shift-employee ratio is illustrated in 4, which shows a schedule and two pools. Table 2 Combining the schedule with pool 1 has a higher ratio than for pool 2, which reduces the number of actions an agent can take without incurring a constraint violation.

Models were trained on 420 problems with between 3 and 8 shifts across 10,000 episodes. This means that the agent would see the same problem many times during training.

Models were tested on 500 problems from a range of complexity distributions, as can be seen in Table 2. The agent sees each problem once only.

The test problems have two levels of shift-employee ratio: an ’average’ ratio of between 1.2 and 2.8, which reflects the first and third quartiles for shift-employee ratio in the Randstad data, and a ’high’ ratio of between 2.9 and 5 to reflect the more complex problems seen in the data.

5. Results Models were trained on 5 diferent seeds and each model

was tested on every problem twice. The results were averaged across both runs for all 5 models. We tested the significance of our results using Welch’s t-test, which doesn’t assume equal variance between groups. Results are summarised in Table 3. Summary of test set problems showing size, shift-employee ratio and count of unique problems.

Across all experiments, the best performing model was the Step Bonus reward function (mean=0.89, standard deviation=0.22) which returned significantly higher average reward than Random baseline (m=0.49, sd=0.49), p<.05. Step Bonus also performed significantly better than the other reward functions Step (m=0.62, sd=0.62), p<.05, and Terminal (m=0.58, sd=0.57), p<.05.

In terms of problem complexity, average reward was significantly higher for Average ratio problems (m=0.74, sd=0.39) than for High ratio problems (m=0.65, sd=0.65), p<.05. This efect was observed for each model. The efect of increasing shift count is less clear. Excluding results from the Random agent, problems with a max shift count of 8 (ms8) (m=0.72, sd=0.36) show significantly higher average reward than problems with max shift counts of 14 (m=0.68, sd=0.68), 18 (m=0.68, sd=0.68) and 23 (m=0.68, sd=0.68). However, there was no significant diference in average reward observed between the smallest problems (ms8) and the largest problems (ms30) (m=0.72, sd=0.41). Furthermore, ms30 showed higher average reward than ms23, ms18, and ms14. These effects were observed across both Average and High ratio problems.

Figure 5 shows the performance of the 3 reward functions against a Random agent baseline on average ratio scheduling problems of increasing shift count.

Figure 6 shows the performance of 3 reward functions against a Random agent baseline on high ratio scheduling problems of increasing shift count.

6. Discussion Personnel scheduling is a real-world combinatorial opti

misation task that requires significant time and expertise to manage efectively. In this paper, we developed a scheduling problem solver based on a framework from practical value. We created a graph representation of Kool et al. [15], which was designed to solve another CO each combination of shift schedule and employee pool problem, the travelling salesman problem. Compared to for use in a GNN. The GNN functions as the policy, using previous works, which used problem-specific approaches, valuable information about the node in the context of this framework is an opportunity for a data-driven ap- the graph to guide the decisions of an agent. The paramproach to personnel scheduling that could be of great eters of the network are trained using the REINFORCE Model Random agent Step Terminal Step Bonus

Average Reward

% Acceptable 0.49 0.62 0.58 0.89 14.00 51.68 51.56 79.60 algorithm. To measure the ability of the RL scheduler methods. State of the art algorithms could be used, but a to optimise scheduling problems, we used average re- simple change worth testing is the use of a baseline, as ward and percentage acceptable metrics. We compared seen in [15]: ”A good baseline reduces gradient variance performance across 3 diferent reward functions and 10 and therefore increases speed of learning”. As Figure diferent levels of problem dificulty. 7 shows, Step Bonus appears to converge on a stable

The Step Bonus model achieved consistently high per- reward after only a few hundred episodes, but this could formance across all problems, significantly better than well change if more constraints were added. the baseline and the other reward functions. Despite the fact that the problems in training set had a max shift count of 8, the trained Step Bonus agent was able to 7. Conclusion able achieve maximum score on 77% of High ratio, max shift 30 problems. The results show that, with the right reward function, an

With a focus on an academic problem, rather than a RL agent was able to solve simplified personnel schedulreal world application, [15] were able to test their model ing problems across a range of shift counts and shifton several standard sets of travelling salesman problems employee ratios. We have shown that an agent can learn (plus variants) and compare the performance to other the constraints of a combinatorial optimisation problem studies that used the same data. There are standard test from data. For future work, additional evaluation using sets for personnel scheduling problems but they were real world data is of importance to better understand the out of scope for this research as these problems generally range of applicability of our approach. have more complicated constraints than the one we used.

A comparison that can be made between this study and References [15], is the efect of node count on performance. The cost for TSPs is positively correlated to the number of [1] S. Den Hartog, et al., On the complexity nodes, so performance appears worse (higher cost) for of nurse scheduling problems, Master thesis larger problems. However, [15] observed performance (2016). URL: https://studenttheses.uu.nl/handle/20. on 100 node TSP problems comparable to state of the 500.12932/22268. art TSP solvers, suggesting that their graph model scaled [2] E. K. Burke, P. De Causmaecker, G. V. Berghe, well to larger problems. A similar case could be made H. Van Landeghem, The state of the art of for this research, as we saw no significant diference nurse rostering, Journal of scheduling 7 (2004) in performance between the hardest (ms30, High ratio) 441–499. URL: http://link.springer.com/10.1023/B: problems and the easiest (ms8, average ratio). JOSH.0000046076.75950.0b.

Because the size of the reward / cost for our system is [3] Y. Bengio, A. Lodi, A. Prouvost, Machine learning negatively correlated to the number of shifts in a prob- for combinatorial optimization: a methodological lem, violations are punished less harshly for big problems. tour d’horizon, European Journal of Operational However, the award of a bonus reward of .5 for an ac- Research 290 (2021) 405–421. URL: http://arxiv.org/ ceptable solution should counteract this. abs/1811.06128.

Another explanation is that shift count and shift- [4] I. Bello, H. Pham, Q. V. Le, M. Norouzi, S. Benemployee ratio are not the best or only way to mea- gio, Neural combinatorial optimization with resure the complexity of a personnel scheduling problem. inforcement learning, 5th International Conference The consistently high rewards we observed for the Step on Learning Representations, ICLR (2017). URL: Bonus model suggest that the network was able to learn http://arxiv.org/abs/1611.09940. the constraint - no simultaneous or consecutive employee [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, assignments - and that it was no harder to exploit it for J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, a large problem than a small one. Aside from the bonus A. K. Fidjeland, G. Ostrovski, et al., Human-level reward, a decision made at time-step doesn’t efect + 1 control through deep reinforcement learning, na. In a real world scheduling problem, this is unlikely to ture 518 (2015) 529–533. URL: http://www.nature. be the case as there will be additional constraints to con- com/articles/nature14236. sider. The most obvious constraint to add to this research [6] Q. Cappart, D. Chételat, E. B. Khalil, A. Lodi, C. Morwould be one to ensure that each worker in the pool is ris, P. Veličković, Combinatorial optimization and assigned to at least one shift. reasoning with graph neural networks, Proceedings

If, as expected, adding additional constraints increases of the Thirtieth International Joint Conference on complexity to an extent that performance is gets worse, Artificial Intelligence, IJCAI-21 (2021) 4348–4355. there are architecture changes that could be made make URL: https://doi.org/10.24963/ijcai.2021/595. the model more efective. The RL algorithm we have used, [7] P. Battaglia, J. B. C. Hamrick, V. Bapst, A. Sanchez, REINFORCE, is one of the most simple policy gradient V. Zambaldi, M. Malinowski, A. Tacchetti, D. Ra

poso , A. Santoro, R.

Faulkner , C.

Gulcehre , F.

Song , arXiv preprint arXiv: 1701 .07274 ( 2017 ). URL: http:

Ballard ,

Gilmer ,

G. E.

Dahl , A . Vaswani, //arxiv.org/abs/1701.07274.

Allen ,

Nash ,

V. J.

Langston ,

Dyer ,

Heess , [15]

Kool , H. van Hoof,

Welling , Attention,

Wierstra ,

Kohli ,

Botvinick , O. Vinyals, learn to solve routing problems! , in: 7th Inter-

learning , and graph networks, arXiv ( 2018 ). URL: ICLR 2019 , New Orleans , LA, USA, May 6- 9 , 2019 ,

https://arxiv.org/pdf/ 1806 .01261.pdf. OpenReview.net, 2019 . URL: https://openreview. [8]

Legrain ,

Omer ,

Rosat , A rotation- net/forum?id=ByxBFsRqYm.

based branch-and-price approach for the nurse [16]

R. S.

Sutton ,

A. G.

Barto , Reinforcement learning:

scheduling problem, Mathematical Programming An introduction (

2018 ) 352 .

Computation 12 ( 2020 ) 417 - 450 . URL: https://hal. [17]

Scarselli ,

Gori ,

A. C.

Tsoi , M. Hagenbuchner,

archives-ouvertes .fr/hal-01545421. G. Monfardini, The graph neural network model , [9]

Koole , An Introduction to Business Analytics, IEEE transactions on neural networks 20 ( 2008 )

Lulu. com, 2019 . 61 - 80 . [10]

G. M.

Jaradat ,

Al-Badareen ,

Ayob , M. Al- [18]

Mirhoseini ,

Goldie ,

Yazgan , J. Jiang,

University-Computer and Information Sciences 31 arXiv: 2004 . 10746 ( 2020 ). URL: http://arxiv.org/abs/

( 2019 ) 378 - 384 . URL: https://linkinghub.elsevier. 2004 . 10746 .

com/retrieve/pii/S1319157818300363. [19]

R. J.

Williams , Simple statistical gradient-following [11]

Mańdziuk , Solving the travelling salesman prob- algorithms for connectionist reinforcement learn-

lem with a hopfield-type neural network, Demon- ing, Machine learning 8 (

1992 ) 229 - 256 .

stratio Mathematica 29 ( 1996 ) 219 - 232 . [20]

Schlichtkrull ,

T. N.

Kipf ,

Bloem , R. v. d. Berg, [12]

Chen , P. De Causmaecker,

Dou , Deep neural

Titov ,

Welling , Modeling relational data with

rostering problem , in: Proceedings of the 13th In- mantic web conference , Springer, 2018 , pp. 593 - 607 .

ternational Conference on the Practice and Theory URL : http://arxiv.org/abs/1703.06103.

of Automated Timetabling-PATAT , volume 2 , 2021 . [21]

Mnih ,

Kavukcuoglu ,

Silver ,

A. A.

Rusu , [13]

Václavík ,

Šcha ,

Hanzálek , Roster evaluation

Veness ,

M. G.

Bellemare ,

Graves , M. Riedmiller,

Journal of Heuristics 22 ( 2016 ) 667 - 697 . URL: http: control through deep reinforcement learning , na-

//arxiv.org/abs/ 1804 .05002. ture 518 ( 2015 ) 529 - 533 . URL: http://www.nature. [14]

Li , Deep reinforcement learning: An overview , com/articles/nature14236.