Demand-Responsive Zone Generation for Real-Time Vehicle Rebalancing in Ride-Sharing Fleets Alberto Castagna1 and Maxime Guériau1 and Giuseppe Vizzari2 and Ivana Dusparic1 Abstract. Enabling Ride-sharing (RS) in existing Mobility-on- demand (MoD) systems allows to reduce the operating vehicle fleet size while achieving a similar level of service. This however re- quires an efficient vehicle to multiple requests assignment, which is the focus of most RS-related research, and an adaptive fleet rebalancing strategy, which counter-acts the uneven geographical spread of demand and relocates unoccupied vehicles to the areas of higher demand. Existing research into rebalancing generally di- vides the system coverage area into predefined geographical zones, however, this is done statically at design-time and can limit their adaptivity to evolving demand patterns. To enable dynamic, and Figure 1: Observed demand imbalance in New York Taxi dataset [21] therefore more accurate rebalancing, this paper proposes a Dynamic trips between morning (7-10am) and evening (6-9pm) peak hours in Demand-Responsive Rebalancer (D2R2) for RS systems. D2R2 uses the south part of Manhattan on Tuesday, February 2nd 2016 Expectation-Maximization (EM) clustering to determine relocation zones at runtime. D2R2 re-calculates zones at each decision step mobility demand is changing over time and distribution of requests and assigns them relative probabilities based on current demand. We is uneven. This can lead to an unbalanced fleet distribution for RS- demonstrate the use of D2R2 by integrating it with a Deep Rein- enabled MoD systems [2], as illustrated in Figure 2, where most of forcement Learning multi-agent RS-enabled MoD system in a fleet the demand is concentrated in the top area while majority of vehicles of 200 vehicle agents serving 10,000 trips extracted from New York are located in the opposite side after finishing their last trip, where taxi trip data. Results show a more fair workload division across the fewer new customers are requesting for a ride. fleet without loss of performance with respect to waiting time and Adaptively following (or even preventing) changes in the demand distribution of passengers per vehicle, when compared to baselines spatial patterns can improve the perceived level of service from the with no rebalancing and static pre-defined equiprobable zones. perspective of the use of the MoD system, and also, assuming a joined fleet in which human drivers can participate with their ve- hicles, its capability to consistently enable drivers to meet the actual 1 INTRODUCTION demand while optimizing vehicles usage and increase the ROI (re- Mobility-on-Demand (MoD) systems are gaining popularity over turn on investment), assuming that the participation to the RS system privately owned vehicles and public transportation due to reduced would imply a subscription cost (due to setup and operating costs of prices and shorter overall journey times [7]. Recent work suggests the system). that RS-enabled MoD systems can achieve similar level of service using fewer vehicles, by better optimizing: (i) vehicle to multiple re- quests assignment [2] and (ii) rebalancing empty vehicles to fit real- time demand [24]. Vehicle to request matching has been widely investigated. Of- fline methods relying on constraint solving [3, 4] can design an op- timized plan that is then executed by the vehicle. Online methods involve matching to requests dynamically and has so far been ad- dressed using constraint solving methods [2] and agent-based mod- Figure 2: Example of an unbalanced fleet distribution where demand els [1, 8, 9, 25]. (requests in red) location differs from available supply (vehicles) However, fleet rebalancing in MoD systems has been less inves- tigated while shown to have a strong impact on level of service of To address this issue, existing work define rebalancing strategies RS-enabled systems [24]. As depicted in Figure 1 created by extract- that consist in relocating vehicles according to past or current de- ing requests from New York Taxi data [21] on two different periods, mand. Rebalancing can be achieved using pre-defined location per 1 vehicles (station-based relocation) or by defining a set of areas (also School of Computer Science and Statistics, Trinity College Dublin, Ireland, emails: acastagn@tcd.ie, maxime.gueriau@scss.tcd.ie, called zones) each vehicle can be sent to. Rebalancing approaches ivana.dusparic@scss.tcd.ie can rely on a static [9, 1, 8, 25, 24, 19] or a dynamic partition of 2 University of Milano - Bicocca, Italy, email: giuseppe.vizzari@unimib.it the network [15] to split it into zones. Rebalancing vehicles using a Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). dynamic partition is expected to better track changes in mobility de- a clustering algorithm. A k-means clustering is applied on virtual re- mand e.g., caused by temporary network disruptions, special events quests generated by a distribution defined on historical data. There- concentrating demand temporarily and long-term city developments fore the coverage and the size of the zones can change, but the total affecting observed patterns. number of zones remains fixed. This approach is the closest related In this paper, we propose a Dynamic Demand-Responsive Rebal- work, however, we allow different number of zones based on dif- ancer (D2R2) which, using Expectation-Maximization (EM) clus- ferent density of requests. Furthermore, our approach relies on real tering technique, generates a dynamic set of rebalancing zones and time data rather than historical to have a better respond to dynamic computes relocating probability per zone from current demand trend demand. every time a vehicle needs to relocate. Novelty of our approach is Table 1 summarizes existing work on rebalancing for MoD sys- twofold: first, boundaries and number of relocating areas are com- tems, categorized by four main characteristics: Analysed data, which puted when required and second, rebalancing probability for each can be historical, real-time or both, depending on what kind of data zone is allocated dynamically using only unserved requests data is rebalancing based; Dynamic # zones, i.e., whether the number of available in real-time. Therefore, D2R2 does not require data collec- rebalancing zones can change over time; RB empty, whether the vehi- tion and no learning phase is required, enabling our proposal to op- cles relocate only when empty or can relocate as a part of RS assign- erate from the beginning while being responsive to current demand. ment; and Dynamic boundaries, whether the area covered by each We evaluate D2R2 in an implementation of a Multi-Agent Re- rebalancing zone can adapt dynamically. inforcement Learning (MARL) ride-sharing enabled MoD system where 200 vehicle agents serve 10,000 ride requests in lower Man- Table 1: Characteristics of existing rebalancing algorithms hattan road network. Requests have been generated from the open Analysed Dynamic RB only Dynamic NYC taxi dataset [21] to be representative of real demand patterns. data # zones empty boundaries Results show that coupling D2R2 with ride-sharing enhances perfor- Wen, 2017 [25] Real-time 7 3 7 mance from a single vehicle perspective, and improves the overall Real-time balance of the distribution of requests across the fleet. At the cost Fagnant, 2017 [8] 7 3 7 Historical of few more kilometres travelled empty for rebalancing, the perfor- Alonso-Mora, Real-time 7 3 7 mance at the fleet level confirms the overall efficiency of our demand- 2018 [24] responsive rebalancing strategy. Alabbasi, 2019 [1] Historical 7 7 7 Yang, 2019 [15] Historical 7 3 3 Real-time Guériau, 2020 [9] 7 7 7 2 RELATED WORK Historical This paper Real-time 3 3 3 Rebalancing for MoD can be categorized in the approaches relying on static rebalancing zones [1, 8, 9, 24, 25] and dynamic zones [15]. We observe that the main issue of reviewed research is the low In static rebalancing zone generation, geographical coverage of relo- adaptability to demand changes (daily, seasonal, or more long-term cation zones is predefined at design time. For example, in [9], NYC ones resulting from new city developments), as the addition of new Lower Manhattan area is divided in predefined zones which do not city areas/zones, or changing their granularity, requires system re- change over time. Each vehicle, using RL, learns and decides at each design. time step whether to relocate to one of the neighbouring zones or to With respect to request assignment, multiple algorithms are used stay in its current zone. In [8], rebalancing areas in Austin, Texas, are in the literature to match riders and drivers (or vehicles). For ex- defined by partitioning the area in 2-mile by 2-mile square blocks. ample, [3, 4, 2] use integer programming to optimize the objective Block balance is calculated for each zone, capturing the excess or function for the optimal matching. RL-based approaches [9, 10, 1] deficit of vehicles within the block in relation to supply and expected are the closest to our approach, in which agents explore by them- travel demand. Blocks with a negative balance try to gather vehicles selves possible solutions without having any prior knowledge. A full from neighbourhood where there is a surplus, and expected travel de- review of vehicle assignment algorithms is out of scope of this paper mand is estimated from historical data and current requests. In [1], as our contribution focuses on rebalancing, nevertheless, it is worth zones are defined using a fine-grained but also static grid. In [25], mentioning that, even though we illustrate D2R2 application in con- if a vehicle is idling, it can rebalance based on its local knowledge: junction with an RL-based vehicle assignment, R2D2 is designed to according to the demand distribution in surrounding areas, it decides be independent from the assignment algorithm used in the MoD sys- to rebalance to a neighbouring zone or not. The work presented in tem. [24] partitions the area into rebalancing zones according to the road network layout. Zones are defined such as that for each region ri , ex- ists a zone which allows to reach ri within the established maximum 3 BACKGROUND travel time. Idle vehicles are rebalanced by taking into account travel This section introduces the background information needed to un- time, to limit empty travels, and future demand, estimated from cur- derstand D2R2 design and implementation: Reinforcement Learn- rent demand, to avoid an excess of vehicles in the same area. How- ing (RL) used for vehicle assignment problem and expectation max- ever, the zones, once defined at the start, do not change based on traf- imization (EM) with model selection criterion for rebalancing. fic or fleet conditions. Majority of the approaches allow rebalancing only for idle (and empty) vehicles, while [1] and [9] mix rebalanc- 3.1 Deep Reinforcement Learning ing with RS assignment, and allow RS pick-ups from neighbouring zones, shortening waiting time for passengers, but increasing their RL is a branch of machine learning in which an agent learns au- travel time. tonomously by trial-and-error to map actions to the current environ- The only current work that uses dynamic zone generation for re- ment state, by receiving a positive or negative reward for their exe- balancing is presented in [15]; rebalancing zones are computed using cution [23]. The goal of the agent is to learn actions that maximize the long term cumulative reward. RL iterates three tasks: at each time computes the complete log-likelihood of data memberships in clus- step an agent obtains the perception of the environment and maps it ters, and Maximization (M step), which, by maximizing the com- to a state s from its overall state space. Based on past experience, puted likelihood, updates the hidden parameters θ related to nor- it can select an action a from the action space. The agent then, at mal distribution, hence mean, variance and prior probability for each timestep t receives a reward rt = R(st , at ) which expresses how cluster. good was the selected action. E step: hidden parameters are initialized by some random values In most of real-world scenarios, the environment space is complex or, more likely, computed from data points. Then, expectation of the or continuous, making it intractable to handle all possible state-action complete log-likelihood, conditioned by observed samples and cur- pairs. To overcome this issue, RL has been combined with deep neu- rent estimation of θ, is computed. Expected value is: ral networks to approximate states, giving rise to a range of Deep hX i RL techniques. The approach we choose for our implementation is Q(θ; θ(t)) = E ln(py (yk ; θ|X; θ(t))) (4) Proximal Policy Optimization (PPO) [22], which is simpler to im- k plement and tune without affecting the performance when compared where θ is the unknown parameter vector and the term inside the log- to other state-of-the-art Deep RL approaches. PPO uses a novel ob- arithm expresses the conditional probability of a datapoint to belong jective function, formed by three terms, which is maximized each to a cluster k given the observed samples X and value of θ at the iteration: previous step. h i M step: computes the next (t + 1)-th estimation of the unknown LCLIP t +V F +S (Θ) = Êt LCLIP t (Θ) − c1 LVt F (Θ) + c2 S[πΘ ](st ) parameter vector by maximising the expected value Q(θ; θ(t)) ob- (1) tained from previous step. Θ is the policy’s parameter vector. The first term LCLIP is the ∂Q(θ; θ(t)) θ(t + 1) : =0 (5) clipped objective function defined in Equation 2. c1 is a coefficient, ∂θ defined between 0.5 and 1, applied to LV F = (VΘ (st ) − Vttarg )2 , EM algorithm terminates when difference between expectation at which computes the squared-error loss of V , the learned state-value time t and time t − 1, obtained from Equation 4, is smaller than a function, compared to the target value at time t. Last term S is the en- threshold . tropy bonus, used to ensure sufficient exploration which is regulated Once terminated, the likelihood indicates how good our model fits by c2 , ranging from 0 to 0.01. S of a stochastic policy πΘ refers to data. However, this parameter alone does not take into account over- state at time t. fitting and the number of clusters; in fact likelihood could be maxi- h i mized with each datapoint belonging to a different cluster. An option LCLIP (Θ) = Êt min(rt (Θ)Ât , clip(rt (Θ), 1 − , 1 + )Ât ) (2) to validate and select a model is by using Bayesian Information Cri- teria (BIC) [6]. It prevents over-fitting by taking into account num- In the LCLIP objective function definition (Equation 2), the first ber of clusters. BIC is computed through Equation 6, where number term inside the min is the surrogate objective with a conservative of free parameters, k, depends on number of clusters. BIC measure policy iteration which is clipped by the second term. Â is an estima- weighs the number of free parameters with the number of samples tor of the advantage function shown in Equation 3 and rt (Θ) denotes available. It looks for the true model among the set of candidates. t |st ) the probability ratio πΘπΘ (a(a t |st ) that expresses the difference be- old BIC = ln(n)k − 2 ln(L̂) (6) tween current and old policy, which is clipped if difference falls out of boundaries by , a small hyper-parameter which weighs distance Where L̂ is the maximized value of the likelihood function, result of from new policy in respect to the old. Equation 4, n is the number of data points within a dataset and k is the number of free parameters to be estimated. We have opted to use EM over other clustering techniques because Ât =σt + (γλ)σt+1 + (γλ)2 σt+2 + · · · + (γλ)(T −t+1) σt−1 it computes clusters by estimating normal distribution with their pa- (3) where, σt = rt + γV (st+1 ) − V (St ) rameters, which underlies data. By doing that, EM enables clusters to have different shapes unlike other clustering methods which tends During training, PPO method collects a sequence of samples of to find clusters with comparable areas by working directly on data length T from the environment and then estimates the advantage Ât points. for the complete sequence. Finally, several epochs of optimization on LCLIP are performed on the same batch, to maximize gathered experiences. 4 DYNAMIC DEMAND-RESPONSIVE While we are not aware of any application of PPO in a ride- REBALANCER sharing problem, it shows good performance in many other applica- This section describes the main contribution of our paper, a Demand- tions with similar characteristics, such as portfolio management [14], Responsive Zone Generation for Real-Time Vehicle Rebalancing robot control [12, 18, 16] or simulation and games [5, 26], which mo- (D2R2) in RS fleets. We first introduce our unpublished RS system tivated our decision to use it as basis for our implementation. using multi-agent Deep Reinforcement Learning, and then our novel rebalancer. 3.2 Clustering 4.1 Ride-Sharing using Deep Reinforcement Expectation-Maximization (EM) is an iterative algorithm to find the Learning maximum-likelihood estimates of parameters by computing proba- bilities of cluster memberships across several probability distribu- We designed a multi-agent decentralized algorithm for RS applied to tions [20]. EM consists of two steps, Expectation (E step), which a fleet composed of 5-seater autonomous vehicles for a MoD system,  which is model-free and designed to be replicable to any city in the quests perceived by an agent, P : Sr1 , Sr2 , Sr3 . Where, world. To take a decision, agent implements PPO [22], introduced in i i i Sri : rPos , rDest , rPassengers represents the state of the i-th request Section 3. Each agent controls a vehicle, taking an action at each step perceived by an agent. Each requests consists of a pick-up location, by evaluating its internal state and perception, without communicat- the destination and number of passengers. ing or coordinating with other vehicles. Once an agent completes its Vehicles can choose between 5 actions, organized in three cate- action a next step can begin. Agents evaluate and decide of an ac- gories: (1) drop-off, in which an agent serves a request by driving the tion in a sequential order. At each time step, each agent perceives passenger(s) to their destination; (2) park, in which an agent waits the environment and decides of the next action: to pick-up a ride, to one minute being parked. (3) pick-up, an agent drives to a pick-up drop-off passengers or to rebalance. Finally, it updates its learning point of the selected request. Pick-up action has three variations, process. This cycle is described in Alg. 1. pick-up first, second or third request from the perception set. Once an agent selects a pick-up action, is first checked whether in percep- tion corresponds a request and then if the vehicle can accommodate Algorithm 1: Controller for a single vehicle V the new passenger(s), line 19 in Algorithm 1. When the vehicle per- Parameters: V vehicle, a action, r request, PPO model ception is empty and it is not serving any requests, it is enabled to 1 Perceive and act (V ) rebalance as shown at line 6 in Algorithm 1. We further discuss re- 2 update vehicle perception and status balancing in next section. 3 a←PPO.getAction([V.perception, V.status]) Rewards associated with actions are also shown in Algorithm 1. 4 if a is parked then An agent get a negative reward of -10 for attempting to try to pick- 5 if (V.queue ∧ V.perception) are empty then up a request while it does not have enough free seats. Otherwise, 6 rebalance(V ) // Alg. 2 the best (+5) and the potential worst reward are related to the same 7 reward ← −0.01 action: when a vehicle is doing a drop-off while carrying passenger 8 end from several requests, if the travel time for passengers to reach their 9 else V.wait() destination exceed the estimated travel time without RS by 30% or 10 if V.queue is empty then reward ← −0.3 more, then the reward is reduced according to the total additional 11 else reward ← −0.5 detour distance travelled. 12 else if a is drop-off then 13 if V.queue is empty then return −10 14 V.destination ← arg minri ∈V.queue Supply(v, ri ) 4.2 Rebalancer - D2R2 15 detourRatio←V.goToDestination() D2R2 rebalancer can be used with different MoD systems, however 16 if detourRatio ≤ max detourRatio then reward ← 5 for illustration we describe its implementation as combined with the 17 else reward ← 5 − (detourRatio−1) Deep RL ride-sharing request assignment strategy presented in pre- 18 else a is pick-up vious section. Rebalancing is triggered when a vehicle is not serving 19 if ∃ r associated to a ∧ V.freeSeats ≥ r.passengers any requests and it has no further requests to serve in its neighbour- then hood. D2R2 aims to dispatch vehicles efficiently and dynamically ac- 20 V.pickUp(r) cording to current demand, preventing fleet unbalance, which in turn 21 if size(V.queue)== 1 then reward ← 1 can result in longer passenger waiting times, or an increased num- 22 else reward ← 2 // doing rs ber of unserved requests. D2R2 infers relocating zones and computes 23 end their associated probabilities (Eq. 7) for a vehicle to be relocated into: 24 else reward ← −10 |Ri | 25 end pr (v, zi ) = (7) |R| 26 PPO.update(reward, a) where zi is the ith zone, Ri is the set of pending requests within current zone, and R is the set of pending requests across all zones. The agent’s internal state is composed by the vehicle position, rep- resented by latitude-longitude pair, its destination, and the number of vacant seats. For an empty vehicle, destination is void, and if a ve- hicle is serving one or more requests, its destination matches to that of the request ri that can be served the quickest, as shown in line 14 in Alg. 1. The vehicle location on the road network is updated ev- ery time that a new position is reached, either destination or pick-up point. Perception P is composed of the three closest requests that an agent could serve. A request r is available to a vehicle v if v has enough empty seats to accommodate the number of passengers asso- ciated with the request (ranging from 1 to 6), and the total waiting time for r, i.e., the delay between request being created and esti- mated passengers pick-up time, is less than the maximum time al- lowed (set to 15 minutes). All customers, who have waited more than Figure 3: Ride-sharing framework with D2R2 integrated 15 minutes leave the system without being served, and the request is recorded as not served. R2D2 framework is illustrated in Figure 3, depicting the rebalanc- Perception is defined as the aggregation of states of re- ing module and an agent behaviour. The demand filter has a double role: it selects three requests for agents perception and also filters de- Algorithm 3: Defining relocating zones for rebalancing mand for rebalancing, according to used time frame. D2R2 takes as input pending requests available at current time. All previous unsat- Result: Given a set of pending requests (reqsAvailable) and a isfied requests and future requests (estimated or scheduled), are not time, finds relocating zones C : {c1 , c2 , . . . , cn } with taken into account, as illustrated in Figure 4. associated probabilities P : {p1 , p2 , . . . , pn } Parameters: V vehicle, r request, p probability 1 updatingClusters (reqsAvailable, V.time) 2 foreach r ∈ reqsAvailable do 3 if V.time − r.timeBegin ≤ TIMEFRAME then 4 queueReqs.append(r) 5 end 6 end 7 clusters← min bic50 k=10 (EM (queueReqs, k )) 8 prob ← 0 9 foreach c ∈ clusters do c.size() 10 prob ← prob + queueReqs.size() 11 relocatingProb[c] ← prob Figure 4: Example of pending requests at time tv . Only requests ac- 12 end tive at time tv , (i.e., r3 , r4 ) are taken into account when rebalancing 13 return clusters, relocatingProb Algorithm 2 describes the procedure applied to relocate an idle vehicle to a new position. First, R2D2 generates new clusters, based peak time, are distributed as follows: 71.95% of demand is composed on pending requests, following Algorithm 3. For a single rebalancing by a single passenger request, 13% by two, 4.1% by three, 2.28% by task, several runs of EM occur, producing different number of clus- four, 5.3% by 5 and the remaining 3.37% by 6. ters. The algorithm then selects the model which better represents the Agents learning stage is performed by running multiple rounds data by choosing the clustering model which minimizes BIC mea- of single-vehicle training. Only a single vehicle vt is allowed to ex- sure (line 7 in Algorithm 3). The vehicle is then rebalanced to a zone plore the environment at a given time t and can perceive remaining within the selected clustering model according to a weighted random requests that the previous vehicle vt−1 could not serve. selection defined over a probability distribution among clusters and This emulates a multi-vehicle concurrent exploration without computed in Equation 7 (lines 3-4 in Algorithm 2). Among different competition between vehicles in serving customers. All the experi- approaches we preferred a weighted random selection to enable ve- ence gained by all of the vehicle agents during training is gathered hicles exploring different zones while rebalancing, avoiding all the into a single learning process. In this way, all vehicles update the vehicles to rebalance in a similar area. same learning process, to optimize the use of acquired knowledge and speed up the overall learning process. However, if one vehicle Algorithm 2: Single vehicle relocating by D2R2 fails to do the update, or is not available to serve requests in a par- ticular location, the others can continue seamlessly. Once training is Result: relocates a vehicle V to a new position completed, knowledge is replicated to all vehicles of the fleet. This 1 rebalance (V ) allows new vehicles/agents to join the fleet without carrying out any 2 clusters, relocatingProbability ← additional training. updatingClusters(reqsAvailable, V.time) // Alg 3 Travel information for vehicles is estimated using the Open Source rnd ← generate random value ∈ [0, 1] Routing Machine (OSRM) [17], which, given two longitude-latitude 3 i←0 coordinates, estimates the distance and travel time driving on the 4 do i++ while rnd