Demand-Responsive Zone Generation for Real-Time
         Vehicle Rebalancing in Ride-Sharing Fleets
             Alberto Castagna1 and Maxime Guériau1 and Giuseppe Vizzari2 and Ivana Dusparic1


Abstract. Enabling Ride-sharing (RS) in existing Mobility-on-
demand (MoD) systems allows to reduce the operating vehicle fleet
size while achieving a similar level of service. This however re-
quires an efficient vehicle to multiple requests assignment, which
is the focus of most RS-related research, and an adaptive fleet
rebalancing strategy, which counter-acts the uneven geographical
spread of demand and relocates unoccupied vehicles to the areas
of higher demand. Existing research into rebalancing generally di-
vides the system coverage area into predefined geographical zones,
however, this is done statically at design-time and can limit their
adaptivity to evolving demand patterns. To enable dynamic, and                 Figure 1: Observed demand imbalance in New York Taxi dataset [21]
therefore more accurate rebalancing, this paper proposes a Dynamic             trips between morning (7-10am) and evening (6-9pm) peak hours in
Demand-Responsive Rebalancer (D2R2) for RS systems. D2R2 uses                  the south part of Manhattan on Tuesday, February 2nd 2016
Expectation-Maximization (EM) clustering to determine relocation
zones at runtime. D2R2 re-calculates zones at each decision step               mobility demand is changing over time and distribution of requests
and assigns them relative probabilities based on current demand. We            is uneven. This can lead to an unbalanced fleet distribution for RS-
demonstrate the use of D2R2 by integrating it with a Deep Rein-                enabled MoD systems [2], as illustrated in Figure 2, where most of
forcement Learning multi-agent RS-enabled MoD system in a fleet                the demand is concentrated in the top area while majority of vehicles
of 200 vehicle agents serving 10,000 trips extracted from New York             are located in the opposite side after finishing their last trip, where
taxi trip data. Results show a more fair workload division across the          fewer new customers are requesting for a ride.
fleet without loss of performance with respect to waiting time and                Adaptively following (or even preventing) changes in the demand
distribution of passengers per vehicle, when compared to baselines             spatial patterns can improve the perceived level of service from the
with no rebalancing and static pre-defined equiprobable zones.                 perspective of the use of the MoD system, and also, assuming a
                                                                               joined fleet in which human drivers can participate with their ve-
                                                                               hicles, its capability to consistently enable drivers to meet the actual
1    INTRODUCTION                                                              demand while optimizing vehicles usage and increase the ROI (re-
Mobility-on-Demand (MoD) systems are gaining popularity over                   turn on investment), assuming that the participation to the RS system
privately owned vehicles and public transportation due to reduced              would imply a subscription cost (due to setup and operating costs of
prices and shorter overall journey times [7]. Recent work suggests             the system).
that RS-enabled MoD systems can achieve similar level of service
using fewer vehicles, by better optimizing: (i) vehicle to multiple re-
quests assignment [2] and (ii) rebalancing empty vehicles to fit real-
time demand [24].
   Vehicle to request matching has been widely investigated. Of-
fline methods relying on constraint solving [3, 4] can design an op-
timized plan that is then executed by the vehicle. Online methods
involve matching to requests dynamically and has so far been ad-
dressed using constraint solving methods [2] and agent-based mod-              Figure 2: Example of an unbalanced fleet distribution where demand
els [1, 8, 9, 25].                                                             (requests in red) location differs from available supply (vehicles)
   However, fleet rebalancing in MoD systems has been less inves-
tigated while shown to have a strong impact on level of service of                To address this issue, existing work define rebalancing strategies
RS-enabled systems [24]. As depicted in Figure 1 created by extract-           that consist in relocating vehicles according to past or current de-
ing requests from New York Taxi data [21] on two different periods,            mand. Rebalancing can be achieved using pre-defined location per
1
                                                                               vehicles (station-based relocation) or by defining a set of areas (also
   School of Computer Science and Statistics, Trinity College Dublin,
  Ireland,    emails:     acastagn@tcd.ie,       maxime.gueriau@scss.tcd.ie,   called zones) each vehicle can be sent to. Rebalancing approaches
  ivana.dusparic@scss.tcd.ie                                                   can rely on a static [9, 1, 8, 25, 24, 19] or a dynamic partition of
2 University of Milano - Bicocca, Italy, email: giuseppe.vizzari@unimib.it
                                                                               the network [15] to split it into zones. Rebalancing vehicles using a

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
dynamic partition is expected to better track changes in mobility de-       a clustering algorithm. A k-means clustering is applied on virtual re-
mand e.g., caused by temporary network disruptions, special events          quests generated by a distribution defined on historical data. There-
concentrating demand temporarily and long-term city developments            fore the coverage and the size of the zones can change, but the total
affecting observed patterns.                                                number of zones remains fixed. This approach is the closest related
   In this paper, we propose a Dynamic Demand-Responsive Rebal-             work, however, we allow different number of zones based on dif-
ancer (D2R2) which, using Expectation-Maximization (EM) clus-               ferent density of requests. Furthermore, our approach relies on real
tering technique, generates a dynamic set of rebalancing zones and          time data rather than historical to have a better respond to dynamic
computes relocating probability per zone from current demand trend          demand.
every time a vehicle needs to relocate. Novelty of our approach is             Table 1 summarizes existing work on rebalancing for MoD sys-
twofold: first, boundaries and number of relocating areas are com-          tems, categorized by four main characteristics: Analysed data, which
puted when required and second, rebalancing probability for each            can be historical, real-time or both, depending on what kind of data
zone is allocated dynamically using only unserved requests data             is rebalancing based; Dynamic # zones, i.e., whether the number of
available in real-time. Therefore, D2R2 does not require data collec-       rebalancing zones can change over time; RB empty, whether the vehi-
tion and no learning phase is required, enabling our proposal to op-        cles relocate only when empty or can relocate as a part of RS assign-
erate from the beginning while being responsive to current demand.          ment; and Dynamic boundaries, whether the area covered by each
   We evaluate D2R2 in an implementation of a Multi-Agent Re-               rebalancing zone can adapt dynamically.
inforcement Learning (MARL) ride-sharing enabled MoD system
where 200 vehicle agents serve 10,000 ride requests in lower Man-                  Table 1: Characteristics of existing rebalancing algorithms
hattan road network. Requests have been generated from the open
                                                                                                     Analysed     Dynamic   RB only    Dynamic
NYC taxi dataset [21] to be representative of real demand patterns.                                    data       # zones    empty     boundaries
Results show that coupling D2R2 with ride-sharing enhances perfor-               Wen, 2017 [25]      Real-time       7         3          7
mance from a single vehicle perspective, and improves the overall                                    Real-time
balance of the distribution of requests across the fleet. At the cost           Fagnant, 2017 [8]                   7          3           7
                                                                                                     Historical
of few more kilometres travelled empty for rebalancing, the perfor-               Alonso-Mora,
                                                                                                     Real-time      7          3           7
mance at the fleet level confirms the overall efficiency of our demand-             2018 [24]
responsive rebalancing strategy.                                                Alabbasi, 2019 [1]   Historical     7          7           7
                                                                                 Yang, 2019 [15]     Historical     7          3           3
                                                                                                     Real-time
                                                                                Guériau, 2020 [9]                  7          7           7
2   RELATED WORK                                                                                     Historical
                                                                                   This paper        Real-time      3          3           3
Rebalancing for MoD can be categorized in the approaches relying
on static rebalancing zones [1, 8, 9, 24, 25] and dynamic zones [15].          We observe that the main issue of reviewed research is the low
In static rebalancing zone generation, geographical coverage of relo-       adaptability to demand changes (daily, seasonal, or more long-term
cation zones is predefined at design time. For example, in [9], NYC         ones resulting from new city developments), as the addition of new
Lower Manhattan area is divided in predefined zones which do not            city areas/zones, or changing their granularity, requires system re-
change over time. Each vehicle, using RL, learns and decides at each        design.
time step whether to relocate to one of the neighbouring zones or to           With respect to request assignment, multiple algorithms are used
stay in its current zone. In [8], rebalancing areas in Austin, Texas, are   in the literature to match riders and drivers (or vehicles). For ex-
defined by partitioning the area in 2-mile by 2-mile square blocks.         ample, [3, 4, 2] use integer programming to optimize the objective
Block balance is calculated for each zone, capturing the excess or          function for the optimal matching. RL-based approaches [9, 10, 1]
deficit of vehicles within the block in relation to supply and expected     are the closest to our approach, in which agents explore by them-
travel demand. Blocks with a negative balance try to gather vehicles        selves possible solutions without having any prior knowledge. A full
from neighbourhood where there is a surplus, and expected travel de-        review of vehicle assignment algorithms is out of scope of this paper
mand is estimated from historical data and current requests. In [1],        as our contribution focuses on rebalancing, nevertheless, it is worth
zones are defined using a fine-grained but also static grid. In [25],       mentioning that, even though we illustrate D2R2 application in con-
if a vehicle is idling, it can rebalance based on its local knowledge:      junction with an RL-based vehicle assignment, R2D2 is designed to
according to the demand distribution in surrounding areas, it decides       be independent from the assignment algorithm used in the MoD sys-
to rebalance to a neighbouring zone or not. The work presented in           tem.
[24] partitions the area into rebalancing zones according to the road
network layout. Zones are defined such as that for each region ri , ex-
ists a zone which allows to reach ri within the established maximum         3      BACKGROUND
travel time. Idle vehicles are rebalanced by taking into account travel     This section introduces the background information needed to un-
time, to limit empty travels, and future demand, estimated from cur-        derstand D2R2 design and implementation: Reinforcement Learn-
rent demand, to avoid an excess of vehicles in the same area. How-          ing (RL) used for vehicle assignment problem and expectation max-
ever, the zones, once defined at the start, do not change based on traf-    imization (EM) with model selection criterion for rebalancing.
fic or fleet conditions. Majority of the approaches allow rebalancing
only for idle (and empty) vehicles, while [1] and [9] mix rebalanc-
                                                                            3.1      Deep Reinforcement Learning
ing with RS assignment, and allow RS pick-ups from neighbouring
zones, shortening waiting time for passengers, but increasing their         RL is a branch of machine learning in which an agent learns au-
travel time.                                                                tonomously by trial-and-error to map actions to the current environ-
   The only current work that uses dynamic zone generation for re-          ment state, by receiving a positive or negative reward for their exe-
balancing is presented in [15]; rebalancing zones are computed using        cution [23]. The goal of the agent is to learn actions that maximize
the long term cumulative reward. RL iterates three tasks: at each time       computes the complete log-likelihood of data memberships in clus-
step an agent obtains the perception of the environment and maps it          ters, and Maximization (M step), which, by maximizing the com-
to a state s from its overall state space. Based on past experience,         puted likelihood, updates the hidden parameters θ related to nor-
it can select an action a from the action space. The agent then, at          mal distribution, hence mean, variance and prior probability for each
timestep t receives a reward rt = R(st , at ) which expresses how            cluster.
good was the selected action.                                                   E step: hidden parameters are initialized by some random values
    In most of real-world scenarios, the environment space is complex        or, more likely, computed from data points. Then, expectation of the
or continuous, making it intractable to handle all possible state-action     complete log-likelihood, conditioned by observed samples and cur-
pairs. To overcome this issue, RL has been combined with deep neu-           rent estimation of θ, is computed. Expected value is:
ral networks to approximate states, giving rise to a range of Deep                                       hX                            i
RL techniques. The approach we choose for our implementation is                          Q(θ; θ(t)) = E        ln(py (yk ; θ|X; θ(t)))          (4)
Proximal Policy Optimization (PPO) [22], which is simpler to im-                                              k

plement and tune without affecting the performance when compared             where θ is the unknown parameter vector and the term inside the log-
to other state-of-the-art Deep RL approaches. PPO uses a novel ob-           arithm expresses the conditional probability of a datapoint to belong
jective function, formed by three terms, which is maximized each             to a cluster k given the observed samples X and value of θ at the
iteration:                                                                   previous step.
                           h                                            i       M step: computes the next (t + 1)-th estimation of the unknown
LCLIP
  t
        +V F +S
                 (Θ) = Êt LCLIP
                              t     (Θ) − c1 LVt F (Θ) + c2 S[πΘ ](st )      parameter vector by maximising the expected value Q(θ; θ(t)) ob-
                                                                      (1)    tained from previous step.

Θ is the policy’s parameter vector. The first term LCLIP is the                                               ∂Q(θ; θ(t))
                                                                                                 θ(t + 1) :               =0                    (5)
clipped objective function defined in Equation 2. c1 is a coefficient,                                           ∂θ
defined between 0.5 and 1, applied to LV F = (VΘ (st ) − Vttarg )2 ,         EM algorithm terminates when difference between expectation at
which computes the squared-error loss of V , the learned state-value         time t and time t − 1, obtained from Equation 4, is smaller than a
function, compared to the target value at time t. Last term S is the en-     threshold .
tropy bonus, used to ensure sufficient exploration which is regulated            Once terminated, the likelihood indicates how good our model fits
by c2 , ranging from 0 to 0.01. S of a stochastic policy πΘ refers to        data. However, this parameter alone does not take into account over-
state at time t.                                                             fitting and the number of clusters; in fact likelihood could be maxi-
                   h                                               i         mized with each datapoint belonging to a different cluster. An option
LCLIP (Θ) = Êt min(rt (Θ)Ât , clip(rt (Θ), 1 − , 1 + )Ât )        (2)   to validate and select a model is by using Bayesian Information Cri-
                                                                             teria (BIC) [6]. It prevents over-fitting by taking into account num-
In the LCLIP objective function definition (Equation 2), the first           ber of clusters. BIC is computed through Equation 6, where number
term inside the min is the surrogate objective with a conservative           of free parameters, k, depends on number of clusters. BIC measure
policy iteration which is clipped by the second term. Â is an estima-       weighs the number of free parameters with the number of samples
tor of the advantage function shown in Equation 3 and rt (Θ) denotes         available. It looks for the true model among the set of candidates.
                              t |st )
the probability ratio πΘπΘ (a(a  t |st )
                                         that expresses the difference be-
                         old                                                                      BIC = ln(n)k − 2 ln(L̂)                       (6)
tween current and old policy, which is clipped if difference falls out
of boundaries by , a small hyper-parameter which weighs distance            Where L̂ is the maximized value of the likelihood function, result of
from new policy in respect to the old.                                       Equation 4, n is the number of data points within a dataset and k is
                                                                             the number of free parameters to be estimated.
                                                                                We have opted to use EM over other clustering techniques because
Ât =σt + (γλ)σt+1 + (γλ)2 σt+2 + · · · + (γλ)(T −t+1) σt−1                  it computes clusters by estimating normal distribution with their pa-
                                                                      (3)
      where, σt = rt + γV (st+1 ) − V (St )                                  rameters, which underlies data. By doing that, EM enables clusters
                                                                             to have different shapes unlike other clustering methods which tends
During training, PPO method collects a sequence of samples of                to find clusters with comparable areas by working directly on data
length T from the environment and then estimates the advantage Ât           points.
for the complete sequence. Finally, several epochs of optimization
on LCLIP are performed on the same batch, to maximize gathered
experiences.                                                                 4     DYNAMIC DEMAND-RESPONSIVE
   While we are not aware of any application of PPO in a ride-                     REBALANCER
sharing problem, it shows good performance in many other applica-            This section describes the main contribution of our paper, a Demand-
tions with similar characteristics, such as portfolio management [14],       Responsive Zone Generation for Real-Time Vehicle Rebalancing
robot control [12, 18, 16] or simulation and games [5, 26], which mo-        (D2R2) in RS fleets. We first introduce our unpublished RS system
tivated our decision to use it as basis for our implementation.              using multi-agent Deep Reinforcement Learning, and then our novel
                                                                             rebalancer.
3.2    Clustering
                                                                             4.1    Ride-Sharing using Deep Reinforcement
Expectation-Maximization (EM) is an iterative algorithm to find the
                                                                                    Learning
maximum-likelihood estimates of parameters by computing proba-
bilities of cluster memberships across several probability distribu-         We designed a multi-agent decentralized algorithm for RS applied to
tions [20]. EM consists of two steps, Expectation (E step), which            a fleet composed of 5-seater autonomous vehicles for a MoD system,
                                                                                                                       
which is model-free and designed to be replicable to any city in the      quests perceived by an agent, P : Sr1 , Sr2 , Sr3 . Where,
world. To take a decision, agent implements PPO [22], introduced in               i      i       i
                                                                          Sri : rPos  , rDest , rPassengers represents the state of the i-th request
Section 3. Each agent controls a vehicle, taking an action at each step   perceived by an agent. Each requests consists of a pick-up location,
by evaluating its internal state and perception, without communicat-      the destination and number of passengers.
ing or coordinating with other vehicles. Once an agent completes its         Vehicles can choose between 5 actions, organized in three cate-
action a next step can begin. Agents evaluate and decide of an ac-        gories: (1) drop-off, in which an agent serves a request by driving the
tion in a sequential order. At each time step, each agent perceives       passenger(s) to their destination; (2) park, in which an agent waits
the environment and decides of the next action: to pick-up a ride, to     one minute being parked. (3) pick-up, an agent drives to a pick-up
drop-off passengers or to rebalance. Finally, it updates its learning     point of the selected request. Pick-up action has three variations,
process. This cycle is described in Alg. 1.                               pick-up first, second or third request from the perception set. Once
                                                                          an agent selects a pick-up action, is first checked whether in percep-
                                                                          tion corresponds a request and then if the vehicle can accommodate
    Algorithm 1: Controller for a single vehicle V
                                                                          the new passenger(s), line 19 in Algorithm 1. When the vehicle per-
   Parameters: V vehicle, a action, r request, PPO model                  ception is empty and it is not serving any requests, it is enabled to
1  Perceive and act (V )                                                  rebalance as shown at line 6 in Algorithm 1. We further discuss re-
 2    update vehicle perception and status                                balancing in next section.
 3    a←PPO.getAction([V.perception, V.status])                              Rewards associated with actions are also shown in Algorithm 1.
 4    if a is parked then                                                 An agent get a negative reward of -10 for attempting to try to pick-
 5         if (V.queue ∧ V.perception) are empty then                     up a request while it does not have enough free seats. Otherwise,
 6              rebalance(V )                        // Alg. 2            the best (+5) and the potential worst reward are related to the same
 7              reward ← −0.01                                            action: when a vehicle is doing a drop-off while carrying passenger
 8         end                                                            from several requests, if the travel time for passengers to reach their
 9         else V.wait()                                                  destination exceed the estimated travel time without RS by 30% or
10         if V.queue is empty then reward ← −0.3                         more, then the reward is reduced according to the total additional
11         else reward ← −0.5                                             detour distance travelled.
12    else if a is drop-off then
13         if V.queue is empty then return −10
14         V.destination ← arg minri ∈V.queue Supply(v, ri )
                                                                          4.2    Rebalancer - D2R2
15         detourRatio←V.goToDestination()                                D2R2 rebalancer can be used with different MoD systems, however
16         if detourRatio ≤ max detourRatio then reward ← 5               for illustration we describe its implementation as combined with the
17         else reward ← 5 − (detourRatio−1)                              Deep RL ride-sharing request assignment strategy presented in pre-
18    else a is pick-up                                                   vious section. Rebalancing is triggered when a vehicle is not serving
19         if ∃ r associated to a ∧ V.freeSeats ≥ r.passengers            any requests and it has no further requests to serve in its neighbour-
             then                                                         hood. D2R2 aims to dispatch vehicles efficiently and dynamically ac-
20              V.pickUp(r)                                               cording to current demand, preventing fleet unbalance, which in turn
21              if size(V.queue)== 1 then reward ← 1                      can result in longer passenger waiting times, or an increased num-
22              else reward ← 2                  // doing rs              ber of unserved requests. D2R2 infers relocating zones and computes
23         end                                                            their associated probabilities (Eq. 7) for a vehicle to be relocated into:
24         else reward ← −10
                                                                                                                     |Ri |
25    end                                                                                            pr (v, zi ) =                              (7)
                                                                                                                      |R|
26    PPO.update(reward, a)
                                                                          where zi is the ith zone, Ri is the set of pending requests within
                                                                          current zone, and R is the set of pending requests across all zones.
   The agent’s internal state is composed by the vehicle position, rep-
resented by latitude-longitude pair, its destination, and the number of
vacant seats. For an empty vehicle, destination is void, and if a ve-
hicle is serving one or more requests, its destination matches to that
of the request ri that can be served the quickest, as shown in line 14
in Alg. 1. The vehicle location on the road network is updated ev-
ery time that a new position is reached, either destination or pick-up
point.
   Perception P is composed of the three closest requests that an
agent could serve. A request r is available to a vehicle v if v has
enough empty seats to accommodate the number of passengers asso-
ciated with the request (ranging from 1 to 6), and the total waiting
time for r, i.e., the delay between request being created and esti-
mated passengers pick-up time, is less than the maximum time al-
lowed (set to 15 minutes). All customers, who have waited more than             Figure 3: Ride-sharing framework with D2R2 integrated
15 minutes leave the system without being served, and the request is
recorded as not served.                                                     R2D2 framework is illustrated in Figure 3, depicting the rebalanc-
   Perception is defined as the aggregation of states of re-              ing module and an agent behaviour. The demand filter has a double
role: it selects three requests for agents perception and also filters de-
                                                                               Algorithm 3: Defining relocating zones for rebalancing
mand for rebalancing, according to used time frame. D2R2 takes as
input pending requests available at current time. All previous unsat-           Result: Given a set of pending requests (reqsAvailable) and a
isfied requests and future requests (estimated or scheduled), are not                   time, finds relocating zones C : {c1 , c2 , . . . , cn } with
taken into account, as illustrated in Figure 4.                                         associated probabilities P : {p1 , p2 , . . . , pn }
                                                                                Parameters: V vehicle, r request, p probability
                                                                              1 updatingClusters (reqsAvailable, V.time)
                                                                              2     foreach r ∈ reqsAvailable do
                                                                              3         if V.time − r.timeBegin ≤ TIMEFRAME then
                                                                              4              queueReqs.append(r)
                                                                              5         end
                                                                              6     end
                                                                              7     clusters← min bic50  k=10 (EM (queueReqs, k ))
                                                                              8     prob ← 0
                                                                              9     foreach c ∈ clusters do
                                                                                                              c.size()
                                                                             10         prob ← prob + queueReqs.size()
                                                                             11         relocatingProb[c] ← prob
Figure 4: Example of pending requests at time tv . Only requests ac-         12     end
tive at time tv , (i.e., r3 , r4 ) are taken into account when rebalancing   13 return clusters, relocatingProb


   Algorithm 2 describes the procedure applied to relocate an idle
vehicle to a new position. First, R2D2 generates new clusters, based         peak time, are distributed as follows: 71.95% of demand is composed
on pending requests, following Algorithm 3. For a single rebalancing         by a single passenger request, 13% by two, 4.1% by three, 2.28% by
task, several runs of EM occur, producing different number of clus-          four, 5.3% by 5 and the remaining 3.37% by 6.
ters. The algorithm then selects the model which better represents the          Agents learning stage is performed by running multiple rounds
data by choosing the clustering model which minimizes BIC mea-               of single-vehicle training. Only a single vehicle vt is allowed to ex-
sure (line 7 in Algorithm 3). The vehicle is then rebalanced to a zone       plore the environment at a given time t and can perceive remaining
within the selected clustering model according to a weighted random          requests that the previous vehicle vt−1 could not serve.
selection defined over a probability distribution among clusters and            This emulates a multi-vehicle concurrent exploration without
computed in Equation 7 (lines 3-4 in Algorithm 2). Among different           competition between vehicles in serving customers. All the experi-
approaches we preferred a weighted random selection to enable ve-            ence gained by all of the vehicle agents during training is gathered
hicles exploring different zones while rebalancing, avoiding all the         into a single learning process. In this way, all vehicles update the
vehicles to rebalance in a similar area.                                     same learning process, to optimize the use of acquired knowledge
                                                                             and speed up the overall learning process. However, if one vehicle
 Algorithm 2: Single vehicle relocating by D2R2                              fails to do the update, or is not available to serve requests in a par-
                                                                             ticular location, the others can continue seamlessly. Once training is
  Result: relocates a vehicle V to a new position
                                                                             completed, knowledge is replicated to all vehicles of the fleet. This
1 rebalance (V )
                                                                             allows new vehicles/agents to join the fleet without carrying out any
2    clusters, relocatingProbability ←
                                                                             additional training.
       updatingClusters(reqsAvailable, V.time) // Alg 3
                                                                                Travel information for vehicles is estimated using the Open Source
      rnd ← generate random value ∈ [0, 1]
                                                                             Routing Machine (OSRM) [17], which, given two longitude-latitude
3    i←0
                                                                             coordinates, estimates the distance and travel time driving on the
4    do i++ while rnd<relocatingProbability[i]
                                                                             shortest route from origin and destination. Distance and time are
5    V.destination ←clusters[i].getCenter()
                                                                             computed according to a 24-hours snap-shoot acquired from a real-
6    V.driveToDestination()
                                                                             world scenario.
                                                                                Agents, which are implemented through Tensorforce [13], use
   Each agent therefore, at each timestep, takes an action using its         PPO with a deep neural network with a dense topology. Input layer
policy trained though PPO and, when needed, relocates by using EM            has 20 neurons since an agent can perceive at most 3 requests and for
as described in this section.                                                each request is taken number of passengers and position with desti-
                                                                             nation, as latitude-longitude coordinates, while the output layer has 5
                                                                             neurons, one for each action. Between input and output layer lie three
5   SIMULATION
                                                                             hidden layers, each of them composed of 32 neurons. The clipping
We evaluate the performance of D2R2 in an RL-based RS system                 ratio is set to 0.2 and the discount factor to 0.99. We used Adam [11]
by using a fleet of 200 5-seater vehicles controlled by autonomous           optimizer, with a learning rate of 1e−3 and an entropy regularization
agents. Requests have been extracted from NYC city taxi data, and            set to 0.01.
filtered to include only requests within lower Manhattan in New York            PPO batching capacity is 100, and we executed 10 iterations over
City. Our evaluation is performed using data for the evening peak            the batches of the PPO objective. The model was trained during 20
hour (from 6 to 9 pm), and consists of 10,000 requests. These 10,000         episodes, and each episode is composed of 10 rounds in which a ve-
requests have been generated by aggregating trips from 50 consecu-           hicle serves requests. Each episode does not have a fixed number
tive Tuesdays between July 2015 and June 2016 (to represent a typi-          of iterations, as it terminates when a given agent does not have any
cal weekday demand pattern). Passengers by request, during evening           more requests to serve. During training, rebalancing is disabled to
avoid unpredictable bias as it relies on a random weighted choice to       Level of Service From the system perspective, as reported in Ta-
select the new zone in which a vehicle is relocated into. For this rea-   ble 3, each scenario shows a different level of service. Scenario Base
son integrating rebalancing as an action would have affected agents       - no RB, no RS, which models a classical taxi service, serves around
learning, requiring more exploration.                                     74% of requests, and D2R2 RB only serves 95% of the requests. All
                                                                          other scenarios (RS only, RS with fixed zones RB, and D2R2 RB and
6     EVALUATION RESULTS AND ANALYSIS                                     RS) achieve maximum level of service possible with 5-seater cars,
                                                                          of 96.63% requests. The results show that by enabling ride-sharing
To evaluate the performance of D2R2 rebalancing strategy, we pro-         in combination with D2R2 rebalancing, decreases number of request
pose 5 different scenarios, as presented in Table 2.                      ride-share (68% vs 80%) when compared to RS only. However, de-
                                                                          tour ratio is decreased (1.5 vs 1.7), which means that average time
            Table 2: Specification of evaluated scenarios                 needed for passengers to reach their destination is decreased in D2R2
                              Scenarios              RB     RS            RB and RS.
                         Base - no RB, no RS          no    no
         Baselines             RS only                no    yes            Passenger waiting time Figure 5a shows passenger waiting times
                        RS with fixed zones RB       yes*   yes
                           D2R2 RB only              yes    no
                                                                          for each scenario. By enabling ride-sharing in the baseline scenario
          D2R2                                                            (RS with fixed zones RB) or D2R2 (D2R2 RB and RS), we observed a
                          D2R2 RB and RS             yes    yes
         * Vehicle randomly chooses the rebalancing zone.                 significant reduction when compared to Base - no RB, no RS. RS only
                                                                          shows lower overall waiting times (3.761 minutes) when compared
   The first baseline scenario (Base - no RB, no RS) imitates the be-     to D2R2 RB only (4.218 min). However, ride-sharing can generate
haviour of a simple MoD system without ride-sharing or rebalanc-          additional delay for passengers as vehicles are following a detour to
ing. In the second baseline scenario (RS only) ride-sharing alone is      serve more requests as shown in Table 3. Rebalancing in addition can
enabled. Third baseline is a MoD with both ride-sharing and rebal-        limit the additional travelled distance due to detours as the vehicle
ancing, but the number, size, and boundaries of zones is fixed (simi-     relocates in zones with closer requests that can be matched. Hence,
larly to existing related work) and the zone for rebalancing is chosen    D2R2 RB only results as the best option for passengers with respect
randomly. We also evaluate two variations of D2R2. (D2R2 RB only)         to travel time.
uses D2R2 without ride-sharing while (D2R2 RB and RS) uses both
ride-sharing and rebalancing. This combination of scenarios allows         Passenger Distribution We can observe in Figure 5b that in Base
to evaluate the benefits of rebalancing overall as well as the benefits   - no RB, no RS many vehicles are driving with only a few passengers.
of D2R2-based rebalancing specifically.                                   These vehicles, once serving one or few requests, may end up in an
   For all simulations, we assumed each vehicle to have a capacity of     area of the network that is empty of any further request. Enabling
5 passengers, therefore requests with more passengers are ignored by      rebalancing or ride-sharing can prevent them from staying idle and
vehicles. This means that, based on the dataset used, the maximum         help vehicle to find new requests. This can be observed from sce-
level of service the fleet can reach is serving 96.63% of the 10,000      narios D2R2 RB only and RS only, where further improvements are
requests.                                                                 achieved by enabling ride-sharing and rebalancing, as all vehicles
                                                                          serve a similar number of passengers. In particular, D2R2 RB and RS
6.1    Evaluation metrics                                                 seems to converge to a higher average value when compared to base-
                                                                          line rebalancer RS with fixed zones RB. Moreover, as shown in Ta-
We rely on commonly used measures adopted by related work                 ble 3, the variance of the number of served passengers is also lower:
[1, 2, 9, 24] to evaluate the performance of D2R2 in the proposed         the combinations of these leads to conclude that the ROI for the indi-
scenarios. The overall performance of the MoD fleet is assessed by        vidual vehicle/agent, member of this network, is more consistent and
analysing the number/percentage of served requests and the percent-       stable (for a fixed number of vehicles serving shared rides).
age of requests that involved ride-sharing (as opposed to occupancy
of the vehicle only by one or more passengers from a single request).
                                                                           Vehicle mileage Distance travelled by agents is depicted in Fig-
From the passenger perspective, we evaluate the average detour ra-
                                                                          ure 5c. Base scenario Base - no RB, no RS shows that around one
tio (Dr ) (i.e., the percentage of extra travel time used to facilitate
                                                                          fourth of vehicles are travelling only a few kilometres, confirming
ride-sharing) and the average waiting time (wt). Dr is obtained by
                                                                          they only serve a few requests and then stay idle in an area with no
comparing actual time spent to reach the passenger destination serv-
                                                                          further demand. Also, since the number of served requests varies by
ing other RS requests, and the expected travel time for a direct trip
                                                                          scenario, an important difference in terms of distance travelled was
between the passenger origin and destination. wt represents the av-
                                                                          recorded. We further investigate travelled distance in Table 3. From
erage time passengers have to wait between creating a request and
                                                                          the Table, we can confirm that enabling rebalancing adds additional
being picked-up. The maximum waiting time per request is limited
                                                                          travel distance for empty vehicles.
to 15 minutes and after this the request is discarded from the system
and flagged as unserved.
   From the vehicles perspective, we recorded the distribution of the     6.3    Discussion
number of passengers per vehicle in the fleet and computed its vari-
ance (σpass). We also recorded the average distance travelled per         Simulation results show that enabling ride-sharing alone is enough to
vehicle (dt ).                                                            satisfy all of the requests possible to serve, when the baseline Base
                                                                          - no RB, no RS only serves 74%. We showed that D2R2 improves
                                                                          the average and individual performance of all vehicles when enabled
6.2    Simulation results
                                                                          in combination with ride-sharing (D2R2 RB and RS) compared to
This section presents the simulation results for all the scenarios.       Base - no RD, no RS. Passenger waiting time for (D2R2 RB and RS)
                                                             (a) Waiting time for request


                                                    (b) Number of passengers served per vehicle


                                                                   (c) Vehicle mileage

Figure 5: Comparison of simulation results for the implemented scenarios: waiting time (a), passengers distribution per vehicle (b) and distance
travelled per vehicle (c)

Table 3: All values refer to 10,000 requests served by a fleet com-           ing (RS with fixed zones RB and both R2D2 scenarios) generate ad-
posed by 200 5-seater vehicles                                                ditional travelled distance: total mileage is slightly increased due to
                                                                              empty vehicles travelling as a part of relocation process.
                    %Served                 wt      σ       dt
    Scenarios                    %RS                                Dr
                     Reqs.                 (sec)   pass    (km)
       Base          74.21          0       410    1926     88       1
      RS only        96.63         80       226    331     118.1    1.7
      RS with
                      96.63        70      210      82     143.6    1.5
  fixed zones RB
  D2R2 RB only          95         0       253     274     137       1
  D2R2 RB and
                      96.63        68      204      63     141.2    1.5
        RS

is almost halved when compared to a normal taxi cab service (Sce-
nario Base - no RB, no RS). However, the main observed advantage
of D2R2 (with RB and RS) is that the workload that each vehicle has
to carry out seems to better converge to a global average value (Fig-
ure 5b), resulting in fairer workload distribution. This is in contrast to
other scenarios, where we observed that a few agents are contributing                    Figure 6: Sample outcome of D2R2 clustering
more that the others and some vehicles can be under-utilized, serv-
ing only a few requests. The closest in terms of workload distribution           To illustrate the zone outcomes of D2R2 and how does it differ
to our approach is (RS with fixed zones RB), however the variance is          from fixed zone clustering, we here show a snapshot of the number,
still larger at 82 passengers vs 63 in our approach (as σpass in Table 3      size, and shape of the clusters it generated for a particular vehicle
shows). Interestly, fairness is not a metric existing work considered.        v at time t (Figure 6). In this instance, 10 relocation zones were
Our results show that this should be included as a standard measure           computed by D2R2. Cluster centres are represented by a cross and
when evaluating ride-sharing systems, along side other system, user           colours intensity indicates its rebalancing probability. However, the
and vehicle metrics. We also observe that approaches with rebalanc-           number of clusters throughout the simulation, based on the demand
                                                                               [3] Vincent Armant and Kenneth N. Brown, ‘Minimizing the driving dis-
                                                                                   tance in ride sharing systems’, 2014 IEEE 26th International Confer-
                                                                                   ence on Tools with Artificial Intelligence, 568–575, (2014).
                                                                               [4] Vincent Armant, Nahid Mabub, and Kenneth N. Brown, ‘Maximising
                                                                                   the number of participants in a ride-sharing scheme: Mip versus cp for-
                                                                                   mulations’, 2015 IEEE 27th International Conference on Tools with
                                                                                   Artificial Intelligence (ICTAI), 836–843, (2015).
 Figure 7: D2R2 clusters distribution during the simulated scenario            [5] Trapit Bansal, Jakub W. Pachocki, Szymon Sidor, Ilya Sutskever, and
                                                                                   Igor Mordatch, ‘Emergent complexity via multi-agent competition’,
                                                                                   ArXiv, abs/1710.03748, (2017).
at any particular time, ranged from 4 to 30, as illustrated in Figure 7,       [6] Gerda Claeskens and Nils Lid Hjort, The Bayesian information crite-
with majority being in range between 13 and 20.                                    rion, 70–98, Cambridge Series in Statistical and Probabilistic Mathe-
                                                                                   matics, Cambridge University Press, 2008.
                                                                               [7] Regina R Clewlow and Gouri Shankar Mishra, ‘Disruptive transporta-
7   CONCLUSION AND FUTURE WORK                                                     tion: The adoption, utilization, and impacts of ride-hailing in the united
                                                                                   states’, Technical report, (2017).
This paper presents a Dynamic Demand-Responsive Rebalancer                     [8] Daniel Fagnant and Kara Kockelman, ‘Dynamic ride-sharing and fleet
(D2R2), a novel vehicle rebalancing algorithm for ride-sharing                     sizing for a system of shared autonomous vehicles in austin, texas’,
mobility-on-demand systems. Unlike existing approaches which use                   Transportation, 45, (08 2016).
a fixed number and fixed geographical division of the network in               [9] Maxime Guériau and Ivana Dusparic, ‘Samod: Shared autonomous
                                                                                   mobility-on-demand using decentralized reinforcement learning’, in
zones to relocate empty vehicles, D2R2 uses EM clustering to dy-
                                                                                   2018 21st International Conference on Intelligent Transportation Sys-
namically generate zones. D2R2 enables zones to be dynamic in                      tems (ITSC), pp. 1558–1563. IEEE, (2018).
terms of their number, position and boundaries. First, rebalancing            [10] Maxime Guériau, Federico Cugurullo, Ransford A. Acheampong, and
zones are identified by analysing real-time pending requests at each               Ivana Dusparic, ‘Shared autonomous mobility-on-demand: Learning-
time step from each vehicle perspective. The zone to which a vehicle               based approach and its performance in the presence of traffic conges-
                                                                                   tion’, IEEE Intelligent Transportation Systems Magazine, (01 2020).
rebalances is then selected, among the defined zones according to a           [11] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic
probability distribution defined over the zones. Thus, idle vehicles               optimization, 2014.
are spread across the area rather than rebalanced to the same zone(s).        [12] William Koch, Renato Mancuso, Richard West, and Azer Bestavros,
D2R2 effectiveness is shown by integrating it with 200 RL-based                    ‘Reinforcement learning for uav attitude control’, ACM Trans. Cyber-
                                                                                   Phys. Syst., 3(2), (February 2019).
ride-sharing vehicles, which serve 10,000 ride-sharing requests in the
                                                                              [13] Alexander Kuhnle, Michael Schaarschmidt, and Kai Fricke. Tensor-
lower Manhattan area. We compare D2R2 to approaches with no re-                    force: a tensorflow library for applied reinforcement learning. Web
balancing and fixed-zone rebalancing, and observe a more fair work-                page, 2017.
load division across the fleet when using D2R2, indicating a more             [14] Zhipeng Liang, Kangkang Jiang, Hao Chen, Junhao Zhu, and Yan-
accurate rebalancing strategy, without loss of performance with re-                ran Li, ‘Deep reinforcement learning in portfolio management’, ArXiv,
                                                                                   abs/1808.09940, (2018).
spect to waiting time and distribution of passengers per vehicle.             [15] Yang Liu and Samitha Samaranayake, ‘Proactive rebalancing and
   This work can be expanded in multiple directions. To verify                     speed-up techniques for on-demand high capacity vehicle pooling’,
its general applicability, it should be integrated with other RS ap-               CoRR, (2019).
proaches, and evaluated on other road network maps and datasets.              [16] Guilherme Lopes, Murillo Ferreira, Alexandre Simões, and Esther
                                                                                   Colombini, ‘Intelligent control of a quadrotor with proximal policy op-
In terms of improving the underlying PPO learning process, vehicles
                                                                                   timization reinforcement learning’, pp. 503–508, (11 2018).
could be enabled with further online learning, to fine-tune their be-         [17] Dennis Luxen and Christian Vetter, ‘Real-time routing with open-
haviours to new request patterns as they arise. Rebalancing could be               streetmap data’, in 19th ACM SIGSPATIAL International Symposium on
further improved by taking into account vehicle position, for estimat-             Advances in Geographic Information Systems, ACM-GIS 2011, Novem-
ing travel time, when selecting the cluster to which to relocate. Thus,            ber 1-4, 2011, Chicago, IL, USA, Proceedings, pp. 513–516, (2011).
                                                                              [18] A. Rupam Mahmood, Dmytro Korenkevych, Gautham Vasan, William
probability computed by Equation 7 would be conditioned by vehi-                   Ma, and James Bergstra, ‘Benchmarking reinforcement learning algo-
cle position and then normalized. Additionally to be more precise,                 rithms on real-world robots’, in CoRL, (2018).
it could integrates real-time traffic congestion data. It could further       [19] Katarzyna Marczuk, Harold Soh, Carlos Lima Azevedo, Der-Horng
be integrated with existing approaches, as reviewed in Related Work,               Lee, and Emilio Frazzoli, ‘Simulation framework for rebalancing of au-
                                                                                   tonomous mobility on demand systems’, MATEC Web of Conferences,
which use historical data to predict future demand, and tune cluster
                                                                                   81, 01005, (01 2016).
probabilities accordingly.                                                    [20] T. K. Moon, ‘The expectation-maximization algorithm’, IEEE Signal
                                                                                   Processing Magazine, 13(6), 47–60, (Nov 1996).
                                                                              [21] NYC Taxi and Limousine Commission. Tlc trip record data, 2020.
ACKNOWLEDGEMENTS                                                              [22] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford,
                                                                                   and Oleg Klimov, ‘Proximal policy optimization algorithms’, CoRR,
This publication has emanated from research supported in part by a                 abs/1707.06347, (2017).
research grant from Science Foundation Ireland (SFI) under Grant              [23] Richard S. Sutton and Andrew G. Barto, Introduction to Reinforcement
Number 18/CRT/6223, Center for Research Training in Artificial In-                 Learning, MIT Press, Cambridge, MA, USA, 1st edn., 1998.
telligence.                                                                   [24] A. Wallar, M. Van Der Zee, J. Alonso-Mora, and D. Rus, ‘Vehicle re-
                                                                                   balancing for mobility-on-demand systems with ride-sharing’, in 2018
                                                                                   IEEE/RSJ International Conference on Intelligent Robots and Systems
REFERENCES                                                                         (IROS), pp. 4539–4546, (Oct 2018).
                                                                              [25] Jian Wen, Jinhua Zhao, and Patrick Jaillet, ‘Rebalancing shared
[1] Abubakr Alabbasi, Arnob Ghosh, and Vaneet Aggarwal, ‘Deeppool:                 mobility-on-demand systems: A reinforcement learning approach’, in
    Distributed model-free algorithm for ride-sharing using deep reinforce-        2017 IEEE 20th International Conference on Intelligent Transportation
    ment learning’, (03 2019).                                                     Systems (ITSC), pp. 220–225. IEEE, (2017).
[2] Javier Alonso-Mora, Samitha Samaranayake, Alex Wallar, Emilio Fraz-       [26] Yunqi Zhao, Igor Borovikov, Jason Rupert, Caedmon Somers, and Ah-
    zoli, and Daniela Rus, ‘On-demand high-capacity ride-sharing via dy-           mad Beirami, ‘On multi-agent learning in team sports games’, (2019).
    namic trip-vehicle assignment’, Proceedings of the National Academy
    of Sciences, 114(3), 462–467, (2017).