Improved Trip Planning by Learning from Travelers’ Choices


Boris Chidlovskii                                                                         BCHIDLOVSKII @ XRCE . XEROX . COM
Xerox Research Center Europe, 6 chemin Maupertuis,38240 Meylan, France


                          Abstract                                   the same origin and destination points.
     We analyze the work of urban trip planners and                  While the personal knowledge plays an important role, in
     the relevance of trips they recommend upon user                 many cases passengers simply have different preferences
     queries. We propose to improve the planner rec-                 about the trip planning. For example, one passenger may
     ommendations by learning from choices made                      avoid multiple changes, by extending the duration of her
     by travelers who use the transportation network                 journey by a few minutes, while another passenger simply
     on the daily basis. We analyze individual travel-               wants to arrive as quickly as possible to the destination.
     ers’ trips and convert them into pair-wise prefer-
                                                                     When a user queries a planner for a journey from origin o
     ences for traveling from a given origin to a des-
                                                                     to destination d starting at time ts , there are often a large
     tination at a given time point. To address the
                                                                     number of trips satisfying the query. Planners are designed
     sparse and noisy character of raw trip data, we
                                                                     to provide the k-top recommendations according to a set
     model passenger preferences with a number of
                                                                     of predefined criteria, such as the minimal transfer time,
     smoothed time-dependent latent variables, which
                                                                     the minimal number of changes, etc.. Their work is sim-
     are used to learn a ranking function for trips. This
                                                                     ilar to any information retrieval system, where the goal is
     function can be used to re-rank the top planner’s
                                                                     to place the most relevant documents among the k-top an-
     recommendations. Results of tests for cities of
                                                                     swers. Therefore, it is highly desirable that a trip planner
     Nancy, France and Adelaide, Australia show a
                                                                     behaves intelligently and suggests k-top trips which reflect
     considerable increase of the recommendation rel-
                                                                     the real passengers’ preferences.
     evance.
                                                                     In this paper we closely analyze the cases of divergence be-
                                                                     tween the planner recommendations and real choices made
1. Introduction                                                      by urban travelers. We collect two sets of individual trips
                                                                     extracted from fare collection systems in cities of Nancy,
Most cities and agglomerations around the world propose              France and Adelaide, Australia (see Figure 1). We com-
their trip planners, in the form of a web or mobile appli-           pare these data to the city planners’ recommendations; and
cation. Upon a user travel request, they recommend trips             in the of case of divergence, we propose a novel method to
using a static library of roads and public transportation net-       rank the trips that better reflects the reality.
work and services. Although these planners are increas-
ingly reliable in their knowledge of transportation network
and available services, they all share the same static-world
assumptions. In particular, they make a general assumption
of constancy and universality (Letchner et al., 2006), that
the optimal trip is independent of the time of day of the
actual journey and of the passengers’ preferences.
In reality, constancy and universality rarely hold. Most
urban travelers can verify that the best trip between work
and home at midnight is not necessarily the best choice to
make between the same locations at 8am. Similarly, differ-           Figure 1. Trip planners of Nancy (left) and Adelaide (right).
ent passengers may choose different ways to travel between
                                                                     Our method relies on two main contributions. First, we
Proceedings of the 2 nd International Workshop on Mining Urban       consider any individual trip as a set of explicit preferences
Data, Lille, France, 2015. Copyright c 2015 for this paper by its    made by the traveler during the trip. We use this set of
authors. Copying permitted for private and academic purposes.
                                                                     pairwise preferences to learn a ranking function of trips.


                                                                    Rd
                                                         ICML MUD 2015

This function is then used on the top of the trip planner,          cations to be assessed for better understanding user behav-
to re-rank the k-top recommendations. Second, we model              iors, and for guiding updates of the PT information system.
passenger preferences of choosing a specific service or a
                                                                    Personalization of trip planning took into account user pref-
change point in a way that reflects their dynamic nature. To
                                                                    erences and tries to identify the best trips among a set of
address the sparse and noisy character of the raw trip, we
                                                                    possible answers. In (Mokhtari et al., 2009), the fuzzy set
model the user preferences by a set of dynamic latent vari-
                                                                    theory was used to model complex user preferences. A ty-
ables. We estimate these variables by a smoothed dynamic
                                                                    pology of preferences was proposed to explicitly express
non-negative factorization of service and transit counts.
                                                                    the preferences and integrate them in a query language.
The remainder of this paper is organized as follows. In
                                                                    Trip personalization by mining public transport data has
Section 2 we briefly review the state of art in urban trip
                                                                    been addressed in (Lathia & Capra, 2011). It established a
planning. Section 3 introduces the trip ranking problem
                                                                    relation between urban mobility and fare purchasing habits
by analyzing individual trips for Nancy city case. Learn-
                                                                    in London public transport network (Seaborn et al., 2010),
ing to rank for trip planning is presented in Section 4.
                                                                    and proposed personalized ticket recommendations based
Then Section 5 proposes to model user preferences by dy-
                                                                    on the estimated future travel patterns and matching travel-
namic latent variables and develop an estimation method
                                                                    ers to the best fare.
by smoothed dynamic non-negative factorization of service
and transit counts. In Section 6, we report results of eval-        Integrating real time information in trip planners has been
uation on trip re-ranking for two city datasets. Section 7          another research trend. (Yuan et al., 2011) presented a
concludes the paper.                                                cloud-based system computing customized and practically
                                                                    fast driving routes for an end user using (historical and real-
2. Prior Art                                                        time) traffic conditions and driver behavior. GPS-equipped
                                                                    taxicabs are used as mobile sensors constantly probing the
Trip planners. Public transport (PT) trip planners are de-          traffic rhythm of a city and taxi drivers’ intelligence in
signed to provide information about available journeys in           choosing driving directions. The real time trip planning has
the transport system. The application prompts a user to in-         also been extended to multi-modality (Casey et al., 2014;
put an origin o, a destination d and a departure time ts (or        Seaborn et al., 2010). It used data from GPS-enabled vehi-
arrival time tf ), it then deploys a trip planning engine to        cles to produce more accurate plans in terms of time and
find a sequence of available PT services from o to d start-         transit vehicles.It incorporates the delays into the transit
ing at time ts (or ending at time tf ).                             network at real-time to minimize the gap with respect to
                                                                    the prediction model.
Trip planners often retrieve multiple trips for a user query.
They typically use a variation of the time-dependent short-         Learning to Rank. In document retrieval, to ranking doc-
est path algorithm to search a graph of nodes (representing         uments based on their degrees of relevance to a query has
access points to the network) and edges (representing pos-          been the key question for decades. Much effort has been
sible journeys between points) (Casey et al., 2014). Differ-        placed on developing document ranking functions. Early
ent weightings such as distance, cost or accessibility are of-      methods used a small number of document features (e.g.,
ten associated with each edge and node. Search may be op-           term frequency, inversed document frequency, and docu-
timized on different criteria, for example, the fastest, least      ment length), with an empirical tuning of the ranking func-
changes or cheapest ones (Pelletier et al., 2009).                  tion parameters. To avoid the manual tuning, the doc-
                                                                    ument retrieval was proposed to be regarded as learning
Planning high quality realistic trips remains difficult for
                                                                    to rank (Burges et al., 2005; 2006; Cao et al., 2006; Liu,
several reasons (McGinty & Smyth, 2000). First, avail-
                                                                    2011). Click-through data are used to deduce pair-wise
able General Transit Feed Specification (GTFS) sources
                                                                    training data for learning ranking functions.
rarely contain all information useful for constructing real-
istic plans. Second, the notion of ”service quality” is diffi-      In learning to rank, a number of categories are given and a
cult to define and is likely to change from person to person.       total order is assumed to exist over the categories. Labeled
Consequently, in real-world trip planning, the shortest trip        instances are provided, and each instance is represented by
is rarely the best one for a given user.                            a feature vector, and each label denotes a rank. Existing
                                                                    methods can be categorized as point-wise, pair-wise and
Multiple efforts have been made to improve the trip plan-
                                                                    list-wise (Liu, 2011). In point-wise methods, each instance
ning (Lathia & Capra, 2011; Liebig et al., 2014; Mokhtari
                                                                    with its rank is used as an independent training example.
et al., 2009; Trepanier et al., 2005; Yuan et al., 2011). Anal-
                                                                    The goal of learning is to correctly map instances into in-
ysis of trip planner log files (Trepanier et al., 2005) can help
                                                                    tervals. In pair-wise methods, each instance pair is used as
improve transit service by providing better knowledge on
                                                                    a training example and the goal of training is to correctly
transit users. Log files were useful for identifying new lo-


                                                                   R3
                                                               ICML MUD 2015

find the differences between ranks of instance pairs, and                  assumed by the planners.
ranking is transformed into pairwise classification or pair-
                                                                           Trip datasets expose a very large variety of paths for any
wise regression (Herbrich et al., 2000). This model formal-
                                                                           (o,d) pair; the maximum number of different paths ob-
izes learning to rank as learning for classification on pairs
                                                                           served is 46 for Nancy and 37 for Adelaide; the average
of instances and can deploy any classification method. In
                                                                           number of paths between two locations is 2.71 and 3.12,
list-wise methods, the loss function is defined on a ranked
                                                                           respectively. We measure the uncertainty of choosing one
list with respect to a query (Xia et al., 2008).
                                                                           or another path from an origin o to a destination d, by using
                                                                           the Kullback-Leibler divergence KL(q||p) of the trip dis-
3. Individual trips analysis                                               tribution q from the uniform distribution p. The higher KL
                                                                           values indicate the higher certainty and a clear domination
We consider a public transportation system that offers a
                                                                           of one trip over others. Figure 2.b plots the KL divergence
number of services (buses, trams, trains, etc.) to urban
                                                                           values for all (o,d) pairs in Nancy using the log-log scale.
travelers. Any individual passenger trip J represents a se-
                                                                           Again, the high density zone suggests that a large part of
quence of PT services and changes between the services.
                                                                           (o,d) pairs is dominated not by one but by 2 to 5 different
Service legs of J form a sequence SJ = {l1 , . . . , ln }, n ≥
                                                                           paths of high frequency.
1, where leg li is a tuple (si , bi , ai , tbi , tai ), si is a service
identifier (a bus number, for ex.); bi and ai are boarding
and alighting stops, tbi and tai are boarding and alighting
timestamps. Trip is direct if n = 1, and transit otherwise.
A transit trip includes n − 1 changes which refer to wait-
ing and/or walking between the services. The sequence of
changes is defined as CJ = {c1 , . . . , cn−1 }, n ≥ 1, where
ci is uniquely defined by two successive service legs li and
li+1 , as ci = (ai , bi+1 , tai , tbi+1 ).
We make the following association between individual trips
and trip recommendations. We consider a trip J as an ex-
plicit answer to an implicit travel query Q = (o = b1 , d =
en , ts = tb1 ) or Q = (o = b1 , d = an , tf = tan ).


                                                                           Figure 3. a) 5-top trips for one (origin, destination) pair in Nancy.
                                                                           b) The travel time and trip count distributions for top 5 trips.

                                                                           It is important to recall that travelling preferences change
                                                                           during the day. Figure 3.a shows 5-top transit trips for an
                                                                           example (o, d) location pair in Nancy. Figure 3.b shows the
                                                                           travel time and average trip counts for 5-top trips for this
                                                                           example. All 5 trips are transit ones with one change. The
                                                                           figure reveals how the user preferences vary during the day.
Figure 2. a) Minimal travel time vs average travel time. b) Trip           The trip planner recommends the trip shown in red for the
uncertainty.                                                               fastest trip query. First, this recommended trip is not the
                                                                           fastest nor the most frequent one. Second, the trip shown
                                                                           in green is the most frequent during the lunch, despite it is
We analyze sets of individual trips collected from the auto-
                                                                           far from being fast.
mated fare collection systems (Mezghani, 2008) installed
in Nancy, France and Adelaide, Australia; we mined these                   Figure 4 gives a more general picture. It shows 240 most
data to understand how passengers’ choices differ from the                 frequent (o,d) pairs in Nancy. For each pair, Figure 4.a uses
planner recommendations.                                                   the different colors to show changing user preferences. The
                                                                           most frequent trip is colored in dark blue. Second, third,
For every pair of locations (o, d) in a network, we extract
                                                                           forth and fifth preferences are shown in blue, green, orange
all real trips from o to d and analyze their travel time dis-
                                                                           and brown colors, respectively. Trips are sorted by the dis-
tribution. Figure 2.a shows the distribution of the minimal
                                                                           tance between the origin and destination (see Figure 4.b).
versus the average travel time for every (o, d) pair in Nancy.
The high density zone suggests that the average travel time                Short trips expose a higher variability than longer ones. As
is far longer than the minimal time which is conventionally                the figure shows, the second choices are more visible (blue


                                                                          RN
                                                          ICML MUD 2015

color) during the morning rush hours. Figure 4.c shows               Algorithm 1 below uses the trip planner and a set T of indi-
the trip planner recommendations for the same pairs. The             vidual passengers’ trips. For any trip J ∈ T matching the
recommendations are static and do not reflect the user pref-         query Q = (o, d, ts ), the algorithm retrieves the k-top can-
erences.                                                             didates for Q and retains that J has been preferred to any of
                                                                     these candidates, except J itself if it happens to be in this
                                                                     set. Real trip J matches a recommended trip J ′ , if it has
                                                                     the same number of legs and following the same sequence
                                                                     of services. If SJ = {l1 , . . . , ln } and SJ ′ = {l1′ , . . . , ln′ },
                                                                     then J matches J ′ iff si = s′i ∧ bi = b′i ∧ ai = a′i , for all
                                                                     i = 1, . . . , n.

                                                                     Algorithm 1 Rank learning algorithm.
                                                                     Require: Collection T of passenger trips J = (S, C)
                                                                     Require: Trip planner P with k-top recommendations
                                                                      1: S = ∅ ; set of pairwise preferences
                                                                      2: for each J ∈ T do
                                                                      3:   Form a query Q = (o = b1 , d = an , ts = tb1 )
                                                                      4:   Query the planner P with query Q
                                                                      5:   Retrieve k-top trips as a list L
                                                                      6:   for each J ′ ∈ L, J ′ 6= J do
                                                                      7:      Add (Q, x(J ) ≻ x(J ′ )) to S
                                                                      8:   end for
                                                                      9: end for
                                                                     10: Learn the ranking model f from S
                                                                     Ensure: f
Figure 4. a) Changing user preferences for most frequent (o,d)
pairs in Nancy. b) Trip distances. c) Trip recommendations by the
planner.                                                             Once the ranking function f is learned, it can be used to im-
                                                                     prove the relevance of trip planner recommendations acord-
We conclude this section by Figure 5 which shows how the             ing to the re-ranking scenario. The trip planner does not
user preferences vary between the PT services. It presents           change the way it works. And for a new user query Q,
the total passenger counts for all Nancy change points, at           the trip planner first generates k-top candidate trips. Then
8am, 1pm and 6pm.                                                    these candidates are re-ranking using the function f .
                                                                     To learn a ranking function f , Algorithm 1 requires every
                                                                     trip J be described by a feature vector x(J ). In the fol-
                                                                     lowing sections, we first describe a method for learning the
                                                                     ranking function f and then how to extract relevant and dy-
                                                                     namic features from individual trips.

                                                                     4.1. Gradient Boosting Rank
                                                                     We used individual trips to form a set pairwise preferences,
                                                                     a ranking function f can be learned from. For each in-
                                                                     dividual trip J ∈ T , we generate a set of labeled data
                                                                     (xi,1 , yi,1 ), . . . , (xi,mi , yi,mi ), i = 1, . . . , |T |, which are
   Figure 5. Change counts in Nancy at 8am, 1pm and 6pm.
                                                                     preference pairs of feature vectors. If xi,j has a higher
                                                                     rank than xi,k (yi,j > yi,k ), then xi,j ≻ xi,k is a pref-
                                                                     erence pair, which means that xi,j is ahead of xi,k . The
4. Learning to rank trips                                            preference pairs can be viewed as instances and labels in a
                                                                     new classification problem, where xi,j ≻ xi,k is a positive
When a passenger travels from an origin o to a destination
                                                                     instance.
d at time ts , she implicitly prefers the trip J she takes to
all other trips J ′ , J ′ 6= J . Our approach is to transform        Any classification method can be used to train a classifier
this implicit feedback into an explicit set of pair-wise trip        f (x) which is then used for ranking. Trips are assigned
preferences and to learn the ranking function f from them.           scores by f (x) and sorted by the scores. Learning a good


                                                                    ky
                                                             ICML MUD 2015

ranking model is realized by training of a model for pair-               plicit factors which influence the passenger choice. Pas-
wise classification. The loss function in learning is pairwise           sengers make their choices in the function of location and
because it is defined on a pair of feature vectors.                      time.
The pairwise approach is adopted in many methods, includ-                We mention two groups of trip features. First, global fea-
ing Ranking SVM (Herbrich et al., 2000), RankBoost (Fre-                 tures describe the whole trip; they are the travel time, the
und et al., 2003), RankNet (Burges et al., 2005), IR                     number of changes, the usage of specific types of transport
SVM (Tsai et al., 2007), GBRank (Zheng et al., 2007),                    (bus, train, tram, etc.), multi-modality, etc. Second, much
LambdaRank (Burges et al., 2006), and others. In the fol-                more relevant and specific are local features that describe
lowing we adopt GBRank as one of popular pairwise meth-                  each service leg and change that compose a given trip. For
ods currently used.                                                      each PT service, we may extract the estimated means and
                                                                         variance of the speed when using this line at this time pe-
GBRank takes preference pairs as training data,
                                                                         riod, the average delay with respect to the schedule. For
{x1i , x2i }, x1i ≻ x2i , i = 1, . . . , N . and uses the para-
                                                                         each change point, we can estimate the walking distance if
metric pairwise loss function
                                                                         any, the closeness to a commercial zone or transportation
                1X
                   N                                                     hub, etc.
      L(f ) =         (max{0, τ − (f (x1i ) − f (x2i )})2 ,
                2 i=1                                                    Unfortunately, raw features of services and change counts
                                                                         are generally sparse, noisy and prone to many errors. Main
where f (x) is the ranking function and τ is a parameter,                reasons for errors are due to incorrect setup of ticket valida-
0 < τ ≤ 1. The loss is 0 if f (x1i ) is larger than f (x2i ) + τ ,       tion machines, lack of alignment between ticket validation
otherwise, the incurred loss is 12 (f (x2i ) − f (x1i ) + τ )2 .         machines and GPS localization, and card misuse by travel-
To optimize the loss function with respect to the training in-           ers.
stances, the Functional Gradient Decent is deployed. Treat-              So we intend to extract such latent features from sparse and
ing all f (x1i ), f (x2i ), i = 1, . . . , N as variables; the gradi-    noisy counts that be able to represent user preferences and
ent of L(f ) is computed with respect to the training in-                their dynamic character.
stances as follows
                                                                      We split all trips J ∈ T in two collections of service and
 − max{0, f (x2i ) − f (x1i ) + τ }, max{0, f (x2i ) − f (x1i ) + τ },change observations, As = {l |l ∈ S , J ∈ T } and
                                                                                                     i i       J
                           i = 1, . . . , N.                          Ac = {c |c ∈ C , J ∈ T }. In the following we as-
                                                                                   i   i     J
                                                                         sume for brevity working with a set of observations A; it
If f (x1i ) − f (x2i ) ≥ τ , the corresponding loss is zero, and         may indicate service or change observations, or their sum.
there is no need to change the ranking function. If f (x1i ) −
f (x2i ) < τ , the loss is non-zero, and the ranking function            If we split all observations in A in T time periods,
is updated using the Gradient Descent:                                   so we obtain a sequence of count matrices At , t =
                                                                                             p×p
                                                                         1, . . . , T, At ∈ R+   at time period t, where aij is the
              fk (x) = fk−1 (x) − ν∆L(fk (x)),                           service or change count during the period t. and p is the
                                                                         number of stops.
where fk (x) and fk−1 (x) denote the values of f (x) at k-th
and (k − 1)-th iterations, respectively, ν is the learning rate.         The full diagram of latent feature extraction for individual
                                                                         trips and learning the ranking function is given in Figure 6.
At the k-th iteration of the learning, GBRank collects
all the pairs with non-zero losses {(x1i , fk−1 (x21 ) +
τ ), (x2i , fk−1 (x1i ) − τ )} and employs Gradient Boosting             5.1. Collapsed matrices
Tree (Friedman, 2000) to learn a regression model gk (x)                 We first consider the static case when T is 1 and all obser-
that can make prediction on the regression data. The                     vations from A are collapsed in one matrix A.
learned model gk (x) is then linearly combined with the
existing model fk−1 (x) to create a new model fk (x) as                  Both service and change data are sparse non-negative
follows                                                                  counts, and we can use the non-negative matrix factor-
                            kfk−1 (x) + βk gk (x)                        ization (NNMF) as a method giving a great low-rank ro-
                 fk (x) =                         ,
                                   k+1                                   bust interpretation of data (Lee & Seung, 2001). They can
with βk as a shrinkage factor (Zheng et al., 2007).                      be efficiently computed by formulating the penalized opti-
                                                                         mization problem and using modern gradient-descent algo-
                                                                         rithms (Hoyer, 2004).
5. Trip feature extraction
                                                                         Matrix A is approximated with a product ot two low-rank
We now describe each real trip J by a set of relevant and                matrices that is estimated through the following minimiza-
dynamic features x(J ). There may exist explicit and im-


                                                                        kR
                                                          ICML MUD 2015

                                                                     where parameters λ, µ are set by the user. The objective
                                                                     function imposes smoothing Ut and Vt on two successive
                                                                     time periods, but it can be generalized to a larger window.
                                                                     To estimate matrices Ut and Vt , we use an extended
                                                                     version of the multiplicative updating algorithm for
                                                                     NNMF (Gillis & Glineur, 2012; Lee & Seung, 2001;
                                                                     Mankad & Michailidis, 2013), based an adaptive gradient
                                                                     descent.
                                                                     Temporal extensions of matrix factorization techniques
                                                                     have been studied in (Elsas & Dumais, 2010; Mankad &
                                                                     Michailidis, 2013; Saha & Sindhwani, 2012; Sun et al.,
                                                                     2014). (Elsas & Dumais, 2010) analyzed the temporal dy-
                                                                     namics of Web document content. To improve the rele-
                                                                     vance ranking, it developed a probabilistic document rank-
                                                                     ing algorithm that allows differential weighting of terms
                                                                     based on their temporal characteristics. (Sun et al., 2014)
Figure 6. Preference features and re-ranking function learning.      addressed recommendation systems with significant tem-
                                                                     poral dynamics; it developed the collaborative Kalman fil-
                                                                     ter which extends probabilistic matrix factorization in time
tion                                                                 through a state-space model. Community detection in time-
                minU≥0,V≥0 ||A − UVT ||2F ,                          evolving graphs is analyzed in (Mankad & Michailidis,
where U and V are n × K non-negative matrices. The                   2013). The latent structure of overlapping communities is
rank or dimension of the approximation K corresponds to              discovered through the sequential matrix factorization.
the number of latent factors; it is chosen to obtain a good          To solve (2), we follow (Mankad & Michailidis, 2013) and
data fit and interpretability, where U give latent factors for       consider the Lagrangian as follows
origin stops and V does for destination stops.
The factorized matrices are obtained by minimizing an ob-              L = ||At − Ut VtT ||2F +
jective function that consists of a goodness of fit term and              PT
                                                                       +µ t=2 (||Ut − Ut−1 ||F                        F
                                                                                                    2 + ||Vt − Vt−1 ||2 )
a roughness penalty                                                      PT
                                                                       + t=1 (λ(||Ut ||1 + ||Vt ||1 ) + T r(ΦUt ) + T r(ΨVt )),
                                                                                                                               (3)
  minU ≥0,V ≥0 ||A − UVT ||2F + λ(||U||1 + ||V||1 ), (1)
                                                                     where Φ, Ψ are Lagrange multipliers. The method works
where the parameter λ ≥ 0 indicates the penalty strength;            as an adaptive gradient descent converging to a local mini-
a larger penalty encourages sparser matrices U and V.                mum. Kuhn-Tucker (KKT) optimality guarantees the nec-
Adding penalties to NMF is a common strategy since they              essary conditions for convergence [44]. The KKT optimal-
                                                                                                                 ∂L        ∂L
not only improve interpretability, but often improve numer-          ity conditions are obtained by setting ∂U     t
                                                                                                                     = 0; ∂V t
                                                                                                                                =
ical stability of the estimation.                                    0, t = 1, . . . , T. It can be shown that the KKT optimality
                                                                     conditions are obtained by
5.2. Smoothed Dynamic NNMF
                                                                      Φt = −2At Vt + 2Ut VtT Vt − 2µ(Ut−1 − Ut ) + 2λ,
In the general case T > 1, we have a sequence of matrices
                                                                      Ψt = −2ATt Ut + 2Vt UTt Ut − 2µ(Vt−1 − Vt ) + 2λ,
{At }Tt=1 for time periods t = 1, . . . , T . To produce a se-
                                                                                                                            (4)
quence of low-rank matrix factorizations {Ut , Vt }Tt=1 , we
                                                                     which after matrix algebra manipulations lead to the multi-
can extend the factorization in (1) to the case T > 1 by in-
                                                                     plicative updating rules presented in Algorithm 2.
dependent factorization of T matrices {At }. However, we
additionally impose a smoothness constraint on both Ut               The convergence of the multiplicative updating algorithm
and Vt , in order to force the latent factors to be similar to       is often reported slow. In practice we obtain meaningful
the previous time periods, in both boardings and alightings.         factorizations after a handful of iterations, which we tend
The objective function then becomes                                  to explain by the sparseness of input matrices At . In the
                                                                     future, when working with the dense data, faster meth-
       minUt ≥0,Vt ≥0 ||At − Ut VtT ||2F                             ods like active set version of the alternating non-negative
          PT
       +µ t=2 (||Ut − Ut−1 ||F                   F
                               2 + ||Vt − Vt−1 ||2 )          (2)    least squares (ANLS) algorithm (Kim & Park, 2008) will
          PT
       +λ( t=1 ||Ut ||1 + ||Vt ||1 ),                                be more appropriate.


                                                                    kk
                                                        ICML MUD 2015

Algorithm 2 Dynamic Smoothing NNMF algorithm.                      lected in Nancy, France during 3 months in 2012. Nancy
Require: Matrices At , t = 1, . . . , T , constants λ ,µ           PT network includes 1129 nodes/stops and offers 107 bus
 1: Initialize Ut , Vt as dense, positive random matrices          and tram services to travelers. We also processed 12.5M
 2: repeat                                                         trips from Adelaide, Australia collected during 2.5 months
 3:    for t = 1,. . . ,T do                                       in 2013. Adelaide network offers 312 bus and tram service
 4:       Ut ← Ut (Ut VtT Vt + λAUt )−1 (At Vt +                   variations, and accounts for 3524 stops.
          µUt−1 )                                                  To evaluate the impact of modeling user preferences
 5:       Vt ← Vt (Vt UTt Ut + λAVt )−1 (ATt Ut +                  from actual trips, we selected 240 most frequent origin-
          µVt−1 )                                                  destination pairs in Nancy (see Figure 4) and 160 most fre-
 6:    end for                                                     quent pairs in Adelaide.
 7: until Convergence
Ensure: Ut , Vt , t = 1, . . . , T                                 When generating temporal sequences of count matrices, we
                                                                   test two cases of T = 24 and T =48, when any matrix in-
                                                                   cludes all passenger counts during one hour or 30 minutes.
5.3. Dynamic trip features                                         Once a matrix sequence is generated, any matrix is ran-
                                                                   domly split into 70% for training data and the remaining
Algorithm 2 finds sparse factorized matrices for a se-
                                                                   30% for testing. All results below are means and variances
quence of input matrices At , t = 1, . . . , T . We first ap-
                                                                   over 10 independent runs.
ply the algorithm to sequences of service matrices Ast and
change matrices Act , extracted from the full trip collection.     We retrieved the trip planner recommendations for Nancy1
We thus obtain smoothed factorized matrices Ust , Vts , and        and Adelaide2 . We learn the ranking function and use it to
Uct , Vtc , t = 1, . . . , T for services and changes, respec-     re-rank the trip recommendations, using different options
tively. At time period t, a boarding stop b has latent factors     described in previous sections. To understand the effect of
given by a corresponding row in Ust this row is denoted            raw count factorization, we consider several options. First,
Ust (b). For an alighting stop a, row Vts (a) gives the latent     we collapse matrices so disregarding the temporal aspect.
factors at time t. We then apply the algorithm to the sum          Second, we consider either the service Ast and change ma-
matrices, Aft = Act + Ast , t = 1, . . . , T . The smoothed        trices Act separately, or sum them up Aft = Ast +Act before
factorized matrices for Aft are denoted Uft , Vtf .                the factorization. Third, we study the effect of temporal
                                                                   smoothing, when factorization is done either independent
To generate a feature vector x for a trip J , we may use
                                                                   or by smoothing over successive time periods. Finally, we
its decomposition into service legs and changes, J =
                                                                   test different values K for the factorization.
(S, C). The vector x(J ) is then composed of a general
feature vector xg and four latent components, x(J ) =              In all experiments with GBRank (see Section 4.1), parame-
{xg , xsb , xsa , xcb , xcb }, where                               ter τ was set to τ = 0.3 and shrinkage factors βk to 0.8. For
                                                                   smoothed dynamic NNMF, optimal values of µ and λ have
  • xsb , xsa are latent feature vectors averaged over the trip    been determined by cross-validation. For evaluating the
    boarding and alighting places, respectively,                   results of ranking methods, we use a measure commonly
                                                                   used in information retrieval, Normalized Discounted Cu-
                  1X s                  1X s
                     n                     n
          xsb =        Utb (bi ); xsa =      V a (ai );            mulative Gain (NDCG). We choose the perfect ranking’s
                  n i=1 i               n i=1 ti                   NDCG score 1 which is the error rate of the 1-top recom-
                                                                   mendation.
  • xcb , xca are latent feature vectors averaged over the
    change places (alighting and boarding), respectively,          Table 6 reports the evaluation results for 12 different meth-
                                                                   ods and compares them to the trip planner baseline for both
                1 X c                      1 X c
                    n−1                        n−1                 cities. The analysis of these results provide some interest-
      xcb =             Utb (bi ); xca =           V a (ai ).      ing insights. First, results are globally better for smaller
              n − 1 i=1 i                n − 1 i=1 ti
                                                                   Nancy than for bigger Adelaide, for both T = 24 and
                                                                   T = 48 cases. Second, collapsed matrices improve the
In the case of sum latent matrices Uft , Vtf , x(J ) is com-       baseline somewhat, but only taking into account temporal
posed of a general feature vector xg and two latent compo-         user preferences does really boost the performance. More-
nents, x(J ) = {xg , xfb , xfa } obtained from Uft and Vtf .       over, smoothed matrix factorization improves considerably
                                                                   over the independent one. Third, the change latent vari-
6. Evaluation                                                      ables appear to be more relevant than services ones. In-
                                                                       1
To test our method for learning a ranking function from                    http://www.reseau-stan.com/
                                                                       2
individual trips, we processed 5.2M individual trips col-                  https://www.adelaidemetro.com.au/


                                                                  kj
                                                       ICML MUD 2015

             City                                          Nancy                           Adelaide
             Method                                T = 24         T = 48            T = 24         T = 48
             Baseline: Trip Planner             24.91 ± 1.20 24.91 ± 1.28         38.17 ± 2.28 38.17 ± 2.28
             Collapsed:Services                 24.73 ± 1.17 24.73 ± 1.22         29.97 ± 2.11 29.97 ± 2.11
             Collapsed:Changes                  19.69 ± 1.01   19.69 ±1.09        28.63 ± 2.29 28.63 ± 2.29
             Collapsed:Services+Changes         19.30 ± 1.14 19.30 ± 1.03         28.05 ± 2.32 28.05 ± 2.32
             Collapsed:Sum                      19.59 ± 1.13 19.59 ± 1.10         28.17 ± 2.18 28.17 ± 2.18
             Indep: Services                    14.08 ± 0.92 15.33 ± 0.97         25.33 ± 2.07 24.87 ± 1.87
             Indep: Changes                      9.55 ± 0.90    9.89 ± 0.86       23.89 ± 1.67 23.93 ± 1.75
             Indep: Services+Changes             9.52 ± 0.89    9.41 ± 0.87       22.41 ± 1.72 22.15 ± 1.55
             Indep: Sum                         10.42 ± 0.89    9.37 ± 0.86       22.37 ± 1.56 23.55 ± 1.59
             Smooth: Services                     9.22 ±0.77    9.37 ± 0.78       15.37 ± 1.38 14.71 ± 1.24
             Smooth: Changes                     6.71 ± 0.82    6.69 ± 0.74       16.69 ± 1.24 16.69 ±1.15
             Smooth: Services+Changes             5.83 ±0.81     6.12± 0.79        14.12±1.29    13.63±1.14
             Smooth: Sum                         7.63 ± 0.79    7.05 ± 0.81       15.05 ± 1.41 14.43 ± 1.32

                                     Table 1. NDCG@1 values for 12 methods and two cities.


stead, using sum counts performs worse than keeping ser-
vice and change variables separately. We tend to explain
this by heterogeneity of service and change preferences.


                                                                  Figure 8. NDCG@1: Independent and smoothed predictions dur-
                                                                  ing the day.


                                                                  7. Conclusion
Figure 7. Independent an smoothed predictions vs Number of la-
tent variables.                                                   We address the problem of relevance of trips recommended
                                                                  by urban trip planners. We analyzed passengers’ trips ex-
                                                                  tracted from two public transportation systems. We pro-
Figure 7 shows the performance of 3 independent and 3
                                                                  pose a method for improving the recommendation rele-
smoothed methods for T = 24 for Nancy, with the number
                                                                  vance by learning from choices made by travelers who use
of latent variables K varying between 2 and 30. Surpris-
                                                                  the transportation system daily. We convert the actual trips
ingly, already K=2 performs well enough, thus indicating
                                                                  into a set of pairwise preferences and learn a ranking func-
the sparsity of the count matrices.
                                                                  tion using the Gradient Boosting Rank method. We de-
Figure 8 reports the hour-per-hour performance for the            scribe actual trips with a number of time-dependent latent
same six methods for Nancy case. Rush hours and lunch             features, and develop a smoothed non-negative matrix fac-
time appear to be hard for all methods; the error is the          torization to estimate the latent variables of user prefer-
smallest for the periods 10am-12am and 2pm-4pm that               ences while choosing PT services and change points. Ex-
points to the correlation between the traffic and trip vari-      periments with real trip data demonstrate that the re-ranked
ability. The traffic growth pushes travelers away from the        trips are measurably closer to those actually chosen by pas-
conventional traveling choices.                                   sengers than are the trips produced by planners with static


                                                                 k9
                                                     ICML MUD 2015

heuristics.                                                     D. Lee and H. S. Seung. Algorithms for non-negative ma-
                                                                  trix factorization. In Proc. NIPS’01, pages 556–562,
References                                                        2001.

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds,          J. Letchner, J. Krumm, and E. Horvitz. Trip router with in-
  N. Hamilton, and G. Hullender. Learning to rank using            dividualized preferences: Incorporating personalization
  gradient descent. In ICML’05, pages 89–96, 2005.                 into route planning. In Proc. IAAI’06 - Vol 2, pages
                                                                   1795–1800. AAAI Press, 2006.
C. Burges, R. Ragno, and Q. V. Le. Learning to rank with
                                                                T. Liebig, N. Piatkowski, C. Bockermann, and K. Morik.
  nonsmooth cost functions. In Proc. NIPS’06, pages 193–
                                                                   Predictive trip planning-smart routing in smart cities. In
  200, 2006.
                                                                   EDBT/ICDT Workshops, pages 331–338, 2014.
Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon.       T.-Y. Liu. Learning to Rank for Information Retrieval.
   Adapting ranking svm to document retrieval. In Proc.            Springer, 2011.
   SIGIR ’06, pages 186–193, New York, NY, USA, 2006.
   ACM.                                                         S. Mankad and G. Michailidis. Structural and functional
                                                                   discovery in dynamic networks with non-negative matrix
B. Casey, A. Bhaskar, H. Guo, and E. Chung. Critical               factorization. Phys. Rev. E, 88:042812, Oct 2013.
  review of time-dependent shortest path algorithms: A
  multimodal trip planner perspective. Transport Reviews,       L. McGinty and B. Smyth. Turas: A personalised route
  34:522–539, 2014.                                               planning system. In Proc. PRICAI’00, pages 791–791,
                                                                  Berlin, Heidelberg, 2000. Springer-Verlag.
J. L. Elsas and S. T. Dumais. Leveraging temporal dynam-
                                                                M. Mezghani. Study on electronic ticketing in public
   ics of document content in relevance ranking. In Proc.
                                                                 transport. European Metropolitan Transport Authorities
   WSDM ’10, pages 1–10, New York, NY, USA, 2010.
                                                                 (EMTA), 38:1–56, 2008.
   ACM.
                                                                A. Mokhtari, O. Pivert, and A. HadjAli. Integrating com-
Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An          plex user preferences into a route planner: A fuzzy-set-
  efficient boosting algorithm for combining preferences.         based approach. In IFSA/EUSFLAT Conf., pages 501–
  J. Machine Learning Res., 4:933–969, 2003.                      506, 2009.
J. H. Friedman. Greedy function approximation: A gra-           M.-P. Pelletier, M. Trepanier, and C. Morency. Smart card
   dient boosting machine. Annals of Statistics, 29:1189–        data in public transit planning: A review. CIRRELT Rap-
   1232, 2000.                                                   port 2009-46, November 2009.

N. Gillis and F. Glineur. Accelerated multiplicative updates    A. Saha and V. Sindhwani. Learning evolving and emerg-
  and hierarchical als algorithms for nonnegative matrix          ing topics in social media: a dynamic nmf approach with
  factorization. Neural Comput., 24(4):1085–1105, April           temporal regularization. In Proc. WSDM’12, pages 693–
  2012.                                                           702, 2012.

R. Herbrich, T. Graepel, and K. Obermayer. Large margin         C. Seaborn, J. Attanucci, and N. H. M. Wilson. Analyz-
  rank boundaries for ordinal regression. In Advances in          ing multimodal public transport journeys in london with
  Large Margin Classifiers, pages 115–132, 2000.                  smart card fare payment data. Transportation Research
                                                                  Record: J. Transp. Research Board, 2121:55–62, 2009.
P. O. Hoyer. Non-negative matrix factorization with sparse-
                                                                J. Z. Sun, D. Parthasarathy, and K.R. Varshney. Collabo-
   ness constraints. J. Mach. Learn. Res., 5:1457–1469,
                                                                   rative kalman filtering for dynamic matrix factorization.
   December 2004.
                                                                   IEEE Trans. on Signal Processing, 62(14):3499–3509,
                                                                   July 2014.
H. Kim and H. Park. Nonnegative matrix factoriza-
  tion based on alternating nonnegativity constrained least     M. Trepanier, R. Chapleau, and B. Allard. Can trip planner
  squares and active set method. SIAM J. Matrix Anal.            log files analysis help in transit service planning? Jour-
  Appl., 30(2):713–730, July 2008.                               nal of Public Transportation, 8(2):79–103, 2005.
N. Lathia and L. Capra. Mining mobility data to minimise        M.-F. Tsai, Tie-Yan Liu, T. Qin, H.-H Chen, and W.-Y. Ma.
  travellers’ spending on public transport. In Proc. ACM         Frank: a ranking method with fidelity loss. In Proc. SI-
  KDD’11, pages 1181–1189, 2011.                                 GIR’07, pages 383–390, 2007.


                                                               k8
                                                  ICML MUD 2015

F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise
   approach to learning to rank: theory and algorithm. In
   Proc. ICML’08, pages 1192–1199, 2008.
J. Yuan, Yu Zheng, X. Xie, and G. Sun. Driving with
   knowledge from the physical world. In KDD ’11, pages
   316–324, New York, NY, USA, 2011. ACM.

Z. Zheng, K. Chen, G. Sun, and H. Zha. A regression
  framework for learning ranking functions using relative
  relevance judgments. In Proc. SIGIR ’07, pages 287–
  294, New York, NY, USA, 2007. ACM.


                                                            ke