=Paper= {{Paper |id=Vol-1392/paper-03 |storemode=property |title=Improved Trip Planning by Learning from Travelers' Choices |pdfUrl=https://ceur-ws.org/Vol-1392/paper-03.pdf |volume=Vol-1392 |dblpUrl=https://dblp.org/rec/conf/icml/Chidlovskii15 }} ==Improved Trip Planning by Learning from Travelers' Choices== https://ceur-ws.org/Vol-1392/paper-03.pdf

Improved Trip Planning by Learning from Travelers’ Choices

Boris Chidlovskii BCHIDLOVSKII @ XRCE . XEROX . COM
Xerox Research Center Europe, 6 chemin Maupertuis,38240 Meylan, France

Abstract the same origin and destination points.
We analyze the work of urban trip planners and While the personal knowledge plays an important role, in
the relevance of trips they recommend upon user many cases passengers simply have different preferences
queries. We propose to improve the planner rec- about the trip planning. For example, one passenger may
ommendations by learning from choices made avoid multiple changes, by extending the duration of her
by travelers who use the transportation network journey by a few minutes, while another passenger simply
on the daily basis. We analyze individual travel- wants to arrive as quickly as possible to the destination.
ers’ trips and convert them into pair-wise prefer-
When a user queries a planner for a journey from origin o
ences for traveling from a given origin to a des-
to destination d starting at time ts , there are often a large
tination at a given time point. To address the
number of trips satisfying the query. Planners are designed
sparse and noisy character of raw trip data, we
to provide the k-top recommendations according to a set
model passenger preferences with a number of
of predefined criteria, such as the minimal transfer time,
smoothed time-dependent latent variables, which
the minimal number of changes, etc.. Their work is sim-
are used to learn a ranking function for trips. This
ilar to any information retrieval system, where the goal is
function can be used to re-rank the top planner’s
to place the most relevant documents among the k-top an-
recommendations. Results of tests for cities of
swers. Therefore, it is highly desirable that a trip planner
Nancy, France and Adelaide, Australia show a
behaves intelligently and suggests k-top trips which reflect
considerable increase of the recommendation rel-
the real passengers’ preferences.
evance.
In this paper we closely analyze the cases of divergence be-
tween the planner recommendations and real choices made
1. Introduction by urban travelers. We collect two sets of individual trips
extracted from fare collection systems in cities of Nancy,
Most cities and agglomerations around the world propose France and Adelaide, Australia (see Figure 1). We com-
their trip planners, in the form of a web or mobile appli- pare these data to the city planners’ recommendations; and
cation. Upon a user travel request, they recommend trips in the of case of divergence, we propose a novel method to
using a static library of roads and public transportation net- rank the trips that better reflects the reality.
work and services. Although these planners are increas-
ingly reliable in their knowledge of transportation network
and available services, they all share the same static-world
assumptions. In particular, they make a general assumption
of constancy and universality (Letchner et al., 2006), that
the optimal trip is independent of the time of day of the
actual journey and of the passengers’ preferences.
In reality, constancy and universality rarely hold. Most
urban travelers can verify that the best trip between work
and home at midnight is not necessarily the best choice to
make between the same locations at 8am. Similarly, differ- Figure 1. Trip planners of Nancy (left) and Adelaide (right).
ent passengers may choose different ways to travel between
Our method relies on two main contributions. First, we
Proceedings of the 2 nd International Workshop on Mining Urban consider any individual trip as a set of explicit preferences
Data, Lille, France, 2015. Copyright c 2015 for this paper by its made by the traveler during the trip. We use this set of
authors. Copying permitted for private and academic purposes.
pairwise preferences to learn a ranking function of trips.

Rd
ICML MUD 2015

This function is then used on the top of the trip planner, cations to be assessed for better understanding user behav-
to re-rank the k-top recommendations. Second, we model iors, and for guiding updates of the PT information system.
passenger preferences of choosing a specific service or a
Personalization of trip planning took into account user pref-
change point in a way that reflects their dynamic nature. To
erences and tries to identify the best trips among a set of
address the sparse and noisy character of the raw trip, we
possible answers. In (Mokhtari et al., 2009), the fuzzy set
model the user preferences by a set of dynamic latent vari-
theory was used to model complex user preferences. A ty-
ables. We estimate these variables by a smoothed dynamic
pology of preferences was proposed to explicitly express
non-negative factorization of service and transit counts.
the preferences and integrate them in a query language.
The remainder of this paper is organized as follows. In
Trip personalization by mining public transport data has
Section 2 we briefly review the state of art in urban trip
been addressed in (Lathia & Capra, 2011). It established a
planning. Section 3 introduces the trip ranking problem
relation between urban mobility and fare purchasing habits
by analyzing individual trips for Nancy city case. Learn-
in London public transport network (Seaborn et al., 2010),
ing to rank for trip planning is presented in Section 4.
and proposed personalized ticket recommendations based
Then Section 5 proposes to model user preferences by dy-
on the estimated future travel patterns and matching travel-
namic latent variables and develop an estimation method
ers to the best fare.
by smoothed dynamic non-negative factorization of service
and transit counts. In Section 6, we report results of eval- Integrating real time information in trip planners has been
uation on trip re-ranking for two city datasets. Section 7 another research trend. (Yuan et al., 2011) presented a
concludes the paper. cloud-based system computing customized and practically
fast driving routes for an end user using (historical and real-
2. Prior Art time) traffic conditions and driver behavior. GPS-equipped
taxicabs are used as mobile sensors constantly probing the
Trip planners. Public transport (PT) trip planners are de- traffic rhythm of a city and taxi drivers’ intelligence in
signed to provide information about available journeys in choosing driving directions. The real time trip planning has
the transport system. The application prompts a user to in- also been extended to multi-modality (Casey et al., 2014;
put an origin o, a destination d and a departure time ts (or Seaborn et al., 2010). It used data from GPS-enabled vehi-
arrival time tf ), it then deploys a trip planning engine to cles to produce more accurate plans in terms of time and
find a sequence of available PT services from o to d start- transit vehicles.It incorporates the delays into the transit
ing at time ts (or ending at time tf ). network at real-time to minimize the gap with respect to
the prediction model.
Trip planners often retrieve multiple trips for a user query.
They typically use a variation of the time-dependent short- Learning to Rank. In document retrieval, to ranking doc-
est path algorithm to search a graph of nodes (representing uments based on their degrees of relevance to a query has
access points to the network) and edges (representing pos- been the key question for decades. Much effort has been
sible journeys between points) (Casey et al., 2014). Differ- placed on developing document ranking functions. Early
ent weightings such as distance, cost or accessibility are of- methods used a small number of document features (e.g.,
ten associated with each edge and node. Search may be op- term frequency, inversed document frequency, and docu-
timized on different criteria, for example, the fastest, least ment length), with an empirical tuning of the ranking func-
changes or cheapest ones (Pelletier et al., 2009). tion parameters. To avoid the manual tuning, the doc-
ument retrieval was proposed to be regarded as learning
Planning high quality realistic trips remains difficult for
to rank (Burges et al., 2005; 2006; Cao et al., 2006; Liu,
several reasons (McGinty & Smyth, 2000). First, avail-
2011). Click-through data are used to deduce pair-wise
able General Transit Feed Specification (GTFS) sources
training data for learning ranking functions.
rarely contain all information useful for constructing real-
istic plans. Second, the notion of ”service quality” is diffi- In learning to rank, a number of categories are given and a
cult to define and is likely to change from person to person. total order is assumed to exist over the categories. Labeled
Consequently, in real-world trip planning, the shortest trip instances are provided, and each instance is represented by
is rarely the best one for a given user. a feature vector, and each label denotes a rank. Existing
methods can be categorized as point-wise, pair-wise and
Multiple efforts have been made to improve the trip plan-
list-wise (Liu, 2011). In point-wise methods, each instance
ning (Lathia & Capra, 2011; Liebig et al., 2014; Mokhtari
with its rank is used as an independent training example.
et al., 2009; Trepanier et al., 2005; Yuan et al., 2011). Anal-
The goal of learning is to correctly map instances into in-
ysis of trip planner log files (Trepanier et al., 2005) can help
tervals. In pair-wise methods, each instance pair is used as
improve transit service by providing better knowledge on
a training example and the goal of training is to correctly
transit users. Log files were useful for identifying new lo-

R3
ICML MUD 2015

find the differences between ranks of instance pairs, and assumed by the planners.
ranking is transformed into pairwise classification or pair-
Trip datasets expose a very large variety of paths for any
wise regression (Herbrich et al., 2000). This model formal-
(o,d) pair; the maximum number of different paths ob-
izes learning to rank as learning for classification on pairs
served is 46 for Nancy and 37 for Adelaide; the average
of instances and can deploy any classification method. In
number of paths between two locations is 2.71 and 3.12,
list-wise methods, the loss function is defined on a ranked
respectively. We measure the uncertainty of choosing one
list with respect to a query (Xia et al., 2008).
or another path from an origin o to a destination d, by using
the Kullback-Leibler divergence KL(q||p) of the trip dis-
3. Individual trips analysis tribution q from the uniform distribution p. The higher KL
values indicate the higher certainty and a clear domination
We consider a public transportation system that offers a
of one trip over others. Figure 2.b plots the KL divergence
number of services (buses, trams, trains, etc.) to urban
values for all (o,d) pairs in Nancy using the log-log scale.
travelers. Any individual passenger trip J represents a se-
Again, the high density zone suggests that a large part of
quence of PT services and changes between the services.
(o,d) pairs is dominated not by one but by 2 to 5 different
Service legs of J form a sequence SJ = {l1 , . . . , ln }, n ≥
paths of high frequency.
1, where leg li is a tuple (si , bi , ai , tbi , tai ), si is a service
identifier (a bus number, for ex.); bi and ai are boarding
and alighting stops, tbi and tai are boarding and alighting
timestamps. Trip is direct if n = 1, and transit otherwise.
A transit trip includes n − 1 changes which refer to wait-
ing and/or walking between the services. The sequence of
changes is defined as CJ = {c1 , . . . , cn−1 }, n ≥ 1, where
ci is uniquely defined by two successive service legs li and
li+1 , as ci = (ai , bi+1 , tai , tbi+1 ).
We make the following association between individual trips
and trip recommendations. We consider a trip J as an ex-
plicit answer to an implicit travel query Q = (o = b1 , d =
en , ts = tb1 ) or Q = (o = b1 , d = an , tf = tan ).

Figure 3. a) 5-top trips for one (origin, destination) pair in Nancy.
b) The travel time and trip count distributions for top 5 trips.

It is important to recall that travelling preferences change
during the day. Figure 3.a shows 5-top transit trips for an
example (o, d) location pair in Nancy. Figure 3.b shows the
travel time and average trip counts for 5-top trips for this
example. All 5 trips are transit ones with one change. The
figure reveals how the user preferences vary during the day.
Figure 2. a) Minimal travel time vs average travel time. b) Trip The trip planner recommends the trip shown in red for the
uncertainty. fastest trip query. First, this recommended trip is not the
fastest nor the most frequent one. Second, the trip shown
in green is the most frequent during the lunch, despite it is
We analyze sets of individual trips collected from the auto-
far from being fast.
mated fare collection systems (Mezghani, 2008) installed
in Nancy, France and Adelaide, Australia; we mined these Figure 4 gives a more general picture. It shows 240 most
data to understand how passengers’ choices differ from the frequent (o,d) pairs in Nancy. For each pair, Figure 4.a uses
planner recommendations. the different colors to show changing user preferences. The
most frequent trip is colored in dark blue. Second, third,
For every pair of locations (o, d) in a network, we extract
forth and fifth preferences are shown in blue, green, orange
all real trips from o to d and analyze their travel time dis-
and brown colors, respectively. Trips are sorted by the dis-
tribution. Figure 2.a shows the distribution of the minimal
tance between the origin and destination (see Figure 4.b).
versus the average travel time for every (o, d) pair in Nancy.
The high density zone suggests that the average travel time Short trips expose a higher variability than longer ones. As
is far longer than the minimal time which is conventionally the figure shows, the second choices are more visible (blue

RN
ICML MUD 2015

color) during the morning rush hours. Figure 4.c shows Algorithm 1 below uses the trip planner and a set T of indi-
the trip planner recommendations for the same pairs. The vidual passengers’ trips. For any trip J ∈ T matching the
recommendations are static and do not reflect the user pref- query Q = (o, d, ts ), the algorithm retrieves the k-top can-
erences. didates for Q and retains that J has been preferred to any of
these candidates, except J itself if it happens to be in this
set. Real trip J matches a recommended trip J ′ , if it has
the same number of legs and following the same sequence
of services. If SJ = {l1 , . . . , ln } and SJ ′ = {l1′ , . . . , ln′ },
then J matches J ′ iff si = s′i ∧ bi = b′i ∧ ai = a′i , for all
i = 1, . . . , n.

Algorithm 1 Rank learning algorithm.
Require: Collection T of passenger trips J = (S, C)
Require: Trip planner P with k-top recommendations
1: S = ∅ ; set of pairwise preferences
2: for each J ∈ T do
3: Form a query Q = (o = b1 , d = an , ts = tb1 )
4: Query the planner P with query Q
5: Retrieve k-top trips as a list L
6: for each J ′ ∈ L, J ′ 6= J do
7: Add (Q, x(J ) ≻ x(J ′ )) to S
8: end for
9: end for
10: Learn the ranking model f from S
Ensure: f
Figure 4. a) Changing user preferences for most frequent (o,d)
pairs in Nancy. b) Trip distances. c) Trip recommendations by the
planner. Once the ranking function f is learned, it can be used to im-
prove the relevance of trip planner recommendations acord-
We conclude this section by Figure 5 which shows how the ing to the re-ranking scenario. The trip planner does not
user preferences vary between the PT services. It presents change the way it works. And for a new user query Q,
the total passenger counts for all Nancy change points, at the trip planner first generates k-top candidate trips. Then
8am, 1pm and 6pm. these candidates are re-ranking using the function f .
To learn a ranking function f , Algorithm 1 requires every
trip J be described by a feature vector x(J ). In the fol-
lowing sections, we first describe a method for learning the
ranking function f and then how to extract relevant and dy-
namic features from individual trips.

4.1. Gradient Boosting Rank
We used individual trips to form a set pairwise preferences,
a ranking function f can be learned from. For each in-
dividual trip J ∈ T , we generate a set of labeled data
(xi,1 , yi,1 ), . . . , (xi,mi , yi,mi ), i = 1, . . . , |T |, which are
Figure 5. Change counts in Nancy at 8am, 1pm and 6pm.
preference pairs of feature vectors. If xi,j has a higher
rank than xi,k (yi,j > yi,k ), then xi,j ≻ xi,k is a pref-
erence pair, which means that xi,j is ahead of xi,k . The
4. Learning to rank trips preference pairs can be viewed as instances and labels in a
new classification problem, where xi,j ≻ xi,k is a positive
When a passenger travels from an origin o to a destination
instance.
d at time ts , she implicitly prefers the trip J she takes to
all other trips J ′ , J ′ 6= J . Our approach is to transform Any classification method can be used to train a classifier
this implicit feedback into an explicit set of pair-wise trip f (x) which is then used for ranking. Trips are assigned
preferences and to learn the ranking function f from them. scores by f (x) and sorted by the scores. Learning a good

ky
ICML MUD 2015

ranking model is realized by training of a model for pair- plicit factors which influence the passenger choice. Pas-
wise classification. The loss function in learning is pairwise sengers make their choices in the function of location and
because it is defined on a pair of feature vectors. time.
The pairwise approach is adopted in many methods, includ- We mention two groups of trip features. First, global fea-
ing Ranking SVM (Herbrich et al., 2000), RankBoost (Fre- tures describe the whole trip; they are the travel time, the
und et al., 2003), RankNet (Burges et al., 2005), IR number of changes, the usage of specific types of transport
SVM (Tsai et al., 2007), GBRank (Zheng et al., 2007), (bus, train, tram, etc.), multi-modality, etc. Second, much
LambdaRank (Burges et al., 2006), and others. In the fol- more relevant and specific are local features that describe
lowing we adopt GBRank as one of popular pairwise meth- each service leg and change that compose a given trip. For
ods currently used. each PT service, we may extract the estimated means and
variance of the speed when using this line at this time pe-
GBRank takes preference pairs as training data,
riod, the average delay with respect to the schedule. For
{x1i , x2i }, x1i ≻ x2i , i = 1, . . . , N . and uses the para-
each change point, we can estimate the walking distance if
metric pairwise loss function
any, the closeness to a commercial zone or transportation
1X
N hub, etc.
L(f ) = (max{0, τ − (f (x1i ) − f (x2i )})2 ,
2 i=1 Unfortunately, raw features of services and change counts
are generally sparse, noisy and prone to many errors. Main
where f (x) is the ranking function and τ is a parameter, reasons for errors are due to incorrect setup of ticket valida-
0 < τ ≤ 1. The loss is 0 if f (x1i ) is larger than f (x2i ) + τ , tion machines, lack of alignment between ticket validation
otherwise, the incurred loss is 12 (f (x2i ) − f (x1i ) + τ )2 . machines and GPS localization, and card misuse by travel-
To optimize the loss function with respect to the training in- ers.
stances, the Functional Gradient Decent is deployed. Treat- So we intend to extract such latent features from sparse and
ing all f (x1i ), f (x2i ), i = 1, . . . , N as variables; the gradi- noisy counts that be able to represent user preferences and
ent of L(f ) is computed with respect to the training in- their dynamic character.
stances as follows
We split all trips J ∈ T in two collections of service and
− max{0, f (x2i ) − f (x1i ) + τ }, max{0, f (x2i ) − f (x1i ) + τ },change observations, As = {l |l ∈ S , J ∈ T } and
i i J
i = 1, . . . , N. Ac = {c |c ∈ C , J ∈ T }. In the following we as-
i i J
sume for brevity working with a set of observations A; it
If f (x1i ) − f (x2i ) ≥ τ , the corresponding loss is zero, and may indicate service or change observations, or their sum.
there is no need to change the ranking function. If f (x1i ) −
f (x2i ) < τ , the loss is non-zero, and the ranking function If we split all observations in A in T time periods,
is updated using the Gradient Descent: so we obtain a sequence of count matrices At , t =
p×p
1, . . . , T, At ∈ R+ at time period t, where aij is the
fk (x) = fk−1 (x) − ν∆L(fk (x)), service or change count during the period t. and p is the
number of stops.
where fk (x) and fk−1 (x) denote the values of f (x) at k-th
and (k − 1)-th iterations, respectively, ν is the learning rate. The full diagram of latent feature extraction for individual
trips and learning the ranking function is given in Figure 6.
At the k-th iteration of the learning, GBRank collects
all the pairs with non-zero losses {(x1i , fk−1 (x21 ) +
τ ), (x2i , fk−1 (x1i ) − τ )} and employs Gradient Boosting 5.1. Collapsed matrices
Tree (Friedman, 2000) to learn a regression model gk (x) We first consider the static case when T is 1 and all obser-
that can make prediction on the regression data. The vations from A are collapsed in one matrix A.
learned model gk (x) is then linearly combined with the
existing model fk−1 (x) to create a new model fk (x) as Both service and change data are sparse non-negative
follows counts, and we can use the non-negative matrix factor-
kfk−1 (x) + βk gk (x) ization (NNMF) as a method giving a great low-rank ro-
fk (x) = ,
k+1 bust interpretation of data (Lee & Seung, 2001). They can
with βk as a shrinkage factor (Zheng et al., 2007). be efficiently computed by formulating the penalized opti-
mization problem and using modern gradient-descent algo-
rithms (Hoyer, 2004).
5. Trip feature extraction
Matrix A is approximated with a product ot two low-rank
We now describe each real trip J by a set of relevant and matrices that is estimated through the following minimiza-
dynamic features x(J ). There may exist explicit and im-

kR
ICML MUD 2015

where parameters λ, µ are set by the user. The objective
function imposes smoothing Ut and Vt on two successive
time periods, but it can be generalized to a larger window.
To estimate matrices Ut and Vt , we use an extended
version of the multiplicative updating algorithm for
NNMF (Gillis & Glineur, 2012; Lee & Seung, 2001;
Mankad & Michailidis, 2013), based an adaptive gradient
descent.
Temporal extensions of matrix factorization techniques
have been studied in (Elsas & Dumais, 2010; Mankad &
Michailidis, 2013; Saha & Sindhwani, 2012; Sun et al.,
2014). (Elsas & Dumais, 2010) analyzed the temporal dy-
namics of Web document content. To improve the rele-
vance ranking, it developed a probabilistic document rank-
ing algorithm that allows differential weighting of terms
based on their temporal characteristics. (Sun et al., 2014)
Figure 6. Preference features and re-ranking function learning. addressed recommendation systems with significant tem-
poral dynamics; it developed the collaborative Kalman fil-
ter which extends probabilistic matrix factorization in time
tion through a state-space model. Community detection in time-
minU≥0,V≥0 ||A − UVT ||2F , evolving graphs is analyzed in (Mankad & Michailidis,
where U and V are n × K non-negative matrices. The 2013). The latent structure of overlapping communities is
rank or dimension of the approximation K corresponds to discovered through the sequential matrix factorization.
the number of latent factors; it is chosen to obtain a good To solve (2), we follow (Mankad & Michailidis, 2013) and
data fit and interpretability, where U give latent factors for consider the Lagrangian as follows
origin stops and V does for destination stops.
The factorized matrices are obtained by minimizing an ob- L = ||At − Ut VtT ||2F +
jective function that consists of a goodness of fit term and PT
+µ t=2 (||Ut − Ut−1 ||F F
2 + ||Vt − Vt−1 ||2 )
a roughness penalty PT
+ t=1 (λ(||Ut ||1 + ||Vt ||1 ) + T r(ΦUt ) + T r(ΨVt )),
(3)
minU ≥0,V ≥0 ||A − UVT ||2F + λ(||U||1 + ||V||1 ), (1)
where Φ, Ψ are Lagrange multipliers. The method works
where the parameter λ ≥ 0 indicates the penalty strength; as an adaptive gradient descent converging to a local mini-
a larger penalty encourages sparser matrices U and V. mum. Kuhn-Tucker (KKT) optimality guarantees the nec-
Adding penalties to NMF is a common strategy since they essary conditions for convergence [44]. The KKT optimal-
∂L ∂L
not only improve interpretability, but often improve numer- ity conditions are obtained by setting ∂U t
= 0; ∂V t
=
ical stability of the estimation. 0, t = 1, . . . , T. It can be shown that the KKT optimality
conditions are obtained by
5.2. Smoothed Dynamic NNMF
Φt = −2At Vt + 2Ut VtT Vt − 2µ(Ut−1 − Ut ) + 2λ,
In the general case T > 1, we have a sequence of matrices
Ψt = −2ATt Ut + 2Vt UTt Ut − 2µ(Vt−1 − Vt ) + 2λ,
{At }Tt=1 for time periods t = 1, . . . , T . To produce a se-
(4)
quence of low-rank matrix factorizations {Ut , Vt }Tt=1 , we
which after matrix algebra manipulations lead to the multi-
can extend the factorization in (1) to the case T > 1 by in-
plicative updating rules presented in Algorithm 2.
dependent factorization of T matrices {At }. However, we
additionally impose a smoothness constraint on both Ut The convergence of the multiplicative updating algorithm
and Vt , in order to force the latent factors to be similar to is often reported slow. In practice we obtain meaningful
the previous time periods, in both boardings and alightings. factorizations after a handful of iterations, which we tend
The objective function then becomes to explain by the sparseness of input matrices At . In the
future, when working with the dense data, faster meth-
minUt ≥0,Vt ≥0 ||At − Ut VtT ||2F ods like active set version of the alternating non-negative
PT
+µ t=2 (||Ut − Ut−1 ||F F
2 + ||Vt − Vt−1 ||2 ) (2) least squares (ANLS) algorithm (Kim & Park, 2008) will
PT
+λ( t=1 ||Ut ||1 + ||Vt ||1 ), be more appropriate.

kk
ICML MUD 2015

Algorithm 2 Dynamic Smoothing NNMF algorithm. lected in Nancy, France during 3 months in 2012. Nancy
Require: Matrices At , t = 1, . . . , T , constants λ ,µ PT network includes 1129 nodes/stops and offers 107 bus
1: Initialize Ut , Vt as dense, positive random matrices and tram services to travelers. We also processed 12.5M
2: repeat trips from Adelaide, Australia collected during 2.5 months
3: for t = 1,. . . ,T do in 2013. Adelaide network offers 312 bus and tram service
4: Ut ← Ut (Ut VtT Vt + λAUt )−1 (At Vt + variations, and accounts for 3524 stops.
µUt−1 ) To evaluate the impact of modeling user preferences
5: Vt ← Vt (Vt UTt Ut + λAVt )−1 (ATt Ut + from actual trips, we selected 240 most frequent origin-
µVt−1 ) destination pairs in Nancy (see Figure 4) and 160 most fre-
6: end for quent pairs in Adelaide.
7: until Convergence
Ensure: Ut , Vt , t = 1, . . . , T When generating temporal sequences of count matrices, we
test two cases of T = 24 and T =48, when any matrix in-
cludes all passenger counts during one hour or 30 minutes.
5.3. Dynamic trip features Once a matrix sequence is generated, any matrix is ran-
domly split into 70% for training data and the remaining
Algorithm 2 finds sparse factorized matrices for a se-
30% for testing. All results below are means and variances
quence of input matrices At , t = 1, . . . , T . We first ap-
over 10 independent runs.
ply the algorithm to sequences of service matrices Ast and
change matrices Act , extracted from the full trip collection. We retrieved the trip planner recommendations for Nancy1
We thus obtain smoothed factorized matrices Ust , Vts , and and Adelaide2 . We learn the ranking function and use it to
Uct , Vtc , t = 1, . . . , T for services and changes, respec- re-rank the trip recommendations, using different options
tively. At time period t, a boarding stop b has latent factors described in previous sections. To understand the effect of
given by a corresponding row in Ust this row is denoted raw count factorization, we consider several options. First,
Ust (b). For an alighting stop a, row Vts (a) gives the latent we collapse matrices so disregarding the temporal aspect.
factors at time t. We then apply the algorithm to the sum Second, we consider either the service Ast and change ma-
matrices, Aft = Act + Ast , t = 1, . . . , T . The smoothed trices Act separately, or sum them up Aft = Ast +Act before
factorized matrices for Aft are denoted Uft , Vtf . the factorization. Third, we study the effect of temporal
smoothing, when factorization is done either independent
To generate a feature vector x for a trip J , we may use
or by smoothing over successive time periods. Finally, we
its decomposition into service legs and changes, J =
test different values K for the factorization.
(S, C). The vector x(J ) is then composed of a general
feature vector xg and four latent components, x(J ) = In all experiments with GBRank (see Section 4.1), parame-
{xg , xsb , xsa , xcb , xcb }, where ter τ was set to τ = 0.3 and shrinkage factors βk to 0.8. For
smoothed dynamic NNMF, optimal values of µ and λ have
• xsb , xsa are latent feature vectors averaged over the trip been determined by cross-validation. For evaluating the
boarding and alighting places, respectively, results of ranking methods, we use a measure commonly
used in information retrieval, Normalized Discounted Cu-
1X s 1X s
n n
xsb = Utb (bi ); xsa = V a (ai ); mulative Gain (NDCG). We choose the perfect ranking’s
n i=1 i n i=1 ti NDCG score 1 which is the error rate of the 1-top recom-
mendation.
• xcb , xca are latent feature vectors averaged over the
change places (alighting and boarding), respectively, Table 6 reports the evaluation results for 12 different meth-
ods and compares them to the trip planner baseline for both
1 X c 1 X c
n−1 n−1 cities. The analysis of these results provide some interest-
xcb = Utb (bi ); xca = V a (ai ). ing insights. First, results are globally better for smaller
n − 1 i=1 i n − 1 i=1 ti
Nancy than for bigger Adelaide, for both T = 24 and
T = 48 cases. Second, collapsed matrices improve the
In the case of sum latent matrices Uft , Vtf , x(J ) is com- baseline somewhat, but only taking into account temporal
posed of a general feature vector xg and two latent compo- user preferences does really boost the performance. More-
nents, x(J ) = {xg , xfb , xfa } obtained from Uft and Vtf . over, smoothed matrix factorization improves considerably
over the independent one. Third, the change latent vari-
6. Evaluation ables appear to be more relevant than services ones. In-
1
To test our method for learning a ranking function from http://www.reseau-stan.com/
2
individual trips, we processed 5.2M individual trips col- https://www.adelaidemetro.com.au/

kj
ICML MUD 2015

City Nancy Adelaide
Method T = 24 T = 48 T = 24 T = 48
Baseline: Trip Planner 24.91 ± 1.20 24.91 ± 1.28 38.17 ± 2.28 38.17 ± 2.28
Collapsed:Services 24.73 ± 1.17 24.73 ± 1.22 29.97 ± 2.11 29.97 ± 2.11
Collapsed:Changes 19.69 ± 1.01 19.69 ±1.09 28.63 ± 2.29 28.63 ± 2.29
Collapsed:Services+Changes 19.30 ± 1.14 19.30 ± 1.03 28.05 ± 2.32 28.05 ± 2.32
Collapsed:Sum 19.59 ± 1.13 19.59 ± 1.10 28.17 ± 2.18 28.17 ± 2.18
Indep: Services 14.08 ± 0.92 15.33 ± 0.97 25.33 ± 2.07 24.87 ± 1.87
Indep: Changes 9.55 ± 0.90 9.89 ± 0.86 23.89 ± 1.67 23.93 ± 1.75
Indep: Services+Changes 9.52 ± 0.89 9.41 ± 0.87 22.41 ± 1.72 22.15 ± 1.55
Indep: Sum 10.42 ± 0.89 9.37 ± 0.86 22.37 ± 1.56 23.55 ± 1.59
Smooth: Services 9.22 ±0.77 9.37 ± 0.78 15.37 ± 1.38 14.71 ± 1.24
Smooth: Changes 6.71 ± 0.82 6.69 ± 0.74 16.69 ± 1.24 16.69 ±1.15
Smooth: Services+Changes 5.83 ±0.81 6.12± 0.79 14.12±1.29 13.63±1.14
Smooth: Sum 7.63 ± 0.79 7.05 ± 0.81 15.05 ± 1.41 14.43 ± 1.32

Table 1. NDCG@1 values for 12 methods and two cities.

stead, using sum counts performs worse than keeping ser-
vice and change variables separately. We tend to explain
this by heterogeneity of service and change preferences.

Figure 8. NDCG@1: Independent and smoothed predictions dur-
ing the day.

7. Conclusion
Figure 7. Independent an smoothed predictions vs Number of la-
tent variables. We address the problem of relevance of trips recommended
by urban trip planners. We analyzed passengers’ trips ex-
tracted from two public transportation systems. We pro-
Figure 7 shows the performance of 3 independent and 3
pose a method for improving the recommendation rele-
smoothed methods for T = 24 for Nancy, with the number
vance by learning from choices made by travelers who use
of latent variables K varying between 2 and 30. Surpris-
the transportation system daily. We convert the actual trips
ingly, already K=2 performs well enough, thus indicating
into a set of pairwise preferences and learn a ranking func-
the sparsity of the count matrices.
tion using the Gradient Boosting Rank method. We de-
Figure 8 reports the hour-per-hour performance for the scribe actual trips with a number of time-dependent latent
same six methods for Nancy case. Rush hours and lunch features, and develop a smoothed non-negative matrix fac-
time appear to be hard for all methods; the error is the torization to estimate the latent variables of user prefer-
smallest for the periods 10am-12am and 2pm-4pm that ences while choosing PT services and change points. Ex-
points to the correlation between the traffic and trip vari- periments with real trip data demonstrate that the re-ranked
ability. The traffic growth pushes travelers away from the trips are measurably closer to those actually chosen by pas-
conventional traveling choices. sengers than are the trips produced by planners with static

k9
ICML MUD 2015

heuristics. D. Lee and H. S. Seung. Algorithms for non-negative ma-
trix factorization. In Proc. NIPS’01, pages 556–562,
References 2001.

C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, J. Letchner, J. Krumm, and E. Horvitz. Trip router with in-
N. Hamilton, and G. Hullender. Learning to rank using dividualized preferences: Incorporating personalization
gradient descent. In ICML’05, pages 89–96, 2005. into route planning. In Proc. IAAI’06 - Vol 2, pages
1795–1800. AAAI Press, 2006.
C. Burges, R. Ragno, and Q. V. Le. Learning to rank with
T. Liebig, N. Piatkowski, C. Bockermann, and K. Morik.
nonsmooth cost functions. In Proc. NIPS’06, pages 193–
Predictive trip planning-smart routing in smart cities. In
200, 2006.
EDBT/ICDT Workshops, pages 331–338, 2014.
Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. T.-Y. Liu. Learning to Rank for Information Retrieval.
Adapting ranking svm to document retrieval. In Proc. Springer, 2011.
SIGIR ’06, pages 186–193, New York, NY, USA, 2006.
ACM. S. Mankad and G. Michailidis. Structural and functional
discovery in dynamic networks with non-negative matrix
B. Casey, A. Bhaskar, H. Guo, and E. Chung. Critical factorization. Phys. Rev. E, 88:042812, Oct 2013.
review of time-dependent shortest path algorithms: A
multimodal trip planner perspective. Transport Reviews, L. McGinty and B. Smyth. Turas: A personalised route
34:522–539, 2014. planning system. In Proc. PRICAI’00, pages 791–791,
Berlin, Heidelberg, 2000. Springer-Verlag.
J. L. Elsas and S. T. Dumais. Leveraging temporal dynam-
M. Mezghani. Study on electronic ticketing in public
ics of document content in relevance ranking. In Proc.
transport. European Metropolitan Transport Authorities
WSDM ’10, pages 1–10, New York, NY, USA, 2010.
(EMTA), 38:1–56, 2008.
ACM.
A. Mokhtari, O. Pivert, and A. HadjAli. Integrating com-
Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An plex user preferences into a route planner: A fuzzy-set-
efficient boosting algorithm for combining preferences. based approach. In IFSA/EUSFLAT Conf., pages 501–
J. Machine Learning Res., 4:933–969, 2003. 506, 2009.
J. H. Friedman. Greedy function approximation: A gra- M.-P. Pelletier, M. Trepanier, and C. Morency. Smart card
dient boosting machine. Annals of Statistics, 29:1189– data in public transit planning: A review. CIRRELT Rap-
1232, 2000. port 2009-46, November 2009.

N. Gillis and F. Glineur. Accelerated multiplicative updates A. Saha and V. Sindhwani. Learning evolving and emerg-
and hierarchical als algorithms for nonnegative matrix ing topics in social media: a dynamic nmf approach with
factorization. Neural Comput., 24(4):1085–1105, April temporal regularization. In Proc. WSDM’12, pages 693–
2012. 702, 2012.

R. Herbrich, T. Graepel, and K. Obermayer. Large margin C. Seaborn, J. Attanucci, and N. H. M. Wilson. Analyz-
rank boundaries for ordinal regression. In Advances in ing multimodal public transport journeys in london with
Large Margin Classifiers, pages 115–132, 2000. smart card fare payment data. Transportation Research
Record: J. Transp. Research Board, 2121:55–62, 2009.
P. O. Hoyer. Non-negative matrix factorization with sparse-
J. Z. Sun, D. Parthasarathy, and K.R. Varshney. Collabo-
ness constraints. J. Mach. Learn. Res., 5:1457–1469,
rative kalman filtering for dynamic matrix factorization.
December 2004.
IEEE Trans. on Signal Processing, 62(14):3499–3509,
July 2014.
H. Kim and H. Park. Nonnegative matrix factoriza-
tion based on alternating nonnegativity constrained least M. Trepanier, R. Chapleau, and B. Allard. Can trip planner
squares and active set method. SIAM J. Matrix Anal. log files analysis help in transit service planning? Jour-
Appl., 30(2):713–730, July 2008. nal of Public Transportation, 8(2):79–103, 2005.
N. Lathia and L. Capra. Mining mobility data to minimise M.-F. Tsai, Tie-Yan Liu, T. Qin, H.-H Chen, and W.-Y. Ma.
travellers’ spending on public transport. In Proc. ACM Frank: a ranking method with fidelity loss. In Proc. SI-
KDD’11, pages 1181–1189, 2011. GIR’07, pages 383–390, 2007.

k8
ICML MUD 2015

F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise
approach to learning to rank: theory and algorithm. In
Proc. ICML’08, pages 1192–1199, 2008.
J. Yuan, Yu Zheng, X. Xie, and G. Sun. Driving with
knowledge from the physical world. In KDD ’11, pages
316–324, New York, NY, USA, 2011. ACM.

Z. Zheng, K. Chen, G. Sun, and H. Zha. A regression
framework for learning ranking functions using relative
relevance judgments. In Proc. SIGIR ’07, pages 287–
294, New York, NY, USA, 2007. ACM.