Improved Trip Planning by Learning from Travelers’ Choices Boris Chidlovskii BCHIDLOVSKII @ XRCE . XEROX . COM Xerox Research Center Europe, 6 chemin Maupertuis,38240 Meylan, France Abstract the same origin and destination points. We analyze the work of urban trip planners and While the personal knowledge plays an important role, in the relevance of trips they recommend upon user many cases passengers simply have different preferences queries. We propose to improve the planner rec- about the trip planning. For example, one passenger may ommendations by learning from choices made avoid multiple changes, by extending the duration of her by travelers who use the transportation network journey by a few minutes, while another passenger simply on the daily basis. We analyze individual travel- wants to arrive as quickly as possible to the destination. ers’ trips and convert them into pair-wise prefer- When a user queries a planner for a journey from origin o ences for traveling from a given origin to a des- to destination d starting at time ts , there are often a large tination at a given time point. To address the number of trips satisfying the query. Planners are designed sparse and noisy character of raw trip data, we to provide the k-top recommendations according to a set model passenger preferences with a number of of predefined criteria, such as the minimal transfer time, smoothed time-dependent latent variables, which the minimal number of changes, etc.. Their work is sim- are used to learn a ranking function for trips. This ilar to any information retrieval system, where the goal is function can be used to re-rank the top planner’s to place the most relevant documents among the k-top an- recommendations. Results of tests for cities of swers. Therefore, it is highly desirable that a trip planner Nancy, France and Adelaide, Australia show a behaves intelligently and suggests k-top trips which reflect considerable increase of the recommendation rel- the real passengers’ preferences. evance. In this paper we closely analyze the cases of divergence be- tween the planner recommendations and real choices made 1. Introduction by urban travelers. We collect two sets of individual trips extracted from fare collection systems in cities of Nancy, Most cities and agglomerations around the world propose France and Adelaide, Australia (see Figure 1). We com- their trip planners, in the form of a web or mobile appli- pare these data to the city planners’ recommendations; and cation. Upon a user travel request, they recommend trips in the of case of divergence, we propose a novel method to using a static library of roads and public transportation net- rank the trips that better reflects the reality. work and services. Although these planners are increas- ingly reliable in their knowledge of transportation network and available services, they all share the same static-world assumptions. In particular, they make a general assumption of constancy and universality (Letchner et al., 2006), that the optimal trip is independent of the time of day of the actual journey and of the passengers’ preferences. In reality, constancy and universality rarely hold. Most urban travelers can verify that the best trip between work and home at midnight is not necessarily the best choice to make between the same locations at 8am. Similarly, differ- Figure 1. Trip planners of Nancy (left) and Adelaide (right). ent passengers may choose different ways to travel between Our method relies on two main contributions. First, we Proceedings of the 2 nd International Workshop on Mining Urban consider any individual trip as a set of explicit preferences Data, Lille, France, 2015. Copyright c 2015 for this paper by its made by the traveler during the trip. We use this set of authors. Copying permitted for private and academic purposes. pairwise preferences to learn a ranking function of trips. Rd ICML MUD 2015 This function is then used on the top of the trip planner, cations to be assessed for better understanding user behav- to re-rank the k-top recommendations. Second, we model iors, and for guiding updates of the PT information system. passenger preferences of choosing a specific service or a Personalization of trip planning took into account user pref- change point in a way that reflects their dynamic nature. To erences and tries to identify the best trips among a set of address the sparse and noisy character of the raw trip, we possible answers. In (Mokhtari et al., 2009), the fuzzy set model the user preferences by a set of dynamic latent vari- theory was used to model complex user preferences. A ty- ables. We estimate these variables by a smoothed dynamic pology of preferences was proposed to explicitly express non-negative factorization of service and transit counts. the preferences and integrate them in a query language. The remainder of this paper is organized as follows. In Trip personalization by mining public transport data has Section 2 we briefly review the state of art in urban trip been addressed in (Lathia & Capra, 2011). It established a planning. Section 3 introduces the trip ranking problem relation between urban mobility and fare purchasing habits by analyzing individual trips for Nancy city case. Learn- in London public transport network (Seaborn et al., 2010), ing to rank for trip planning is presented in Section 4. and proposed personalized ticket recommendations based Then Section 5 proposes to model user preferences by dy- on the estimated future travel patterns and matching travel- namic latent variables and develop an estimation method ers to the best fare. by smoothed dynamic non-negative factorization of service and transit counts. In Section 6, we report results of eval- Integrating real time information in trip planners has been uation on trip re-ranking for two city datasets. Section 7 another research trend. (Yuan et al., 2011) presented a concludes the paper. cloud-based system computing customized and practically fast driving routes for an end user using (historical and real- 2. Prior Art time) traffic conditions and driver behavior. GPS-equipped taxicabs are used as mobile sensors constantly probing the Trip planners. Public transport (PT) trip planners are de- traffic rhythm of a city and taxi drivers’ intelligence in signed to provide information about available journeys in choosing driving directions. The real time trip planning has the transport system. The application prompts a user to in- also been extended to multi-modality (Casey et al., 2014; put an origin o, a destination d and a departure time ts (or Seaborn et al., 2010). It used data from GPS-enabled vehi- arrival time tf ), it then deploys a trip planning engine to cles to produce more accurate plans in terms of time and find a sequence of available PT services from o to d start- transit vehicles.It incorporates the delays into the transit ing at time ts (or ending at time tf ). network at real-time to minimize the gap with respect to the prediction model. Trip planners often retrieve multiple trips for a user query. They typically use a variation of the time-dependent short- Learning to Rank. In document retrieval, to ranking doc- est path algorithm to search a graph of nodes (representing uments based on their degrees of relevance to a query has access points to the network) and edges (representing pos- been the key question for decades. Much effort has been sible journeys between points) (Casey et al., 2014). Differ- placed on developing document ranking functions. Early ent weightings such as distance, cost or accessibility are of- methods used a small number of document features (e.g., ten associated with each edge and node. Search may be op- term frequency, inversed document frequency, and docu- timized on different criteria, for example, the fastest, least ment length), with an empirical tuning of the ranking func- changes or cheapest ones (Pelletier et al., 2009). tion parameters. To avoid the manual tuning, the doc- ument retrieval was proposed to be regarded as learning Planning high quality realistic trips remains difficult for to rank (Burges et al., 2005; 2006; Cao et al., 2006; Liu, several reasons (McGinty & Smyth, 2000). First, avail- 2011). Click-through data are used to deduce pair-wise able General Transit Feed Specification (GTFS) sources training data for learning ranking functions. rarely contain all information useful for constructing real- istic plans. Second, the notion of ”service quality” is diffi- In learning to rank, a number of categories are given and a cult to define and is likely to change from person to person. total order is assumed to exist over the categories. Labeled Consequently, in real-world trip planning, the shortest trip instances are provided, and each instance is represented by is rarely the best one for a given user. a feature vector, and each label denotes a rank. Existing methods can be categorized as point-wise, pair-wise and Multiple efforts have been made to improve the trip plan- list-wise (Liu, 2011). In point-wise methods, each instance ning (Lathia & Capra, 2011; Liebig et al., 2014; Mokhtari with its rank is used as an independent training example. et al., 2009; Trepanier et al., 2005; Yuan et al., 2011). Anal- The goal of learning is to correctly map instances into in- ysis of trip planner log files (Trepanier et al., 2005) can help tervals. In pair-wise methods, each instance pair is used as improve transit service by providing better knowledge on a training example and the goal of training is to correctly transit users. Log files were useful for identifying new lo- R3 ICML MUD 2015 find the differences between ranks of instance pairs, and assumed by the planners. ranking is transformed into pairwise classification or pair- Trip datasets expose a very large variety of paths for any wise regression (Herbrich et al., 2000). This model formal- (o,d) pair; the maximum number of different paths ob- izes learning to rank as learning for classification on pairs served is 46 for Nancy and 37 for Adelaide; the average of instances and can deploy any classification method. In number of paths between two locations is 2.71 and 3.12, list-wise methods, the loss function is defined on a ranked respectively. We measure the uncertainty of choosing one list with respect to a query (Xia et al., 2008). or another path from an origin o to a destination d, by using the Kullback-Leibler divergence KL(q||p) of the trip dis- 3. Individual trips analysis tribution q from the uniform distribution p. The higher KL values indicate the higher certainty and a clear domination We consider a public transportation system that offers a of one trip over others. Figure 2.b plots the KL divergence number of services (buses, trams, trains, etc.) to urban values for all (o,d) pairs in Nancy using the log-log scale. travelers. Any individual passenger trip J represents a se- Again, the high density zone suggests that a large part of quence of PT services and changes between the services. (o,d) pairs is dominated not by one but by 2 to 5 different Service legs of J form a sequence SJ = {l1 , . . . , ln }, n ≥ paths of high frequency. 1, where leg li is a tuple (si , bi , ai , tbi , tai ), si is a service identifier (a bus number, for ex.); bi and ai are boarding and alighting stops, tbi and tai are boarding and alighting timestamps. Trip is direct if n = 1, and transit otherwise. A transit trip includes n − 1 changes which refer to wait- ing and/or walking between the services. The sequence of changes is defined as CJ = {c1 , . . . , cn−1 }, n ≥ 1, where ci is uniquely defined by two successive service legs li and li+1 , as ci = (ai , bi+1 , tai , tbi+1 ). We make the following association between individual trips and trip recommendations. We consider a trip J as an ex- plicit answer to an implicit travel query Q = (o = b1 , d = en , ts = tb1 ) or Q = (o = b1 , d = an , tf = tan ). Figure 3. a) 5-top trips for one (origin, destination) pair in Nancy. b) The travel time and trip count distributions for top 5 trips. It is important to recall that travelling preferences change during the day. Figure 3.a shows 5-top transit trips for an example (o, d) location pair in Nancy. Figure 3.b shows the travel time and average trip counts for 5-top trips for this example. All 5 trips are transit ones with one change. The figure reveals how the user preferences vary during the day. Figure 2. a) Minimal travel time vs average travel time. b) Trip The trip planner recommends the trip shown in red for the uncertainty. fastest trip query. First, this recommended trip is not the fastest nor the most frequent one. Second, the trip shown in green is the most frequent during the lunch, despite it is We analyze sets of individual trips collected from the auto- far from being fast. mated fare collection systems (Mezghani, 2008) installed in Nancy, France and Adelaide, Australia; we mined these Figure 4 gives a more general picture. It shows 240 most data to understand how passengers’ choices differ from the frequent (o,d) pairs in Nancy. For each pair, Figure 4.a uses planner recommendations. the different colors to show changing user preferences. The most frequent trip is colored in dark blue. Second, third, For every pair of locations (o, d) in a network, we extract forth and fifth preferences are shown in blue, green, orange all real trips from o to d and analyze their travel time dis- and brown colors, respectively. Trips are sorted by the dis- tribution. Figure 2.a shows the distribution of the minimal tance between the origin and destination (see Figure 4.b). versus the average travel time for every (o, d) pair in Nancy. The high density zone suggests that the average travel time Short trips expose a higher variability than longer ones. As is far longer than the minimal time which is conventionally the figure shows, the second choices are more visible (blue RN ICML MUD 2015 color) during the morning rush hours. Figure 4.c shows Algorithm 1 below uses the trip planner and a set T of indi- the trip planner recommendations for the same pairs. The vidual passengers’ trips. For any trip J ∈ T matching the recommendations are static and do not reflect the user pref- query Q = (o, d, ts ), the algorithm retrieves the k-top can- erences. didates for Q and retains that J has been preferred to any of these candidates, except J itself if it happens to be in this set. Real trip J matches a recommended trip J ′ , if it has the same number of legs and following the same sequence of services. If SJ = {l1 , . . . , ln } and SJ ′ = {l1′ , . . . , ln′ }, then J matches J ′ iff si = s′i ∧ bi = b′i ∧ ai = a′i , for all i = 1, . . . , n. Algorithm 1 Rank learning algorithm. Require: Collection T of passenger trips J = (S, C) Require: Trip planner P with k-top recommendations 1: S = ∅ ; set of pairwise preferences 2: for each J ∈ T do 3: Form a query Q = (o = b1 , d = an , ts = tb1 ) 4: Query the planner P with query Q 5: Retrieve k-top trips as a list L 6: for each J ′ ∈ L, J ′ 6= J do 7: Add (Q, x(J ) ≻ x(J ′ )) to S 8: end for 9: end for 10: Learn the ranking model f from S Ensure: f Figure 4. a) Changing user preferences for most frequent (o,d) pairs in Nancy. b) Trip distances. c) Trip recommendations by the planner. Once the ranking function f is learned, it can be used to im- prove the relevance of trip planner recommendations acord- We conclude this section by Figure 5 which shows how the ing to the re-ranking scenario. The trip planner does not user preferences vary between the PT services. It presents change the way it works. And for a new user query Q, the total passenger counts for all Nancy change points, at the trip planner first generates k-top candidate trips. Then 8am, 1pm and 6pm. these candidates are re-ranking using the function f . To learn a ranking function f , Algorithm 1 requires every trip J be described by a feature vector x(J ). In the fol- lowing sections, we first describe a method for learning the ranking function f and then how to extract relevant and dy- namic features from individual trips. 4.1. Gradient Boosting Rank We used individual trips to form a set pairwise preferences, a ranking function f can be learned from. For each in- dividual trip J ∈ T , we generate a set of labeled data (xi,1 , yi,1 ), . . . , (xi,mi , yi,mi ), i = 1, . . . , |T |, which are Figure 5. Change counts in Nancy at 8am, 1pm and 6pm. preference pairs of feature vectors. If xi,j has a higher rank than xi,k (yi,j > yi,k ), then xi,j ≻ xi,k is a pref- erence pair, which means that xi,j is ahead of xi,k . The 4. Learning to rank trips preference pairs can be viewed as instances and labels in a new classification problem, where xi,j ≻ xi,k is a positive When a passenger travels from an origin o to a destination instance. d at time ts , she implicitly prefers the trip J she takes to all other trips J ′ , J ′ 6= J . Our approach is to transform Any classification method can be used to train a classifier this implicit feedback into an explicit set of pair-wise trip f (x) which is then used for ranking. Trips are assigned preferences and to learn the ranking function f from them. scores by f (x) and sorted by the scores. Learning a good ky ICML MUD 2015 ranking model is realized by training of a model for pair- plicit factors which influence the passenger choice. Pas- wise classification. The loss function in learning is pairwise sengers make their choices in the function of location and because it is defined on a pair of feature vectors. time. The pairwise approach is adopted in many methods, includ- We mention two groups of trip features. First, global fea- ing Ranking SVM (Herbrich et al., 2000), RankBoost (Fre- tures describe the whole trip; they are the travel time, the und et al., 2003), RankNet (Burges et al., 2005), IR number of changes, the usage of specific types of transport SVM (Tsai et al., 2007), GBRank (Zheng et al., 2007), (bus, train, tram, etc.), multi-modality, etc. Second, much LambdaRank (Burges et al., 2006), and others. In the fol- more relevant and specific are local features that describe lowing we adopt GBRank as one of popular pairwise meth- each service leg and change that compose a given trip. For ods currently used. each PT service, we may extract the estimated means and variance of the speed when using this line at this time pe- GBRank takes preference pairs as training data, riod, the average delay with respect to the schedule. For {x1i , x2i }, x1i ≻ x2i , i = 1, . . . , N . and uses the para- each change point, we can estimate the walking distance if metric pairwise loss function any, the closeness to a commercial zone or transportation 1X N hub, etc. L(f ) = (max{0, τ − (f (x1i ) − f (x2i )})2 , 2 i=1 Unfortunately, raw features of services and change counts are generally sparse, noisy and prone to many errors. Main where f (x) is the ranking function and τ is a parameter, reasons for errors are due to incorrect setup of ticket valida- 0 < τ ≤ 1. The loss is 0 if f (x1i ) is larger than f (x2i ) + τ , tion machines, lack of alignment between ticket validation otherwise, the incurred loss is 12 (f (x2i ) − f (x1i ) + τ )2 . machines and GPS localization, and card misuse by travel- To optimize the loss function with respect to the training in- ers. stances, the Functional Gradient Decent is deployed. Treat- So we intend to extract such latent features from sparse and ing all f (x1i ), f (x2i ), i = 1, . . . , N as variables; the gradi- noisy counts that be able to represent user preferences and ent of L(f ) is computed with respect to the training in- their dynamic character. stances as follows We split all trips J ∈ T in two collections of service and − max{0, f (x2i ) − f (x1i ) + τ }, max{0, f (x2i ) − f (x1i ) + τ },change observations, As = {l |l ∈ S , J ∈ T } and i i J i = 1, . . . , N. Ac = {c |c ∈ C , J ∈ T }. In the following we as- i i J sume for brevity working with a set of observations A; it If f (x1i ) − f (x2i ) ≥ τ , the corresponding loss is zero, and may indicate service or change observations, or their sum. there is no need to change the ranking function. If f (x1i ) − f (x2i ) < τ , the loss is non-zero, and the ranking function If we split all observations in A in T time periods, is updated using the Gradient Descent: so we obtain a sequence of count matrices At , t = p×p 1, . . . , T, At ∈ R+ at time period t, where aij is the fk (x) = fk−1 (x) − ν∆L(fk (x)), service or change count during the period t. and p is the number of stops. where fk (x) and fk−1 (x) denote the values of f (x) at k-th and (k − 1)-th iterations, respectively, ν is the learning rate. The full diagram of latent feature extraction for individual trips and learning the ranking function is given in Figure 6. At the k-th iteration of the learning, GBRank collects all the pairs with non-zero losses {(x1i , fk−1 (x21 ) + τ ), (x2i , fk−1 (x1i ) − τ )} and employs Gradient Boosting 5.1. Collapsed matrices Tree (Friedman, 2000) to learn a regression model gk (x) We first consider the static case when T is 1 and all obser- that can make prediction on the regression data. The vations from A are collapsed in one matrix A. learned model gk (x) is then linearly combined with the existing model fk−1 (x) to create a new model fk (x) as Both service and change data are sparse non-negative follows counts, and we can use the non-negative matrix factor- kfk−1 (x) + βk gk (x) ization (NNMF) as a method giving a great low-rank ro- fk (x) = , k+1 bust interpretation of data (Lee & Seung, 2001). They can with βk as a shrinkage factor (Zheng et al., 2007). be efficiently computed by formulating the penalized opti- mization problem and using modern gradient-descent algo- rithms (Hoyer, 2004). 5. Trip feature extraction Matrix A is approximated with a product ot two low-rank We now describe each real trip J by a set of relevant and matrices that is estimated through the following minimiza- dynamic features x(J ). There may exist explicit and im- kR ICML MUD 2015 where parameters λ, µ are set by the user. The objective function imposes smoothing Ut and Vt on two successive time periods, but it can be generalized to a larger window. To estimate matrices Ut and Vt , we use an extended version of the multiplicative updating algorithm for NNMF (Gillis & Glineur, 2012; Lee & Seung, 2001; Mankad & Michailidis, 2013), based an adaptive gradient descent. Temporal extensions of matrix factorization techniques have been studied in (Elsas & Dumais, 2010; Mankad & Michailidis, 2013; Saha & Sindhwani, 2012; Sun et al., 2014). (Elsas & Dumais, 2010) analyzed the temporal dy- namics of Web document content. To improve the rele- vance ranking, it developed a probabilistic document rank- ing algorithm that allows differential weighting of terms based on their temporal characteristics. (Sun et al., 2014) Figure 6. Preference features and re-ranking function learning. addressed recommendation systems with significant tem- poral dynamics; it developed the collaborative Kalman fil- ter which extends probabilistic matrix factorization in time tion through a state-space model. Community detection in time- minU≥0,V≥0 ||A − UVT ||2F , evolving graphs is analyzed in (Mankad & Michailidis, where U and V are n × K non-negative matrices. The 2013). The latent structure of overlapping communities is rank or dimension of the approximation K corresponds to discovered through the sequential matrix factorization. the number of latent factors; it is chosen to obtain a good To solve (2), we follow (Mankad & Michailidis, 2013) and data fit and interpretability, where U give latent factors for consider the Lagrangian as follows origin stops and V does for destination stops. The factorized matrices are obtained by minimizing an ob- L = ||At − Ut VtT ||2F + jective function that consists of a goodness of fit term and PT +µ t=2 (||Ut − Ut−1 ||F F 2 + ||Vt − Vt−1 ||2 ) a roughness penalty PT + t=1 (λ(||Ut ||1 + ||Vt ||1 ) + T r(ΦUt ) + T r(ΨVt )), (3) minU ≥0,V ≥0 ||A − UVT ||2F + λ(||U||1 + ||V||1 ), (1) where Φ, Ψ are Lagrange multipliers. The method works where the parameter λ ≥ 0 indicates the penalty strength; as an adaptive gradient descent converging to a local mini- a larger penalty encourages sparser matrices U and V. mum. Kuhn-Tucker (KKT) optimality guarantees the nec- Adding penalties to NMF is a common strategy since they essary conditions for convergence [44]. The KKT optimal- ∂L ∂L not only improve interpretability, but often improve numer- ity conditions are obtained by setting ∂U t = 0; ∂V t = ical stability of the estimation. 0, t = 1, . . . , T. It can be shown that the KKT optimality conditions are obtained by 5.2. Smoothed Dynamic NNMF Φt = −2At Vt + 2Ut VtT Vt − 2µ(Ut−1 − Ut ) + 2λ, In the general case T > 1, we have a sequence of matrices Ψt = −2ATt Ut + 2Vt UTt Ut − 2µ(Vt−1 − Vt ) + 2λ, {At }Tt=1 for time periods t = 1, . . . , T . To produce a se- (4) quence of low-rank matrix factorizations {Ut , Vt }Tt=1 , we which after matrix algebra manipulations lead to the multi- can extend the factorization in (1) to the case T > 1 by in- plicative updating rules presented in Algorithm 2. dependent factorization of T matrices {At }. However, we additionally impose a smoothness constraint on both Ut The convergence of the multiplicative updating algorithm and Vt , in order to force the latent factors to be similar to is often reported slow. In practice we obtain meaningful the previous time periods, in both boardings and alightings. factorizations after a handful of iterations, which we tend The objective function then becomes to explain by the sparseness of input matrices At . In the future, when working with the dense data, faster meth- minUt ≥0,Vt ≥0 ||At − Ut VtT ||2F ods like active set version of the alternating non-negative PT +µ t=2 (||Ut − Ut−1 ||F F 2 + ||Vt − Vt−1 ||2 ) (2) least squares (ANLS) algorithm (Kim & Park, 2008) will PT +λ( t=1 ||Ut ||1 + ||Vt ||1 ), be more appropriate. kk ICML MUD 2015 Algorithm 2 Dynamic Smoothing NNMF algorithm. lected in Nancy, France during 3 months in 2012. Nancy Require: Matrices At , t = 1, . . . , T , constants λ ,µ PT network includes 1129 nodes/stops and offers 107 bus 1: Initialize Ut , Vt as dense, positive random matrices and tram services to travelers. We also processed 12.5M 2: repeat trips from Adelaide, Australia collected during 2.5 months 3: for t = 1,. . . ,T do in 2013. Adelaide network offers 312 bus and tram service 4: Ut ← Ut (Ut VtT Vt + λAUt )−1 (At Vt + variations, and accounts for 3524 stops. µUt−1 ) To evaluate the impact of modeling user preferences 5: Vt ← Vt (Vt UTt Ut + λAVt )−1 (ATt Ut + from actual trips, we selected 240 most frequent origin- µVt−1 ) destination pairs in Nancy (see Figure 4) and 160 most fre- 6: end for quent pairs in Adelaide. 7: until Convergence Ensure: Ut , Vt , t = 1, . . . , T When generating temporal sequences of count matrices, we test two cases of T = 24 and T =48, when any matrix in- cludes all passenger counts during one hour or 30 minutes. 5.3. Dynamic trip features Once a matrix sequence is generated, any matrix is ran- domly split into 70% for training data and the remaining Algorithm 2 finds sparse factorized matrices for a se- 30% for testing. All results below are means and variances quence of input matrices At , t = 1, . . . , T . We first ap- over 10 independent runs. ply the algorithm to sequences of service matrices Ast and change matrices Act , extracted from the full trip collection. We retrieved the trip planner recommendations for Nancy1 We thus obtain smoothed factorized matrices Ust , Vts , and and Adelaide2 . We learn the ranking function and use it to Uct , Vtc , t = 1, . . . , T for services and changes, respec- re-rank the trip recommendations, using different options tively. At time period t, a boarding stop b has latent factors described in previous sections. To understand the effect of given by a corresponding row in Ust this row is denoted raw count factorization, we consider several options. First, Ust (b). For an alighting stop a, row Vts (a) gives the latent we collapse matrices so disregarding the temporal aspect. factors at time t. We then apply the algorithm to the sum Second, we consider either the service Ast and change ma- matrices, Aft = Act + Ast , t = 1, . . . , T . The smoothed trices Act separately, or sum them up Aft = Ast +Act before factorized matrices for Aft are denoted Uft , Vtf . the factorization. Third, we study the effect of temporal smoothing, when factorization is done either independent To generate a feature vector x for a trip J , we may use or by smoothing over successive time periods. Finally, we its decomposition into service legs and changes, J = test different values K for the factorization. (S, C). The vector x(J ) is then composed of a general feature vector xg and four latent components, x(J ) = In all experiments with GBRank (see Section 4.1), parame- {xg , xsb , xsa , xcb , xcb }, where ter τ was set to τ = 0.3 and shrinkage factors βk to 0.8. For smoothed dynamic NNMF, optimal values of µ and λ have • xsb , xsa are latent feature vectors averaged over the trip been determined by cross-validation. For evaluating the boarding and alighting places, respectively, results of ranking methods, we use a measure commonly used in information retrieval, Normalized Discounted Cu- 1X s 1X s n n xsb = Utb (bi ); xsa = V a (ai ); mulative Gain (NDCG). We choose the perfect ranking’s n i=1 i n i=1 ti NDCG score 1 which is the error rate of the 1-top recom- mendation. • xcb , xca are latent feature vectors averaged over the change places (alighting and boarding), respectively, Table 6 reports the evaluation results for 12 different meth- ods and compares them to the trip planner baseline for both 1 X c 1 X c n−1 n−1 cities. The analysis of these results provide some interest- xcb = Utb (bi ); xca = V a (ai ). ing insights. First, results are globally better for smaller n − 1 i=1 i n − 1 i=1 ti Nancy than for bigger Adelaide, for both T = 24 and T = 48 cases. Second, collapsed matrices improve the In the case of sum latent matrices Uft , Vtf , x(J ) is com- baseline somewhat, but only taking into account temporal posed of a general feature vector xg and two latent compo- user preferences does really boost the performance. More- nents, x(J ) = {xg , xfb , xfa } obtained from Uft and Vtf . over, smoothed matrix factorization improves considerably over the independent one. Third, the change latent vari- 6. Evaluation ables appear to be more relevant than services ones. In- 1 To test our method for learning a ranking function from http://www.reseau-stan.com/ 2 individual trips, we processed 5.2M individual trips col- https://www.adelaidemetro.com.au/ kj ICML MUD 2015 City Nancy Adelaide Method T = 24 T = 48 T = 24 T = 48 Baseline: Trip Planner 24.91 ± 1.20 24.91 ± 1.28 38.17 ± 2.28 38.17 ± 2.28 Collapsed:Services 24.73 ± 1.17 24.73 ± 1.22 29.97 ± 2.11 29.97 ± 2.11 Collapsed:Changes 19.69 ± 1.01 19.69 ±1.09 28.63 ± 2.29 28.63 ± 2.29 Collapsed:Services+Changes 19.30 ± 1.14 19.30 ± 1.03 28.05 ± 2.32 28.05 ± 2.32 Collapsed:Sum 19.59 ± 1.13 19.59 ± 1.10 28.17 ± 2.18 28.17 ± 2.18 Indep: Services 14.08 ± 0.92 15.33 ± 0.97 25.33 ± 2.07 24.87 ± 1.87 Indep: Changes 9.55 ± 0.90 9.89 ± 0.86 23.89 ± 1.67 23.93 ± 1.75 Indep: Services+Changes 9.52 ± 0.89 9.41 ± 0.87 22.41 ± 1.72 22.15 ± 1.55 Indep: Sum 10.42 ± 0.89 9.37 ± 0.86 22.37 ± 1.56 23.55 ± 1.59 Smooth: Services 9.22 ±0.77 9.37 ± 0.78 15.37 ± 1.38 14.71 ± 1.24 Smooth: Changes 6.71 ± 0.82 6.69 ± 0.74 16.69 ± 1.24 16.69 ±1.15 Smooth: Services+Changes 5.83 ±0.81 6.12± 0.79 14.12±1.29 13.63±1.14 Smooth: Sum 7.63 ± 0.79 7.05 ± 0.81 15.05 ± 1.41 14.43 ± 1.32 Table 1. NDCG@1 values for 12 methods and two cities. stead, using sum counts performs worse than keeping ser- vice and change variables separately. We tend to explain this by heterogeneity of service and change preferences. Figure 8. NDCG@1: Independent and smoothed predictions dur- ing the day. 7. Conclusion Figure 7. Independent an smoothed predictions vs Number of la- tent variables. We address the problem of relevance of trips recommended by urban trip planners. We analyzed passengers’ trips ex- tracted from two public transportation systems. We pro- Figure 7 shows the performance of 3 independent and 3 pose a method for improving the recommendation rele- smoothed methods for T = 24 for Nancy, with the number vance by learning from choices made by travelers who use of latent variables K varying between 2 and 30. Surpris- the transportation system daily. We convert the actual trips ingly, already K=2 performs well enough, thus indicating into a set of pairwise preferences and learn a ranking func- the sparsity of the count matrices. tion using the Gradient Boosting Rank method. We de- Figure 8 reports the hour-per-hour performance for the scribe actual trips with a number of time-dependent latent same six methods for Nancy case. Rush hours and lunch features, and develop a smoothed non-negative matrix fac- time appear to be hard for all methods; the error is the torization to estimate the latent variables of user prefer- smallest for the periods 10am-12am and 2pm-4pm that ences while choosing PT services and change points. Ex- points to the correlation between the traffic and trip vari- periments with real trip data demonstrate that the re-ranked ability. The traffic growth pushes travelers away from the trips are measurably closer to those actually chosen by pas- conventional traveling choices. sengers than are the trips produced by planners with static k9 ICML MUD 2015 heuristics. D. Lee and H. S. Seung. Algorithms for non-negative ma- trix factorization. In Proc. NIPS’01, pages 556–562, References 2001. C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, J. Letchner, J. Krumm, and E. Horvitz. Trip router with in- N. Hamilton, and G. Hullender. Learning to rank using dividualized preferences: Incorporating personalization gradient descent. In ICML’05, pages 89–96, 2005. into route planning. In Proc. IAAI’06 - Vol 2, pages 1795–1800. AAAI Press, 2006. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with T. Liebig, N. Piatkowski, C. Bockermann, and K. Morik. nonsmooth cost functions. In Proc. NIPS’06, pages 193– Predictive trip planning-smart routing in smart cities. In 200, 2006. EDBT/ICDT Workshops, pages 331–338, 2014. Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Huang, and H.-W. Hon. T.-Y. Liu. Learning to Rank for Information Retrieval. Adapting ranking svm to document retrieval. In Proc. Springer, 2011. SIGIR ’06, pages 186–193, New York, NY, USA, 2006. ACM. S. Mankad and G. Michailidis. Structural and functional discovery in dynamic networks with non-negative matrix B. Casey, A. Bhaskar, H. Guo, and E. Chung. Critical factorization. Phys. Rev. E, 88:042812, Oct 2013. review of time-dependent shortest path algorithms: A multimodal trip planner perspective. Transport Reviews, L. McGinty and B. Smyth. Turas: A personalised route 34:522–539, 2014. planning system. In Proc. PRICAI’00, pages 791–791, Berlin, Heidelberg, 2000. Springer-Verlag. J. L. Elsas and S. T. Dumais. Leveraging temporal dynam- M. Mezghani. Study on electronic ticketing in public ics of document content in relevance ranking. In Proc. transport. European Metropolitan Transport Authorities WSDM ’10, pages 1–10, New York, NY, USA, 2010. (EMTA), 38:1–56, 2008. ACM. A. Mokhtari, O. Pivert, and A. HadjAli. Integrating com- Y. Freund, R. D. Iyer, R. E. Schapire, and Y. Singer. An plex user preferences into a route planner: A fuzzy-set- efficient boosting algorithm for combining preferences. based approach. In IFSA/EUSFLAT Conf., pages 501– J. Machine Learning Res., 4:933–969, 2003. 506, 2009. J. H. Friedman. Greedy function approximation: A gra- M.-P. Pelletier, M. Trepanier, and C. Morency. Smart card dient boosting machine. Annals of Statistics, 29:1189– data in public transit planning: A review. CIRRELT Rap- 1232, 2000. port 2009-46, November 2009. N. Gillis and F. Glineur. Accelerated multiplicative updates A. Saha and V. Sindhwani. Learning evolving and emerg- and hierarchical als algorithms for nonnegative matrix ing topics in social media: a dynamic nmf approach with factorization. Neural Comput., 24(4):1085–1105, April temporal regularization. In Proc. WSDM’12, pages 693– 2012. 702, 2012. R. Herbrich, T. Graepel, and K. Obermayer. Large margin C. Seaborn, J. Attanucci, and N. H. M. Wilson. Analyz- rank boundaries for ordinal regression. In Advances in ing multimodal public transport journeys in london with Large Margin Classifiers, pages 115–132, 2000. smart card fare payment data. Transportation Research Record: J. Transp. Research Board, 2121:55–62, 2009. P. O. Hoyer. Non-negative matrix factorization with sparse- J. Z. Sun, D. Parthasarathy, and K.R. Varshney. Collabo- ness constraints. J. Mach. Learn. Res., 5:1457–1469, rative kalman filtering for dynamic matrix factorization. December 2004. IEEE Trans. on Signal Processing, 62(14):3499–3509, July 2014. H. Kim and H. Park. Nonnegative matrix factoriza- tion based on alternating nonnegativity constrained least M. Trepanier, R. Chapleau, and B. Allard. Can trip planner squares and active set method. SIAM J. Matrix Anal. log files analysis help in transit service planning? Jour- Appl., 30(2):713–730, July 2008. nal of Public Transportation, 8(2):79–103, 2005. N. Lathia and L. Capra. Mining mobility data to minimise M.-F. Tsai, Tie-Yan Liu, T. Qin, H.-H Chen, and W.-Y. Ma. travellers’ spending on public transport. In Proc. ACM Frank: a ranking method with fidelity loss. In Proc. SI- KDD’11, pages 1181–1189, 2011. GIR’07, pages 383–390, 2007. k8 ICML MUD 2015 F. Xia, T.Y. Liu, J. Wang, W. Zhang, and H. Li. Listwise approach to learning to rank: theory and algorithm. In Proc. ICML’08, pages 1192–1199, 2008. J. Yuan, Yu Zheng, X. Xie, and G. Sun. Driving with knowledge from the physical world. In KDD ’11, pages 316–324, New York, NY, USA, 2011. ACM. Z. Zheng, K. Chen, G. Sun, and H. Zha. A regression framework for learning ranking functions using relative relevance judgments. In Proc. SIGIR ’07, pages 287– 294, New York, NY, USA, 2007. ACM. ke