Analysis of the preferences of public transport passengers in the task of building a personalized recommender system A A Borodinov1, V V Myasnikov1,2 1 Samara National Research University, Moskovskoe Shosse 34А, Samara, Russia, 443086 2 Image Processing Systems Institute of RAS - Branch of the FSRC "Crystallography and Photonics" RAS, Molodogvardejskaya street 151, Samara, Russia, 443001 e-mail: aaborodinov@yandex.ru Abstract. The paper presents the theoretical and algorithmic aspects for making a personalized recommender system (mobile service) designed for public route transport users. The main focus is on identifying and formalizing the concept of "user preferences", which is the basis of modern personalized recommender systems. Informal (verbal) and formal (mathematical) formulations of the corresponding problems of determining "user preferences" in a specific spatial-temporal context are presented: the preferred stops definition and the preferred "transport correspondence" definition. The first task can be represented as a well-known classification problem. Thus, it can be formulated and solved using well-known pattern recognition and machine learning methods. The second is reduced to the construction of dynamic graphs series. The experiments were conducted on data from the mobile application "Pribyvalka-63". The application is the tosamara.ru service part, currently used to inform Samara residents about the public transport movement. 1. Introduction The amount of heterogeneous data characterizing the transport situation in the city has increased due to the widespread and active use of modern electronic communications systems, global navigation systems, and various active and passive sensors. Such information is used in navigation and reference systems (services) quite widely [1]. However, along with the development of services and their popularization, the expectations and demands of users and the amount of information that has to be taken into account when planning movements are growing. User demand is individualized from the classic tasks of searching for the “shortest path” [2] or getting a “forecast of arrival at a stop of public transport” [3, 4], shifting expectations from services to Intelligent personal assistants. Although the final decision or choice in such systems remains with the person, the options of the solutions they offer are significantly dependent on the scenario conditions of the request as well as on previous actions and decisions of the user [5, 6]. Accounting for all of the indicated factors is possible in “self-tuning” systems for individual user preferences based on machine learning methods [7]. The imperfection of existing algorithms and the lack of significant experience in the use of machine learning methods in such systems prevents the rapid emergence of such services. V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) Image Processing and Earth Remote Sensing A A Borodinov, V V Myasnikov Multimodal routing is determined by the possibility of using several modes of transport in one trip. The analysis of modern literature devoted to recommender systems of multimodal routing [5, 8, 9] allows us to identify some major problems: – The cold start problem is a well-known and well-researched problem for recommender systems [8, 10]: it is essential to achieve a balance between the accuracy of the recommended routes from system initialization. Thus, the allowable setting time for a personal preference profile should be small. – The receiving information method from the user is not formalized [11, 12]. – Individual characteristics such as personal income, age, gender, family size, access to public transport influence the choice of the route even for the same purpose of the trip [13]. – User preferences change over time. In addition, context influences user selection [14, 15]. – Typical existing solutions mainly use the Bayesian approach with a sequential parameter recalculation scheme [5, 16]. – It is possible to use transfer learning to improve recommendations [17]. – The problem of determining traffic flow on the vehicle route [18]. This article proposes one of the possible ways of describing and solving the problem of determining individual preferences of users of public route transport and creating a personalized recommender system. The system uses user interaction data with the mobile service as part of the problem of creating a personalized recommender system. The second section of the work formalizes the basic concepts and introduces the basic notation for all objects of interaction. The third section describes the information arising from the public transport user interaction with the mobile application "Pribyvalka-63". Mobile application and service tosamara.ru, are currently used to inform Samara residents about public transport movement and its arrival at the stop. Also in this section are presented the variants of unformalized (verbally described) definitions of "user preferences", suitable for further consideration. The fourth section presents the mathematical formulations of problems, as well as methods and algorithms. Finally, the fifth section presents the results of experimental studies on real data obtained using the mobile application "Pribyvalka-63". 2. Definitions and designations Let S be the set of public transport stops. Let for each stop s ∈ S defined spatial (geographical) coordinates x s ≡ ( xs , ys , zs ) and some unique stop identifier, denoted by ID(s). Without loss of generality, we can assume that the set S is ordered (for example, by ID(s)): S = s1 , s2 , , s S . { } Let the value d determine the calendar date, the value t - the time of day, and w ( d ) ∈W - the day of the week, taking values from the set: W=W0  W1 , (1) W0 ≡ {MON , TUE ,WEN , THU , FRI } , W1 ≡ {SAT , SUN }. Let V determine the set of public transport vehicles, each v ∈ V is characterized by the type type ( v ) ∈ { BUS , TRAM , TROL, MARS } , (2) and has a unique identifier ID(v) (in practice, the unique identifier may coincide with the vehicle state registration number). For each vehicle at any time, we consider its spatial coordinates to be determined: x ( v, d , t ) ≡ ( x ( v, d , t ) , y ( v, d , t ) , z ( v, d , t ) ) . (3) Denote the routes set of public transport objects as M. In addition, each route m ∈ M is characterized by five arguments: ( m ≡ ID ( m ) , N ( m ) , s ( m ) , N * ( m) , x ( m)) , (4) where ID ( m ) - route identifier (in practice, the route number), N ( m ) - stops number in the route, and s ( m ) - stops sequence in an amount N ( m ) of form: V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 199 Image Processing and Earth Remote Sensing A A Borodinov, V V Myasnikov ( ) s ( m ) = s1m , s2m ,, sNm( m ) , (5) where snm ∈ S ( m ∈ M, n ∈1, N ( m )) . Let S( m ) ≡ {s } m ( ) i i =1, N m ⊆ S be the stops set of the corresponding route, ind ( s, m ) - stop index s of route m, viz. ind ( snm , m ) = n . In case s ∉ S ( m ) the corresponding index is assumed to be "indefinite", and denoted as ∆: ind ( s, m ) = ∆ . Denote the routes set passing through one or a couple stops as follows: M ( s ) ≡ {m ∈ M: s ∈ S ( m )} , M ( s1, s 2 ) ≡ {m ∈ M: s1∈ S ( m ) ∧ s2 ∈ S ( m )} . (6) More detailed information about the route geometry is represented by a pair N ( m ) , x ( m ) , where * the first value determines the number of nodes of the polyline describing the route, and the second is the vector defining the coordinates of these nodes: ( x ( m ) ≡ x1m , x 2m , , x mN * m . ( ) ) (7) For convenience, we will call the pair ( m, k ) , ( k = 1, K ( d , m )) route implementations (RI) on the appropriate day d. Additionally, we denote t ( d , m, k , s ) - vehicle arrival time assigned to RI (m,k) on day d, to stop s ∈ S ( m ) (in case s ∉ S ( m ) we consider the time value as uncertain). Denote the vehicle assigned to RI (m,k) on day d, as v ( d , m, k ) ∈ V (k = 1, K ( d , m ) ) . In addition to the vehicles, pedestrians and passengers, considered in the paper as the users of transport services, are participants in the traffic. Denote by U the set of users, and we will characterize each specific user u ∈ U with a unique identifier ID(u) (mobile device id or hash code) and spatial coordinates at a specific point in time d , t : x ( u, d , t ) ≡ ( x ( u, d , t ) , y ( u, d , t ) , z ( u, d , t ) ) . (8) If there is no information about user coordinates, we consider that they have "undefined value" ∆ (all three at the same time). Figure 1. Blue circle – stop location, purple circles – user location points at request time. 3. Mobile application and support service data, «informal preferences» options For the mobile service "Pribyvalka-63" data for analysis are presented as follows: - stop data (identifiers and coordinates); - route information (identifiers and stop identifier list); V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 200 Image Processing and Earth Remote Sensing A A Borodinov, V V Myasnikov - data on the vehicle (identifiers), location coordinates (with a frequency of 2 times per minute), the destination to routes; - coordinates of users and request parameters are recorded during requests (request results are not saved, since they can be restored from vehicle traffic data) in the form: ID ( s ) , d , t , ID ( u ) , x ( u , d , t ) ; - user response to the request is not saved. Based on the presented data, the following two options "user preferences" seem appropriate (we consider the user known): - user-preferred stops at specific space-time coordinates (Figure 1); - user-preferred "transport correspondence", also considered in the space-time context. "Transport correspondence" refers to the actual movement from one stop to another, the route chosen and the route vehicle type. Information about the "starting" and "end" stops of a particular user is optional (derived) information (Figure 2). Figure 2. "Start" and "End" stops of a specific user. 4. Methods 4.1 User-preferred stops The task of determining "user-preferred stops" when it is in certain space-time coordinates can be formalized as follows. Let given precedent set (training examples) for a specific user. Each precedent is a set of space- time context vector-description and the corresponding "answer" (in our case, the stop identifier). It is necessary for the newly received space-time context vector-description of the new situation (which may be absent in the list of precedent examples), to indicate the most "suitable (s) / relevant (s)" answer (s), if necessary, ordering them by degree relevance. For the case of a single issued response, the described task is a well-known classification theory [19, 20] or machine learning [7, 21], where, according to the object/situation description, the system should indicate the object/situation class as a feature vector. Any recognition algorithm can be represented as two consecutive operators. The first operator translates the description of the object into a numerical value characterizing the "degree of membership" to the class. And the second, according to the indicated value, refers to a specific class [22]. In statistical methods of recognition, the posterior probability [19] is used as a numerical value, in algebraic [23] - estimates, in neural network - the output of the last layer of neurons, etc. Denote the specified numeric value Γ ( features; class ) . V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 201 Image Processing and Earth Remote Sensing A A Borodinov, V V Myasnikov Thus, the formal formulation of the problem for a specific user u ∈ U can be represented as follows (where z ≡ ID ( s ) ). Given: a) precedent set in the form {xi , di , ti ; zi }i∈ℑ . (feature vector; answer) b) feature vector of the new situation x, d , t . It is necessary: for the specified vector x, d , t to determine the permutation of objects { from an ordered set (stops) S = s1 , s2 , , s S } such that Γ ( x, d , t ; ID ( s ( ) ) ) ≥ Γ ( x, d , t ; ID ( s ( ) ) ) ≥  ≥ Γ ( x, d , t ; ID ( s ( ) ) ) . σ 1 σ 2 σ S (9) The result for the user is an ordered stops list: sσ (1) , sσ ( 2) , , sσ ( S ) . (10) The formal quality measure of the final decision: ( ( )) S 1 Γ=Σ ∑i =1 i Γ x, d , t ; ID sσ ( i ) . (11) Solution: Theory of pattern recognition and machine learning offers a variety of methods for solving the problem. In this paper, we use an approach based on the calculating estimates idea proposed by Yu. I. Zhuravlev [23] and nonparametric estimation of Parzen probability density [19]. Set the value Γ ( features; class ) , characterizing feature vector belongs to a class Γ ( x, d , t ; z ) = ∑ µ ( x, d , t= i∈ℑ ;x , d ,t ) I ( z z ) , i i i i (12) where  ( w ( d ) ∈W0 ∧ w ( di ) ∈W0 ) ∨  = µ ( x, d , t ; xi , di , ti ) I   ⋅ exp ( −α t − ti ) ⋅ exp ( − β x − xi ) . (13)  ( w ( d ) ∈W1 ∧ w ( di ) ∈W1 )    Event indicator: 1, a = true; I (a) =  (14) 0, a = false. The values - some coefficients, t − ti - a numerical value (for example, the number of seconds), which characterizes the difference between t , ti . As a result, the algorithm for solving the problem of determining "user-preferred stops" will be as follows: Step 1. For all stops from the set S calculate the values (12): Γ ( x, d , t ; ID ( si ) ) , i = 1, S . (15) Step 2. The values set obtained in (15) is ordered in descending order — a permutation is formed (9). The resulting permutation is the solution of the problem. An ordered list of stops is provided to the user (10). «Cold Start» Solution: To solve the "cold start" problem, the precedent set as follows supplements the initially empty precedent set: {( x , d 0, t 0; ID ( s ) )}  {( x , d1, t 0; ID ( s ) )} , i = 1, S , i i i i (16) where t0="0h00min", d0 and d1 are the dates, respectively, of the weekend and working days preceding the system launch date, and xi i = 1, S ( ) - stop coordinates s (i = 1, S ) . Analysis of i V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 202 Image Processing and Earth Remote Sensing A A Borodinov, V V Myasnikov expressions (12) - (13) shows that with such "starting" data, the contribution of the time component exp ( −α t − ti ) in expression (12) will be the same, and the differences in values Γ () will be completely determined by differences in Euclidean distances from the point x to the stops coordinates ( ) xi i = 1, S . Thus, the value Γ ( x,; ID ( s ) ) will be more significant when s closer to x . 4.2 User-preferred "transport correspondence" The task of determining the user-preferred "transport correspondence" can be presented as the task of estimating the probability characteristics (relative frequency) of correspondences. That is, the movements from the stop s1 to the stop s2, in the space-time context. The following values are important (all characteristics are related to the behavior of a particular user u): • pu ( t s1, s 2, m, Wa ) ( m ∈ M ( s1, s2) , a = 0,1) - correspondence time distribution density s1→s2 with the route choice m on the weekday Wa ; the density function corresponds to the “boarding time” on the route vehicle, and is indicated to stop s1; • pu ( t s1, s 2, Wa ) - correspondence time distribution density s1→s2 on the weekday Wa ; • Pu ( s1, s 2 Wa ) - correspondence probability s1→s2 on the weekday Wa ; • Pu ( m s1, s 2, Wa ) - probability of choosing the route m for implementing the correspondence s1→s2 on the weekday Wa ; • Pu ( m Wa ) - probability of choosing the route m for implementing the correspondence on the weekday Wa ; • Pu * ( s | Wa ) - the probability that the stop s is the “end/start”. Additional information about user behavior can be obtained from additional data: • pu ( ρ | Wa ) - distances distribution that the user is able to overcome without using route vehicles; • pu (τ ρ , Wa ) - time distribution that the user spends in overcoming the corresponding distance. All specified values can be calculated on potential user correspondences data collected by the recommender system: { } sistart , siend , mij , ki j , t ( d , mij , ki j , sistart ) ,τ i* , σ i* i∈I d (17) start end for each day d and user. Where s i ,s i - correspondence data, information about the route mij , ki j t ( d , mij , ki j , sistart ) - vehicle arrival time RI at the stop sistart ,τ i* ,σ i* - mean and standard deviation of potential boarding on the vehicle. 5. Experiments The presented method software implementation was written in Python. The results were visualized based on Google Maps. The experiments used "Pribyvalka-63" mobile application and tosamara.ru service data. The database obtained during the experiments, contains information about requests 57190 users. Each user is represented by a unique identifier ID(u), which is defined by the device ID hash code and is impersonal. The database contains a total of 4103161 user requests for an arrival forecast at a public transport stop. From 1478 stops of the tosamara.ru service, users made requests to 1417 stops. For the experiments, we selected common user requests that represent the average user of the service. Maps with different parameters and request time were built to visualize the results of the proposed approach. The color of the area on the map corresponds to the first stop from the ordered list (10). An example of determining the preferred stop for the user is shown in Figure 3. V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 203 Image Processing and Earth Remote Sensing A A Borodinov, V V Myasnikov Figure 3. Preferred stops map depending on user location. Leave-One-Out cross-validation was applied to obtain statistical indicators characterizing the quality of the proposed algorithm. The partitions number С1ℑ of the user requests set in this case is equal to ℑ , and the classification accuracy is calculated as follows: Accuracy = 1 ( ) ∑ I zi ≡ ID(sσ i (1) ) ⋅100% ℑ i∈ℑ (18) where the indicator I ( a ) corresponds to (14). Also, another method was implemented for comparison with the proposed algorithm, in which the user was offered the nearest stop, without taking into account previous requests. The proposed algorithm accuracy was 93%, for the nearest stop algorithm - 65%. 6. Conclusion In this paper, we presented the informal and mathematical problem formulations of defining user preferences. Users take public transport route in a personalized recommendation system task. We showed the results of an experimental study. An algorithm was developed using pattern recognition and machine learning methods. The determining the user preferred transport correspondence task was formalized and an approach to its solution was specified. 7. References [1] Chorus C G, Molin E J E and Van Wee B 2006 Use and effects of Advanced Traveller Information Services (ATIS): A review of the literature Transport Reviews 26 127-149 [2] Agafonov A A and Myasnikov V V 2018 Numerical route reservation method in the geoinformatic task of autonomous vehicle routing Computer Optics 42(5) 912-920 DOI: 10.18287/2412-6179-2018-42-5-912-920 [3] Agafonov A A, Yumaganov A S and Myasnikov V V 2018 Big data analysis in a geoinformatic problem of short-term traffic flow forecasting based on a K nearest neighbors method Computer Optics 42(6) 1101-1111 DOI: 10.18287/2412-6179-2018-42-6-1101-1111 [4] Agafonov A and Myasnikov V 2015 Traffic flow forecasting algorithm based on combination of adaptive elementary predictors Communications in Computer and Information Science 542 163-174 V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 204 Image Processing and Earth Remote Sensing A A Borodinov, V V Myasnikov [5] Arentze T A 2013 Adaptive personalized travel information systems: A bayesian method to learn users’ personal preferences in multimodal transport networks IEEE Transactions on Intelligent Transportation Systems 14 1957-1966 [6] Nuzzolo A, Crisalli U, Comi A and Rosati L 2015 Individual behavioural models for personal transit pre-trip planners Transportation Research Procedia 5 30-43 [7] Portugal I, Alencar P and Cowan D 2018 The use of machine learning algorithms in recommender systems: A systematic review Expert Systems with Applications 97 205-2 27 [8] Campigotto P, Rudloff C, Leodolter M and Bauer D 2017 Personalized and Situation-Aware Multimodal Route Recommendations: The FAVOUR Algorithm IEEE Transactions on Intelligent Transportation Systems 18 92-102 [9] Eiter T, Krennwallner T, Prandtstetter M, Rudloff C, Schneider P and Straub M 2016 Semantically Enriched Multi-Modal Routing International Journal of Intelligent Transportation Systems Research 14 20-35 [10] Mikic Fonte F A, López M R, Burguillo J C, Peleteiro A and Barragáns Martínez A B 2013 A Tagging Recommender Service for Mobile Terminals Information and Communication Technologies in Tourism (Springer Berlin Heidelberg) 424-435 [11] G. March J 1978 Bounded Rationality, Ambiguity, and the Engineering of Choice Bell Journal of Economics 9 587-608 [12] Campigotto P and Passerini A 2010 Adapting to a realistic decision maker: Experiments towards a reactive multi-objective optimizer Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6073 LNCS 338-341 [13] Zhang J and Arentze T 2013 Design and implementation of a daily activity scheduler in the context of a personal travel information system Lecture Notes in Geoinformation and Cartography 407-433 [14] Braunhofer M and Ricci F 2017 Selective contextual information acquisition in travel recommender systems Information Technology and Tourism 17 5-29 [15] Braunhofer M, Elahi M and Ricci F 2015 User Personality and the New User Problem in a Context-Aware Point of Interest Recommender System Information and Communication Technologies in Tourism (Springer International Publishing) 537-549 [16] Guo S and Sanner S 2010 Real-time multiattribute Bayesian preference elicitation with pairwise comparison queries Journal of Machine Learning Research 9 289-296 [17] Pan S J and Yang Q 2010 A Survey on Transfer Learning IEEE Transactions on Knowledge and Data Engineering 22 1345-1359 [18] Myasnikov V V 2012 Method for detection of vehicles in digital aerial and space remote sensed images Computer Optics 36 429-438 [19] Fukunaga K 1990 Introduction to statistical pattern recognition (San Diego: Academic Press) [20] Vorontsov K V Machine Learning URL: http://www.machinelearning.ru/wiki/ (date of the application 10.11.18) [21] Bishop C M 2006 Pattern Recognition and Machine Learning (Springer) [22] Zhuravlev Yu I and Nikiforov V V 1971 Recognition Algorithms Based on Estimation Calculation Cybernetics 3 [23] Zhuravlev Yu I and Gurevich I B 1989 Pattern recognition and image recognition Recognition, classification, forecast. Mathematical methods and their application 2 Acknowledgments The work was funded by the Ministry of Science and Higher Education of the Russian Federation (unique project identifier RFMEFI57518X0177). V International Conference on "Information Technology and Nanotechnology" (ITNT-2019) 205