Data-Driven Location Annotation for Fleet Mobility Modeling Riccardo Guidotti Mirco Nanni Francesca Sbolgi University of Pisa ISTI-CNR, Pisa University of Pisa Pisa, Italy Pisa, Italy Pisa, Italy riccardo.guidotti@di.unipi.it mirco.nanni@isti.cnr.it f.sbolgi@unipi.it ABSTRACT (IMN) capturing the structured patterns of visits to locations [21]. The large availability of mobility data allows studying human The existing works either focus on building mobility models or behavior and human activities. However, this massive and raw on adding semantics from external data sources. On the other amount of data generally lacks any detailed semantics or useful hand, in this work we exploit the mobility data used to build the categorization. Annotations of the locations where the users stop individual model also to annotate the model itself. may be helpful in a number of contexts, including user modeling In particular, we advance IMNs proposing a data-driven pro- and profiling, urban planning, activity recommendations, and cedure for extracting Annotated Individual Mobility Networks can even lead to a deeper understanding of the mobility evo- (AIMNs). A limitation of actual IMNs is that they are indeed β€œin- lution of an urban area. In this paper, we foster the expressive dividual” and not easily comparable. Following [1], our goal is to power of individual mobility networks, a data model describing add semantics to the locations modeling the nodes of the IMNs users’ behavior, by defining a data-driven procedure for locations in order to make them comparable among different users. We annotation. The procedure considers individual, collective, and accomplish this task by designing a procedure considering indi- contextual features for turning locations into annotated ones. vidual, collective, and contextual features for turning locations The annotated locations own a high expressiveness that allows into annotated locations. The location annotation procedure, be- generalizing individual mobility networks, and that makes them sides differentiating the locations with respect to the different comparable across different users. The results of our study on a features, also provides a hierarchy showing the reasons for the dataset of trucks moving in Greece show that the annotated indi- different annotation categories. The procedure is basically a two- vidual mobility networks can enable detailed analysis of urban step-clustering. The first step applies a distance-based clustering areas and the planning of advanced mobility applications. to group the different locations. The second step exploits a hier- archical approach for further summarizing the locations, and for better describing them according to features characterizing the 1 INTRODUCTION annotation categories. This provides to the annotated locations a The large availability of digital traces of individuals is offering higher expressiveness and allows to generalize IMNs by making novel possibilities for understanding the patterns characterizing AIMNs comparable. Therefore, the analysis on AIMNs allows to human mobility [7]. However, the personal data collected by segment the users moving on different areas by means of the smartphones or devices installed by car telematics companies data-driven semantics provided by the area itself. for business and insurance purposes is generally limited to the We employ the proposed methodology on a dataset of trucks positions of the vehicle, with no vision of what happens around it. moving in Greece, Albania, and other EU countries. We focused On the other hand, planning individual and collective advanced our analysis on two areas with a different size finding in both mobility applications as well as providing detailed analysis of cases six types of locations for annotating the IMNs of the vehi- urban and suburban areas require additional and more complex cles. The first main differences between the location types are information [6, 10]. Raw mobility data like GPS positioning de- the number of stops in a location, the number of links with other scribes elementary events (position, acceleration, etc.), while any individual locations, and the average arrival/leaving times. Then, proper modeling requires a higher-level vision of what is happen- the locations differ on their centrality with respect to individual ing to the user, to other users living in the surroundings, and to locations of other users and with respect to existing points of the environment in which the user is moving. Such higher-level interest like gas stations, parking areas, shops, supermarkets, etc. modeling should provide some clear categorization, annotation or A preliminary analysis of the AIMNs highlights that in each area semantics for locations and/or movements. Recognizing different there are distinct types of users (trucks, in our study), that visit individual mobility behaviors abstracting at a higher comparable the various types of location with different frequencies. Low en- level and making them explicit is mandatory for enabling novel tropy trucks visit frequently the same type of annotated location, applications or empowering existing ones. while high entropy trucks have a central node from which they Many studies semantically enrich mobility data with anno- reach all the differently annotated locations. tations about human activities and build individual mobility The rest of the paper is organized as follows. Section 2 summa- models on top of that. For instance, these approaches estimate rizes related work on locations annotation. In Section 3 we recall home/work locations of an individual by analyzing the frequency individual mobility networks and further concepts for under- of visits in a particular place [14], observing the sequence of move- standing the procedures for locations annotation and annotated ments to derive the sequence of activities [15], their semantic [19] individual mobility networks extraction described in Section 4. and possible different aspects [3], or extract network-based per- Section 5 presents experiments in the form of a case study in sonal data-driven models named Individual Mobility Network which we employ the proposed methodology. Finally, Section 6 Copyright Β© 2020 for this paper by its author(s). Published in the Workshop Proceed- concludes the paper and illustrates future research directions. ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen, Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At- tribution 4.0 International (CC BY 4.0). 2 RELATED WORK In the literature, various works address the problem of describ- ing and characterizing visited locations for modeling mobility behavior and for describing land use. In [4, 5] it is presented a technique to determine land uses in specific urban areas based on tweeting patterns, and to identify points of interest as places with high activity of tweets. Human activities and geographical areas are modeled in [17] by means of Foursquare place categories. A spectral clustering algorithm is used on areas and users for identi- fying user communities that visit similar categories of places, and for comparing urban neighborhoods within and across cities. Se- mantic information attached to places could be used for location- Figure 1: IMN extracted from the mobility of an individ- based applications. Indeed, in [18], both Foursquare and cellular ual. Edges represent the existence of a route between the data are exploited to infer user activities in urban environments. locations. The functions characterize each location. The authors employ user communication patterns to predict the activity of Foursquare users who check-in at nearby venues 3 SETTING THE STAGE adopting a machine learning approach. In [2] it is presented a In the following, we introduce the definitions of trajectory [25] location-based and preference-aware recommender system con- and individual mobility network [12, 21], useful for understanding sidering user preferences learned from location history with a the rest of the paper. We adapt them to the needs of the problem predefined weighted category hierarchy (from Foursquare), and we are facing and the approach designed to solve it. social opinions mined from location histories. Similarly, in [29], Foursquare check-in category information is exploited to model Definition 3.1 (Trajectory). A trajectory is a sequence 𝑑 = βŸ¨π‘ 1, user’s movement patterns and to predict the category of user . . . , 𝑝𝑛 ⟩ of spatio-temporal points, each being a tuple 𝑝𝑖 = (π‘₯𝑖 , 𝑦𝑖 , 𝑧𝑖 ) activity at the next step by means of a mixed hidden Markov that contains latitude π‘₯𝑖 , longitude 𝑦𝑖 and timestamp 𝑧𝑖 of the model. In order to describe and characterize the locations, the point. The points of a trajectory are chronologically ordered, i.e., aforementioned works adopt social network data like Twitter βˆ€1 ≀ 𝑖 < 𝑛 : 𝑧𝑖 < 𝑧𝑖+1 . and Foursquare. None of them use GPS positioning as it is not Given a trajectory 𝑑 we refer to its i-th point 𝑝𝑖 with the nota- sufficiently informative for their purpose. On the opposite, our ap- tion 𝑑 [𝑖], and to its number of points with 𝑑 .𝑛. Also, we indicate proach extracts the location categorization through data mining the longitude, latitude and timestamp components of point 𝑑 [𝑖] and location modeling applied to GPS trajectories. respectively with the notation 𝑑 [𝑖].π‘₯, 𝑑 [𝑖].𝑦, and 𝑑 [𝑖].𝑧. In the following, we illustrate works modeling human mobility from GPS data but not explicitly characterizing locations with Definition 3.2 (Individual History). Given a user 𝑒, we indicate respect to activities at locations or neighborhood description. with 𝐻𝑒 = βŸ¨π‘‘ 1, . . . , 𝑑𝑛 ⟩ the individual history of user 𝑒 as the set In [25] it is introduced the mobility profile of a user as the set of of trajectories traveled by 𝑒 in a certain time period. her routine trips, i.e., a set of very systematic and repetitive move- Given the individual history 𝐻𝑒 of user 𝑒, we can extract from ments. On the other hand, in [9] the authors focus on the locations it the individual mobility network (IMN) 𝐺𝑒 . An IMN describes and points of interest by developing a parameter-free method the individual mobility of a user through a graph representation for extracting individual locations with a data-driven approach. of her locations and movements, grasping the relevant properties Combining these approaches, in [12, 21] the authors define in- of individual mobility and removing unnecessary details. dividual mobility networks (IMNs) for modeling all the salient aspects of individual mobility. An IMN describes the mobility of Definition 3.3 (Individual Mobility Network). Given a user 𝑒, an individual through a graph representation of her locations we indicate with 𝐺𝑒 = (𝐿𝑒 , 𝑀𝑒 ) the individual mobility network and movements, grasping the relevant properties and removing of user 𝑒, where 𝐿𝑒 is the set of nodes and 𝑀𝑒 is the set of edges. unnecessary details. In [1] the authors define a general approach On the nodes and edges we define the following functions: to use state-transition graphs (STGs) in movement analysis. A β€’ πœ” : 𝐿 β†’ N returns the number of stops in location 𝑙 ∈ 𝐿𝑒 ; STG is an aggregate representation of a behaviour by a directed β€’ 𝜏 : 𝐿 β†’ R returns the typical time spent in location 𝑙 ∈ 𝐿𝑒 ; weighted graph, where the nodes stand for the possible states β€’ 𝜌 : 𝐿 β†’ Time returns the typical time of arrival of 𝑒 along and the edges for the transitions between the states. Information all the movements π‘š ∈ 𝑀𝑒 reaching location 𝑙 ∈ 𝐿𝑒 . from the states and activities is gathered from external sources β€’ πœ‹π‘‘ : 𝐿 β†’ R returns the typical time travelled by 𝑒 along like land allocations or presence of points of interest. The main all the movements π‘š ∈ 𝑀𝑒 reaching location 𝑙 ∈ 𝐿𝑒 . difference between IMNs and STGs lies in the fact that the first β€’ πœ‹π‘‘ : 𝐿 β†’ R returns the typical distance travelled by 𝑒 one models the mobility and it is more attached to geographic along all the movements π‘š ∈ 𝑀𝑒 reaching location 𝑙 ∈ 𝐿𝑒 . components, while the second one models the events, states or activities that can occur one after another. Nodes represent locations 𝐿𝑒 and edges represent movements Modeling individual behavior is a precious task as, besides 𝑀𝑒 between locations. With the term typical we refer to using providing a succinct and understandable representation of the aggregating function like mean and median, also including the mobility patterns of the users, it enables the development of associated dispersion indexes like standard deviation. To clarify applications like carpooling [11], or trajectory prediction [26]. the concept of IMN, let us consider Figure 1. It describes the IMN As a consequence, an individual model with enhanced location extracted from the mobility of an individual who visited seven descriptions can be undoubtedly beneficial and informative. distinct locations. Location (π‘Ž) has been visited by user 𝑒 for a total of πœ” (π‘Ž) = 25 times, i.e., 25 stops, with a typical stay of 𝜏 (π‘Ž) = 7β„Ž40π‘šπ‘–π‘›. On the other hand, user 𝑒 stopped in location Name Description lat, lon location prototype coordinates next locs number of different subsequent locations radius radius of gyration entropy entropy of ingoing/outgoing movements stops total and categorized number of stopsβˆ— staytime typical† stay timeβˆ— arrival times typical† arrival timesβˆ— Figure 2: AIMN extracted from the IMN of Figure 1. Anno- duration/length typical† duration/length of movementsβˆ— tations are represented with colors. The first graph reports is regular if a location is frequently visited the annotated nodes, while the second graph is the AIMN Table 1: Individual Features. (βˆ—) indicates that the features with the novel annotated nodes and edges. are calculated with respect to weekdays vs. weekend, and daytime (from 7 am to 8 pm) vs. nighttime. (†) indicates features for both mean and standard deviation. (𝑓 ) for πœ” (𝑓 ) = 3 times with a typical stay of 𝜏 (𝑓 ) = 1β„Ž10π‘šπ‘–π‘›. Name Description Edges (𝑐, 𝑓 ) and (𝑒, 𝑓 ) lead to location 𝑓 , typically arriving at π‘Ÿ -exclusivity number of stops within π‘Ÿ km w.r.t. other vehicles time 𝜌 (𝑓 ) = 18.10, traveling πœ‹π‘‘ (𝑓 ) = 15π‘šπ‘–π‘›, and πœ‹π‘‘ (𝑓 ) = 10π‘˜π‘š. π‘Ÿ -centrality number of locations in a radius of π‘Ÿ km The computation of an IMN 𝐺𝑒 starts from the ordered se- π‘˜-distance the distance of the π‘˜ th nearest locations quence history 𝐻𝑒 of user 𝑒. Locations are obtained by aggregat- Table 2: Collective Features. They are calculated compar- ing the origin and destination points of the trajectories using the ing an individual location with other individual locations. TOSCA location clustering algorithm [9, 12]. A location identifies a set of points, and a location prototype is the point minimizing Name Description the distance with the other observations part of the location. POI π‘Ÿ -centrality number of different POIs within a radius of π‘Ÿ km POI π‘˜-centrality number of different POIs within the π‘˜ th nearest POI POI π‘˜-distance distance to closest POIs within the π‘˜ th nearest POI 4 PROPOSED APPROACH Table 3: Contextual Features. All measures are also com- In this section, we describe the location annotation procedure puted separately over nine different categories of POIs: and how it is applied to extract annotated individual mobility gas stations, parking areas, piers, hotels, food shops, networks (AIMNs) from individual mobility networks (IMNs). leisure activities, (no-food) shops, services, supermarkets. 4.1 Annotated Individual Mobility Network 4.2 Location Annotation Procedure An annotated individual mobility network (AIMN) extracted from In this section, we describe the location annotation procedure we an IMN describes the individual mobility of a user through a designed to obtain a non-trivial annotation function πœ† : 𝐿 β†’ N. very simple graph representation of annotated locations and As previously discussed, the aim of πœ† is to provide the same movements, summarizing the relevant properties of the IMN and annotation for similar locations. Hence, a definition of similarity compressing the information contained in the mobility model. must be provided in order to design πœ†. Given an IMN 𝐺𝑒 of user 𝑒, The extraction of an AIMN 𝐺 𝑒 = (𝐿𝑒 , 𝑀 𝑒 ) of a user 𝑒 starts in order to compare two locations with a distance function, every from her IMN 𝐺𝑒 = (𝐿𝑒 , 𝑀𝑒 ), and an annotation function πœ†. location 𝑙 ∈ 𝐿𝑒 must be described with a set of attributes (see [12, Given an individual location 𝑙 ∈ 𝐿𝑒 , the annotation function 21] for more details). Individual locations represented as a vector πœ† : 𝐿 β†’ N returns its annotation as a consequence of a learn- of features can be grouped using clustering algorithms [24]. As ing process. Two locations 𝑙 1, 𝑙 2 have the same annotations, i.e., a consequence, πœ† is the function which annotates each location πœ†(𝑙 1 ) = πœ†(𝑙 2 ), if they are β€œsimilar”. Details of the location annota- with the cluster label to which the location belongs to. tion procedure that returns πœ† and the meaning of similar locations Simple πœ† functions can be obtained by modeling an individ- are presented in the next section. ual location through a basic set of attributes, for instance using The AIMN 𝐺 𝑒 is built by mapping each node 𝑙 ∈ 𝐿𝑒 to a node features like latitude and longitude, i.e., spatial features, or the 𝑙 = πœ†(𝑙) corresponding to its annotation, and by merging the average stay time and arrival time, i.e., temporal features. How- corresponding edges. Thus, locations with the same annotation ever, such basic representations of a location can be insufficient are collapsed into the same annotated location. More in detail, to capture various aspects related to (i) the mobility behavior of given an edge between 𝑙 1, 𝑙 2 if they have a different annotation the individual user, (ii) the user behavior in relationship with πœ†(𝑙 1 ) β‰  πœ†(𝑙 2 ), then two annotated locations 𝑙 1, 𝑙 2 are created the mobility behavior of other users, (iii) the user behavior in together with the movement-edge connecting them. On the other relationship with the geography and context in which she moves. hand, if both 𝑙 1, 𝑙 2 have the same annotation πœ†(𝑙 1 ) = πœ†(𝑙 2 ), then a To overcome these weaknesses, we design the following lo- unique annotated location 𝑙 is created together with a self-loop. cation annotation procedure that takes as input a set of IMNs We exemplify the concept of AIMN through Figure 2. Given G = {𝐺 1, . . . , 𝐺𝑛 }, a context C where the users move and returns a certain πœ† function, the first graph reports the nodes/locations an annotation function πœ†. The procedure works as follows. For annotation of the graph in Figure 1, where the annotations are each IMN 𝐺𝑒 ∈ G, for each individual location 𝑙 ∈ 𝐿𝑒 , we de- represented through colors. Nodes having the same colors are scribe 𝑙 with a vector of features capturing three distinct aspects. described by similar features, i.e., are similar according to some β€’ Individual features. These features characterize an indi- aspects. The second graph illustrates the AIMN obtained by col- vidual location 𝑙 only with the mobility of the vehicle lapsing nodes with the same annotation into an annotated node, 𝑒 stopping in 𝑙. The list attributes is shown in Table 1. merging edges and creating self-loops. Some features are calculated with respect to weekdays vs. Truck Type Inter-regional Area Urban Area Van 214,129 (71.03%) 105,065 (75.24%) Truck 3 82,697 (27.43%) 33,353 (23.88%) Truck 3 ax. 3,478 (1.15%) 1,223 (0.88%) Flatbed Truck 978 (0.32%) - Truck 179 (0.05%) - Table 4: Distribution of different type of vehicles for the two areas analyzed (percentages in brackets). Figure 3: Heatmaps of the stop points in the areas ana- lyzed in Greece: Inter-regional (left), and Urban (right). Measure Inter-regional Area Urban Area Traj per Vehicle 419.86 Β± 255.75 498.72 Β± 278.72 Avg Length 10.40 Β± 14.17 6.64 Β± 9.65 weekend, and daytime (from 7 am to 8 pm) vs. nighttime. Avg Duration 23.85 Β± 23.82 19.25 Β± 19.85 For features describing aggregates we report mean and Table 5: Mean and standard deviation of some descriptive standard deviation. A location is regular if it is frequently statistics for the trajectories in the two areas analyzed. visited more than the others (see [12] for details). β€’ Collective features. These features characterize an individ- ual location 𝑙 with respect to all the other locations both of user 𝑒 and of the other users in G. Details in Table 2. β€’ Contextual features. These features characterize an individ- ual location 𝑙 with respect to facilities, i.e., different types of points of interests (POIs) in the surrounding areas given by the context C. Table 3 reports a detailed description. Figure 4: K-Means Sum of Squared Error (SSE) varying the Given the locations described by these features, we implement number of clusters π‘˜. We selected π‘˜ = 155 for the inter- the annotation function by means of a clustering algorithm. In- regional area, and π‘˜ = 140 for the urban area. spired by [8, 9], instead of using a simple clustering approach, we design a two steps clustering allowing to simultaneously sum- marize the different locations and obtain a hierarchy for better following types of trucks which are the most frequent in the describing them. In particular, we use K-Means clustering algo- dataset: Van, Truck 3, Truck 3 ax., Flatbed Truck, Truck2 . rithm [16] for the first clustering phase. As illustrated in the next Analyzing all the vehicles together would not lead to a fair section, we observe that we need a high π‘˜ (between 100 and comparison due to the very different types of trajectories followed 200) in order to have a good clustering with respect to internal by the trucks. For instance, we cannot compare the IMN of a truck evaluation measures like the Sum of Squared Error (SSE). The performing daily deliveries in a radius of 20 km with the IMN clustering with K-Means allows to consistently reduce the num- of a truck moving across regions or countries along very long ber of different location prototypes. The second phase is aimed trajectories. Therefore, we partition the trajectories of the dataset at further reducing the number of clusters for keeping simple through a simple, K-means-like iterative procedure. First, each the annotation computed by πœ†, and simultaneously obtaining a vehicle 𝑒 is associated to the geographical bounding box π‘Ÿπ‘’ of its hierarchy for the different location prototypes. To this purpose trajectories in 𝐻𝑒 , and the set of areas 𝐴 is initialized to βˆ…. Then, we adopt a hierarchical clustering approach with the Ward’s cri- we iteratively consider each vehicle 𝑒 and compare its π‘Ÿπ‘’ with terion [28] to determine the clusters to be merged. A visual and all the existing areas π‘Žπ‘– ∈ 𝐴. If, for at least 75% of the vehicles interactive inspection of the resulting dendogram and centroids 𝑣 ∈ π‘Žπ‘– , π‘Ÿπ‘’ ∩ π‘Ÿ 𝑣 β‰  βˆ… and 1/4 ≀ π‘Žπ‘Ÿπ‘’π‘Ž(π‘Ÿπ‘’ )/π‘Žπ‘Ÿπ‘’π‘Ž(π‘Ÿ 𝑣 ) ≀ 4, then 𝑒 is describing the clusters allows to understand the reasons [20] added to π‘Žπ‘– ; otherwise, if Βšπ‘Žπ‘– ∈ 𝐴 satisfying this condition, a new for having different types of annotated locations and which is a area π‘Ž 𝑗 is created and 𝑒 is added to π‘Ž 𝑗 . Finally, we check that reasonable number of clusters representing the different types of each vehicle belongs to the area with the highest overlap, and annotated locations in an AIMN. in case we move a vehicle between two areas until convergence. The above procedure recognizes twelve different areas. In the following, we focus on two areas depicted in Figure 3. 5 EXPERIMENTS We name the first inter-regional area since it contains the trajecto- In this section, we present a case study on a dataset of trucks in ries of vehicles moving in various regions of Greece. On the other which we employ the proposed methodology1 . First, we briefly hand, the second is an urban area containing the movements of analyze the dataset. Then, we provide details for the location trucks in Athens. Details about the number of different vehicles annotations and the clustering results. Finally, we show some can be found in Table 4. In addition, in Table 5 we report basic preliminary analysis enabled by the extracted AIMNs. statistics of the trajectories for the two different areas. We notice how in the urban area a vehicle performs on average much more 5.1 Dataset Description trajectories than a vehicle in the inter-regional area. As expected, We analyze a dataset of about 15 million of trajectories of trucks the trajectories in the inter-regional area are on average longer moving in Greece, Albania, Cyprus, and other few EU countries than those in the urban area. Moreover, the high standard devia- from July 2017 to June 2018. There are different kinds of trucks, tion means that the trajectories in the inter-regional area are also depending on the size, number of carts, etc. We focus on the more variegate than those in the urban area. The similar average duration might be due to the higher traffic in the urban area, that 1 The code for performing the analysis is available at: https://github.com/fsbolgi/ makes the speed lower than in the inter-regional area. AnnotatedIndividualMobilityNetwork. The dataset is not publicly available. 2 More details are available at https://trackandknowproject.eu/about/deliverables/. regular ↑ ∧ next locs ↑ ∧ radius ↑ ∧ entropy ↑ ∧ avg arrival time ↑ true false π‘Ÿ 15 βˆ’centr ↑ ∧ regular ↓ ∧ π‘Ÿ 15 βˆ’centr ↓ ∧ POI π‘Ÿ 0.5 βˆ’centr ↑ ∧ POI π‘Ÿ 0.5 βˆ’centr ↓ ∧ POI π‘˜ 30 βˆ’dist ↑ POI π‘˜ 30 βˆ’dist ↓ true false false true π‘Ÿ 15 βˆ’centr ↓ ∧ regular ↓ ∧ Figure 5: Inter-regional Area: hierarchical clustering den- C1 C4 POI π‘Ÿ 0.5 βˆ’centr ↓ ∧ avg staytime ↓ ∧ stops ↓ POI π‘˜ 30 βˆ’dist ↑ dogram with a cut at six clusters (clusters size in brackets). true false true false C2 C3 C5 C6 Figure 7: Inter-regional Area: Tree showing the most dis- criminant features for the dendogram in Figure 5. (iv) we remove π‘Ÿ 5 -centrality and π‘˜ 1 -centrality, π‘˜ 5 -centrality, π‘˜ 20 - centrality as they are highly correlated with the remaining ones. We ran K-Means with π‘˜ = 2, . . . , 900. The visual inspection of the Sum of Squared Error (SEE) in Figure 4 suggested to select π‘˜ = 155 for the inter-regional area, and π‘˜ = 140 for the urban area. The subsequent run of the hierarchical clustering with the Ward’s criterion yields the dendograms in Figures 5 and 8. In both cases, we cut the dendograms to obtain six clusters that characterize the description of the different annotated locations with a good trade-off between a sufficiently high level of abstraction and a Figure 6: Inter-regional Area: centroids parallel plots detailed specification. We observe that in both cases, more than showing discriminant features for dendogram in Fig. 5. 50% of the locations end in the rightmost part of the dendogram, making it slightly imbalanced. That produces some small clusters, yet none of them is negligible in terms of size. In the following, 5.2 Annotated Locations Clustering Analysis we combine the hierarchy of the dendogram with the information Using the procedure described in [12], we extract the IMNs for the returned by the parallel plots of the centroids of the clusters for vehicles in both areas, and we use them as input for the proposed each split. This visualization, due to the interpretable features [13] methodology. In particular, in our analysis, we focus on Septem- describing the locations, allows to explain [20] the hierarchy and ber and October 2017, obtaining 883 IMNs in the inter-regional consequently the annotations of the various clusters 𝐢 1, . . . , 𝐢 6 . area and 373 IMNs in the urban area. Statistics describing the In order to better understand the reasons that led to differenti- IMNs are reported in Table 7. As context C, we exploit a dataset ate the locations into the six clusters – and therefore understand containing the POIs of the whole Europe3 . We extracted only the what each cluster contains –, we show in Figures 6 and 9 the cor- POIs relative to the areas we are interested in and we restricted to responding parallel plots of the most significant features, respec- some more general and relevant categories, namely: gas, parking, tively for the inter-regional and urban areas; we also summarize pier, hotel, food, leisure, (no-food) shop, service, supermarket. in Figures 7 and 10 the insights we obtained through inspection Given the IMNs for the two areas, the location annotation of the features at different levels of the dendogram, by means of procedure received as input about 110k locations for the inter- a decision tree representation. regional area and about 39k locations for the urban one. We de- In both dendograms the first split is a consequence of differ- scribed each location with the individual, collective, and contex- ences relative to individual features. The first split (Figures 6 tual characteristics illustrated in Section 4.2, ending in a vector of and 9 top), using the individual features, separates on the left 72 features. For π‘Ÿ -exclusivity we adopt π‘Ÿ = 0.2 km, for π‘Ÿ -centrality branch regularly visited locations (↑ regular) from which is possi- π‘Ÿ ∈ {1, 5, 15} km, for π‘˜-distance π‘˜ ∈ {1, 3, 5, 8, 10, 20}. For POI ble to reach various destinations (↑ next locs and ↑ entropy) with π‘Ÿ -centrality, POI π‘˜-centrality and POI π‘˜-distance we adopt π‘Ÿ = 0.5 early arrivals and leavings (↑ avg arrival times, i.e., the vehicle km and π‘˜ = 30, respectively. Before running the location anno- leaves before 8 am and is back before 7 pm) in a location defined tation procedure, we performed a correlation analysis using the by a not very specific area (↑ radius), from the other locations Pearson correlation coefficient. This step allowed us to reduce (right branch). Thus, with respect to our case study, the locations the number of features to 55. In particular, (i) the total number on the left-hand side can be matched with storage points and/or of stops is removed in favor of next loc, entropy and number of deposits of the trucks analyzed, while those on the right are all daytime weekday stops; (ii) among the temporal aggregations of the others. This consideration can also justify the fact that the stay times, arrival and leaving times, only the weekday vs week- majority of locations lie in the rightmost part of the dendogram. end is considered; (iii) movement duration features are removed Moving forward on the left branch, we have a different split for since they are strongly correlated with movement length features; the inter-regional area and for the urban area. For the inter-regional area (Figure 6 center left) the second 3 The POIs are points collected from OpenStreetMap filtered based on Geofabrik’s split, making use of collective and contextual features, separates taxonomy of OpenStreetMap features, i.e., points with the label β€œPOI” are kept. not regular locations (regular ↓) not close to individual locations Figure 8: Urban area: hierarchical clustering dendogram with a cut yielding six clusters (clusters size in brackets). Figure 11: Heatmaps of the different clusters for the an- notated locations of the Inter-regional area. From left to right, top to bottom, clusters 𝐢 1, 𝐢 2, . . . , 𝐢 6 . Following the same logic used to describe the splits of the inter- regional area we can understand the splits for the annotations of the urban area from Figures 9 and 10. The second split (Fig- ure 9 center left) discriminates between locations close to various Figure 9: Urban Area: centroids parallel plots showing dis- existing POIs (POI π‘Ÿ 0.5 βˆ’centr ↑ and POI π‘˜ 30 βˆ’dist ↓). Since the criminant features for dendogram in Figure 8. closest categories for the black lines are food, leisure, shop, service and supermarket we can assume that cluster 𝐢 1 mainly identifies β€œregular” locations in the city center. A further split on individual regular ↑ ∧ next locs ↑ ∧ radius ↑ ∧ entropy ↑ ∧ avg arrival time ↑ features (Figure 9 bottom left) identifies more regularly visited locations (↑ regular) with a high stay time and number of stops true false during both weekend and weekday (↑ avg staytime, ↑ stops). The other branch separates the locations based on the distance to POI π‘Ÿ 0.5 βˆ’centr ↑ ∧ POI π‘Ÿ 0.5 βˆ’centr ↑ ∧ POI π‘˜ 30 βˆ’dist ↓ POI π‘˜ 30 βˆ’dist ↓ existing POIs (POIπ‘Ÿ 0.5 βˆ’centr ↑ and POIπ‘˜ 30 βˆ’dist ↓). This time false false there is a clear separation for all the contextual features also true true considering gas, parking and hotel. Hence, the cluster 𝐢 4 captures radius ↑ ∧ central locations (probably even more central than those on the regular ↑ ∧ C1 C4 next locs ↑ ∧ left branch of the tree) visited sporadically by the vehicles ana- avg staytime ↑ ∧ stops ↑ POI βˆ’ gas π‘Ÿ 0.5 βˆ’centr ↑ true false lyzed. This cluster is the biggest in the urban area. Finally, the true false last split relative to suburban locations distinguish cluster 𝐢 5 , containing suburban locations with a large radius from which C2 C3 C5 C6 can be reached other many individual locations but far away Figure 10: Urban Area: Tree showing the most discrimi- from POIs except gas stations (probably located into an industrial nant features for the dendogram in Figure 8. area), from cluster 𝐢 6 , which is formed by suburban locations close to facilities but reached sporadically. In Figure 11 we show the heatmaps of the locations for the of other vehicles (π‘Ÿ 15 βˆ’centrality ↓, i.e., not central w.r.t. the oth- various clusters for the Inter-regional area. We can notice how ers) nor to POIs (POI π‘Ÿ 0.5 βˆ’centrality ↓ and POI π‘˜ 30 βˆ’distance ↑) the position of the different annotated locations on the maps is from the rest. Cluster 𝐢 1 models suburban and peripheral storage coherent with the descriptions reported above. From a very high points. On the other hand, the subsequent split (Figure 6 bottom level we can summarize the distinction between the different left) identifies less frequent locations with shorter stops (cluster annotated locations as delivery β€œorigins” (𝐢 1, 𝐢 2, 𝐢 3 ) and β€œdesti- 𝐢 2 ) and more frequent locations with longer stops (cluster 𝐢 3 ). nations ”(𝐢 4, 𝐢 5, 𝐢 6 ), i.e., recipients. Then, the clusters distinguish The right branch of the inter-regional tree in Figure 7 performs according to the closeness with respect to other individual loca- a symmetric split with respect to the left branch. Thus, in cluster tions and POIs, to usage in terms of stay times and arrival times. 𝐢 4 we find not regularly visited locations surrounded by many For instance, we have locations in the city center in 𝐢 4 , isolated other personal locations and POIs (π‘Ÿ 15 βˆ’centr ↑, POI π‘Ÿ 0.5 βˆ’centr ↑, suburban locations 𝐢 5 , and suburban locations surrounded by POI π‘˜ 30 βˆ’dist ↓). Due to the high density, we can infer that these POIs in 𝐢 6 . Moreover, we highlight that latitude and longitude locations are central regions of the inter-regional area. Finally, are not crucial for distinguishing the various clusters at the final clusters 𝐢 5 and 𝐢 6 containing most of the locations are placed in level of the dendogram. Indeed the points are not entirely sep- suburbans regions far away from many POIs. arated from a spatial point of view. This implies that the other Measure Inter-regional Area Urban Area IMN vs TMP 0.1580 0.1465 IMN vs SPT 0.1186 0.0723 TMP vs SPT 0.0036 0.0136 Table 6: NMI comparing the proposed location annotation procedure based on IMNs against a temporal annotation (TMP) and a spatial annotation (SPT) procedure. Measure Inter-regional Area Urban Area Nodes 96.84 Β± 74.98 138.96 Β± 112.04 Edges 270.47 Β± 190.32 312.21 Β± 224.90 Density 0.07 Β± 0.15 0.05 Β± 0.10 Degree 5.78 Β± 1.46 4.85 Β± 2.39 Clus. Coef. 0.19 Β± 0.13 0.17 Β± 0.13 Table 7: IMNs characteristics (mean Β± std dev). Measure Inter-regional Area Urban Area Figure 12: Purity and entropy distributions for number of Nodes 5.38 Β± 0.89 5.06 Β± 1.02 stops in different annotated locations. Edges 15.12 Β± 4.69 13.31 Β± 4.56 Density 1.28 Β± 0.28 1.28 Β± 0.32 truck with a vector 𝑣𝑒 = βŸ¨π‘“1, 𝑓2, . . . , 𝑓6 ⟩ of six elements contain- Degree 5.48 Β± 1.16 5.10 Β± 1.10 ing the relative number of stops in each type 𝐢 1, 𝐢 2, . . . , 𝐢 6 of Clus. Coef. 0.82 Β± 0.22 0.85 Β± 0.20 annotated locations. Using these vectors of relative frequencies Table 8: AIMNs characteristics (mean Β± std dev). 𝑓𝑖 , we calculate for each user the purity and (normalized) en- tropy [22] as purity(𝑣) = π‘šπ‘Žπ‘₯𝑖 ∈ [1,6] (𝑓𝑖 ), and as entropy(𝑣) = Í features are much more relevant than the geographical aspects βˆ’ 𝑖 ∈ [1,6] 𝑓𝑖 log2 (𝑓𝑖 )/log2 (6). We highlight that purity and en- for capturing different characteristics. tropy capture two different aspects. We report the distributions Finally, we adopt a clustering evaluation measure to show that of purity and entropy for the inter-regional and urban areas in the proposed location annotation procedure instantiated with Figure 12. For the purity (top row), we observe two normal-like the truck case study is meaningful and not trivial. We implement distributions with most of the users having a purity of 0.5 in both two simple annotation functions πœ†. The first procedure (SPT) de- areas meaning that half of the stops of the vehicle refer to the scribes the locations only using spatial features, i.e., latitude and same annotated location. On the other hand, there are also trucks longitude. The second procedure (TMP) describes the locations with a purity close to 1.0, meaning that almost all the stops belong only using the average stay time and arrival time. In both proce- to the same annotated locations. There are no vehicles having dures, the locations are grouped and annotated using K-Means a purity lower than 0.2. With respect to entropy (bottom row), with π‘˜ = 6, i.e., the same number of annotated locations discov- we observe right-skewed distributions with a mean of about 0.7, ered with the proposed procedure. We compare the annotations, showing that, despite there is an average high purity, most of the i.e., the clustering results, returned by the three procedures using trucks stop in different annotated locations. In addition, there the Normalized Mutual Information score [27]. The more similar are very few vehicles with an entropy close to 0.0 signaling that are the annotations between two location annotation procedures, a truck generally visits a minimum number of different anno- the higher is the NMI score. The very low NMI scores reported tated locations. Finally, we report that we found no relationships in Table 6 confirms that it is not possible to design an annotation between the type of truck and entropy or purity. procedure similar to the proposed one with a trivial approach. We report in Figure 13 the AIMNs of three trucks moving in the urban area with different levels of purity and entropy. The 5.3 Inspection of Annotated IMNs leftmost AIMNs show a high entropy and low purity, visiting all Given the locations annotation described in the previous sec- the types of annotated locations with a balanced number of stops, tion, we know the behavior of the annotation function πœ† and the i.e., similar sizes. We notice on the map how it covers a larger area meaning of the different annotations. Thus, given a location 𝑙 compared to the other AIMNs. The big yellow node 𝐢 2 models with the vector of features describing it, πœ† assigns 𝑙 to the most the parking/storage while the others model different types of similar cluster with respect to the six prototypes represented by destinations. On the other hand, the rightmost AIMN shows a the centroids reported in Figures 6 and 9, i.e., πœ† : 𝐿 β†’ [1, . . . , 6]. low entropy and high purity: the majority of the stops are on Using the annotation functions πœ† for the IMNs of the two areas annotated locations of type 𝐢 3 (red), which are on the north-west we obtain the corresponding AIMNs. We report in Table 8 statis- and south-east in the map. Finally, the central AIMN shows a tics describing the AIMNs. By comparing Table 8 with Table 7 we typical vehicle with a medium level of entropy and purity. notice that the number of nodes and edges obviously drops. On the other hand, we observe that AIMNs are much denser than 6 CONCLUSION IMNs and with a higher average degree about 5 with a standard We have proposed a data-driven procedure for annotating indi- deviation of 1.1, meaning that typically each vehicle from an vidual locations, and we have employed it to extract annotated annotated location can reach at least 3 other annotated locations. individual mobility networks (AIMNs) from individual mobility Consequently, AIMNs show a much higher clustering coefficient. networks (IMNs). A case study experimentation on a real dataset Moreover, we exploit the AIMNs for a preliminary analysis of fleet moving in Greek areas has shown the effectiveness of aimed at comparing vehicles. From the AIMNs we model each the proposed approach. We have found a hierarchical structure Figure 13: AIMNs for three vehicles moving in the urban area with different purity and entropy levels. Left low purity and high entropy. Center medium purity and entropy. Right high purity and low entropy. The top row reports the AIMNs while the bottom row contains the corresponding AIMNs with a spatial disaggregation of the annotated locations on a real map. In both cases the higher is the size of the node/location, the higher the number of stops in the locations. describing the annotated locations through individual, collective, [4] Vanessa Frias-Martinez et al. 2012. Characterizing urban landscapes using and contextual features. The principal discrimination is based geolocated tweets. In PASSAT. IEEE, 239–248. [5] Vanessa Frias-Martinez and Enrique Frias-Martinez. 2014. Spectral clustering on the frequency of visits, length of stay, and centrality with for sensing urban land use using Twitter activity. EAAI 35 (2014), 237–245. respect to other locations and existing points of interest. As a [6] Fosca Giannotti et al. 2011. Unveiling the complexity of human mobility by querying and mining massive trajectory data. VLDB 20, 5 (2011), 695–719. consequence of the annotation, we can observed different vehi- [7] Marta C Gonzalez et al. 2008. Understanding individual human mobility cles studying the corresponding AIMNs. Such information can be patterns. Nature 453, 7196 (2008), 779. applied for a detailed analysis of inter-regional and urban areas [8] Riccardo Guidotti et al. 2014. Retrieving points of interest from human sys- tematic movements. In SEFM. Springer, 294–308. and for planning ad-hoc and personalized mobility applications. [9] Riccardo Guidotti et al. 2015. TOSCA: two-steps clustering algorithm for Future research directions can purse various goals. First, we personal locations detection. In SIGSPATIAL. ACM, 38. would apply the proposed methodology to different data sources. [10] Riccardo Guidotti et al. 2016. Unveiling mobility complexity through complex network analysis. SNAM 6, 1 (2016), 59. For instance, observing the personal mobility of individual car [11] Riccardo Guidotti et al. 2017. Never drive alone: Boosting carpooling with drivers, or of users adopting different types of means of trans- network analysis. IS 64 (2017), 237–257. [12] Riccardo Guidotti et al. 2017. There’s a path for everyone: A data-driven portations like bicycle, foot, public services, would probably lead personal model reproducing mobility agendas. In DSAA. IEEE, 303–312. to identify other types of annotations for the locations. Second, [13] Riccardo Guidotti et al. 2018. Discovering temporal regularities in retail we would like to perform a deeper analysis of the users by clus- customers’ shopping behavior. EPJ Data Science 7, 1 (2018), 6. [14] Shan Jiang et al. 2012. Clustering daily patterns of human activities in the tering the AIMNs with respect to the annotated locations for city. DAMI 25, 3 (2012), 478–510. discovering relevant users’ segmentation and geographical char- [15] John Lafferty et al. 2001. Conditional random fields: Probabilistic models for acterization of the territory. Third, we would like to use the an- segmenting and labeling sequence data. (2001). [16] James MacQueen et al. 1967. Some methods for classification and analysis of notated locations and the mobility history for building sequences multivariate observations. In BSMSP, Vol. 1. Oakland, CA, USA, 281–297. of β€œstates” [1]. Exploiting sequential pattern mining algorithms [17] Anastasios Noulas et al. 2011. Exploiting semantic annotations for clustering geographic areas and users in location-based social networks. In AAAI. would allow us to search for mobility patterns represented as [18] Anastasios Noulas et al. 2013. Exploiting foursquare and cellular data to infer sequences of different annotated locations. Fourth, we would user activity in urban environments. In ICMDM, Vol. 1. IEEE, 167–176. like to analyze the benefits of integrating our concepts of anno- [19] Christine Parent et al. 2013. Semantic trajectories modeling and analysis. CSUR 45, 4 (2013), 42. tated locations and AIMNs into visualization platforms [1, 23] [20] Dino Pedreschi et al. 2019. Meaningful explanations of Black Box AI decision for semantic annotation of individual mobility. systems. In AAAI, Vol. 33. 9780–9784. [21] Salvatore Rinzivillo et al. 2014. The purpose of motion: Learning activities from individual mobility networks. In DSAA. IEEE, 312–318. ACKNOWLEDGMENTS [22] Claude Elwood Shannon. 1948. A mathematical theory of communication. Bell system technical journal 27, 3 (1948), 379–423. This work is partially supported by the E.C. H2020 programme [23] AmΓ­lcar Soares et al. 2019. VISTA: A visual analytics platform for semantic under the funding scheme Track &Know, G.A. 780754, https:// annotation of trajectories.. In EDBT. 570–573. [24] Pang-Ning Tan. 2018. Introduction to data mining. Pearson Education India. trackandknowproject.eu/. We thank Haosheng Huang and Cheng [25] Roberto Trasarti et al. 2011. Mining mobility user profiles for car pooling. In Fu from University of Zurich for sharing the POIs dataset. KDD. ACM, 1190–1198. [26] Roberto Trasarti et al. 2017. Myway: Location prediction via mobility profiling. IS 64 (2017), 350–367. REFERENCES [27] Nguyen Xuan Vinh et al. 2010. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. [1] Natalia Andrienko and Gennady Andrienko. 2018. State transition graphs for JMLR 11, Oct (2010), 2837–2854. semantic analysis of movement behaviours. IV 17, 1 (2018), 41–65. [28] Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function. [2] Jie Bao et al. 2012. Location-based and preference-aware recommendation JASA 58, 301 (1963), 236–244. using sparse geo-social networking data. In SIGSPATIAL. ACM, 199–208. [29] Jihang Ye, Zhe Zhu, and Hong Cheng. 2013. What’s your next move: User [3] Ronaldo Dos Santos Mello et al. 2019. MASTER: A Multiple Aspect View on activity prediction in location-based social networks. In ICDM. SIAM, 171–179. Trajectories. TGIS 12526 (2019).