Data-Driven Location Annotation
                                          for Fleet Mobility Modeling
               Riccardo Guidotti                                            Mirco Nanni                               Francesca Sbolgi
                University of Pisa                                         ISTI-CNR, Pisa                              University of Pisa
                    Pisa, Italy                                              Pisa, Italy                                   Pisa, Italy
          riccardo.guidotti@di.unipi.it                                mirco.nanni@isti.cnr.it                         f.sbolgi@unipi.it

ABSTRACT                                                                               (IMN) capturing the structured patterns of visits to locations [21].
The large availability of mobility data allows studying human                          The existing works either focus on building mobility models or
behavior and human activities. However, this massive and raw                           on adding semantics from external data sources. On the other
amount of data generally lacks any detailed semantics or useful                        hand, in this work we exploit the mobility data used to build the
categorization. Annotations of the locations where the users stop                      individual model also to annotate the model itself.
may be helpful in a number of contexts, including user modeling                           In particular, we advance IMNs proposing a data-driven pro-
and profiling, urban planning, activity recommendations, and                           cedure for extracting Annotated Individual Mobility Networks
can even lead to a deeper understanding of the mobility evo-                           (AIMNs). A limitation of actual IMNs is that they are indeed “in-
lution of an urban area. In this paper, we foster the expressive                       dividual” and not easily comparable. Following [1], our goal is to
power of individual mobility networks, a data model describing                         add semantics to the locations modeling the nodes of the IMNs
users’ behavior, by defining a data-driven procedure for locations                     in order to make them comparable among different users. We
annotation. The procedure considers individual, collective, and                        accomplish this task by designing a procedure considering indi-
contextual features for turning locations into annotated ones.                         vidual, collective, and contextual features for turning locations
The annotated locations own a high expressiveness that allows                          into annotated locations. The location annotation procedure, be-
generalizing individual mobility networks, and that makes them                         sides differentiating the locations with respect to the different
comparable across different users. The results of our study on a                       features, also provides a hierarchy showing the reasons for the
dataset of trucks moving in Greece show that the annotated indi-                       different annotation categories. The procedure is basically a two-
vidual mobility networks can enable detailed analysis of urban                         step-clustering. The first step applies a distance-based clustering
areas and the planning of advanced mobility applications.                              to group the different locations. The second step exploits a hier-
                                                                                       archical approach for further summarizing the locations, and for
                                                                                       better describing them according to features characterizing the
1    INTRODUCTION                                                                      annotation categories. This provides to the annotated locations a
The large availability of digital traces of individuals is offering                    higher expressiveness and allows to generalize IMNs by making
novel possibilities for understanding the patterns characterizing                      AIMNs comparable. Therefore, the analysis on AIMNs allows to
human mobility [7]. However, the personal data collected by                            segment the users moving on different areas by means of the
smartphones or devices installed by car telematics companies                           data-driven semantics provided by the area itself.
for business and insurance purposes is generally limited to the                           We employ the proposed methodology on a dataset of trucks
positions of the vehicle, with no vision of what happens around it.                    moving in Greece, Albania, and other EU countries. We focused
On the other hand, planning individual and collective advanced                         our analysis on two areas with a different size finding in both
mobility applications as well as providing detailed analysis of                        cases six types of locations for annotating the IMNs of the vehi-
urban and suburban areas require additional and more complex                           cles. The first main differences between the location types are
information [6, 10]. Raw mobility data like GPS positioning de-                        the number of stops in a location, the number of links with other
scribes elementary events (position, acceleration, etc.), while any                    individual locations, and the average arrival/leaving times. Then,
proper modeling requires a higher-level vision of what is happen-                      the locations differ on their centrality with respect to individual
ing to the user, to other users living in the surroundings, and to                     locations of other users and with respect to existing points of
the environment in which the user is moving. Such higher-level                         interest like gas stations, parking areas, shops, supermarkets, etc.
modeling should provide some clear categorization, annotation or                       A preliminary analysis of the AIMNs highlights that in each area
semantics for locations and/or movements. Recognizing different                        there are distinct types of users (trucks, in our study), that visit
individual mobility behaviors abstracting at a higher comparable                       the various types of location with different frequencies. Low en-
level and making them explicit is mandatory for enabling novel                         tropy trucks visit frequently the same type of annotated location,
applications or empowering existing ones.                                              while high entropy trucks have a central node from which they
   Many studies semantically enrich mobility data with anno-                           reach all the differently annotated locations.
tations about human activities and build individual mobility                              The rest of the paper is organized as follows. Section 2 summa-
models on top of that. For instance, these approaches estimate                         rizes related work on locations annotation. In Section 3 we recall
home/work locations of an individual by analyzing the frequency                        individual mobility networks and further concepts for under-
of visits in a particular place [14], observing the sequence of move-                  standing the procedures for locations annotation and annotated
ments to derive the sequence of activities [15], their semantic [19]                   individual mobility networks extraction described in Section 4.
and possible different aspects [3], or extract network-based per-                      Section 5 presents experiments in the form of a case study in
sonal data-driven models named Individual Mobility Network                             which we employ the proposed methodology. Finally, Section 6
Copyright © 2020 for this paper by its author(s). Published in the Workshop Proceed-
                                                                                       concludes the paper and illustrates future research directions.
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
tribution 4.0 International (CC BY 4.0).
2    RELATED WORK
In the literature, various works address the problem of describ-
ing and characterizing visited locations for modeling mobility
behavior and for describing land use. In [4, 5] it is presented a
technique to determine land uses in specific urban areas based on
tweeting patterns, and to identify points of interest as places with
high activity of tweets. Human activities and geographical areas
are modeled in [17] by means of Foursquare place categories. A
spectral clustering algorithm is used on areas and users for identi-
fying user communities that visit similar categories of places, and
for comparing urban neighborhoods within and across cities. Se-
mantic information attached to places could be used for location-        Figure 1: IMN extracted from the mobility of an individ-
based applications. Indeed, in [18], both Foursquare and cellular        ual. Edges represent the existence of a route between the
data are exploited to infer user activities in urban environments.       locations. The functions characterize each location.
The authors employ user communication patterns to predict
the activity of Foursquare users who check-in at nearby venues           3    SETTING THE STAGE
adopting a machine learning approach. In [2] it is presented a           In the following, we introduce the definitions of trajectory [25]
location-based and preference-aware recommender system con-              and individual mobility network [12, 21], useful for understanding
sidering user preferences learned from location history with a           the rest of the paper. We adapt them to the needs of the problem
predefined weighted category hierarchy (from Foursquare), and            we are facing and the approach designed to solve it.
social opinions mined from location histories. Similarly, in [29],
Foursquare check-in category information is exploited to model                Definition 3.1 (Trajectory). A trajectory is a sequence 𝑡 = ⟨𝑝 1,
user’s movement patterns and to predict the category of user             . . . , 𝑝𝑛 ⟩ of spatio-temporal points, each being a tuple 𝑝𝑖 = (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 )
activity at the next step by means of a mixed hidden Markov              that contains latitude 𝑥𝑖 , longitude 𝑦𝑖 and timestamp 𝑧𝑖 of the
model. In order to describe and characterize the locations, the          point. The points of a trajectory are chronologically ordered, i.e.,
aforementioned works adopt social network data like Twitter              ∀1 ≤ 𝑖 < 𝑛 : 𝑧𝑖 < 𝑧𝑖+1 .
and Foursquare. None of them use GPS positioning as it is not
                                                                            Given a trajectory 𝑡 we refer to its i-th point 𝑝𝑖 with the nota-
sufficiently informative for their purpose. On the opposite, our ap-
                                                                         tion 𝑡 [𝑖], and to its number of points with 𝑡 .𝑛. Also, we indicate
proach extracts the location categorization through data mining
                                                                         the longitude, latitude and timestamp components of point 𝑡 [𝑖]
and location modeling applied to GPS trajectories.
                                                                         respectively with the notation 𝑡 [𝑖].𝑥, 𝑡 [𝑖].𝑦, and 𝑡 [𝑖].𝑧.
   In the following, we illustrate works modeling human mobility
from GPS data but not explicitly characterizing locations with              Definition 3.2 (Individual History). Given a user 𝑢, we indicate
respect to activities at locations or neighborhood description.          with 𝐻𝑢 = ⟨𝑡 1, . . . , 𝑡𝑛 ⟩ the individual history of user 𝑢 as the set
In [25] it is introduced the mobility profile of a user as the set of    of trajectories traveled by 𝑢 in a certain time period.
her routine trips, i.e., a set of very systematic and repetitive move-
                                                                             Given the individual history 𝐻𝑢 of user 𝑢, we can extract from
ments. On the other hand, in [9] the authors focus on the locations
                                                                         it the individual mobility network (IMN) 𝐺𝑢 . An IMN describes
and points of interest by developing a parameter-free method
                                                                         the individual mobility of a user through a graph representation
for extracting individual locations with a data-driven approach.
                                                                         of her locations and movements, grasping the relevant properties
Combining these approaches, in [12, 21] the authors define in-
                                                                         of individual mobility and removing unnecessary details.
dividual mobility networks (IMNs) for modeling all the salient
aspects of individual mobility. An IMN describes the mobility of            Definition 3.3 (Individual Mobility Network). Given a user 𝑢,
an individual through a graph representation of her locations            we indicate with 𝐺𝑢 = (𝐿𝑢 , 𝑀𝑢 ) the individual mobility network
and movements, grasping the relevant properties and removing             of user 𝑢, where 𝐿𝑢 is the set of nodes and 𝑀𝑢 is the set of edges.
unnecessary details. In [1] the authors define a general approach        On the nodes and edges we define the following functions:
to use state-transition graphs (STGs) in movement analysis. A                 • 𝜔 : 𝐿 → N returns the number of stops in location 𝑙 ∈ 𝐿𝑢 ;
STG is an aggregate representation of a behaviour by a directed               • 𝜏 : 𝐿 → R returns the typical time spent in location 𝑙 ∈ 𝐿𝑢 ;
weighted graph, where the nodes stand for the possible states                 • 𝜌 : 𝐿 → Time returns the typical time of arrival of 𝑢 along
and the edges for the transitions between the states. Information               all the movements 𝑚 ∈ 𝑀𝑢 reaching location 𝑙 ∈ 𝐿𝑢 .
from the states and activities is gathered from external sources              • 𝜋𝑡 : 𝐿 → R returns the typical time travelled by 𝑢 along
like land allocations or presence of points of interest. The main               all the movements 𝑚 ∈ 𝑀𝑢 reaching location 𝑙 ∈ 𝐿𝑢 .
difference between IMNs and STGs lies in the fact that the first              • 𝜋𝑑 : 𝐿 → R returns the typical distance travelled by 𝑢
one models the mobility and it is more attached to geographic                   along all the movements 𝑚 ∈ 𝑀𝑢 reaching location 𝑙 ∈ 𝐿𝑢 .
components, while the second one models the events, states or
activities that can occur one after another.                                 Nodes represent locations 𝐿𝑢 and edges represent movements
   Modeling individual behavior is a precious task as, besides           𝑀𝑢 between locations. With the term typical we refer to using
providing a succinct and understandable representation of the            aggregating function like mean and median, also including the
mobility patterns of the users, it enables the development of            associated dispersion indexes like standard deviation. To clarify
applications like carpooling [11], or trajectory prediction [26].        the concept of IMN, let us consider Figure 1. It describes the IMN
As a consequence, an individual model with enhanced location             extracted from the mobility of an individual who visited seven
descriptions can be undoubtedly beneficial and informative.              distinct locations. Location (𝑎) has been visited by user 𝑢 for
                                                                         a total of 𝜔 (𝑎) = 25 times, i.e., 25 stops, with a typical stay of
                                                                         𝜏 (𝑎) = 7ℎ40𝑚𝑖𝑛. On the other hand, user 𝑢 stopped in location
                                                                                       Name                           Description
                                                                                       lat, lon            location prototype coordinates
                                                                                     next locs        number of different subsequent locations
                                                                                        radius                      radius of gyration
                                                                                      entropy         entropy of ingoing/outgoing movements
                                                                                         stops         total and categorized number of stops∗
                                                                                      staytime                     typical† stay time∗
                                                                                   arrival times                 typical† arrival times∗
Figure 2: AIMN extracted from the IMN of Figure 1. Anno-                          duration/length     typical† duration/length of movements∗
tations are represented with colors. The first graph reports                         is regular           if a location is frequently visited
the annotated nodes, while the second graph is the AIMN
                                                                            Table 1: Individual Features. (∗) indicates that the features
with the novel annotated nodes and edges.
                                                                            are calculated with respect to weekdays vs. weekend, and
                                                                            daytime (from 7 am to 8 pm) vs. nighttime. (†) indicates
                                                                            features for both mean and standard deviation.

(𝑓 ) for 𝜔 (𝑓 ) = 3 times with a typical stay of 𝜏 (𝑓 ) = 1ℎ10𝑚𝑖𝑛.                  Name                          Description
Edges (𝑐, 𝑓 ) and (𝑒, 𝑓 ) lead to location 𝑓 , typically arriving at           𝑟 -exclusivity    number of stops within 𝑟 km w.r.t. other vehicles
time 𝜌 (𝑓 ) = 18.10, traveling 𝜋𝑡 (𝑓 ) = 15𝑚𝑖𝑛, and 𝜋𝑑 (𝑓 ) = 10𝑘𝑚.             𝑟 -centrality       number of locations in a radius of 𝑟 km
   The computation of an IMN 𝐺𝑢 starts from the ordered se-                      𝑘-distance         the distance of the 𝑘 th nearest locations
quence history 𝐻𝑢 of user 𝑢. Locations are obtained by aggregat-            Table 2: Collective Features. They are calculated compar-
ing the origin and destination points of the trajectories using the         ing an individual location with other individual locations.
TOSCA location clustering algorithm [9, 12]. A location identifies
a set of points, and a location prototype is the point minimizing                 Name                                 Description
the distance with the other observations part of the location.               POI 𝑟 -centrality      number of different POIs within a radius of 𝑟 km
                                                                             POI 𝑘-centrality     number of different POIs within the 𝑘 th nearest POI
                                                                              POI 𝑘-distance       distance to closest POIs within the 𝑘 th nearest POI
4     PROPOSED APPROACH
                                                                            Table 3: Contextual Features. All measures are also com-
In this section, we describe the location annotation procedure              puted separately over nine different categories of POIs:
and how it is applied to extract annotated individual mobility              gas stations, parking areas, piers, hotels, food shops,
networks (AIMNs) from individual mobility networks (IMNs).                  leisure activities, (no-food) shops, services, supermarkets.

4.1    Annotated Individual Mobility Network                                4.2    Location Annotation Procedure
An annotated individual mobility network (AIMN) extracted from              In this section, we describe the location annotation procedure we
an IMN describes the individual mobility of a user through a                designed to obtain a non-trivial annotation function 𝜆 : 𝐿 → N.
very simple graph representation of annotated locations and                 As previously discussed, the aim of 𝜆 is to provide the same
movements, summarizing the relevant properties of the IMN and               annotation for similar locations. Hence, a definition of similarity
compressing the information contained in the mobility model.                must be provided in order to design 𝜆. Given an IMN 𝐺𝑢 of user 𝑢,
   The extraction of an AIMN 𝐺 𝑢 = (𝐿𝑢 , 𝑀 𝑢 ) of a user 𝑢 starts           in order to compare two locations with a distance function, every
from her IMN 𝐺𝑢 = (𝐿𝑢 , 𝑀𝑢 ), and an annotation function 𝜆.                 location 𝑙 ∈ 𝐿𝑢 must be described with a set of attributes (see [12,
Given an individual location 𝑙 ∈ 𝐿𝑢 , the annotation function               21] for more details). Individual locations represented as a vector
𝜆 : 𝐿 → N returns its annotation as a consequence of a learn-               of features can be grouped using clustering algorithms [24]. As
ing process. Two locations 𝑙 1, 𝑙 2 have the same annotations, i.e.,        a consequence, 𝜆 is the function which annotates each location
𝜆(𝑙 1 ) = 𝜆(𝑙 2 ), if they are “similar”. Details of the location annota-   with the cluster label to which the location belongs to.
tion procedure that returns 𝜆 and the meaning of similar locations             Simple 𝜆 functions can be obtained by modeling an individ-
are presented in the next section.                                          ual location through a basic set of attributes, for instance using
   The AIMN 𝐺 𝑢 is built by mapping each node 𝑙 ∈ 𝐿𝑢 to a node              features like latitude and longitude, i.e., spatial features, or the
𝑙 = 𝜆(𝑙) corresponding to its annotation, and by merging the                average stay time and arrival time, i.e., temporal features. How-
corresponding edges. Thus, locations with the same annotation               ever, such basic representations of a location can be insufficient
are collapsed into the same annotated location. More in detail,             to capture various aspects related to (i) the mobility behavior of
given an edge between 𝑙 1, 𝑙 2 if they have a different annotation          the individual user, (ii) the user behavior in relationship with
𝜆(𝑙 1 ) ≠ 𝜆(𝑙 2 ), then two annotated locations 𝑙 1, 𝑙 2 are created        the mobility behavior of other users, (iii) the user behavior in
together with the movement-edge connecting them. On the other               relationship with the geography and context in which she moves.
hand, if both 𝑙 1, 𝑙 2 have the same annotation 𝜆(𝑙 1 ) = 𝜆(𝑙 2 ), then a      To overcome these weaknesses, we design the following lo-
unique annotated location 𝑙 is created together with a self-loop.           cation annotation procedure that takes as input a set of IMNs
   We exemplify the concept of AIMN through Figure 2. Given                 G = {𝐺 1, . . . , 𝐺𝑛 }, a context C where the users move and returns
a certain 𝜆 function, the first graph reports the nodes/locations           an annotation function 𝜆. The procedure works as follows. For
annotation of the graph in Figure 1, where the annotations are              each IMN 𝐺𝑢 ∈ G, for each individual location 𝑙 ∈ 𝐿𝑢 , we de-
represented through colors. Nodes having the same colors are                scribe 𝑙 with a vector of features capturing three distinct aspects.
described by similar features, i.e., are similar according to some               • Individual features. These features characterize an indi-
aspects. The second graph illustrates the AIMN obtained by col-                     vidual location 𝑙 only with the mobility of the vehicle
lapsing nodes with the same annotation into an annotated node,                      𝑢 stopping in 𝑙. The list attributes is shown in Table 1.
merging edges and creating self-loops.                                              Some features are calculated with respect to weekdays vs.
                                                                                         Truck Type Inter-regional Area          Urban Area
                                                                                             Van           214,129 (71.03%)    105,065 (75.24%)
                                                                                           Truck 3          82,697 (27.43%)     33,353 (23.88%)
                                                                                          Truck 3 ax.        3,478 (1.15%)       1,223 (0.88%)
                                                                                        Flatbed Truck         978 (0.32%)              -
                                                                                            Truck             179 (0.05%)              -
                                                                                      Table 4: Distribution of different type of vehicles for the
                                                                                      two areas analyzed (percentages in brackets).
Figure 3: Heatmaps of the stop points in the areas ana-
lyzed in Greece: Inter-regional (left), and Urban (right).                                  Measure         Inter-regional Area     Urban Area
                                                                                         Traj per Vehicle      419.86 ± 255.75     498.72 ± 278.72
                                                                                           Avg Length           10.40 ± 14.17        6.64 ± 9.65
        weekend, and daytime (from 7 am to 8 pm) vs. nighttime.                           Avg Duration          23.85 ± 23.82       19.25 ± 19.85
        For features describing aggregates we report mean and
                                                                                      Table 5: Mean and standard deviation of some descriptive
        standard deviation. A location is regular if it is frequently
                                                                                      statistics for the trajectories in the two areas analyzed.
        visited more than the others (see [12] for details).
      • Collective features. These features characterize an individ-
        ual location 𝑙 with respect to all the other locations both
        of user 𝑢 and of the other users in G. Details in Table 2.
      • Contextual features. These features characterize an individ-
        ual location 𝑙 with respect to facilities, i.e., different types
        of points of interests (POIs) in the surrounding areas given
        by the context C. Table 3 reports a detailed description.
                                                                                      Figure 4: K-Means Sum of Squared Error (SSE) varying the
   Given the locations described by these features, we implement
                                                                                      number of clusters 𝑘. We selected 𝑘 = 155 for the inter-
the annotation function by means of a clustering algorithm. In-
                                                                                      regional area, and 𝑘 = 140 for the urban area.
spired by [8, 9], instead of using a simple clustering approach,
we design a two steps clustering allowing to simultaneously sum-
marize the different locations and obtain a hierarchy for better                      following types of trucks which are the most frequent in the
describing them. In particular, we use K-Means clustering algo-                       dataset: Van, Truck 3, Truck 3 ax., Flatbed Truck, Truck2 .
rithm [16] for the first clustering phase. As illustrated in the next                    Analyzing all the vehicles together would not lead to a fair
section, we observe that we need a high 𝑘 (between 100 and                            comparison due to the very different types of trajectories followed
200) in order to have a good clustering with respect to internal                      by the trucks. For instance, we cannot compare the IMN of a truck
evaluation measures like the Sum of Squared Error (SSE). The                          performing daily deliveries in a radius of 20 km with the IMN
clustering with K-Means allows to consistently reduce the num-                        of a truck moving across regions or countries along very long
ber of different location prototypes. The second phase is aimed                       trajectories. Therefore, we partition the trajectories of the dataset
at further reducing the number of clusters for keeping simple                         through a simple, K-means-like iterative procedure. First, each
the annotation computed by 𝜆, and simultaneously obtaining a                          vehicle 𝑢 is associated to the geographical bounding box 𝑟𝑢 of its
hierarchy for the different location prototypes. To this purpose                      trajectories in 𝐻𝑢 , and the set of areas 𝐴 is initialized to ∅. Then,
we adopt a hierarchical clustering approach with the Ward’s cri-                      we iteratively consider each vehicle 𝑢 and compare its 𝑟𝑢 with
terion [28] to determine the clusters to be merged. A visual and                      all the existing areas 𝑎𝑖 ∈ 𝐴. If, for at least 75% of the vehicles
interactive inspection of the resulting dendogram and centroids                       𝑣 ∈ 𝑎𝑖 , 𝑟𝑢 ∩ 𝑟 𝑣 ≠ ∅ and 1/4 ≤ 𝑎𝑟𝑒𝑎(𝑟𝑢 )/𝑎𝑟𝑒𝑎(𝑟 𝑣 ) ≤ 4, then 𝑢 is
describing the clusters allows to understand the reasons [20]                         added to 𝑎𝑖 ; otherwise, if 𝑎𝑖 ∈ 𝐴 satisfying this condition, a new
for having different types of annotated locations and which is a                      area 𝑎 𝑗 is created and 𝑢 is added to 𝑎 𝑗 . Finally, we check that
reasonable number of clusters representing the different types of                     each vehicle belongs to the area with the highest overlap, and
annotated locations in an AIMN.                                                       in case we move a vehicle between two areas until convergence.
                                                                                      The above procedure recognizes twelve different areas.
                                                                                         In the following, we focus on two areas depicted in Figure 3.
5     EXPERIMENTS
                                                                                      We name the first inter-regional area since it contains the trajecto-
In this section, we present a case study on a dataset of trucks in                    ries of vehicles moving in various regions of Greece. On the other
which we employ the proposed methodology1 . First, we briefly                         hand, the second is an urban area containing the movements of
analyze the dataset. Then, we provide details for the location                        trucks in Athens. Details about the number of different vehicles
annotations and the clustering results. Finally, we show some                         can be found in Table 4. In addition, in Table 5 we report basic
preliminary analysis enabled by the extracted AIMNs.                                  statistics of the trajectories for the two different areas. We notice
                                                                                      how in the urban area a vehicle performs on average much more
5.1     Dataset Description                                                           trajectories than a vehicle in the inter-regional area. As expected,
We analyze a dataset of about 15 million of trajectories of trucks                    the trajectories in the inter-regional area are on average longer
moving in Greece, Albania, Cyprus, and other few EU countries                         than those in the urban area. Moreover, the high standard devia-
from July 2017 to June 2018. There are different kinds of trucks,                     tion means that the trajectories in the inter-regional area are also
depending on the size, number of carts, etc. We focus on the                          more variegate than those in the urban area. The similar average
                                                                                      duration might be due to the higher traffic in the urban area, that
1 The code for performing the analysis is available at: https://github.com/fsbolgi/
                                                                                      makes the speed lower than in the inter-regional area.
AnnotatedIndividualMobilityNetwork. The dataset is not publicly available.            2 More details are available at https://trackandknowproject.eu/about/deliverables/.
                                                                                                              regular ↑ ∧ next locs ↑ ∧
                                                                                                      radius ↑ ∧ entropy ↑ ∧ avg arrival time ↑
                                                                                                               true               false

                                                                                                                                     𝑟 15 −centr ↑ ∧
                                                                                          regular ↓ ∧ 𝑟 15 −centr ↓ ∧
                                                                                                                                   POI 𝑟 0.5 −centr ↑ ∧
                                                                                      POI 𝑟 0.5 −centr ↓ ∧ POI 𝑘 30 −dist ↑
                                                                                                                                     POI 𝑘 30 −dist ↓
                                                                                               true            false                               false
                                                                                                                                   true
                                                                                                                                               𝑟 15 −centr ↓ ∧
                                                                                                            regular ↓ ∧
Figure 5: Inter-regional Area: hierarchical clustering den-                                 C1                                    C4         POI 𝑟 0.5 −centr ↓ ∧
                                                                                                      avg staytime ↓ ∧ stops ↓
                                                                                                                                              POI 𝑘 30 −dist ↑
dogram with a cut at six clusters (clusters size in brackets).                                            true         false                   true        false


                                                                                                          C2            C3                     C5           C6

                                                                                   Figure 7: Inter-regional Area: Tree showing the most dis-
                                                                                   criminant features for the dendogram in Figure 5.

                                                                                   (iv) we remove 𝑟 5 -centrality and 𝑘 1 -centrality, 𝑘 5 -centrality, 𝑘 20 -
                                                                                   centrality as they are highly correlated with the remaining ones.
                                                                                      We ran K-Means with 𝑘 = 2, . . . , 900. The visual inspection of
                                                                                   the Sum of Squared Error (SEE) in Figure 4 suggested to select
                                                                                   𝑘 = 155 for the inter-regional area, and 𝑘 = 140 for the urban area.
                                                                                   The subsequent run of the hierarchical clustering with the Ward’s
                                                                                   criterion yields the dendograms in Figures 5 and 8. In both cases,
                                                                                   we cut the dendograms to obtain six clusters that characterize
                                                                                   the description of the different annotated locations with a good
                                                                                   trade-off between a sufficiently high level of abstraction and a
Figure 6: Inter-regional Area: centroids parallel plots                            detailed specification. We observe that in both cases, more than
showing discriminant features for dendogram in Fig. 5.                             50% of the locations end in the rightmost part of the dendogram,
                                                                                   making it slightly imbalanced. That produces some small clusters,
                                                                                   yet none of them is negligible in terms of size. In the following,
5.2     Annotated Locations Clustering Analysis                                    we combine the hierarchy of the dendogram with the information
Using the procedure described in [12], we extract the IMNs for the                 returned by the parallel plots of the centroids of the clusters for
vehicles in both areas, and we use them as input for the proposed                  each split. This visualization, due to the interpretable features [13]
methodology. In particular, in our analysis, we focus on Septem-                   describing the locations, allows to explain [20] the hierarchy and
ber and October 2017, obtaining 883 IMNs in the inter-regional                     consequently the annotations of the various clusters 𝐶 1, . . . , 𝐶 6 .
area and 373 IMNs in the urban area. Statistics describing the                        In order to better understand the reasons that led to differenti-
IMNs are reported in Table 7. As context C, we exploit a dataset                   ate the locations into the six clusters – and therefore understand
containing the POIs of the whole Europe3 . We extracted only the                   what each cluster contains –, we show in Figures 6 and 9 the cor-
POIs relative to the areas we are interested in and we restricted to               responding parallel plots of the most significant features, respec-
some more general and relevant categories, namely: gas, parking,                   tively for the inter-regional and urban areas; we also summarize
pier, hotel, food, leisure, (no-food) shop, service, supermarket.                  in Figures 7 and 10 the insights we obtained through inspection
    Given the IMNs for the two areas, the location annotation                      of the features at different levels of the dendogram, by means of
procedure received as input about 110k locations for the inter-                    a decision tree representation.
regional area and about 39k locations for the urban one. We de-                       In both dendograms the first split is a consequence of differ-
scribed each location with the individual, collective, and contex-                 ences relative to individual features. The first split (Figures 6
tual characteristics illustrated in Section 4.2, ending in a vector of             and 9 top), using the individual features, separates on the left
72 features. For 𝑟 -exclusivity we adopt 𝑟 = 0.2 km, for 𝑟 -centrality             branch regularly visited locations (↑ regular) from which is possi-
𝑟 ∈ {1, 5, 15} km, for 𝑘-distance 𝑘 ∈ {1, 3, 5, 8, 10, 20}. For POI                ble to reach various destinations (↑ next locs and ↑ entropy) with
𝑟 -centrality, POI 𝑘-centrality and POI 𝑘-distance we adopt 𝑟 = 0.5                early arrivals and leavings (↑ avg arrival times, i.e., the vehicle
km and 𝑘 = 30, respectively. Before running the location anno-                     leaves before 8 am and is back before 7 pm) in a location defined
tation procedure, we performed a correlation analysis using the                    by a not very specific area (↑ radius), from the other locations
Pearson correlation coefficient. This step allowed us to reduce                    (right branch). Thus, with respect to our case study, the locations
the number of features to 55. In particular, (i) the total number                  on the left-hand side can be matched with storage points and/or
of stops is removed in favor of next loc, entropy and number of                    deposits of the trucks analyzed, while those on the right are all
daytime weekday stops; (ii) among the temporal aggregations of                     the others. This consideration can also justify the fact that the
stay times, arrival and leaving times, only the weekday vs week-                   majority of locations lie in the rightmost part of the dendogram.
end is considered; (iii) movement duration features are removed                    Moving forward on the left branch, we have a different split for
since they are strongly correlated with movement length features;                  the inter-regional area and for the urban area.
                                                                                      For the inter-regional area (Figure 6 center left) the second
3 The POIs are points collected from OpenStreetMap filtered based on Geofabrik’s   split, making use of collective and contextual features, separates
taxonomy of OpenStreetMap features, i.e., points with the label “POI” are kept.    not regular locations (regular ↓) not close to individual locations
Figure 8: Urban area: hierarchical clustering dendogram
with a cut yielding six clusters (clusters size in brackets).


                                                                              Figure 11: Heatmaps of the different clusters for the an-
                                                                              notated locations of the Inter-regional area. From left to
                                                                              right, top to bottom, clusters 𝐶 1, 𝐶 2, . . . , 𝐶 6 .

                                                                                 Following the same logic used to describe the splits of the inter-
                                                                              regional area we can understand the splits for the annotations
                                                                              of the urban area from Figures 9 and 10. The second split (Fig-
                                                                              ure 9 center left) discriminates between locations close to various
Figure 9: Urban Area: centroids parallel plots showing dis-                   existing POIs (POI 𝑟 0.5 −centr ↑ and POI 𝑘 30 −dist ↓). Since the
criminant features for dendogram in Figure 8.                                 closest categories for the black lines are food, leisure, shop, service
                                                                              and supermarket we can assume that cluster 𝐶 1 mainly identifies
                                                                              “regular” locations in the city center. A further split on individual
                      regular ↑ ∧ next locs ↑ ∧
              radius ↑ ∧ entropy ↑ ∧ avg arrival time ↑
                                                                              features (Figure 9 bottom left) identifies more regularly visited
                                                                              locations (↑ regular) with a high stay time and number of stops
                         true             false
                                                                              during both weekend and weekday (↑ avg staytime, ↑ stops). The
                                                                              other branch separates the locations based on the distance to
      POI 𝑟 0.5 −centr ↑ ∧                 POI 𝑟 0.5 −centr ↑ ∧
       POI 𝑘 30 −dist ↓                      POI 𝑘 30 −dist ↓                 existing POIs (POI𝑟 0.5 −centr ↑ and POI𝑘 30 −dist ↓). This time
                       false                              false               there is a clear separation for all the contextual features also
       true                                 true
                                                                              considering gas, parking and hotel. Hence, the cluster 𝐶 4 captures
                                                        radius ↑ ∧            central locations (probably even more central than those on the
                    regular ↑ ∧
     C1                                   C4           next locs ↑ ∧          left branch of the tree) visited sporadically by the vehicles ana-
              avg staytime ↑ ∧ stops ↑
                                                   POI − gas 𝑟 0.5 −centr ↑
                  true          false
                                                                              lyzed. This cluster is the biggest in the urban area. Finally, the
                                                      true         false
                                                                              last split relative to suburban locations distinguish cluster 𝐶 5 ,
                                                                              containing suburban locations with a large radius from which
                  C2             C3                   C5            C6        can be reached other many individual locations but far away
Figure 10: Urban Area: Tree showing the most discrimi-                        from POIs except gas stations (probably located into an industrial
nant features for the dendogram in Figure 8.                                  area), from cluster 𝐶 6 , which is formed by suburban locations
                                                                              close to facilities but reached sporadically.
                                                                                 In Figure 11 we show the heatmaps of the locations for the
of other vehicles (𝑟 15 −centrality ↓, i.e., not central w.r.t. the oth-      various clusters for the Inter-regional area. We can notice how
ers) nor to POIs (POI 𝑟 0.5 −centrality ↓ and POI 𝑘 30 −distance ↑)           the position of the different annotated locations on the maps is
from the rest. Cluster 𝐶 1 models suburban and peripheral storage             coherent with the descriptions reported above. From a very high
points. On the other hand, the subsequent split (Figure 6 bottom              level we can summarize the distinction between the different
left) identifies less frequent locations with shorter stops (cluster          annotated locations as delivery “origins” (𝐶 1, 𝐶 2, 𝐶 3 ) and “desti-
𝐶 2 ) and more frequent locations with longer stops (cluster 𝐶 3 ).           nations ”(𝐶 4, 𝐶 5, 𝐶 6 ), i.e., recipients. Then, the clusters distinguish
The right branch of the inter-regional tree in Figure 7 performs              according to the closeness with respect to other individual loca-
a symmetric split with respect to the left branch. Thus, in cluster           tions and POIs, to usage in terms of stay times and arrival times.
𝐶 4 we find not regularly visited locations surrounded by many                For instance, we have locations in the city center in 𝐶 4 , isolated
other personal locations and POIs (𝑟 15 −centr ↑, POI 𝑟 0.5 −centr ↑,         suburban locations 𝐶 5 , and suburban locations surrounded by
POI 𝑘 30 −dist ↓). Due to the high density, we can infer that these           POIs in 𝐶 6 . Moreover, we highlight that latitude and longitude
locations are central regions of the inter-regional area. Finally,            are not crucial for distinguishing the various clusters at the final
clusters 𝐶 5 and 𝐶 6 containing most of the locations are placed in           level of the dendogram. Indeed the points are not entirely sep-
suburbans regions far away from many POIs.                                    arated from a spatial point of view. This implies that the other
       Measure    Inter-regional Area Urban Area
     IMN vs TMP          0.1580            0.1465
      IMN vs SPT         0.1186            0.0723
     TMP vs SPT          0.0036            0.0136
Table 6: NMI comparing the proposed location annotation
procedure based on IMNs against a temporal annotation
(TMP) and a spatial annotation (SPT) procedure.

      Measure Inter-regional Area        Urban Area
        Nodes         96.84 ± 74.98     138.96 ± 112.04
        Edges        270.47 ± 190.32    312.21 ± 224.90
       Density         0.07 ± 0.15        0.05 ± 0.10
        Degree         5.78 ± 1.46        4.85 ± 2.39
      Clus. Coef.      0.19 ± 0.13        0.17 ± 0.13
       Table 7: IMNs characteristics (mean ± std dev).

       Measure Inter-regional Area Urban Area                              Figure 12: Purity and entropy distributions for number of
         Nodes         5.38 ± 0.89       5.06 ± 1.02                       stops in different annotated locations.
         Edges         15.12 ± 4.69      13.31 ± 4.56
        Density        1.28 ± 0.28       1.28 ± 0.32
                                                                           truck with a vector 𝑣𝑢 = ⟨𝑓1, 𝑓2, . . . , 𝑓6 ⟩ of six elements contain-
        Degree         5.48 ± 1.16       5.10 ± 1.10
                                                                           ing the relative number of stops in each type 𝐶 1, 𝐶 2, . . . , 𝐶 6 of
       Clus. Coef.     0.82 ± 0.22       0.85 ± 0.20
                                                                           annotated locations. Using these vectors of relative frequencies
      Table 8: AIMNs characteristics (mean ± std dev).                     𝑓𝑖 , we calculate for each user the purity and (normalized) en-
                                                                           tropy [22] as purity(𝑣) = 𝑚𝑎𝑥𝑖 ∈ [1,6] (𝑓𝑖 ), and as entropy(𝑣) =
                                                                              Í
features are much more relevant than the geographical aspects              − 𝑖 ∈ [1,6] 𝑓𝑖 log2 (𝑓𝑖 )/log2 (6). We highlight that purity and en-
for capturing different characteristics.                                   tropy capture two different aspects. We report the distributions
    Finally, we adopt a clustering evaluation measure to show that         of purity and entropy for the inter-regional and urban areas in
the proposed location annotation procedure instantiated with               Figure 12. For the purity (top row), we observe two normal-like
the truck case study is meaningful and not trivial. We implement           distributions with most of the users having a purity of 0.5 in both
two simple annotation functions 𝜆. The first procedure (SPT) de-           areas meaning that half of the stops of the vehicle refer to the
scribes the locations only using spatial features, i.e., latitude and      same annotated location. On the other hand, there are also trucks
longitude. The second procedure (TMP) describes the locations              with a purity close to 1.0, meaning that almost all the stops belong
only using the average stay time and arrival time. In both proce-          to the same annotated locations. There are no vehicles having
dures, the locations are grouped and annotated using K-Means               a purity lower than 0.2. With respect to entropy (bottom row),
with 𝑘 = 6, i.e., the same number of annotated locations discov-           we observe right-skewed distributions with a mean of about 0.7,
ered with the proposed procedure. We compare the annotations,              showing that, despite there is an average high purity, most of the
i.e., the clustering results, returned by the three procedures using       trucks stop in different annotated locations. In addition, there
the Normalized Mutual Information score [27]. The more similar             are very few vehicles with an entropy close to 0.0 signaling that
are the annotations between two location annotation procedures,            a truck generally visits a minimum number of different anno-
the higher is the NMI score. The very low NMI scores reported              tated locations. Finally, we report that we found no relationships
in Table 6 confirms that it is not possible to design an annotation        between the type of truck and entropy or purity.
procedure similar to the proposed one with a trivial approach.                  We report in Figure 13 the AIMNs of three trucks moving in
                                                                           the urban area with different levels of purity and entropy. The
5.3    Inspection of Annotated IMNs                                        leftmost AIMNs show a high entropy and low purity, visiting all
Given the locations annotation described in the previous sec-              the types of annotated locations with a balanced number of stops,
tion, we know the behavior of the annotation function 𝜆 and the            i.e., similar sizes. We notice on the map how it covers a larger area
meaning of the different annotations. Thus, given a location 𝑙             compared to the other AIMNs. The big yellow node 𝐶 2 models
with the vector of features describing it, 𝜆 assigns 𝑙 to the most         the parking/storage while the others model different types of
similar cluster with respect to the six prototypes represented by          destinations. On the other hand, the rightmost AIMN shows a
the centroids reported in Figures 6 and 9, i.e., 𝜆 : 𝐿 → [1, . . . , 6].   low entropy and high purity: the majority of the stops are on
   Using the annotation functions 𝜆 for the IMNs of the two areas          annotated locations of type 𝐶 3 (red), which are on the north-west
we obtain the corresponding AIMNs. We report in Table 8 statis-            and south-east in the map. Finally, the central AIMN shows a
tics describing the AIMNs. By comparing Table 8 with Table 7 we            typical vehicle with a medium level of entropy and purity.
notice that the number of nodes and edges obviously drops. On
the other hand, we observe that AIMNs are much denser than                 6    CONCLUSION
IMNs and with a higher average degree about 5 with a standard              We have proposed a data-driven procedure for annotating indi-
deviation of 1.1, meaning that typically each vehicle from an              vidual locations, and we have employed it to extract annotated
annotated location can reach at least 3 other annotated locations.         individual mobility networks (AIMNs) from individual mobility
Consequently, AIMNs show a much higher clustering coefficient.             networks (IMNs). A case study experimentation on a real dataset
   Moreover, we exploit the AIMNs for a preliminary analysis               of fleet moving in Greek areas has shown the effectiveness of
aimed at comparing vehicles. From the AIMNs we model each                  the proposed approach. We have found a hierarchical structure
Figure 13: AIMNs for three vehicles moving in the urban area with different purity and entropy levels. Left low purity
and high entropy. Center medium purity and entropy. Right high purity and low entropy. The top row reports the AIMNs
while the bottom row contains the corresponding AIMNs with a spatial disaggregation of the annotated locations on a
real map. In both cases the higher is the size of the node/location, the higher the number of stops in the locations.


describing the annotated locations through individual, collective,                 [4] Vanessa Frias-Martinez et al. 2012. Characterizing urban landscapes using
and contextual features. The principal discrimination is based                         geolocated tweets. In PASSAT. IEEE, 239–248.
                                                                                   [5] Vanessa Frias-Martinez and Enrique Frias-Martinez. 2014. Spectral clustering
on the frequency of visits, length of stay, and centrality with                        for sensing urban land use using Twitter activity. EAAI 35 (2014), 237–245.
respect to other locations and existing points of interest. As a                   [6] Fosca Giannotti et al. 2011. Unveiling the complexity of human mobility by
                                                                                       querying and mining massive trajectory data. VLDB 20, 5 (2011), 695–719.
consequence of the annotation, we can observed different vehi-                     [7] Marta C Gonzalez et al. 2008. Understanding individual human mobility
cles studying the corresponding AIMNs. Such information can be                         patterns. Nature 453, 7196 (2008), 779.
applied for a detailed analysis of inter-regional and urban areas                  [8] Riccardo Guidotti et al. 2014. Retrieving points of interest from human sys-
                                                                                       tematic movements. In SEFM. Springer, 294–308.
and for planning ad-hoc and personalized mobility applications.                    [9] Riccardo Guidotti et al. 2015. TOSCA: two-steps clustering algorithm for
   Future research directions can purse various goals. First, we                       personal locations detection. In SIGSPATIAL. ACM, 38.
would apply the proposed methodology to different data sources.                   [10] Riccardo Guidotti et al. 2016. Unveiling mobility complexity through complex
                                                                                       network analysis. SNAM 6, 1 (2016), 59.
For instance, observing the personal mobility of individual car                   [11] Riccardo Guidotti et al. 2017. Never drive alone: Boosting carpooling with
drivers, or of users adopting different types of means of trans-                       network analysis. IS 64 (2017), 237–257.
                                                                                  [12] Riccardo Guidotti et al. 2017. There’s a path for everyone: A data-driven
portations like bicycle, foot, public services, would probably lead                    personal model reproducing mobility agendas. In DSAA. IEEE, 303–312.
to identify other types of annotations for the locations. Second,                 [13] Riccardo Guidotti et al. 2018. Discovering temporal regularities in retail
we would like to perform a deeper analysis of the users by clus-                       customers’ shopping behavior. EPJ Data Science 7, 1 (2018), 6.
                                                                                  [14] Shan Jiang et al. 2012. Clustering daily patterns of human activities in the
tering the AIMNs with respect to the annotated locations for                           city. DAMI 25, 3 (2012), 478–510.
discovering relevant users’ segmentation and geographical char-                   [15] John Lafferty et al. 2001. Conditional random fields: Probabilistic models for
acterization of the territory. Third, we would like to use the an-                     segmenting and labeling sequence data. (2001).
                                                                                  [16] James MacQueen et al. 1967. Some methods for classification and analysis of
notated locations and the mobility history for building sequences                      multivariate observations. In BSMSP, Vol. 1. Oakland, CA, USA, 281–297.
of “states” [1]. Exploiting sequential pattern mining algorithms                  [17] Anastasios Noulas et al. 2011. Exploiting semantic annotations for clustering
                                                                                       geographic areas and users in location-based social networks. In AAAI.
would allow us to search for mobility patterns represented as                     [18] Anastasios Noulas et al. 2013. Exploiting foursquare and cellular data to infer
sequences of different annotated locations. Fourth, we would                           user activity in urban environments. In ICMDM, Vol. 1. IEEE, 167–176.
like to analyze the benefits of integrating our concepts of anno-                 [19] Christine Parent et al. 2013. Semantic trajectories modeling and analysis.
                                                                                       CSUR 45, 4 (2013), 42.
tated locations and AIMNs into visualization platforms [1, 23]                    [20] Dino Pedreschi et al. 2019. Meaningful explanations of Black Box AI decision
for semantic annotation of individual mobility.                                        systems. In AAAI, Vol. 33. 9780–9784.
                                                                                  [21] Salvatore Rinzivillo et al. 2014. The purpose of motion: Learning activities
                                                                                       from individual mobility networks. In DSAA. IEEE, 312–318.
ACKNOWLEDGMENTS                                                                   [22] Claude Elwood Shannon. 1948. A mathematical theory of communication.
                                                                                       Bell system technical journal 27, 3 (1948), 379–423.
This work is partially supported by the E.C. H2020 programme                      [23] Amílcar Soares et al. 2019. VISTA: A visual analytics platform for semantic
under the funding scheme Track &Know, G.A. 780754, https://                            annotation of trajectories.. In EDBT. 570–573.
                                                                                  [24] Pang-Ning Tan. 2018. Introduction to data mining. Pearson Education India.
trackandknowproject.eu/. We thank Haosheng Huang and Cheng                        [25] Roberto Trasarti et al. 2011. Mining mobility user profiles for car pooling. In
Fu from University of Zurich for sharing the POIs dataset.                             KDD. ACM, 1190–1198.
                                                                                  [26] Roberto Trasarti et al. 2017. Myway: Location prediction via mobility profiling.
                                                                                       IS 64 (2017), 350–367.
REFERENCES                                                                        [27] Nguyen Xuan Vinh et al. 2010. Information theoretic measures for clusterings
                                                                                       comparison: Variants, properties, normalization and correction for chance.
 [1] Natalia Andrienko and Gennady Andrienko. 2018. State transition graphs for
                                                                                       JMLR 11, Oct (2010), 2837–2854.
     semantic analysis of movement behaviours. IV 17, 1 (2018), 41–65.
                                                                                  [28] Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function.
 [2] Jie Bao et al. 2012. Location-based and preference-aware recommendation
                                                                                       JASA 58, 301 (1963), 236–244.
     using sparse geo-social networking data. In SIGSPATIAL. ACM, 199–208.
                                                                                  [29] Jihang Ye, Zhe Zhu, and Hong Cheng. 2013. What’s your next move: User
 [3] Ronaldo Dos Santos Mello et al. 2019. MASTER: A Multiple Aspect View on
                                                                                       activity prediction in location-based social networks. In ICDM. SIAM, 171–179.
     Trajectories. TGIS 12526 (2019).