=Paper= {{Paper |id=Vol-2578/BMDA2 |storemode=property |title=Data-Driven Location Annotation for Fleet Mobility Modeling |pdfUrl=https://ceur-ws.org/Vol-2578/BMDA2.pdf |volume=Vol-2578 |authors=Riccardo Guidotti,Mirco Nanni,Francesca Sbolgi |dblpUrl=https://dblp.org/rec/conf/edbt/GuidottiNS20 }} ==Data-Driven Location Annotation for Fleet Mobility Modeling== https://ceur-ws.org/Vol-2578/BMDA2.pdf

Data-Driven Location Annotation
for Fleet Mobility Modeling
Riccardo Guidotti Mirco Nanni Francesca Sbolgi
University of Pisa ISTI-CNR, Pisa University of Pisa
Pisa, Italy Pisa, Italy Pisa, Italy
riccardo.guidotti@di.unipi.it mirco.nanni@isti.cnr.it f.sbolgi@unipi.it

ABSTRACT (IMN) capturing the structured patterns of visits to locations [21].
The large availability of mobility data allows studying human The existing works either focus on building mobility models or
behavior and human activities. However, this massive and raw on adding semantics from external data sources. On the other
amount of data generally lacks any detailed semantics or useful hand, in this work we exploit the mobility data used to build the
categorization. Annotations of the locations where the users stop individual model also to annotate the model itself.
may be helpful in a number of contexts, including user modeling In particular, we advance IMNs proposing a data-driven pro-
and profiling, urban planning, activity recommendations, and cedure for extracting Annotated Individual Mobility Networks
can even lead to a deeper understanding of the mobility evo- (AIMNs). A limitation of actual IMNs is that they are indeed “in-
lution of an urban area. In this paper, we foster the expressive dividual” and not easily comparable. Following [1], our goal is to
power of individual mobility networks, a data model describing add semantics to the locations modeling the nodes of the IMNs
users’ behavior, by defining a data-driven procedure for locations in order to make them comparable among different users. We
annotation. The procedure considers individual, collective, and accomplish this task by designing a procedure considering indi-
contextual features for turning locations into annotated ones. vidual, collective, and contextual features for turning locations
The annotated locations own a high expressiveness that allows into annotated locations. The location annotation procedure, be-
generalizing individual mobility networks, and that makes them sides differentiating the locations with respect to the different
comparable across different users. The results of our study on a features, also provides a hierarchy showing the reasons for the
dataset of trucks moving in Greece show that the annotated indi- different annotation categories. The procedure is basically a two-
vidual mobility networks can enable detailed analysis of urban step-clustering. The first step applies a distance-based clustering
areas and the planning of advanced mobility applications. to group the different locations. The second step exploits a hier-
archical approach for further summarizing the locations, and for
better describing them according to features characterizing the
1 INTRODUCTION annotation categories. This provides to the annotated locations a
The large availability of digital traces of individuals is offering higher expressiveness and allows to generalize IMNs by making
novel possibilities for understanding the patterns characterizing AIMNs comparable. Therefore, the analysis on AIMNs allows to
human mobility [7]. However, the personal data collected by segment the users moving on different areas by means of the
smartphones or devices installed by car telematics companies data-driven semantics provided by the area itself.
for business and insurance purposes is generally limited to the We employ the proposed methodology on a dataset of trucks
positions of the vehicle, with no vision of what happens around it. moving in Greece, Albania, and other EU countries. We focused
On the other hand, planning individual and collective advanced our analysis on two areas with a different size finding in both
mobility applications as well as providing detailed analysis of cases six types of locations for annotating the IMNs of the vehi-
urban and suburban areas require additional and more complex cles. The first main differences between the location types are
information [6, 10]. Raw mobility data like GPS positioning de- the number of stops in a location, the number of links with other
scribes elementary events (position, acceleration, etc.), while any individual locations, and the average arrival/leaving times. Then,
proper modeling requires a higher-level vision of what is happen- the locations differ on their centrality with respect to individual
ing to the user, to other users living in the surroundings, and to locations of other users and with respect to existing points of
the environment in which the user is moving. Such higher-level interest like gas stations, parking areas, shops, supermarkets, etc.
modeling should provide some clear categorization, annotation or A preliminary analysis of the AIMNs highlights that in each area
semantics for locations and/or movements. Recognizing different there are distinct types of users (trucks, in our study), that visit
individual mobility behaviors abstracting at a higher comparable the various types of location with different frequencies. Low en-
level and making them explicit is mandatory for enabling novel tropy trucks visit frequently the same type of annotated location,
applications or empowering existing ones. while high entropy trucks have a central node from which they
Many studies semantically enrich mobility data with anno- reach all the differently annotated locations.
tations about human activities and build individual mobility The rest of the paper is organized as follows. Section 2 summa-
models on top of that. For instance, these approaches estimate rizes related work on locations annotation. In Section 3 we recall
home/work locations of an individual by analyzing the frequency individual mobility networks and further concepts for under-
of visits in a particular place [14], observing the sequence of move- standing the procedures for locations annotation and annotated
ments to derive the sequence of activities [15], their semantic [19] individual mobility networks extraction described in Section 4.
and possible different aspects [3], or extract network-based per- Section 5 presents experiments in the form of a case study in
sonal data-driven models named Individual Mobility Network which we employ the proposed methodology. Finally, Section 6
Copyright © 2020 for this paper by its author(s). Published in the Workshop Proceed-
concludes the paper and illustrates future research directions.
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-
tribution 4.0 International (CC BY 4.0).
2 RELATED WORK
In the literature, various works address the problem of describ-
ing and characterizing visited locations for modeling mobility
behavior and for describing land use. In [4, 5] it is presented a
technique to determine land uses in specific urban areas based on
tweeting patterns, and to identify points of interest as places with
high activity of tweets. Human activities and geographical areas
are modeled in [17] by means of Foursquare place categories. A
spectral clustering algorithm is used on areas and users for identi-
fying user communities that visit similar categories of places, and
for comparing urban neighborhoods within and across cities. Se-
mantic information attached to places could be used for location- Figure 1: IMN extracted from the mobility of an individ-
based applications. Indeed, in [18], both Foursquare and cellular ual. Edges represent the existence of a route between the
data are exploited to infer user activities in urban environments. locations. The functions characterize each location.
The authors employ user communication patterns to predict
the activity of Foursquare users who check-in at nearby venues 3 SETTING THE STAGE
adopting a machine learning approach. In [2] it is presented a In the following, we introduce the definitions of trajectory [25]
location-based and preference-aware recommender system con- and individual mobility network [12, 21], useful for understanding
sidering user preferences learned from location history with a the rest of the paper. We adapt them to the needs of the problem
predefined weighted category hierarchy (from Foursquare), and we are facing and the approach designed to solve it.
social opinions mined from location histories. Similarly, in [29],
Foursquare check-in category information is exploited to model Definition 3.1 (Trajectory). A trajectory is a sequence 𝑡 = ⟨𝑝 1,
user’s movement patterns and to predict the category of user . . . , 𝑝𝑛 ⟩ of spatio-temporal points, each being a tuple 𝑝𝑖 = (𝑥𝑖 , 𝑦𝑖 , 𝑧𝑖 )
activity at the next step by means of a mixed hidden Markov that contains latitude 𝑥𝑖 , longitude 𝑦𝑖 and timestamp 𝑧𝑖 of the
model. In order to describe and characterize the locations, the point. The points of a trajectory are chronologically ordered, i.e.,
aforementioned works adopt social network data like Twitter ∀1 ≤ 𝑖 < 𝑛 : 𝑧𝑖 < 𝑧𝑖+1 .
and Foursquare. None of them use GPS positioning as it is not
Given a trajectory 𝑡 we refer to its i-th point 𝑝𝑖 with the nota-
sufficiently informative for their purpose. On the opposite, our ap-
tion 𝑡 [𝑖], and to its number of points with 𝑡 .𝑛. Also, we indicate
proach extracts the location categorization through data mining
the longitude, latitude and timestamp components of point 𝑡 [𝑖]
and location modeling applied to GPS trajectories.
respectively with the notation 𝑡 [𝑖].𝑥, 𝑡 [𝑖].𝑦, and 𝑡 [𝑖].𝑧.
In the following, we illustrate works modeling human mobility
from GPS data but not explicitly characterizing locations with Definition 3.2 (Individual History). Given a user 𝑢, we indicate
respect to activities at locations or neighborhood description. with 𝐻𝑢 = ⟨𝑡 1, . . . , 𝑡𝑛 ⟩ the individual history of user 𝑢 as the set
In [25] it is introduced the mobility profile of a user as the set of of trajectories traveled by 𝑢 in a certain time period.
her routine trips, i.e., a set of very systematic and repetitive move-
Given the individual history 𝐻𝑢 of user 𝑢, we can extract from
ments. On the other hand, in [9] the authors focus on the locations
it the individual mobility network (IMN) 𝐺𝑢 . An IMN describes
and points of interest by developing a parameter-free method
the individual mobility of a user through a graph representation
for extracting individual locations with a data-driven approach.
of her locations and movements, grasping the relevant properties
Combining these approaches, in [12, 21] the authors define in-
of individual mobility and removing unnecessary details.
dividual mobility networks (IMNs) for modeling all the salient
aspects of individual mobility. An IMN describes the mobility of Definition 3.3 (Individual Mobility Network). Given a user 𝑢,
an individual through a graph representation of her locations we indicate with 𝐺𝑢 = (𝐿𝑢 , 𝑀𝑢 ) the individual mobility network
and movements, grasping the relevant properties and removing of user 𝑢, where 𝐿𝑢 is the set of nodes and 𝑀𝑢 is the set of edges.
unnecessary details. In [1] the authors define a general approach On the nodes and edges we define the following functions:
to use state-transition graphs (STGs) in movement analysis. A • 𝜔 : 𝐿 → N returns the number of stops in location 𝑙 ∈ 𝐿𝑢 ;
STG is an aggregate representation of a behaviour by a directed • 𝜏 : 𝐿 → R returns the typical time spent in location 𝑙 ∈ 𝐿𝑢 ;
weighted graph, where the nodes stand for the possible states • 𝜌 : 𝐿 → Time returns the typical time of arrival of 𝑢 along
and the edges for the transitions between the states. Information all the movements 𝑚 ∈ 𝑀𝑢 reaching location 𝑙 ∈ 𝐿𝑢 .
from the states and activities is gathered from external sources • 𝜋𝑡 : 𝐿 → R returns the typical time travelled by 𝑢 along
like land allocations or presence of points of interest. The main all the movements 𝑚 ∈ 𝑀𝑢 reaching location 𝑙 ∈ 𝐿𝑢 .
difference between IMNs and STGs lies in the fact that the first • 𝜋𝑑 : 𝐿 → R returns the typical distance travelled by 𝑢
one models the mobility and it is more attached to geographic along all the movements 𝑚 ∈ 𝑀𝑢 reaching location 𝑙 ∈ 𝐿𝑢 .
components, while the second one models the events, states or
activities that can occur one after another. Nodes represent locations 𝐿𝑢 and edges represent movements
Modeling individual behavior is a precious task as, besides 𝑀𝑢 between locations. With the term typical we refer to using
providing a succinct and understandable representation of the aggregating function like mean and median, also including the
mobility patterns of the users, it enables the development of associated dispersion indexes like standard deviation. To clarify
applications like carpooling [11], or trajectory prediction [26]. the concept of IMN, let us consider Figure 1. It describes the IMN
As a consequence, an individual model with enhanced location extracted from the mobility of an individual who visited seven
descriptions can be undoubtedly beneficial and informative. distinct locations. Location (𝑎) has been visited by user 𝑢 for
a total of 𝜔 (𝑎) = 25 times, i.e., 25 stops, with a typical stay of
𝜏 (𝑎) = 7ℎ40𝑚𝑖𝑛. On the other hand, user 𝑢 stopped in location
Name Description
lat, lon location prototype coordinates
next locs number of different subsequent locations
radius radius of gyration
entropy entropy of ingoing/outgoing movements
stops total and categorized number of stops∗
staytime typical† stay time∗
arrival times typical† arrival times∗
Figure 2: AIMN extracted from the IMN of Figure 1. Anno- duration/length typical† duration/length of movements∗
tations are represented with colors. The first graph reports is regular if a location is frequently visited
the annotated nodes, while the second graph is the AIMN
Table 1: Individual Features. (∗) indicates that the features
with the novel annotated nodes and edges.
are calculated with respect to weekdays vs. weekend, and
daytime (from 7 am to 8 pm) vs. nighttime. (†) indicates
features for both mean and standard deviation.

(𝑓 ) for 𝜔 (𝑓 ) = 3 times with a typical stay of 𝜏 (𝑓 ) = 1ℎ10𝑚𝑖𝑛. Name Description
Edges (𝑐, 𝑓 ) and (𝑒, 𝑓 ) lead to location 𝑓 , typically arriving at 𝑟 -exclusivity number of stops within 𝑟 km w.r.t. other vehicles
time 𝜌 (𝑓 ) = 18.10, traveling 𝜋𝑡 (𝑓 ) = 15𝑚𝑖𝑛, and 𝜋𝑑 (𝑓 ) = 10𝑘𝑚. 𝑟 -centrality number of locations in a radius of 𝑟 km
The computation of an IMN 𝐺𝑢 starts from the ordered se- 𝑘-distance the distance of the 𝑘 th nearest locations
quence history 𝐻𝑢 of user 𝑢. Locations are obtained by aggregat- Table 2: Collective Features. They are calculated compar-
ing the origin and destination points of the trajectories using the ing an individual location with other individual locations.
TOSCA location clustering algorithm [9, 12]. A location identifies
a set of points, and a location prototype is the point minimizing Name Description
the distance with the other observations part of the location. POI 𝑟 -centrality number of different POIs within a radius of 𝑟 km
POI 𝑘-centrality number of different POIs within the 𝑘 th nearest POI
POI 𝑘-distance distance to closest POIs within the 𝑘 th nearest POI
4 PROPOSED APPROACH
Table 3: Contextual Features. All measures are also com-
In this section, we describe the location annotation procedure puted separately over nine different categories of POIs:
and how it is applied to extract annotated individual mobility gas stations, parking areas, piers, hotels, food shops,
networks (AIMNs) from individual mobility networks (IMNs). leisure activities, (no-food) shops, services, supermarkets.

4.1 Annotated Individual Mobility Network 4.2 Location Annotation Procedure
An annotated individual mobility network (AIMN) extracted from In this section, we describe the location annotation procedure we
an IMN describes the individual mobility of a user through a designed to obtain a non-trivial annotation function 𝜆 : 𝐿 → N.
very simple graph representation of annotated locations and As previously discussed, the aim of 𝜆 is to provide the same
movements, summarizing the relevant properties of the IMN and annotation for similar locations. Hence, a definition of similarity
compressing the information contained in the mobility model. must be provided in order to design 𝜆. Given an IMN 𝐺𝑢 of user 𝑢,
The extraction of an AIMN 𝐺 𝑢 = (𝐿𝑢 , 𝑀 𝑢 ) of a user 𝑢 starts in order to compare two locations with a distance function, every
from her IMN 𝐺𝑢 = (𝐿𝑢 , 𝑀𝑢 ), and an annotation function 𝜆. location 𝑙 ∈ 𝐿𝑢 must be described with a set of attributes (see [12,
Given an individual location 𝑙 ∈ 𝐿𝑢 , the annotation function 21] for more details). Individual locations represented as a vector
𝜆 : 𝐿 → N returns its annotation as a consequence of a learn- of features can be grouped using clustering algorithms [24]. As
ing process. Two locations 𝑙 1, 𝑙 2 have the same annotations, i.e., a consequence, 𝜆 is the function which annotates each location
𝜆(𝑙 1 ) = 𝜆(𝑙 2 ), if they are “similar”. Details of the location annota- with the cluster label to which the location belongs to.
tion procedure that returns 𝜆 and the meaning of similar locations Simple 𝜆 functions can be obtained by modeling an individ-
are presented in the next section. ual location through a basic set of attributes, for instance using
The AIMN 𝐺 𝑢 is built by mapping each node 𝑙 ∈ 𝐿𝑢 to a node features like latitude and longitude, i.e., spatial features, or the
𝑙 = 𝜆(𝑙) corresponding to its annotation, and by merging the average stay time and arrival time, i.e., temporal features. How-
corresponding edges. Thus, locations with the same annotation ever, such basic representations of a location can be insufficient
are collapsed into the same annotated location. More in detail, to capture various aspects related to (i) the mobility behavior of
given an edge between 𝑙 1, 𝑙 2 if they have a different annotation the individual user, (ii) the user behavior in relationship with
𝜆(𝑙 1 ) ≠ 𝜆(𝑙 2 ), then two annotated locations 𝑙 1, 𝑙 2 are created the mobility behavior of other users, (iii) the user behavior in
together with the movement-edge connecting them. On the other relationship with the geography and context in which she moves.
hand, if both 𝑙 1, 𝑙 2 have the same annotation 𝜆(𝑙 1 ) = 𝜆(𝑙 2 ), then a To overcome these weaknesses, we design the following lo-
unique annotated location 𝑙 is created together with a self-loop. cation annotation procedure that takes as input a set of IMNs
We exemplify the concept of AIMN through Figure 2. Given G = {𝐺 1, . . . , 𝐺𝑛 }, a context C where the users move and returns
a certain 𝜆 function, the first graph reports the nodes/locations an annotation function 𝜆. The procedure works as follows. For
annotation of the graph in Figure 1, where the annotations are each IMN 𝐺𝑢 ∈ G, for each individual location 𝑙 ∈ 𝐿𝑢 , we de-
represented through colors. Nodes having the same colors are scribe 𝑙 with a vector of features capturing three distinct aspects.
described by similar features, i.e., are similar according to some • Individual features. These features characterize an indi-
aspects. The second graph illustrates the AIMN obtained by col- vidual location 𝑙 only with the mobility of the vehicle
lapsing nodes with the same annotation into an annotated node, 𝑢 stopping in 𝑙. The list attributes is shown in Table 1.
merging edges and creating self-loops. Some features are calculated with respect to weekdays vs.
Truck Type Inter-regional Area Urban Area
Van 214,129 (71.03%) 105,065 (75.24%)
Truck 3 82,697 (27.43%) 33,353 (23.88%)
Truck 3 ax. 3,478 (1.15%) 1,223 (0.88%)
Flatbed Truck 978 (0.32%) -
Truck 179 (0.05%) -
Table 4: Distribution of different type of vehicles for the
two areas analyzed (percentages in brackets).
Figure 3: Heatmaps of the stop points in the areas ana-
lyzed in Greece: Inter-regional (left), and Urban (right). Measure Inter-regional Area Urban Area
Traj per Vehicle 419.86 ± 255.75 498.72 ± 278.72
Avg Length 10.40 ± 14.17 6.64 ± 9.65
weekend, and daytime (from 7 am to 8 pm) vs. nighttime. Avg Duration 23.85 ± 23.82 19.25 ± 19.85
For features describing aggregates we report mean and
Table 5: Mean and standard deviation of some descriptive
standard deviation. A location is regular if it is frequently
statistics for the trajectories in the two areas analyzed.
visited more than the others (see [12] for details).
• Collective features. These features characterize an individ-
ual location 𝑙 with respect to all the other locations both
of user 𝑢 and of the other users in G. Details in Table 2.
• Contextual features. These features characterize an individ-
ual location 𝑙 with respect to facilities, i.e., different types
of points of interests (POIs) in the surrounding areas given
by the context C. Table 3 reports a detailed description.
Figure 4: K-Means Sum of Squared Error (SSE) varying the
Given the locations described by these features, we implement
number of clusters 𝑘. We selected 𝑘 = 155 for the inter-
the annotation function by means of a clustering algorithm. In-
regional area, and 𝑘 = 140 for the urban area.
spired by [8, 9], instead of using a simple clustering approach,
we design a two steps clustering allowing to simultaneously sum-
marize the different locations and obtain a hierarchy for better following types of trucks which are the most frequent in the
describing them. In particular, we use K-Means clustering algo- dataset: Van, Truck 3, Truck 3 ax., Flatbed Truck, Truck2 .
rithm [16] for the first clustering phase. As illustrated in the next Analyzing all the vehicles together would not lead to a fair
section, we observe that we need a high 𝑘 (between 100 and comparison due to the very different types of trajectories followed
200) in order to have a good clustering with respect to internal by the trucks. For instance, we cannot compare the IMN of a truck
evaluation measures like the Sum of Squared Error (SSE). The performing daily deliveries in a radius of 20 km with the IMN
clustering with K-Means allows to consistently reduce the num- of a truck moving across regions or countries along very long
ber of different location prototypes. The second phase is aimed trajectories. Therefore, we partition the trajectories of the dataset
at further reducing the number of clusters for keeping simple through a simple, K-means-like iterative procedure. First, each
the annotation computed by 𝜆, and simultaneously obtaining a vehicle 𝑢 is associated to the geographical bounding box 𝑟𝑢 of its
hierarchy for the different location prototypes. To this purpose trajectories in 𝐻𝑢 , and the set of areas 𝐴 is initialized to ∅. Then,
we adopt a hierarchical clustering approach with the Ward’s cri- we iteratively consider each vehicle 𝑢 and compare its 𝑟𝑢 with
terion [28] to determine the clusters to be merged. A visual and all the existing areas 𝑎𝑖 ∈ 𝐴. If, for at least 75% of the vehicles
interactive inspection of the resulting dendogram and centroids 𝑣 ∈ 𝑎𝑖 , 𝑟𝑢 ∩ 𝑟 𝑣 ≠ ∅ and 1/4 ≤ 𝑎𝑟𝑒𝑎(𝑟𝑢 )/𝑎𝑟𝑒𝑎(𝑟 𝑣 ) ≤ 4, then 𝑢 is
describing the clusters allows to understand the reasons [20] added to 𝑎𝑖 ; otherwise, if 𝑎𝑖 ∈ 𝐴 satisfying this condition, a new
for having different types of annotated locations and which is a area 𝑎 𝑗 is created and 𝑢 is added to 𝑎 𝑗 . Finally, we check that
reasonable number of clusters representing the different types of each vehicle belongs to the area with the highest overlap, and
annotated locations in an AIMN. in case we move a vehicle between two areas until convergence.
The above procedure recognizes twelve different areas.
In the following, we focus on two areas depicted in Figure 3.
5 EXPERIMENTS
We name the first inter-regional area since it contains the trajecto-
In this section, we present a case study on a dataset of trucks in ries of vehicles moving in various regions of Greece. On the other
which we employ the proposed methodology1 . First, we briefly hand, the second is an urban area containing the movements of
analyze the dataset. Then, we provide details for the location trucks in Athens. Details about the number of different vehicles
annotations and the clustering results. Finally, we show some can be found in Table 4. In addition, in Table 5 we report basic
preliminary analysis enabled by the extracted AIMNs. statistics of the trajectories for the two different areas. We notice
how in the urban area a vehicle performs on average much more
5.1 Dataset Description trajectories than a vehicle in the inter-regional area. As expected,
We analyze a dataset of about 15 million of trajectories of trucks the trajectories in the inter-regional area are on average longer
moving in Greece, Albania, Cyprus, and other few EU countries than those in the urban area. Moreover, the high standard devia-
from July 2017 to June 2018. There are different kinds of trucks, tion means that the trajectories in the inter-regional area are also
depending on the size, number of carts, etc. We focus on the more variegate than those in the urban area. The similar average
duration might be due to the higher traffic in the urban area, that
1 The code for performing the analysis is available at: https://github.com/fsbolgi/
makes the speed lower than in the inter-regional area.
AnnotatedIndividualMobilityNetwork. The dataset is not publicly available. 2 More details are available at https://trackandknowproject.eu/about/deliverables/.
regular ↑ ∧ next locs ↑ ∧
radius ↑ ∧ entropy ↑ ∧ avg arrival time ↑
true false

𝑟 15 −centr ↑ ∧
regular ↓ ∧ 𝑟 15 −centr ↓ ∧
POI 𝑟 0.5 −centr ↑ ∧
POI 𝑟 0.5 −centr ↓ ∧ POI 𝑘 30 −dist ↑
POI 𝑘 30 −dist ↓
true false false
true
𝑟 15 −centr ↓ ∧
regular ↓ ∧
Figure 5: Inter-regional Area: hierarchical clustering den- C1 C4 POI 𝑟 0.5 −centr ↓ ∧
avg staytime ↓ ∧ stops ↓
POI 𝑘 30 −dist ↑
dogram with a cut at six clusters (clusters size in brackets). true false true false

C2 C3 C5 C6

Figure 7: Inter-regional Area: Tree showing the most dis-
criminant features for the dendogram in Figure 5.

(iv) we remove 𝑟 5 -centrality and 𝑘 1 -centrality, 𝑘 5 -centrality, 𝑘 20 -
centrality as they are highly correlated with the remaining ones.
We ran K-Means with 𝑘 = 2, . . . , 900. The visual inspection of
the Sum of Squared Error (SEE) in Figure 4 suggested to select
𝑘 = 155 for the inter-regional area, and 𝑘 = 140 for the urban area.
The subsequent run of the hierarchical clustering with the Ward’s
criterion yields the dendograms in Figures 5 and 8. In both cases,
we cut the dendograms to obtain six clusters that characterize
the description of the different annotated locations with a good
trade-off between a sufficiently high level of abstraction and a
Figure 6: Inter-regional Area: centroids parallel plots detailed specification. We observe that in both cases, more than
showing discriminant features for dendogram in Fig. 5. 50% of the locations end in the rightmost part of the dendogram,
making it slightly imbalanced. That produces some small clusters,
yet none of them is negligible in terms of size. In the following,
5.2 Annotated Locations Clustering Analysis we combine the hierarchy of the dendogram with the information
Using the procedure described in [12], we extract the IMNs for the returned by the parallel plots of the centroids of the clusters for
vehicles in both areas, and we use them as input for the proposed each split. This visualization, due to the interpretable features [13]
methodology. In particular, in our analysis, we focus on Septem- describing the locations, allows to explain [20] the hierarchy and
ber and October 2017, obtaining 883 IMNs in the inter-regional consequently the annotations of the various clusters 𝐶 1, . . . , 𝐶 6 .
area and 373 IMNs in the urban area. Statistics describing the In order to better understand the reasons that led to differenti-
IMNs are reported in Table 7. As context C, we exploit a dataset ate the locations into the six clusters – and therefore understand
containing the POIs of the whole Europe3 . We extracted only the what each cluster contains –, we show in Figures 6 and 9 the cor-
POIs relative to the areas we are interested in and we restricted to responding parallel plots of the most significant features, respec-
some more general and relevant categories, namely: gas, parking, tively for the inter-regional and urban areas; we also summarize
pier, hotel, food, leisure, (no-food) shop, service, supermarket. in Figures 7 and 10 the insights we obtained through inspection
Given the IMNs for the two areas, the location annotation of the features at different levels of the dendogram, by means of
procedure received as input about 110k locations for the inter- a decision tree representation.
regional area and about 39k locations for the urban one. We de- In both dendograms the first split is a consequence of differ-
scribed each location with the individual, collective, and contex- ences relative to individual features. The first split (Figures 6
tual characteristics illustrated in Section 4.2, ending in a vector of and 9 top), using the individual features, separates on the left
72 features. For 𝑟 -exclusivity we adopt 𝑟 = 0.2 km, for 𝑟 -centrality branch regularly visited locations (↑ regular) from which is possi-
𝑟 ∈ {1, 5, 15} km, for 𝑘-distance 𝑘 ∈ {1, 3, 5, 8, 10, 20}. For POI ble to reach various destinations (↑ next locs and ↑ entropy) with
𝑟 -centrality, POI 𝑘-centrality and POI 𝑘-distance we adopt 𝑟 = 0.5 early arrivals and leavings (↑ avg arrival times, i.e., the vehicle
km and 𝑘 = 30, respectively. Before running the location anno- leaves before 8 am and is back before 7 pm) in a location defined
tation procedure, we performed a correlation analysis using the by a not very specific area (↑ radius), from the other locations
Pearson correlation coefficient. This step allowed us to reduce (right branch). Thus, with respect to our case study, the locations
the number of features to 55. In particular, (i) the total number on the left-hand side can be matched with storage points and/or
of stops is removed in favor of next loc, entropy and number of deposits of the trucks analyzed, while those on the right are all
daytime weekday stops; (ii) among the temporal aggregations of the others. This consideration can also justify the fact that the
stay times, arrival and leaving times, only the weekday vs week- majority of locations lie in the rightmost part of the dendogram.
end is considered; (iii) movement duration features are removed Moving forward on the left branch, we have a different split for
since they are strongly correlated with movement length features; the inter-regional area and for the urban area.
For the inter-regional area (Figure 6 center left) the second
3 The POIs are points collected from OpenStreetMap filtered based on Geofabrik’s split, making use of collective and contextual features, separates
taxonomy of OpenStreetMap features, i.e., points with the label “POI” are kept. not regular locations (regular ↓) not close to individual locations
Figure 8: Urban area: hierarchical clustering dendogram
with a cut yielding six clusters (clusters size in brackets).

Figure 11: Heatmaps of the different clusters for the an-
notated locations of the Inter-regional area. From left to
right, top to bottom, clusters 𝐶 1, 𝐶 2, . . . , 𝐶 6 .

Following the same logic used to describe the splits of the inter-
regional area we can understand the splits for the annotations
of the urban area from Figures 9 and 10. The second split (Fig-
ure 9 center left) discriminates between locations close to various
Figure 9: Urban Area: centroids parallel plots showing dis- existing POIs (POI 𝑟 0.5 −centr ↑ and POI 𝑘 30 −dist ↓). Since the
criminant features for dendogram in Figure 8. closest categories for the black lines are food, leisure, shop, service
and supermarket we can assume that cluster 𝐶 1 mainly identifies
“regular” locations in the city center. A further split on individual
regular ↑ ∧ next locs ↑ ∧
radius ↑ ∧ entropy ↑ ∧ avg arrival time ↑
features (Figure 9 bottom left) identifies more regularly visited
locations (↑ regular) with a high stay time and number of stops
true false
during both weekend and weekday (↑ avg staytime, ↑ stops). The
other branch separates the locations based on the distance to
POI 𝑟 0.5 −centr ↑ ∧ POI 𝑟 0.5 −centr ↑ ∧
POI 𝑘 30 −dist ↓ POI 𝑘 30 −dist ↓ existing POIs (POI𝑟 0.5 −centr ↑ and POI𝑘 30 −dist ↓). This time
false false there is a clear separation for all the contextual features also
true true
considering gas, parking and hotel. Hence, the cluster 𝐶 4 captures
radius ↑ ∧ central locations (probably even more central than those on the
regular ↑ ∧
C1 C4 next locs ↑ ∧ left branch of the tree) visited sporadically by the vehicles ana-
avg staytime ↑ ∧ stops ↑
POI − gas 𝑟 0.5 −centr ↑
true false
lyzed. This cluster is the biggest in the urban area. Finally, the
true false
last split relative to suburban locations distinguish cluster 𝐶 5 ,
containing suburban locations with a large radius from which
C2 C3 C5 C6 can be reached other many individual locations but far away
Figure 10: Urban Area: Tree showing the most discrimi- from POIs except gas stations (probably located into an industrial
nant features for the dendogram in Figure 8. area), from cluster 𝐶 6 , which is formed by suburban locations
close to facilities but reached sporadically.
In Figure 11 we show the heatmaps of the locations for the
of other vehicles (𝑟 15 −centrality ↓, i.e., not central w.r.t. the oth- various clusters for the Inter-regional area. We can notice how
ers) nor to POIs (POI 𝑟 0.5 −centrality ↓ and POI 𝑘 30 −distance ↑) the position of the different annotated locations on the maps is
from the rest. Cluster 𝐶 1 models suburban and peripheral storage coherent with the descriptions reported above. From a very high
points. On the other hand, the subsequent split (Figure 6 bottom level we can summarize the distinction between the different
left) identifies less frequent locations with shorter stops (cluster annotated locations as delivery “origins” (𝐶 1, 𝐶 2, 𝐶 3 ) and “desti-
𝐶 2 ) and more frequent locations with longer stops (cluster 𝐶 3 ). nations ”(𝐶 4, 𝐶 5, 𝐶 6 ), i.e., recipients. Then, the clusters distinguish
The right branch of the inter-regional tree in Figure 7 performs according to the closeness with respect to other individual loca-
a symmetric split with respect to the left branch. Thus, in cluster tions and POIs, to usage in terms of stay times and arrival times.
𝐶 4 we find not regularly visited locations surrounded by many For instance, we have locations in the city center in 𝐶 4 , isolated
other personal locations and POIs (𝑟 15 −centr ↑, POI 𝑟 0.5 −centr ↑, suburban locations 𝐶 5 , and suburban locations surrounded by
POI 𝑘 30 −dist ↓). Due to the high density, we can infer that these POIs in 𝐶 6 . Moreover, we highlight that latitude and longitude
locations are central regions of the inter-regional area. Finally, are not crucial for distinguishing the various clusters at the final
clusters 𝐶 5 and 𝐶 6 containing most of the locations are placed in level of the dendogram. Indeed the points are not entirely sep-
suburbans regions far away from many POIs. arated from a spatial point of view. This implies that the other
Measure Inter-regional Area Urban Area
IMN vs TMP 0.1580 0.1465
IMN vs SPT 0.1186 0.0723
TMP vs SPT 0.0036 0.0136
Table 6: NMI comparing the proposed location annotation
procedure based on IMNs against a temporal annotation
(TMP) and a spatial annotation (SPT) procedure.

Measure Inter-regional Area Urban Area
Nodes 96.84 ± 74.98 138.96 ± 112.04
Edges 270.47 ± 190.32 312.21 ± 224.90
Density 0.07 ± 0.15 0.05 ± 0.10
Degree 5.78 ± 1.46 4.85 ± 2.39
Clus. Coef. 0.19 ± 0.13 0.17 ± 0.13
Table 7: IMNs characteristics (mean ± std dev).

Measure Inter-regional Area Urban Area Figure 12: Purity and entropy distributions for number of
Nodes 5.38 ± 0.89 5.06 ± 1.02 stops in different annotated locations.
Edges 15.12 ± 4.69 13.31 ± 4.56
Density 1.28 ± 0.28 1.28 ± 0.32
truck with a vector 𝑣𝑢 = ⟨𝑓1, 𝑓2, . . . , 𝑓6 ⟩ of six elements contain-
Degree 5.48 ± 1.16 5.10 ± 1.10
ing the relative number of stops in each type 𝐶 1, 𝐶 2, . . . , 𝐶 6 of
Clus. Coef. 0.82 ± 0.22 0.85 ± 0.20
annotated locations. Using these vectors of relative frequencies
Table 8: AIMNs characteristics (mean ± std dev). 𝑓𝑖 , we calculate for each user the purity and (normalized) en-
tropy [22] as purity(𝑣) = 𝑚𝑎𝑥𝑖 ∈ [1,6] (𝑓𝑖 ), and as entropy(𝑣) =
Í
features are much more relevant than the geographical aspects − 𝑖 ∈ [1,6] 𝑓𝑖 log2 (𝑓𝑖 )/log2 (6). We highlight that purity and en-
for capturing different characteristics. tropy capture two different aspects. We report the distributions
Finally, we adopt a clustering evaluation measure to show that of purity and entropy for the inter-regional and urban areas in
the proposed location annotation procedure instantiated with Figure 12. For the purity (top row), we observe two normal-like
the truck case study is meaningful and not trivial. We implement distributions with most of the users having a purity of 0.5 in both
two simple annotation functions 𝜆. The first procedure (SPT) de- areas meaning that half of the stops of the vehicle refer to the
scribes the locations only using spatial features, i.e., latitude and same annotated location. On the other hand, there are also trucks
longitude. The second procedure (TMP) describes the locations with a purity close to 1.0, meaning that almost all the stops belong
only using the average stay time and arrival time. In both proce- to the same annotated locations. There are no vehicles having
dures, the locations are grouped and annotated using K-Means a purity lower than 0.2. With respect to entropy (bottom row),
with 𝑘 = 6, i.e., the same number of annotated locations discov- we observe right-skewed distributions with a mean of about 0.7,
ered with the proposed procedure. We compare the annotations, showing that, despite there is an average high purity, most of the
i.e., the clustering results, returned by the three procedures using trucks stop in different annotated locations. In addition, there
the Normalized Mutual Information score [27]. The more similar are very few vehicles with an entropy close to 0.0 signaling that
are the annotations between two location annotation procedures, a truck generally visits a minimum number of different anno-
the higher is the NMI score. The very low NMI scores reported tated locations. Finally, we report that we found no relationships
in Table 6 confirms that it is not possible to design an annotation between the type of truck and entropy or purity.
procedure similar to the proposed one with a trivial approach. We report in Figure 13 the AIMNs of three trucks moving in
the urban area with different levels of purity and entropy. The
5.3 Inspection of Annotated IMNs leftmost AIMNs show a high entropy and low purity, visiting all
Given the locations annotation described in the previous sec- the types of annotated locations with a balanced number of stops,
tion, we know the behavior of the annotation function 𝜆 and the i.e., similar sizes. We notice on the map how it covers a larger area
meaning of the different annotations. Thus, given a location 𝑙 compared to the other AIMNs. The big yellow node 𝐶 2 models
with the vector of features describing it, 𝜆 assigns 𝑙 to the most the parking/storage while the others model different types of
similar cluster with respect to the six prototypes represented by destinations. On the other hand, the rightmost AIMN shows a
the centroids reported in Figures 6 and 9, i.e., 𝜆 : 𝐿 → [1, . . . , 6]. low entropy and high purity: the majority of the stops are on
Using the annotation functions 𝜆 for the IMNs of the two areas annotated locations of type 𝐶 3 (red), which are on the north-west
we obtain the corresponding AIMNs. We report in Table 8 statis- and south-east in the map. Finally, the central AIMN shows a
tics describing the AIMNs. By comparing Table 8 with Table 7 we typical vehicle with a medium level of entropy and purity.
notice that the number of nodes and edges obviously drops. On
the other hand, we observe that AIMNs are much denser than 6 CONCLUSION
IMNs and with a higher average degree about 5 with a standard We have proposed a data-driven procedure for annotating indi-
deviation of 1.1, meaning that typically each vehicle from an vidual locations, and we have employed it to extract annotated
annotated location can reach at least 3 other annotated locations. individual mobility networks (AIMNs) from individual mobility
Consequently, AIMNs show a much higher clustering coefficient. networks (IMNs). A case study experimentation on a real dataset
Moreover, we exploit the AIMNs for a preliminary analysis of fleet moving in Greek areas has shown the effectiveness of
aimed at comparing vehicles. From the AIMNs we model each the proposed approach. We have found a hierarchical structure
Figure 13: AIMNs for three vehicles moving in the urban area with different purity and entropy levels. Left low purity
and high entropy. Center medium purity and entropy. Right high purity and low entropy. The top row reports the AIMNs
while the bottom row contains the corresponding AIMNs with a spatial disaggregation of the annotated locations on a
real map. In both cases the higher is the size of the node/location, the higher the number of stops in the locations.

describing the annotated locations through individual, collective, [4] Vanessa Frias-Martinez et al. 2012. Characterizing urban landscapes using
and contextual features. The principal discrimination is based geolocated tweets. In PASSAT. IEEE, 239–248.
[5] Vanessa Frias-Martinez and Enrique Frias-Martinez. 2014. Spectral clustering
on the frequency of visits, length of stay, and centrality with for sensing urban land use using Twitter activity. EAAI 35 (2014), 237–245.
respect to other locations and existing points of interest. As a [6] Fosca Giannotti et al. 2011. Unveiling the complexity of human mobility by
querying and mining massive trajectory data. VLDB 20, 5 (2011), 695–719.
consequence of the annotation, we can observed different vehi- [7] Marta C Gonzalez et al. 2008. Understanding individual human mobility
cles studying the corresponding AIMNs. Such information can be patterns. Nature 453, 7196 (2008), 779.
applied for a detailed analysis of inter-regional and urban areas [8] Riccardo Guidotti et al. 2014. Retrieving points of interest from human sys-
tematic movements. In SEFM. Springer, 294–308.
and for planning ad-hoc and personalized mobility applications. [9] Riccardo Guidotti et al. 2015. TOSCA: two-steps clustering algorithm for
Future research directions can purse various goals. First, we personal locations detection. In SIGSPATIAL. ACM, 38.
would apply the proposed methodology to different data sources. [10] Riccardo Guidotti et al. 2016. Unveiling mobility complexity through complex
network analysis. SNAM 6, 1 (2016), 59.
For instance, observing the personal mobility of individual car [11] Riccardo Guidotti et al. 2017. Never drive alone: Boosting carpooling with
drivers, or of users adopting different types of means of trans- network analysis. IS 64 (2017), 237–257.
[12] Riccardo Guidotti et al. 2017. There’s a path for everyone: A data-driven
portations like bicycle, foot, public services, would probably lead personal model reproducing mobility agendas. In DSAA. IEEE, 303–312.
to identify other types of annotations for the locations. Second, [13] Riccardo Guidotti et al. 2018. Discovering temporal regularities in retail
we would like to perform a deeper analysis of the users by clus- customers’ shopping behavior. EPJ Data Science 7, 1 (2018), 6.
[14] Shan Jiang et al. 2012. Clustering daily patterns of human activities in the
tering the AIMNs with respect to the annotated locations for city. DAMI 25, 3 (2012), 478–510.
discovering relevant users’ segmentation and geographical char- [15] John Lafferty et al. 2001. Conditional random fields: Probabilistic models for
acterization of the territory. Third, we would like to use the an- segmenting and labeling sequence data. (2001).
[16] James MacQueen et al. 1967. Some methods for classification and analysis of
notated locations and the mobility history for building sequences multivariate observations. In BSMSP, Vol. 1. Oakland, CA, USA, 281–297.
of “states” [1]. Exploiting sequential pattern mining algorithms [17] Anastasios Noulas et al. 2011. Exploiting semantic annotations for clustering
geographic areas and users in location-based social networks. In AAAI.
would allow us to search for mobility patterns represented as [18] Anastasios Noulas et al. 2013. Exploiting foursquare and cellular data to infer
sequences of different annotated locations. Fourth, we would user activity in urban environments. In ICMDM, Vol. 1. IEEE, 167–176.
like to analyze the benefits of integrating our concepts of anno- [19] Christine Parent et al. 2013. Semantic trajectories modeling and analysis.
CSUR 45, 4 (2013), 42.
tated locations and AIMNs into visualization platforms [1, 23] [20] Dino Pedreschi et al. 2019. Meaningful explanations of Black Box AI decision
for semantic annotation of individual mobility. systems. In AAAI, Vol. 33. 9780–9784.
[21] Salvatore Rinzivillo et al. 2014. The purpose of motion: Learning activities
from individual mobility networks. In DSAA. IEEE, 312–318.
ACKNOWLEDGMENTS [22] Claude Elwood Shannon. 1948. A mathematical theory of communication.
Bell system technical journal 27, 3 (1948), 379–423.
This work is partially supported by the E.C. H2020 programme [23] Amílcar Soares et al. 2019. VISTA: A visual analytics platform for semantic
under the funding scheme Track &Know, G.A. 780754, https:// annotation of trajectories.. In EDBT. 570–573.
[24] Pang-Ning Tan. 2018. Introduction to data mining. Pearson Education India.
trackandknowproject.eu/. We thank Haosheng Huang and Cheng [25] Roberto Trasarti et al. 2011. Mining mobility user profiles for car pooling. In
Fu from University of Zurich for sharing the POIs dataset. KDD. ACM, 1190–1198.
[26] Roberto Trasarti et al. 2017. Myway: Location prediction via mobility profiling.
IS 64 (2017), 350–367.
REFERENCES [27] Nguyen Xuan Vinh et al. 2010. Information theoretic measures for clusterings
comparison: Variants, properties, normalization and correction for chance.
[1] Natalia Andrienko and Gennady Andrienko. 2018. State transition graphs for
JMLR 11, Oct (2010), 2837–2854.
semantic analysis of movement behaviours. IV 17, 1 (2018), 41–65.
[28] Joe H Ward Jr. 1963. Hierarchical grouping to optimize an objective function.
[2] Jie Bao et al. 2012. Location-based and preference-aware recommendation
JASA 58, 301 (1963), 236–244.
using sparse geo-social networking data. In SIGSPATIAL. ACM, 199–208.
[29] Jihang Ye, Zhe Zhu, and Hong Cheng. 2013. What’s your next move: User
[3] Ronaldo Dos Santos Mello et al. 2019. MASTER: A Multiple Aspect View on
activity prediction in location-based social networks. In ICDM. SIAM, 171–179.
Trajectories. TGIS 12526 (2019).