=Paper= {{Paper |id=Vol-3651/BMDA_paper12 |storemode=property |title=A Shape-Based Map Matching Approach for Geographic Transferability of Discriminative Subtrajectories |pdfUrl=https://ceur-ws.org/Vol-3651/BMDA-12.pdf |volume=Vol-3651 |authors=Cristiano Landi,Riccardo Guidotti |dblpUrl=https://dblp.org/rec/conf/edbt/LandiG24 }} ==A Shape-Based Map Matching Approach for Geographic Transferability of Discriminative Subtrajectories== https://ceur-ws.org/Vol-3651/BMDA-12.pdf

A Shape-Based Map Matching Approach for Geographic
Transferability of Discriminative Subtrajectories
Cristiano Landi1,2,∗ , Riccardo Guidotti1,2
1
University of Pisa, Pisa, Italy.
2
ISTI-CNR, Pisa, Italy.

Abstract
This paper addresses the challenge of map matching and geographic transferability in trajectory analysis. Existing methods
often face limitations tied to specific coordinates or road networks. In response, we propose GASM, a shape-based map
matching method that relies solely on trajectory shapes, irrespective of geographic origin. GASM introduces a symbolic road
network representation, facilitating efficient searches based solely on trajectory shapes. Our experimentation, spanning over
5,000 km of roads, demonstrates GASM’s ability to accurately position trajectories with an impressive accuracy exceeding 90%.
Notably, GASM stands as the first in the literature to perform shape-based symbolic map matching without prior knowledge
of the geographic region.

Keywords
Map Matching, Geographic Transferability, Machine Learning, Discriminative Subtrajectories

1. Introduction mobility patterns from one geographical region and ef-
fectively applying them in another region [8, 9].
In recent years, the widespread adoption of cutting-edge Particularly noteworthy are recent advancements in
technologies equipped with Global Positioning System machine learning leveraging shapelet-based subtrajec-
(GPS) devices has enabled the recording of positions for tories [5, 6, 7]. Originating from the domain of time
various moving objects, ranging from cars and transporta- series analytics, shapelets represent discriminatory sub-
tion vehicles to phones and wearables. Unfortunately, sequences that encapsulate a collection of distinctive
the coordinates captured by these sensors often fail to ac- shapes, crucial for discerning specific classes [10]. Vari-
curately reflect real positions due to physical constraints ous approaches exist for defining discriminative subtra-
and/or legal regulations. Nevertheless, in various applica- jectories. In [6], the Movelet method is introduced—an
tions, it is imperative to accurately align GPS trajectories approach for extracting discriminative subtrajectories
with a road network. For instance, in navigation ser- selected through a rigorous statistical test. During the
vices, map matching empowers drivers to monitor their discovery phase, Movelet generates candidate subtrajec-
exact locations and receive optimal routes to specified tories by extracting all possible subsequences with more
destinations. Conversely, in machine learning tasks, map than two contiguous observations, utilizing a sliding win-
matching enhances users’ mobility information by in- dow. Building upon the foundation laid by Movelet,
corporating knowledge related to the territory, such as Geolet is introduced in [7]. This extension incorporates
Points Of Interest (POI), feature engineering [1, 2, 3, 4], or a normalization step after the discovery phase. The nor-
the identification of discriminatory subsequences, such malization step is designed to ease the comparison of
as mobility shapelets [5, 6, 7]. Without an appropriate discriminative subsequences with trajectories recorded
map-matching procedure, reliance on an expert becomes in diverse geographical regions. The underlying ratio-
necessary to determine which features can be extracted nale is that a subtrajectory pinpointing a sudden break
from trajectories concerning the territory. However, the in a road segment in one city should exhibit similarities
reliance on ad-hoc features restricts the applicability of to a subtrajectory associated with the same event in an-
machine learning methods and amplifies sensitivity to other city. This normalization enhances the method’s
input changes [1], rendering it unsuitable for geographic adaptability across various geographic contexts.
transferability. This implies the challenge of extracting While Geolet successfully addresses the limitation
of Movelet by providing normalized subtrajectories,
Published in the Proceedings of the Workshops of the EDBT/ICDT 2024 thereby enhancing geographic transferability indepen-
Joint Conference (March 25-28, 2024), Paestum, Italy dent of specific GPS coordinates, it introduces a potential
∗
Corresponding author. vulnerability tied to the road network.
Envelope-Open cristiano.landi@phd.unipi.it (C. Landi);
riccardo.guidotti@unipi.it (R. Guidotti)
Our underlying hypothesis is that the less frequently
Orcid 0000-0003-4907-9728 (C. Landi); 0000-0002-2827-7613 a trajectory occurs, the greater the likelihood that shape-
(R. Guidotti) based methods utilizing it as a discriminative subse-
Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
quence may not capture the intrinsic features of the key strategies employed to tackle this challenge.
trajectory but rather only its geographic position. In In the literature of trajectory analysis, a multitude of
essence, if a discriminative subtrajectory is intrinsically
strategies exists for mapping trajectories onto a road net-
linked to a particular road network due to its distinc- work. For high-frequency sampled trajectories, the sim-
tive shape, it becomes unsuitable for geographic trans- plest approach involves associating each spatio-temporal
ferability. This limitation arises from the fact that thepoint with the nearest street segment [11, 12]. However,
discriminatory aspect is not rooted in the movements these techniques, while fast and straightforward, have ex-
themselves but rather in the structural characteristics hibited inaccuracies, particularly at intersections and par-
of the road network. Consequently, evaluating the geo- allel roads. To address these limitations, enhancements
graphic transferability of discriminative subtrajectorieshave been introduced, incorporating heading direction
necessitates a shape-based map-matching approach that or employing a Kalman filter to eliminate outlier points
exclusively relies on shapes without prior knowledge of in trajectories [13]. Alternatively, some approaches lever-
the position. Regrettably, to the best of our knowledge, age probabilistic-based map-matching algorithms, in-
such an approach is currently unavailable. This under- tegrating hidden Markov models to identify the most
scores the need for innovative solutions in the realm of likely sequence of road segments aligning with the tra-
shape-based map matching to comprehensively assess jectory [14, 15]. On the other hand, in the context of
the adaptability of discriminative subtrajectories acrosslow-sampled trajectories, much of the existing literature
diverse geographical contexts. presupposes that the most probable route connecting two
To overcome this limitation, our paper introduces successive points is also the shortest or fastest [16]. How-
GASM, an Geographic Automaton Shape-based map ever, in [17], is introduced a map-matching algorithm
Matching approach. GASM relies solely on the shape of a that exploits temporal intervals between GPS points. This
trajectory to accurately determine its position within the
method identifies the optimal match between two GPS
road network. Specifically, GASM employs a symbolic points by selecting the route with the most similar travel
representation to transform the road network, construct- time. Also, in [18] is proposed a method for map match-
ing a spatial index independent of coordinates. This ing low-sampled trajectories based on supplementary
unique approach allows for efficient trajectory searches information such as speed and moving direction, typi-
based solely on their shapes. To the best of our knowl- cally collected alongside spatial locations.
edge, GASM is the first proposal in the literature that Geographic transferability encapsulates the challenge
exclusively utilizes a discretized representation of a trajec-
of extending knowledge gleaned from one geographic re-
tory’s shape, devoid of any knowledge of the geographic gion to another. This entails constructing machine learn-
region, for shape-based map matching. Our experimenta- ing models capable of adeptly executing tasks in regions
tion with GASM on a novel comprehensive geographical distinct from their original training grounds. The core
dataset spanning over 5,000 km of roads in Tuscany, cen- of this challenge emerges from pronounced disparities
tral Italy, demonstrates its capability to identify correct
in data distributions, patterns, and characteristics across
alignments with an impressive accuracy exceeding 90%. diverse geographical locations. As articulated in [8, 9],
Furthermore, GASM exhibits efficiency, as it can con- models trained on data from one region may encounter
struct the necessary representation for the entire dataset
difficulties in generalization when confronted with the
in less than 1.5 hours, maintaining a linear complexity at
unique variations inherent in another region. In pursuit
query time. of a global model, a fundamental strategy is employed
The paper is organized as follows. Section 2 sum- as demonstrated in [8], where diverse data sources are
marises the related works concerning map-matching aggregated on a global scale to build a transferable model.
methods and the challenges posed by geographic trans- Conversely, in [9], city indicators are identified pertain-
ferability. In Section 3, we encapsulate the technical ing to road networks, traffic flows, and individual mobil-
concepts essential for comprehending the algorithm de- ity, to facilitate the assessment of similarities between ge-
lineated in Section 4. The outcomes of experiments con- ographical regions. Subsequently, an ensemble classifier
ducted with GASM are detailed in Section 5. Finally, is devised, computing the output as a weighted average of
Section 6 encapsulates our findings and delves into po- outputs generated by individual local classifiers. Notably,
tential avenues for future developments. the importance of local models is determined by their
higher similarity to the target regions compared to others
in the ensemble with respect to the city indicators. Al-
2. Related Works ternate methodologies involve adapting a model initially
trained in a data-rich region and transplanting it to a tar-
In the following, we provide a concise overview of the
get region characterized by limited data availability. This
literature concerning map matching methods and eluci-
adaptation process entails integrating additional data to
date the geographic transferability problem, introducing
compute sub-region similarities, subsequently enabling
the remapping of the model [19, 20]. matrix 𝑇 ∈ ℝ𝑛×ℎ , obtained by taking the best fitting of
each trajectory 𝑋 ∈ 𝒳, and each subtrajectory 𝑆 ∈ 𝒮.
3. Problem Setting Movelet computes the best fitting as the minimal Eu-
clidean Distance (ED) between 𝑆 in each subsequence of
In this section, we articulate the fundamental concepts length 𝑙 = |𝑆| of 𝑋, formally,
essential for comprehending our proposal. Initially, we 𝑚−𝑙
establish a shared language and framework by introduc- bestfitMovelets (𝑋 , 𝑆) = min{𝐸𝐷(𝑋𝑗∶𝑗+𝑙 , 𝑆)} (1)
𝑗=0
ing notation that serves as a basis for discussing key
elements. Subsequently, we delve into the transforma- On the other hand, Geolet computes the best fitting in
tion facilitated by Geolet, a catalyst for the motivation the same way, but geographically shifting 𝑆 to overlap
behind this work. Ultimately, we present a formal intro- each subsequence of 𝑋 of length 𝑙. In particular, Geolet
duction to the problem at hand. extends the best fitting function of Movelet by adding a
pre-processing function, shift, that subtracts the value of
Definition 1 (Trajectory). A trajectory 𝑋 is a sequence the first vector of the subsequence from all the others,
of spatio-temporal points 𝑋 = {⃗x𝑡0 , … , x⃗ 𝑡𝑚 } ∈ ℝ𝑚×3 where
𝑚−𝑙
the spatial vectors x⃗ 𝑡𝑗 = (lat𝑡𝑗 , long𝑡𝑗 ) are sorted by in- bestfitGeolet (𝑋 , 𝑆) = min{𝐸𝐷(shift(𝑋𝑗∶𝑗+𝑙 ), shift(𝑆))} (2)
creasing time 𝑡𝑗 , i.e., ∀1 ≤ 𝑗 < 𝑚 we have 𝑡𝑗 < 𝑡𝑗+1 . 𝑗=0

where shift(𝑋 ) = {𝑥⃗0 − 𝑥⃗0 , … , 𝑥⃗𝑚 − 𝑥⃗0 }. The shift function
In a sense, trajectories can be viewed as multivariate
makes Geolet suitable for geographic transferability not
time series containing two signals, i.e., the latitude and
being tide to the territory.
longitude, recorded at non-constant sampling rates [5,
21, 6]. In order to simplify notation, we will use 𝑗 instead Finally, in order to present map matching we define a
of 𝑡𝑗 every time. A trajectory classification dataset is a set road network as follows:
of trajectories with a vector of labels attached. Formally: Definition 5 (Road Network). A road network 𝐺 =
Definition 2 (Trajectory Dataset). A trajectory dataset ⟨𝑉 , 𝐸⟩ is a directed graph where 𝑉 = {𝑣1 , … , 𝑣𝑝 } is the set
𝒳 ∈ ℝ𝑛×𝑚×3 is a set of 𝑛 trajectories, 𝒳 = {𝑋0 … , 𝑋𝑛 }. of 𝑝 road junction (or nodes), and 𝐸 = {𝑒1 , … , 𝑒𝑞 } is the
set of 𝑞 road segments (or edges), where 𝑒𝑖 = (𝑣𝑖1 , 𝑣𝑖2 ).
For simplicity, we use a single symbol 𝑚 to denote the
lengths of the trajectories, even if a dataset can contain We underline that we rely on an enhanced road network
trajectories with a different number of observations. Sim- representation where, for each edge, we also have ac-
ilarly, we emphasize that there is no constraint on the cess to the road segment geometries expressed as a se-
sampling rate, i.e., we can have a non-constant sample quence of 𝑘 latitude and longitude, formally shape(𝐸) =
in the same trajectory. Furthermore, we define a subtra- {𝑥⃗0 , … , 𝑥⃗𝑘 } for some 𝐸 ∈ ℰ. As for trajectories, for sim-
jectory as: plicity of notation, we use a single symbol 𝑘 to denote
the lengths of the points describing the geometry of the
Definition 3 (Subtrajectory). Given a trajectory 𝑋 of road segment, even if, in real-case scenarios, the shape
length 𝑚, a subtrajectory 𝑆 = {⃗𝑠𝑗 , … , 𝑠⃗𝑗+𝑙 } ⊂ 𝑋, of length can be described using an arbitrary number of points.
𝑙 ≤ 𝑚, is an ordered sequence of consecutive values such We are now able to formalize the shape-based map
that 0 ≤ 𝑗 ≤ 𝑚 − 𝑙. matching problem as follows:
As previously mentioned, Movelet [6] and Geolet [7] Definition 6 (Shape-based Map Matching). Given a
are shapelet-inspired [10] trajectory approaches that road network 𝒢 = ⟨𝑉 , 𝐸⟩ and a trajectory 𝑋, the shape-
identify discriminative subtrajectories for classification based map matching problem consists in finding the best
purposes. They both select the most discriminative sub- sequence of edges 𝑌 = {𝑒1 , … , 𝑒𝑧 } ⊆ 𝐸 such that does
trajectories w.r.t. the target label using the mutual infor- not exist another sequence of edges 𝑌 ′ ⊂ 𝐸 different
mation [22]. Like shapelet-based approaches, Movelet from 𝑌, i.e., 𝑌 ′ ≠ 𝑌, where bestfitGeolet (𝑋 , shape(𝑌 ′ )) <
and Geolet extract discriminative subtrajectories that bestfitGeolet (𝑋 , shape(𝑌 )).
can be used to train any machine learning model [23].
In other words, the shape-based map matching problem
Indeed, once the most discriminative subtrajectories are
involves determining the optimal alignment for a (sub)tra-
identified, a trajectory dataset can be transformed into
jectory 𝑋 within a designated road network 𝐺, relying on
a tabular representation capturing the distance between
the configuration of the edges comprising the road seg-
trajectories and discriminative subtrajectories through
ments. It is essential to emphasize that this map matching
the subtrajectory transform function:
endeavor necessitates resolution without any reliance on
Definition 4 (Subtrajectory Transform). Given a dataset GPS coordinates as the usage of the shift operator nor-
𝒳 and a set 𝒮 containing ℎ subtrajectories, the subtrajec- malizes the trajectory 𝑋 rendering state-of-the-art map
tory transform converts 𝒳 ∈ ℝ𝑛×𝑚×3 into a real-valued matching methods unsuitable for this particular task.
Algorithm 1: build(𝐸, 𝑑, Σ, ℎ)
Input : 𝐸 - road segments, 𝑑 - resampling
distance, 𝛼 - max nbr. of symbols,
ℎ - h-hop aggregation
Output : 𝑎ℎ𝑜 - Aho-Corasick automaton
1 𝐸ℎ ← aggregate(𝐸, ℎ); // aggregate trajectories
Figure 1: Aho-Corasick automaton using symbolic sequences 2 𝒲 ← ∅;
{𝐴𝐶, 𝐵, 𝐵𝐶𝐴, 𝐶}. Blue dashed arches are suffix arches, while 3 for 𝑒 ∈ 𝐸ℎ do // for each road segment
green dotted arches are the dictionary suffix arches. 4 𝑋 ← resample(𝑠ℎ𝑎𝑝𝑒(𝑒), 𝑑); // resample traj.
5 𝑋⃗ ← direction(𝑋 ); // get traj. direction
6 𝑊 ← SAX(𝑋⃗ , 𝛼); // discretize traj.
4. Shape-based Map Matching 7 𝒲 ← 𝒲 ∪ {𝑊 }; // add to dict.
8 return Aho-Corasick(𝒲 );
To tackle the shape-based map-matching problem, our
aim is to design a map-matching method with the capabil-
ity to accurately deduce the original GPS coordinates of Algorithm 2: 𝑠𝑒𝑎𝑟𝑐ℎ(𝑋 , 𝐴, 𝑑, Σ)
a trajectory within a designated road network. Crucially,
this precision is sought exclusively through an examina- Input : 𝑋 - query traj., 𝐴 - Aho-Corasick
tion of the trajectory’s shape and the configurations of automaton, 𝐸 - road segments, 𝑑 -
the edges within the road network, entirely independent resampling dist., 𝛼 - max nbr. of symbols
of any reliance on GPS coordinates. Output : 𝑌 ∗ , 𝒴 - best methc and best candidates
A brute-force approach to address the problem in- 1 𝑋 ′ ← resample(𝑋 , 𝑑); // resample traj.
volves map matching all conceivable alignments of 𝑋 2 𝑋⃗ ′ ← direction(𝑄); // get traj. direction
within every segment 𝐸 of the road network 𝐺, employ- 3 𝑄 ← SAX(𝑋 ⃗ ′ , 𝛼); // discretize using SAX
ing the bestfitGeolet function. However, this naive strategy 4 𝒴 ← search(𝐴, 𝑄, 𝐸); // get best candidates
is only viable for small road networks due to the algo- ∗ ′
5 𝑌 ← arg min𝑌 ∈𝒴 bestfitGeolet (𝑌 , 𝑋 );
rithmic complexity being 𝑂(|𝐸|(𝑚 − 𝑘)𝑘), where 𝑚 and
6 return 𝑌; // return the best match
𝑘 represent the number of points characterizing the tra-
jectory 𝑋 and the number of points describing each road
segments in 𝐸, respectively1 . This limitation also extends
to other map-matching algorithms that rely on latitude finite sequence of symbols over an alphabet Σ. Subse-
and longitude coordinates to confine the matching scope quently, it builds a finite-state automaton based on the
to the nearest roads. sequences in 𝒲 within a given finite symbol sequence de-
We overcome this limitation by proposing GASM a fined over the alphabet Σ. Consequently, the automaton,
Geographic Automaton Shape-based map Matching ap- constructed using the dictionary 𝒲, identifies a subset
proach that is able to significantly reduce the number 𝒲 ′ ⊂ 𝒲 wherein each sequence in 𝒲 ′ is contained in
of road segment alignments to test with the brute force 𝑄. Figure 1 provides an illustration of the automaton cre-
method. In essence, GASM comprises two key steps. Ini- ated by the Aho-Corasick algorithm, using the sequences
tially, leveraging the Aho-Corasick algorithm [24], GASM 𝒲 = {𝐴𝐶, 𝐵, 𝐵𝐶𝐴, 𝐶} over the alphabet Σ = {𝐴, 𝐵, 𝐶, 𝐷}.
constructs a shape-based index for all road segments in The Aho-Corasick algorithm initiates by constructing
𝐸, portraying it as a geographic finite state automaton. a suffix trie [25], depicted in black in Figure 1. Subse-
Subsequently, GASM facilitates querying the automaton quently, it designates all leaves of the trie as final states
to pinpoint a set of candidate partial matches between of the automaton and introduces edges to complete the
𝑋 and {shape(𝑌 )| ∀ 𝑌 ⊆ 𝐸}. Further elucidation of these automaton. Two types of edges are incorporated, con-
two steps is provided in the subsequent sections. necting their respective suffixes: suffix edges, depicted
Geographic Automaton Construction. GASM in blue, are utilized in the case of a mismatch, without
leverages the Aho-Corasick algorithm to construct a guaranteeing that the suffix is also a sequence in the dic-
geographic automaton, serving as a spatial index for tionary. In contrast, dictionary-suffix edges, portrayed in
expedited query processing [24]. The Aho-Corasick green, guarantee that the suffix is a sequence present in
algorithm, renowned for string searching, takes a set the dictionary. These operations unfold linearly concern-
𝒲 = {𝑊1 , … , 𝑊𝑛 } as input, where each 𝑊𝑖 represents a ing the total number of symbols in the input dictionary 𝒲.
The automaton enables the search for all sequences con-
1 tained in a query by traversing the automaton, achievable
For each 𝑌 ⊆ 𝐸, we compute the Euclidean Distance (linear com-
plexity), for all the possible 𝑚 − 𝑘 alignments of shape(𝑌 ) in 𝑋. in linear time relative to the query’s length.
Algorithm 1 delineates the procedural steps requisite
of GASM for constructing the Aho-Corasick automaton.
The algorithm accepts, as input, the road segments 𝐸 of
the road network 𝐺, the resampling distance 𝑑, the maxi-
mum allowed number of symbols 𝛼, and the number of
hops ℎ, producing a geographical automaton 𝐴 as out-
put. The GASM-build algorithm begins by aggregating
the road network, concatenating ℎ times a road segment
to linked road segments in 𝐸ℎ to extend the length of
existing segments and enhance their representativeness
(line 1). Subsequently, it initializes an empty dictionary
𝒲 (line 2). The following steps are applied for each road
segment in 𝐸ℎ (denoted as 𝑒). Given that the shape of a
road segment 𝑒 may be described by varying numbers of
points based on its length and sinuosity, GASM initially
resamples the geometries into a series of evenly spaced
Figure 2: Summary of GASM, depicting geographic automa-
points 𝑋. This ensures that the symbolic representation’s
ton construction (left) and shape-based matching (right).
length, crucial for Aho-Corasick automaton construction,
is proportional solely to the road length. To fulfill the
prerequisite of representing each road segment 𝑒 in a dis- Hyperparameter GASM Values
cretized space, a sequence of symbols is generated (line 𝑑 Resampling distance (m) [5, 10, 20, 50]
5). Subsequently, GASM determines the heading direc- |Σ| Alphabet size [4, 8, 16]
ℎ h-hop aggregation [0, 1, 2, 3]
tion 𝑋⃗ between consecutive points along the resampled
road segment, transforming the shape of each road se- Table 1
quence 𝑒 into a univariate time series of directions 𝑋⃗ with Tested hyperparameters with their values.
a consistent length-based sampling rate 𝑑 (line 6). This
facilitates the utilization of Symbolic Aggregate approXi- Figure 2 visually summarise GASM. On the left side,
mation (SAX) [26] to obtain a symbolic representation the geographic automaton construction phase is depicted,
of each road segment over an alphabet Σ (line 7). These wherein each road within an arbitrary large road network
representations are added to the dictionary 𝒲. Finally, is indexed according to its heading direction. On the
the dictionary of discretized representations of the road right side, the shape-based matching phase is illustrated.
segments is employed to construct the Aho-Corasick au- Here, given a trajectory with known shape but unknown
tomaton, which is then returned as the output (line 8). origin point, GASM computes the set of potential partial
Shape-based Matching. Once the construction of the map matches. Subsequently, it selects the match that
geographic automaton is complete, GASM can execute minimizes the bestfitGeolet function.
shape-based map matching over the automaton following
the steps outlined in Algorithm 2. GASM-search takes as
input the query trajectory 𝑋, the geographical automaton 5. Experiments
𝐴, the road segments 𝐸, the resampling distance 𝑑, and
the maximum allowed number of symbols 𝛼. It yields the In this section, we evaluate the effectiveness of GASM2 .
sequence of edges 𝑌 ⊆ 𝐸 that minimizes the bestfitGeolet First, we present the experimental setting, then we report
function, as per the ensuing procedure. The initial three and discuss the best performance achieved. Finally, we
steps of Algorithm 2, aligning with Algorithm 1, involve illustrate details of the hyperparameter tuning and the
resampling the query trajectory 𝑋, extracting its direc- result of a sensitivity analysis w.r.t. some data properties.
tion, and transforming it into a symbolic representation Experimental Setting. Regrettably, only a handful
𝑄. Indeed, the same preprocessing applied to the road seg- of mobility datasets, such as GeoLife and Porto Taxi3 , are
ments 𝐸 is applied to the query trajectories. Subsequently, available as open access [1, 7]. However, these datasets
the geographic automaton 𝐴 is utilized to perform a lin- possess limited geographic coverage, rendering them
ear search for the best matches 𝒴 among all possibilities unsuitable for our study. Thus, we introduce a novel
offered by 𝐸 (line 4). This implementation enables GASM high-sample rate dataset derived from the publicly acces-
to identify an “initial best match”, presenting a set of
best match candidates 𝒴 = {𝑌1 , … , 𝑌𝑛 }. From this set, the 2
Python code: https://t.ly/wVlXS. We ran our experiments on a
final selection of the optimal alignment 𝑌 ∗ is determined 2xIntel Xeon Gold 6342 24-core CPU, limiting each test to use at
through a naive approach (line 5). most 12 cores.
3
GeoLife: https://t.ly/6VJ-E. Porto: https://t.ly/0GMR9.
Length (km) Length (#points) Kind of Road (%)
Province #Trj Totoal Average (𝜎) Average (𝜎) Motorway Trunk Primary Secondary Minor
Arezzo 17 341 17.2 (18.38) 853 (851) 0.019 0.227 0.205 0.344 0.196
Firenze 58 1041 30.4 (37.31) 1526 (1427) 0.122 0.032 0.392 0.205 0.249
Grosseto 35 215 6.7 (9.35) 578 (545) 0.020 0.133 0.054 0.327 0.466
Livorno 53 410 18.5 (25.42) 916 (645) 0.410 0.043 0.089 0.144 0.313
Lucca 39 804 17.5 (15.49) 1128 (1133) 0.225 0.000 0.056 0.442 0.278
M. Carrara 46 267 5.1 (3.54) 625 (430) 0.187 0.173 0.160 0.267 0.212
Pisa 35 831 26.0 (23.04) 1347 (1178) 0.000 0.002 0.219 0.200 0.578
Pistoia 20 468 53.1 (31.07) 929 (777) 0.000 0.186 0.251 0.163 0.399
Prato 31 146 4.5 (2.37) 660 (672) 0.141 0.050 0.395 0.146 0.269
Siena 24 497 22.4 (16.54) 983 (660) 0.557 0.032 0.340 0.026 0.046
Table 2
Dataset description. Besides average values are reported standard deviations (𝜎).

Method Performances Road Network Characteristics
Selectivity Building #Road Avg node
Province Factor ↓ Accuracy ↑ Time (s) ↓ Segments #Intersections Degree Length (km)
Arezzo 0.096 0.875 708.32 133028 54485 4.883 26852
Siena 0.068 1.000 516.37 175088 72079 4.858 28949
Pistoia 0.080 0.950 433.24 81870 34492 4.747 12363
Lucca 0.070 0.846 658.03 141149 59274 4.763 20328
Firenze 0.399 0.263 157.17 299312 119068 5.028 39025
Grosseto 0.060 0.823 421.66 121045 50014 4.841 26818
Livorno 0.069 1.000 277.38 96631 39613 4.879 11269
M. Carrara 0.056 0.978 294.50 71917 300071 4.783 10809
Pisa 0,081 0.857 1007.06 150954 62580 4.824 22396
Prato 0.105 0.936 190.77 45060 18794 4.795 5224
Macro Avg (𝜎) 0.108 (0.103) 0.853 (0.217)
Table 3
Selectivity factor, accuracy, automaton construction runtime, and other road network informations.

sible 2013 GPS traces on OpenStreetMap4 . Although the the metric of accuracy.On the other hand, the evaluation
initial OpenStreetMap dataset encompasses GPS trajecto- of the second question relies on the metric of selectivity,
ries spanning the entire globe, our analysis concentrates commonly employed in database literature [27]. Selec-
on the ten provinces in Tuscany, a region encompass- tivity measures the reduction of potential alignments
ing 22, 985𝑘𝑚2 in central Italy. The Mappymatch python between a query result and the entire dataset. In our con-
package5 was employed to map-match each trajectory, text, selectivity is defined as the ratio of matched road
retaining only those trajectories with an average error of segments (|𝒴 |) to the total number of road segments (|𝐸|).
less than 10𝑚. The final dataset encompasses 358 distinct For accuracy, higher values indicate better results, while
trajectories, covering a total travel distance of 5, 024𝑘𝑚 for selectivity, lower values indicate better outcomes.
and described by 300, 049 GPS points. Additional infor- GASM Performance. Table 5 presents the perfor-
mation on the types of roads traversed in each province mance metrics of GASM across individual provinces. To
in Tuscany, as per the OpenStreetMap taxonomy6 , is determine the optimal hyperparameters, a grid search
presented in Table 6. was conducted over the values outlined in Table 6, specif-
Within the framework of our shape-based map- ically for the province of Grosseto. This process yielded
matching formulation, we aim to address the following the following hyperparameters: a resampling distance
questions. First, to what extent can GASM infer the of 𝑑 = 10 meters, an alphabet size of 𝛼 = 8 symbols, and
original GPS coordinates without utilizing them for map a street aggregation of ℎ = 2 hops. GASM showcases
matching? Second, how effectively can GASM reduce the an impressive ability to deduce the original GPS coor-
number of potential alignments compared to the entire dinates, achieving an average light accuracy of 90.1%.
road network? The first question is evaluated through Furthermore, it significantly narrows down the potential
4 alignments, as indicated by the selectivity factor, reduc-
OpenStreetMap 2013 public GPS traces: https://t.ly/q7u2N
5
Mappymatch: https://t.ly/RHafS ing it to just 10.8% of the original road network. These
6
Highway taxonomy: https://t.ly/NpxZv commendable results are attained while maintaining a
Figure 3: Hyperparameters influence: ℎ-hop aggregation (left), 𝛼 alphabet size (center), and 𝑑 resampling distance (right).

Figure 4: Influence of trajectory length (left), trajectory straightness (center), and kind of road (right) on performance metrics.

reasonable automaton construction time of 7.8 minutes criminative nature of trajectories based on the type of
per province, resulting in a total indexing time of a mere road. Our hypothesis posits that straight streets, such
1.29 hours for the entire region. as motorways, exhibit lower discriminative characteris-
Hyperparameters Tuning. In this section, we tics. Consequently, trajectories observed on such roads
present the results of experiments conducted on the are more likely to avoid re-identification, suggesting en-
province of Grosseto while varying the hyperparame- hanced geographic transferability. To investigate this, we
ters detailed in Table 6. Initially, we compute the Pearson identify the type of road traveled within each segment.
correlation between the method’s hyperparameters and In cases where multiple types of roads are encountered,
two key performance metrics: the selectivity factor and we perform a majority vote weighted by road length. Ad-
accuracy. Figure 3 visually depicts the changes in perfor- ditionally, to examine the influence of changes in the
mance metrics, emphasizing variations in the top three input data, we create random subtrajectories of varying
most influential hyperparameters—those exhibiting the lengths, including 100m, 150m, 500m, 1km, 1.5km, 5km,
highest absolute values of Pearson correlation. Notably, and 10km, derived from our OpenStreetMap dataset. In
the most influential hyperparameter is the number of order to assess our hypothesis, we evaluate the straight-
road segment aggregations (ℎ), demonstrating a corre- ness [4] of each subtrajectory by calculating the ratio
lation of −0.52 with the selectivity. Thus, increasing ℎ between the shortest path from the origin to the destina-
proves beneficial for GASM as it helps select fewer can- tion and the actual trajectory.
didate road segments without significantly impacting Figure 4 encapsulates these results. The initial plot on
accuracy. The alphabet size (𝛼) displays correlations of the left highlights a notable trend: an increase in subtra-
−0.44 and 0.40 with respect to the selectivity and accu- jectory length correlates with a rapid elevation in both
racy, respectively. This hyperparameter introduces a accuracy and selectivity. In simpler terms, as the subtra-
trade-off, as increasing the number of symbols reduces jectory length extends, the model’s precision improves.
selectivity but may lead to a slight decrease in accuracy. The central plot reveals that trajectory straightness has
Finally, the resampling distance (𝑑) exhibits a correlation a negligible impact on the number of candidate matches.
of 0.22 with accuracy. Interestingly, decreasing 𝑑 slightly However, as trajectories become more linear, the accu-
enhances accuracy according to our observations. racy experiences a decline. Finally, the rightmost plot
illustrates the method’s performance across various road
Sensitivity Analysis. We delve here into the varia-
types. This plot validates the findings of the straightness
tions in performance with respect to the length of the
plot: roads with greater straightness, like motorways,
query trajectory 𝑋. Additionally, we explore the dis-
pose the greatest challenge for re-identification. Con-
versely, more sinuous roads present a slightly higher [7] C. Landi, et al., Geolet: An interpretable model for
difficulty in re-identification, reflected in a higher selec- trajectory classification, in: IDA, volume 13876 of
tivity but with a concomitant boost in accuracy. LNCS, Springer, 2023, pp. 236–248.
[8] F. Veronesi, et al., Assessing accuracy and geograph-
6. Conclusion ical transferability of machine learning algorithms
for wind speed modelling, in: AGILE, LNGC, 2017.
In this paper we have introduced GASM, a map matching [9] M. Nanni, et al., City indicators for geographical
method capable of determining a trajectory’s position transfer learning: an application to crash prediction,
solely based on its shape. Our experiments showcase GeoInformatica 26 (2022) 581–612.
that GASM significantly reduces the number of poten- [10] J. Lines, et al., A shapelet transform for time series
tial alignments and deduces the original GPS coordinates classification, in: ACM SIGKDD, 2012, pp. 289–297.
with remarkable accuracy. Further analysis reveals that [11] C. White, et al., Some map matching algorithms for
longer and less linear trajectories are more straightfor- personal navigation assistants, TRC 8 (2000).
ward to map match. However, this observation raises [12] D. Bernstein, et al., An introduction to map match-
concerns about the potential for shape-based methods ing for personal navigation assistants (1996).
to inadvertently learn geographic positions instead of [13] M. A. Quddus, et al., A general map matching al-
focusing on other intrinsic features. As a part of future gorithm for transport telematics applications, GPS
work, as outlined at the beginning, we aim to assess the solutions 7 (2003) 157–167.
geographic transferability of shape-based methods, such [14] Y. Liu, Z. Li, A novel algorithm of low sampling
as Geolet, by incorporating GASM. Specifically, we pro- rate GPS trajectories on map-matching, EURASIP
pose giving more weight to the selection of discriminative 2017 (2017) 30.
subsequences with higher selectivity rather than basing [15] X. Liu, et al., A st-crf map-matching method for
the decision solely on a statistical test. low-frequency floating car data, TITS 18 (2016)
1241–1254.
Acknowledgments [16] L. Jiang, et al., From driving trajectories to driving
paths: a survey on map-matching algorithms, CCF
This work is partially supported by the EU NextGener- TPCI 4 (2022) 252–267.
ationEU programme under the funding schemes PNRR- [17] P. Cintia, M. Nanni, An effective time-aware map
“SoBigData.it - Strengthening the Italian RI for Social matching process for low sampling gps data, arXiv
Mining and Big Data Analytics” - Prot. IR0000013, H2020- preprint arXiv:1603.07376 (2016).
INFRAIA-2019-1: Res. Infr. G.A. 871042 SoBigData++, [18] G. Hu, et al., If-matching: Towards accurate map-
and GreenDatAI G.A. 101070416. matching with information fusion, TKDE 29 (2016)
114–127.
[19] L. Wang, et al., Smart city development with urban
References transfer learning, Computer 51 (2018) 32–41.
[1] C. L. da Silva, et al., A survey and comparison of [20] L. Wang, et al., Cross-city transfer learning for
trajectory classification methods, in: BRACIS, IEEE, deep spatio-temporal prediction, in: IJCAI, ijcai.org,
2019, pp. 788–793. 2019, pp. 1893–1899.
[2] A. Bolbol, et al., Inferring hybrid transportation [21] R. Trasarti, et al., Myway: Location prediction via
modes from sparse GPS data using a moving win- mobility profiling, Inf. Syst. 64 (2017) 350–367.
dow SVM classification, CEUS 36 (2012) 526–537. [22] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to
[3] S. Dodge, et al., Revealing the physics of movement: data mining, Pearson Education India, 2016.
Comparing the similarity of movement characteris- [23] A. J. Bagnall, et al., The great time series classifica-
tics, CEUS 33 (2009) 419–434. tion bake off, CoRR abs/1602.01711 (2016).
[4] P. J. Almeida, et al., Indices of movement behaviour: [24] A. V. Aho, et al., Efficient string matching: An aid
conceptual background, effects of scale and location to bibliographic search, ACM 18 (1975) 333–340.
errors, Zoologia (Curitiba) 27 (2010) 674–680. [25] E. M. McCreight, A space-economical suffix tree
[5] I. Kontopoulos, et al., Traclets: Harnessing the construction algorithm, JACM 23 (1976) 262–272.
power of computer vision for trajectory classifica- [26] J. Lin, et al., A symbolic representation of time
tion, arXiv:2205.13880 (2022). series, with implications for streaming algorithms,
[6] C. A. Ferrero, et al., MOVELETS: exploring relevant in: DMKD, ACM, 2003, pp. 2–11.
subtrajectories for robust trajectory classification, [27] S. Acharya, et al., Selectivity estimation in spatial
in: SAC, ACM, 2018, pp. 849–856. databases, in: SIGMOD, ACM, 1999, pp. 13–24.