=Paper= {{Paper |id=Vol-3651/BMDA_paper12 |storemode=property |title=A Shape-Based Map Matching Approach for Geographic Transferability of Discriminative Subtrajectories |pdfUrl=https://ceur-ws.org/Vol-3651/BMDA-12.pdf |volume=Vol-3651 |authors=Cristiano Landi,Riccardo Guidotti |dblpUrl=https://dblp.org/rec/conf/edbt/LandiG24 }} ==A Shape-Based Map Matching Approach for Geographic Transferability of Discriminative Subtrajectories== https://ceur-ws.org/Vol-3651/BMDA-12.pdf
                                A Shape-Based Map Matching Approach for Geographic
                                Transferability of Discriminative Subtrajectories
                                Cristiano Landi1,2,∗ , Riccardo Guidotti1,2
                                1
                                    University of Pisa, Pisa, Italy.
                                2
                                    ISTI-CNR, Pisa, Italy.


                                                   Abstract
                                                   This paper addresses the challenge of map matching and geographic transferability in trajectory analysis. Existing methods
                                                   often face limitations tied to specific coordinates or road networks. In response, we propose GASM, a shape-based map
                                                   matching method that relies solely on trajectory shapes, irrespective of geographic origin. GASM introduces a symbolic road
                                                   network representation, facilitating efficient searches based solely on trajectory shapes. Our experimentation, spanning over
                                                   5,000 km of roads, demonstrates GASM’s ability to accurately position trajectories with an impressive accuracy exceeding 90%.
                                                   Notably, GASM stands as the first in the literature to perform shape-based symbolic map matching without prior knowledge
                                                   of the geographic region.

                                                   Keywords
                                                   Map Matching, Geographic Transferability, Machine Learning, Discriminative Subtrajectories



                                1. Introduction                                                                                             mobility patterns from one geographical region and ef-
                                                                                                                                            fectively applying them in another region [8, 9].
                                In recent years, the widespread adoption of cutting-edge                                                       Particularly noteworthy are recent advancements in
                                technologies equipped with Global Positioning System                                                        machine learning leveraging shapelet-based subtrajec-
                                (GPS) devices has enabled the recording of positions for                                                    tories [5, 6, 7]. Originating from the domain of time
                                various moving objects, ranging from cars and transporta-                                                   series analytics, shapelets represent discriminatory sub-
                                tion vehicles to phones and wearables. Unfortunately,                                                       sequences that encapsulate a collection of distinctive
                                the coordinates captured by these sensors often fail to ac-                                                 shapes, crucial for discerning specific classes [10]. Vari-
                                curately reflect real positions due to physical constraints                                                 ous approaches exist for defining discriminative subtra-
                                and/or legal regulations. Nevertheless, in various applica-                                                 jectories. In [6], the Movelet method is introduced—an
                                tions, it is imperative to accurately align GPS trajectories                                                approach for extracting discriminative subtrajectories
                                with a road network. For instance, in navigation ser-                                                       selected through a rigorous statistical test. During the
                                vices, map matching empowers drivers to monitor their                                                       discovery phase, Movelet generates candidate subtrajec-
                                exact locations and receive optimal routes to specified                                                     tories by extracting all possible subsequences with more
                                destinations. Conversely, in machine learning tasks, map                                                    than two contiguous observations, utilizing a sliding win-
                                matching enhances users’ mobility information by in-                                                        dow. Building upon the foundation laid by Movelet,
                                corporating knowledge related to the territory, such as                                                     Geolet is introduced in [7]. This extension incorporates
                                Points Of Interest (POI), feature engineering [1, 2, 3, 4], or                                              a normalization step after the discovery phase. The nor-
                                the identification of discriminatory subsequences, such                                                     malization step is designed to ease the comparison of
                                as mobility shapelets [5, 6, 7]. Without an appropriate                                                     discriminative subsequences with trajectories recorded
                                map-matching procedure, reliance on an expert becomes                                                       in diverse geographical regions. The underlying ratio-
                                necessary to determine which features can be extracted                                                      nale is that a subtrajectory pinpointing a sudden break
                                from trajectories concerning the territory. However, the                                                    in a road segment in one city should exhibit similarities
                                reliance on ad-hoc features restricts the applicability of                                                  to a subtrajectory associated with the same event in an-
                                machine learning methods and amplifies sensitivity to                                                       other city. This normalization enhances the method’s
                                input changes [1], rendering it unsuitable for geographic                                                   adaptability across various geographic contexts.
                                transferability. This implies the challenge of extracting                                                      While Geolet successfully addresses the limitation
                                                                                                                                            of Movelet by providing normalized subtrajectories,
                                Published in the Proceedings of the Workshops of the EDBT/ICDT 2024                                         thereby enhancing geographic transferability indepen-
                                Joint Conference (March 25-28, 2024), Paestum, Italy                                                        dent of specific GPS coordinates, it introduces a potential
                                ∗
                                     Corresponding author.                                                                                  vulnerability tied to the road network.
                                Envelope-Open cristiano.landi@phd.unipi.it (C. Landi);
                                riccardo.guidotti@unipi.it (R. Guidotti)
                                                                                                                                               Our underlying hypothesis is that the less frequently
                                Orcid 0000-0003-4907-9728 (C. Landi); 0000-0002-2827-7613                                                   a trajectory occurs, the greater the likelihood that shape-
                                (R. Guidotti)                                                                                               based methods utilizing it as a discriminative subse-
                                             Copyright © 2024 for this paper by its authors. Use permitted under Creative Commons License
                                             Attribution 4.0 International (CC BY 4.0).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
quence may not capture the intrinsic features of the     key strategies employed to tackle this challenge.
trajectory but rather only its geographic position. In      In the literature of trajectory analysis, a multitude of
essence, if a discriminative subtrajectory is intrinsically
                                                         strategies exists for mapping trajectories onto a road net-
linked to a particular road network due to its distinc-  work. For high-frequency sampled trajectories, the sim-
tive shape, it becomes unsuitable for geographic trans-  plest approach involves associating each spatio-temporal
ferability. This limitation arises from the fact that thepoint with the nearest street segment [11, 12]. However,
discriminatory aspect is not rooted in the movements     these techniques, while fast and straightforward, have ex-
themselves but rather in the structural characteristics  hibited inaccuracies, particularly at intersections and par-
of the road network. Consequently, evaluating the geo-   allel roads. To address these limitations, enhancements
graphic transferability of discriminative subtrajectorieshave been introduced, incorporating heading direction
necessitates a shape-based map-matching approach that    or employing a Kalman filter to eliminate outlier points
exclusively relies on shapes without prior knowledge of  in trajectories [13]. Alternatively, some approaches lever-
the position. Regrettably, to the best of our knowledge, age probabilistic-based map-matching algorithms, in-
such an approach is currently unavailable. This under-   tegrating hidden Markov models to identify the most
scores the need for innovative solutions in the realm of likely sequence of road segments aligning with the tra-
shape-based map matching to comprehensively assess       jectory [14, 15]. On the other hand, in the context of
the adaptability of discriminative subtrajectories acrosslow-sampled trajectories, much of the existing literature
diverse geographical contexts.                           presupposes that the most probable route connecting two
   To overcome this limitation, our paper introduces     successive points is also the shortest or fastest [16]. How-
GASM, an Geographic Automaton Shape-based map            ever, in [17], is introduced a map-matching algorithm
Matching approach. GASM relies solely on the shape of a  that exploits temporal intervals between GPS points. This
trajectory to accurately determine its position within the
                                                         method identifies the optimal match between two GPS
road network. Specifically, GASM employs a symbolic      points by selecting the route with the most similar travel
representation to transform the road network, construct- time. Also, in [18] is proposed a method for map match-
ing a spatial index independent of coordinates. This     ing low-sampled trajectories based on supplementary
unique approach allows for efficient trajectory searches information such as speed and moving direction, typi-
based solely on their shapes. To the best of our knowl-  cally collected alongside spatial locations.
edge, GASM is the first proposal in the literature that     Geographic transferability encapsulates the challenge
exclusively utilizes a discretized representation of a trajec-
                                                         of extending knowledge gleaned from one geographic re-
tory’s shape, devoid of any knowledge of the geographic  gion to another. This entails constructing machine learn-
region, for shape-based map matching. Our experimenta-   ing models capable of adeptly executing tasks in regions
tion with GASM on a novel comprehensive geographical     distinct from their original training grounds. The core
dataset spanning over 5,000 km of roads in Tuscany, cen- of this challenge emerges from pronounced disparities
tral Italy, demonstrates its capability to identify correct
                                                         in data distributions, patterns, and characteristics across
alignments with an impressive accuracy exceeding 90%.    diverse geographical locations. As articulated in [8, 9],
Furthermore, GASM exhibits efficiency, as it can con-    models trained on data from one region may encounter
struct the necessary representation for the entire dataset
                                                         difficulties in generalization when confronted with the
in less than 1.5 hours, maintaining a linear complexity at
                                                         unique variations inherent in another region. In pursuit
query time.                                              of a global model, a fundamental strategy is employed
   The paper is organized as follows. Section 2 sum-     as demonstrated in [8], where diverse data sources are
marises the related works concerning map-matching        aggregated on a global scale to build a transferable model.
methods and the challenges posed by geographic trans-    Conversely, in [9], city indicators are identified pertain-
ferability. In Section 3, we encapsulate the technical   ing to road networks, traffic flows, and individual mobil-
concepts essential for comprehending the algorithm de-   ity, to facilitate the assessment of similarities between ge-
lineated in Section 4. The outcomes of experiments con-  ographical regions. Subsequently, an ensemble classifier
ducted with GASM are detailed in Section 5. Finally,     is devised, computing the output as a weighted average of
Section 6 encapsulates our findings and delves into po-  outputs generated by individual local classifiers. Notably,
tential avenues for future developments.                 the importance of local models is determined by their
                                                         higher similarity to the target regions compared to others
                                                         in the ensemble with respect to the city indicators. Al-
2. Related Works                                         ternate methodologies involve adapting a model initially
                                                         trained in a data-rich region and transplanting it to a tar-
In the following, we provide a concise overview of the
                                                         get region characterized by limited data availability. This
literature concerning map matching methods and eluci-
                                                         adaptation process entails integrating additional data to
date the geographic transferability problem, introducing
                                                         compute sub-region similarities, subsequently enabling
the remapping of the model [19, 20].                              matrix 𝑇 ∈ ℝ𝑛×ℎ , obtained by taking the best fitting of
                                                                  each trajectory 𝑋 ∈ 𝒳, and each subtrajectory 𝑆 ∈ 𝒮.
3. Problem Setting                                                   Movelet computes the best fitting as the minimal Eu-
                                                                  clidean Distance (ED) between 𝑆 in each subsequence of
In this section, we articulate the fundamental concepts           length 𝑙 = |𝑆| of 𝑋, formally,
essential for comprehending our proposal. Initially, we                                             𝑚−𝑙
establish a shared language and framework by introduc-                     bestfitMovelets (𝑋 , 𝑆) = min{𝐸𝐷(𝑋𝑗∶𝑗+𝑙 , 𝑆)}      (1)
                                                                                                    𝑗=0
ing notation that serves as a basis for discussing key
elements. Subsequently, we delve into the transforma-             On the other hand, Geolet computes the best fitting in
tion facilitated by Geolet, a catalyst for the motivation         the same way, but geographically shifting 𝑆 to overlap
behind this work. Ultimately, we present a formal intro-          each subsequence of 𝑋 of length 𝑙. In particular, Geolet
duction to the problem at hand.                                   extends the best fitting function of Movelet by adding a
                                                                  pre-processing function, shift, that subtracts the value of
Definition 1 (Trajectory). A trajectory 𝑋 is a sequence           the first vector of the subsequence from all the others,
of spatio-temporal points 𝑋 = {⃗x𝑡0 , … , x⃗ 𝑡𝑚 } ∈ ℝ𝑚×3 where
                                                                                          𝑚−𝑙
the spatial vectors x⃗ 𝑡𝑗 = (lat𝑡𝑗 , long𝑡𝑗 ) are sorted by in-     bestfitGeolet (𝑋 , 𝑆) = min{𝐸𝐷(shift(𝑋𝑗∶𝑗+𝑙 ), shift(𝑆))} (2)
creasing time 𝑡𝑗 , i.e., ∀1 ≤ 𝑗 < 𝑚 we have 𝑡𝑗 < 𝑡𝑗+1 .                                   𝑗=0

                                                                where shift(𝑋 ) = {𝑥⃗0 − 𝑥⃗0 , … , 𝑥⃗𝑚 − 𝑥⃗0 }. The shift function
   In a sense, trajectories can be viewed as multivariate
                                                                makes Geolet suitable for geographic transferability not
time series containing two signals, i.e., the latitude and
                                                                being tide to the territory.
longitude, recorded at non-constant sampling rates [5,
21, 6]. In order to simplify notation, we will use 𝑗 instead       Finally, in order to present map matching we define a
of 𝑡𝑗 every time. A trajectory classification dataset is a set road network as follows:
of trajectories with a vector of labels attached. Formally: Definition 5 (Road Network). A road network 𝐺 =
Definition 2 (Trajectory Dataset). A trajectory dataset ⟨𝑉 , 𝐸⟩ is a directed graph where 𝑉 = {𝑣1 , … , 𝑣𝑝 } is the set
𝒳 ∈ ℝ𝑛×𝑚×3 is a set of 𝑛 trajectories, 𝒳 = {𝑋0 … , 𝑋𝑛 }.        of 𝑝 road junction (or nodes), and 𝐸 = {𝑒1 , … , 𝑒𝑞 } is the
                                                                set of 𝑞 road segments (or edges), where 𝑒𝑖 = (𝑣𝑖1 , 𝑣𝑖2 ).
For simplicity, we use a single symbol 𝑚 to denote the
lengths of the trajectories, even if a dataset can contain We underline that we rely on an enhanced road network
trajectories with a different number of observations. Sim- representation where, for each edge, we also have ac-
ilarly, we emphasize that there is no constraint on the cess to the road segment geometries expressed as a se-
sampling rate, i.e., we can have a non-constant sample quence of 𝑘 latitude and longitude, formally shape(𝐸) =
in the same trajectory. Furthermore, we define a subtra- {𝑥⃗0 , … , 𝑥⃗𝑘 } for some 𝐸 ∈ ℰ. As for trajectories, for sim-
jectory as:                                                     plicity of notation, we use a single symbol 𝑘 to denote
                                                                the lengths of the points describing the geometry of the
Definition 3 (Subtrajectory). Given a trajectory 𝑋 of road segment, even if, in real-case scenarios, the shape
length 𝑚, a subtrajectory 𝑆 = {⃗𝑠𝑗 , … , 𝑠⃗𝑗+𝑙 } ⊂ 𝑋, of length can be described using an arbitrary number of points.
𝑙 ≤ 𝑚, is an ordered sequence of consecutive values such           We are now able to formalize the shape-based map
that 0 ≤ 𝑗 ≤ 𝑚 − 𝑙.                                             matching problem as follows:
   As previously mentioned, Movelet [6] and Geolet [7]            Definition 6 (Shape-based Map Matching). Given a
are shapelet-inspired [10] trajectory approaches that             road network 𝒢 = ⟨𝑉 , 𝐸⟩ and a trajectory 𝑋, the shape-
identify discriminative subtrajectories for classification        based map matching problem consists in finding the best
purposes. They both select the most discriminative sub-           sequence of edges 𝑌 = {𝑒1 , … , 𝑒𝑧 } ⊆ 𝐸 such that does
trajectories w.r.t. the target label using the mutual infor-      not exist another sequence of edges 𝑌 ′ ⊂ 𝐸 different
mation [22]. Like shapelet-based approaches, Movelet              from 𝑌, i.e., 𝑌 ′ ≠ 𝑌, where bestfitGeolet (𝑋 , shape(𝑌 ′ )) <
and Geolet extract discriminative subtrajectories that            bestfitGeolet (𝑋 , shape(𝑌 )).
can be used to train any machine learning model [23].
                                                           In other words, the shape-based map matching problem
Indeed, once the most discriminative subtrajectories are
                                                           involves determining the optimal alignment for a (sub)tra-
identified, a trajectory dataset can be transformed into
                                                           jectory 𝑋 within a designated road network 𝐺, relying on
a tabular representation capturing the distance between
                                                           the configuration of the edges comprising the road seg-
trajectories and discriminative subtrajectories through
                                                           ments. It is essential to emphasize that this map matching
the subtrajectory transform function:
                                                           endeavor necessitates resolution without any reliance on
Definition 4 (Subtrajectory Transform). Given a dataset GPS coordinates as the usage of the shift operator nor-
𝒳 and a set 𝒮 containing ℎ subtrajectories, the subtrajec- malizes the trajectory 𝑋 rendering state-of-the-art map
tory transform converts 𝒳 ∈ ℝ𝑛×𝑚×3 into a real-valued matching methods unsuitable for this particular task.
                                                                   Algorithm 1: build(𝐸, 𝑑, Σ, ℎ)
                                                                    Input : 𝐸 - road segments, 𝑑 - resampling
                                                                             distance, 𝛼 - max nbr. of symbols,
                                                                             ℎ - h-hop aggregation
                                                                    Output : 𝑎ℎ𝑜 - Aho-Corasick automaton
                                                                  1 𝐸ℎ ← aggregate(𝐸, ℎ); // aggregate trajectories
Figure 1: Aho-Corasick automaton using symbolic sequences         2 𝒲 ← ∅;
{𝐴𝐶, 𝐵, 𝐵𝐶𝐴, 𝐶}. Blue dashed arches are suffix arches, while      3 for 𝑒 ∈ 𝐸ℎ do           // for each road segment
green dotted arches are the dictionary suffix arches.             4     𝑋 ← resample(𝑠ℎ𝑎𝑝𝑒(𝑒), 𝑑); // resample traj.
                                                                  5     𝑋⃗ ← direction(𝑋 );   // get traj. direction
                                                                  6     𝑊 ← SAX(𝑋⃗ , 𝛼);         // discretize traj.
4. Shape-based Map Matching                                       7     𝒲 ← 𝒲 ∪ {𝑊 };                // add to dict.
                                                                  8 return Aho-Corasick(𝒲 );
To tackle the shape-based map-matching problem, our
aim is to design a map-matching method with the capabil-
ity to accurately deduce the original GPS coordinates of           Algorithm 2: 𝑠𝑒𝑎𝑟𝑐ℎ(𝑋 , 𝐴, 𝑑, Σ)
a trajectory within a designated road network. Crucially,
this precision is sought exclusively through an examina-            Input : 𝑋 - query traj., 𝐴 - Aho-Corasick
tion of the trajectory’s shape and the configurations of                     automaton, 𝐸 - road segments, 𝑑 -
the edges within the road network, entirely independent                      resampling dist., 𝛼 - max nbr. of symbols
of any reliance on GPS coordinates.                                 Output : 𝑌 ∗ , 𝒴 - best methc and best candidates
   A brute-force approach to address the problem in-              1 𝑋 ′ ← resample(𝑋 , 𝑑);            // resample traj.
volves map matching all conceivable alignments of 𝑋               2 𝑋⃗ ′ ← direction(𝑄);     // get traj. direction
within every segment 𝐸 of the road network 𝐺, employ-             3 𝑄 ← SAX(𝑋   ⃗ ′ , 𝛼);   // discretize using SAX
ing the bestfitGeolet function. However, this naive strategy      4 𝒴 ← search(𝐴, 𝑄, 𝐸);     // get best candidates
is only viable for small road networks due to the algo-               ∗                                 ′
                                                                  5 𝑌 ← arg min𝑌 ∈𝒴 bestfitGeolet (𝑌 , 𝑋 );
rithmic complexity being 𝑂(|𝐸|(𝑚 − 𝑘)𝑘), where 𝑚 and
                                                                  6 return 𝑌;              // return the best match
𝑘 represent the number of points characterizing the tra-
jectory 𝑋 and the number of points describing each road
segments in 𝐸, respectively1 . This limitation also extends
to other map-matching algorithms that rely on latitude               finite sequence of symbols over an alphabet Σ. Subse-
and longitude coordinates to confine the matching scope              quently, it builds a finite-state automaton based on the
to the nearest roads.                                                sequences in 𝒲 within a given finite symbol sequence de-
   We overcome this limitation by proposing GASM a                   fined over the alphabet Σ. Consequently, the automaton,
Geographic Automaton Shape-based map Matching ap-                    constructed using the dictionary 𝒲, identifies a subset
proach that is able to significantly reduce the number               𝒲 ′ ⊂ 𝒲 wherein each sequence in 𝒲 ′ is contained in
of road segment alignments to test with the brute force              𝑄. Figure 1 provides an illustration of the automaton cre-
method. In essence, GASM comprises two key steps. Ini-               ated by the Aho-Corasick algorithm, using the sequences
tially, leveraging the Aho-Corasick algorithm [24], GASM             𝒲 = {𝐴𝐶, 𝐵, 𝐵𝐶𝐴, 𝐶} over the alphabet Σ = {𝐴, 𝐵, 𝐶, 𝐷}.
constructs a shape-based index for all road segments in              The Aho-Corasick algorithm initiates by constructing
𝐸, portraying it as a geographic finite state automaton.             a suffix trie [25], depicted in black in Figure 1. Subse-
Subsequently, GASM facilitates querying the automaton                quently, it designates all leaves of the trie as final states
to pinpoint a set of candidate partial matches between               of the automaton and introduces edges to complete the
𝑋 and {shape(𝑌 )| ∀ 𝑌 ⊆ 𝐸}. Further elucidation of these             automaton. Two types of edges are incorporated, con-
two steps is provided in the subsequent sections.                    necting their respective suffixes: suffix edges, depicted
    Geographic Automaton Construction. GASM in blue, are utilized in the case of a mismatch, without
leverages the Aho-Corasick algorithm to construct a guaranteeing that the suffix is also a sequence in the dic-
geographic automaton, serving as a spatial index for tionary. In contrast, dictionary-suffix edges, portrayed in
expedited query processing [24]. The Aho-Corasick green, guarantee that the suffix is a sequence present in
algorithm, renowned for string searching, takes a set the dictionary. These operations unfold linearly concern-
𝒲 = {𝑊1 , … , 𝑊𝑛 } as input, where each 𝑊𝑖 represents a ing the total number of symbols in the input dictionary 𝒲.
                                                                     The automaton enables the search for all sequences con-
1                                                                    tained in a query by traversing the automaton, achievable
  For each 𝑌 ⊆ 𝐸, we compute the Euclidean Distance (linear com-
  plexity), for all the possible 𝑚 − 𝑘 alignments of shape(𝑌 ) in 𝑋. in linear time relative to the query’s length.
   Algorithm 1 delineates the procedural steps requisite
of GASM for constructing the Aho-Corasick automaton.
The algorithm accepts, as input, the road segments 𝐸 of
the road network 𝐺, the resampling distance 𝑑, the maxi-
mum allowed number of symbols 𝛼, and the number of
hops ℎ, producing a geographical automaton 𝐴 as out-
put. The GASM-build algorithm begins by aggregating
the road network, concatenating ℎ times a road segment
to linked road segments in 𝐸ℎ to extend the length of
existing segments and enhance their representativeness
(line 1). Subsequently, it initializes an empty dictionary
𝒲 (line 2). The following steps are applied for each road
segment in 𝐸ℎ (denoted as 𝑒). Given that the shape of a
road segment 𝑒 may be described by varying numbers of
points based on its length and sinuosity, GASM initially
resamples the geometries into a series of evenly spaced
                                                               Figure 2: Summary of GASM, depicting geographic automa-
points 𝑋. This ensures that the symbolic representation’s
                                                               ton construction (left) and shape-based matching (right).
length, crucial for Aho-Corasick automaton construction,
is proportional solely to the road length. To fulfill the
prerequisite of representing each road segment 𝑒 in a dis-                  Hyperparameter GASM            Values
cretized space, a sequence of symbols is generated (line               𝑑    Resampling distance (m)        [5, 10, 20, 50]
5). Subsequently, GASM determines the heading direc-                  |Σ|   Alphabet size                  [4, 8, 16]
                                                                       ℎ    h-hop aggregation              [0, 1, 2, 3]
tion 𝑋⃗ between consecutive points along the resampled
road segment, transforming the shape of each road se-          Table 1
quence 𝑒 into a univariate time series of directions 𝑋⃗ with   Tested hyperparameters with their values.
a consistent length-based sampling rate 𝑑 (line 6). This
facilitates the utilization of Symbolic Aggregate approXi-        Figure 2 visually summarise GASM. On the left side,
mation (SAX) [26] to obtain a symbolic representation          the geographic automaton construction phase is depicted,
of each road segment over an alphabet Σ (line 7). These        wherein each road within an arbitrary large road network
representations are added to the dictionary 𝒲. Finally,        is indexed according to its heading direction. On the
the dictionary of discretized representations of the road      right side, the shape-based matching phase is illustrated.
segments is employed to construct the Aho-Corasick au-         Here, given a trajectory with known shape but unknown
tomaton, which is then returned as the output (line 8).        origin point, GASM computes the set of potential partial
   Shape-based Matching. Once the construction of the          map matches. Subsequently, it selects the match that
geographic automaton is complete, GASM can execute             minimizes the bestfitGeolet function.
shape-based map matching over the automaton following
the steps outlined in Algorithm 2. GASM-search takes as
input the query trajectory 𝑋, the geographical automaton       5. Experiments
𝐴, the road segments 𝐸, the resampling distance 𝑑, and
the maximum allowed number of symbols 𝛼. It yields the         In this section, we evaluate the effectiveness of GASM2 .
sequence of edges 𝑌 ⊆ 𝐸 that minimizes the bestfitGeolet       First, we present the experimental setting, then we report
function, as per the ensuing procedure. The initial three      and discuss the best performance achieved. Finally, we
steps of Algorithm 2, aligning with Algorithm 1, involve       illustrate details of the hyperparameter tuning and the
resampling the query trajectory 𝑋, extracting its direc-       result of a sensitivity analysis w.r.t. some data properties.
tion, and transforming it into a symbolic representation          Experimental Setting. Regrettably, only a handful
𝑄. Indeed, the same preprocessing applied to the road seg-     of mobility datasets, such as GeoLife and Porto Taxi3 , are
ments 𝐸 is applied to the query trajectories. Subsequently,    available as open access [1, 7]. However, these datasets
the geographic automaton 𝐴 is utilized to perform a lin-       possess limited geographic coverage, rendering them
ear search for the best matches 𝒴 among all possibilities      unsuitable for our study. Thus, we introduce a novel
offered by 𝐸 (line 4). This implementation enables GASM        high-sample rate dataset derived from the publicly acces-
to identify an “initial best match”, presenting a set of
best match candidates 𝒴 = {𝑌1 , … , 𝑌𝑛 }. From this set, the   2
                                                                 Python code: https://t.ly/wVlXS. We ran our experiments on a
final selection of the optimal alignment 𝑌 ∗ is determined       2xIntel Xeon Gold 6342 24-core CPU, limiting each test to use at
through a naive approach (line 5).                               most 12 cores.
                                                               3
                                                                 GeoLife: https://t.ly/6VJ-E. Porto: https://t.ly/0GMR9.
                               Length (km)           Length (#points)                      Kind of Road (%)
     Province      #Trj    Totoal Average (𝜎)            Average (𝜎)     Motorway     Trunk Primary Secondary            Minor
        Arezzo       17       341   17.2 (18.38)             853 (851)        0.019    0.227      0.205     0.344         0.196
       Firenze       58      1041   30.4 (37.31)           1526 (1427)        0.122    0.032      0.392     0.205         0.249
      Grosseto       35       215      6.7 (9.35)            578 (545)        0.020    0.133      0.054     0.327         0.466
       Livorno       53       410   18.5 (25.42)             916 (645)        0.410    0.043      0.089     0.144         0.313
         Lucca       39       804   17.5 (15.49)           1128 (1133)        0.225    0.000      0.056     0.442         0.278
    M. Carrara       46       267      5.1 (3.54)            625 (430)        0.187    0.173      0.160     0.267         0.212
           Pisa      35       831   26.0 (23.04)           1347 (1178)        0.000    0.002      0.219     0.200         0.578
        Pistoia      20       468   53.1 (31.07)             929 (777)        0.000    0.186      0.251     0.163         0.399
         Prato       31       146      4.5 (2.37)            660 (672)        0.141    0.050      0.395     0.146         0.269
         Siena       24       497   22.4 (16.54)             983 (660)        0.557    0.032      0.340     0.026         0.046
Table 2
Dataset description. Besides average values are reported standard deviations (𝜎).

                                Method Performances                            Road Network Characteristics
                        Selectivity                Building           #Road                   Avg node
        Province           Factor ↓ Accuracy ↑ Time (s) ↓          Segments    #Intersections   Degree Length (km)
            Arezzo             0.096         0.875   708.32           133028            54485     4.883      26852
             Siena             0.068         1.000   516.37           175088            72079     4.858      28949
            Pistoia            0.080         0.950   433.24            81870            34492     4.747      12363
             Lucca             0.070         0.846   658.03           141149            59274     4.763      20328
           Firenze             0.399         0.263   157.17           299312           119068     5.028      39025
         Grosseto              0.060         0.823   421.66           121045            50014     4.841      26818
           Livorno             0.069         1.000   277.38            96631            39613     4.879      11269
       M. Carrara              0.056         0.978   294.50            71917           300071     4.783      10809
               Pisa            0,081         0.857  1007.06           150954            62580     4.824      22396
             Prato             0.105         0.936   190.77            45060            18794     4.795        5224
     Macro Avg (𝜎)     0.108 (0.103) 0.853 (0.217)
Table 3
Selectivity factor, accuracy, automaton construction runtime, and other road network informations.



sible 2013 GPS traces on OpenStreetMap4 . Although the             the metric of accuracy.On the other hand, the evaluation
initial OpenStreetMap dataset encompasses GPS trajecto-            of the second question relies on the metric of selectivity,
ries spanning the entire globe, our analysis concentrates          commonly employed in database literature [27]. Selec-
on the ten provinces in Tuscany, a region encompass-               tivity measures the reduction of potential alignments
ing 22, 985𝑘𝑚2 in central Italy. The Mappymatch python             between a query result and the entire dataset. In our con-
package5 was employed to map-match each trajectory,                text, selectivity is defined as the ratio of matched road
retaining only those trajectories with an average error of         segments (|𝒴 |) to the total number of road segments (|𝐸|).
less than 10𝑚. The final dataset encompasses 358 distinct          For accuracy, higher values indicate better results, while
trajectories, covering a total travel distance of 5, 024𝑘𝑚         for selectivity, lower values indicate better outcomes.
and described by 300, 049 GPS points. Additional infor-               GASM Performance. Table 5 presents the perfor-
mation on the types of roads traversed in each province            mance metrics of GASM across individual provinces. To
in Tuscany, as per the OpenStreetMap taxonomy6 , is                determine the optimal hyperparameters, a grid search
presented in Table 6.                                              was conducted over the values outlined in Table 6, specif-
   Within the framework of our shape-based map-                    ically for the province of Grosseto. This process yielded
matching formulation, we aim to address the following              the following hyperparameters: a resampling distance
questions. First, to what extent can GASM infer the                of 𝑑 = 10 meters, an alphabet size of 𝛼 = 8 symbols, and
original GPS coordinates without utilizing them for map            a street aggregation of ℎ = 2 hops. GASM showcases
matching? Second, how effectively can GASM reduce the              an impressive ability to deduce the original GPS coor-
number of potential alignments compared to the entire              dinates, achieving an average light accuracy of 90.1%.
road network? The first question is evaluated through              Furthermore, it significantly narrows down the potential
4                                                                  alignments, as indicated by the selectivity factor, reduc-
    OpenStreetMap 2013 public GPS traces: https://t.ly/q7u2N
5
    Mappymatch: https://t.ly/RHafS                                 ing it to just 10.8% of the original road network. These
6
    Highway taxonomy: https://t.ly/NpxZv                           commendable results are attained while maintaining a
Figure 3: Hyperparameters influence: ℎ-hop aggregation (left), 𝛼 alphabet size (center), and 𝑑 resampling distance (right).




Figure 4: Influence of trajectory length (left), trajectory straightness (center), and kind of road (right) on performance metrics.



reasonable automaton construction time of 7.8 minutes             criminative nature of trajectories based on the type of
per province, resulting in a total indexing time of a mere        road. Our hypothesis posits that straight streets, such
1.29 hours for the entire region.                                 as motorways, exhibit lower discriminative characteris-
   Hyperparameters Tuning. In this section, we                    tics. Consequently, trajectories observed on such roads
present the results of experiments conducted on the               are more likely to avoid re-identification, suggesting en-
province of Grosseto while varying the hyperparame-               hanced geographic transferability. To investigate this, we
ters detailed in Table 6. Initially, we compute the Pearson       identify the type of road traveled within each segment.
correlation between the method’s hyperparameters and              In cases where multiple types of roads are encountered,
two key performance metrics: the selectivity factor and           we perform a majority vote weighted by road length. Ad-
accuracy. Figure 3 visually depicts the changes in perfor-        ditionally, to examine the influence of changes in the
mance metrics, emphasizing variations in the top three            input data, we create random subtrajectories of varying
most influential hyperparameters—those exhibiting the             lengths, including 100m, 150m, 500m, 1km, 1.5km, 5km,
highest absolute values of Pearson correlation. Notably,          and 10km, derived from our OpenStreetMap dataset. In
the most influential hyperparameter is the number of              order to assess our hypothesis, we evaluate the straight-
road segment aggregations (ℎ), demonstrating a corre-             ness [4] of each subtrajectory by calculating the ratio
lation of −0.52 with the selectivity. Thus, increasing ℎ          between the shortest path from the origin to the destina-
proves beneficial for GASM as it helps select fewer can-          tion and the actual trajectory.
didate road segments without significantly impacting                 Figure 4 encapsulates these results. The initial plot on
accuracy. The alphabet size (𝛼) displays correlations of          the left highlights a notable trend: an increase in subtra-
−0.44 and 0.40 with respect to the selectivity and accu-          jectory length correlates with a rapid elevation in both
racy, respectively. This hyperparameter introduces a              accuracy and selectivity. In simpler terms, as the subtra-
trade-off, as increasing the number of symbols reduces            jectory length extends, the model’s precision improves.
selectivity but may lead to a slight decrease in accuracy.        The central plot reveals that trajectory straightness has
Finally, the resampling distance (𝑑) exhibits a correlation       a negligible impact on the number of candidate matches.
of 0.22 with accuracy. Interestingly, decreasing 𝑑 slightly       However, as trajectories become more linear, the accu-
enhances accuracy according to our observations.                  racy experiences a decline. Finally, the rightmost plot
                                                                  illustrates the method’s performance across various road
   Sensitivity Analysis. We delve here into the varia-
                                                                  types. This plot validates the findings of the straightness
tions in performance with respect to the length of the
                                                                  plot: roads with greater straightness, like motorways,
query trajectory 𝑋. Additionally, we explore the dis-
                                                                  pose the greatest challenge for re-identification. Con-
versely, more sinuous roads present a slightly higher            [7] C. Landi, et al., Geolet: An interpretable model for
difficulty in re-identification, reflected in a higher selec-        trajectory classification, in: IDA, volume 13876 of
tivity but with a concomitant boost in accuracy.                     LNCS, Springer, 2023, pp. 236–248.
                                                                 [8] F. Veronesi, et al., Assessing accuracy and geograph-
6. Conclusion                                                        ical transferability of machine learning algorithms
                                                                     for wind speed modelling, in: AGILE, LNGC, 2017.
In this paper we have introduced GASM, a map matching            [9] M. Nanni, et al., City indicators for geographical
method capable of determining a trajectory’s position                transfer learning: an application to crash prediction,
solely based on its shape. Our experiments showcase                  GeoInformatica 26 (2022) 581–612.
that GASM significantly reduces the number of poten-            [10] J. Lines, et al., A shapelet transform for time series
tial alignments and deduces the original GPS coordinates             classification, in: ACM SIGKDD, 2012, pp. 289–297.
with remarkable accuracy. Further analysis reveals that         [11] C. White, et al., Some map matching algorithms for
longer and less linear trajectories are more straightfor-            personal navigation assistants, TRC 8 (2000).
ward to map match. However, this observation raises             [12] D. Bernstein, et al., An introduction to map match-
concerns about the potential for shape-based methods                 ing for personal navigation assistants (1996).
to inadvertently learn geographic positions instead of          [13] M. A. Quddus, et al., A general map matching al-
focusing on other intrinsic features. As a part of future            gorithm for transport telematics applications, GPS
work, as outlined at the beginning, we aim to assess the             solutions 7 (2003) 157–167.
geographic transferability of shape-based methods, such         [14] Y. Liu, Z. Li, A novel algorithm of low sampling
as Geolet, by incorporating GASM. Specifically, we pro-              rate GPS trajectories on map-matching, EURASIP
pose giving more weight to the selection of discriminative           2017 (2017) 30.
subsequences with higher selectivity rather than basing         [15] X. Liu, et al., A st-crf map-matching method for
the decision solely on a statistical test.                           low-frequency floating car data, TITS 18 (2016)
                                                                     1241–1254.
Acknowledgments                                                 [16] L. Jiang, et al., From driving trajectories to driving
                                                                     paths: a survey on map-matching algorithms, CCF
This work is partially supported by the EU NextGener-                TPCI 4 (2022) 252–267.
ationEU programme under the funding schemes PNRR-               [17] P. Cintia, M. Nanni, An effective time-aware map
“SoBigData.it - Strengthening the Italian RI for Social              matching process for low sampling gps data, arXiv
Mining and Big Data Analytics” - Prot. IR0000013, H2020-             preprint arXiv:1603.07376 (2016).
INFRAIA-2019-1: Res. Infr. G.A. 871042 SoBigData++,             [18] G. Hu, et al., If-matching: Towards accurate map-
and GreenDatAI G.A. 101070416.                                       matching with information fusion, TKDE 29 (2016)
                                                                     114–127.
                                                                [19] L. Wang, et al., Smart city development with urban
References                                                           transfer learning, Computer 51 (2018) 32–41.
 [1] C. L. da Silva, et al., A survey and comparison of         [20] L. Wang, et al., Cross-city transfer learning for
     trajectory classification methods, in: BRACIS, IEEE,            deep spatio-temporal prediction, in: IJCAI, ijcai.org,
     2019, pp. 788–793.                                              2019, pp. 1893–1899.
 [2] A. Bolbol, et al., Inferring hybrid transportation         [21] R. Trasarti, et al., Myway: Location prediction via
     modes from sparse GPS data using a moving win-                  mobility profiling, Inf. Syst. 64 (2017) 350–367.
     dow SVM classification, CEUS 36 (2012) 526–537.            [22] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to
 [3] S. Dodge, et al., Revealing the physics of movement:            data mining, Pearson Education India, 2016.
     Comparing the similarity of movement characteris-          [23] A. J. Bagnall, et al., The great time series classifica-
     tics, CEUS 33 (2009) 419–434.                                   tion bake off, CoRR abs/1602.01711 (2016).
 [4] P. J. Almeida, et al., Indices of movement behaviour:      [24] A. V. Aho, et al., Efficient string matching: An aid
     conceptual background, effects of scale and location            to bibliographic search, ACM 18 (1975) 333–340.
     errors, Zoologia (Curitiba) 27 (2010) 674–680.             [25] E. M. McCreight, A space-economical suffix tree
 [5] I. Kontopoulos, et al., Traclets: Harnessing the                construction algorithm, JACM 23 (1976) 262–272.
     power of computer vision for trajectory classifica-        [26] J. Lin, et al., A symbolic representation of time
     tion, arXiv:2205.13880 (2022).                                  series, with implications for streaming algorithms,
 [6] C. A. Ferrero, et al., MOVELETS: exploring relevant             in: DMKD, ACM, 2003, pp. 2–11.
     subtrajectories for robust trajectory classification,      [27] S. Acharya, et al., Selectivity estimation in spatial
     in: SAC, ACM, 2018, pp. 849–856.                                databases, in: SIGMOD, ACM, 1999, pp. 13–24.