A Shape-Based Map Matching Approach for Geographic Transferability of Discriminative Subtrajectories

A Shape-Based Map Matching Approach for Geographic Transferability of Discriminative Subtrajectories CristianoLandi cristiano.landi@phd.unipi.it University of Pisa

Pisa Italy

ISTI-CNR

Pisa Italy

RiccardoGuidotti riccardo.guidotti@unipi.it University of Pisa

Pisa Italy

ISTI-CNR

Pisa Italy

A Shape-Based Map Matching Approach for Geographic Transferability of Discriminative Subtrajectories 1613-0073 56594342527A915FA26D93ED08140E81 GROBID - A machine learning software for extracting information from scholarly documents Map Matching Geographic Transferability Machine Learning Discriminative Subtrajectories

This paper addresses the challenge of map matching and geographic transferability in trajectory analysis. Existing methods often face limitations tied to specific coordinates or road networks. In response, we propose GASM, a shape-based map matching method that relies solely on trajectory shapes, irrespective of geographic origin. GASM introduces a symbolic road network representation, facilitating efficient searches based solely on trajectory shapes. Our experimentation, spanning over 5,000 km of roads, demonstrates GASM's ability to accurately position trajectories with an impressive accuracy exceeding 90%. Notably, GASM stands as the first in the literature to perform shape-based symbolic map matching without prior knowledge of the geographic region.

Introduction

In recent years, the widespread adoption of cutting-edge technologies equipped with Global Positioning System (GPS) devices has enabled the recording of positions for various moving objects, ranging from cars and transportation vehicles to phones and wearables. Unfortunately, the coordinates captured by these sensors often fail to accurately reflect real positions due to physical constraints and/or legal regulations. Nevertheless, in various applications, it is imperative to accurately align GPS trajectories with a road network. For instance, in navigation services, map matching empowers drivers to monitor their exact locations and receive optimal routes to specified destinations. Conversely, in machine learning tasks, map matching enhances users' mobility information by incorporating knowledge related to the territory, such as Points Of Interest (POI), feature engineering [1,2,3,4], or the identification of discriminatory subsequences, such as mobility shapelets [5,6,7]. Without an appropriate map-matching procedure, reliance on an expert becomes necessary to determine which features can be extracted from trajectories concerning the territory. However, the reliance on ad-hoc features restricts the applicability of machine learning methods and amplifies sensitivity to input changes [1], rendering it unsuitable for geographic transferability. This implies the challenge of extracting mobility patterns from one geographical region and effectively applying them in another region [8,9].

Particularly noteworthy are recent advancements in machine learning leveraging shapelet-based subtrajectories [5,6,7]. Originating from the domain of time series analytics, shapelets represent discriminatory subsequences that encapsulate a collection of distinctive shapes, crucial for discerning specific classes [10]. Various approaches exist for defining discriminative subtrajectories. In [6], the Movelet method is introduced-an approach for extracting discriminative subtrajectories selected through a rigorous statistical test. During the discovery phase, Movelet generates candidate subtrajectories by extracting all possible subsequences with more than two contiguous observations, utilizing a sliding window. Building upon the foundation laid by Movelet, Geolet is introduced in [7]. This extension incorporates a normalization step after the discovery phase. The normalization step is designed to ease the comparison of discriminative subsequences with trajectories recorded in diverse geographical regions. The underlying rationale is that a subtrajectory pinpointing a sudden break in a road segment in one city should exhibit similarities to a subtrajectory associated with the same event in another city. This normalization enhances the method's adaptability across various geographic contexts.

While Geolet successfully addresses the limitation of Movelet by providing normalized subtrajectories, thereby enhancing geographic transferability independent of specific GPS coordinates, it introduces a potential vulnerability tied to the road network.

Our underlying hypothesis is that the less frequently a trajectory occurs, the greater the likelihood that shapebased methods utilizing it as a discriminative subse-quence may not capture the intrinsic features of the trajectory but rather only its geographic position. In essence, if a discriminative subtrajectory is intrinsically linked to a particular road network due to its distinctive shape, it becomes unsuitable for geographic transferability. This limitation arises from the fact that the discriminatory aspect is not rooted in the movements themselves but rather in the structural characteristics of the road network. Consequently, evaluating the geographic transferability of discriminative subtrajectories necessitates a shape-based map-matching approach that exclusively relies on shapes without prior knowledge of the position. Regrettably, to the best of our knowledge, such an approach is currently unavailable. This underscores the need for innovative solutions in the realm of shape-based map matching to comprehensively assess the adaptability of discriminative subtrajectories across diverse geographical contexts.

To overcome this limitation, our paper introduces GASM, an Geographic Automaton Shape-based map Matching approach. GASM relies solely on the shape of a trajectory to accurately determine its position within the road network. Specifically, GASM employs a symbolic representation to transform the road network, constructing a spatial index independent of coordinates. This unique approach allows for efficient trajectory searches based solely on their shapes. To the best of our knowledge, GASM is the first proposal in the literature that exclusively utilizes a discretized representation of a trajectory's shape, devoid of any knowledge of the geographic region, for shape-based map matching. Our experimentation with GASM on a novel comprehensive geographical dataset spanning over 5,000 km of roads in Tuscany, central Italy, demonstrates its capability to identify correct alignments with an impressive accuracy exceeding 90%. Furthermore, GASM exhibits efficiency, as it can construct the necessary representation for the entire dataset in less than 1.5 hours, maintaining a linear complexity at query time.

The paper is organized as follows. Section 2 summarises the related works concerning map-matching methods and the challenges posed by geographic transferability. In Section 3, we encapsulate the technical concepts essential for comprehending the algorithm delineated in Section 4. The outcomes of experiments conducted with GASM are detailed in Section 5. Finally, Section 6 encapsulates our findings and delves into potential avenues for future developments.

Related Works

In the following, we provide a concise overview of the literature concerning map matching methods and elucidate the geographic transferability problem, introducing key strategies employed to tackle this challenge.

In the literature of trajectory analysis, a multitude of strategies exists for mapping trajectories onto a road network. For high-frequency sampled trajectories, the simplest approach involves associating each spatio-temporal point with the nearest street segment [11,12]. However, these techniques, while fast and straightforward, have exhibited inaccuracies, particularly at intersections and parallel roads. To address these limitations, enhancements have been introduced, incorporating heading direction or employing a Kalman filter to eliminate outlier points in trajectories [13]. Alternatively, some approaches leverage probabilistic-based map-matching algorithms, integrating hidden Markov models to identify the most likely sequence of road segments aligning with the trajectory [14,15]. On the other hand, in the context of low-sampled trajectories, much of the existing literature presupposes that the most probable route connecting two successive points is also the shortest or fastest [16]. However, in [17], is introduced a map-matching algorithm that exploits temporal intervals between GPS points. This method identifies the optimal match between two GPS points by selecting the route with the most similar travel time. Also, in [18] is proposed a method for map matching low-sampled trajectories based on supplementary information such as speed and moving direction, typically collected alongside spatial locations.

Geographic transferability encapsulates the challenge of extending knowledge gleaned from one geographic region to another. This entails constructing machine learning models capable of adeptly executing tasks in regions distinct from their original training grounds. The core of this challenge emerges from pronounced disparities in data distributions, patterns, and characteristics across diverse geographical locations. As articulated in [8,9], models trained on data from one region may encounter difficulties in generalization when confronted with the unique variations inherent in another region. In pursuit of a global model, a fundamental strategy is employed as demonstrated in [8], where diverse data sources are aggregated on a global scale to build a transferable model. Conversely, in [9], city indicators are identified pertaining to road networks, traffic flows, and individual mobility, to facilitate the assessment of similarities between geographical regions. Subsequently, an ensemble classifier is devised, computing the output as a weighted average of outputs generated by individual local classifiers. Notably, the importance of local models is determined by their higher similarity to the target regions compared to others in the ensemble with respect to the city indicators. Alternate methodologies involve adapting a model initially trained in a data-rich region and transplanting it to a target region characterized by limited data availability. This adaptation process entails integrating additional data to compute sub-region similarities, subsequently enabling the remapping of the model [19,20].

Problem Setting

In this section, we articulate the fundamental concepts essential for comprehending our proposal. Initially, we establish a shared language and framework by introducing notation that serves as a basis for discussing key elements. Subsequently, we delve into the transformation facilitated by Geolet, a catalyst for the motivation behind this work. Ultimately, we present a formal introduction to the problem at hand.

Definition 1 (Trajectory). A trajectory 𝑋 is a sequence of spatio-temporal points 𝑋 = {⃗ x 𝑡 0 , … , ⃗ x 𝑡 𝑚 } ∈ ℝ 𝑚×3

where the spatial vectors ⃗

x 𝑡 𝑗 = (lat 𝑡 𝑗 , long 𝑡 𝑗 ) are sorted by increasing time 𝑡 𝑗 , i.e., ∀1 ≤ 𝑗 < 𝑚 we have 𝑡 𝑗 < 𝑡 𝑗+1 .

In a sense, trajectories can be viewed as multivariate time series containing two signals, i.e., the latitude and longitude, recorded at non-constant sampling rates [5,21,6]. In order to simplify notation, we will use 𝑗 instead of 𝑡 𝑗 every time. A trajectory classification dataset is a set of trajectories with a vector of labels attached. Formally:

Definition 2 (Trajectory Dataset). A trajectory dataset 𝒳 ∈ ℝ 𝑛×𝑚×3 is a set of 𝑛 trajectories, 𝒳 = {𝑋 0 … , 𝑋 𝑛 }.

For simplicity, we use a single symbol 𝑚 to denote the lengths of the trajectories, even if a dataset can contain trajectories with a different number of observations. Similarly, we emphasize that there is no constraint on the sampling rate, i.e., we can have a non-constant sample in the same trajectory. Furthermore, we define a subtrajectory as: Definition 3 (Subtrajectory). Given a trajectory 𝑋 of length 𝑚, a subtrajectory 𝑆 = {⃗ 𝑠 𝑗 , … , ⃗ 𝑠 𝑗+𝑙 } ⊂ 𝑋, of length 𝑙 ≤ 𝑚, is an ordered sequence of consecutive values such that 0 ≤ 𝑗 ≤ 𝑚 − 𝑙.

As previously mentioned, Movelet [6] and Geolet [7] are shapelet-inspired [10] trajectory approaches that identify discriminative subtrajectories for classification purposes. They both select the most discriminative subtrajectories w.r.t. the target label using the mutual information [22]. Like shapelet-based approaches, Movelet and Geolet extract discriminative subtrajectories that can be used to train any machine learning model [23]. Indeed, once the most discriminative subtrajectories are identified, a trajectory dataset can be transformed into a tabular representation capturing the distance between trajectories and discriminative subtrajectories through the subtrajectory transform function:

On the other hand, Geolet computes the best fitting in the same way, but geographically shifting 𝑆 to overlap each subsequence of 𝑋 of length 𝑙. In particular, Geolet extends the best fitting function of Movelet by adding a pre-processing function, shift, that subtracts the value of the first vector of the subsequence from all the others, bestfit Geolet (𝑋 , 𝑆) =

𝑚−𝑙 min 𝑗=0 {𝐸𝐷(shift(𝑋 𝑗∶𝑗+𝑙 ), shift(𝑆))} (2)

where shift(𝑋

) = { ⃗ 𝑥 0 − ⃗ 𝑥 0 , … , ⃗ 𝑥 𝑚 − ⃗ 𝑥 0 }.

The shift function makes Geolet suitable for geographic transferability not being tide to the territory.

Finally, in order to present map matching we define a road network as follows:

Definition 5 (Road Network). A road network 𝐺 = ⟨𝑉 , 𝐸⟩ is a directed graph where 𝑉 = {𝑣 1 , … , 𝑣 𝑝 } is the set of 𝑝 road junction (or nodes), and 𝐸 = {𝑒 1 , … , 𝑒 𝑞 } is the set of 𝑞 road segments (or edges), where

𝑒 𝑖 = (𝑣 𝑖 1 , 𝑣 𝑖 2 ).

We underline that we rely on an enhanced road network representation where, for each edge, we also have access to the road segment geometries expressed as a sequence of 𝑘 latitude and longitude, formally shape(𝐸) = { ⃗ 𝑥 0 , … , ⃗ 𝑥 𝑘 } for some 𝐸 ∈ ℰ. As for trajectories, for simplicity of notation, we use a single symbol 𝑘 to denote the lengths of the points describing the geometry of the road segment, even if, in real-case scenarios, the shape can be described using an arbitrary number of points.

We are now able to formalize the shape-based map matching problem as follows:

Definition 6 (Shape-based Map Matching). Given a road network 𝒢 = ⟨𝑉 , 𝐸⟩ and a trajectory 𝑋, the shapebased map matching problem consists in finding the best sequence of edges 𝑌 = {𝑒 1 , … , 𝑒 𝑧 } ⊆ 𝐸 such that does not exist another sequence of edges 𝑌 ′ ⊂ 𝐸 different from 𝑌, i.e., 𝑌 ′ ≠ 𝑌, where bestfit Geolet (𝑋 , shape(𝑌 ′ )) < bestfit Geolet (𝑋 , shape(𝑌 )).

In other words, the shape-based map matching problem involves determining the optimal alignment for a (sub)trajectory 𝑋 within a designated road network 𝐺, relying on the configuration of the edges comprising the road segments. It is essential to emphasize that this map matching endeavor necessitates resolution without any reliance on GPS coordinates as the usage of the shift operator normalizes the trajectory 𝑋 rendering state-of-the-art map matching methods unsuitable for this particular task.

Shape-based Map Matching

To tackle the shape-based map-matching problem, our aim is to design a map-matching method with the capability to accurately deduce the original GPS coordinates of a trajectory within a designated road network. Crucially, this precision is sought exclusively through an examination of the trajectory's shape and the configurations of the edges within the road network, entirely independent of any reliance on GPS coordinates.

A brute-force approach to address the problem involves map matching all conceivable alignments of 𝑋 within every segment 𝐸 of the road network 𝐺, employing the bestfit Geolet function. However, this naive strategy is only viable for small road networks due to the algorithmic complexity being 𝑂(|𝐸|(𝑚 − 𝑘)𝑘), where 𝑚 and 𝑘 represent the number of points characterizing the trajectory 𝑋 and the number of points describing each road segments in 𝐸, respectively 1 . This limitation also extends to other map-matching algorithms that rely on latitude and longitude coordinates to confine the matching scope to the nearest roads.

We overcome this limitation by proposing GASM a Geographic Automaton Shape-based map Matching approach that is able to significantly reduce the number of road segment alignments to test with the brute force method. In essence, GASM comprises two key steps. Initially, leveraging the Aho-Corasick algorithm [24], GASM constructs a shape-based index for all road segments in 𝐸, portraying it as a geographic finite state automaton. Subsequently, GASM facilitates querying the automaton to pinpoint a set of candidate partial matches between 𝑋 and {shape(𝑌 )| ∀ 𝑌 ⊆ 𝐸}. Further elucidation of these two steps is provided in the subsequent sections.

Geographic Automaton Construction. GASM leverages the Aho-Corasick algorithm to construct a geographic automaton, serving as a spatial index for expedited query processing [24]. The Aho-Corasick algorithm, renowned for string searching, takes a set 𝒲 = {𝑊 1 , … , 𝑊 𝑛 } as input, where each 𝑊 𝑖 represents a finite sequence of symbols over an alphabet Σ. Subsequently, it builds a finite-state automaton based on the sequences in 𝒲 within a given finite symbol sequence defined over the alphabet Σ. Consequently, the automaton, constructed using the dictionary 𝒲, identifies a subset 𝒲 ′ ⊂ 𝒲 wherein each sequence in 𝒲 ′ is contained in 𝑄. Figure 1 provides an illustration of the automaton created by the Aho-Corasick algorithm, using the sequences 𝒲 = {𝐴𝐶, 𝐵, 𝐵𝐶𝐴, 𝐶} over the alphabet Σ = {𝐴, 𝐵, 𝐶, 𝐷}. The Aho-Corasick algorithm initiates by constructing a suffix trie [25], depicted in black in Figure 1. Subsequently, it designates all leaves of the trie as final states of the automaton and introduces edges to complete the automaton. Two types of edges are incorporated, connecting their respective suffixes: suffix edges, depicted in blue, are utilized in the case of a mismatch, without guaranteeing that the suffix is also a sequence in the dictionary. In contrast, dictionary-suffix edges, portrayed in green, guarantee that the suffix is a sequence present in the dictionary. These operations unfold linearly concerning the total number of symbols in the input dictionary 𝒲. The automaton enables the search for all sequences contained in a query by traversing the automaton, achievable in linear time relative to the query's length.

Algorithm 1 delineates the procedural steps requisite of GASM for constructing the Aho-Corasick automaton. The algorithm accepts, as input, the road segments 𝐸 of the road network 𝐺, the resampling distance 𝑑, the maximum allowed number of symbols 𝛼, and the number of hops ℎ, producing a geographical automaton 𝐴 as output. The GASM-build algorithm begins by aggregating the road network, concatenating ℎ times a road segment to linked road segments in 𝐸 ℎ to extend the length of existing segments and enhance their representativeness (line 1). Subsequently, it initializes an empty dictionary 𝒲 (line 2). The following steps are applied for each road segment in 𝐸 ℎ (denoted as 𝑒). Given that the shape of a road segment 𝑒 may be described by varying numbers of points based on its length and sinuosity, GASM initially resamples the geometries into a series of evenly spaced points 𝑋. This ensures that the symbolic representation's length, crucial for Aho-Corasick automaton construction, is proportional solely to the road length. To fulfill the prerequisite of representing each road segment 𝑒 in a discretized space, a sequence of symbols is generated (line 5). Subsequently, GASM determines the heading direction ⃗ 𝑋 between consecutive points along the resampled road segment, transforming the shape of each road sequence 𝑒 into a univariate time series of directions ⃗ 𝑋 with a consistent length-based sampling rate 𝑑 (line 6). This facilitates the utilization of Symbolic Aggregate approXimation (SAX) [26] to obtain a symbolic representation of each road segment over an alphabet Σ (line 7). These representations are added to the dictionary 𝒲. Finally, the dictionary of discretized representations of the road segments is employed to construct the Aho-Corasick automaton, which is then returned as the output (line 8).

Shape-based Matching.

Once the construction of the geographic automaton is complete, GASM can execute shape-based map matching over the automaton following the steps outlined in Algorithm 2. GASM-search takes as input the query trajectory 𝑋, the geographical automaton 𝐴, the road segments 𝐸, the resampling distance 𝑑, and the maximum allowed number of symbols 𝛼. It yields the sequence of edges 𝑌 ⊆ 𝐸 that minimizes the bestfit Geolet function, as per the ensuing procedure. The initial three steps of Algorithm 2, aligning with Algorithm 1, involve resampling the query trajectory 𝑋, extracting its direction, and transforming it into a symbolic representation 𝑄. Indeed, the same preprocessing applied to the road segments 𝐸 is applied to the query trajectories. Subsequently, the geographic automaton 𝐴 is utilized to perform a linear search for the best matches 𝒴 among all possibilities offered by 𝐸 (line 4). This implementation enables GASM to identify an "initial best match", presenting a set of best match candidates 𝒴 = {𝑌 1 , … , 𝑌 𝑛 }. From this set, the final selection of the optimal alignment 𝑌 * is determined through a naive approach (line 5).

Table 1

Tested hyperparameters with their values.

Figure 2 visually summarise GASM. On the left side, the geographic automaton construction phase is depicted, wherein each road within an arbitrary large road network is indexed according to its heading direction. On the right side, the shape-based matching phase is illustrated. Here, given a trajectory with known shape but unknown origin point, GASM computes the set of potential partial map matches. Subsequently, it selects the match that minimizes the bestfit Geolet function.

Experiments

In this section, we evaluate the effectiveness of GASM 2 . First, we present the experimental setting, then we report and discuss the best performance achieved. Finally, we illustrate details of the hyperparameter tuning and the result of a sensitivity analysis w.r.t. some data properties.

Experimental Setting. Regrettably, only a handful of mobility datasets, such as GeoLife and Porto Taxi3 , are available as open access [1,7]. However, these datasets possess limited geographic coverage, rendering them unsuitable for our study. Thus, we introduce a novel high-sample rate dataset derived from the publicly acces- sible 2013 GPS traces on OpenStreetMap 4 . Although the initial OpenStreetMap dataset encompasses GPS trajectories spanning the entire globe, our analysis concentrates on the ten provinces in Tuscany, a region encompassing 22, 985𝑘𝑚 2 in central Italy. The Mappymatch python package 5 was employed to map-match each trajectory, retaining only those trajectories with an average error of less than 10𝑚. The final dataset encompasses 358 distinct trajectories, covering a total travel distance of 5, 024𝑘𝑚 and described by 300, 049 GPS points. Additional information on the types of roads traversed in each province in Tuscany, as per the OpenStreetMap taxonomy 6 , is presented in Table 6.

Length

Within the framework of our shape-based mapmatching formulation, we aim to address the following questions. First, to what extent can GASM infer the original GPS coordinates without utilizing them for map matching? Second, how effectively can GASM reduce the number of potential alignments compared to the entire road network? The first question is evaluated through 4 OpenStreetMap 2013 public GPS traces: https://t.ly/q7u2N 5 Mappymatch: https://t.ly/RHafS 6 Highway taxonomy: https://t.ly/NpxZv the metric of accuracy.On the other hand, the evaluation of the second question relies on the metric of selectivity, commonly employed in database literature [27]. Selectivity measures the reduction of potential alignments between a query result and the entire dataset. In our context, selectivity is defined as the ratio of matched road segments (|𝒴 |) to the total number of road segments (|𝐸|). For accuracy, higher values indicate better results, while for selectivity, lower values indicate better outcomes. GASM Performance. Table 5 presents the performance metrics of GASM across individual provinces. To determine the optimal hyperparameters, a grid search was conducted over the values outlined in Table 6, specifically for the province of Grosseto. This process yielded the following hyperparameters: a resampling distance of 𝑑 = 10 meters, an alphabet size of 𝛼 = 8 symbols, and a street aggregation of ℎ = 2 hops. GASM showcases an impressive ability to deduce the original GPS coordinates, achieving an average light accuracy of 90.1%. Furthermore, it significantly narrows down the potential alignments, as indicated by the selectivity factor, reducing it to just 10.8% of the original road network. These commendable results are attained while maintaining a Hyperparameters Tuning. In this section, we present the results of experiments conducted on the province of Grosseto while varying the hyperparameters detailed in Table 6. Initially, we compute the Pearson correlation between the method's hyperparameters and two key performance metrics: the selectivity factor and accuracy. Figure 3 visually depicts the changes in performance metrics, emphasizing variations in the top three most influential hyperparameters-those exhibiting the highest absolute values of Pearson correlation. Notably, the most influential hyperparameter is the number of road segment aggregations (ℎ), demonstrating a correlation of −0.52 with the selectivity. Thus, increasing ℎ proves beneficial for GASM as it helps select fewer candidate road segments without significantly impacting accuracy. The alphabet size (𝛼) displays correlations of −0.44 and 0.40 with respect to the selectivity and accuracy, respectively. This hyperparameter introduces a trade-off, as increasing the number of symbols reduces selectivity but may lead to a slight decrease in accuracy. Finally, the resampling distance (𝑑) exhibits a correlation of 0.22 with accuracy. Interestingly, decreasing 𝑑 slightly enhances accuracy according to our observations. Sensitivity Analysis. We delve here into the variations in performance with respect to the length of the query trajectory 𝑋. Additionally, we explore the dis-criminative nature of trajectories based on the type of road. Our hypothesis posits that straight streets, such as motorways, exhibit lower discriminative characteristics. Consequently, trajectories observed on such roads are more likely to avoid re-identification, suggesting enhanced geographic transferability. To investigate this, we identify the type of road traveled within each segment. In cases where multiple types of roads are encountered, we perform a majority vote weighted by road length. Additionally, to examine the influence of changes in the input data, we create random subtrajectories of varying lengths, including 100m, 150m, 500m, 1km, 1.5km, 5km, and 10km, derived from our OpenStreetMap dataset. In order to assess our hypothesis, we evaluate the straightness [4] of each subtrajectory by calculating the ratio between the shortest path from the origin to the destination and the actual trajectory.

Figure 4 encapsulates these results. The initial plot on the left highlights a notable trend: an increase in subtrajectory length correlates with a rapid elevation in both accuracy and selectivity. In simpler terms, as the subtrajectory length extends, the model's precision improves. The central plot reveals that trajectory straightness has a negligible impact on the number of candidate matches. However, as trajectories become more linear, the accuracy experiences a decline. Finally, the rightmost plot illustrates the method's performance across various road types. This plot validates the findings of the straightness plot: roads with greater straightness, like motorways, pose the greatest challenge for re-identification. Con-versely, more sinuous roads present a slightly higher difficulty in re-identification, reflected in a higher selectivity but with a concomitant boost in accuracy.

Conclusion

In this paper we have introduced GASM, a map matching method capable of determining a trajectory's position solely based on its shape. Our experiments showcase that GASM significantly reduces the number of potential alignments and deduces the original GPS coordinates with remarkable accuracy. Further analysis reveals that longer and less linear trajectories are more straightforward to map match. However, this observation raises concerns about the potential for shape-based methods to inadvertently learn geographic positions instead of focusing on other intrinsic features. As a part of future work, as outlined at the beginning, we aim to assess the geographic transferability of shape-based methods, such as Geolet, by incorporating GASM. Specifically, we propose giving more weight to the selection of discriminative subsequences with higher selectivity rather than basing the decision solely on a statistical test.

Definition 4 (4Subtrajectory Transform). Given a dataset 𝒳 and a set 𝒮 containing ℎ subtrajectories, the subtrajectory transform converts 𝒳 ∈ ℝ 𝑛×𝑚×3 into a real-valued matrix 𝑇 ∈ ℝ 𝑛×ℎ , obtained by taking the best fitting of each trajectory 𝑋 ∈ 𝒳, and each subtrajectory 𝑆 ∈ 𝒮. Movelet computes the best fitting as the minimal Euclidean Distance (ED) between 𝑆 in each subsequence of length 𝑙 = |𝑆| of 𝑋, formally, bestfit Movelets (𝑋 , 𝑆) = 𝑚−𝑙 min 𝑗=0 {𝐸𝐷(𝑋 𝑗∶𝑗+𝑙 , 𝑆)}

Figure 1 :1Figure 1: Aho-Corasick automaton using symbolic sequences {𝐴𝐶, 𝐵, 𝐵𝐶𝐴, 𝐶}. Blue dashed arches are suffix arches, while green dotted arches are the dictionary suffix arches.

Figure 2 :2Figure 2: Summary of GASM, depicting geographic automaton construction (left) and shape-based matching (right).

Figure 3 :3Figure 3: Hyperparameters influence: ℎ-hop aggregation (left), 𝛼 alphabet size (center), and 𝑑 resampling distance (right).

Figure 4 :4Figure 4: Influence of trajectory length (left), trajectory straightness (center), and kind of road (right) on performance metrics.

(km) Length (#points) Kind of Road (%) Province #Trj Totoal Average (𝜎) Average (𝜎) Motorway Trunk Primary Secondary MinorArezzo1734117.2 (18.38)853 (851)0.0190.2270.2050.3440.196Firenze58104130.4 (37.31)1526 (1427)0.1220.0320.3920.2050.249Grosseto352156.7 (9.35)578 (545)0.0200.1330.0540.3270.466Livorno5341018.5 (25.42)916 (645)0.4100.0430.0890.1440.313Lucca3980417.5 (15.49)1128 (1133)0.2250.0000.0560.4420.278M. Carrara462675.1 (3.54)625 (430)0.1870.1730.1600.2670.212Pisa3583126.0 (23.04)1347 (1178)0.0000.0020.2190.2000.578Pistoia2046853.1 (31.07)929 (777)0.0000.1860.2510.1630.399Prato311464.5 (2.37)660 (672)0.1410.0500.3950.1460.269Siena2449722.4 (16.54)983 (660)0.5570.0320.3400.0260.046

Table 22Dataset description. Besides average values are reported standard deviations (𝜎).Method PerformancesRoad Network CharacteristicsSelectivityBuilding#RoadAvg nodeProvinceFactor ↓ Accuracy ↑ Time (s) ↓ Segments #IntersectionsDegree Length (km)Arezzo0.0960.875708.32133028544854.88326852Siena0.0681.000516.37175088720794.85828949Pistoia0.0800.950433.2481870344924.74712363Lucca0.0700.846658.03141149592744.76320328Firenze0.3990.263157.172993121190685.02839025Grosseto0.0600.823421.66121045500144.84126818Livorno0.0691.000277.3896631396134.87911269M. Carrara0.0560.978294.50719173000714.78310809Pisa0,0810.8571007.06150954625804.82422396Prato0.1050.936190.7745060187944.7955224Macro Avg (𝜎) 0.108 (0.103) 0.853 (0.217)

Table 33Selectivity factor, accuracy, automaton construction runtime, and other road network informations.For each 𝑌 ⊆ 𝐸, we compute the Euclidean Distance (linear complexity), for all the possible 𝑚 − 𝑘 alignments of shape(𝑌 ) in 𝑋.𝑌 * ← arg min 𝑌 ∈𝒴 bestfit Geolet (𝑌 , 𝑋 ′ );return 𝑌;// return the best matchPython code: https://t.ly/wVlXS. We ran our experiments on a 2xIntel Xeon Gold 6342 24-core CPU, limiting each test to use at most 12 cores.GeoLife: https://t.ly/6VJ-E. Porto: https://t.ly/0GMR9.

Acknowledgments

This work is partially supported by the EU NextGener-ationEU programme under the funding schemes PNRR-"SoBigData.it -Strengthening the Italian RI for Social Mining and Big Data Analytics" -Prot. IR0000013, H2020-INFRAIA-2019-1: Res. Infr. G.A. 871042 SoBigData++, and GreenDatAI G.A. 101070416.

A survey and comparison of trajectory classification methods CLDa Silva BRACIS, IEEE 2019 Inferring hybrid transportation modes from sparse GPS data using a moving window SVM classification ABolbol CEUS 36 2012 Revealing the physics of movement: Comparing the similarity of movement characteristics SDodge CEUS 33 2009 Indices of movement behaviour: conceptual background, effects of scale and location errors PJAlmeida Zoologia 27 2010 IKontopoulos arXiv:2205.13880 Traclets: Harnessing the power of computer vision for trajectory classification 2022 MOVELETS: exploring relevant subtrajectories for robust trajectory classification CAFerrero SAC, ACM 2018 Geolet: An interpretable model for trajectory classification CLandi 2023 Springer 13876 Assessing accuracy and geographical transferability of machine learning algorithms for wind speed modelling FVeronesi AGILE, LNGC 2017 City indicators for geographical transfer learning: an application to crash prediction MNanni GeoInformatica 26 2022 A shapelet transform for time series classification JLines ACM SIGKDD 2012 Some map matching algorithms for personal navigation assistants CWhite TRC 8 2000 An introduction to map matching for personal navigation assistants DBernstein 1996 A general map matching algorithm for transport telematics applications MAQuddus GPS solutions 7 2003 A novel algorithm of low sampling rate GPS trajectories on map-matching YLiu ZLi EURASIP 2017 30 2017 A st-crf map-matching method for low-frequency floating car data XLiu TITS 18 2016 From driving trajectories to driving paths: a survey on map-matching algorithms LJiang CCF TPCI 4 2022 PCintia MNanni arXiv:1603.07376 An effective time-aware map matching process for low sampling gps data 2016 arXiv preprint If-matching: Towards accurate mapmatching with information fusion GHu TKDE 29 2016 Smart city development with urban transfer learning LWang Computer 51 2018 Cross-city transfer learning for deep spatio-temporal prediction LWang IJCAI, ijcai.org 2019 Myway: Location prediction via mobility profiling RTrasarti Inf. Syst 64 2017 Introduction to data mining P.-NTan MSteinbach VKumar 2016 Pearson Education India The great time series classification bake off AJBagnall CoRR abs/1602.01711 2016 Efficient string matching: An aid to bibliographic search AVAho ACM 18 1975 A space-economical suffix tree construction algorithm EMMccreight JACM 23 1976 A symbolic representation of time series, with implications for streaming algorithms JLin DMKD, ACM 2003 Selectivity estimation in spatial databases SAcharya SIGMOD, ACM 1999