On feature selection and evaluation of transportation mode
                       prediction strategies
                          Mohammad Etemad                                                                Amílcar Soares
     Institute for Big Data Analytics, Dalhousie University                          Institute for Big Data Analytics, Dalhousie University
                       Halifax, NS, Canada                                                             Halifax, NS, Canada
                         etemad@dal.ca                                                                amilcar.soares@dal.ca

                                Stan Matwin∗                                                                Luis Torgo
     Institute for Big Data Analytics, Dalhousie University                           Faculty of Computer Science, Dalhousie University
                       Halifax, NS, Canada                                                          Halifax, NS, Canada
                         stan@cs.dal.ca                                                                ltorgo@dal.ca

ABSTRACT                                                                             for authorities and the public, may reduce the fuel consumption
Transportation modes prediction is a fundamental task for de-                        and commute time, and may provide more pleasant moments for
cision making in smart cities and traffic management systems.                        residents and tourists. Since a trajectory is a collection of geolo-
Traffic policies based on trajectory mining can save money and                       cations captured through time, extracting features that show the
time for authorities and the public. It may reduce the fuel con-                     behavior of a trajectory is of prime importance. The number of
sumption, commute time, and more pleasant moments for resi-                          features that can be generated for trajectory data is significant.
dents and tourists. Since the number of features that may be used                    However, some of these features are more important than others
to predict a user transportation mode can be substantial, finding                    for the transportation mode prediction task. Selecting the best
a subset of features that maximizes a performance measure is                         subset of features not only saves processing time but also may
worth investigating. In this work, we explore a wrapper and an                       increase the performance of the learning algorithm. The features
information retrieval methods to find the best subset of trajectory                  selection problem and the trajectory classification task were se-
features for a transportation mode dataset. Our results were com-                    lected as the focus of this work. The contributions of this paper
pared with two related papers that applied deep learning methods.                    are listed below.
The results showed that our work achieved better performance.                            • We investigated several classifiers using their default pa-
Furthermore, two types of cross-validation approaches were in-                             rameters values and selected the one with the best perfor-
vestigated, and the performance results show that the random                               mance.
cross-validation method may provide overestimated results.                               • Using two distinct feature selection approaches, we in-
                                                                                           vestigated the best subset of features for transportation
KEYWORDS                                                                                   modes prediction.
Trajectory mining, feature selection, cross-validation                                   • After finding the best subset of features for some classifiers,
                                                                                           we compare our results with the works of [3] and [6]. The
                                                                                           results showed that our approach performed better than
1    INTRODUCTION                                                                          the others from the literature.
Trajectory mining is a very hot topic since positioning devices                          • Finally, we investigate the differences between the two
are now used to track people, vehicles, vessels, natural phenom-                           methods of cross-validation used by the literature on trans-
ena, and animals. It has applications including but not limited to                         portation mode prediction. The results show that the ran-
transportation mode detection [3, 6, 7, 31, 33], fishing detection                         dom cross-validation method may suggest overestimated
[4], tourism [8], vessels monitoring [5], and animal behaviour                             results in comparison to user-oriented cross-validation.
analysis [9]. There are also a number of topics in this field that
                                                                                        The rest of this work is structured as follows. The related works
need to be investigated further such as high performance trajec-
                                                                                     are reviewed in section 2. The basic concepts and definitions are
tory classification methods [3, 6, 20, 31, 33], accurate trajectory
                                                                                     provided in section 3 and the proposed framework is presented
segmentation methods [28, 30, 34], trajectory similarity and clus-
                                                                                     in section 4. We provide our experimental results in section 5.
tering [10, 17], dealing with trajectory uncertainty [15], active
                                                                                     Finally, the conclusions and future works are shown in section 6.
learning [29], and semantic trajectories [2, 22, 24]. These topics
are highly correlated and solving one of them requires to some
extent exploring more than one.                                                      2   RELATED WORKS
   As one of the trajectory mining applications, transportation                      Feature engineering is an essential part of building a learning
mode prediction is a fundamental task for decision making in                         algorithm. Some of the algorithms artificially extract features
smart cities and traffic management systems. Traffic policies that                   using representation learning methods; On the other hand, some
are designed based on trajectory mining can save money and time                      studies select a subset from the handcrafted features. Both meth-
                                                                                     ods have advantages such as faster learning, less storage space,
∗ Institute for Computer Science, Polish Academy of Sciences, Warsaw and Postcode,
                                                                                     performance improvement of learning, and generalized models
Poland
                                                                                     building [18]. These two methods are different from two perspec-
© 2019 Copyright held by the owner/author(s). Published in Proceedings of the        tives. First, artificially extracting features generates a new set
Published in the Workshop Proceedings of the EDBT/ICDT 2019 Joint Conference         of features by learning, while feature selection chooses a subset
on CEUR-WS.org, March 26, 2019:
Distribution of this paper is permitted under the terms of the Creative Commons
                                                                                     of existing handcrafted ones. Second, selecting handcrafted fea-
license CC-by-nc-nd 4.0.                                                             tures constructs more readable and interpretable models than
artificially extracting features [18]. This work focuses on the          3    NOTATIONS AND DEFINITIONS
handcrafted feature selection task.                                         Definition 3.1. A trajectory point, li ∈ L, so that li = (x i , yi , ti ),
   Feature selection methods can be categorized into three gen-          where x i is longitude and it varies from 0◦ to ±180◦ , yi is latitude
eral groups: filter methods, wrapper methods, and embedded               and it varies from 0◦ to ±90◦ , and ti (ti < ti+1 ) is the capturing
methods [12]. Filter methods are independent of the learning algo-       time of the moving object, and L is the set of all trajectory points.
rithm. They select features based on the nature of data regardless
of the learning algorithm [18]. On the other hand, wrapper meth-            A trajectory point can be assigned by some features that de-
ods are based on a kind of search, such as sequential, best first, or    scribe different attributes of the moving object with a specific
branch and bound, to find the best subset that gives the highest         time-stamp and location. The time-stamp and location are two
score on a selected learning algorithm [18]. The embedded meth-          dimensions that make trajectory point spatio-temporal data with
ods apply both filter and wrapper [18]. Feature selection methods        two important properties: (i) auto-correlation and (ii) heterogene-
can be grouped based on the type of data as well. The feature            ity [1]. These features make the conventional cross validation
selection methods that use the assumption of i.i.d. (independent         less suitable [27].
and identically distributed) are conventional feature selection             Definition 3.2. A raw trajectory, or simply a trajectory τ , is
methods [18] such as laplacian methods [14] and spectral feature         a sequence of trajectory points captured through time, where
selection methods [32]. They are not designed to handle hetero-          τ = (li , li+1 , .., ln ), li ∈ L, i ≤ n.
geneous or auto-correlated data. Some feature selection methods
have been introduced to handle heterogeneous data and stream                Definition 3.3. A sub-trajectory is one of the consecutive sub-
data that most of them working on graph structure such as [11].          sequences of a raw trajectory generated by splitting the raw
   Conventional feature selection methods are categorized in four        trajectory into two or more sub-trajectories.
groups: similarity-based methods like laplacian methods[14], In-            For example, if we have one split point, k, and τ1 is a raw
formation theoretical methods [26], sparse learning methods such         trajectory then s 1 = (li , li+1 , ..., lk ) and s 2 = (lk +1 , lk +2 , ..., ln )
as [19], and statistical based methods like chi2 [21]. Similarity-       are two sub trajectories generated by τ1 .
based feature selection approaches are independent of the learn-
ing algorithm, and most of them cannot handle feature redun-                Definition 3.4. The process of generating sub-trajectories from
dancy or correlation between features[21]. Likewise, statistical         a raw trajectory is called segmentation.
methods like chi-square cannot handle feature redundancy, and               We used a daily segmentation of raw trajectories and then seg-
they need some discretization strategies[21]. The statistical meth-      mented the data utilizing the transportation modes annotations
ods are also not effective in high dimensional space[21]. Since          to partition the data. This approach is also used in [6] and [3].
our data is not sparse and sparse learning methods need to over-
come the complexity of optimization methods, and they were                  Definition 3.5. A point feature is a measured value F p , assigned
not a candidate for our experiments. On the other hand, infor-           to each trajectory points of a sub trajectory s.
mation retrieval methods can handle both feature relevance and
redundancy[21]. Furthermore, selected features can be gener-                                      F p = (fi , fi+1 , .., fn )               (1)
alized for learning tasks. Information gain, which is the core                                              p
                                                                         Notation 1 shows the feature F for sub trajectory s. For example,
of Information theoretical methods, assumes that samples are
                                                                         speed can be a point feature since we can calculate the speed
independently and identically distributed. Finally, the wrapper
                                                                         of a moving object for each trajectory point. Since we need two
method only sees the score of the learning algorithm and tries to
                                                                         trajectory points to calculate speed, we assume the speed of the
maximize the score of the learning algorithm.
                                                                         first trajectory point is equal to the speed of the second trajectory
   The most common evaluation metric reported in the related
                                                                         point.
works is the accuracy of the models. Therefore, we use the ac-
curacy metric to compare our work with others from literature.              Definition 3.6. A trajectory feature is a measured value Ft ,
Since the data was imbalanced, we reported the f-score as well           assigned to a sub trajectory, s.
to give equal importance to precision and recall. Despite the fact
that most of the related work applied the accuracy metric, it                                                Σfk
is calculated using different methods including random cross-                                              Ft =                            (2)
                                                                                                              n
validation, cross-validation with dividing users, cross-validation          Equation 2 shows the feature Ft for sub trajectory s. For ex-
with mix users and simple division of the training and test set          ample, the speed mean can be a trajectory feature since we can
without cross-validation. The latter is a weak method that is used       calculate the speed mean of a moving object for a sub trajectory.
only in [35]. The random cross-validation or the conventional                     p
                                                                            The Ft is the notation for all trajectory features that generated
cross-validation was applied in [31], [20] , and [3]. [33] mixed the                                                      speed
training and test set according to users so that 70% of trajectories     using point feature p. For example, Ft       represents all the
of a user goes to the training set and the rest goes to test set. Only   trajectory features derived from speed point feature. Moreover,
                                                                           speed
[6] performed the cross-validation by dividing users between the         Fmean denotes the mean of the trajectory features derived from
training and test set. Because trajectory data is a kind of data with    the speed point feature.
spatial and temporal dimensions, users can also be placed in the
same semantic hierarchical structure such as students, worker,           4    THE FRAMEWORK
visitors, and teachers, a conventional cross-validation method           In this section, the sequence of steps of q framework with eight
could provide overestimated results as studied in [27].                  steps are explained (Figure 1). The first step groups the trajectory
                                                                         points by Trajectory id to create daily sub-trajectories (segmenta-
                                                                         tion). Sub-trajectories with less than ten trajectory points were
                                                                         discarded to avoid generating low-quality trajectories.
                          Figure 1: The steps of the applied framework to predict transportation modes.


   Point features including speed, acceleration, bearing, jerk,         features and local trajectory features. Global features, like the
bearing rate, and the rate of the bearing rate were generated           Minimum, Maximum, Mean, Median, and Standard Deviation,
in step two. The features speed, acceleration, and bearing were         summarize information about the whole trajectory and local tra-
first introduced in [34], and jerk was proposed in [3]. The very        jectory features, percentiles ( 10, 25, 50, 75, and 90), describe
first point feature that we generated was duration. This is the         a behavior related to part of a trajectory. The local trajectory
time difference between two trajectory points. This feature gives       features extracted in this work were the percentiles of every
us essential information including some of the segmentation             point feature. Five different global trajectory features were used
position points, loss signal points, and is useful in calculating       in the models tested in this work. In summary, we computed
point features such as speed, and acceleration. The distance was        70 trajectory features ( 10 statistical measures including five
calculated using the haversine formula. Having duration and             global and five local features calculated for 7 point features)
distance as two point features, we calculate speed, acceleration        for each sample trajectory. In Step 4, two feature selection ap-
and jerk using Equation 3, 4 , and 5 respectively. A function to        proaches were performed, wrapper search and information re-
calculate the bearing (B) between two consecutive points was            trieval feature importance. According to the best accuracy results
also implemented and is detailed in Equation 6, where ϕ i , λi is       for development set , a subset of top 19 features was selected in
the start point, ϕ i+1 , λi+1 the end point.                            step 5. The code implementation of all these steps is available at
                                 Distancei                              https://github.com/metemaad/TrajLib.
                            Si =                               (3)         In step 6, the framework deals with noise in the data option-
                                 Durationi
                                                                        ally. This means that we ran the experiments with and without
                                    (Si+1 − Si )                        this step. Finally, we normalized the features (step 7) using the
                        Ai+1 =                                   (4)
                                        ∆t                              Min-Max normalization method to avoid saturation, since this
                                   (Ai+1 − Ai )                         method preserves the relationship between the values to trans-
                        Ji+1 =                                   (5)    form features to the same range and improves the quality of the
                                       ∆t
                                                                        classification process [13]. Another possible method is Z normal-
         Bi+1 = atan2(sin λi+1 − λi cos ϕ i+1 ,                         ization; however, finding the best normalization method was out
                                                               (6)
         cos ϕ i sin ϕ i+1 − sin ϕ i cos ϕ i+1 cos λi+1 − λi )          of the scope of this work.
  Two new features were introduced in [7], named bearing rate,
and the rate of the bearing rate. Applying equation 7, we com-
puted the bearing rate.                                                 5   EXPERIMENTS
                                                                        In this section, we detail the four experiments performed in
                                   (Bi+1 − Bi )                         this work. In this work, we used the GeoLife dataset [34]. This
                     Br at e(i+1) =                             (7)
                                       ∆t                               dataset has 5,504,363 GPS records collected by 69 users, and is la-
   Bi and Bi+1 are the bearing point feature values in points           beled with eleven transportation modes: taxi (4.41%); car (9.40%);
i and i + 1. ∆t is the time difference. The rate of the bearing         train (10.19%); subway (5.68%); walk (29.35%); airplane (0.16%);
rate point feature is computed using equation 8. Since extensive        boat (0.06%); bike (17.34%); run (0.03%); motorcycle (0.006%); and
calculations are done with trajectory points, it was necessary an       bus (23.33%). Two primary sources of uncertainty of the Geolife
efficient way to calculate all these equations for each trajectory.     dataset are device and human error. This inaccuracy can be cate-
Therefore, the code was written in a vectorized manner in Python        gorized in two major groups, systematic errors and random errors
programming language which is faster than other online available        [16]. The systematic error occurs when the recording device can-
python versions of the bearing calculation. It can be possible to       not find enough satellites to provide precise data. The random
gain more performance using other languages like C/C++.                 error can happen because of atmospheric and ionospheric effects.
                                 (Br at e(i+1) − Br at e(i) )           Furthermore, the data annotation process has been done after
              Br r at e(i+1) =                                    (8)   each tracking as [34] explained in the Geolife dataset documen-
                                        ∆t
   After calculating the point features for each trajectory, the tra-   tation. As humans, we are all subject to fail in providing precise
jectory features were extracted in step three. Trajectory features      information; it is possible that some users forget to annotate the
were divided into two different types including global trajectory       trajectory when they switch from one transportation mode to
                                                                                 show that the random forest performs better than other models
                                                                                 (µ accur acy = 0.8189, σ = 0.10%) on the development set.
                                                                                     The results of cross validation f-score, presented in figure 3,
                                                                                 show that the random forest performs better than other models
                                                                                 (µ f 1 = 0.8179, σ = 0.12%) on the development set.
                                                                                     The second best model was XGBoost (µ accur acy = 0.8245, σ =
                                                                                 0.11%). The XGBoost was ranked the second because a paired
                                                                                 T-Test indicated that the random forest classifier results were
                                                                                 not statistically significantly higher than the XGBoost classifier
                                                                                 results, but since it has a higher variance than random forest,
                                                                                 we decided to rank random forest as first. In the other hand,
                                                                                 paired t-tests indicated that the random forest classifier results
                                                                                 were statistically significantly higher than the SVM, decision tree,
                                                                                 Neural Network, and Adaboost classifiers results.
Figure 2: Among the trained classifiers random forest
achieved the highest mean accuracy.


another. For example, the changes in the speed pattern might be
a representation of human error.
   Moreover, we divide data into two folds: development set and
validation set. These two folds divided in a way that each user
can be either in development set or validation set. Therefore,
there is no overlap in terms of users. This division is applied for
user-oriented cross validation. We divide the validation fold to
five folds to do the cross validation and using this fold to compare
our results with related work.
   The best classifier using their default input parameters (Sec-
tion 5.1) was found in our first experiment (check scikit-learn
documentation1 for the classifiers default parameters values).
Tuning the classifiers parameters may lead to find a better classi-              Figure 3: Among the trained classifiers random forest
fier, but doing a grid search is expensive and does not change the               achieved the highest mean F-score.
framework. In our second experiment (Section 5.2), the wrapper
and information theoretical methods are used to search the best
subset of our 70 features for the transportation modes prediction                5.2    Feature selection using wrapper and
task. The third experiment (Section 5.3) is a comparison between                        information theoretical methods
[6] and [3] and our implementation. In the last experiment (Sec-                 The second experiment aims to select the best features for trans-
tion 5.4), the type of cross validation was investigated.                        portation modes prediction task for the Geolife dataset.
   In order to avoid using non-parametric statistical tests, we                      We select one method from filter category which is informa-
repeat the experiments with different seeds and collect more                     tion theoretical method to see the effect of the heterogeneity of
than 30 samples for performing the statistical tests. According to               data on feature selection method. Another method was selected
central limit theorem, we can assume these samples follow the                    from wrapper category which is the full search wrapper method.
normal distribution. Therefore, t-test results are reported.                     Filter methods suffer from having i.i.d assumption, while wrap-
                                                                                 per methods do not. Therefore, comparing these two methods
5.1     Classifier selection                                                     shows the importance of taking into account the heterogeneity
In this experiment, we investigated among six classifiers, which                 of features of trajectory data.
classifier is the best. The experiment settings use to conventional                  We selected the wrapper feature selection method because
cross-validation and to perform the transportation mode predic-                  it can be used with any classifier. Using this approach, we first
tion task showed on [3]. XGBoost, SVM, decision tree, random                     defined an empty set for selected features. Then, we searched
forest, neural network, and adaboost are six classifiers that were               all the trajectory features one by one to find the best feature to
applied in the reviewed literature [7, 31, 33, 35].2 The dataset is              append to the selected feature set. The maximum accuracy score
filtered based on labels that have been applied in [3] (e.g., walk-              was the metric for selecting the best feature to append to selected
ing, train, bus, bike, driving) and no noise removal method was                  features. After, we removed the selected feature from the set of
applied. The classifiers mentioned above were trained, and the                   features and repeated the search for union of selected features
accuracy metric was calculated using random cross-validation                     and next candidate feature in the feature set. We selected the
similar to [20], [31], and [3]. This experiment was repeated for                 labels applied in [6] and the same cross-validation technique.
eight randomly selected seeds (8, 65, 44, 7, 99, 654, 127, 653) to                   The results are shown in figure 4. The results of this method
generate more than 30 result samples that make safe to assume                    suggest that the top 19 features get the highest accuracy. There-
a normal distribution for results based on central limit theorem.                fore, we selected this subset as the best subset for classification
The results of cross validation accuracy, presented in figure 2,                 purposes using the random forest algorithm.
1 https://scikit-learn.org/stable/supervised_learning.html#supervised-learning       Information theoretical feature selection is one of the meth-
2 available on https://github.com/metemaad/trajpred                              ods widely used to select essential features. Random forest is a
Figure 4: Accuracy of random forest classifier for incremental appending features ranked by random forest feature im-
portance.


classifier that has embedded feature selection using information       task by human because of using large samples and scrutinizing
theoretical metrics. We calculated the feature importance using        the data to fine clean it. However, “we cannot do better than
random forest. Then, each feature is appended to the selected          bayes error unless we are overfitting". [23]. Having noise in GPS
feature set and calculating the accuracy score for random forest       data and human error, as we discussed, suggest that the avoidable
classifier. The user-oriented cross-validation was used here, and      bias is more than five percent. This ground truth was our base to
the target labels are similar to [6]. Figure 5 shows the results of    exclude papers that reported more than 95% of accuracy.
cross-validation for appending features with respect to the impor-        Thus, we compare our accuracy per segment results, repeated
tance rank suggested by the random forest. We chose the wrapper        for 8 different seeds, against [6] mean accuracy, 67.9%. A one-
approach results since it produces statistically significant higher    sample T-test indicated that our accuracy results (70.97%) are
accuracy score.                                                        higher and statistically significantly better than [6]’s results
                                                                       (67.9%), p=0.0182.
                                                                          The label set for [3]’s research is walking, train, bus, bike,
5.3    Comparison with the related work
                                                                       taxi, subway, and car so that the taxi and car are merged and
In this third experiment, we filtered transportation modes which       called driving. Moreover, subway and train merged and called
have been used by [6] for evaluation. We divided the validation        the train class. We filtered the Geolife data to get the same sub-
fold into the training and test folds in a way that each user          sets as [3] reported based on that. Then, we randomly selected
can appear only either in the training or test fold. The top 19        80% of the data as the training and the rest as test set, we ap-
features were selected to be used in this experiment which is          plied five-fold cross-validation and repeated this for 8 different
the best features subset mentioned in section 5.2. Therefore, we       seeds. The best subset of features was the same as the previous
approximately divided 80% of the data as training and 20% of the       experiment (Section 5.2). Running the random forest classifier
data as the test set.                                                  with 50 estimators, using SKlearn implementation [25], results
   We selected [6] because this is the only paper that divided         on a mean accuracy of 87.16% for the five-fold cross-validation.
the dataset in a way that isolated users in training and test set.     A one-sample T-test indicated that our accuracy results (87.16%)
Moreover, This research applied the handcrafted features and           are higher and statistically significantly better than [3]’s results
interpretable classifiers, while [3] did not isolated users and used   (84.8%), p=2.27e-12.
representation learning features. Therefore, these two research           We avoided using the noise removal method in the above
are in the two ends and spectrum and comparing our results with        experiment because we believe we do not have access to labels of
theirs and may provide insights for validating our results.            the test dataset and using this method only increases our accuracy
   We assume the bayes error is the minimum possible error             unrealistically.
and human error is near to the bayes error [23]. Avoidable bias
is defined as the difference between the training error and the
human error. Achieving the performance near to the human               5.4    Effects of types of cross-validation
performance in each task is the primary objective of the research.     To visualize the effect of type of cross-validation on transporta-
The recent advancements in deep learning lead to achieving some        tion modes prediction task, we set up a controlled experiment.
performance level even more than the performance of doing the          We used the same classifiers and same features to calculate the
                   Figure 5: Accuracy of random forest classifier for incremental appending best features


cross-validation accuracy on the whole dataset. Only the type of
cross-validation is different in this experiment, one is random,
and another is user-oriented cross-validation. Figure 6 shows that
there is a considerable difference between the cross-validation
accuracy results of user-oriented cross-validation and random
cross-validation.


                                                                      Figure 7: The F-score cross-validation results for user ori-
                                                                      ented cross-validation and random cross-validation


                                                                      6   CONCLUSIONS
                                                                      In this work, we reviewed some recent transportation modes
                                                                      prediction methods and feature selection methods. We proposed
Figure 6: The accuracy cross-validation results for user ori-         a framework for transportation modes prediction and four experi-
ented cross-validation and random cross-validation                    ments were conducted to cover different aspects of transportation
                                                                      modes prediction.
                                                                         First, the performance of six recently used classifiers for the
   Furthermore, figure 7 shows that there is a considerable differ-   transportation modes prediction was evaluated. The results showed
ence between the cross-validation f-score results of user-oriented    that the random forest classifier performs the best among all the
cross-validation and random cross-validation.                         evaluated classifiers. The SVM was the worst classifier, and the
   These results indicate that random cross-validation provides       accuracy result of XGBoost was competitive with the random
overestimated accuracy and f-score results. Since the correla-        forest classifier.
tion between user-oriented cross-validation results is less than         In the second experiment, the effect of features using two
random cross-validation, proposing a specific cross-validation        different approaches, the wrapper method and information theo-
method for evaluating the transportation mode prediction is a         retical method were evaluated. The wrapper method shows that
topic that needs attention.                                           we can achieve the highest accuracy using the top 19 features.
                                                speed
Both approaches suggest that the Fp90 (the percentile 90 of                            [15] Sungsoon Hwang, Cynthia VanDeMark, Navdeep Dhatt, Sai V Yalla, and
                                                                                            Ryan T Crews. 2018. Segmenting human trajectory data by movement states
the speed as defined in section 3) is the most essential feature                            while addressing signal loss and signal noise. International Journal of Geo-
among all 70 introduced features. This feature is robust to noise                           graphical Information Science (2018), 1–22.
since the outlier values do not contribute to the calculation of                       [16] Jungwook Jun, Randall Guensler, and Jennifer Ogle. 2006. Smoothing methods
                                                                                            to minimize impact of global positioning system random error on travel
percentile 90.                                                                              distance, speed, and acceleration profile estimates. Transportation Research
   In the third experiment, the best model was compared with                                Record: Journal of the TRB 1, 1972 (2006), 141–150.
the results showed in [6] and [3]. The results show that our                           [17] Hye-Young Kang, Joon-Seok Kim, and Ki-Joune Li. 2009. Similarity measures
                                                                                            for trajectory of moving objects in cellular space. In SIGAPP09. 1325–1330.
suggested model achieved a higher accuracy. Our applied features                       [18] Jundong Li, Kewei Cheng, Suhang Wang, Fred Morstatter, Robert P Trevino,
are readable and interpretable in comparison to [6] and our model                           Jiliang Tang, and Huan Liu. 2017. Feature selection: A data perspective. CSUR
                                                                                            50, 6 (2017), 94.
has less computational cost.                                                           [19] Zechao Li, Yi Yang, Jing Liu, Xiaofang Zhou, Hanqing Lu, et al. 2012. Un-
   Finally, we investigate the effects of user-oriented cross-validation                    supervised feature selection using nonnegative spectral analysis.. In AAAI,
and random cross-validation in the last experiments. The results                            Vol. 2.
                                                                                       [20] Hongbin Liu and Ickjai Lee. 2017. End-to-end trajectory transportation mode
showed that random cross-validation provides overestimated                                  classification using Bi-LSTM recurrent neural network. In Intelligent Systems
results in terms of the analyzed performance measures.                                      and Knowledge Engineering (ISKE), 2017 12th International Conference on. IEEE,
   We intend to extend this work in many directions. The spa-                               1–5.
                                                                                       [21] Huan Liu and Rudy Setiono. 1995. Chi2: Feature selection and discretization
tiotemporal characteristic of trajectory data is not taken into                             of numeric attributes. In Tools with artificial intelligence, 1995. proceedings.,
account in most of the works from literature (e.g. autocorrelation                          seventh international conference on. IEEE, 388–391.
                                                                                       [22] B. N. Moreno, A. Soares Júnior, V. C. Times, P. Tedesco, and Stan Matwin. 2014.
and heterogeneity). Fine tuning the classification models with                              Weka-SAT: A Hierarchical Context-Based Inference Engine to Enrich Trajec-
grid search and automatic (e.g. Genetic Algorithms, Racing algo-                            tories with Semantics. In Advances in Artificial Intelligence. Springer Interna-
rithms, and meta-learning) methods. We also intend to deeply                                tional Publishing, Cham, 333–338. https://doi.org/10.1007/978-3-319-06483-3_
                                                                                            34
investigate the effects of cross-validation and other strategies like                  [23] Andrew Ng. 2016. Nuts and bolts of building AI applications using Deep
holdout in trajectory data. Finally, space and time dependencies                            Learning. NIPS.
can also be explored to tailor features for transportation means                       [24] Christine Parent, Stefano Spaccapietra, Chiara Renso, Gennady Andrienko,
                                                                                            Natalia Andrienko, Vania Bogorny, Maria Luisa Damiani, Aris Gkoulalas-
prediction.                                                                                 Divanis, Jose Macedo, Nikos Pelekis, Yannis Theodoridis, and Zhixian Yan.
                                                                                            2013. Semantic Trajectories Modeling and Analysis. ACM Comput. Surv. 45, 4,
                                                                                            Article 42 (Aug. 2013), 32 pages.
ACKNOWLEDGMENTS                                                                        [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.
The authors would like to thank NSERC (Natural Sciences and                                 Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D.
                                                                                            Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn:
Engineering Research Council of Canada) for financial support.                              Machine Learning in Python. MLR (2011).
                                                                                       [26] Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based
                                                                                            on mutual information criteria of max-dependency, max-relevance, and min-
REFERENCES                                                                                  redundancy. IEEE Transactions on pattern analysis and machine intelligence 27,
 [1] Gowtham Atluri, Anuj Karpatne, and Vipin Kumar. 2017. Spatio-Temporal                  8 (2005), 1226–1238.
     Data Mining: A Survey of Problems and Methods. arXiv arXiv:1711.04710             [27] David R Roberts, Volker Bahn, Simone Ciuti, Mark S Boyce, Jane Elith, Gu-
     (2017).                                                                                rutzeta Guillera-Arroita, Severin Hauenstein, José J Lahoz-Monfort, Boris
 [2] Vania Bogorny, Chiara Renso, Artur Ribeiro de Aquino, Fernando de                      Schröder, Wilfried Thuiller, et al. 2017. Cross-validation strategies for data
     Lucca Siqueira, and Luis Otavio Alvares. 2014. Constant–a conceptual data              with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40,
     model for semantic trajectories of moving objects. Transactions in GIS 18, 1           8 (2017), 913–929.
     (2014), 66–88.                                                                    [28] A. Soares Júnior, B. N. Moreno, V. C. Times, S. Matwin, and L. A. F. Cabral.
 [3] Sina Dabiri and Kevin Heaslip. 2018. Inferring transportation modes from GPS           2015. GRASP-UTS: an algorithm for unsupervised trajectory segmentation.
     trajectories using a convolutional neural network. Transportation Research             International Journal of Geographical Information Science 29, 1 (2015), 46–68.
     Part C: Emerging Technologies 86 (2018), 360–371.                                 [29] A. Soares Júnior, C. Renso, and S. Matwin. 2017. ANALYTiC: An Active
 [4] Erico N de Souza, Kristina Boerder, Stan Matwin, and Boris Worm. 2016.                 Learning System for Trajectory Classification. IEEE Computer Graphics and
     Improving fishing pattern detection from satellite AIS using data mining and           Applications 37, 5 (2017), 28–39. https://doi.org/10.1109/MCG.2017.3621221
     machine learning. PloS one 11, 7 (2016), e0158248.                                [30] A. Soares Júnior, V. Cesario Times, C. Renso, S. Matwin, and L. A. F. Cabral.
 [5] Renata Dividino, Amilcar Soares, Stan Matwin, Anthony W Isenor, Sean                   2018. A Semi-Supervised Approach for the Semantic Segmentation of Trajec-
     Webb, and Matthew Brousseau. 2018. Semantic Integration of Real-Time                   tories. In 2018 19th IEEE International Conference on Mobile Data Management
     Heterogeneous Data Streams for Ocean-related Decision Making. In Big Data              (MDM). 145–154.
     and Artificial Intelligence for Military Decision Making. STO. https://doi.org/   [31] Xiao. 2017. Identifying Different Transportation Modes from Trajectory Data
     10.14339/STO-MP-IST-160-S1-3-PDF                                                       Using Tree-Based Ensemble Classifiers. ISPRS 6, 2 (2017), 57.
 [6] Yuki Endo, Hiroyuki Toda, Kyosuke Nishida, and Akihisa Kawanobe. 2016.            [32] Zheng Zhao and Huan Liu. 2007. Spectral feature selection for supervised
     Deep feature extraction from trajectories for transportation mode estimation.          and unsupervised learning. In Proceedings of the 24th international conference
     In Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer,           on Machine learning. ACM, 1151–1157.
     54–66.                                                                            [33] Yu Zheng, Yukun Chen, Quannan Li, Xing Xie, and Wei-Ying Ma. 2010. Un-
 [7] Mohammad Etemad, Amílcar Soares Júnior, and Stan Matwin. 2018. Predicting              derstanding transportation modes based on GPS data for web applications.
     Transportation Modes of GPS Trajectories using Feature Engineering and                 TWEB 4, 1 (2010), 1.
     Noise Removal. In Advances in AI: 31st Canadian Conf. on AI, Canadian AI          [34] Yu Zheng, Quannan Li, Yukun Chen, Xing Xie, and Wei-Ying Ma. 2008. Un-
     2018, Toronto, ON, CA, Proc. 31. Springer, 259–264.                                    derstanding mobility based on GPS data. In UbiComp 10th. ACM, 312–321.
 [8] Shanshan Feng, Gao Cong, Bo An, and Yeow Meng Chee. 2017. POI2Vec:                [35] Qiuhui Zhu, Min Zhu, Mingzhao Li, Min Fu, Zhibiao Huang, Qihong Gan,
     Geographical Latent Representation for Predicting Future Visitors.. In AAAI.           and Zhenghao Zhou. 2018. Transportation modes behaviour analysis based
 [9] Sabrina Fossette, Victoria J Hobson, Charlotte Girard, Beatriz Calmettes,              on raw GPS dataset. International Journal of Embedded Systems 10, 2 (2018),
     Philippe Gaspar, Jean-Yves Georges, and Graeme C Hays. 2010. Spatio-                   126–136.
     temporal foraging patterns of a giant zooplanktivore, the leatherback turtle.
     Journal of Marine systems 81, 3 (2010), 225–234.
[10] Andre Salvaro Furtado, Laercio Lima Pilla, and Vania Bogorny. 2018. A branch
     and bound strategy for Fast Trajectory Similarity Measuring. Data Knowledge
     Engineering 115 (2018), 16 – 31. https://doi.org/10.1016/j.datak.2018.01.003
[11] Quanquan Gu and Jiawei Han. 2011. Towards feature selection in network. In
     Proceedings of the 20th ACM ICIKM. ACM, 1175–1184.
[12] Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and
     feature selection. Journal of ML research 3, Mar (2003), 1157–1182.
[13] Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and
     techniques. Elsevier.
[14] X He, D Cai, and P Niyogi. 2005. Laplacian Score for Feature Selection,
     Advances in Nerual Information Processing Systems. (2005).