=Paper=
{{Paper
|id=Vol-2083/paper-18
|storemode=property
|title=Mining Vessel Trajectory Data for Patterns of Search and Rescue
|pdfUrl=https://ceur-ws.org/Vol-2083/paper-18.pdf
|volume=Vol-2083
|authors=Konstantinos Chatzikokolakis,Dimitrios Zissis,Giannis Spiliopoulos,Konstantinos Tserpes
|dblpUrl=https://dblp.org/rec/conf/edbt/0002ZST18
}}
==Mining Vessel Trajectory Data for Patterns of Search and Rescue==
<pdf width="1500px">https://ceur-ws.org/Vol-2083/paper-18.pdf</pdf>
<pre>
        Mining Vessel Trajectory Data for Patterns of Search
                            and Rescue
         Konstantinos Chatzikokolakis*, Dimitrios Zissis*†, Giannis Spiliopoulos* and Konstantinos Tserpes**
                                           * MarineTraffic, London, United Kingdom
                          Email:{konstantinos.chatzikokolakis, giannis.spiliopoulos} @marinetraffic.com
                 † Department of Product and Systems Design Engineering, University of the Aegean, Syros, Greece
                                                    Email: dzissis@aegean.gr
                      ** Department of Informatics and Telematics, Harokopio University of Athens, Greece
                                                    Email: ktserpes@hua.com


                        Figure 1. Visualization of SAR activity in the Mediterranean Sea during July-September 2015


ABSTRACT                                                                                  Mediterranean Sea to Europe. Since the Syrian war in 2011,
                                                                                          there has been a rapid increase in the number of people
The overall aim of this work is to explore the possibility of                             crossing; a trend which is not expected to stop any time soon.
automatically detecting Search And Rescue (SAR) activity,                                 According to the UN Refugee Agency, this year alone, at
even when a distress call has on yet been received. For this,                             least 2,030 people have died or gone missing on the voyage,
we exploit a large volume of historical Automatic                                         with the greatest number of fatalities occurring along the so-
Identification System (AIS) data so as to detect SAR activity                             called Central Mediterranean Route, through Libya [23].
from vessel trajectories, in a scalable, data-driven supervised                           Although under maritime law, any vessel in the area of a
way, with no reliance on external sources of information                                  vessel in distress is obliged to offer assistance, numerous
(e.g. coast guard reports). Specifically, we present our                                  national and international missions have been launched on
approach which is based on a parallelised, nonparametric                                  the EU borders and in the international waters of the
statistical method (Random Forests), which has proved                                     Mediterranean, so as to assist in Search and Rescue (SAR)
capable of achieving prediction accuracy rates higher than                                operations, such as Operation Mare Nostrum led by Italy,
77%.                                                                                      Operation Triton led by Frontex, NATO Operation Sea
                                                                                          Guardian and the EU operation Sophia. Many of these
1    INTRODUCTION                                                                         operations were not designed with SAR as a primary mission
    1
    For many years, North Africa has served as the jumping                                goal. Due to this numerous Non-Governmental
off point for refugees and migrants hoping to cross the                                   Organisations (NGO) have stepped in and have been
                                                                                          performing SAR operations in the area; these include

© 2018 Copyright held by the owner/author(s). Published in the Workshop
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
under the terms of the Creative Commons license CC-by-nc-nd 4.0


                                                                                    117
Migrant Offshore Aid Station (MOAS), Doctors Without                        •    Domain Specific: The overall aim of this work is
Borders, Sea-Watch and others. According to the UNHCR                            to explore if it is possible to automatically detect
an overall 41% of those rescued have been by the NGOs.                           SAR activity from open data (such as AIS), even
           Recently though concerns have been raised about                       when a distress call has not been received. This
the possible interactions between NGOs and smugglers. A                          work has an important social impact, as it can help
report published by the EU agency Frontex stated that there                      improve coordination of SAR efforts and
were “clear indications before departure on the precise                          understanding of implicated activities (e.g.
direction to be followed in order to reach the NGOs’                             response time).
boats”[4]. According to this same report, during 2015, and
the first months of 2016, smuggling groups instructed                       •    Algorithmic: We extract patterns of “rescue-like
migrants to make satellite phone calls to the Maritime                           behavior” from billions of records of spatio-
Rescue Coordination Centre (MRCC) in Rome so as to                               temporal (AIS) data and apply Random Forests,
initiate targeted rescues on the high seas. During this period,                  which is a parallelised nonparametric statistical
SAR operations were mainly undertaken by Italian law                             method, evaluated as capable of achieving
enforcement, EUNAVFOR Med or Frontex vessels with                                prediction accuracy rates of more than 77%, even
NGO vessels involved in less than 5% of the incidents. From                      when applied to large volumes of highly skewed
June to October 2016, however, the pattern was reversed.                         geospatial data. To the best of the authors
“Satellite phone calls to MRCC Rome decreased sharply                            knowledge, no previous work has considered
(down to 10%) and NGO rescue operations rose significantly                       deriving SAR activity from AIS data.
to more than 40% of all incidents. Since June 2016, a                   The rest of the paper is organized as follows: Section 2
significant number of boats were intercepted or rescued by              shortly presents previous work in this domain, while Section
NGO vessels without any prior distress call and without                 3 describes our approach and Section 4 presents the
official information as to the rescue location” according to            preliminary results while section 5 concludes this paper by
Frontex [4].                                                            briefly outlining the main contributions of this work and
           Maritime Domain Awareness (MDA) is the                       suggesting future improvements.
effective understanding of activities, events and threats in
the maritime environment that could impact global safety,               2   RELATED WORK
security, economic activity or the environment [5]. Whilst in           The rise in the availability of larger quantity and better
the past, MDA had suffered from a lack of data, current                 quality mobility data, has increased the interest of
tracking technology has transformed the problem into one of             researchers in data driven knowledge discovery. Some of the
an overabundance of data and information. Currently, huge               typical mining tasks in the spatio-temporal context include,
amounts of structured and unstructured data, tracking vessels           frequent pattern discovery, trajectory pattern clustering,
during their voyages across the seas, are becoming available,           trajectory classification, forecasting, and outlier detection.
mostly due to the Automatic Identification System (AIS) that            Recent works on pattern discovery are based on online event
vessels of specific categories are required to carry. The AIS           recognition systems that recognize suspicious and illegal
is a collaborative, self-reporting system that allows maritime          vessel activities of compressed routes (i.e., only critical
vessels to broadcast their information to nearby vessels and            points of routes are preserved)[17]. Although this solution
coastal based stations [26]. AIS transceivers allow real time           identifies complex events, it does not classify those to
information exchange between vessels and shore based                    specific vessel operations (e.g. tugging, fishing, search and
stations through digital radio signals transmitted over                 rescue, etc.). The merits of this work have been extended in
dedicated channels in VHF band. The major challenge faced               where vessels’ moving pattern analysis is performed through
today, is exploiting these vast amounts of data and transform           an ontology-based system[14]. Trajectory classification,
it into actionable information. Discovering patterns                    includes constructing a model capable of predicting the class
emerging within these huge datasets is of great importance              labels of moving objects based on their trajectories and other
so as to provide critical insights into the patterns vessels            features [9]. Trajectory classification has been applied in
follow during their voyages at sea.                                     many mobility applications and numerous methods have
           The main objective of our work is to explore the             been proposed throughout the given literature, however less
possibility of leveraging these huge mobility datasets so as            attention has been paid to the maritime domain and
to automatically detect vessels performing SAR operations.              classifying a vessel’s type with regards to its trajectory. For
Towards this direction we adopt a practical data mining and             example, in [9], authors propose a feature generation
machine learning approach which is capable of overcoming                framework TraClass for trajectory data from satellite images
the shortcomings and difficulties presented by AIS data                 and trace gas measurements, which generates a hierarchy of
(highly skewed, non-uniform, reception errors etc.) [6]. In             features by partitioning trajectories and explores two types
sum, this work presents novelties on two fronts:                        of clustering: (1) region-based and (2) trajectory-based. In


                                                                  118
this paper, hierarchical region-based and trajectory-based              vessel types: cargo ship, tanker, tug and law-enforcement
clustering after trajectory partitioning is performed, and a            vessel with the best accuracy score being 76.25%. Jiang,
vessel classification rate as high as 84.4% is reported, but            Silver, Hu, De Souza, and Matwin in [8], also make use of
unfortunately information on how many vessel types are                  AIS data and compare Autoencoders with SVMs and
included in the dataset is not provided [9].                            Random Forests. In their work they suggest that
          Several studies have proved the value of using AIS            autoencoders can perform at least as well as and sometimes
data for data driven knowledge discovery in this domain [12,            better than SVM and Random Forests on classification
15, 16]. An interesting trajectory classification case that has         fishing activities, achieving up to 85% accuracy [8].
caught researchers attention, is that of fishing activity               However, the nature of the autoencoders is to capture as
detection; especially for applications such as illegal fishing,         much information as possible and not as much relevant
where the task can be defined as given a ship trajectory T,             information as possible and since this work utilised only a
predict a label 𝑦𝑖 for each data point 𝑡𝑖 where 𝑦𝑖 ∈{Fishing,           small dataset it would be difficult to have only a small part
NonFishing}[21]. In [21], authors develop three different               of the input that is relevant to the considered problem.
models to detect potential fishing behavior according to the            Furthermore, SVMs do not work well with categorical
type of fishing activity; for trawlers a Hidden Markov Model            features and often fail to handle larger datasets as they pose
(HMM) is developed using vessel speed as observation                    significant memory requirements and computational
variable; for longliners a pattern recognition approach                 complexity in such cases. Other studies indicate the
named Lavielle’s algorithm has been applied; and for purse              superiority of Random Forests when used for classification
seiners a multi-layered filtering strategy based on vessel              tasks, compared to SVMs and back propagation neural
speed and operation time was implemented. Validation                    networks [10].
against expert-labeled datasets showed average detection                    Random Forests, which are based on decision trees
accuracies of 83% for trawler and longliner, and 97% for                combined with aggregation and bootstrap ideas, were first
purse seiner. Although these methods were designed for                  introduced by Breiman in 2001 [2]. They are a powerful
wide applicability, high accuracy results are only achieved             nonparametric statistical method allowing to consider in a
by preprocessing AIS data, where wrong detections, noise                single and versatile framework regression problems, as well
and faulty out-of-bounds data (e.g. observations on land) are           as two-class and multi-class classification problems [19].
previously removed [21]. The use of AIS data poses a series             Random Forests can deal with large numbers of predictor
of data management and data processing challenges linked                variables even in the presence of complex interactions, and
to the treatment of large volumes of data which may heavily             have been applied successfully in genetics, clinical
reduce the applicability of the approach. Many traditional              medicine, and bioinformatics within the past few years.
data mining approaches assume that the underlying data                  Random Forests have been shown to achieve a high
distribution is uniform and spatially continuous. This is not           prediction accuracy in such applications and to provide
the case for global AIS data, as it is often to have large              descriptive variable importance measures reflecting the
geographical coverage gaps, message collisions or erroneous             impact of each variable in both main effects and interactions
messages especially when processing large areas [18, 25].               [22]. They are considered capable of good accuracy,
          In [11] Mazzarella, Vespe, Damalas and Osio focus             relatively robust of outliers and noise, can be pararellised
on discovering and characterising fishing areas by exploiting           and are thus considered suitable data mining algorithm for
historical AIS data broadcast by fishing vessels.                       big data [1, 2].
Specifically, they focused on detecting the behavior of
fishing boats that are probably actively fishing. The                   3     PROPOSED APPROACH
methodology used for the identification of fishing activity             Our aim is to explore the possibility of automatically
was based on assuming a fishing behaviour highly dependent              detecting SAR activity from open data (such as AIS), even
and characterised by speed. Detecting changes and                       when a distress call has not been received. The task can be
frequency of speed could help identifying which part of the             formulated as given a set of vessel trajectories T, predict a
vessel track can be considered as fishing and which not                 label 𝑦𝑖 for each trajectory 𝑡𝑖 where 𝑦𝑖 ∈{SAR, Non-SAR}.
[13].Their approach relies on DB-SMoT [20] and DBSCAN                   A trajectory T is a set of AIS messages monitoring a vessel’s
[3] but unfortunately it is difficult to evaluate the overall           movement from a departure port to a destination port.
accuracy of their results due to the limited availability of
ground truth data.                                                      3.1    Dataset description and processing requirements
          In [24], authors make use of trajectory kernels in
                                                                        According to International Organisation for Migration, more
combination with a Support Vector Machines (SVM) to
                                                                        than 360.000 migrants have arrived to EU by sea in 2016,
detect fishing activity from AIS data, which was collected in
                                                                        mainly at Italy, Greece and Spain [7]. With respect to the
a 50km radius around the Port of Rotterdam. For their
                                                                        spatial coverage, our analysis has been focused on a
classification experiments they use the four most common                bounding box covering the Central Mediterranean Route,


                                                                  119
where most of the refugee fatalities have been observed.               up to a Spark cluster with 56 cores and 392GB memory (in
Figure 2 below illustrates the bounding box taken into                 total). The worker nodes have 8 processing cores and 56GB
account in conjunction with the refugee fatalities in 2016. It         of memory each and the head nodes have 4 processing cores
should be noted that our approach relies only on AIS data              and 28GB of memory each.
and the migration fatalities dataset visualised in Figure 2 is
used only as a reference to define the bounding box area.              3.2    Data processing and analysis
                                                                       The dataset used for this study consists of all the voyages of
                                                                       2016 that intersect with the bounding box shown in Figure
                                                                       2. More specifically this includes 275.657 (SAR and non-
                                                                       SAR, according to the reported AIS SHIPTYPE) voyages
                                                                       made by 12.291 vessels. These correspond to 54.766.629
                                                                       AIS observations. After processing the initial data we used
                                                                       an algorithmic approach we have introduced in [6], which
                                                                       determines departure and destination port for each AIS
                                                                       message, thus transforming them into specific voyages. Each
    Figure 2. Spatial coverage in conjunction to migration             voyage includes the vessel’s trajectory as well as its static
                       fatalities for 2016                             and voyage information described in the previous sub-
                                                                       section. Then, a data curation process was performed, to
          The considered dataset includes the 6 most relevant          discard voyages with insignificant amount of positions (e.g.
to navigation AIS messages out of the 27 AIS message types             statistically too few to be representative). More specifically,
defined in ITU 1371-4 report [26], which are used in                   all the voyages that included less than 50 positions were
approximately 90% of AIS-based scenarios. More                         removed (as the geographical area selected covers a distance
specifically, the dataset includes messages of types 1, 2, 3,          of over 1500 kilometers, trajectories with only 50 reported
5, 18, and 19 out of which 1, 2, 3, 18 and 19 are position             positions translate to a sample rate of less than one sample
reports, including latitude, longitude, speed-over-ground              per hour). Such voyages suffer from gaps of communication,
(SOG), course-over-ground (COG), and other fields related              which will affect the accuracy and the effectiveness of the
to ship movement, while type 5 messages correspond to                  proposed. After the curation process the dataset included
static-and voyage information, including the IMO identifier,           114.762 voyages, performed by 10.816 vessels, containing
radio call sign, name, ship dimensions, ship and cargo types.          52.505.718 AIS records. However, the SAR data available
          Each vessel's type can be deduced using the                  in this geographic area for 2016 are more than 100-times less
information contained in these messages that the vessel is             compared to the data of non-SAR voyages. More
transmitting. This piece of information, typically referred to         specifically, the dataset includes 114.377 non-SAR voyages
as AIS SHIPTYPE, usually consists of two digits, the first             of 10.788 vessels which include a total of 52.429.521 AIS
one ranging from 1-9 indicates the general category of the             messages while the SAR voyages are 385 made by 28
subject vessel (e.g., Special Category, Passenger, Cargo,              vessels with 75.797 AIS records. For evaluating the
etc.), while the second one provides additional information            approach, the dataset was split into training and test data; the
regarding the vessel’s type of cargo in certain vessel                 training set included 70% of the SAR voyages and in order
categories (e.g., Cargo Ships, Tankers, etc.). The vessel's            to avoid having imbalanced training data or having
crew or the accountable officer are responsible for correctly          imbalanced evaluation metric of the classifier (e.g. true
entering information into the AIS transponder and although             positive rate at some false positive threshold), we
there are explicit types for SAR vessels, it is frequently the         subsampled the non-SAR voyages (i.e., randomly selecting
case that vessels participating in SAR operations are not              a subset) included in the training data. Particularly the
declared as such. Furthermore, only the fact that a vessel’s           training data included 1.544 non-SAR voyages and 261 SAR
type is SAR does not necessarily infer that each voyage of             voyages made by 949 and 26 distinct vessels respectively.
the vessel is linked to SAR operations (e.g., such vessel              The rest of the data (i.e. 30% of the SAR voyages and all the
could travel between ports for maintenance purposes). Data             non-SAR voyages not included in the training set)
volume included in our analysis demands large                          constituted our test data.
computational power and a parallel processing approach,                          For all the records in the dataset we filter the
due to the fact that traditional analytics fail to handle such         following attributes which will be used in our analysis for
volumes of data in a considerable time frame. Consequently,            distinguishing SAR patterns:
we have deployed our approach in Microsoft Azure which is
a distributed computing framework capable to process large               a.    Ship id: This is a unique identifier for each vessel
amount of data fast. Particularly our system included two
Head D12v2 nodes, and six D13v2 Worker nodes summing


                                                                 120
  b.   Ship type: This is a two-digit code that corresponds             However, there are also other types of ships that may follow
       to the general category of the vessel and the vessel’s           similar patterns. For instance, inland vessels tend to have
       type of cargo in certain vessel categories                       frequent changes in course over ground and heading due to
  c.   Latitude, Longitude: These represent the geographic              the voyage area topography. Another example are tugboats
       location of the vessel                                           that maneuver other vessels by pushing or towing them.
  d.   SOG: This is the speed over ground of the vessel                 Such vessels typically operate in crowded port or narrow
       measured in knots                                                canals and perform various maneuvers leading to increased
  e.   COG: This is the course over ground of the vessel                frequency of turns. Furthermore, tugs typically have the
       measured in degrees with 0 corresponding to north                same departure and arrival port as they are called to leave a
  f.   Heading: This attribute represents the ship's heading            port (i.e., depart), reach the vessel to be towed (or pushed)
       in degrees with 0 corresponding to north                         and return to the same port. One of the distinguishing factors
  g.   Timestamp: This is the full UTC timestamp that the               between such vessels and SAR is the voyage duration. In
       AIS message was received by MarineTraffic                        many SAR operations, once vessels recover migrants from
                                                                        sea, they return to the same port from which they departed
          It should be noted that COG and Heading may be                so as to disembark rescued people and return back to the
different, due to weather conditions such as wind speed and             SAR operation area. Furthermore, SAR vessels patrolling
direction, wave height and currents (e.g. when vessels are              tend to have a steady course, while when they are engaged
drifting). COG on the one hand is the actual moving                     in rescuing operation they perform complex maneuvers to
direction of the vessel, while heading simply indicates where           collect migrants. In some cases, it has been observed that
the ship is pointing compared to north. Based on all these              vessels patrolling an area, may be at open sea (i.e., outside
attributes and in conjunction with other datasets that assist           of port boundaries) for several days (or even weeks)
on determining the boundaries of a port the following                   traveling in a rather small bounding box (compared to the
additional attributes were calculated:                                  overall time of their voyage).
                                                                                  Based on those characteristics, we produced some
  a.   Departure port id: This is a unique identifier of the            additional attributes that have been considered as possible
       port from which the vessel departed                              features for the classification process. For each voyage we
  b.   Departure timestamp: Full timestamp of the first AIS             have ordered the AIS messages received chronologically and
                                                                        we calculated COG, SOG and Heading deltas for each pair
       message outside of departure port geometry
                                                                        of (chronologically) consecutive messages. Negative values
  c.   Departure port name: This is the name of the
                                                                        in the COG delta feature indicate moving to the left, while
       departure port
                                                                        positive values indicate moving to the right. Similarly,
  d.   Departure port type: This attribute determines the
       type of the port (e.g., port, anchorage, etc.)                   negative values in the SOG delta feature indicate speed
  e.   Departure country code: This attribute indicates the             decrease, while positive values indicate speed increase.
                                                                        Finally, negative values for the Heading delta imply a turn
       country of the departure port
                                                                        of ship’s heading to the left, while positive values indicate a
                                                                        turn to the right. In our analysis we use the absolute values
          Similar attributes related the arrival of each vessel         of COG, SOG and Heading deltas, which capture the
to a port have been also calculated.                                    magnitude of change of the corresponding attributes. In
                                                                        addition, two extra features have been added to the dataset.
3.3 SAR Motion analysis                                                 The first one is a Boolean value indicating whether the vessel
                                                                        has the same departure and arrival port has been added to the
All these attributes have been used to transform raw                    dataset, while the latter one is the voyage duration.
positional data into vessel voyages. However, in order to                         After constructing these last features, we were able
distinguish SAR trajectories from other voyages it has been             to measure the quantiles for the COG, SOG and Heading
required to delve into more details on the motion patterns              deltas and it has been observed that SAR operation voyages
during SAR operations and focus on maneuverability of such              have different behavior compared to other voyages. More
vessels. The methodology used for the identification of SAR             specifically, non-SAR voyages seem to have low values
activity is based on assuming that SAR behaviour is highly              even for large quantiles (i.e., 75%, 80%, 85% etc.) compared
dependent and characterised by frequency of speed changes,              to the SAR voyages, meaning that in most observations the
frequency of turns, departing and arriving at the same port             COG, SOG and Heading deltas are typically small, while for
or anchorage and voyage duration. Detecting changes and                 SAR voyages those quantiles had large values. Thus, we
frequency of speed as well as departing and arriving at the             added to our dataset the 50%, 75%, 85% and 95% quantiles
same port will help distinguishing SAR trajectories from                for each of those voyages.
typical voyages (i.e., travelling from one port to another).


                                                                  121
4. RESULTS AND DISCUSSION                                              measure giving the ratio of correctly predicted observation
The focus of this work is on exploiting large volumes of               to the total observations. Precision is the ratio of correctly
historical AIS data so as to identify SAR operations from              predicted positive observations to the total predicted positive
trajectories in a scalable data-driven and supervised way.             observations. High Precision relates to the low false positive
Our approach is based on a parallelised, non-parametric                rate. Finally, Recall is the ratio of correctly predicted
statistical method, the Random Forests. To evaluate the                positive observations to the all observations in actual class.
                                                                                  Table 1: Prediction model metrics scores
approaches’ performance, we conducted a series of
experiments that showcase its effectiveness to unseen real-                        Metric                       Value
world data. Firstly, we applied a multiple fold cross                              F1 score                     0.986
validation procedure and measured the F1 score. This score                         Accuracy                     0.975
given by the Equation (1) below is the weighted average of                         Weighted Recall              0.975
Precision and Recall taking both false positives and false                         Weighted Precision           0.998
negatives into account. Then, using the best model derived
through the cross-validation procedure the algorithm                            The results, show high scores in all the metrics.
classified the test data.                                              This occurs due to the highly imbalanced test dataset. More
                          Recall ∗ Precision                           specifically it shows that the model can distinguish non-SAR
        F1 Score = 2 ∗                                  (1)            voyages and classify them as such. The ROC curve and the
                         𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
                                                                       Area Under ROC curve shown in Figure 3 below indicate
                                                                       also the capabilities of the derived model to classify SAR
4.1 Random Forest training and validation                              and non-SAR voyages, as the area under ROC is equal to
                                                                       0.86.
The training dataset described in subsection 3.2 has been
used to train and validate the Random Forest model using
the features analysed in subsections 3.2 and 3.3. The dataset
has been repeatedly partitioned, following the well-known
k-fold cross-validation procedure, into training and
validation pairs. The partitioning process has been repeated
5 times (i.e. 5-fold cross validation) each time leading to
different training and validation pairs. In each partition we
have split the dataset into five parts. Four of them used as
training set and one of them as validation set with the former
set utilised to create the model of the Random Forest and the
latter one used for predicting the class of the observations
and comparing it against its actual value. Each Random
Forest model derived has 10.000 trees and the F1 metric has
been measured, leading to an average score of 0.946 for all
the 5 folds. The best model derived from the cross-validation
process has been retained and used for predicting the values
of the test set. Finally, it should be noted that, although
classification has not been applied afore for SAR missions,
the Random Forest algorithm shows similar performance
compared to other classification schemes used for                         Figure 3: ROC curve and Area Under ROC curve of the
identifying other types of vessels’ motion patterns such as                          Random Forest prediction model
fishing [8][21][24].
                                                                                However, since the test dataset is imbalanced, and
                                                                       in order to further investigate how well the algorithm
4.2 Random Forest prediction model evaluation
                                                                       identified SAR voyages we have measured the
                                                                       misclassification rate for each vessel type. Particularly the
The best model obtained through the 5-fold cross validation            prediction accuracy of each vessel type class has been
process has been used for predicting the labels of the test            derived and Table 2 below includes the top-5 (i.e., with most
dataset. To evaluate the performance of the model against              misclassification) vessel types (i.e. false positives) and the
first seen data, we measured the F1 score, the Accuracy, the           misclassification of SAR voyages (i.e. false negatives). The
weighted Recall and the weighted Precision presented in
Table 1 below. Accuracy is the most intuitive performance


                                                                 122
results show that the classification model labelled accurately          a set of ship trajectories T, predict a label 𝑦𝑖 for each
77,5% of the SAR voyages.                                               trajectory 𝑡𝑖 where 𝑦𝑖 ∈{SAR, Non-SAR}. Our proposed
                                                                        approach proved capable of classifying SAR trajectories at
          Table 2: Top-5 misclassified vessel classes                   an accuracy higher than 77%. To the best of the authors
    AIS       AIS           # voyages    Misclassification              knowledge, no previous work has considered deriving SAR
    Vessel    Vessel                     rate (%)                       activity from AIS data in a data driven approach. In the
    type      type                                                      future, we will attempt to reformulate the problem towards a
              name                                                      point based approach classification, such that given a ship
    51        SAR           124          22.5           (false          trajectory T, predict a label 𝑦𝑖 for each data point 𝑡𝑖 where 𝑦𝑖
                                                                        ∈{SAR, NotSAR}. Based on these labeled points, SAR time
                                         negatives)
                                                                        per area can possibly be calculated on any given scale.
    34        Dive          40           62.5
              Vessels                                                   ACKNOWLEDGEMENT
    53        Port          10           60                             This project has received funding from the European
              Tender                                                    Union’s Horizon 2020 research and innovation programme
    49        High-         548          57.6                           under grant agreement No 732310 and by Microsoft
              Speed                                                     Research through a Microsoft Azure for Research Award.
              Craft I
    40        High-         435          57.01                          REFERENCES
              Speed                                                     [1]    A Parallel Random Forest Algorithm for Big Data in a Spark Cloud
                                                                               Computing Environment - IEEE Journals & Magazine:
              Craft II                                                         http://ieeexplore.ieee.org/document/7557062/. Accessed: 2017-11-
    30        Fishing       1021         26.75                                 30.
                                                                        [2]    Breiman, L. 2001. Random Forests. Machine Learning. 45, 1 (Oct.
                                                                               2001), 5–32. DOI:https://doi.org/10.1023/A:1010933404324.
          Though, the misclassification rate of the non-SAR             [3]    Ester, M. et al. 1996. A Density-based Algorithm for Discovering
voyages presented above is high, these classes represent a                     Clusters a Density-based Algorithm for Discovering Clusters in
small portion of the overall test dataset, with only a few tens                Large Spatial Databases with Noise. Proceedings of the Second
or hundred voyages. On the other hand, the classification                      International Conference on Knowledge Discovery and Data
                                                                               Mining (Portland, Oregon, 1996), 226–231.
algorithm achieved remarkable accuracy rate reaching up to              [4]    FRONTEX Risk Analysis for 2017.
99.7% in classes with more voyages in the test set. Table 3             [5]    Galdorisi, G. and Goshorn, R. 2006. Maritime Domain Awareness:
below includes the five vessel types with the most voyages                     The Key to Maritime Security Operational Challenges and
in the test set and the misclassification rate for those vessel                Technical Solutions.
                                                                        [6]    Giannis Spiliopoulos et al. 2017. A big data driven approach to
types.                                                                         extracting global trade patterns. (Sep. 2017).
          Table 3: Top 5 vessels with most voyages                      [7]    International Organization for Migration- UN Mixed Migration
    AIS       AIS Vessel     #            Misclassification                    Flows in the Mediterranean and Beyond.
                                                                        [8]    Jiang, X. et al. 2016. Fishing Activity Detection from AIS Data
    Vessel    type name      voyages      rate (%)                             Using Autoencoders. 33–39.
    type                                                                [9]    Lee, J.-G. et al. 2008. TraClass: Trajectory Classification Using
    70        Cargo          32.611       0.3                                  Hierarchical Region-based and Trajectory-based Clustering. Proc.
                                                                               VLDB       Endow.       1,     1    (Aug.     2008),     1081–1094.
    60        Passenger      17.253       1.64                                 DOI:https://doi.org/10.14778/1453856.1453972.
    71        Cargo     –    10.308       0.32                          [10]   Liu, M. et al. 2013. Comparison of random forest, support vector
                                                                               machine and back propagation neural network for electronic tongue
              Hazard A                                                         data classification: Application to the recognition of orange
    80        Tanker         9.599        1.43                                 beverage and Chinese vinegar. Sensors and Actuators B: Chemical.
    69        Passenger      9.057        0.695                                177,       Supplement        C       (Feb.     2013),        970–980.
                                                                               DOI:https://doi.org/10.1016/j.snb.2012.11.071.
                                                                        [11]   Mazzarella, F. et al. 2014. Discovering vessel activities at sea using
5. CONCLUSION AND FUTURE WORK                                                  AIS data: Mapping of fishing footprints. 17th International
                                                                               Conference on Information Fusion (FUSION) (Jul. 2014), 1–7.
This work focused on the task of automatically detecting                [12]   Millefiori, L. et al. 2016. A distributed approach to estimating sea
SAR vessels from maritime trajectory data. Specifically, we                    port operational regions from lots of AIS data. (Washington D.C.,
leveraged a large volume of historical AIS data and                            USA, 2016).
                                                                        [13]   Natale, F. et al. 2015. Mapping Fishing Effort through AIS Data.
described our approach which is based on Random Forests,                       PLOS         ONE.          10,       6      (2015),        e0130746.
a parallelized nonparametric statistical method, with no                       DOI:https://doi.org/10.1371/journal.pone.0130746.
reliance on external sources of information (e.g. coast guard           [14]   OBDAIR: Ontology-Based Distributed Framework for Accessing,
reports), so as to detect vessels performing SAR operations                    Integrating and Reasoning with Data in Disparate Data Sources
                                                                               (PDF                       Download                       Available):
in the Mediterranean Sea. The task was formulated as given


                                                                  123
       https://www.researchgate.net/publication/319280828_OBDAIR_O
       ntology-
       Based_Distributed_Framework_for_Accessing_Integrating_and_
       Reasoning_with_Data_in_Disparate_Data_Sources.            Accessed:
       2018-02-02.
[15]   Pallotta, G. et al. 2013. Traffic knowledge discovery from AIS data.
       Proceedings of the 16th International Conference on Information
       Fusion (Jul. 2013), 1996–2003.
[16]   Pallotta, G. et al. 2013. Vessel Pattern Knowledge Discovery from
       AIS Data: A Framework for Anomaly Detection and Route
       Prediction. Entropy. 15, 6 (Jun. 2013), 2218–2245.
       DOI:https://doi.org/10.3390/e15062218.
[17]   Patroumpas, K. et al. 2017. Online event recognition from moving
       vessel trajectories. GeoInformatica. 21, 2 (Apr. 2017), 389–427.
       DOI:https://doi.org/10.1007/s10707-016-0266-x.
[18]   Poļevskis, J. et al. 2012. Methods for Processing and Interpretation
       of AIS Signals Corrupted by Noise and Packet Collisions. Latvian
       Journal of Physics and Technical Sciences. 49, (Jan. 2012), 25–31.
       DOI:https://doi.org/10.2478/v10047-012-0015-3.
[19]   Random        Forests     for     Big    Data    -    ScienceDirect:
       http://www.sciencedirect.com/science/article/pii/S2214579616301
       939. Accessed: 2017-11-30.
[20]   Rocha, J.A.M.R. et al. 2010. DB-SMoT: A direction-based spatio-
       temporal clustering method. 2010 5th IEEE International
       Conference Intelligent Systems (Jul. 2010), 114–119.
[21]   Souza, E.N. de et al. 2016. Improving Fishing Pattern Detection
       from Satellite AIS Using Data Mining and Machine Learning.
       PLOS          ONE.         11,        7     (2016),       e0158248.
       DOI:https://doi.org/10.1371/journal.pone.0158248.
[22]   Strobl, C. et al. 2009. An introduction to recursive partitioning:
       Rationale, application, and characteristics of classification and
       regression trees, bagging, and random forests. Psychological
       Methods. 14, 4 (2009).
[23]   UN Refugee Agency Refugee and migrant flows through Libya on
       the rise – report.
[24]   de Vries, G.K.D. and van Someren, M. 2012. Machine learning for
       vessel trajectories using compression, alignments and domain
       knowledge. Expert Systems with Applications. 39, 18 (Dec. 2012),
       13426–13439. DOI:https://doi.org/10.1016/j.eswa.2012.05.060.
[25]   Yang, M. et al. 2012. Collision and Detection Performance with
       Three Overlap Signal Collisions in Space-Based AIS Reception.
       2012 IEEE 11th International Conference on Trust, Security and
       Privacy in Computing and Communications (Jun. 2012), 1641–
       1648.
[26]   2001. ITU Recommendation 1371-4, “Technical characteristics for
       an Automatic Identification System using time-division multiple
       access in the VHF maritime mobile band.” Tech. Rep.
       Recommendation.


                                                                              124

</pre>