=Paper=
{{Paper
|id=Vol-2083/paper-18
|storemode=property
|title=Mining Vessel Trajectory Data for Patterns of Search and Rescue
|pdfUrl=https://ceur-ws.org/Vol-2083/paper-18.pdf
|volume=Vol-2083
|authors=Konstantinos Chatzikokolakis,Dimitrios Zissis,Giannis Spiliopoulos,Konstantinos Tserpes
|dblpUrl=https://dblp.org/rec/conf/edbt/0002ZST18
}}
==Mining Vessel Trajectory Data for Patterns of Search and Rescue==
Mining Vessel Trajectory Data for Patterns of Search
and Rescue
Konstantinos Chatzikokolakis*, Dimitrios Zissis*†, Giannis Spiliopoulos* and Konstantinos Tserpes**
* MarineTraffic, London, United Kingdom
Email:{konstantinos.chatzikokolakis, giannis.spiliopoulos} @marinetraffic.com
† Department of Product and Systems Design Engineering, University of the Aegean, Syros, Greece
Email: dzissis@aegean.gr
** Department of Informatics and Telematics, Harokopio University of Athens, Greece
Email: ktserpes@hua.com
Figure 1. Visualization of SAR activity in the Mediterranean Sea during July-September 2015
ABSTRACT Mediterranean Sea to Europe. Since the Syrian war in 2011,
there has been a rapid increase in the number of people
The overall aim of this work is to explore the possibility of crossing; a trend which is not expected to stop any time soon.
automatically detecting Search And Rescue (SAR) activity, According to the UN Refugee Agency, this year alone, at
even when a distress call has on yet been received. For this, least 2,030 people have died or gone missing on the voyage,
we exploit a large volume of historical Automatic with the greatest number of fatalities occurring along the so-
Identification System (AIS) data so as to detect SAR activity called Central Mediterranean Route, through Libya [23].
from vessel trajectories, in a scalable, data-driven supervised Although under maritime law, any vessel in the area of a
way, with no reliance on external sources of information vessel in distress is obliged to offer assistance, numerous
(e.g. coast guard reports). Specifically, we present our national and international missions have been launched on
approach which is based on a parallelised, nonparametric the EU borders and in the international waters of the
statistical method (Random Forests), which has proved Mediterranean, so as to assist in Search and Rescue (SAR)
capable of achieving prediction accuracy rates higher than operations, such as Operation Mare Nostrum led by Italy,
77%. Operation Triton led by Frontex, NATO Operation Sea
Guardian and the EU operation Sophia. Many of these
1 INTRODUCTION operations were not designed with SAR as a primary mission
1
For many years, North Africa has served as the jumping goal. Due to this numerous Non-Governmental
off point for refugees and migrants hoping to cross the Organisations (NGO) have stepped in and have been
performing SAR operations in the area; these include
© 2018 Copyright held by the owner/author(s). Published in the Workshop
Proceedings of the EDBT/ICDT 2018 Joint Conference (March 26, 2018, Vienna,
Austria) on CEUR-WS.org (ISSN 1613-0073). Distribution of this paper is permitted
under the terms of the Creative Commons license CC-by-nc-nd 4.0
117
Migrant Offshore Aid Station (MOAS), Doctors Without • Domain Specific: The overall aim of this work is
Borders, Sea-Watch and others. According to the UNHCR to explore if it is possible to automatically detect
an overall 41% of those rescued have been by the NGOs. SAR activity from open data (such as AIS), even
Recently though concerns have been raised about when a distress call has not been received. This
the possible interactions between NGOs and smugglers. A work has an important social impact, as it can help
report published by the EU agency Frontex stated that there improve coordination of SAR efforts and
were “clear indications before departure on the precise understanding of implicated activities (e.g.
direction to be followed in order to reach the NGOs’ response time).
boats”[4]. According to this same report, during 2015, and
the first months of 2016, smuggling groups instructed • Algorithmic: We extract patterns of “rescue-like
migrants to make satellite phone calls to the Maritime behavior” from billions of records of spatio-
Rescue Coordination Centre (MRCC) in Rome so as to temporal (AIS) data and apply Random Forests,
initiate targeted rescues on the high seas. During this period, which is a parallelised nonparametric statistical
SAR operations were mainly undertaken by Italian law method, evaluated as capable of achieving
enforcement, EUNAVFOR Med or Frontex vessels with prediction accuracy rates of more than 77%, even
NGO vessels involved in less than 5% of the incidents. From when applied to large volumes of highly skewed
June to October 2016, however, the pattern was reversed. geospatial data. To the best of the authors
“Satellite phone calls to MRCC Rome decreased sharply knowledge, no previous work has considered
(down to 10%) and NGO rescue operations rose significantly deriving SAR activity from AIS data.
to more than 40% of all incidents. Since June 2016, a The rest of the paper is organized as follows: Section 2
significant number of boats were intercepted or rescued by shortly presents previous work in this domain, while Section
NGO vessels without any prior distress call and without 3 describes our approach and Section 4 presents the
official information as to the rescue location” according to preliminary results while section 5 concludes this paper by
Frontex [4]. briefly outlining the main contributions of this work and
Maritime Domain Awareness (MDA) is the suggesting future improvements.
effective understanding of activities, events and threats in
the maritime environment that could impact global safety, 2 RELATED WORK
security, economic activity or the environment [5]. Whilst in The rise in the availability of larger quantity and better
the past, MDA had suffered from a lack of data, current quality mobility data, has increased the interest of
tracking technology has transformed the problem into one of researchers in data driven knowledge discovery. Some of the
an overabundance of data and information. Currently, huge typical mining tasks in the spatio-temporal context include,
amounts of structured and unstructured data, tracking vessels frequent pattern discovery, trajectory pattern clustering,
during their voyages across the seas, are becoming available, trajectory classification, forecasting, and outlier detection.
mostly due to the Automatic Identification System (AIS) that Recent works on pattern discovery are based on online event
vessels of specific categories are required to carry. The AIS recognition systems that recognize suspicious and illegal
is a collaborative, self-reporting system that allows maritime vessel activities of compressed routes (i.e., only critical
vessels to broadcast their information to nearby vessels and points of routes are preserved)[17]. Although this solution
coastal based stations [26]. AIS transceivers allow real time identifies complex events, it does not classify those to
information exchange between vessels and shore based specific vessel operations (e.g. tugging, fishing, search and
stations through digital radio signals transmitted over rescue, etc.). The merits of this work have been extended in
dedicated channels in VHF band. The major challenge faced where vessels’ moving pattern analysis is performed through
today, is exploiting these vast amounts of data and transform an ontology-based system[14]. Trajectory classification,
it into actionable information. Discovering patterns includes constructing a model capable of predicting the class
emerging within these huge datasets is of great importance labels of moving objects based on their trajectories and other
so as to provide critical insights into the patterns vessels features [9]. Trajectory classification has been applied in
follow during their voyages at sea. many mobility applications and numerous methods have
The main objective of our work is to explore the been proposed throughout the given literature, however less
possibility of leveraging these huge mobility datasets so as attention has been paid to the maritime domain and
to automatically detect vessels performing SAR operations. classifying a vessel’s type with regards to its trajectory. For
Towards this direction we adopt a practical data mining and example, in [9], authors propose a feature generation
machine learning approach which is capable of overcoming framework TraClass for trajectory data from satellite images
the shortcomings and difficulties presented by AIS data and trace gas measurements, which generates a hierarchy of
(highly skewed, non-uniform, reception errors etc.) [6]. In features by partitioning trajectories and explores two types
sum, this work presents novelties on two fronts: of clustering: (1) region-based and (2) trajectory-based. In
118
this paper, hierarchical region-based and trajectory-based vessel types: cargo ship, tanker, tug and law-enforcement
clustering after trajectory partitioning is performed, and a vessel with the best accuracy score being 76.25%. Jiang,
vessel classification rate as high as 84.4% is reported, but Silver, Hu, De Souza, and Matwin in [8], also make use of
unfortunately information on how many vessel types are AIS data and compare Autoencoders with SVMs and
included in the dataset is not provided [9]. Random Forests. In their work they suggest that
Several studies have proved the value of using AIS autoencoders can perform at least as well as and sometimes
data for data driven knowledge discovery in this domain [12, better than SVM and Random Forests on classification
15, 16]. An interesting trajectory classification case that has fishing activities, achieving up to 85% accuracy [8].
caught researchers attention, is that of fishing activity However, the nature of the autoencoders is to capture as
detection; especially for applications such as illegal fishing, much information as possible and not as much relevant
where the task can be defined as given a ship trajectory T, information as possible and since this work utilised only a
predict a label 𝑦𝑖 for each data point 𝑡𝑖 where 𝑦𝑖 ∈{Fishing, small dataset it would be difficult to have only a small part
NonFishing}[21]. In [21], authors develop three different of the input that is relevant to the considered problem.
models to detect potential fishing behavior according to the Furthermore, SVMs do not work well with categorical
type of fishing activity; for trawlers a Hidden Markov Model features and often fail to handle larger datasets as they pose
(HMM) is developed using vessel speed as observation significant memory requirements and computational
variable; for longliners a pattern recognition approach complexity in such cases. Other studies indicate the
named Lavielle’s algorithm has been applied; and for purse superiority of Random Forests when used for classification
seiners a multi-layered filtering strategy based on vessel tasks, compared to SVMs and back propagation neural
speed and operation time was implemented. Validation networks [10].
against expert-labeled datasets showed average detection Random Forests, which are based on decision trees
accuracies of 83% for trawler and longliner, and 97% for combined with aggregation and bootstrap ideas, were first
purse seiner. Although these methods were designed for introduced by Breiman in 2001 [2]. They are a powerful
wide applicability, high accuracy results are only achieved nonparametric statistical method allowing to consider in a
by preprocessing AIS data, where wrong detections, noise single and versatile framework regression problems, as well
and faulty out-of-bounds data (e.g. observations on land) are as two-class and multi-class classification problems [19].
previously removed [21]. The use of AIS data poses a series Random Forests can deal with large numbers of predictor
of data management and data processing challenges linked variables even in the presence of complex interactions, and
to the treatment of large volumes of data which may heavily have been applied successfully in genetics, clinical
reduce the applicability of the approach. Many traditional medicine, and bioinformatics within the past few years.
data mining approaches assume that the underlying data Random Forests have been shown to achieve a high
distribution is uniform and spatially continuous. This is not prediction accuracy in such applications and to provide
the case for global AIS data, as it is often to have large descriptive variable importance measures reflecting the
geographical coverage gaps, message collisions or erroneous impact of each variable in both main effects and interactions
messages especially when processing large areas [18, 25]. [22]. They are considered capable of good accuracy,
In [11] Mazzarella, Vespe, Damalas and Osio focus relatively robust of outliers and noise, can be pararellised
on discovering and characterising fishing areas by exploiting and are thus considered suitable data mining algorithm for
historical AIS data broadcast by fishing vessels. big data [1, 2].
Specifically, they focused on detecting the behavior of
fishing boats that are probably actively fishing. The 3 PROPOSED APPROACH
methodology used for the identification of fishing activity Our aim is to explore the possibility of automatically
was based on assuming a fishing behaviour highly dependent detecting SAR activity from open data (such as AIS), even
and characterised by speed. Detecting changes and when a distress call has not been received. The task can be
frequency of speed could help identifying which part of the formulated as given a set of vessel trajectories T, predict a
vessel track can be considered as fishing and which not label 𝑦𝑖 for each trajectory 𝑡𝑖 where 𝑦𝑖 ∈{SAR, Non-SAR}.
[13].Their approach relies on DB-SMoT [20] and DBSCAN A trajectory T is a set of AIS messages monitoring a vessel’s
[3] but unfortunately it is difficult to evaluate the overall movement from a departure port to a destination port.
accuracy of their results due to the limited availability of
ground truth data. 3.1 Dataset description and processing requirements
In [24], authors make use of trajectory kernels in
According to International Organisation for Migration, more
combination with a Support Vector Machines (SVM) to
than 360.000 migrants have arrived to EU by sea in 2016,
detect fishing activity from AIS data, which was collected in
mainly at Italy, Greece and Spain [7]. With respect to the
a 50km radius around the Port of Rotterdam. For their
spatial coverage, our analysis has been focused on a
classification experiments they use the four most common bounding box covering the Central Mediterranean Route,
119
where most of the refugee fatalities have been observed. up to a Spark cluster with 56 cores and 392GB memory (in
Figure 2 below illustrates the bounding box taken into total). The worker nodes have 8 processing cores and 56GB
account in conjunction with the refugee fatalities in 2016. It of memory each and the head nodes have 4 processing cores
should be noted that our approach relies only on AIS data and 28GB of memory each.
and the migration fatalities dataset visualised in Figure 2 is
used only as a reference to define the bounding box area. 3.2 Data processing and analysis
The dataset used for this study consists of all the voyages of
2016 that intersect with the bounding box shown in Figure
2. More specifically this includes 275.657 (SAR and non-
SAR, according to the reported AIS SHIPTYPE) voyages
made by 12.291 vessels. These correspond to 54.766.629
AIS observations. After processing the initial data we used
an algorithmic approach we have introduced in [6], which
determines departure and destination port for each AIS
message, thus transforming them into specific voyages. Each
Figure 2. Spatial coverage in conjunction to migration voyage includes the vessel’s trajectory as well as its static
fatalities for 2016 and voyage information described in the previous sub-
section. Then, a data curation process was performed, to
The considered dataset includes the 6 most relevant discard voyages with insignificant amount of positions (e.g.
to navigation AIS messages out of the 27 AIS message types statistically too few to be representative). More specifically,
defined in ITU 1371-4 report [26], which are used in all the voyages that included less than 50 positions were
approximately 90% of AIS-based scenarios. More removed (as the geographical area selected covers a distance
specifically, the dataset includes messages of types 1, 2, 3, of over 1500 kilometers, trajectories with only 50 reported
5, 18, and 19 out of which 1, 2, 3, 18 and 19 are position positions translate to a sample rate of less than one sample
reports, including latitude, longitude, speed-over-ground per hour). Such voyages suffer from gaps of communication,
(SOG), course-over-ground (COG), and other fields related which will affect the accuracy and the effectiveness of the
to ship movement, while type 5 messages correspond to proposed. After the curation process the dataset included
static-and voyage information, including the IMO identifier, 114.762 voyages, performed by 10.816 vessels, containing
radio call sign, name, ship dimensions, ship and cargo types. 52.505.718 AIS records. However, the SAR data available
Each vessel's type can be deduced using the in this geographic area for 2016 are more than 100-times less
information contained in these messages that the vessel is compared to the data of non-SAR voyages. More
transmitting. This piece of information, typically referred to specifically, the dataset includes 114.377 non-SAR voyages
as AIS SHIPTYPE, usually consists of two digits, the first of 10.788 vessels which include a total of 52.429.521 AIS
one ranging from 1-9 indicates the general category of the messages while the SAR voyages are 385 made by 28
subject vessel (e.g., Special Category, Passenger, Cargo, vessels with 75.797 AIS records. For evaluating the
etc.), while the second one provides additional information approach, the dataset was split into training and test data; the
regarding the vessel’s type of cargo in certain vessel training set included 70% of the SAR voyages and in order
categories (e.g., Cargo Ships, Tankers, etc.). The vessel's to avoid having imbalanced training data or having
crew or the accountable officer are responsible for correctly imbalanced evaluation metric of the classifier (e.g. true
entering information into the AIS transponder and although positive rate at some false positive threshold), we
there are explicit types for SAR vessels, it is frequently the subsampled the non-SAR voyages (i.e., randomly selecting
case that vessels participating in SAR operations are not a subset) included in the training data. Particularly the
declared as such. Furthermore, only the fact that a vessel’s training data included 1.544 non-SAR voyages and 261 SAR
type is SAR does not necessarily infer that each voyage of voyages made by 949 and 26 distinct vessels respectively.
the vessel is linked to SAR operations (e.g., such vessel The rest of the data (i.e. 30% of the SAR voyages and all the
could travel between ports for maintenance purposes). Data non-SAR voyages not included in the training set)
volume included in our analysis demands large constituted our test data.
computational power and a parallel processing approach, For all the records in the dataset we filter the
due to the fact that traditional analytics fail to handle such following attributes which will be used in our analysis for
volumes of data in a considerable time frame. Consequently, distinguishing SAR patterns:
we have deployed our approach in Microsoft Azure which is
a distributed computing framework capable to process large a. Ship id: This is a unique identifier for each vessel
amount of data fast. Particularly our system included two
Head D12v2 nodes, and six D13v2 Worker nodes summing
120
b. Ship type: This is a two-digit code that corresponds However, there are also other types of ships that may follow
to the general category of the vessel and the vessel’s similar patterns. For instance, inland vessels tend to have
type of cargo in certain vessel categories frequent changes in course over ground and heading due to
c. Latitude, Longitude: These represent the geographic the voyage area topography. Another example are tugboats
location of the vessel that maneuver other vessels by pushing or towing them.
d. SOG: This is the speed over ground of the vessel Such vessels typically operate in crowded port or narrow
measured in knots canals and perform various maneuvers leading to increased
e. COG: This is the course over ground of the vessel frequency of turns. Furthermore, tugs typically have the
measured in degrees with 0 corresponding to north same departure and arrival port as they are called to leave a
f. Heading: This attribute represents the ship's heading port (i.e., depart), reach the vessel to be towed (or pushed)
in degrees with 0 corresponding to north and return to the same port. One of the distinguishing factors
g. Timestamp: This is the full UTC timestamp that the between such vessels and SAR is the voyage duration. In
AIS message was received by MarineTraffic many SAR operations, once vessels recover migrants from
sea, they return to the same port from which they departed
It should be noted that COG and Heading may be so as to disembark rescued people and return back to the
different, due to weather conditions such as wind speed and SAR operation area. Furthermore, SAR vessels patrolling
direction, wave height and currents (e.g. when vessels are tend to have a steady course, while when they are engaged
drifting). COG on the one hand is the actual moving in rescuing operation they perform complex maneuvers to
direction of the vessel, while heading simply indicates where collect migrants. In some cases, it has been observed that
the ship is pointing compared to north. Based on all these vessels patrolling an area, may be at open sea (i.e., outside
attributes and in conjunction with other datasets that assist of port boundaries) for several days (or even weeks)
on determining the boundaries of a port the following traveling in a rather small bounding box (compared to the
additional attributes were calculated: overall time of their voyage).
Based on those characteristics, we produced some
a. Departure port id: This is a unique identifier of the additional attributes that have been considered as possible
port from which the vessel departed features for the classification process. For each voyage we
b. Departure timestamp: Full timestamp of the first AIS have ordered the AIS messages received chronologically and
we calculated COG, SOG and Heading deltas for each pair
message outside of departure port geometry
of (chronologically) consecutive messages. Negative values
c. Departure port name: This is the name of the
in the COG delta feature indicate moving to the left, while
departure port
positive values indicate moving to the right. Similarly,
d. Departure port type: This attribute determines the
type of the port (e.g., port, anchorage, etc.) negative values in the SOG delta feature indicate speed
e. Departure country code: This attribute indicates the decrease, while positive values indicate speed increase.
Finally, negative values for the Heading delta imply a turn
country of the departure port
of ship’s heading to the left, while positive values indicate a
turn to the right. In our analysis we use the absolute values
Similar attributes related the arrival of each vessel of COG, SOG and Heading deltas, which capture the
to a port have been also calculated. magnitude of change of the corresponding attributes. In
addition, two extra features have been added to the dataset.
3.3 SAR Motion analysis The first one is a Boolean value indicating whether the vessel
has the same departure and arrival port has been added to the
All these attributes have been used to transform raw dataset, while the latter one is the voyage duration.
positional data into vessel voyages. However, in order to After constructing these last features, we were able
distinguish SAR trajectories from other voyages it has been to measure the quantiles for the COG, SOG and Heading
required to delve into more details on the motion patterns deltas and it has been observed that SAR operation voyages
during SAR operations and focus on maneuverability of such have different behavior compared to other voyages. More
vessels. The methodology used for the identification of SAR specifically, non-SAR voyages seem to have low values
activity is based on assuming that SAR behaviour is highly even for large quantiles (i.e., 75%, 80%, 85% etc.) compared
dependent and characterised by frequency of speed changes, to the SAR voyages, meaning that in most observations the
frequency of turns, departing and arriving at the same port COG, SOG and Heading deltas are typically small, while for
or anchorage and voyage duration. Detecting changes and SAR voyages those quantiles had large values. Thus, we
frequency of speed as well as departing and arriving at the added to our dataset the 50%, 75%, 85% and 95% quantiles
same port will help distinguishing SAR trajectories from for each of those voyages.
typical voyages (i.e., travelling from one port to another).
121
4. RESULTS AND DISCUSSION measure giving the ratio of correctly predicted observation
The focus of this work is on exploiting large volumes of to the total observations. Precision is the ratio of correctly
historical AIS data so as to identify SAR operations from predicted positive observations to the total predicted positive
trajectories in a scalable data-driven and supervised way. observations. High Precision relates to the low false positive
Our approach is based on a parallelised, non-parametric rate. Finally, Recall is the ratio of correctly predicted
statistical method, the Random Forests. To evaluate the positive observations to the all observations in actual class.
Table 1: Prediction model metrics scores
approaches’ performance, we conducted a series of
experiments that showcase its effectiveness to unseen real- Metric Value
world data. Firstly, we applied a multiple fold cross F1 score 0.986
validation procedure and measured the F1 score. This score Accuracy 0.975
given by the Equation (1) below is the weighted average of Weighted Recall 0.975
Precision and Recall taking both false positives and false Weighted Precision 0.998
negatives into account. Then, using the best model derived
through the cross-validation procedure the algorithm The results, show high scores in all the metrics.
classified the test data. This occurs due to the highly imbalanced test dataset. More
Recall ∗ Precision specifically it shows that the model can distinguish non-SAR
F1 Score = 2 ∗ (1) voyages and classify them as such. The ROC curve and the
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
Area Under ROC curve shown in Figure 3 below indicate
also the capabilities of the derived model to classify SAR
4.1 Random Forest training and validation and non-SAR voyages, as the area under ROC is equal to
0.86.
The training dataset described in subsection 3.2 has been
used to train and validate the Random Forest model using
the features analysed in subsections 3.2 and 3.3. The dataset
has been repeatedly partitioned, following the well-known
k-fold cross-validation procedure, into training and
validation pairs. The partitioning process has been repeated
5 times (i.e. 5-fold cross validation) each time leading to
different training and validation pairs. In each partition we
have split the dataset into five parts. Four of them used as
training set and one of them as validation set with the former
set utilised to create the model of the Random Forest and the
latter one used for predicting the class of the observations
and comparing it against its actual value. Each Random
Forest model derived has 10.000 trees and the F1 metric has
been measured, leading to an average score of 0.946 for all
the 5 folds. The best model derived from the cross-validation
process has been retained and used for predicting the values
of the test set. Finally, it should be noted that, although
classification has not been applied afore for SAR missions,
the Random Forest algorithm shows similar performance
compared to other classification schemes used for Figure 3: ROC curve and Area Under ROC curve of the
identifying other types of vessels’ motion patterns such as Random Forest prediction model
fishing [8][21][24].
However, since the test dataset is imbalanced, and
in order to further investigate how well the algorithm
4.2 Random Forest prediction model evaluation
identified SAR voyages we have measured the
misclassification rate for each vessel type. Particularly the
The best model obtained through the 5-fold cross validation prediction accuracy of each vessel type class has been
process has been used for predicting the labels of the test derived and Table 2 below includes the top-5 (i.e., with most
dataset. To evaluate the performance of the model against misclassification) vessel types (i.e. false positives) and the
first seen data, we measured the F1 score, the Accuracy, the misclassification of SAR voyages (i.e. false negatives). The
weighted Recall and the weighted Precision presented in
Table 1 below. Accuracy is the most intuitive performance
122
results show that the classification model labelled accurately a set of ship trajectories T, predict a label 𝑦𝑖 for each
77,5% of the SAR voyages. trajectory 𝑡𝑖 where 𝑦𝑖 ∈{SAR, Non-SAR}. Our proposed
approach proved capable of classifying SAR trajectories at
Table 2: Top-5 misclassified vessel classes an accuracy higher than 77%. To the best of the authors
AIS AIS # voyages Misclassification knowledge, no previous work has considered deriving SAR
Vessel Vessel rate (%) activity from AIS data in a data driven approach. In the
type type future, we will attempt to reformulate the problem towards a
name point based approach classification, such that given a ship
51 SAR 124 22.5 (false trajectory T, predict a label 𝑦𝑖 for each data point 𝑡𝑖 where 𝑦𝑖
∈{SAR, NotSAR}. Based on these labeled points, SAR time
negatives)
per area can possibly be calculated on any given scale.
34 Dive 40 62.5
Vessels ACKNOWLEDGEMENT
53 Port 10 60 This project has received funding from the European
Tender Union’s Horizon 2020 research and innovation programme
49 High- 548 57.6 under grant agreement No 732310 and by Microsoft
Speed Research through a Microsoft Azure for Research Award.
Craft I
40 High- 435 57.01 REFERENCES
Speed [1] A Parallel Random Forest Algorithm for Big Data in a Spark Cloud
Computing Environment - IEEE Journals & Magazine:
Craft II http://ieeexplore.ieee.org/document/7557062/. Accessed: 2017-11-
30 Fishing 1021 26.75 30.
[2] Breiman, L. 2001. Random Forests. Machine Learning. 45, 1 (Oct.
2001), 5–32. DOI:https://doi.org/10.1023/A:1010933404324.
Though, the misclassification rate of the non-SAR [3] Ester, M. et al. 1996. A Density-based Algorithm for Discovering
voyages presented above is high, these classes represent a Clusters a Density-based Algorithm for Discovering Clusters in
small portion of the overall test dataset, with only a few tens Large Spatial Databases with Noise. Proceedings of the Second
or hundred voyages. On the other hand, the classification International Conference on Knowledge Discovery and Data
Mining (Portland, Oregon, 1996), 226–231.
algorithm achieved remarkable accuracy rate reaching up to [4] FRONTEX Risk Analysis for 2017.
99.7% in classes with more voyages in the test set. Table 3 [5] Galdorisi, G. and Goshorn, R. 2006. Maritime Domain Awareness:
below includes the five vessel types with the most voyages The Key to Maritime Security Operational Challenges and
in the test set and the misclassification rate for those vessel Technical Solutions.
[6] Giannis Spiliopoulos et al. 2017. A big data driven approach to
types. extracting global trade patterns. (Sep. 2017).
Table 3: Top 5 vessels with most voyages [7] International Organization for Migration- UN Mixed Migration
AIS AIS Vessel # Misclassification Flows in the Mediterranean and Beyond.
[8] Jiang, X. et al. 2016. Fishing Activity Detection from AIS Data
Vessel type name voyages rate (%) Using Autoencoders. 33–39.
type [9] Lee, J.-G. et al. 2008. TraClass: Trajectory Classification Using
70 Cargo 32.611 0.3 Hierarchical Region-based and Trajectory-based Clustering. Proc.
VLDB Endow. 1, 1 (Aug. 2008), 1081–1094.
60 Passenger 17.253 1.64 DOI:https://doi.org/10.14778/1453856.1453972.
71 Cargo – 10.308 0.32 [10] Liu, M. et al. 2013. Comparison of random forest, support vector
machine and back propagation neural network for electronic tongue
Hazard A data classification: Application to the recognition of orange
80 Tanker 9.599 1.43 beverage and Chinese vinegar. Sensors and Actuators B: Chemical.
69 Passenger 9.057 0.695 177, Supplement C (Feb. 2013), 970–980.
DOI:https://doi.org/10.1016/j.snb.2012.11.071.
[11] Mazzarella, F. et al. 2014. Discovering vessel activities at sea using
5. CONCLUSION AND FUTURE WORK AIS data: Mapping of fishing footprints. 17th International
Conference on Information Fusion (FUSION) (Jul. 2014), 1–7.
This work focused on the task of automatically detecting [12] Millefiori, L. et al. 2016. A distributed approach to estimating sea
SAR vessels from maritime trajectory data. Specifically, we port operational regions from lots of AIS data. (Washington D.C.,
leveraged a large volume of historical AIS data and USA, 2016).
[13] Natale, F. et al. 2015. Mapping Fishing Effort through AIS Data.
described our approach which is based on Random Forests, PLOS ONE. 10, 6 (2015), e0130746.
a parallelized nonparametric statistical method, with no DOI:https://doi.org/10.1371/journal.pone.0130746.
reliance on external sources of information (e.g. coast guard [14] OBDAIR: Ontology-Based Distributed Framework for Accessing,
reports), so as to detect vessels performing SAR operations Integrating and Reasoning with Data in Disparate Data Sources
(PDF Download Available):
in the Mediterranean Sea. The task was formulated as given
123
https://www.researchgate.net/publication/319280828_OBDAIR_O
ntology-
Based_Distributed_Framework_for_Accessing_Integrating_and_
Reasoning_with_Data_in_Disparate_Data_Sources. Accessed:
2018-02-02.
[15] Pallotta, G. et al. 2013. Traffic knowledge discovery from AIS data.
Proceedings of the 16th International Conference on Information
Fusion (Jul. 2013), 1996–2003.
[16] Pallotta, G. et al. 2013. Vessel Pattern Knowledge Discovery from
AIS Data: A Framework for Anomaly Detection and Route
Prediction. Entropy. 15, 6 (Jun. 2013), 2218–2245.
DOI:https://doi.org/10.3390/e15062218.
[17] Patroumpas, K. et al. 2017. Online event recognition from moving
vessel trajectories. GeoInformatica. 21, 2 (Apr. 2017), 389–427.
DOI:https://doi.org/10.1007/s10707-016-0266-x.
[18] Poļevskis, J. et al. 2012. Methods for Processing and Interpretation
of AIS Signals Corrupted by Noise and Packet Collisions. Latvian
Journal of Physics and Technical Sciences. 49, (Jan. 2012), 25–31.
DOI:https://doi.org/10.2478/v10047-012-0015-3.
[19] Random Forests for Big Data - ScienceDirect:
http://www.sciencedirect.com/science/article/pii/S2214579616301
939. Accessed: 2017-11-30.
[20] Rocha, J.A.M.R. et al. 2010. DB-SMoT: A direction-based spatio-
temporal clustering method. 2010 5th IEEE International
Conference Intelligent Systems (Jul. 2010), 114–119.
[21] Souza, E.N. de et al. 2016. Improving Fishing Pattern Detection
from Satellite AIS Using Data Mining and Machine Learning.
PLOS ONE. 11, 7 (2016), e0158248.
DOI:https://doi.org/10.1371/journal.pone.0158248.
[22] Strobl, C. et al. 2009. An introduction to recursive partitioning:
Rationale, application, and characteristics of classification and
regression trees, bagging, and random forests. Psychological
Methods. 14, 4 (2009).
[23] UN Refugee Agency Refugee and migrant flows through Libya on
the rise – report.
[24] de Vries, G.K.D. and van Someren, M. 2012. Machine learning for
vessel trajectories using compression, alignments and domain
knowledge. Expert Systems with Applications. 39, 18 (Dec. 2012),
13426–13439. DOI:https://doi.org/10.1016/j.eswa.2012.05.060.
[25] Yang, M. et al. 2012. Collision and Detection Performance with
Three Overlap Signal Collisions in Space-Based AIS Reception.
2012 IEEE 11th International Conference on Trust, Security and
Privacy in Computing and Communications (Jun. 2012), 1641–
1648.
[26] 2001. ITU Recommendation 1371-4, “Technical characteristics for
an Automatic Identification System using time-division multiple
access in the VHF maritime mobile band.” Tech. Rep.
Recommendation.
124