Data Variety and Integrity Assessment for
                   Maritime Anomaly Detection
                                                              Cyril Ray
                                                 Naval Academy Research Institute
                                                           Brest, France
                                                    cyril.ray@ecole-navale.fr


    Abstract — The ever-increasing spread of mobile                             II. A SYSTEM WITH WEAKNESSES
technologies and connected sensors necessitates the continuous
updating of methods and techniques to cope with growing                Three major cases of bad data quality can be
volume of data. While fast processing and data management          distinguished: the errors (when false data in non-deliberately
has received lots of attention, data veracity and its assessment   broadcasted), the falsifications (when false data is
based on a large variety of contextual data seriously pose         deliberately broadcasted) and the spoofing (when data is
question. Indeed, sensors-based data collected through             created or modified and broadcasted by an outsider) [5].
automated processes can be altered at every stage of their         Data contained in AIS messages can be erroneous, falsified
collection and process, accidentally or maliciously. In the        or spoofed for several reasons: there is no strong verification
maritime domain, several embedded sensors continuously             of the transmission, the transmission is done using a non-
report vessel’s positions. Beyond known errors (human,
misconfiguration), recent works have shown that falsification
                                                                   secured channel, some pieces of information might not be
of such data is easy, and therefore could mask or favor illegal    well known by the crew or the crew may want to hide some
actions, lead to disturbance of monitoring systems and new         data from other people’s knowledge. Those operations
maritime risks. This research presents maritime data quality       modify and handicap the understanding of the maritime
issues and a methodological approach for modelling, analyzing      traffic.
and detecting such anomalies in data.
                                                                       The errors, by nature unintentional, can be caused by
   Keywords — Data quality assessment, data falsification,         transponder deficiency, a wrong input of manual data, an
maritime data, maritime cyber threats.                             input of manual data of poor quality, erroneous pieces of
                                                                   information that come from external sensors, and can have
                     I. INTRODUCTION                               an impact on the name of the vessel, its physical
                                                                   characteristics, the position or the destination for instance.
     The maritime environment undergoes an ever-growing            Those pieces of information can then be false, incomplete,
activity. In order to mitigate the risk of grounding or            impossible according to the norm or impossible according to
creating a collision, passive and active systems have been         the physics (for instance a latitude field value shall be
developed for mariners and set by international authorities.       inferior to 90°). According to [7], circa 50% of the messages
The Automatic Identification System (AIS) is an electronic         contain erroneous data.
system set on board vessels which transmits its location,
amongst many other data. The AIS broadcasts, on a regular              A falsification is the fact to voluntarily degrade a
basis, 27 kinds of messages, each one having its own               message by the modification of a genuine value by a false
purpose in information transmission (positioning,                  value, or by stopping the broadcast of messages, made in
nominative information, management...). The messages are           order to mislead the outer world. Identity theft [8], the
openly broadcast on two dedicated Very High Frequencies            disappearances [9], the broadcast of false GNSS coordinates
(VHF). As messages are sent and received by vessel and             or the statement of a wrong activity [10] are types of
coastal stations within the radio horizon (circa 40 nautical       falsification. According to [7], about 1% of the vessels
miles), it enables a better understanding of the surroundings      broadcast falsified data.
for vessel and coastal states, thus supporting several uses            The spoofing of messages is done by an external actor
such as fleet control, traffic control or boarding prevention,     by the creation ex nihilo of false messages and their
etc.                                                               broadcast on the AIS frequencies [4]. Those spoofing
    The number of messages is important: in a mean day, 19         activities are done in order to mislead both the outer world
million messages can be received in Europe from about              and the crews at sea, by the creation of ghost vessels, of
80,000 unique vessels [3]. With a great amount of, mainly          false closest point of approach trigger, a false emergency
spatial and temporal, but not only, data to process, issues        message or even a false course (in the case of a spoofed
linked to big data analyses arise. In particular, this research    vessel).
takes focuses on the veracity aspect. Indeed, sent messages            The whole chain of AIS data transmission can be
contain errors (unintentional), falsifications (intentional) and   affected by one of these three problems; from the GPS
undergo spoofing (intentional) due to the unsecured channel        signal to human supervision, going through data
of transmission, and that weakens the whole system and the         transmission and distributed data processing and
safety of navigation.                                              information systems involved. In order to formally identify
    This work reports on the design and results of a               these threats, an EBIOS risk analysis of the AIS has been
methodology for the detection of AIS falsification. The            performed [11]. This consists in the analysis of
objectives are the determination of the false messages in          vulnerabilities, failures and risks associated with it, enabling
real-time and the improvement of both the effectiveness of         the identification of issues that could actually emanate from
the system as a security system and the maritime situational       the use of AIS. This method has been chosen for its
awareness.


                                                                                                                                      4
compliance to ISO norms and a list of circa 350 threat               detection of falsifications. Relying on the accurate
scenarios and a typology of anomalies has been established.          understanding of the way the system is supposed to work, of
                                                                     its vulnerabilities and the errors and falsification that have
                  III. A VARIETY OF DATA                             been highlighted, these objectives include the creation of an
                                                                     attacking platform allowing the creation and the broadcast
    Depending on the objectives, variety of data can be more         of falsified data, the modelling of a statistic and algorithm-
important than volume. Variety variations consider the usage         based falsification detection mechanism, the creation of an
of heterogeneous data sources, used to complement a core             information system for the real-time handling of data taking
dataset in the understanding of a given situation. Indeed, data      into account archived or forecasted data, and the modelling
analysis at large, including detection of abnormal situations        of risks that are inducted by an inadequate use of AIS, as
can be resolved or confirmed only by means of algorithms             well as an assessment of the risks linked to AIS errors,
taking advantages of additional, complementary sources of            anomalies, falsification or spoofing.
information. This variety of data is absolutely required where
(sensor-based) data with known issues of quality are                     Intentional broadcast of false AIS information can be
analysed despite a lack “ground truth”. Beyond the                   understood at both the physical and logical levels. The first
understanding of data, the use of variations in variety which        approach focuses on signals transmitted by transponders
consists in progressively include additional sources is also         while the second considers information exchanged where
mean of understanding quality of algorithms processing data          fraud and attacks can be identified by message-based data
(e.g. data compression, mining, visualisation).                      mining methodology to identify abnormal messages (and
                                                                     parameters). In our approach we are considering a
    While efforts have been initiated to centralise maritime         combination of both analyses within a single information
data and information, most of the data are of heterogeneous          system.
type and format and still independently sourced and
maintained [1]. These data can support maritime situational
                                                                     A. Message-based analysis
awareness as far as they are harmonised, properly combined,
integrated, summarised, and possibly cleaned up from                     Method for the integrity assessment of messages and the
inconsistencies. Indeed, it is expected that the analysis and        discovery of anomalous data is particularly based on spatial
understanding of maritime activities cannot be deduced               information, which is the cornerstone of AIS messages but
solely from vessels kinematics but would strongly benefit            not only as AIS also broadcast many contextual and control
from complementary data of various types. However, the               information along 27 messages [6].
integration, combination (or fusion) of such data remains                Considering the data within the fields of the 27 AIS
challenging (e.g., spatio-temporal alignment of data, fusion         messages, four ways to discriminate the inner integrity of
of data from different sensors, maritime anomaly detection,          those data can be distinguished. The first way consists of the
activity classification) and research is still needed to develop     control of the integrity of each field of each message taken
such efficient techniques.                                           individually. The second way is at the scale of one single
    In this research, ship information collected through the         message, and assesses the integrity, in this very message, of
Automatic Identification System has been prepared together           all the fields with respect to one another. As there are 27
with correlated data aligned in space and time. The dataset          types of messages, messages of the same type have the same
has been carefully prepared and validated in order to offer          fields and it is thus possible to compare them, as time series,
the research community a set of heterogeneous real data to           and assess their integrity, this makes the third way.
challenge, test and validate their research developments [2],        Eventually, the fourth way is the comparison and integrity
and in the scope of this research to assess falsification cases      assessment of the fields of different messages. Indeed,
of the AIS data.                                                     although pieces of information can come from different
                                                                     messages, it is possible to assess their integrity as some fields
    The dataset1 contains four categories of data: navigation        are either the same or linked or comparable (i.e. id-based
data (vessel positions acquired automatically by an AIS              cross verification in order to link information received by
receiver), vessel-oriented data (public, official nominative         different stations). Those four ways are referred as first-
vessel position), geographic data (cartographic, topographic         order,    second-order,      third-order     and     fourth-order
or regulatory context of vessel navigation), and                     assessments, respectively.
environmental data (weather and ocean data from forecast
models and from observations). It covers a time span of six              Depending on the type of messages assessed and the
months, from October 1st, 2015 to March 31st, 2016 and               order of assessment, the number of items to check is fixed.
provides ship positions over the Celtic sea, the North               We established a list of 935 integrity items for the 27
Atlantic Ocean, the English Channel, and the Bay of Biscay.          messages, and an ad-hoc nomenclature has been established
                                                                     so that each item can have a clear unique identifier. Predicate
                                                                     logic can present, under a formal form, the actions that lead
                IV. INTEGRITY ASSESSMENT                             on the integrity determination of an item in a rigorous and
   Since the AIS does not carry perfectly genuine data               unambiguous way. Relying on three main elements: the data
(beyond data errors), that those inaccuracies are not perfect        fields values, the syntax and the expert knowledge values, a
and therefore are detectable, and that impacts on the real-          logic-based formalism based on predicate logic has been
world can be substantial, a set of objectives has been set for       chosen for item assessment. 666 have been implemented.
                                                                         A falsification being the fact either to transmit erroneous
   1
       C. Ray, R. Dréo, E. Camossi, A.-L. Jousselme, Heterogeneous   data or to trick the system by making him behave in a way it
Integrated Dataset for Maritime Intelligence, Surveillance, and      is not supposed to, a falsification scenario can take numerous
Reconnaissance (Version 0.1). Data set. Licence CC-BY-NC-SA-4.0.     forms. Linking integrity assessment with falsification
Zenodo. doi.org/10.5281/zenodo.1167594, February 2018


                                                                                                                                         5
scenarios is essential for the identification of cyber threats        The data processing box number two corresponds to a
relying on the AIS. A set of 23 algorithms (so called flags)      signal processing for the determination of aforementioned
has been designed for the identification of 4 falsification       characteristics. These data are stored in the database with the
scenarios: falsified identity, positions, control messages, and   associated AIS decoded messages.
saturation. Amongst these flags, let’s cite for instance
                                                                      The data processing box number one is in charge of on-
f_quadruplet that analyse if one element of identity
quadruplet (MMSI number, IMO number, Callsign, Name)              the-fly analysis of first-order and second-order data
                                                                  assessment, in order to have as output coefficients to store in
has changed along time. The flag f_ubiquity analyse if a
vessel reports two distinct locations at the same time.           the database. Similarly, the data processing box number three
                                                                  is in charge of the analysis of third-order and fourth-order
                                                                  data assessment, in order to have as output coefficients to be
B. Signal-based analysis                                          stored in the database. This part of the study, unless the
    We also studied physical characteristics of the signal        previous, considers time series and needs to request historical
which are intended to be integrated in the mining process.        data.
We considered five parameters. The first parameter is the
power of the received signal and the four others are time-            The data processing box number four will be in charge of
dependent and are relative to the shape of the signal. While      integrity assessments between AIS data and external and
these parameters cannot fully qualify ship’s identity and         aggregated data, (e.g. cartographic information, weather
presence, the regularity of these parameters can conversely       conditions, results of black hole computations). Finally, data
help to identify inconsistent values.                             processing box number five is in charge of running flags. Of
                                                                  course, the types of processing at this level vary according to
C. Processing principles                                          the variety of information available and the requested
                                                                  anomaly scenarios.
    A synoptic diagram of the proposed architecture can be
found in the Figure 1. The AIS stream can be received from
various sources and goes towards a centralised processing.                                  V. CONLUSION
The parser provides messages parameters (Pi). It includes a           This article proposes a method for analysing message-
statistical analysis of messages (per ship identity, per type     based data using integrity of information as a key factor.
and at the global level) for the identification of AIS            Considering a variety of data sources, the approach described
saturation. All this architecture is built around the central     considers an assessment done on the message itself, on the
database (DéAIS DB) where historical data described in            message with respect to other messages, on the message with
Section 3 are stored and where streamed data are processed        respect to external databases and on the signal itself with its
asynchronously. The implementation of the database relies         physical characteristics. Applied in the context of maritime
on the relational database model (postgres/postgis).              data, such an assessment is the consequence of the defects of
                                                                  the AIS system, transmitting erroneous and possibly falsified
                                                                  data. This method provides integrity-based predicates on data
                                                                  that are useful for the determination of erroneous and
                                                                  falsified data, leading to a risk assessment and alert
                                                                  triggering of maritime cyber threats. The approach is generic
                                                                  and could be transposed to many sensor-based systems.

                                                                        ACKNOWLEDGMENT
                                                                     This research is supported by The French National
                                                                  Research Agency (ANR) and co-funded by Defense
                                                                  procurement and technology agency (DGA) under reference
                                                                  ANR-14-CE28-0028 and labelled by French clusters Pôle
                                                                  Mer Bretagne Atlantique and Pôle Mer Méditerranée.

                                                                        REFERENCES
                                                                  [1]   Kalyvas C., Kokkos A., Tzouramanis T., A survey of official online
                                                                        sources of high-quality free-of-charge geospatial data for maritime
                                                                        geographic information systems applications, Information Systems,
                                                                        Volume 65, 2017, Pages 36-51
                                                                  [2]   Ray, C., R. Dréo, E. Camossi, A.-L. Jousselme, and C. Iphar (2018).
                                                                        Heterogeneous integrated dataset for maritime intelligence,
                                                                        surveillance, and reconnaissance. Data In Brief , Accepted for
                                                                        publication
Fig. 1. Processing principles                                     [3]   European Maritime Safety Agency. EMSA Facts & Figures 2016.
                                                                        Report. 40p. (2016)
    Additionally, an online processing of the AIS stream has      [4]   Balduzzi M., A. Pasta, and K. Wilhoit. A security evaluation of ais
                                                                        automated identification system. In Proceedings of the 30th Annual
been also designed based on Flink [12] for the computation              Computer Security Applications Conference, pages 436–445. ACM,
of black hole. As AIS coverage (and consequently black hole             2014.
locations) by a receiver evolve continuously, this processing     [5]   Ray, C., Iphar, C., Napoli, A., Gallen, R. and Bouju, A.. DeAIS
is essential to accurately detect falsified positions.                  project: Detection of AIS Spoofing and Resulting Risks In: The
                                                                        proceedings of OCEANS’15. Genova, 2015


                                                                                                                                              6
[6]  Iphar, C., Napoli, A., Ray, C., Data Quality Assessment For Maritime        proceedings of the 16th international conference on information
     Situation Awareness, 9th ISPRS International Symposium on Spatial           fusion. Istambul, 2013.
     Data Quality (ISSDQ 2015), Volume II-3/W5, pages 291-296, La           [11] Iphar C., Napoli A., Ray C., Alincourt E., Brosset D., Risk Analysis
     Grande Motte - France, 29-30 September 2015                                 of falsified Automatic Identification System for the improvement of
[7] Harati-Mokhari, A., Wall, A., Brooks, P. and Wang J., Automatic              maritime traffic safety, 8 pages, ESREL 2016, Glasgow 25th–29th
     Identification System (AIS): a human factors approach. J. Navig. Vol        September 2016
     60(3), Cambridge University Press, 2007.                               [12] Salmon, L., Ray, C., Design principles of a stream-based framework
[8] The Maritime Executive, Iran, Tanzania and falsifying AIS signals to         for mobility analysis, Geoinformatica, Special Issue on
     trade with Syria. Published in The Maritime Executive, December 7th,        GeoStreaming, 25 pages, April 2016 (DOI 10.1007/s10707-016-0256-
     2012.                                                                       z)
[9] Windward, AIS data on the high seas: an analysis of the magnitude
     and implications of growing data manipulation at sea, 2014.
[10] Kastilieris, F., Braca, P. and Coraluppi, S., Detection of malicious
     AIS position spoofing by exploiting radar information. In:


                                                                                                                                                        7