<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Cyril Ray Naval Academy Research Institute Brest</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>4</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>- The ever-increasing spread of mobile technologies and connected sensors necessitates the continuous updating of methods and techniques to cope with growing volume of data. While fast processing and data management has received lots of attention, data veracity and its assessment based on a large variety of contextual data seriously pose question. Indeed, sensors-based data collected through automated processes can be altered at every stage of their collection and process, accidentally or maliciously. In the maritime domain, several embedded sensors continuously report vessel's positions. Beyond known errors (human, misconfiguration), recent works have shown that falsification of such data is easy, and therefore could mask or favor illegal actions, lead to disturbance of monitoring systems and new maritime risks. This research presents maritime data quality issues and a methodological approach for modelling, analyzing and detecting such anomalies in data.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Data quality assessment</kwd>
        <kwd>data falsification</kwd>
        <kwd>maritime data</kwd>
        <kwd>maritime cyber threats</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>The maritime environment undergoes an ever-growing
activity. In order to mitigate the risk of grounding or
creating a collision, passive and active systems have been
developed for mariners and set by international authorities.
The Automatic Identification System (AIS) is an electronic
system set on board vessels which transmits its location,
amongst many other data. The AIS broadcasts, on a regular
basis, 27 kinds of messages, each one having its own
purpose in information transmission (positioning,
nominative information, management...). The messages are
openly broadcast on two dedicated Very High Frequencies
(VHF). As messages are sent and received by vessel and
coastal stations within the radio horizon (circa 40 nautical
miles), it enables a better understanding of the surroundings
for vessel and coastal states, thus supporting several uses
such as fleet control, traffic control or boarding prevention,
etc.</p>
      <p>
        The number of messages is important: in a mean day, 19
million messages can be received in Europe from about
80,000 unique vessels [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. With a great amount of, mainly
spatial and temporal, but not only, data to process, issues
linked to big data analyses arise. In particular, this research
takes focuses on the veracity aspect. Indeed, sent messages
contain errors (unintentional), falsifications (intentional) and
undergo spoofing (intentional) due to the unsecured channel
of transmission, and that weakens the whole system and the
safety of navigation.
      </p>
      <p>This work reports on the design and results of a
methodology for the detection of AIS falsification. The
objectives are the determination of the false messages in
real-time and the improvement of both the effectiveness of
the system as a security system and the maritime situational
awareness.</p>
    </sec>
    <sec id="sec-2">
      <title>II. A SYSTEM WITH WEAKNESSES</title>
      <p>
        Three major cases of bad data quality can be
distinguished: the errors (when false data in non-deliberately
broadcasted), the falsifications (when false data is
deliberately broadcasted) and the spoofing (when data is
created or modified and broadcasted by an outsider) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Data contained in AIS messages can be erroneous, falsified
or spoofed for several reasons: there is no strong verification
of the transmission, the transmission is done using a
nonsecured channel, some pieces of information might not be
well known by the crew or the crew may want to hide some
data from other people’s knowledge. Those operations
modify and handicap the understanding of the maritime
traffic.
      </p>
      <p>The errors, by nature unintentional, can be caused by
transponder deficiency, a wrong input of manual data, an
input of manual data of poor quality, erroneous pieces of
information that come from external sensors, and can have
an impact on the name of the vessel, its physical
characteristics, the position or the destination for instance.
Those pieces of information can then be false, incomplete,
impossible according to the norm or impossible according to
the physics (for instance a latitude field value shall be
inferior to 90°). According to [7], circa 50% of the messages
contain erroneous data.</p>
      <p>
        A falsification is the fact to voluntarily degrade a
message by the modification of a genuine value by a false
value, or by stopping the broadcast of messages, made in
order to mislead the outer world. Identity theft [8], the
disappearances [9], the broadcast of false GNSS coordinates
or the statement of a wrong activity [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ] are types of
falsification. According to [7], about 1% of the vessels
broadcast falsified data.
      </p>
      <p>
        The spoofing of messages is done by an external actor
by the creation ex nihilo of false messages and their
broadcast on the AIS frequencies [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Those spoofing
activities are done in order to mislead both the outer world
and the crews at sea, by the creation of ghost vessels, of
false closest point of approach trigger, a false emergency
message or even a false course (in the case of a spoofed
vessel).
      </p>
      <p>
        The whole chain of AIS data transmission can be
affected by one of these three problems; from the GPS
signal to human supervision, going through data
transmission and distributed data processing and
information systems involved. In order to formally identify
these threats, an EBIOS risk analysis of the AIS has been
performed [
        <xref ref-type="bibr" rid="ref10">11</xref>
        ]. This consists in the analysis of
vulnerabilities, failures and risks associated with it, enabling
the identification of issues that could actually emanate from
the use of AIS. This method has been chosen for its
compliance to ISO norms and a list of circa 350 threat
scenarios and a typology of anomalies has been established.
      </p>
    </sec>
    <sec id="sec-3">
      <title>III. A VARIETY OF DATA</title>
      <p>Depending on the objectives, variety of data can be more
important than volume. Variety variations consider the usage
of heterogeneous data sources, used to complement a core
dataset in the understanding of a given situation. Indeed, data
analysis at large, including detection of abnormal situations
can be resolved or confirmed only by means of algorithms
taking advantages of additional, complementary sources of
information. This variety of data is absolutely required where
(sensor-based) data with known issues of quality are
analysed despite a lack “ground truth”. Beyond the
understanding of data, the use of variations in variety which
consists in progressively include additional sources is also
mean of understanding quality of algorithms processing data
(e.g. data compression, mining, visualisation).</p>
      <p>
        While efforts have been initiated to centralise maritime
data and information, most of the data are of heterogeneous
type and format and still independently sourced and
maintained [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. These data can support maritime situational
awareness as far as they are harmonised, properly combined,
integrated, summarised, and possibly cleaned up from
inconsistencies. Indeed, it is expected that the analysis and
understanding of maritime activities cannot be deduced
solely from vessels kinematics but would strongly benefit
from complementary data of various types. However, the
integration, combination (or fusion) of such data remains
challenging (e.g., spatio-temporal alignment of data, fusion
of data from different sensors, maritime anomaly detection,
activity classification) and research is still needed to develop
such efficient techniques.
      </p>
      <p>
        In this research, ship information collected through the
Automatic Identification System has been prepared together
with correlated data aligned in space and time. The dataset
has been carefully prepared and validated in order to offer
the research community a set of heterogeneous real data to
challenge, test and validate their research developments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
and in the scope of this research to assess falsification cases
of the AIS data.
      </p>
      <p>The dataset1 contains four categories of data: navigation
data (vessel positions acquired automatically by an AIS
receiver), vessel-oriented data (public, official nominative
vessel position), geographic data (cartographic, topographic
or regulatory context of vessel navigation), and
environmental data (weather and ocean data from forecast
models and from observations). It covers a time span of six
months, from October 1st, 2015 to March 31st, 2016 and
provides ship positions over the Celtic sea, the North
Atlantic Ocean, the English Channel, and the Bay of Biscay.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. INTEGRITY ASSESSMENT</title>
      <p>Since the AIS does not carry perfectly genuine data
(beyond data errors), that those inaccuracies are not perfect
and therefore are detectable, and that impacts on the
realworld can be substantial, a set of objectives has been set for
detection of falsifications. Relying on the accurate
understanding of the way the system is supposed to work, of
its vulnerabilities and the errors and falsification that have
been highlighted, these objectives include the creation of an
attacking platform allowing the creation and the broadcast
of falsified data, the modelling of a statistic and
algorithmbased falsification detection mechanism, the creation of an
information system for the real-time handling of data taking
into account archived or forecasted data, and the modelling
of risks that are inducted by an inadequate use of AIS, as
well as an assessment of the risks linked to AIS errors,
anomalies, falsification or spoofing.</p>
      <p>Intentional broadcast of false AIS information can be
understood at both the physical and logical levels. The first
approach focuses on signals transmitted by transponders
while the second considers information exchanged where
fraud and attacks can be identified by message-based data
mining methodology to identify abnormal messages (and
parameters). In our approach we are considering a
combination of both analyses within a single information
system.</p>
      <sec id="sec-4-1">
        <title>A. Message-based analysis</title>
        <p>
          Method for the integrity assessment of messages and the
discovery of anomalous data is particularly based on spatial
information, which is the cornerstone of AIS messages but
not only as AIS also broadcast many contextual and control
information along 27 messages [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>Considering the data within the fields of the 27 AIS
messages, four ways to discriminate the inner integrity of
those data can be distinguished. The first way consists of the
control of the integrity of each field of each message taken
individually. The second way is at the scale of one single
message, and assesses the integrity, in this very message, of
all the fields with respect to one another. As there are 27
types of messages, messages of the same type have the same
fields and it is thus possible to compare them, as time series,
and assess their integrity, this makes the third way.
Eventually, the fourth way is the comparison and integrity
assessment of the fields of different messages. Indeed,
although pieces of information can come from different
messages, it is possible to assess their integrity as some fields
are either the same or linked or comparable (i.e. id-based
cross verification in order to link information received by
different stations). Those four ways are referred as
firstorder, second-order, third-order and fourth-order
assessments, respectively.</p>
        <p>Depending on the type of messages assessed and the
order of assessment, the number of items to check is fixed.
We established a list of 935 integrity items for the 27
messages, and an ad-hoc nomenclature has been established
so that each item can have a clear unique identifier. Predicate
logic can present, under a formal form, the actions that lead
on the integrity determination of an item in a rigorous and
unambiguous way. Relying on three main elements: the data
fields values, the syntax and the expert knowledge values, a
logic-based formalism based on predicate logic has been
chosen for item assessment. 666 have been implemented.</p>
        <p>A falsification being the fact either to transmit erroneous
data or to trick the system by making him behave in a way it
is not supposed to, a falsification scenario can take numerous
forms. Linking integrity assessment with falsification
scenarios is essential for the identification of cyber threats
relying on the AIS. A set of 23 algorithms (so called flags)
has been designed for the identification of 4 falsification
scenarios: falsified identity, positions, control messages, and
saturation. Amongst these flags, let’s cite for instance
f_quadruplet that analyse if one element of identity
quadruplet (MMSI number, IMO number, Callsign, Name)
has changed along time. The flag f_ubiquity analyse if a
vessel reports two distinct locations at the same time.</p>
      </sec>
      <sec id="sec-4-2">
        <title>B. Signal-based analysis</title>
        <p>We also studied physical characteristics of the signal
which are intended to be integrated in the mining process.
We considered five parameters. The first parameter is the
power of the received signal and the four others are
timedependent and are relative to the shape of the signal. While
these parameters cannot fully qualify ship’s identity and
presence, the regularity of these parameters can conversely
help to identify inconsistent values.</p>
      </sec>
      <sec id="sec-4-3">
        <title>C. Processing principles</title>
        <p>A synoptic diagram of the proposed architecture can be
found in the Figure 1. The AIS stream can be received from
various sources and goes towards a centralised processing.
The parser provides messages parameters (Pi). It includes a
statistical analysis of messages (per ship identity, per type
and at the global level) for the identification of AIS
saturation. All this architecture is built around the central
database (DéAIS DB) where historical data described in
Section 3 are stored and where streamed data are processed
asynchronously. The implementation of the database relies
on the relational database model (postgres/postgis).</p>
        <p>
          Additionally, an online processing of the AIS stream has
been also designed based on Flink [
          <xref ref-type="bibr" rid="ref11">12</xref>
          ] for the computation
of black hole. As AIS coverage (and consequently black hole
locations) by a receiver evolve continuously, this processing
is essential to accurately detect falsified positions.
        </p>
        <p>The data processing box number two corresponds to a
signal processing for the determination of aforementioned
characteristics. These data are stored in the database with the
associated AIS decoded messages.</p>
        <p>The data processing box number one is in charge of
onthe-fly analysis of first-order and second-order data
assessment, in order to have as output coefficients to store in
the database. Similarly, the data processing box number three
is in charge of the analysis of third-order and fourth-order
data assessment, in order to have as output coefficients to be
stored in the database. This part of the study, unless the
previous, considers time series and needs to request historical
data.</p>
        <p>The data processing box number four will be in charge of
integrity assessments between AIS data and external and
aggregated data, (e.g. cartographic information, weather
conditions, results of black hole computations). Finally, data
processing box number five is in charge of running flags. Of
course, the types of processing at this level vary according to
the variety of information available and the requested
anomaly scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>V. CONLUSION</title>
      <p>This article proposes a method for analysing
messagebased data using integrity of information as a key factor.
Considering a variety of data sources, the approach described
considers an assessment done on the message itself, on the
message with respect to other messages, on the message with
respect to external databases and on the signal itself with its
physical characteristics. Applied in the context of maritime
data, such an assessment is the consequence of the defects of
the AIS system, transmitting erroneous and possibly falsified
data. This method provides integrity-based predicates on data
that are useful for the determination of erroneous and
falsified data, leading to a risk assessment and alert
triggering of maritime cyber threats. The approach is generic
and could be transposed to many sensor-based systems.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENT</title>
      <p>This research is supported by The French National
Research Agency (ANR) and co-funded by Defense
procurement and technology agency (DGA) under reference
ANR-14-CE28-0028 and labelled by French clusters Pôle
Mer Bretagne Atlantique and Pôle Mer Méditerranée.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Kalyvas</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kokkos</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tzouramanis</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <article-title>A survey of official online sources of high-quality free-of-charge geospatial data for maritime geographic information systems applications</article-title>
          ,
          <source>Information Systems</source>
          , Volume
          <volume>65</volume>
          ,
          <year>2017</year>
          , Pages
          <fpage>36</fpage>
          -
          <lpage>51</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dréo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Camossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Jousselme</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Iphar</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Heterogeneous integrated dataset for maritime intelligence, surveillance, and reconnaissance</article-title>
          . Data In Brief , Accepted for publication
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>European</given-names>
            <surname>Maritime Safety</surname>
          </string-name>
          <article-title>Agency</article-title>
          .
          <source>EMSA Facts &amp; Figures 2016. Report. 40p.</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Balduzzi</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pasta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Wilhoit</surname>
          </string-name>
          .
          <article-title>A security evaluation of ais automated identification system</article-title>
          .
          <source>In Proceedings of the 30th Annual Computer Security Applications Conference</source>
          , pages
          <fpage>436</fpage>
          -
          <lpage>445</lpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iphar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gallen</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bouju</surname>
          </string-name>
          , A..
          <source>DeAIS project: Detection of AIS Spoofing and Resulting Risks In: The proceedings of OCEANS'15. Genova</source>
          ,
          <year>2015</year>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Iphar</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Data Quality Assessment For Maritime Situation Awareness</surname>
          </string-name>
          ,
          <source>9th ISPRS International Symposium on Spatial Data Quality (ISSDQ</source>
          <year>2015</year>
          ),
          <source>Volume II-3/W5</source>
          , pages
          <fpage>291</fpage>
          -
          <lpage>296</lpage>
          ,
          <string-name>
            <given-names>La</given-names>
            <surname>Grande</surname>
          </string-name>
          Motte - France,
          <fpage>29</fpage>
          -
          <lpage>30</lpage>
          September 2015
          <string-name>
            <surname>Harati-Mokhari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brooks</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wang</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <article-title>Automatic Identification System (AIS): a human factors approach</article-title>
          .
          <source>J. Navig</source>
          . Vol
          <volume>60</volume>
          (
          <issue>3</issue>
          ), Cambridge University Press,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>The Maritime Executive, Iran, Tanzania and falsifying AIS signals to trade with Syria. Published in The Maritime Executive</article-title>
          ,
          <year>December 7th</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Windward</surname>
          </string-name>
          ,
          <article-title>AIS data on the high seas: an analysis of the magnitude and implications of growing data manipulation at sea</article-title>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Kastilieris</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braca</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Coraluppi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <article-title>Detection of malicious AIS position spoofing by exploiting radar information</article-title>
          .
          <source>In: proceedings of the 16th international conference on information fusion. Istambul</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Iphar</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Napoli</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alincourt</surname>
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brosset</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <article-title>Risk Analysis of falsified Automatic Identification System for the improvement of maritime traffic safety, 8 pages</article-title>
          ,
          <source>ESREL</source>
          <year>2016</year>
          ,
          <article-title>Glasgow 25th-29th September 2016</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Salmon</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <article-title>Design principles of a stream-based framework for mobility analysis</article-title>
          , Geoinformatica, Special Issue on GeoStreaming, 25 pages,
          <year>April 2016</year>
          (DOI 10.1007/s10707-016-0256- z)
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>