Data Variety and Integrity Assessment for Maritime Anomaly Detection Cyril Ray Naval Academy Research Institute Brest, France cyril.ray@ecole-navale.fr Abstract — The ever-increasing spread of mobile II. A SYSTEM WITH WEAKNESSES technologies and connected sensors necessitates the continuous updating of methods and techniques to cope with growing Three major cases of bad data quality can be volume of data. While fast processing and data management distinguished: the errors (when false data in non-deliberately has received lots of attention, data veracity and its assessment broadcasted), the falsifications (when false data is based on a large variety of contextual data seriously pose deliberately broadcasted) and the spoofing (when data is question. Indeed, sensors-based data collected through created or modified and broadcasted by an outsider) [5]. automated processes can be altered at every stage of their Data contained in AIS messages can be erroneous, falsified collection and process, accidentally or maliciously. In the or spoofed for several reasons: there is no strong verification maritime domain, several embedded sensors continuously of the transmission, the transmission is done using a non- report vessel’s positions. Beyond known errors (human, misconfiguration), recent works have shown that falsification secured channel, some pieces of information might not be of such data is easy, and therefore could mask or favor illegal well known by the crew or the crew may want to hide some actions, lead to disturbance of monitoring systems and new data from other people’s knowledge. Those operations maritime risks. This research presents maritime data quality modify and handicap the understanding of the maritime issues and a methodological approach for modelling, analyzing traffic. and detecting such anomalies in data. The errors, by nature unintentional, can be caused by Keywords — Data quality assessment, data falsification, transponder deficiency, a wrong input of manual data, an maritime data, maritime cyber threats. input of manual data of poor quality, erroneous pieces of information that come from external sensors, and can have I. INTRODUCTION an impact on the name of the vessel, its physical characteristics, the position or the destination for instance. The maritime environment undergoes an ever-growing Those pieces of information can then be false, incomplete, activity. In order to mitigate the risk of grounding or impossible according to the norm or impossible according to creating a collision, passive and active systems have been the physics (for instance a latitude field value shall be developed for mariners and set by international authorities. inferior to 90°). According to [7], circa 50% of the messages The Automatic Identification System (AIS) is an electronic contain erroneous data. system set on board vessels which transmits its location, amongst many other data. The AIS broadcasts, on a regular A falsification is the fact to voluntarily degrade a basis, 27 kinds of messages, each one having its own message by the modification of a genuine value by a false purpose in information transmission (positioning, value, or by stopping the broadcast of messages, made in nominative information, management...). The messages are order to mislead the outer world. Identity theft [8], the openly broadcast on two dedicated Very High Frequencies disappearances [9], the broadcast of false GNSS coordinates (VHF). As messages are sent and received by vessel and or the statement of a wrong activity [10] are types of coastal stations within the radio horizon (circa 40 nautical falsification. According to [7], about 1% of the vessels miles), it enables a better understanding of the surroundings broadcast falsified data. for vessel and coastal states, thus supporting several uses The spoofing of messages is done by an external actor such as fleet control, traffic control or boarding prevention, by the creation ex nihilo of false messages and their etc. broadcast on the AIS frequencies [4]. Those spoofing The number of messages is important: in a mean day, 19 activities are done in order to mislead both the outer world million messages can be received in Europe from about and the crews at sea, by the creation of ghost vessels, of 80,000 unique vessels [3]. With a great amount of, mainly false closest point of approach trigger, a false emergency spatial and temporal, but not only, data to process, issues message or even a false course (in the case of a spoofed linked to big data analyses arise. In particular, this research vessel). takes focuses on the veracity aspect. Indeed, sent messages The whole chain of AIS data transmission can be contain errors (unintentional), falsifications (intentional) and affected by one of these three problems; from the GPS undergo spoofing (intentional) due to the unsecured channel signal to human supervision, going through data of transmission, and that weakens the whole system and the transmission and distributed data processing and safety of navigation. information systems involved. In order to formally identify This work reports on the design and results of a these threats, an EBIOS risk analysis of the AIS has been methodology for the detection of AIS falsification. The performed [11]. This consists in the analysis of objectives are the determination of the false messages in vulnerabilities, failures and risks associated with it, enabling real-time and the improvement of both the effectiveness of the identification of issues that could actually emanate from the system as a security system and the maritime situational the use of AIS. This method has been chosen for its awareness. 4 compliance to ISO norms and a list of circa 350 threat detection of falsifications. Relying on the accurate scenarios and a typology of anomalies has been established. understanding of the way the system is supposed to work, of its vulnerabilities and the errors and falsification that have III. A VARIETY OF DATA been highlighted, these objectives include the creation of an attacking platform allowing the creation and the broadcast Depending on the objectives, variety of data can be more of falsified data, the modelling of a statistic and algorithm- important than volume. Variety variations consider the usage based falsification detection mechanism, the creation of an of heterogeneous data sources, used to complement a core information system for the real-time handling of data taking dataset in the understanding of a given situation. Indeed, data into account archived or forecasted data, and the modelling analysis at large, including detection of abnormal situations of risks that are inducted by an inadequate use of AIS, as can be resolved or confirmed only by means of algorithms well as an assessment of the risks linked to AIS errors, taking advantages of additional, complementary sources of anomalies, falsification or spoofing. information. This variety of data is absolutely required where (sensor-based) data with known issues of quality are Intentional broadcast of false AIS information can be analysed despite a lack “ground truth”. Beyond the understood at both the physical and logical levels. The first understanding of data, the use of variations in variety which approach focuses on signals transmitted by transponders consists in progressively include additional sources is also while the second considers information exchanged where mean of understanding quality of algorithms processing data fraud and attacks can be identified by message-based data (e.g. data compression, mining, visualisation). mining methodology to identify abnormal messages (and parameters). In our approach we are considering a While efforts have been initiated to centralise maritime combination of both analyses within a single information data and information, most of the data are of heterogeneous system. type and format and still independently sourced and maintained [1]. These data can support maritime situational A. Message-based analysis awareness as far as they are harmonised, properly combined, integrated, summarised, and possibly cleaned up from Method for the integrity assessment of messages and the inconsistencies. Indeed, it is expected that the analysis and discovery of anomalous data is particularly based on spatial understanding of maritime activities cannot be deduced information, which is the cornerstone of AIS messages but solely from vessels kinematics but would strongly benefit not only as AIS also broadcast many contextual and control from complementary data of various types. However, the information along 27 messages [6]. integration, combination (or fusion) of such data remains Considering the data within the fields of the 27 AIS challenging (e.g., spatio-temporal alignment of data, fusion messages, four ways to discriminate the inner integrity of of data from different sensors, maritime anomaly detection, those data can be distinguished. The first way consists of the activity classification) and research is still needed to develop control of the integrity of each field of each message taken such efficient techniques. individually. The second way is at the scale of one single In this research, ship information collected through the message, and assesses the integrity, in this very message, of Automatic Identification System has been prepared together all the fields with respect to one another. As there are 27 with correlated data aligned in space and time. The dataset types of messages, messages of the same type have the same has been carefully prepared and validated in order to offer fields and it is thus possible to compare them, as time series, the research community a set of heterogeneous real data to and assess their integrity, this makes the third way. challenge, test and validate their research developments [2], Eventually, the fourth way is the comparison and integrity and in the scope of this research to assess falsification cases assessment of the fields of different messages. Indeed, of the AIS data. although pieces of information can come from different messages, it is possible to assess their integrity as some fields The dataset1 contains four categories of data: navigation are either the same or linked or comparable (i.e. id-based data (vessel positions acquired automatically by an AIS cross verification in order to link information received by receiver), vessel-oriented data (public, official nominative different stations). Those four ways are referred as first- vessel position), geographic data (cartographic, topographic order, second-order, third-order and fourth-order or regulatory context of vessel navigation), and assessments, respectively. environmental data (weather and ocean data from forecast models and from observations). It covers a time span of six Depending on the type of messages assessed and the months, from October 1st, 2015 to March 31st, 2016 and order of assessment, the number of items to check is fixed. provides ship positions over the Celtic sea, the North We established a list of 935 integrity items for the 27 Atlantic Ocean, the English Channel, and the Bay of Biscay. messages, and an ad-hoc nomenclature has been established so that each item can have a clear unique identifier. Predicate logic can present, under a formal form, the actions that lead IV. INTEGRITY ASSESSMENT on the integrity determination of an item in a rigorous and Since the AIS does not carry perfectly genuine data unambiguous way. Relying on three main elements: the data (beyond data errors), that those inaccuracies are not perfect fields values, the syntax and the expert knowledge values, a and therefore are detectable, and that impacts on the real- logic-based formalism based on predicate logic has been world can be substantial, a set of objectives has been set for chosen for item assessment. 666 have been implemented. A falsification being the fact either to transmit erroneous 1 C. Ray, R. Dréo, E. Camossi, A.-L. Jousselme, Heterogeneous data or to trick the system by making him behave in a way it Integrated Dataset for Maritime Intelligence, Surveillance, and is not supposed to, a falsification scenario can take numerous Reconnaissance (Version 0.1). Data set. Licence CC-BY-NC-SA-4.0. forms. Linking integrity assessment with falsification Zenodo. doi.org/10.5281/zenodo.1167594, February 2018 5 scenarios is essential for the identification of cyber threats The data processing box number two corresponds to a relying on the AIS. A set of 23 algorithms (so called flags) signal processing for the determination of aforementioned has been designed for the identification of 4 falsification characteristics. These data are stored in the database with the scenarios: falsified identity, positions, control messages, and associated AIS decoded messages. saturation. Amongst these flags, let’s cite for instance The data processing box number one is in charge of on- f_quadruplet that analyse if one element of identity quadruplet (MMSI number, IMO number, Callsign, Name) the-fly analysis of first-order and second-order data assessment, in order to have as output coefficients to store in has changed along time. The flag f_ubiquity analyse if a vessel reports two distinct locations at the same time. the database. Similarly, the data processing box number three is in charge of the analysis of third-order and fourth-order data assessment, in order to have as output coefficients to be B. Signal-based analysis stored in the database. This part of the study, unless the We also studied physical characteristics of the signal previous, considers time series and needs to request historical which are intended to be integrated in the mining process. data. We considered five parameters. The first parameter is the power of the received signal and the four others are time- The data processing box number four will be in charge of dependent and are relative to the shape of the signal. While integrity assessments between AIS data and external and these parameters cannot fully qualify ship’s identity and aggregated data, (e.g. cartographic information, weather presence, the regularity of these parameters can conversely conditions, results of black hole computations). Finally, data help to identify inconsistent values. processing box number five is in charge of running flags. Of course, the types of processing at this level vary according to C. Processing principles the variety of information available and the requested anomaly scenarios. A synoptic diagram of the proposed architecture can be found in the Figure 1. The AIS stream can be received from various sources and goes towards a centralised processing. V. CONLUSION The parser provides messages parameters (Pi). It includes a This article proposes a method for analysing message- statistical analysis of messages (per ship identity, per type based data using integrity of information as a key factor. and at the global level) for the identification of AIS Considering a variety of data sources, the approach described saturation. All this architecture is built around the central considers an assessment done on the message itself, on the database (DéAIS DB) where historical data described in message with respect to other messages, on the message with Section 3 are stored and where streamed data are processed respect to external databases and on the signal itself with its asynchronously. The implementation of the database relies physical characteristics. Applied in the context of maritime on the relational database model (postgres/postgis). data, such an assessment is the consequence of the defects of the AIS system, transmitting erroneous and possibly falsified data. This method provides integrity-based predicates on data that are useful for the determination of erroneous and falsified data, leading to a risk assessment and alert triggering of maritime cyber threats. The approach is generic and could be transposed to many sensor-based systems. ACKNOWLEDGMENT This research is supported by The French National Research Agency (ANR) and co-funded by Defense procurement and technology agency (DGA) under reference ANR-14-CE28-0028 and labelled by French clusters Pôle Mer Bretagne Atlantique and Pôle Mer Méditerranée. REFERENCES [1] Kalyvas C., Kokkos A., Tzouramanis T., A survey of official online sources of high-quality free-of-charge geospatial data for maritime geographic information systems applications, Information Systems, Volume 65, 2017, Pages 36-51 [2] Ray, C., R. Dréo, E. Camossi, A.-L. Jousselme, and C. Iphar (2018). Heterogeneous integrated dataset for maritime intelligence, surveillance, and reconnaissance. Data In Brief , Accepted for publication Fig. 1. Processing principles [3] European Maritime Safety Agency. EMSA Facts & Figures 2016. Report. 40p. (2016) Additionally, an online processing of the AIS stream has [4] Balduzzi M., A. Pasta, and K. Wilhoit. A security evaluation of ais automated identification system. In Proceedings of the 30th Annual been also designed based on Flink [12] for the computation Computer Security Applications Conference, pages 436–445. ACM, of black hole. As AIS coverage (and consequently black hole 2014. locations) by a receiver evolve continuously, this processing [5] Ray, C., Iphar, C., Napoli, A., Gallen, R. and Bouju, A.. DeAIS is essential to accurately detect falsified positions. project: Detection of AIS Spoofing and Resulting Risks In: The proceedings of OCEANS’15. Genova, 2015 6 [6] Iphar, C., Napoli, A., Ray, C., Data Quality Assessment For Maritime proceedings of the 16th international conference on information Situation Awareness, 9th ISPRS International Symposium on Spatial fusion. Istambul, 2013. Data Quality (ISSDQ 2015), Volume II-3/W5, pages 291-296, La [11] Iphar C., Napoli A., Ray C., Alincourt E., Brosset D., Risk Analysis Grande Motte - France, 29-30 September 2015 of falsified Automatic Identification System for the improvement of [7] Harati-Mokhari, A., Wall, A., Brooks, P. and Wang J., Automatic maritime traffic safety, 8 pages, ESREL 2016, Glasgow 25th–29th Identification System (AIS): a human factors approach. J. Navig. Vol September 2016 60(3), Cambridge University Press, 2007. [12] Salmon, L., Ray, C., Design principles of a stream-based framework [8] The Maritime Executive, Iran, Tanzania and falsifying AIS signals to for mobility analysis, Geoinformatica, Special Issue on trade with Syria. Published in The Maritime Executive, December 7th, GeoStreaming, 25 pages, April 2016 (DOI 10.1007/s10707-016-0256- 2012. z) [9] Windward, AIS data on the high seas: an analysis of the magnitude and implications of growing data manipulation at sea, 2014. [10] Kastilieris, F., Braca, P. and Coraluppi, S., Detection of malicious AIS position spoofing by exploiting radar information. In: 7