Joint Proceedings of Modellierung 2020 Short, Workshop and Tools & Demo Papers
130 Workshop on Models in AI


Learning of Utility Functions for the Behaviour Analysis in
Maritime Surveillance Tasks

Mathias Anneken1,∗, Sebastian Markgraf 1,∗, Sebastian Robert2,3, Jürgen Beyerer1,2, 3


Abstract: For detecting suspicious activities in the maritime domain such as illegal, unreported and
unregulated fishing, it is crucial to counter the increasing amount of available information regarding
vessels at sea with sophisticated algorithms and user interfaces. Deep learning and other data driven
approaches create good results, but for an operator to be able to intervene suspicious activities, the
supporting algorithms must be explainable and transparent. One possibility is to use simulations
based on utility functions and behaviour models derived from the field of human behaviour modelling
like game theory. Here, an expensive and time-consuming way is to model these utility functions by
experts.

As this is not always feasible or the behaviour patterns might not be easily expressed by experts, this
work follows a different approach by utilizing inverse reinforcement learning in order to estimate the
utility functions for different ship types. For this study, data based on the automatic identification
system (AIS) is used for comparing the behaviour of cargo vessels and fishing boats.

Keywords: maximum entropy inverse reinforcement learning; maritime domain; agents; decision
support; spatiotemporal data


1     Introduction

The amount of traffic at sea is large and steadily increasing: In 2018 80 % of all international
goods were carried by sea. This is only possible due to a fleet of over 94,000 vessels,
with a combined tonnage of 1.92 billion dwt. [Un18]. In order to ensure the safe voyage
of all vessels, different methods and technologies were introduce to handle the traffic,
e. g., the automatic identification system (AIS) for collisions avoidance or analysing the
spatiotemporal data of actors.


Due to better quality and availability of sensor systems, the amount of information available
for analysing and monitoring the behaviour of entities in surveillance tasks is steadily
1 Vision and Fusion Laboratory (IES), Karlsruhe Institute of Technology (KIT), mathias.anneken@kit.edu,

 sebastian.markgraf@student.kit.edu
2 Fraunhofer IOSB, Karlsruhe, Germany, sebastian.robert@iosb.fraunhofer.de,juergen.beyerer@iosb.fraunhofer.

 de
3 Fraunhofer Center for Machine Learning
 ∗ equal contribution, alphabetical order


Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                   Learning of Utility Functions for the Behaviour Analysis 131

increasing. This is also true for the maritime domain: satellites, radio communication, radar
and other technologies make it easy to acquire a huge amount of data, which might result in
an information overload for operators like coast guards. In order to make the most of the
data, it is crucial to support the operators with reliable algorithms. For an operator not only
the performance of an algorithm alone is of importance, but also the possibility to grasp an
understanding about why the algorithms have decided on a particular result. This means the
algorithm has to be transparent and explainable.
One possible approach for an explainable algorithms is to rely on utility functions (including
reward and cost functions) for explaining certain behaviour. In [AFB17], a game theoretic
approach is used for modelling rational behaviour of different vessels in the maritime
domain. The underlying utility functions are hand crafted and need to be specifically tailored
to mimic the behaviour of the vessels. As this process is tedious and error-prone, this paper
shows a data-driven method to learn an utility function for different vessel types based on
inverse reinforcement learning. This method is also applicable to other domains, e. g., car
traffic.


2   State of the art
In general, utility functions are used to train an agent for a specific problem. The agent
assesses actions by evaluating the utility in order to choose a favourable one. Mostly, this
function is tailored by hand and then directly used to learn and predict behaviour of agents.
Here, the approach is vice-versa: By learning the utility function, the behaviour will be
mimicked. A popular example is inverse reinforcement learning (IRL), as introduced by
Ng and Russell [NR00] for learning a reward function from expert trajectories. Some open
challenges of the original approach include the insufficient heuristics to decide for a reward
function and the limitation to small state spaces.
Wulfmeier et al. [Wu17] learn cost functions directly from sensor inputs. Their own deep
maximum entropy IRL algorithm is used in conjunction with a convolutional neural network
for autonomous driving in complex, urban environments. This allows them to skip the
feature extraction and feed the sensor map around a vehicle directly to the algorithm. The
learned cost map is given to a path planning algorithm and trajectories of the vehicle can be
calculated corresponding to the current sensor input. Their data consists of over one year of
vehicle trajectories.

An extension to maximum entropy IRL given by Ziebart et al. [Zi08] extends the original
approach by using the maximum entropy as a deciding factor for a reward function. Maximum
entropy describes the amount of assumptions made about the functions. The authors showed
that using the least amount of assumptions leads to the most general reward function.
Wulfmeier et al. [WOP15] further extend maximum entropy IRL to use a neural net for the
approximation of the reward function and describe the current state of the art approach:
Deep maximum entropy IRL.
132 Anneken et al.

Hirakawa et al. [Hi18] predict animal trajectories by learning their underlying reward
structure. They use GPS data collected from bird flights to create cost maps depending
on the current terrain type. Afterwards, these cost maps are used to predict paths with an
optimal control algorithm. This approach is novel by using the time as an explicit part of the
state. This increases the state space’s dimension vastly, but allows them to predict indirect
trajectories instead of only direct paths.


3     Foundations

This section will briefly introduce the necessary foundations for implementing IRL. In
general, each vessel is modelled in this work as an agent. The agent will recognize its
environment and perform actions to influence it as described by Norvig and Russell [RN16].
For this purpose the agent perceives the environment with some kind of sensor. This
information is used to build an internal representation of all important entities around the
agent. Based on the state, the agent will perform actions to alter the environment. The
transitions between states do not have to be deterministic. Norvig and Russell [RN16]
differentiate between different kinds of agents. For reinforcement learning, especially the
learning agent is crucial.


3.1   Markov Decision Process

A (finite) Markov Decision Process (MDP) models a decision process. It is given by a tuple
(S, A, T, R) with

•      the state space S: Set of all possible states of the environment
•      the action space A: Set of all possible actions of the agent
•      the reward R: Abstract reward value for each state
•      the transition probability T: Probability for state transition depending on the current
       state and chosen action
•      the discount factor γ: Trade-off between near vs. far away rewards

As the MDP fulfils the Markov property, the next step depends only on the current state and
chosen action. The solution to a MDP is called a policy π which maps a given state to an (in
some measure) optimal action: π : S → A. As this paper deals with stochastic policies, it
will not map to a single action, but rather a probability distribution over possible actions.
                                     Learning of Utility Functions for the Behaviour Analysis 133

3.2   Inverse reinforcement learning

While reinforcement learning estimates an optimal policy π ∗ from a given MDP with reward
function R, IRL tries to map a given set Z of expert trajectories ζ = [s1, s2, . . . , sn ] of
an unknown optimal policy π ∗ to the underlying reward function R of the expert. Each
trajectory ζ is given by consecutive states si . The states are composed of a two-dimensional
position and a discrete timestamp si = (xi, yi, ti ).
Here, the reward function is assumed to be linear in its features:
                                            Õ
                                  R(ζ, θ) =     θT f (s j )                                   (1)
                                                  s j ∈ζ


with f (s j ) as feature vector and θT as the corresponding weight vector. f (ζ) denotes the
sum over all states in a trajectory for a feature:
                                                Õ
                                        f (ζ) =     f (s j )                             (2)
                                                  s j ∈ζ

Following [Hi18], the expected reward is then defined by
                                                         Õ                  
                          E pπ (ζ) [R(ζ, θ)] = E pπ (ζ)        θT f (s j )
                                                                            
                                                         s j ∈ζ                             (3)
                                                                            
                                             = θ E pπ (ζ) [ f (ζ)]
                                                T


with pπ (ζ) as the probability of trajectory ζ given the policy π. The expected feature vector
E pπ (ζ) [ f (ζ)] is then matched with the empirical count of observed behaviours Z from the
optimal policy π ∗ :
                                                       1 Õ
                                E pπ ∗ (ζ) [ f (ζ)] =           f (ζ) = f¯                 (4)
                                                      |Z | ζ ∈Z

Following [NR00], this yields a set of solutions containing degenerate reward functions as
well, e. g., R = 0. Filtering the reward functions through heuristics yields a useful result.


3.3   Maximum entropy inverse reinforcement learning

Maximum entropy inverse reinforcement learning (MaxEntIRL) is an extension to the
general IRL approach. As its name suggests, it utilizes the maximum entropy as described
by [Ja57]. In [Zi08], Ziebart et al. tackle the problem by defining all possible paths in the
MDP as a probability distribution. Although many paths match feature expectations, by
using [Ja57] the paths yielding higher rewards should be exponentially more likely to be
134 Anneken et al.

chosen. This results in the probability distribution with the partition function Z as given in
[Zi08]:

                                         1        ©Õ T
                           P(ζ | θ) =         exp ­         θ f (s j )®
                                                                        ª
                                       Z(θ)
                                                  «s j ∈ζ
                                                                                               (5)
                                                 Í                     ¬
                                             exp s j ∈ζ θT f (s j )
                                      =∫        Í                       
                                        ζ 0 exp      0
                                                    s j ∈ζ 0 θ T f (s 0 ) dζ 0
                                                                      j


By estimating the log-likelihood, the optimal weights θ ∗ can be calculated:
                                                           1 Õ
                       θ ∗ = argmax L(θ) = argmax                   log P(ζ | θ)               (6)
                               θ                    θ     |Z | ζ ∈Z

The gradient is then given by:

                     ∇L(θ) = f¯ −          P(ζ | θ) = f¯ −
                                    Õ                        ÕÕ
                                                                           f (s j )Dζ (s j )   (7)
                                    ζ ∈Z                     ζ ∈Z s j ∈ζ

with Dζ as the expected state visitation frequency. Following [Zi08], the expected state
visitation frequency can be calculated by using forward and backward passes. The algorithm
calculates backwards from terminal states and keeps track of the probability mass associated
with each branch by calculating the partition function. This leads to state frequencies which
add up to the total frequency counts.
Following [KW97], the weights are updated using the exponentiated gradient with a learning
rate λ
                                     θ ← θeλ∇L(θ)                                      (8)
until they converge.


4   Preprocessing of AIS data

This sections describes the necessary preprocessing steps regarding the AIS input data
in order to be suited for further usage during MaxEntIRL. This includes filtering for the
correct location as well as the ship types and the mapping to the grid. The dataset [Ra18]
contains AIS data from the Celtic sea, the Channel and the Bay of Biscay in the vicinity of
Brest, France, as shown in Fig. 1 (longitude between 10.00 and 0.00, and latitude between
45.00 and 51.00). The data was recorded over a period of six month from 01/10/2015
till 31/03/2016. Additional to the AIS data, complementary material, like lists of ports,
weather and ocean conditions, and fishing areas, is attached. The AIS data can be split into
dynamic (18.648.556 decoded messages) and static (1.032.187 decoded messages) parts.
                                     Learning of Utility Functions for the Behaviour Analysis 135

The dynamic part consists of the position, speed over ground, true heading, course over
ground, rate of turn, and navigational status. The static part includes the Maritime Mobile
Service Identity (MMSI), ship type, destination, and draught.


Fig. 1: The area around Brest indicated by the black rectangle (longitude between 10.00 and 0.00, and
latitude between 45.00 and 51.00) is used. The dots indicate ports as listed in the World Port Index.


4.0.1   Filtering for ship type

As only fishing boats and container vessels are analysed, a filter is used on the static data.


4.0.2   Splitting by length

As the complete trajectories of each vessel is not suitable for the learning task, the trajectories
will be split by using their length as discriminating factor. Training on shorter trajectories
will increase the learning speed due to the smaller state space and therefore faster state
visitation calculation.
Furthermore, the positional information provided by AIS data can be quite noisy as described
in [SA17]. This low reliability regarding the position can be compensated for in many cases
by splitting the trajectory based on some threshold for the length. Often the position errors
span great distances and thus will be split into shorter trajectories. Too short trajectories
will be discarded in the last step.
136 Anneken et al.

4.0.3   Splitting by time

As the input data is partly from a port area, many vessels are either anchored or moored,
which implies, that they will not move in a meaningful way for discerning their type: The
position will be centred around a single point (in port) or on a circle (anchorage area).
Thus considering the time dimension including the vessels in port will just create longer
trajectories. By applying a threshold on the time, the trajectories are split in ports (and
general stopping manoeuvres).


4.0.4   Filtering rivers

As vessels on rivers are not free in their navigation, the learning of their decision process
will not yield additional information on their contribution to the utility function. By using
[WS96] as database for geo-referenced information on rivers, the trajectories which have
more than 80 % of their path close to a river will be removed.


4.0.5   Mapping to grid

Using a continuous state space would imply a huge state space resulting in challenges
regarding the computability. Thus, the state space is reduced by discretization the position.


4.0.6   Split round trips

Predicting trajectories performing round trips (start and end point are nearly the same) poses
a challenge to MaxEntIRL. The logical path in these cases would be the direct path between
start and goal, nearly without any movement at all. Therefore, these trajectories are split
into two different trajectories.


4.0.7   Filtering short trajectories

Only longer trajectories covering a certain number of grid cells will be used in the evaluation.
Therefore, all trajectories shorter than a threshold are omitted.


5   Creating feature maps

For the analysed areas, MaxEntIRL needs to be fed with feature maps representing certain
properties of the environment. The feature maps will be used to estimate the utility for the
                                    Learning of Utility Functions for the Behaviour Analysis 137

agent based on Equation (1). All different types of feature maps are shown in Fig. 2. As
the traffic on sea is analysed, geographical information like the coastline is crucial. Here, a
shapefile from [WS96] is used. A shapefile will provide geo-referenced geometrical objects
for describing e. g., the outlines of continents and islands or the path of rivers.


5.0.1   Constant Feature Map

In order to punish longer trajectories, a constant feature map as depicted in Fig. 2a is used.
By increasing the weight for this feature, the prediction will prefer shorter trajectories.


5.0.2   Coast Feature Map

A map of the coast is generated by using the shapefile. The coastline is directly mapped to
the grid. The resulting coast feature is shown in Fig. 2b. All landmasses have a constant
cost in order to avoid (fatal) collisions.


5.0.3   Special Feature Maps

Some other special feature maps are generated for fishing areas and traffic separation areas.
The fishing areas are bundled in [Ra18]. Originally they were released by the European
Commission Joint Research Centre4. The traffic separation scheme denotes the areas, the
normal sea traffic are not allowed to enter, or it has to follow a specific direction. Here,
OpenSeaMap5 data is used.


5.0.4   Distance to Coast Maps

Several other coastline related features are derived. First of all, the distance to the coast
is used resulting in a feature map rewarding to keep greater distance to the coast. On the
distance an exponential function with three different variances σ 2 = {1, 4, 10} is applied.
This results in a total of three maps. One for σ 2 = 1 is given in Fig. 2e.

As many vessels try to keep a certain distance from the coast, another feature based on the
inverse of the first class of distance to coast maps is introduced, again with the same three
variances. The feature map is shown in Fig. 2f.
4 https://bluehub.jrc.ec.europa.eu/webgis_fish
5 https://www.openseamap.org
138 Anneken et al.


                             (a) Constant.                            (b) Coast.


                           (c) Fishing areas.                (d) Traffic Separation Areas.


    (e) Distance to coast for σ 2 = 10. (f) Distance to coast inverted for (g) Distance to harbour for σ 2 = 1.
                                        σ 2 = 1.
                                     Fig. 2: All generated feature maps.


5.0.5     Distance to Harbour Maps

The last class of feature maps is based on the distance to harbours. It is depicted in Fig. 2g.
The generation is similar to the other distance related features. The idea is to specify the
distance a vessel likes to keep from the nearest harbour.


6     Evaluation

This section discusses the results for different ship types, namely container vessels and
fishing boats.


6.1     Resulting Weights

In Fig. 3, the different weights for container vessels and fishing boats are depicted in blue
and orange. For both types, the weights are quite similar: First of all, the greatest weight for
both types is on the coast feature map (Fig. 2b), in order to avoid trajectories over land. The
                                    Learning of Utility Functions for the Behaviour Analysis 139


                     Fig. 3: Resulting weights for the different feature maps.


second highest weight is on the different distance maps. Possibly in order to keep a safe
distance to the coast.

The feature map for the traffic separation scheme (Fig. 2d) has only a small weight. The
problem here is, that there is only one small traffic separation scheme in the used database.
Thus, the small weight is of no surprise. More unexpected is the small weight on the constant
feature map (Fig. 2a), as this feature was intended for ensuring shorter trajectories. Probably,
the combination of other features, which are constant over larger parts of the area, has this
specific effect.
By combining the weighted feature maps utilizing Equation (1), an estimation for a static
reward function can be calculated. This map is depicted for the different types in Fig. 4. As
expected with the similarity of the weight distribution between the two types, these maps
look quite alike.


6.2   Testing the Utility Function

The section before is giving a qualitative assessment of the resulted training. At first glance,
it seems to match the general expectations. For a quantitative evaluation the utility function
140 Anneken et al.


Fig. 4: Resulting weighted combination of feature maps for fishing boats (left) and container ships
(right).


is used for predicting actual trajectories. For this prediction, the start and end state of a real
trajectory is used. The trajectory between these two states is estimated by calculating the
best policy using the learned reward functions. For each step, the action with the highest
probability is taken until the end state is reached. This will result in an indirect trajectory.
For comparison, a linear interpolation between the start and end state is used. This will
always result in the shortest path between these two states, but it might not respect land
masses which are not traversable.
For comparing the distance between the predictions based both on the learned utility function
and the linear interpolation and the ground truth, dynamic time warping (DTW) is used in
order to account for the difference in timing. As a distance metric the euclidean distance is
used.
In Tab. 1 the resulting distances for the different vessel types are shown. For fishing boats,
the MaxEntIRL predictions yield better results compared to the linear interpolation. This
means in average MaxEntIRL predictions are closer to the original trajectory and thus mimic
the behaviour better than the linear interpolation. One major advantage is the avoidance of
land as seen e. g. in Fig. 5. Thus, the algorithm has learned that fishing boats want to keep a
certain distance to the coast.
For the container vessels the results of MaxEntIRL are slightly worse compared to the
linear interpolation. One reason could be the often straight and direct routes of the container
vessels far away from the coast. Thus, by just following a clear line, the approximation is
already quite near to the original trajectory. The MaxEntIRL seems to be at an disadvantage
for this task.

         Tab. 1: Resulting distances to the original trajectory using dynamic time warping.

                           Method                   Fisher     Container
                           Linear interpolation   3.9 ± 2.7    1.0 ± 0.7
                           MaxEntIRL              2.0 ± 1.2    1.9 ± 0.9
                                       Learning of Utility Functions for the Behaviour Analysis 141


Fig. 5: Trajectory prediction for a fishing boat (left) and container vessel (right). The red line is the
prediction, green a linear interpolation, and black the original trajectory. Each pixel represents one
grid cell.


7    Results

Utility functions for two different vessel types in the maritime domain are learned by
using MaxEntIRL. These functions are successfully used for predicting trajectories. The
training is based on filtered AIS data together with different feature maps created based on
geographical and nautical information.

This method was used for fishing boats and container vessels. For both types, feature maps
representing the preferences for each type are generated. The results for fishing boats are
superior to the comparison algorithm. The results for the direct trajectories of container
vessels are less promising. One reason might be the linear combination of the feature maps
together with only considering a static environment. Later means, that e. g., other vessels
are not considered.

As the area is discretized, thin landmasses might be mapped to only a few cells. As traversing
the grid diagonally is possible, the predicted trajectory might be infeasible. This has nothing
to do with the learning algorithm but rather with the limited resolution which is used in
order to have a manageable state space size.


8    Conclusion

This work explored the possibility of learning utility functions for maritime agents instead
of crafting them manually. Here, an approach for learning utility functions for birds was
transferred to the maritime context. This results shows the feasibility of learning utility
functions for maritime agents.

While the first results look promising, the actual usage of this method in a real operational
environment is still in the future. The shown method is limited in its learning capabilities and
142 Anneken et al.

thus will not sufficiently convey the necessary information to operators to really support their
work. Nevertheless, by further extending this version and exploring different approaches, a
better understanding of the behaviour of maritime agents can be achieved.


9   Future work

The usage of only linear combination of feature maps drastically limits the possibilities of
the algorithm. Further, the algorithm itself chooses to only fully use a limit number of the
provided features. Another future challenge is the inclusion of dynamic features, i. e., the
position data of other vessels, weather conditions, sea state, etc. In reality the operator of a
vessel will take all these dynamic factors into account for making decisions.

Further, only data from a small area is considered as a first application. Transferring the
method to other areas in the maritime domain or even other domains might give more
inside into the possibilities of the learned utility functions. In order to enhance these utility
function, the algorithm needs to be extended, e. g., by using the deep learning approach
described by Wulfmeier et al. in [WOP15]. This (and other deep learning approaches) will
enable the usage of non-linear functions. Further, the usage of dynamic data instead of the
static maps will be possible.
In order to have a better comparison, in future work, other path planing algorithms like A*
or similar could be used to avoid obstacles. To improve the results for container vessels,
deploying a deep approach, which can leverage specific areas the container vessels use,
would be needed.

The trajectory estimation alone is not enough to detect scenarios of interest, e. g. illegal,
unreported and unregulated fishing or the smuggling of contraband. For these kind of
situations, the prediction together with the utility function itself could be used for detecting
the deviation of vessels from their normal behaviour.


Acknowledgment.

Underlying projects to this article are funded by the WTD 81 of the German Federal
Ministry of Defense. The authors are responsible for the content of this article. This work
was developed in the Fraunhofer Cluster of Excellence “Cognitive Internet Technologies”.


Bibliography
[AFB17] Anneken, Mathias; Fischer, Yvonne; Beyerer, Jürgen: A Multi-agent Approach to Model
       and Analyze the Behavior of Vessels in the Maritime Domain. SCITEPRESS, pp. 200–207,
       2017.
                                      Learning of Utility Functions for the Behaviour Analysis 143

[Hi18] Hirakawa, Tsubasa; Yamashita, Takayoshi; Tamaki, Toru; Fujiyoshi, Hironobu; Umezu, Yuta;
       Takeuchi, Ichiro; Matsumoto, Sakiko; Yoda, Ken: Can AI predict animal movements? Filling
       gaps in animal trajectories using inverse reinforcement learning. Ecosphere, 9(10):e02447,
       2018.
[Ja57]   Jaynes, Edwin T.: Information theory and statistical mechanics. Physical review, 106(4):620,
         1957.

[KW97] Kivinen, Jyrki; Warmuth, Manfred K.: Exponentiated Gradient versus Gradient Descent for
       Linear Predictors. Information and Computation, 132(1):1 – 63, 1997.

[NR00] Ng, Andrew Y.; Russell, Stuart J.: Algorithms for Inverse Reinforcement Learning. In:
       Proceedings of the Seventeenth International Conference on Machine Learning. ICML ’00,
       Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 663–670, 2000.

[Ra18] Ray, Cyril; Dréo, Richard; Camossi, Elena; Jousselme, Anne-Laure: Heterogeneous Inte-
       grated Dataset for Maritime Intelligence, Surveillance, and Reconnaissance. 2018.

[RN16] Russell, Stuart J; Norvig, Peter: Artificial intelligence: a modern approach. Malaysia; Pearson
       Education Limited„ 2016.

[SA17] Sotirov, Svetoslav; Alexandrov, Chavdar: IMPROVING AIS DATA RELIABILITY. 2017.

[Un18] United Nations Conference on Trade and Development: , Review of Maritime Transport
       2018, 2018.

[WOP15] Wulfmeier, Markus; Ondruska, Peter; Posner, Ingmar: Deep Inverse Reinforcement
      Learning. CoRR, abs/1507.04888, 2015.
[WS96] Wessel, Paul; Smith, Walter: A global, self-consistent, hierarchical, high-resolution shoreline
       database. Journal of Geophysical Research, 101:8741–8743, 04 1996.

[Wu17] Wulfmeier, Markus; Rao, Dushyant; Wang, Dominic Zeng; Ondruska, Peter; Posner, Ingmar:
       Large-scale cost function learning for path planning using deep inverse reinforcement
       learning. The International Journal of Robotics Research, 36(10):1073–1087, 2017.

[Zi08]   Ziebart, Brian; Maas, Andrew; Bagnell, J. Andrew; Dey, Anind K.: Maximum Entropy
         Inverse Reinforcement Learning. pp. 1433–1438, 2008.