=Paper=
{{Paper
|id=Vol-3349/paper10
|storemode=property
|title=Hallucinating Hidden Obstacles for Unmanned Surface
                  Vehicles Using a Compositional Model
|pdfUrl=https://ceur-ws.org/Vol-3349/paper10.pdf
|volume=Vol-3349
|authors=Jon Muhovic,Gregor Koporec,Janez Pers
|dblpUrl=https://dblp.org/rec/conf/cvww/MuhovicKP23
}}
==Hallucinating Hidden Obstacles for Unmanned Surface
                  Vehicles Using a Compositional Model==
<pdf width="1500px">https://ceur-ws.org/Vol-3349/paper10.pdf</pdf>
<pre>
Hallucinating Hidden Obstacles for Unmanned Surface
Vehicles Using a Compositional Model
Jon Muhovič1 , Gregor Koporec1,2 and Janez Perš1,*
1
    Faculty of Electrical Engineering, University of Ljubljana, Tržaška 25, 1000 Ljubljana, Slovenia
2
    Gorenje, d.o.o., 3320 Velenje, Slovenia


                                           Abstract
                                           The water environment in which unmanned surface vehicles (USVs) navigate presents many unique challenges. One of
                                           these is the risk of encountering obstacles that are (partially) submerged and therefore poorly visible. Therefore, their extent
                                           cannot be determined directly from available above-water sensor data. On the other hand, it is well known that human
                                           skippers are able to safely navigate boats around obstacles even without underwater sensors and only with the help of their
                                           expertise. In this paper, we describe initial work on extending the USV obstacle detection to include such functionality using
                                           a compositional model. To learn to hallucinate the extent of obstacles with a minimum of learning effort, we exploit the
                                           nature of obstacles (people in kayaks, canoes, and on paddleboards) that are visible most of the time, but not always. We
                                           evaluate the impact of such hallucinations on USV safety and maneuverability, and suggest additional cases where such
                                           hallucinations can be used to improve USV safety.

                                           Keywords
                                           unmanned vehicles, USV, obstacle detection, compositional models


1. Introduction
Unmanned surface vehicles (USVs) are increasingly rec-
ognized as a valuable tool for a variety of applications,
including military, environmental, and commercial pur-
poses. These autonomous craft are capable of operating
in difficult or hazardous environments, making them
ideal for tasks that would be too risky for humans.
   On the other hand, one of the envisioned benefits of
USVs is the ability to gather data and perform tasks for
extended periods of time without the need for human
intervention. This would allow them to cover large areas
and collect a large amount of data that can then be used
for a variety of purposes. USVs equipped with sensors
and cameras could be used, for example, to monitor and
map the marine environment, track wildlife [1], or assess                                                      Figure 1: Left: detection of objects using Yolov7 [3], a person
the health of coral reefs [2]. However, truly autonomous                                                       is detected (dark blue), but neither boat nor paddle board are
                                                                                                               detected. We hallucinate the boat (in green). Right: Same per-
vehicles with no captain on board and no contact with
                                                                                                               son, later, when the boat is actually detected by Yolov7 (light
remote operators must essentially duplicate the reason-                                                        blue), comparing the actual detection versus the hallucination
ing of a trained skipper in certain situations. One of                                                         (green).
those situations are (partially) submerged objects that
cannot be detected by USV sensors located above the

26th Computer Vision Winter Workshop, Robert Sablatnig and Florian
                                                                                                                                    water, but whose presence could be easily inferred by a
Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023                                                                      human operator.
*
  Corresponding author.                                                                                                                Our approach is best illustrated by observing Fig. 1.
†
  These authors contributed equally.                                                                                                Based on the observation that people cannot walk or
$ jon.muhovic@fe.uni-lj.si (J. Muhovič);                                                                                            sit on water, we force the hallucination of a boat with
gregor.koporec@gorenje.com (G. Koporec); janez.pers@fe.uni-lj.si                                                                    every person that is detected on the surface of the water.
(J. Perš)
 https://lmi.fe.uni-lj.si/en/jon-muhovic/ (J. Muhovič);
                                                                                                                                    The parameters of the hallucinated object are learned
https://lmi.fe.uni-lj.si/en/janez-pers-2/ (J. Perš)                                                                                 from person-boat compositions obtained by using a pre-
 0000-0002-6039-6110 (J. Perš)                                                                                                     trained object detector on a separate dataset and do not
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                   require annotation.
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                           1
Jon Muhovič et al. CEUR Workshop Proceedings                                                                     1–9


   This paper is organized as follows: Following related   branches of obstacle detection have been improved. On
work, we define the problem we want to use to demon-       the one hand, researchers have adapted or retrained gen-
strate the capabilities of our method. We then introduce   eral object detectors for marine environments [18, 19, 10]
the basic concepts of compositional models and describe    using more precise classification information and custom
our use case and evaluation method. In the experimental    datasets. However, such approaches only work for well-
part, we present our own dataset used in our experiments   defined objects. Unknown structures, such as floating
and its properties, followed by the evaluation setup focus-debris or piers, cannot usually be detected using such
ing on the USV navigation. Finally, we discuss the results methods.
and further applications of the presented approach.           The other branch of obstacle detection is semantic
                                                           segmentation. Several methods have adapted general
                                                           segmentation methods to the marine environment [20,
2. Related work                                            7, 21]. Obstacle detection can be performed using such
                                                           methods by determining regions that are partially or
Recently, numerous papers have been published on the
                                                           completely surrounded by water.
subject of USV sensors, obstacle detection and navigation.
                                                              The method presented in this paper operates at a
   The computer vision aspect of marine environment
                                                           higher level of reasoning and aims to use assumptions
interpretation has been approached in several ways so
                                                           that reasonably hold in water-bound environments. It
far: Some authors have acquired datasets to facilitate do-
                                                           relies on existing but imperfect methods for obstacle
main transfer for Deep Learning and further investigate
                                                           detection (in this paper we use Yolov7 [3]).This work
the specific problems in the maritime domain [4, 5, 6].
                                                           contains two contributions:
Several USV architectures with different sensors have
been presented to solve problems such as poor lighting           • A method for improving the safety of USV and
conditions and the need for absolute distance measure-             its environment by improving the estimation of
ments [7, 8, 9]. In addition, authors have proposed deep           free passage corridors in front of the USV, even
learning methods specific to the maritime domain that ei-          with imperfect obstacle detectors.
ther incorporate additional relevant modalities or address       • An evaluation method that evaluates increase of
problems that arise in the maritime domain [10]. Numer-            safety in that case
ous publications have also been presented that address
automatic navigation and maritime collision avoidance
compliance [11].                                           3. Problem definition
   Han et al. [12] have presented a complete platform and
framework for obstacle detection and avoidance, com- In situations where we cannot reliably observe fully or
plete with multimodal sensors, obstacle detectors, and partially submerged obstacles using any of the sensors
collision avoidance rules. They use SSD detector [13] to mounted above the water, we use knowledge of com-
detect potential obstacles and track them using sensor fu- monly occurring structures in marine environments to
sion. Since real-time performance is usually desired, fast improve the safety of a USV.
detectors such as SSD or YOLO [14] are usually preferred      In this paper, we present preliminary research results:
for USV applications.                                      We   focused on the problem of detecting boats or other
   Several datasets have also been published, some of floating objects in situations where a person was detected
which will be used as learning data for Deep Learning- above the water surface, but the corresponding boat was not
based methods and others as benchmarks for existing detected. Such cases often occur when boats are of a sim-
methods. One such dataset, SMD, was proposed by ilar color to the surrounding water, partially submerged
Prasad et al. [4]. It contained 51 RGB and 30 NIR se- due to maneuvering, or are otherwise poorly visible due
quences and was primarily intended for monitoring. to backlight or the distance between a smaller object and
Since then, several more USV oriented datasets have the camera. The work was performed using RGB images,
been proposed, such as MODD [15], MaSTr1325 [5], and because of the wide availability of pre-trained object de-
MODS [6].                                                  tectors that perform reasonably well without the need
   In the past, obstacle detection was performed directly for additional training.
by estimating salient regions [16] or color segmenta-         Since we are dealing with coastal and continental wa-
tion [8]. Before the widespread use of Deep Learning,      ter regions where smaller boats such as rowboats and
several approaches were also proposed that mainly fo- paddle boats are usually found, consistent detection of
cused on semantic segmentation followed by anomaly such obstacles is necessary. Depending on the lighting
detection. These methods [15, 17] used prior informa- conditions, size and color of the boats, detection with
tion about the scene and refined it with color image in- conventional detectors applied to color images is not al-
formation. With the advent of Deep Learning, the two ways consistent. This inconsistency can be a hazard to


                                                              2
Jon Muhovič et al. CEUR Workshop Proceedings                                                                                      1–9


safe navigation, especially when maneuvering near other
boats.
   This problem has the following interesting properties:
     • Solid physical foundation. People cannot walk
       or sit on the water. There must be some kind of
       highly buoyant device present to support their
       weight.
     • No opportunity to introduce gross errors with
       false detections. False positives only restrict the
       possibilities for the USV to advance, and our ex-
       periments were designed to check for that effect.
     • No manual annotations are needed, since we can
       obtain ground truth using the object detector
       (Yolov7) and therefore obtain plenty of data to
       train the higher-reasoning model.
  The method will be later extended to a wider range of
problems, which are discussed in Section 7.2, but repre- Figure 2: Concept of the compositional model – modelling of
sent edge cases and thus are subject to problems of data the Coke bottle. The composition is shown on the left, each
collection.                                              part marked with a green rectangle. Names of the parts are
                                                                    shown in the middle. A compositional hierarchical model is
                                                                    shown on the right: darkest rectangles represent high level
4. Our method                                                       parts 1, 2, 3. Of those, part 3 is a composition itself, containing
                                                                    parts 6 and 7 (lighter). Part 7 is again a composition of parts 4
Our method is heavily influenced by the work of Ko-                 and 5 (lightest).
porec et al. [22], that uses hierarchical compositional
models to detect objects’ visible parts even when large
parts of objects are occluded, and allows collection of ex-         𝜇𝑖𝑗 and covariance matrix Σ𝑖𝑗 . The parameters of the
pert knowledge from a small number of targeted human                Gaussian distribution are obtained by learning on a suf-
annotations. In our work we use a highly simplified im-             ficiently large set of train data, from which vectors X𝑖𝑗
plementation of Human-Centered Deep Compositional                   are extracted.
(HCDC) model [22].
                                                                                                          part 1

4.1. Compositional models                                                                       Σ11
In computer vision, a composition refers to the arrange-                                                𝜇11 part 2
ment of visual elements in an image. These visual ele-
                                                                                                         Σ21
ments are called parts and can be low level primitives (e.g.                                      𝜇21
edges, corners) or high-level objects themselves (e.g. cap,
                                                                                                         part 3
a label and recognizable shaped bottom on a bottle of
soft drink), as shown in Fig. 2. Parts can be compositions                                              𝜇31
themselves, yielding a hierarchical compositional model.
   The compositional model, as shown in Fig. 2 is not                                                   Σ31
particularly useful, as it is completely rigid. In prac-
tice, geometrical parameters of the parts are modelled
as random vectors. In Figure 3 we show a hierarchical,
compositional model of a 3-part Coke bottle under the
assumption that the probability distribution of 𝑗-th part
                                                                               Composition 1
position (𝑥𝑖𝑗 , 𝑦𝑖𝑗 ) relative to the center (origin) of the
𝑗-th composition is Gaussian:                                       Figure 3: Illustration: three parts of a Coke bottle (parts
                             [︀        ]︀𝑇                          1, 2, and 3 from Fig. 2 could look something like this, if the
                     X𝑖𝑗 = 𝑥𝑖𝑗 𝑦𝑖𝑗                                  learning samples would feature coke bottles tilted slightly to
                                                          (1)
                     X𝑖𝑗 ∼ 𝒩 (𝜇𝑖𝑗 , Σ𝑖𝑗 )                           the right. Other parts are not shown. Ellipses show Gaussian
                                                                    distributions of part displacements (𝑥𝑖𝑗 , 𝑦𝑖𝑗 ) relative to the
where X𝑖𝑗 is a two-dimensional random vector, gener-                center of the composition (denoted as Composition 1).
ated by the Gaussian distribution 𝒩2 with mean vector


                                                                3
Jon Muhovič et al. CEUR Workshop Proceedings                                                                               1–9


  Compositional models can be used in the following
ways:                                                                                                   𝑑𝑚𝑎𝑥
                                                                                                  𝑑=𝑘
                                                                                                         𝑁
     • 1. Robust, explainable detection of partially oc-                                     [︀
                                                                                       X𝐿𝑘 = 𝑥𝐿𝑘       𝑦𝐿𝑘
                                                                                                           ]︀𝑇
       cluded objects, where the object (composition) is                                    [︀             ]︀𝑇              (2)
       detected even if not all its parts are visible.                                 X𝑅𝑘 = 𝑥𝑅𝑘       𝑦𝑅𝑘
     • 2. Explanation (hallucination) of the missing part.                           X𝐿𝑘 ∼ 𝒩𝐿𝑘 (𝜇𝐿𝑘 , Σ𝐿𝑘 )
       This is the functionality we use in the presented                            X𝑅𝑘 ∼ 𝒩𝑅𝑘 (𝜇𝐿𝑘 , Σ𝑅𝑘 )
       work.
                                                                        where subscripts 𝐿 and 𝑅 denote left-top or right-
                                                                     bottom point of the boat bounding box, respectively, and
4.2. Model of a person on a boat                                     𝑘 denotes the scale index. Therefore, a total parameter
Human-Centered Deep Compositional (HCDC)                             set of our 2D model consists of 2𝑁 Gaussian means and
model [22] operates on parts that are itself deep                    2𝑁 2D Gaussian covariance matrices.
detections (detections, obtained by convolutional
neural network models, CNNs). This makes the model                   4.3. Training the compositional model
explainable, as the parts are already categorized into
human-understandable categories.                                     Our training does not require any manual annotations.
   We follow this example and use the detections                     Due to pretty good (but not perfect) performance of the
provided by an obstacle detector pretrained on MS                    chosen detector (Yolov7 detects about 95% boats and even
COCO [23]. We only retained the pertinent detection                  higher percentage of persons) we use those cases where
classes: person, boat and surfboard. Additionally, we                both the boat and the person on it were detected, to
treated the classes boat and surfboard as the same se-               establish a model that can reasonably predict the position
mantic entity (referred to as boat in the remainder of the           and size of a boat in absence of detections.
text), since both of those classes almost always appear                 Although we assume Gaussian model for probability
simultaneously with the class person. The compositional              distributions 𝒩𝐿𝑘 and 𝒩𝑅𝑘 , we estimate each separate
model that we use is shown in Fig. 4.                                distribution using expectation maximization (EM) algo-
                                                                     rithm with 2-component Gaussian Mixture Model (GMM)
                                                                     and retain the larger of the two components as either
                                                                     𝒩𝐿𝑘 or 𝒩𝑅𝑘 . Our preliminary testing has revealed that
                                       Person                        using 2-component GMM results in more accurate fitting
                                                                     of Gaussian model to the data, collecting the outliers in
                                  d                                  the significantly smaller component.

               𝜇L                                                    4.4. Hallucination
   ΣL                                      𝜇R
                                                Boat
                                                              To hallucinate the most likely bounding box of the (un-
                                                              detected) boat, we examine the bounding box of the de-
                                                        ΣR    tected person, calculate its centroid and diagonal 𝑑, calcu-
                                                              late the scale index 𝑘 and look up the relevant Gaussian
                                                              models 𝒩𝐿𝑘 and 𝒩𝑅𝑘 obtained during the training. The
Figure 4: Model of a two-part composition we use in this hallucinated bounding box points of a boat are deter-
research – a person on a boat. The centroid of the person mined at displacements 𝑥𝐿𝑘 and 𝑦𝐿𝑘 at which 𝒩𝐿𝑘 and
detection is the origin of the composition coordinate system, 𝒩𝑅𝑘 have their maximum values. Note that 𝑥𝐿𝑘 and 𝑦𝐿𝑘
and two corners of a bounding box represent the boat. The are relative to the person’s centroid point.
position of the corners is modelled using two Gaussian distri-
butions. To adjust the model to different scales, we use the
diagonal of the person’s bounding box, 𝑑.                            5. USV safety-focused evaluation
                                                                     To compare performance of object detectors, a generic
  In our case, the Eq. (1) changes, since we have two                approach by counting false positives and false negatives,
separate Gaussian models for upper-left and bottom-right             with respect to some minimum intersection over union
corners of the boat bounding box, and that for each of 𝑁             (IoU) value is often used. However, when evaluating
scales.                                                              the detectors in with actual application in mind, it is
                                                                     often the case, that not all errors are equally important
                                                                     or relevant. For example, USV benchmark [6] defines a


                                                                 4
Jon Muhovič et al. CEUR Workshop Proceedings                                                                                   1–9


Figure 5: Illustration of the evaluation methodology. Note that the shore does not influence evaluation in any way, this is
an intentional simplification. Blue and green bounding boxes represent ground truth detections and output of our method
(hallucinations), respectively. All bounding boxes are vertically projected onto the horizontal (𝑥) axis. All evaluation, including
IoU is done in one dimension, along the horizontal axis. Arrows denote the widths of ”navigable channels” after the projection
of the bounding boxes onto the horizontal axis.


so-called danger zone to evaluate more relevant obstacles                  may not have any possibility of advancing, and
separately. The problem that we are addressing in this                     regardless of the increase of safety, this solution
work is increasing safety of the USV navigation, in cases                  is not good. Coverage is obtained by dividing the
where actual boats are not detected. The challenge is,                     width of the evaluation line in pixels with the
how do we measure increase in safety?                                      sum of the pixels on the evaluation line, covered
   Note that a crucial safety issue here is that the USV                   by projected bounding boxes.
may navigate in the areas that actually contain part of the
boat. Fig. 5 shows the situation with multiple detections              This evaluation protocol does not assume or require
and corresponding hallucinations. The aim of the USV is             complex obstacle avoidance maneuvers, and is not sensi-
to proceed in the forward direction, but it has to avoid ob-        tive to vertical displacement of bounding boxes.
stacles. Therefore, it can proceed only through navigable
channels, marked with arrows in Fig. 5. To ensure safety,           6. Experiments
navigable channels cannot contain any part of the boat
at any distance, and the problem can be compressed to        We recorded several hours of video on the Ljubljanica
one-dimensional representation along the horizontal (𝑥)      river (sessions denoted LJU1, LJU2, and LJU3) in differ-
axis. However, if the hallucinations are too wide, there     ent weather conditions, on Lake Bled (denoted BLE1),
may not be any navigable channel left in front of the        and on the Adriatic Sea (near the coast, in several areas
boat.                                                        between Koper and Portorož), denoted ADR1. In each
   Therefore, we define the following two metrics:           case, we hired human workers who served as obstacles
     • One-dimensional IoU value (referred to as IoU- in boats, kayaks, canoes and on paddleboards. The data
       1D), calculated from the projections of actual contains about 10 obstacles in the near vicinity of the
       (ground truth) bounding boxes and hallucinated recording boat, captured in different configurations and
       bounding boxes, both projected downwards onto from different angles relative to the position of the sun (so
       the horizontal axis (evaluation line in Fig. 5). This challenging backlit scenes were also captured). Videos
       value should be as high as possible.                  were recorded at 10 frames per second using Stereolabs
                                                                                    1
     • One-dimensional coverage (referred to as Cov- ZED 3D stereo camera , mounted between 1-1.5 meters
       1D) of the horizontal axis (evaluation line) with above the water surface (different watercraft were used
       projection of both ground truth bounding boxes at different locations). In this experiment we only use the
       and hallucinated bounding boxes. If the coverage
       of hallucinations becomes too high, then USV 1 https://www.stereolabs.com


                                                                5
Jon Muhovič et al. CEUR Workshop Proceedings                                                                                  1–9


left RGB images, the right RGB image and depth were       We examined the reason behind this and found that the
not used in any way.                                      increase is not as high as expected due to obstacles which
                                                          are further away and have disproportionately wide per-
6.1. Analysis of dataset contents                         son detection bounding boxes, and due to differences
                                                          in the set of boats used for training and testing (note
The training data was constructed by first obtaining pre- the highest increase in Cov-1D from person detection
dictions for all the relevant classes using Yolov7. The to hallucination when the training set BLE1 was tested).
compositions were then constructed from cases where Figure 6 shows an image where the result of our method
there was overlap between detections of class person and is poor.
either of the classes boat or surfboard.
   Analysis of the detections provide some insight into
the problem of ”invisible” boats and paddle boards, as
shown in Table 1.
 Session (dataset)   LJU1   LJU2     LJU3    BLE1    ADR1
 person only (%)     0.04   0.05     0.05    0.03    0.05
 person+boat (%)     0.96   0.95     0.95    0.97    0.95

Table 1
Percentages of detected people without boats vs detected peo-
ple with boats among all detections for each of recording ses-
sions. Note that the percentage of missing boat detections
ranges from 3-5%. The videos contained negligible amount of
people on the shore (physically plausible detections without
boats).                                                              Figure 6: Image on which the proposed method does not sig-
                                                                     nificantly improve safety. Note the wide detections of persons
                                                                     and an uncharacteristically long canoe.

6.2. Training
We decided to use session BLE1 for training of the Gaus-
sian distributions 𝒩𝐿𝑘 and 𝒩𝑅𝑘 , as it featured boats of 7. Discussion
varying shapes and sizes. The training time using precal-
culated Yolov7 detections was negligible.                 This paper presents a preliminary research on use of
                                                          hallucinations, provided by compositional models, in
                                                          water-borne obstacle detection and avoidance. The ex-
6.3. Testing                                              perimental design in this work has been subject to many
Free from requirements for manual annotation, we were constraints, most notably the absence of proper ground
able to run the evaluation of our method on all images truth annotations. These issues will be addressed in fur-
from our dataset, For evaluation, we used only the de- ther work, towards a general framework to hallucinate
tections of people with corresponding boats. Boat de- obstacles that are not directly observed by the sensors.
tections, obtained via Yolov7, were considered ground       Since using an obstacle detector precludes us from de-
truth, against which the hallucinations, obtained using tecting unknown objects, combining their results with
our compositional model, were tested. Person detections either semantic segmentation or another method of
without corresponding boats were not used, as these had anomaly detection or a different sensor modality (such as
no usable ground truth. Table 2 shows the results.        LIDAR) might help in producing a more general hazard
   Analysing the results, we can see that there is good detection system that will perform hazard detection from
overlap between ground truth detections and hallucina- multimodal cues.
tions, with IoU-1D ranging from 0.465 to 0.605 for the
same dataset on which the model was trained. Note that 7.1. Underwater sensors
IoU-1D of 0.5 means that the middle half of bounding
box projections overlap, while the 1/4 at each edge is A state of the art in experimental autonomous road ve-
non-overlapping.                                          hicles relies heavily on multimodal sensor setup, with
   Coverage of hallucinations is not as high as coverage sensors like LIDAR and RADAR [24, 25], which bear no
of detections, and, most surprisingly, coverage of pure resemblance to human sensing. Therefore, an argument
person detections (e.g. in absence of any detected boats) could be made that instead of hallucinating the obstacles
is not much lower than the coverage of hallucinations. and trying to emulate the skipper, one could detect the
                                                          hidden obstacles using proper underwater sensor setup.


                                                                 6
Jon Muhovič et al. CEUR Workshop Proceedings                                                                                1–9


                          Session (dataset)           LJU1         LJU2     LJU3     BLE1      ADR1
                          IoU-1D                      0.465        0.435    0.532    0.605     0.582
                          Ground truth Cov-1D         0.13         0.149    0.152    0.101     0.193
                          Hallucination Cov-1D        0.074        0.065    0.117    0.083     0.149
                          Person detection Cov-1D     0.067        0.054    0.127    0.062     0.139
Table 2
Evaluation results using the model trained on BLE1 session. IoU-1D is 1-dimensional IoU on bounding box projections onto the
horizontal axis and Cov-1D is coverage of the horizontal axis with each type of bounding boxes. We included the projections
of pure person detection bounding boxes as well for comparison.


In practice, this results in fragile setup due to water tur-       leaves in the fall), so avoiding it at all times is not an op-
bidity – USVs are expected to navigate safely even in              tion. However, debris may accumulate in shallow water
water that is dirty or muddy.                                      areas (it may not be debris, but aquatic plants sticking
   Note also that a paddleboard, as shown in Fig. 1, is a          out of the shallow water). So, if we encounter debris far-
very thin object at the boundary between air and water,            ther from shore, it is not a cause for concern as it is most
which is not comparable to the situations encountered in           likely floating. However, if it is found near land features
autonomous driving (on the road), so it is unlikely that           (e.g., trees, mud), then it usually means that the area is
additional (underwater) sensors will reliably detect it. In        dangerous, shallow, and not navigable. To detect this
fact, some watercraft may be completely submerged at               case, we might model the shallow, non-navigable area as
times, as can be seen in Fig. 7 which shows a fast-moving          a composition of debris and other land-based features.
athlete in a kayak.                                                   As it can be seen in top right image in Fig. 8, it is some-
                                                                   times difficult to determine whether the situation is a
                                                                   hazard or not. The labeling of such situations cannot
                                                                   be done by (untrained) labelers, but must be defined by
                                                                   experienced skippers working in cooperation with com-
                                                                   puter vision engineers. These compositions and their
                                                                   parameters must be defined by hand for a small number
                                                                   of available cases. The HCDC approach [22] has shown
                                                                   that this is indeed possible for common, well known food
                                                                   items. In this case, it will be used to insert concentrated
                                                                   expert knowledge into the compositional hazard detection
                                                                   model.


                                                                   8. Acknowledgments
Figure 7: A submerged kayak that cannot possibly be reliably       This work was financed by the Slovenian Research
detected using visual sensors.                                     Agency (ARRS), research program [P2-0095], and re-
                                                                   search project [J2-2506].


7.2. Other examples of invisible hazards                           References
Missing detections of boats and paddleboards are imme-              [1] A. Dallolio, H. B. Bjerck, H. A. Urke, J. A. Alfred-
diately available in our waterborne datasets. However,                  sen, A persistent sea-going platform for robotic fish
there are other scenarios where such an approach would                  telemetry using a wave-propelled usv: Technical
be useful, but for which there is currently insufficient                solution and proof-of-concept, Frontiers in Marine
data to train the models. The main reason for this is that              Science 9 (2022). URL: https://www.frontiersin.org/
these scenarios are to some extent hazardous to the USV                 articles/10.3389/fmars.2022.857623. doi:10.3389/
and represent edge cases in USV deployment. In Figure 8,                fmars.2022.857623.
we present a common scenario that we have encountered               [2] G. T. Raber, S. R. Schill, Reef rover: A low-cost
several times, but for which we currently do not have                   small autonomous unmanned surface vehicle (usv)
enough data to properly test, let alone train. Plant de-                for mapping and monitoring coral reefs, Drones 3
bris is common in continental waters and usually safe to                (2019). URL: https://www.mdpi.com/2504-446X/3/
traverse. Often it covers the entire navigable area (e.g.,              2/38. doi:10.3390/drones3020038.


                                                               7
Jon Muhovič et al. CEUR Workshop Proceedings                                                                               1–9


Figure 8: Example of an invisible danger - plant water debris. In all four images, plant debris can be seen in the image. Plant
debris is usually mobile, buoyant, harmless, and can be run over by a boat (top left). However, if the plant debris is near the
shore, it can accumulate on aquatic plants and signal dangerously shallow depth (top right and bottom left). The presence of
other clues (muddy water) increases the likelihood that the water in the area of the debris is precariously shallow (bottom
right).


 [3] C.-Y. Wang, A. Bochkovskiy, H.-Y. M. Liao, Yolov7:               sors for object detection and tracking in a maritime
     Trainable bag-of-freebies sets new state-of-the-art              environment: a survey, IEEE Transactions on Intel-
     for real-time object detectors, 2022. URL: https:                ligent Transportation Systems 18 (2017) 1993–2016.
     //arxiv.org/abs/2207.02696. doi:10.48550/ARXIV.              [5] B. Bovcon, J. Muhovič, J. Perš, M. Kristan, The
     2207.02696.                                                      mastr1325 dataset for training deep usv obstacle
 [4] D. K. Prasad, D. Rajan, L. Rachmawati, E. Rajabally,             detection models, in: 2019 IEEE/RSJ Interna-
     C. Quek, Video processing from electro-optical sen-              tional Conference on Intelligent Robots and Sys-


                                                              8
Jon Muhovič et al. CEUR Workshop Proceedings                                                                               1–9


     tems (IROS), IEEE, 2019, pp. 3431–3438.                             detection and tracking with deep learning and ap-
 [6] B. Bovcon, J. Muhovič, D. Vranac, D. Mozetič, J. Perš,              pearance feature, in: 2019 5th International Confer-
     M. Kristan, Mods–a usv-oriented object detec-                       ence on Control, Automation and Robotics (ICCAR),
     tion and obstacle segmentation benchmark, IEEE                      IEEE, 2019, pp. 276–280.
     Transactions on Intelligent Transportation Systems             [19] S. Moosbauer, D. Konig, J. Jakel, M. Teutsch, A
     (2021).                                                             benchmark for deep learning based object detection
 [7] L. Steccanella, D. Bloisi, A. Castellini, A. Farinelli,             in maritime environments, in: Proceedings of the
     Waterline and obstacle detection in images from                     IEEE Conference on Computer Vision and Pattern
     low-cost autonomous boats for environmental mon-                    Recognition Workshops, 2019, pp. 0–0.
     itoring, Robotics and Autonomous Systems 124                   [20] H. Kim, J. Koo, D. Kim, B. Park, Y. Jo, H. Myung,
     (2020) 103346.                                                      D. Lee, Vision-based real-time obstacle segmen-
 [8] A. J. Sinisterra, M. R. Dhanak, K. Von Ellenrieder,                 tation algorithm for autonomous surface vehicle,
     Stereovision-based target tracking system for usv                   IEEE Access 7 (2019) 179420–179428.
     operations, Ocean Engineering 133 (2017) 197–214.              [21] B. Bovcon, M. Kristan, A water-obstacle separation
 [9] Y. Cheng, M. Jiang, J. Zhu, Y. Liu, Are we ready                    and refinement network for unmanned surface ve-
     for unmanned surface vehicles in inland water-                      hicles, in: 2020 IEEE International Conference on
     ways? the usvinland multisensor dataset and bench-                  Robotics and Automation (ICRA), IEEE, 2020, pp.
     mark, IEEE Robotics and Automation Letters 6                        9470–9476.
     (2021) 3964–3970.                                              [22] G. Koporec, J. Perš, Human-centered deep compo-
[10] D. Nunes, J. Fortuna, B. Damas, R. Ventura, Real-                   sitional model for handling occlusions, 2022. 2nd
     time vision based obstacle detection in maritime en-                revision in Pattern Recognition.
     vironments, in: 2022 IEEE International Conference             [23] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Gir-
     on Autonomous Robot Systems and Competitions                        shick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick,
     (ICARSC), IEEE, 2022, pp. 243–248.                                  P. Dollár, Microsoft coco: Common objects in con-
[11] Y. Kuwata, M. T. Wolf, D. Zarzhitsky, T. L. Hunts-                  text, 2014. URL: http://arxiv.org/abs/1405.0312, cite
     berger, Safe maritime autonomous navigation with                    arxiv:1405.0312Comment: 1) updated annotation
     colregs, using velocity obstacles, IEEE Journal of                  pipeline description and figures; 2) added new sec-
     Oceanic Engineering 39 (2013) 110–119.                              tion describing datasets splits; 3) updated author
[12] J. Han, Y. Cho, J. Kim, J. Kim, N.-s. Son, S. Y. Kim,               list.
     Autonomous collision detection and avoidance for               [24] J. Peršić, I. Marković, I. Petrović, Extrinsic 6dof
     aragon usv: Development and field tests, Journal                    calibration of a radar–lidar–camera system en-
     of Field Robotics 37 (2020) 987–1002.                               hanced by radar cross section estimates evaluation,
[13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E.                    Robotics and Autonomous Systems 114 (2019) 217–
     Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox               230.
     detector., in: B. Leibe, J. Matas, N. Sebe, M. Welling         [25] C. Schöller, M. Schnettler, A. Krämmer, G. Hinz,
     (Eds.), ECCV (1), volume 9905 of Lecture Notes in                   M. Bakovic, M. Güzet, A. Knoll, Targetless rota-
     Computer Science, Springer, 2016, pp. 21–37.                        tional auto-calibration of radar and camera for in-
[14] J. Redmon, S. Divvala, R. Girshick, A. Farhadi,                     telligent transportation systems, in: 2019 IEEE Intel-
     You only look once: Unified, real-time object de-                   ligent Transportation Systems Conference (ITSC),
     tection, 2015. URL: http://arxiv.org/abs/1506.02640,                IEEE, 2019, pp. 3934–3941.
     cite arxiv:1506.02640.
[15] M. Kristan, V. S. Kenk, S. Kovačič, J. Perš, Fast image-
     based obstacle detection from unmanned surface
     vehicles, IEEE transactions on cybernetics 46 (2015)
     641–654.
[16] H. Wang, Z. Wei, S. Wang, C. S. Ow, K. T. Ho,
     B. Feng, A vision-based obstacle detection sys-
     tem for unmanned surface vehicle, in: Robotics,
     Automation and Mechatronics (RAM), 2011 IEEE
     Conference on, IEEE, 2011, pp. 364–369.
[17] B. Bovcon, J. Perš, M. Kristan, et al., Stereo ob-
     stacle detection for unmanned surface vehicles by
     imu-assisted semantic segmentation, Robotics and
     Autonomous Systems 104 (2018) 1–13.
[18] J. Yang, Y. Li, Q. Zhang, Y. Ren, Surface vehicle


                                                                9

</pre>