1. Introduction

Hallucinating Hidden Obstacles for Unmanned Surface Vehicles Using a Compositional Model

Jon Muhovič

Gregor Koporec

0 1

Janez Perš

0 0 Faculty of Electrical Engineering, University of Ljubljana , Tržaška 25, 1000 Ljubljana , Slovenia 1 Gorenje , d.o.o., 3320 Velenje , Slovenia

The water environment in which unmanned surface vehicles (USVs) navigate presents many unique challenges. One of these is the risk of encountering obstacles that are (partially) submerged and therefore poorly visible. Therefore, their extent cannot be determined directly from available above-water sensor data. On the other hand, it is well known that human skippers are able to safely navigate boats around obstacles even without underwater sensors and only with the help of their expertise. In this paper, we describe initial work on extending the USV obstacle detection to include such functionality using a compositional model. To learn to hallucinate the extent of obstacles with a minimum of learning efort, we exploit the nature of obstacles (people in kayaks, canoes, and on paddleboards) that are visible most of the time, but not always. We evaluate the impact of such hallucinations on USV safety and maneuverability, and suggest additional cases where such hallucinations can be used to improve USV safety.

eol>unmanned vehicles USV obstacle detection compositional models

1. Introduction

Unmanned surface vehicles (USVs) are increasingly recognized as a valuable tool for a variety of applications, including military, environmental, and commercial purposes. These autonomous craft are capable of operating in dificult or hazardous environments, making them ideal for tasks that would be too risky for humans.

On the other hand, one of the envisioned benefits of USVs is the ability to gather data and perform tasks for extended periods of time without the need for human intervention. This would allow them to cover large areas and collect a large amount of data that can then be used for a variety of purposes. USVs equipped with sensors and cameras could be used, for example, to monitor and map the marine environment, track wildlife [ 1 ], or assess the health of coral reefs [ 2 ]. However, truly autonomous vehicles with no captain on board and no contact with remote operators must essentially duplicate the reasoning of a trained skipper in certain situations. One of those situations are (partially) submerged objects that cannot be detected by USV sensors located above the

This paper is organized as follows: Following related work, we define the problem we want to use to demonstrate the capabilities of our method. We then introduce the basic concepts of compositional models and describe our use case and evaluation method. In the experimental part, we present our own dataset used in our experiments and its properties, followed by the evaluation setup focusing on the USV navigation. Finally, we discuss the results and further applications of the presented approach. 2. Related work branches of obstacle detection have been improved. On the one hand, researchers have adapted or retrained general object detectors for marine environments [ 18, 19, 10 ] using more precise classification information and custom datasets. However, such approaches only work for welldefined objects. Unknown structures, such as floating debris or piers, cannot usually be detected using such methods.

The other branch of obstacle detection is semantic segmentation. Several methods have adapted general segmentation methods to the marine environment [ 20, 7, 21 ]. Obstacle detection can be performed using such methods by determining regions that are partially or completely surrounded by water.

The method presented in this paper operates at a higher level of reasoning and aims to use assumptions that reasonably hold in water-bound environments. It relies on existing but imperfect methods for obstacle detection (in this paper we use Yolov7 [ 3 ]).This work contains two contributions: Recently, numerous papers have been published on the subject of USV sensors, obstacle detection and navigation.

The computer vision aspect of marine environment interpretation has been approached in several ways so far: Some authors have acquired datasets to facilitate domain transfer for Deep Learning and further investigate the specific problems in the maritime domain [ 4, 5, 6 ].

Several USV architectures with diferent sensors have been presented to solve problems such as poor lighting • A method for improving the safety of USV and conditions and the need for absolute distance measure- its environment by improving the estimation of ments [ 7, 8, 9 ]. In addition, authors have proposed deep free passage corridors in front of the USV, even learning methods specific to the maritime domain that ei- with imperfect obstacle detectors. ther incorporate additional relevant modalities or address • An evaluation method that evaluates increase of problems that arise in the maritime domain [ 10 ]. Numer- safety in that case ous publications have also been presented that address automatic navigation and maritime collision avoidance compliance [ 11 ]. 3. Problem definition

Han et al. [ 12 ] have presented a complete platform and framework for obstacle detection and avoidance, com- In situations where we cannot reliably observe fully or plete with multimodal sensors, obstacle detectors, and partially submerged obstacles using any of the sensors collision avoidance rules. They use SSD detector [ 13 ] to mounted above the water, we use knowledge of comdetect potential obstacles and track them using sensor fu- monly occurring structures in marine environments to sion. Since real-time performance is usually desired, fast improve the safety of a USV. detectors such as SSD or YOLO [ 14 ] are usually preferred In this paper, we present preliminary research results: for USV applications. We focused on the problem of detecting boats or other

Several datasets have also been published, some of floating objects in situations where a person was detected which will be used as learning data for Deep Learning- above the water surface, but the corresponding boat was not based methods and others as benchmarks for existing detected. Such cases often occur when boats are of a simmethods. One such dataset, SMD, was proposed by ilar color to the surrounding water, partially submerged Prasad et al. [ 4 ]. It contained 51 RGB and 30 NIR se- due to maneuvering, or are otherwise poorly visible due quences and was primarily intended for monitoring. to backlight or the distance between a smaller object and Since then, several more USV oriented datasets have the camera. The work was performed using RGB images, been proposed, such as MODD [ 15 ], MaSTr1325 [ 5 ], and because of the wide availability of pre-trained object deMODS [ 6 ]. tectors that perform reasonably well without the need

In the past, obstacle detection was performed directly for additional training. by estimating salient regions [ 16 ] or color segmenta- Since we are dealing with coastal and continental wation [ 8 ]. Before the widespread use of Deep Learning, ter regions where smaller boats such as rowboats and several approaches were also proposed that mainly fo- paddle boats are usually found, consistent detection of cused on semantic segmentation followed by anomaly such obstacles is necessary. Depending on the lighting detection. These methods [ 15, 17 ] used prior informa- conditions, size and color of the boats, detection with tion about the scene and refined it with color image in- conventional detectors applied to color images is not alformation. With the advent of Deep Learning, the two ways consistent. This inconsistency can be a hazard to safe navigation, especially when maneuvering near other boats.

This problem has the following interesting properties: • Solid physical foundation. People cannot walk or sit on the water. There must be some kind of highly buoyant device present to support their weight. • No opportunity to introduce gross errors with false detections. False positives only restrict the possibilities for the USV to advance, and our experiments were designed to check for that efect. • No manual annotations are needed, since we can obtain ground truth using the object detector (Yolov7) and therefore obtain plenty of data to train the higher-reasoning model.

The method will be later extended to a wider range of problems, which are discussed in Section 7.2, but represent edge cases and thus are subject to problems of data collection. 4. Our method 4.1. Compositional models

In computer vision, a composition refers to the arrangement of visual elements in an image. These visual elements are called parts and can be low level primitives (e.g. edges, corners) or high-level objects themselves (e.g. cap, a label and recognizable shaped bottom on a bottle of soft drink), as shown in Fig. 2. Parts can be compositions themselves, yielding a hierarchical compositional model.

The compositional model, as shown in Fig. 2 is not particularly useful, as it is completely rigid. In practice, geometrical parameters of the parts are modelled as random vectors. In Figure 3 we show a hierarchical, compositional model of a 3-part Coke bottle under the assumption that the probability distribution of -th part position ( , ) relative to the center (origin) of the -th composition is Gaussian: 1–9 Our method is heavily influenced by the work of Koporec et al. [22], that uses hierarchical compositional models to detect objects’ visible parts even when large parts of objects are occluded, and allows collection of ex- and covariance matrix Σ . The parameters of the pert knowledge from a small number of targeted human Gaussian distribution are obtained by learning on a sufannotations. In our work we use a highly simplified im- ifciently large set of train data, from which vectors X plementation of Human-Centered Deep Compositional are extracted. (HCDC) model [22].

Compositional models can be used in the following

ways: • 1. Robust, explainable detection of partially occluded objects, where the object (composition) is detected even if not all its parts are visible. • 2. Explanation (hallucination) of the missing part.

This is the functionality we use in the presented work.

where subscripts and denote left-top or right4.2. Model of a person on a boat bottom point of the boat bounding box, respectively, and denotes the scale index. Therefore, a total parameter Human-Centered Deep Compositional (HCDC) set of our 2D model consists of 2 Gaussian means and model [22] operates on parts that are itself deep 2 2D Gaussian covariance matrices. detections (detections, obtained by convolutional neural network models, CNNs). This makes the model 4.3. Training the compositional model explainable, as the parts are already categorized into human-understandable categories. Our training does not require any manual annotations.

We follow this example and use the detections Due to pretty good (but not perfect) performance of the provided by an obstacle detector pretrained on MS chosen detector (Yolov7 detects about 95% boats and even COCO [23]. We only retained the pertinent detection higher percentage of persons) we use those cases where classes: person, boat and surfboard. Additionally, we both the boat and the person on it were detected, to treated the classes boat and surfboard as the same se- establish a model that can reasonably predict the position mantic entity (referred to as boat in the remainder of the and size of a boat in absence of detections. text), since both of those classes almost always appear Although we assume Gaussian model for probability simultaneously with the class person. The compositional distributions and , we estimate each separate model that we use is shown in Fig. 4. distribution using expectation maximization (EM) algorithm with 2-component Gaussian Mixture Model (GMM) and retain the larger of the two components as either Person usingo2r-comp.oOneunrtpGreMliMmirneasruylttseisntimngorheaascrceuvreaatelefitdtitnhgat of Gaussian model to the data, collecting the outliers in the significantly smaller component. d =

︀] X = [︀ X = [︀

︀] X ∼ ( , Σ) X ∼ ( , Σ) 1–9 (2) ΣL L R Boat

ΣR

In our case, the Eq. (1) changes, since we have two

separate Gaussian models for upper-left and bottom-right corners of the boat bounding box, and that for each of scales.

4.4. Hallucination

To hallucinate the most likely bounding box of the (undetected) boat, we examine the bounding box of the detected person, calculate its centroid and diagonal , calculate the scale index and look up the relevant Gaussian models and obtained during the training. The hallucinated bounding box points of a boat are determined at displacements and at which and have their maximum values. Note that and are relative to the person’s centroid point.

5. USV safety-focused evaluation

To compare performance of object detectors, a generic approach by counting false positives and false negatives, with respect to some minimum intersection over union (IoU) value is often used. However, when evaluating the detectors in with actual application in mind, it is often the case, that not all errors are equally important or relevant. For example, USV benchmark [ 6 ] defines a so-called danger zone to evaluate more relevant obstacles may not have any possibility of advancing, and separately. The problem that we are addressing in this regardless of the increase of safety, this solution work is increasing safety of the USV navigation, in cases is not good. Coverage is obtained by dividing the where actual boats are not detected. The challenge is, width of the evaluation line in pixels with the how do we measure increase in safety? sum of the pixels on the evaluation line, covered

Note that a crucial safety issue here is that the USV by projected bounding boxes. may navigate in the areas that actually contain part of the boat. Fig. 5 shows the situation with multiple detections This evaluation protocol does not assume or require and corresponding hallucinations. The aim of the USV is complex obstacle avoidance maneuvers, and is not sensito proceed in the forward direction, but it has to avoid ob- tive to vertical displacement of bounding boxes. stacles. Therefore, it can proceed only through navigable channels, marked with arrows in Fig. 5. To ensure safety, 6. Experiments navigable channels cannot contain any part of the boat at any distance, and the problem can be compressed to one-dimensional representation along the horizontal () axis. However, if the hallucinations are too wide, there may not be any navigable channel left in front of the boat.

Therefore, we define the following two metrics:

We recorded several hours of video on the Ljubljanica river (sessions denoted LJU1, LJU2, and LJU3) in diferent weather conditions, on Lake Bled (denoted BLE1), and on the Adriatic Sea (near the coast, in several areas between Koper and Portorož), denoted ADR1. In each case, we hired human workers who served as obstacles • One-dimensional IoU value (referred to as IoU- in boats, kayaks, canoes and on paddleboards. The data 1D), calculated from the projections of actual contains about 10 obstacles in the near vicinity of the (ground truth) bounding boxes and hallucinated recording boat, captured in diferent configurations and bounding boxes, both projected downwards onto from diferent angles relative to the position of the sun (so the horizontal axis (evaluation line in Fig. 5). This challenging backlit scenes were also captured). Videos value should be as high as possible. were recorded at 10 frames per second using Stereolabs • One-dimensional coverage (referred to as Cov- ZED 3D stereo camera1, mounted between 1-1.5 meters 1D) of the horizontal axis (evaluation line) with above the water surface (diferent watercraft were used projection of both ground truth bounding boxes at diferent locations). In this experiment we only use the and hallucinated bounding boxes. If the coverage of hallucinations becomes too high, then USV 1https://www.stereolabs.com left RGB images, the right RGB image and depth were not used in any way.

6.1. Analysis of dataset contents

The training data was constructed by first obtaining predictions for all the relevant classes using Yolov7. The compositions were then constructed from cases where there was overlap between detections of class person and either of the classes boat or surfboard.

Analysis of the detections provide some insight into the problem of ”invisible” boats and paddle boards, as shown in Table 1.

Session (dataset) person only (%) person+boat (%)

LJU1 0.04 0.96

LJU2 0.05 0.95

LJU3 0.05 0.95

BLE1 0.03 0.97

ADR1 0.05 0.95

6.2. Training

We decided to use session BLE1 for training of the Gaussian distributions and , as it featured boats of varying shapes and sizes. The training time using precalculated Yolov7 detections was negligible.

6.3. Testing

Free from requirements for manual annotation, we were able to run the evaluation of our method on all images from our dataset, For evaluation, we used only the detections of people with corresponding boats. Boat detections, obtained via Yolov7, were considered ground truth, against which the hallucinations, obtained using our compositional model, were tested. Person detections without corresponding boats were not used, as these had no usable ground truth. Table 2 shows the results.

Analysing the results, we can see that there is good overlap between ground truth detections and hallucinations, with IoU-1D ranging from 0.465 to 0.605 for the same dataset on which the model was trained. Note that IoU-1D of 0.5 means that the middle half of bounding box projections overlap, while the 1/4 at each edge is non-overlapping.

Coverage of hallucinations is not as high as coverage of detections, and, most surprisingly, coverage of pure person detections (e.g. in absence of any detected boats) is not much lower than the coverage of hallucinations. We examined the reason behind this and found that the increase is not as high as expected due to obstacles which are further away and have disproportionately wide person detection bounding boxes, and due to diferences in the set of boats used for training and testing (note the highest increase in Cov-1D from person detection to hallucination when the training set BLE1 was tested). Figure 6 shows an image where the result of our method is poor.

7. Discussion

This paper presents a preliminary research on use of hallucinations, provided by compositional models, in water-borne obstacle detection and avoidance. The experimental design in this work has been subject to many constraints, most notably the absence of proper ground truth annotations. These issues will be addressed in further work, towards a general framework to hallucinate obstacles that are not directly observed by the sensors.

Since using an obstacle detector precludes us from detecting unknown objects, combining their results with either semantic segmentation or another method of anomaly detection or a diferent sensor modality (such as LIDAR) might help in producing a more general hazard detection system that will perform hazard detection from multimodal cues.

7.1. Underwater sensors

A state of the art in experimental autonomous road vehicles relies heavily on multimodal sensor setup, with sensors like LIDAR and RADAR [24, 25], which bear no resemblance to human sensing. Therefore, an argument could be made that instead of hallucinating the obstacles and trying to emulate the skipper, one could detect the hidden obstacles using proper underwater sensor setup. Session (dataset) IoU-1D Ground truth Cov-1D Hallucination Cov-1D Person detection Cov-1D

LJU1 0.465 0.13 0.074 0.067

In practice, this results in fragile setup due to water tur- leaves in the fall), so avoiding it at all times is not an opbidity – USVs are expected to navigate safely even in tion. However, debris may accumulate in shallow water water that is dirty or muddy. areas (it may not be debris, but aquatic plants sticking

Note also that a paddleboard, as shown in Fig. 1, is a out of the shallow water). So, if we encounter debris farvery thin object at the boundary between air and water, ther from shore, it is not a cause for concern as it is most which is not comparable to the situations encountered in likely floating. However, if it is found near land features autonomous driving (on the road), so it is unlikely that (e.g., trees, mud), then it usually means that the area is additional (underwater) sensors will reliably detect it. In dangerous, shallow, and not navigable. To detect this fact, some watercraft may be completely submerged at case, we might model the shallow, non-navigable area as times, as can be seen in Fig. 7 which shows a fast-moving a composition of debris and other land-based features. athlete in a kayak. As it can be seen in top right image in Fig. 8, it is sometimes dificult to determine whether the situation is a hazard or not. The labeling of such situations cannot be done by (untrained) labelers, but must be defined by experienced skippers working in cooperation with computer vision engineers. These compositions and their parameters must be defined by hand for a small number of available cases. The HCDC approach [22] has shown that this is indeed possible for common, well known food items. In this case, it will be used to insert concentrated expert knowledge into the compositional hazard detection model.

7.2. Other examples of invisible hazards

Missing detections of boats and paddleboards are immediately available in our waterborne datasets. However, there are other scenarios where such an approach would be useful, but for which there is currently insuficient data to train the models. The main reason for this is that these scenarios are to some extent hazardous to the USV and represent edge cases in USV deployment. In Figure 8, we present a common scenario that we have encountered several times, but for which we currently do not have enough data to properly test, let alone train. Plant debris is common in continental waters and usually safe to traverse. Often it covers the entire navigable area (e.g.,

8. Acknowledgments This work was financed by the Slovenian Research Agency (ARRS), research program [P2-0095], and research project [J2-2506].

[1]

Dallolio ,

H. B.

Bjerck ,

H. A.

Urke ,

J. A.

Alfredsen , A persistent sea-going platform for robotic fish telemetry using a wave-propelled usv: Technical solution and proof-of-concept , Frontiers in Marine Science 9 ( 2022 ). URL: https://www.frontiersin.org/ articles/10.3389/fmars. 2022 . 857623 . doi: 10 .3389/ fmars. 2022 . 857623 .

[2]

G. T.

Raber ,

S. R.

Schill , Reef rover: A low-cost small autonomous unmanned surface vehicle (usv) for mapping and monitoring coral reefs , Drones 3 ( 2019 ). URL: https://www.mdpi.com/2504-446X/3/ 2/38. doi: 10 .3390/drones3020038.

[3]

C.-Y.

Wang ,

Bochkovskiy , H. -Y. M. Liao , Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors , 2022 . URL: https: //arxiv.org/abs/2207.02696. doi: 10 .48550/ARXIV. 2207.02696.

[4]

D. K.

Prasad ,

Rajan ,

Rachmawati ,

Rajabally ,

Quek , Video processing from electro-optical sensors for object detection and tracking in a maritime environment: a survey , IEEE Transactions on Intelligent Transportation Systems 18 ( 2017 ) 1993 - 2016 .

[5]

Bovcon ,

Muhovič ,

Perš ,

Kristan , The mastr1325 dataset for training deep usv obstacle detection models , in: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , IEEE, 2019 , pp. 3431 - 3438 . detection and tracking with deep learning and ap-

[6]

Bovcon ,

Muhovič ,

Vranac ,

Mozetič , J. Perš, pearance feature, in: 2019 5th

International

ConferM. Kristan , Mods-a usv-oriented object detec- ence on Control, Automation and Robotics (ICCAR), tion and obstacle segmentation benchmark , IEEE IEEE , 2019 , pp. 276 - 280 . Transactions on Intelligent Transportation Systems [19]

Moosbauer ,

Konig ,

Jakel ,

Teutsch , A ( 2021 ). benchmark for deep learning based object detection

[7]

Steccanella ,

Bloisi ,

Castellini , A. Farinelli, in maritime environments, in: Proceedings of the Waterline and obstacle detection in images from IEEE Conference on Computer Vision and Pattern low-cost autonomous boats for environmental mon - Recognition Workshops , 2019 , pp. 0 - 0 . itoring, Robotics and Autonomous Systems 124 [20]

Kim ,

Koo ,

Kim ,

Park ,

Jo , H. Myung, ( 2020 ) 103346.

Lee , Vision-based real-time obstacle segmen-

[8]

A. J.

Sinisterra ,

M. R.

Dhanak , K.

Von Ellenrieder, tation algorithm for autonomous surface vehicle, Stereovision-based target tracking system for usv IEEE Access 7 (

2019 ) 179420 - 179428 . operations, Ocean Engineering 133 ( 2017 ) 197 - 214 . [21]

Bovcon ,

Kristan , A water-obstacle separation

[9]

Cheng , M. Jiang,

Zhu , Y. Liu, Are we ready and refinement network for unmanned surface vefor unmanned surface vehicles in inland water- hicles , in: 2020 IEEE International Conference on ways? the usvinland multisensor dataset and bench- Robotics and Automation (ICRA) , IEEE, 2020 , pp. mark , IEEE Robotics and Automation Letters 6 9470 - 9476 . ( 2021 ) 3964 - 3970 . [22]

Koporec ,

Perš , Human-centered deep compo-

[10]

Nunes ,

Fortuna ,

Damas ,

Ventura , Real- sitional model for handling occlusions , 2022 . 2nd time vision based obstacle detection in maritime en- revision in Pattern Recognition. vironments , in: 2022 IEEE International Conference [23] T.-Y. Lin , M.

Maire , S.

Belongie , L.

Bourdev , R.

Giron Autonomous Robot Systems and Competitions shick , J. Hays,

Perona ,

Ramanan , C. L. Zitnick, (ICARSC), IEEE, 2022 , pp. 243 - 248 . P. Dollár, Microsoft coco: Common objects in con-

[11]

Kuwata ,

M. T.

Wolf ,

Zarzhitsky , T. L. Hunts- text, 2014 . URL: http://arxiv.org/abs/1405.0312, cite

berger

, Safe maritime autonomous navigation with arxiv:1405.0312Comment: 1) updated annotation colregs, using velocity obstacles , IEEE Journal of pipeline description and figures; 2) added new secOceanic Engineering 39 ( 2013 ) 110 - 119 . tion describing datasets splits; 3) updated author

[12]

Han ,

Cho ,

Kim ,

Kim , N.-s. Son, S. Y. Kim, list. Autonomous collision detection and avoidance for [24]

Peršić , I. Marković , I. Petrović , Extrinsic 6dof aragon usv: Development and field tests, Journal calibration of a radar-lidar-camera system enof Field Robotics 37 ( 2020 ) 987 - 1002 . hanced by radar cross section estimates evaluation,

[13]

Liu ,

Anguelov ,

Erhan ,

Szegedy , S. E. Robotics and Autonomous Systems 114 ( 2019 ) 217 - Reed , C.- Y.

Fu , A. C.

Berg , Ssd: Single shot multibox 230. detector ., in: B. Leibe , J.

Matas , N.

Sebe , M. Welling [25] C.

Schöller , M.

Schnettler , A.

Krämmer , G. Hinz, (Eds.), ECCV (1) , volume 9905 of Lecture Notes in M. Bakovic,

Güzet ,

Knoll , Targetless rotaComputer Science , Springer, 2016 , pp. 21 - 37 . tional auto-calibration of radar and camera for in-

[14]

Redmon ,

Divvala ,

Girshick , A. Farhadi, telligent transportation systems, in: 2019 IEEE IntelYou only look once: Unified, real-time object de- ligent Transportation Systems Conference (ITSC), tection, 2015 . URL: http://arxiv.org/abs/1506.02640, IEEE, 2019 , pp. 3934 - 3941 . cite arxiv: 1506 . 02640 .

[15]

Kristan ,

V. S.

Kenk ,

Kovačič ,

Perš , Fast imagebased obstacle detection from unmanned surface vehicles , IEEE transactions on cybernetics 46 ( 2015 ) 641 - 654 .

[16]

Wang ,

Wei ,

Wang ,

C. S.

Ow ,

K. T.

Ho ,

Feng , A vision-based obstacle detection system for unmanned surface vehicle , in: Robotics, Automation and Mechatronics (RAM) , 2011 IEEE Conference on, IEEE , 2011 , pp. 364 - 369 .

[17]

Bovcon ,

Perš ,

Kristan , et al., Stereo obstacle detection for unmanned surface vehicles by imu-assisted semantic segmentation , Robotics and Autonomous Systems 104 ( 2018 ) 1 - 13 .

[18]

Yang ,

Li ,

Zhang ,

Ren , Surface vehicle