=Paper=
{{Paper
|id=Vol-3349/paper10
|storemode=property
|title=Hallucinating Hidden Obstacles for Unmanned Surface
Vehicles Using a Compositional Model
|pdfUrl=https://ceur-ws.org/Vol-3349/paper10.pdf
|volume=Vol-3349
|authors=Jon Muhovic,Gregor Koporec,Janez Pers
|dblpUrl=https://dblp.org/rec/conf/cvww/MuhovicKP23
}}
==Hallucinating Hidden Obstacles for Unmanned Surface
Vehicles Using a Compositional Model==
Hallucinating Hidden Obstacles for Unmanned Surface Vehicles Using a Compositional Model Jon Muhovič1 , Gregor Koporec1,2 and Janez Perš1,* 1 Faculty of Electrical Engineering, University of Ljubljana, Tržaška 25, 1000 Ljubljana, Slovenia 2 Gorenje, d.o.o., 3320 Velenje, Slovenia Abstract The water environment in which unmanned surface vehicles (USVs) navigate presents many unique challenges. One of these is the risk of encountering obstacles that are (partially) submerged and therefore poorly visible. Therefore, their extent cannot be determined directly from available above-water sensor data. On the other hand, it is well known that human skippers are able to safely navigate boats around obstacles even without underwater sensors and only with the help of their expertise. In this paper, we describe initial work on extending the USV obstacle detection to include such functionality using a compositional model. To learn to hallucinate the extent of obstacles with a minimum of learning effort, we exploit the nature of obstacles (people in kayaks, canoes, and on paddleboards) that are visible most of the time, but not always. We evaluate the impact of such hallucinations on USV safety and maneuverability, and suggest additional cases where such hallucinations can be used to improve USV safety. Keywords unmanned vehicles, USV, obstacle detection, compositional models 1. Introduction Unmanned surface vehicles (USVs) are increasingly rec- ognized as a valuable tool for a variety of applications, including military, environmental, and commercial pur- poses. These autonomous craft are capable of operating in difficult or hazardous environments, making them ideal for tasks that would be too risky for humans. On the other hand, one of the envisioned benefits of USVs is the ability to gather data and perform tasks for extended periods of time without the need for human intervention. This would allow them to cover large areas and collect a large amount of data that can then be used for a variety of purposes. USVs equipped with sensors and cameras could be used, for example, to monitor and map the marine environment, track wildlife [1], or assess Figure 1: Left: detection of objects using Yolov7 [3], a person the health of coral reefs [2]. However, truly autonomous is detected (dark blue), but neither boat nor paddle board are detected. We hallucinate the boat (in green). Right: Same per- vehicles with no captain on board and no contact with son, later, when the boat is actually detected by Yolov7 (light remote operators must essentially duplicate the reason- blue), comparing the actual detection versus the hallucination ing of a trained skipper in certain situations. One of (green). those situations are (partially) submerged objects that cannot be detected by USV sensors located above the 26th Computer Vision Winter Workshop, Robert Sablatnig and Florian water, but whose presence could be easily inferred by a Kleber (eds.), Krems, Lower Austria, Austria, Feb. 15-17, 2023 human operator. * Corresponding author. Our approach is best illustrated by observing Fig. 1. † These authors contributed equally. Based on the observation that people cannot walk or $ jon.muhovic@fe.uni-lj.si (J. Muhovič); sit on water, we force the hallucination of a boat with gregor.koporec@gorenje.com (G. Koporec); janez.pers@fe.uni-lj.si every person that is detected on the surface of the water. (J. Perš) https://lmi.fe.uni-lj.si/en/jon-muhovic/ (J. Muhovič); The parameters of the hallucinated object are learned https://lmi.fe.uni-lj.si/en/janez-pers-2/ (J. Perš) from person-boat compositions obtained by using a pre- 0000-0002-6039-6110 (J. Perš) trained object detector on a separate dataset and do not © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). require annotation. CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 1 Jon Muhovič et al. CEUR Workshop Proceedings 1–9 This paper is organized as follows: Following related branches of obstacle detection have been improved. On work, we define the problem we want to use to demon- the one hand, researchers have adapted or retrained gen- strate the capabilities of our method. We then introduce eral object detectors for marine environments [18, 19, 10] the basic concepts of compositional models and describe using more precise classification information and custom our use case and evaluation method. In the experimental datasets. However, such approaches only work for well- part, we present our own dataset used in our experiments defined objects. Unknown structures, such as floating and its properties, followed by the evaluation setup focus-debris or piers, cannot usually be detected using such ing on the USV navigation. Finally, we discuss the results methods. and further applications of the presented approach. The other branch of obstacle detection is semantic segmentation. Several methods have adapted general segmentation methods to the marine environment [20, 2. Related work 7, 21]. Obstacle detection can be performed using such methods by determining regions that are partially or Recently, numerous papers have been published on the completely surrounded by water. subject of USV sensors, obstacle detection and navigation. The method presented in this paper operates at a The computer vision aspect of marine environment higher level of reasoning and aims to use assumptions interpretation has been approached in several ways so that reasonably hold in water-bound environments. It far: Some authors have acquired datasets to facilitate do- relies on existing but imperfect methods for obstacle main transfer for Deep Learning and further investigate detection (in this paper we use Yolov7 [3]).This work the specific problems in the maritime domain [4, 5, 6]. contains two contributions: Several USV architectures with different sensors have been presented to solve problems such as poor lighting • A method for improving the safety of USV and conditions and the need for absolute distance measure- its environment by improving the estimation of ments [7, 8, 9]. In addition, authors have proposed deep free passage corridors in front of the USV, even learning methods specific to the maritime domain that ei- with imperfect obstacle detectors. ther incorporate additional relevant modalities or address • An evaluation method that evaluates increase of problems that arise in the maritime domain [10]. Numer- safety in that case ous publications have also been presented that address automatic navigation and maritime collision avoidance compliance [11]. 3. Problem definition Han et al. [12] have presented a complete platform and framework for obstacle detection and avoidance, com- In situations where we cannot reliably observe fully or plete with multimodal sensors, obstacle detectors, and partially submerged obstacles using any of the sensors collision avoidance rules. They use SSD detector [13] to mounted above the water, we use knowledge of com- detect potential obstacles and track them using sensor fu- monly occurring structures in marine environments to sion. Since real-time performance is usually desired, fast improve the safety of a USV. detectors such as SSD or YOLO [14] are usually preferred In this paper, we present preliminary research results: for USV applications. We focused on the problem of detecting boats or other Several datasets have also been published, some of floating objects in situations where a person was detected which will be used as learning data for Deep Learning- above the water surface, but the corresponding boat was not based methods and others as benchmarks for existing detected. Such cases often occur when boats are of a sim- methods. One such dataset, SMD, was proposed by ilar color to the surrounding water, partially submerged Prasad et al. [4]. It contained 51 RGB and 30 NIR se- due to maneuvering, or are otherwise poorly visible due quences and was primarily intended for monitoring. to backlight or the distance between a smaller object and Since then, several more USV oriented datasets have the camera. The work was performed using RGB images, been proposed, such as MODD [15], MaSTr1325 [5], and because of the wide availability of pre-trained object de- MODS [6]. tectors that perform reasonably well without the need In the past, obstacle detection was performed directly for additional training. by estimating salient regions [16] or color segmenta- Since we are dealing with coastal and continental wa- tion [8]. Before the widespread use of Deep Learning, ter regions where smaller boats such as rowboats and several approaches were also proposed that mainly fo- paddle boats are usually found, consistent detection of cused on semantic segmentation followed by anomaly such obstacles is necessary. Depending on the lighting detection. These methods [15, 17] used prior informa- conditions, size and color of the boats, detection with tion about the scene and refined it with color image in- conventional detectors applied to color images is not al- formation. With the advent of Deep Learning, the two ways consistent. This inconsistency can be a hazard to 2 Jon Muhovič et al. CEUR Workshop Proceedings 1–9 safe navigation, especially when maneuvering near other boats. This problem has the following interesting properties: • Solid physical foundation. People cannot walk or sit on the water. There must be some kind of highly buoyant device present to support their weight. • No opportunity to introduce gross errors with false detections. False positives only restrict the possibilities for the USV to advance, and our ex- periments were designed to check for that effect. • No manual annotations are needed, since we can obtain ground truth using the object detector (Yolov7) and therefore obtain plenty of data to train the higher-reasoning model. The method will be later extended to a wider range of problems, which are discussed in Section 7.2, but repre- Figure 2: Concept of the compositional model – modelling of sent edge cases and thus are subject to problems of data the Coke bottle. The composition is shown on the left, each collection. part marked with a green rectangle. Names of the parts are shown in the middle. A compositional hierarchical model is shown on the right: darkest rectangles represent high level 4. Our method parts 1, 2, 3. Of those, part 3 is a composition itself, containing parts 6 and 7 (lighter). Part 7 is again a composition of parts 4 Our method is heavily influenced by the work of Ko- and 5 (lightest). porec et al. [22], that uses hierarchical compositional models to detect objects’ visible parts even when large parts of objects are occluded, and allows collection of ex- 𝜇𝑖𝑗 and covariance matrix Σ𝑖𝑗 . The parameters of the pert knowledge from a small number of targeted human Gaussian distribution are obtained by learning on a suf- annotations. In our work we use a highly simplified im- ficiently large set of train data, from which vectors X𝑖𝑗 plementation of Human-Centered Deep Compositional are extracted. (HCDC) model [22]. part 1 4.1. Compositional models Σ11 In computer vision, a composition refers to the arrange- 𝜇11 part 2 ment of visual elements in an image. These visual ele- Σ21 ments are called parts and can be low level primitives (e.g. 𝜇21 edges, corners) or high-level objects themselves (e.g. cap, part 3 a label and recognizable shaped bottom on a bottle of soft drink), as shown in Fig. 2. Parts can be compositions 𝜇31 themselves, yielding a hierarchical compositional model. The compositional model, as shown in Fig. 2 is not Σ31 particularly useful, as it is completely rigid. In prac- tice, geometrical parameters of the parts are modelled as random vectors. In Figure 3 we show a hierarchical, compositional model of a 3-part Coke bottle under the assumption that the probability distribution of 𝑗-th part Composition 1 position (𝑥𝑖𝑗 , 𝑦𝑖𝑗 ) relative to the center (origin) of the 𝑗-th composition is Gaussian: Figure 3: Illustration: three parts of a Coke bottle (parts [︀ ]︀𝑇 1, 2, and 3 from Fig. 2 could look something like this, if the X𝑖𝑗 = 𝑥𝑖𝑗 𝑦𝑖𝑗 learning samples would feature coke bottles tilted slightly to (1) X𝑖𝑗 ∼ 𝒩 (𝜇𝑖𝑗 , Σ𝑖𝑗 ) the right. Other parts are not shown. Ellipses show Gaussian distributions of part displacements (𝑥𝑖𝑗 , 𝑦𝑖𝑗 ) relative to the where X𝑖𝑗 is a two-dimensional random vector, gener- center of the composition (denoted as Composition 1). ated by the Gaussian distribution 𝒩2 with mean vector 3 Jon Muhovič et al. CEUR Workshop Proceedings 1–9 Compositional models can be used in the following ways: 𝑑𝑚𝑎𝑥 𝑑=𝑘 𝑁 • 1. Robust, explainable detection of partially oc- [︀ X𝐿𝑘 = 𝑥𝐿𝑘 𝑦𝐿𝑘 ]︀𝑇 cluded objects, where the object (composition) is [︀ ]︀𝑇 (2) detected even if not all its parts are visible. X𝑅𝑘 = 𝑥𝑅𝑘 𝑦𝑅𝑘 • 2. Explanation (hallucination) of the missing part. X𝐿𝑘 ∼ 𝒩𝐿𝑘 (𝜇𝐿𝑘 , Σ𝐿𝑘 ) This is the functionality we use in the presented X𝑅𝑘 ∼ 𝒩𝑅𝑘 (𝜇𝐿𝑘 , Σ𝑅𝑘 ) work. where subscripts 𝐿 and 𝑅 denote left-top or right- bottom point of the boat bounding box, respectively, and 4.2. Model of a person on a boat 𝑘 denotes the scale index. Therefore, a total parameter Human-Centered Deep Compositional (HCDC) set of our 2D model consists of 2𝑁 Gaussian means and model [22] operates on parts that are itself deep 2𝑁 2D Gaussian covariance matrices. detections (detections, obtained by convolutional neural network models, CNNs). This makes the model 4.3. Training the compositional model explainable, as the parts are already categorized into human-understandable categories. Our training does not require any manual annotations. We follow this example and use the detections Due to pretty good (but not perfect) performance of the provided by an obstacle detector pretrained on MS chosen detector (Yolov7 detects about 95% boats and even COCO [23]. We only retained the pertinent detection higher percentage of persons) we use those cases where classes: person, boat and surfboard. Additionally, we both the boat and the person on it were detected, to treated the classes boat and surfboard as the same se- establish a model that can reasonably predict the position mantic entity (referred to as boat in the remainder of the and size of a boat in absence of detections. text), since both of those classes almost always appear Although we assume Gaussian model for probability simultaneously with the class person. The compositional distributions 𝒩𝐿𝑘 and 𝒩𝑅𝑘 , we estimate each separate model that we use is shown in Fig. 4. distribution using expectation maximization (EM) algo- rithm with 2-component Gaussian Mixture Model (GMM) and retain the larger of the two components as either 𝒩𝐿𝑘 or 𝒩𝑅𝑘 . Our preliminary testing has revealed that Person using 2-component GMM results in more accurate fitting of Gaussian model to the data, collecting the outliers in d the significantly smaller component. 𝜇L 4.4. Hallucination ΣL 𝜇R Boat To hallucinate the most likely bounding box of the (un- detected) boat, we examine the bounding box of the de- ΣR tected person, calculate its centroid and diagonal 𝑑, calcu- late the scale index 𝑘 and look up the relevant Gaussian models 𝒩𝐿𝑘 and 𝒩𝑅𝑘 obtained during the training. The Figure 4: Model of a two-part composition we use in this hallucinated bounding box points of a boat are deter- research – a person on a boat. The centroid of the person mined at displacements 𝑥𝐿𝑘 and 𝑦𝐿𝑘 at which 𝒩𝐿𝑘 and detection is the origin of the composition coordinate system, 𝒩𝑅𝑘 have their maximum values. Note that 𝑥𝐿𝑘 and 𝑦𝐿𝑘 and two corners of a bounding box represent the boat. The are relative to the person’s centroid point. position of the corners is modelled using two Gaussian distri- butions. To adjust the model to different scales, we use the diagonal of the person’s bounding box, 𝑑. 5. USV safety-focused evaluation To compare performance of object detectors, a generic In our case, the Eq. (1) changes, since we have two approach by counting false positives and false negatives, separate Gaussian models for upper-left and bottom-right with respect to some minimum intersection over union corners of the boat bounding box, and that for each of 𝑁 (IoU) value is often used. However, when evaluating scales. the detectors in with actual application in mind, it is often the case, that not all errors are equally important or relevant. For example, USV benchmark [6] defines a 4 Jon Muhovič et al. CEUR Workshop Proceedings 1–9 Figure 5: Illustration of the evaluation methodology. Note that the shore does not influence evaluation in any way, this is an intentional simplification. Blue and green bounding boxes represent ground truth detections and output of our method (hallucinations), respectively. All bounding boxes are vertically projected onto the horizontal (𝑥) axis. All evaluation, including IoU is done in one dimension, along the horizontal axis. Arrows denote the widths of ”navigable channels” after the projection of the bounding boxes onto the horizontal axis. so-called danger zone to evaluate more relevant obstacles may not have any possibility of advancing, and separately. The problem that we are addressing in this regardless of the increase of safety, this solution work is increasing safety of the USV navigation, in cases is not good. Coverage is obtained by dividing the where actual boats are not detected. The challenge is, width of the evaluation line in pixels with the how do we measure increase in safety? sum of the pixels on the evaluation line, covered Note that a crucial safety issue here is that the USV by projected bounding boxes. may navigate in the areas that actually contain part of the boat. Fig. 5 shows the situation with multiple detections This evaluation protocol does not assume or require and corresponding hallucinations. The aim of the USV is complex obstacle avoidance maneuvers, and is not sensi- to proceed in the forward direction, but it has to avoid ob- tive to vertical displacement of bounding boxes. stacles. Therefore, it can proceed only through navigable channels, marked with arrows in Fig. 5. To ensure safety, 6. Experiments navigable channels cannot contain any part of the boat at any distance, and the problem can be compressed to We recorded several hours of video on the Ljubljanica one-dimensional representation along the horizontal (𝑥) river (sessions denoted LJU1, LJU2, and LJU3) in differ- axis. However, if the hallucinations are too wide, there ent weather conditions, on Lake Bled (denoted BLE1), may not be any navigable channel left in front of the and on the Adriatic Sea (near the coast, in several areas boat. between Koper and Portorož), denoted ADR1. In each Therefore, we define the following two metrics: case, we hired human workers who served as obstacles • One-dimensional IoU value (referred to as IoU- in boats, kayaks, canoes and on paddleboards. The data 1D), calculated from the projections of actual contains about 10 obstacles in the near vicinity of the (ground truth) bounding boxes and hallucinated recording boat, captured in different configurations and bounding boxes, both projected downwards onto from different angles relative to the position of the sun (so the horizontal axis (evaluation line in Fig. 5). This challenging backlit scenes were also captured). Videos value should be as high as possible. were recorded at 10 frames per second using Stereolabs 1 • One-dimensional coverage (referred to as Cov- ZED 3D stereo camera , mounted between 1-1.5 meters 1D) of the horizontal axis (evaluation line) with above the water surface (different watercraft were used projection of both ground truth bounding boxes at different locations). In this experiment we only use the and hallucinated bounding boxes. If the coverage of hallucinations becomes too high, then USV 1 https://www.stereolabs.com 5 Jon Muhovič et al. CEUR Workshop Proceedings 1–9 left RGB images, the right RGB image and depth were We examined the reason behind this and found that the not used in any way. increase is not as high as expected due to obstacles which are further away and have disproportionately wide per- 6.1. Analysis of dataset contents son detection bounding boxes, and due to differences in the set of boats used for training and testing (note The training data was constructed by first obtaining pre- the highest increase in Cov-1D from person detection dictions for all the relevant classes using Yolov7. The to hallucination when the training set BLE1 was tested). compositions were then constructed from cases where Figure 6 shows an image where the result of our method there was overlap between detections of class person and is poor. either of the classes boat or surfboard. Analysis of the detections provide some insight into the problem of ”invisible” boats and paddle boards, as shown in Table 1. Session (dataset) LJU1 LJU2 LJU3 BLE1 ADR1 person only (%) 0.04 0.05 0.05 0.03 0.05 person+boat (%) 0.96 0.95 0.95 0.97 0.95 Table 1 Percentages of detected people without boats vs detected peo- ple with boats among all detections for each of recording ses- sions. Note that the percentage of missing boat detections ranges from 3-5%. The videos contained negligible amount of people on the shore (physically plausible detections without boats). Figure 6: Image on which the proposed method does not sig- nificantly improve safety. Note the wide detections of persons and an uncharacteristically long canoe. 6.2. Training We decided to use session BLE1 for training of the Gaus- sian distributions 𝒩𝐿𝑘 and 𝒩𝑅𝑘 , as it featured boats of 7. Discussion varying shapes and sizes. The training time using precal- culated Yolov7 detections was negligible. This paper presents a preliminary research on use of hallucinations, provided by compositional models, in water-borne obstacle detection and avoidance. The ex- 6.3. Testing perimental design in this work has been subject to many Free from requirements for manual annotation, we were constraints, most notably the absence of proper ground able to run the evaluation of our method on all images truth annotations. These issues will be addressed in fur- from our dataset, For evaluation, we used only the de- ther work, towards a general framework to hallucinate tections of people with corresponding boats. Boat de- obstacles that are not directly observed by the sensors. tections, obtained via Yolov7, were considered ground Since using an obstacle detector precludes us from de- truth, against which the hallucinations, obtained using tecting unknown objects, combining their results with our compositional model, were tested. Person detections either semantic segmentation or another method of without corresponding boats were not used, as these had anomaly detection or a different sensor modality (such as no usable ground truth. Table 2 shows the results. LIDAR) might help in producing a more general hazard Analysing the results, we can see that there is good detection system that will perform hazard detection from overlap between ground truth detections and hallucina- multimodal cues. tions, with IoU-1D ranging from 0.465 to 0.605 for the same dataset on which the model was trained. Note that 7.1. Underwater sensors IoU-1D of 0.5 means that the middle half of bounding box projections overlap, while the 1/4 at each edge is A state of the art in experimental autonomous road ve- non-overlapping. hicles relies heavily on multimodal sensor setup, with Coverage of hallucinations is not as high as coverage sensors like LIDAR and RADAR [24, 25], which bear no of detections, and, most surprisingly, coverage of pure resemblance to human sensing. Therefore, an argument person detections (e.g. in absence of any detected boats) could be made that instead of hallucinating the obstacles is not much lower than the coverage of hallucinations. and trying to emulate the skipper, one could detect the hidden obstacles using proper underwater sensor setup. 6 Jon Muhovič et al. CEUR Workshop Proceedings 1–9 Session (dataset) LJU1 LJU2 LJU3 BLE1 ADR1 IoU-1D 0.465 0.435 0.532 0.605 0.582 Ground truth Cov-1D 0.13 0.149 0.152 0.101 0.193 Hallucination Cov-1D 0.074 0.065 0.117 0.083 0.149 Person detection Cov-1D 0.067 0.054 0.127 0.062 0.139 Table 2 Evaluation results using the model trained on BLE1 session. IoU-1D is 1-dimensional IoU on bounding box projections onto the horizontal axis and Cov-1D is coverage of the horizontal axis with each type of bounding boxes. We included the projections of pure person detection bounding boxes as well for comparison. In practice, this results in fragile setup due to water tur- leaves in the fall), so avoiding it at all times is not an op- bidity – USVs are expected to navigate safely even in tion. However, debris may accumulate in shallow water water that is dirty or muddy. areas (it may not be debris, but aquatic plants sticking Note also that a paddleboard, as shown in Fig. 1, is a out of the shallow water). So, if we encounter debris far- very thin object at the boundary between air and water, ther from shore, it is not a cause for concern as it is most which is not comparable to the situations encountered in likely floating. However, if it is found near land features autonomous driving (on the road), so it is unlikely that (e.g., trees, mud), then it usually means that the area is additional (underwater) sensors will reliably detect it. In dangerous, shallow, and not navigable. To detect this fact, some watercraft may be completely submerged at case, we might model the shallow, non-navigable area as times, as can be seen in Fig. 7 which shows a fast-moving a composition of debris and other land-based features. athlete in a kayak. As it can be seen in top right image in Fig. 8, it is some- times difficult to determine whether the situation is a hazard or not. The labeling of such situations cannot be done by (untrained) labelers, but must be defined by experienced skippers working in cooperation with com- puter vision engineers. These compositions and their parameters must be defined by hand for a small number of available cases. The HCDC approach [22] has shown that this is indeed possible for common, well known food items. In this case, it will be used to insert concentrated expert knowledge into the compositional hazard detection model. 8. Acknowledgments Figure 7: A submerged kayak that cannot possibly be reliably This work was financed by the Slovenian Research detected using visual sensors. Agency (ARRS), research program [P2-0095], and re- search project [J2-2506]. 7.2. Other examples of invisible hazards References Missing detections of boats and paddleboards are imme- [1] A. Dallolio, H. B. Bjerck, H. A. Urke, J. A. Alfred- diately available in our waterborne datasets. However, sen, A persistent sea-going platform for robotic fish there are other scenarios where such an approach would telemetry using a wave-propelled usv: Technical be useful, but for which there is currently insufficient solution and proof-of-concept, Frontiers in Marine data to train the models. The main reason for this is that Science 9 (2022). URL: https://www.frontiersin.org/ these scenarios are to some extent hazardous to the USV articles/10.3389/fmars.2022.857623. doi:10.3389/ and represent edge cases in USV deployment. In Figure 8, fmars.2022.857623. we present a common scenario that we have encountered [2] G. T. Raber, S. R. Schill, Reef rover: A low-cost several times, but for which we currently do not have small autonomous unmanned surface vehicle (usv) enough data to properly test, let alone train. Plant de- for mapping and monitoring coral reefs, Drones 3 bris is common in continental waters and usually safe to (2019). URL: https://www.mdpi.com/2504-446X/3/ traverse. Often it covers the entire navigable area (e.g., 2/38. doi:10.3390/drones3020038. 7 Jon Muhovič et al. CEUR Workshop Proceedings 1–9 Figure 8: Example of an invisible danger - plant water debris. In all four images, plant debris can be seen in the image. Plant debris is usually mobile, buoyant, harmless, and can be run over by a boat (top left). However, if the plant debris is near the shore, it can accumulate on aquatic plants and signal dangerously shallow depth (top right and bottom left). The presence of other clues (muddy water) increases the likelihood that the water in the area of the debris is precariously shallow (bottom right). [3] C.-Y. Wang, A. Bochkovskiy, H.-Y. M. Liao, Yolov7: sors for object detection and tracking in a maritime Trainable bag-of-freebies sets new state-of-the-art environment: a survey, IEEE Transactions on Intel- for real-time object detectors, 2022. URL: https: ligent Transportation Systems 18 (2017) 1993–2016. //arxiv.org/abs/2207.02696. doi:10.48550/ARXIV. [5] B. Bovcon, J. Muhovič, J. Perš, M. Kristan, The 2207.02696. mastr1325 dataset for training deep usv obstacle [4] D. K. Prasad, D. Rajan, L. Rachmawati, E. Rajabally, detection models, in: 2019 IEEE/RSJ Interna- C. Quek, Video processing from electro-optical sen- tional Conference on Intelligent Robots and Sys- 8 Jon Muhovič et al. CEUR Workshop Proceedings 1–9 tems (IROS), IEEE, 2019, pp. 3431–3438. detection and tracking with deep learning and ap- [6] B. Bovcon, J. Muhovič, D. Vranac, D. Mozetič, J. Perš, pearance feature, in: 2019 5th International Confer- M. Kristan, Mods–a usv-oriented object detec- ence on Control, Automation and Robotics (ICCAR), tion and obstacle segmentation benchmark, IEEE IEEE, 2019, pp. 276–280. Transactions on Intelligent Transportation Systems [19] S. Moosbauer, D. Konig, J. Jakel, M. Teutsch, A (2021). benchmark for deep learning based object detection [7] L. Steccanella, D. Bloisi, A. Castellini, A. Farinelli, in maritime environments, in: Proceedings of the Waterline and obstacle detection in images from IEEE Conference on Computer Vision and Pattern low-cost autonomous boats for environmental mon- Recognition Workshops, 2019, pp. 0–0. itoring, Robotics and Autonomous Systems 124 [20] H. Kim, J. Koo, D. Kim, B. Park, Y. Jo, H. Myung, (2020) 103346. D. Lee, Vision-based real-time obstacle segmen- [8] A. J. Sinisterra, M. R. Dhanak, K. Von Ellenrieder, tation algorithm for autonomous surface vehicle, Stereovision-based target tracking system for usv IEEE Access 7 (2019) 179420–179428. operations, Ocean Engineering 133 (2017) 197–214. [21] B. Bovcon, M. Kristan, A water-obstacle separation [9] Y. Cheng, M. Jiang, J. Zhu, Y. Liu, Are we ready and refinement network for unmanned surface ve- for unmanned surface vehicles in inland water- hicles, in: 2020 IEEE International Conference on ways? the usvinland multisensor dataset and bench- Robotics and Automation (ICRA), IEEE, 2020, pp. mark, IEEE Robotics and Automation Letters 6 9470–9476. (2021) 3964–3970. [22] G. Koporec, J. Perš, Human-centered deep compo- [10] D. Nunes, J. Fortuna, B. Damas, R. Ventura, Real- sitional model for handling occlusions, 2022. 2nd time vision based obstacle detection in maritime en- revision in Pattern Recognition. vironments, in: 2022 IEEE International Conference [23] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Gir- on Autonomous Robot Systems and Competitions shick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, (ICARSC), IEEE, 2022, pp. 243–248. P. Dollár, Microsoft coco: Common objects in con- [11] Y. Kuwata, M. T. Wolf, D. Zarzhitsky, T. L. Hunts- text, 2014. URL: http://arxiv.org/abs/1405.0312, cite berger, Safe maritime autonomous navigation with arxiv:1405.0312Comment: 1) updated annotation colregs, using velocity obstacles, IEEE Journal of pipeline description and figures; 2) added new sec- Oceanic Engineering 39 (2013) 110–119. tion describing datasets splits; 3) updated author [12] J. Han, Y. Cho, J. Kim, J. Kim, N.-s. Son, S. Y. Kim, list. Autonomous collision detection and avoidance for [24] J. Peršić, I. Marković, I. Petrović, Extrinsic 6dof aragon usv: Development and field tests, Journal calibration of a radar–lidar–camera system en- of Field Robotics 37 (2020) 987–1002. hanced by radar cross section estimates evaluation, [13] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Robotics and Autonomous Systems 114 (2019) 217– Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox 230. detector., in: B. Leibe, J. Matas, N. Sebe, M. Welling [25] C. Schöller, M. Schnettler, A. Krämmer, G. Hinz, (Eds.), ECCV (1), volume 9905 of Lecture Notes in M. Bakovic, M. Güzet, A. Knoll, Targetless rota- Computer Science, Springer, 2016, pp. 21–37. tional auto-calibration of radar and camera for in- [14] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, telligent transportation systems, in: 2019 IEEE Intel- You only look once: Unified, real-time object de- ligent Transportation Systems Conference (ITSC), tection, 2015. URL: http://arxiv.org/abs/1506.02640, IEEE, 2019, pp. 3934–3941. cite arxiv:1506.02640. [15] M. Kristan, V. S. Kenk, S. Kovačič, J. Perš, Fast image- based obstacle detection from unmanned surface vehicles, IEEE transactions on cybernetics 46 (2015) 641–654. [16] H. Wang, Z. Wei, S. Wang, C. S. Ow, K. T. Ho, B. Feng, A vision-based obstacle detection sys- tem for unmanned surface vehicle, in: Robotics, Automation and Mechatronics (RAM), 2011 IEEE Conference on, IEEE, 2011, pp. 364–369. [17] B. Bovcon, J. Perš, M. Kristan, et al., Stereo ob- stacle detection for unmanned surface vehicles by imu-assisted semantic segmentation, Robotics and Autonomous Systems 104 (2018) 1–13. [18] J. Yang, Y. Li, Q. Zhang, Y. Ren, Surface vehicle 9