<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bridging the Sim-to-Real Gap with Explainability for ML-based Object Detection on Sonar Data⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Şakir Furkan Yöndem</string-name>
          <email>sakirfurkan.yoendem@th-nuernberg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramin Tavakoli Kolagari</string-name>
          <email>ramin.tavakolikolagari@th-nuernberg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benedikt Schlereth-Groh</string-name>
          <email>benedikt.schlereth-groh@th-nuernberg.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Hochschule Nürnberg</institution>
          ,
          <addr-line>Keßlerplatz 12, 90489 Nürnberg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Saving drowning victims is time-critical, but detecting people underwater is highly challenging due to poor visibility and large distances. While side-scan sonar (SSS) is widely used for seafloor mapping and debris detection, human detection in sonar data remains largely unexplored. Training deep neural networks for this task requires a large dataset, but collecting real maritime data is dificult and expensive, making a synthetic data generation approach necessary. We introduce SimWave, a simulation environment designed to generate synthetic data for underwater human detection. We train deep learning models on real, synthetic, and hybrid datasets, evaluating their performance on real sonar images. The contribution of this paper lies in combining synthetic data generation with Explainable Artificial Intelligence (XAI) to systematically refine artificial datasets, addressing the gap between synthetic and natural data to enhance real-world performance--an approach not previously explored in underwater sonar-based human detection. To gain insight into the model's decision-making process, we apply XAI techniques to analyze how attention shifts between real and synthetic training data. This helps visualize the synthetic-real data mismatch, refine synthetic data, and enhance model performance in real-world conditions. Our experimental results show that models trained on hybrid datasets, supported by XAI-based analysis, achieve notable performance improvements and better generalization. XAI helps identify domain gaps between real and synthetic data, allowing for dataset refinement and improved model accuracy. These findings highlight the efectiveness of synthetic generated data in training deep learning models for underwater human detection and emphasize the critical role of XAI in optimizing training data for real-world conditions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Underwater Human Detection</kwd>
        <kwd>Side-Scan Sonar</kwd>
        <kwd>Synthetic Data Generation</kwd>
        <kwd>Explainable AI (XAI)</kwd>
        <kwd>Domain Bridging</kwd>
        <kwd>Domain Gap</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Human search and recovery operations in underwater environments are time-consuming and complex
due to poor visibility, hazardous conditions, and operational challenges, placing a significant burden
on specially trained divers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Side-scan sonar is widely used to support such searches, but it still
relies heavily on manual control and remains dependent on trained sonar operators [2, 3]. Artificial
Intelligence (AI)-based automated analysis methods ofer a promising alternative to speed up the process
and reduce human intervention. However, the efectiveness of these methods depends on the availability
of large, high-quality training datasets [4, 5]. In specialized areas such as sonar imaging, collecting
real-world data is particularly challenging, and the lack of diverse data limits model generalization,
leading to performance degradation in unseen or underrepresented scenarios. The key barriers to data
collection include high costs, security risks, legal restrictions, and operational complexities [6].
      </p>
      <p>To address these challenges, synthetic data generation is increasingly used for training deep learning
models. Simulation-based approaches enhance model generalization by generating datasets that simulate
diverse real-world conditions [7, 8]. However, structural diferences between synthetic and real data
(domain gap) can cause models to underperform in real-world applications. Optimizing synthetic data
and integrating hybrid datasets can significantly improve generalization and model robustness [9].</p>
      <p>The limitations of real-world side-scan sonar datasets for human detection and the low quality
of available sonar images are the main motivation for this work. In order to overcome these issues,
we introduce SimWave, a synthetic dataset designed for underwater human detection. We train the
YOLOv8 and YOLOv11 models using both real and synthetic data and compare their performance on
real sonar images. By leveraging XAI, we optimize synthetic images to enhance model decision-making
and reduce errors caused by excessive brightness and reflections through a dynamic clipping technique
applied to SimWave. Experimental results show that the dynamically clipped dataset improves detection
accuracy and recall on real sonar images, demonstrating the efectiveness of synthetic data refinement.</p>
      <p>The remainder of our paper begins by summarizing existing research on underwater human detection
and situating our contribution within this broader context. We then detail the design and implementation
of the SimWave simulation system, introducing the synthetic sonar data generation mechanism and
the diferent datasets we use. We explore the complex relationship between real and synthetic data,
pushing the boundaries of deep learning models. Using an innovative data optimization technique, we
achieve noticeable improvements in model performance. Finally, we provide an in-depth analysis of
our findings, draw critical insights from our experimental evaluations, and outline future directions to
improve sonar-based detection.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the Art</title>
      <p>In order to understand and solve the problem of detecting people in sonar images, important research
areas are analysed below and suitable methods are presented.</p>
      <sec id="sec-2-1">
        <title>2.1. Detection on Sonar images</title>
        <p>Sonar images based on acoustic wave reflections can be efectively used in underwater object detection
tasks with deep learning models such as convolutional neural networks (CNNs), as they can be
represented as Red-Green-Blue (RGB) image [10]. Especially in the detection of submerged objects in turbid
waters, object detection on sonar images has been successfully applied by various research groups. For
example, Lu et al. improved the YOLO Network by replacing convolutional neural network layers with
residual blocks and were able to detect diferent objects in sonar images [ 5]. Similar detection work on
side-scan sonar as well, like the YOLOv7 Model [11] or YOLOv9 [12]. Humans can also be detected on
sonar images shown by Hu and Liu by detecting an underwater rescue target on multi-beam imaging
sonar [13].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Domain Gap</title>
        <p>Since data collection is particularly expensive and time-consuming, expanding the dataset through a
simulation environment to generate synthetic data can be promising. However, the synthetic-to-real
domain gap, i.e., the diferences between the simulation and the real world, remains a challenge. Kiefer
et al. [14] demonstrated poor performance on real world data if trained on only synthetic data, but
improvements with synthetic and real world data. Showcasing a gap for object detection trained on
either simulated or natural data. While synthetic sonar images are generated for underwater mine-like
objects [15] or wrecks on the seabed [16]. Both show how the combination of synthetic and real data
can improve training results on sonar images as well, but do not show how big the gap is between
training purely synthetic and purely real data [15, 16].</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Explainable AI</title>
        <p>Explainable artificial intelligence (x-AI) for object detection algorithms can help to understand at
what features are important for the deep neural networks. Visual explanations can be provided by
model-agnostic or model-dependent algorithms. By visualizing the last convolutional layer of the target
class, Grad-CAM can provide an explanation for the predicted class [17]. Petsiuk et al. improved the
explanation by applying random masks and calculating the similarity to the original detection, and can
explain localization and classification on images [ 18]. Since only the predictions are needed to calculate
the saliency maps, this approach is also model-agnostic and can be used for diferent object recognizers
[18].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. SimWave</title>
      <p>In this study, we developed a simulation environment to overcome the lack of real sonar data and train
deep learning models for underwater human detection. We created this environment, which we call
SimWave, using the Robot Operating System 2 (ROS2), Gazebo and Blender.</p>
      <p>ROS2 is an open source framework that facilitates the management of autonomous systems,
sensorbased data processing and the integration of robotic applications thanks to its modular structure. Support
for real-time communication and distributed computing allows us to create simulation environments
that reflect real-world conditions [ 19]. Gazebo provides a powerful platform for simulating robotic
systems with its realistic physics engine and high-precision sensor modelling capabilities. In particular,
its ability to model hydrodynamic interactions makes it ideal for testing the behaviour of sonar systems
in an underwater environment. SimWave was developed to generate synthetic side-scan sonar images.
First, we integrated the FLS sensor developed in Project Dave (Underwater Sonar Sensor Development
Project) [20] into the simulation environment by making it compatible with ROS2. We then configured
SimWave with two FLS sensors placed on the left and right sides to provide wide-angle data collection.
Finally, we combined the data from these two sensors to produce an image similar to side-scan sonar. The
sonar sensor parameters used in the simulation are as follows: The sensor captures 512 vertical samples
within a range of -90° to +90°. Horizontally, it records 300 samples within a range of approximately
between 0° and 5° to +20°.</p>
      <p>To make the simulation environment compatible with the sonar images from the real data set, we
used Blender to create a standard-sized 3D human model and various rock and wood objects. Blender is
an open-source 3D modeling software that allows the modeling of physical objects, human figures and
environmental elements. Figure 1 presents both simulated and real-world sonar images.</p>
      <p>SimWave</p>
      <p>Aaltonen</p>
    </sec>
    <sec id="sec-4">
      <title>4. Improving Data Quality with XAI</title>
      <p>In this section, we show how a dataset for finding submerged bodys in water can be improved by
leveraging an XAI analysis. By evaluating the performance of object detection algorithms (YOLOv8
and YOLOv11) trained on diferent dataset compositions, the need for improvement of the synthetic
data becomes apparent. With the attention of the object detection algorithms visualized using a XAI
method, an adaption to the sensor simulator is proposed.</p>
      <sec id="sec-4-1">
        <title>4.1. Datasets</title>
        <p>In this study, both real and synthetic data are used to improve human detection in sonar images. The
dataset consists of real-world data collected using side-scan sonar and synthetic data generated in the
SimWave simulation environment. The real-world data was collected by Aaltonen[21], which provides
waterfall images from side-scan sonar, and made available as a public dataset. We partitioned this dataset
into 145 training, 61 validation, and 125 test images for model training and evaluation. Due to the limited
size of the real-world dataset, we generated additional sonar images using the SimWave simulation
environment. We designed this simulated dataset to align with the real-world dataset, incorporating
145 training and 61 validation images. To further enhance the model’s generalization capability and
increase data diversity, we applied data augmentation techniques tailored for the functionality of sonar
sensors. Meanwhile, we constructed the hybrid dataset without augmentation, solely by merging real
and synthetic data, resulting in a total of 351 training samples—145 real and 206 synthetic sonar images,
as shown in Table 1.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Evaluation</title>
        <p>The performance of all models are validated on the same 125 test images from the real-world dataset
collected by Aaltonen [21]. As discussed by Lu et al. YOLOv8 model performed well on the Marine
Debrise Dataset and the Underwater Acoustic Target Detection Dataset [5]. We therefore decided to
test the YOLOv8l [22] model as well and use the newest YOLO Model Yolov11l [23]. The performance
of both models on diferent datasets to detect humans underwater are presented in Table 2.</p>
        <p>The YOLOv8 model demonstrated superior precision values, whereas the YOLOv11 model exhibited
enhanced performance in terms of recall. The hybrid dataset (Aaltonen + SimWave) facilitated a more
balanced performance across both models, leading to an increase in mAP@50-95 and precision, thereby
illustrating the capacity of synthetic data to augment the generalization capabilities of the models
and provide more consistent outcomes under diverse sonar conditions. However, the models trained
exclusively with SimWave experienced a decline in precision and recall, indicating a domain gap
between the real and synthetic datasets. In order to understand which diferences in detection are
important, a XAI approach can provide insights.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Understanding Sim-to-Real Gap</title>
        <p>Although only one sensor modality and a small selection of object detection algorithms are analyzed,
the model-agnostic XAI approach called D-RISE is selected to enable an extension to other models and
datasets in the future. Employing D-RISE, a methodology posited by Petsiuk et al. [18], calculates a
saliency map, indicating important areas for proposed detections. Our analytical procedure involved the
utilization of N=5000 masking iterations, with a stochastic masking probability of p=0.5 and a spatial
resolution of (h,w)=(16,16) [18]. With the better precision values, the YOLOv8 detections on the same
image with diferent models are visualized with the corresponding saliency maps.</p>
        <p>(a) Aaltonen
(b) SimWave
(c) Hybrid</p>
        <p>In Figure 2 the saliency maps between the models trained on the real-world data set and simulated
difer quite a lot. The model trained on the hybrid dataset missed the detection in the image and was
therefore not able to provide a saliency map. With those visual analysis and similar comparisons, the
following assumption for the synthetic sonar images are made:
• The model trained on real sonar images focuses on shadows during human detection.
• The model trained with synthetic data alone is more sensitive to reflections than to shadows.
• The hybrid model trained with real and simulated data makes errors in human detection because
it has dificulty generalizing shadow and reflection information.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Improving Synthetic Data Generation</title>
        <p>Based on the performance of the diferent models, an adaption to the synthetic data generation is
needed. The analysis of the saliency maps indicates over-strong reflection in the synthetic data. To
counteract this efect, dynamic clipping is used in the following, which approximates the synthetic data
to the real-world data using the following formula:
′ =
{︃ −  · ( −  ), if  &gt;  .</p>
        <p>, otherwise.
(1)</p>
        <p>Here,  represents pixel intensity,  is the threshold value, and  is the scaling factor. In the synthetic
images, the intensity of reflections in sonar images ranges between 0 and 255. Our observations show
that the brightest regions reach up to 250, while weak reflections are around 75. To correct over-strong
reflections while preserving weak ones, we set  = 100. Additionally, to balance reflections from bright
and weak regions while maintaining the integrity of features inside the image, we chose  = 0.5.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Evaluating on improved data</title>
        <p>The dynamic clipping method is applied to the synthetic generated data and two new datasets (SimWave
and Aaltonen + SimWave) are introduced and the training is repeated. The evaluation of the four new
models and the previous results are shown in Table 3 exceeding the previous models on all metrics.
YOLOv8 demonstrated higher precision by efectively filtering out false positives, while YOLOv11
improved recall by detecting more people. These results suggest that the clipping method reduces the
domain gap and enhances the model’s generalization ability for the hybrid dataset. The improvement
in detection is also to be underlaid by the saliency maps.</p>
        <p>The Figure 3 shows the saliency maps on the same image as before in Figure 2, but with the last two
images trained on the improved data. The model trained on the clipped hybrid dataset is now able to
detect the object in the upper right corner, as the detection rate increased. It is also worth mentioning
the diference between the models trained on the synthetic data set. The attention of for the object
changes from focusing on both the reflection and shadow in Figure 2 to align more with the saliency
map trained on real-world data (Aaltonen) in Figure 3.</p>
        <p>To further emphasize this point, the diference in the attention trained on real-world and synthetic
data is of most interest. The diferent saliency maps of another example are shown in Figure 4. In this
example, the model trained on real-world data focuses on the legs, while the model trained on the
hybrid dataset focuses on another region. After approximating the synthetic data to the real-world data
by applying the proposed dynamic clipping process, the saliency map focuses even better on the object.
This improvement confirms our assumption that the performance of the model was further improved
by the selected data adjustment procedure, derived from an XAI methodology.</p>
        <p>Aaltonen + SimWave</p>
        <p>Aaltonen + SimWave (clipping)</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and future work</title>
      <p>This study investigates the integration of real and simulated data to improve human detection in
side-scan sonar images. Due to the limitations of real sonar datasets, we developed a simulation-based
approach called SimWave to generate synthetic side-scan images and evaluated its impact on real-world
data. Our study demonstrates how XAI methods can be efectively used to improve synthetically
generated sonar data. The experimental results show that an object detection can perform better when
trained on real-world and synthetic generated data. With the help of an XAI method of individual objects,
an assumption could be made about the diference between synthetic and real images. This proposal
could be used to improve the data provided by the simulation and improve the domain generalization
to enhance the detection. Further qualitative analyses showed that the model trained with clipped
data successfully identified people that the hybrid model without clipping and the model trained with
real data failed to detect. The testing conducted in this paper is based on a small dataset consisting of
125 side-scan sonar scans. For a solid investigation of our assumption, a larger dataset is required. In
the field of search and rescue, a larger dataset is currently not available, but the simulation could be
transferred to a similar problem with larger datasets at hand. In future work, the real world data set
will be enlarged in order to be able to train a better model. In addition, the simulated sonar images can
be improved by using articulated 3D human models to create more realistic scenarios. With the model
agnostic explanation method used other object detection algorithms like Detection Transformer (DETR)
or Region-based Convolutional Neural Networks (R-CNN) can be tested as well. All developments are
intended to contribute to the development of a robust and generalizable object detection algorithm
based on side-scan sonar to assist human search and recovery operations.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>This work was accomplished within the project KI-S, FKZ 03DPS1124A, funded by the German Federal Ministry of Education and Research</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <sec id="sec-7-1">
        <title>The author has not employed any Generative AI tools.</title>
        <p>[2] J. Rutledge, W. Yuan, J. Wu, S. Freed, A. Lewis, Z. Wood, T. Gambin, C. Clark, Intelligent shipwreck
search using autonomous underwater vehicles, in: 2018 IEEE International Conference on Robotics
and Automation (ICRA), IEEE, 2018, pp. 6175–6182.
[3] A. Rufell, Lacustrine flow (divers, side scan sonar, hydrogeology, water penetrating radar) used to
understand the location of a drowned person, Journal of hydrology 513 (2014) 164–168.
[4] Y. Z. Nga, Z. Rymansaib, A. Anthony Treloar, A. Hunter, Automated recognition of submerged
body-like objects in sonar images using convolutional neural networks, Remote Sensing 16 (2024)
4036.
[5] Y. Lu, J. Zhang, Q. Chen, C. Xu, M. Irfan, Z. Chen, Aquayolo: Enhancing yolov8 for accurate
underwater object detection for sonar images, Journal of Marine Science and Engineering 13
(2025) 73.
[6] F. Zhang, W. Zhang, C. Cheng, X. Hou, C. Cao, Detection of small objects in side-scan sonar
images using an enhanced yolov7-based approach, Journal of Marine Science and Engineering 11
(2023) 2155.
[7] S. R. Richter, V. Vineet, S. Roth, V. Koltun, Playing for data: Ground truth from computer games, in:
Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October
11-14, 2016, Proceedings, Part II 14, Springer, 2016, pp. 102–118.
[8] B. Kiefer, D. Ott, A. Zell, Leveraging synthetic data in object detection on unmanned aerial vehicles,
in: 2022 26th international conference on pattern recognition (ICPR), IEEE, 2022, pp. 3564–3571.
[9] N. Mital, S. Malzard, R. Walters, C. M. De Melo, R. Rao, V. Nockles, Improving object detection by
modifying synthetic data with explainable ai, arXiv preprint arXiv:2412.01477 (2024).
[10] S. Lee, Deep learning of submerged body images from 2d sonar sensor based on convolutional
neural network, 2017 IEEE Underwater Technology (UT) (2017) 1–3.
[11] X. Wen, J. Wang, C. Cheng, F. Zhang, G. Pan, Underwater side-scan sonar target detection: Yolov7
model combined with attention mechanism and scaling factor, Remote. Sens. 16 (2024) 2492.
[12] X. Yuan, J. Li, W. Wang, X. Zhou, N. Li, C. Yu, Improved yolov9 for underwater side scan sonar
target detection, The Computer Journal (2024).
[13] S. Hu, T. Liu, Underwater rescue target detection based on acoustic images, Sensors (Basel,</p>
        <p>Switzerland) 24 (2024).
[14] B. Kiefer, D. Ott, A. Zell, Leveraging synthetic data in object detection on unmanned aerial vehicles,
2022 26th International Conference on Pattern Recognition (ICPR) (2021) 3564–3571.
[15] A. Agrawal, A. Sikdar, R. Makam, S. Sundaram, S. K. Besai, M. Gopi, Syn2real domain generalization
for underwater mine-like object detection using side-scan sonar, ArXiv abs/2410.12953 (2024).
[16] K. Basha, A. Nambiar, S3simulator: A benchmarking side scan sonar simulator dataset for
underwater image analysis, ArXiv abs/2408.12833 (2024).
[17] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, D. Batra, Grad-cam: Visual
explanations from deep networks via gradient-based localization, International Journal of Computer
Vision 128 (2016) 336 – 359.
[18] V. Petsiuk, R. Jain, V. Manjunatha, V. I. Morariu, A. Mehra, V. Ordonez, K. Saenko, Black-box
explanation of object detectors via saliency maps, 2021 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR) (2020) 11438–11447.
[19] S. Macenski, T. Foote, B. Gerkey, C. Lalancette, W. Woodall, Robot operating system 2: Design,
architecture, and uses in the wild, Science Robotics 7 (2022) eabm6074. URL: https://www.science.
org/doi/abs/10.1126/scirobotics.abm6074. doi:10.1126/scirobotics.abm6074.
[20] W.-S. Choi, D. R. Olson, D. Davis, M. Zhang, A. Racson, B. Bingham, M. McCarrin, C. Vogt,
J. Herman, Physics-based modelling and simulation of multibeam echosounder perception for
autonomous underwater manipulation. frontiers inrobotics and ai8 (2021), 279, 2021.
[21] T. Aaltonen, Consumer class side scanning sonar dataset for human detection, 2023 46th MIPRO</p>
        <p>ICT and Electronics Convention (MIPRO) (2023) 1161–1166.
[22] G. Jocher, A. Chaurasia, J. Qiu, Ultralytics yolov8, 2023. URL: https://github.com/ultralytics/
ultralytics.
[23] G. Jocher, J. Qiu, Ultralytics yolo11, 2024. URL: https://github.com/ultralytics/ultralytics.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Nordby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jon</surname>
          </string-name>
          ,
          <article-title>Underwater forensic investigation</article-title>
          , CRC Press,
          <year>2013</year>
          . doi:https: //doi.org/10.1201/b14765.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>