<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Usage of Intermediate Fusion of Multimodal Data for Dangerous Objects Enhanced Detection⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hlib Shchur</string-name>
          <email>hlib.o.shchur@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iryna Dumyn</string-name>
          <email>iryna.b.shvorob@lpnu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lviv Polytechnic National University</institution>
          ,
          <addr-line>Bandery 12, 79000, Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In real-world applications like autonomous driving, maritime navigation, and industrial monitoring, reliably detecting dangerous objects is critical. Traditional object detection systems that rely on just one type of sensor often struggle when conditions are challenging - whether due to adverse weather, low light, or when objects are only partially visible. In this study, the last publications explore innovative multimodal sensor fusion techniques. These studies combine information from cameras, LiDAR, thermal, terahertz, and tactile sensors to create detection systems that are both more accurate and more robust. Building on these insights, the current paper aims to propose a unified framework that merges visual and sensory data using an intermediate-level fusion strategy enhanced by attention mechanisms. The proposed approach extracts detailed features from each sensor and fuses them into a single, cohesive representation. It also introduces an object criticality score - considering factors like distance, relative velocity, and orientation to prioritize high-risk objects. A hypothetical example shows how the system might work in practice.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal Sensor Fusion</kwd>
        <kwd>Object Detection</kwd>
        <kwd>Autonomous Systems</kwd>
        <kwd>LiDAR</kwd>
        <kwd>Attention Mechanisms</kwd>
        <kwd>Feature Extraction</kwd>
        <kwd>Risk-Based Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Ensuring the reliable detection of dangerous objects is absolutely critical in many real-world
settings — from autonomous vehicles maneuvering through busy urban streets and maritime
vessels navigating treacherous waters to industrial facilities and smart waste management systems
in crowded cities. Traditional object detection systems[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], which typically rely on just one sensor
type (like standard RGB cameras), often struggle under challenging conditions such as low-light
environments, occlusions, or adverse weather. These shortcomings have sparked a growing
interest in multimodal sensor fusion[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], where data from diverse sensors — such as cameras,
LiDAR, thermal/infrared sensors, tactile sensors, and even terahertz imaging — are combined to
deliver a more robust and reliable detection performance.
      </p>
      <p>
        The process of demining areas contaminated with explosive devices remains one of the most
pressing global challenges[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. According to international organizations, a significant portion of
land in conflict zones and post-war regions remains affected by landmines, posing a serious threat
to civilian populations and hindering economic development. The application of robotic systems in
this field represents a promising direction, as it significantly reduces risks for deminers, enhances
demining efficiency, and shortens the duration of operations. Autonomous and remotely operated
demining robots can function in challenging conditions, including high-threat areas and
difficult
      </p>
      <p>Despite these advantages, traditional control of demining robots in real-world environments
faces several challenges. Key issues include limited situational awareness of the operator due to
delays in video signal and data transmission, difficulties in navigating uneven terrain, and potential
errors in decision-making algorithms that may lead to mission failures. Additionally, real-world
testing of demining robots requires substantial financial resources and specially designed testing
grounds, limiting their evaluation across a wide range of scenarios.</p>
      <p>
        A promising approach to addressing these challenges is the use of virtual modeling for testing
and training operators of robotic systems. Modern simulation platforms enable the creation of
realistic environments where demining scenarios can be practiced, various threats can be modeled,
and autonomous control algorithms can be adapted[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This approach significantly reduces testing
costs, improves operator training, and facilitates more effective learning in safe conditions.
Furthermore, the integration of machine learning methods in virtual environments enhances the
adaptability of robotic systems to dynamic changes in real-world conditions. Multimodal data plays
a key role in complex analysis and decision-making systems, as it combines information from
different sensors (e.g., images, lidar data, audio, temperature readings), which increases the
accuracy and reliability of processing. Relational and non-relational databases, graph structures,
and specialized platforms for streaming large amounts of data are used to efficiently store such
data[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Processing multimodal data includes deep learning, graph algorithms for establishing
relationships between different modalities, and various data fusion methods.
      </p>
      <p>This paper proposes a unified framework that fuses visual and sensory modalities to enhance
the detection of dangerous objects. By leveraging intermediate-level fusion with attention
mechanisms, proposed approach combines feature representations from multiple sensors and
integrates an object criticality model to prioritize detections based on safety relevance. The
framework is designed to be robust across a variety of environments and applicable to multiple
domains, ultimately addressing the limitations of single-modality systems and advancing the state
of safety-critical detection systems.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature Review</title>
      <p>A careful examination of recent literature reveals a rich variety of approaches and challenges in the
field of multimodal sensor fusion for object detection. The review of the latest scientific
investigations in the area of object detection is provided in this section.</p>
      <p>
        Thompson [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] delves into the realm of maritime object detection by combining LiDAR and
vision data. In his study, high-fidelity GPS/INS information is fused with 3D LiDAR point clouds
and camera images to track and classify objects on autonomous surface vehicles. The result is a
detection system that achieves an impressive 98.7% accuracy across six object classes. However, the
study also highlights important challenges — sensor alignment, accurate coordinate
transformation, and the creation of reliable occupancy grids — which are crucial for extracting
objects in the ever-changing maritime environment.
      </p>
      <p>
        Vadidar et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] focus on overcoming the limitations of conventional RGB cameras by fusing
visual and thermal (infrared) data for autonomous driving. Their unified learning pipeline, centered
around an innovative RGB-thermal (RGBT) fusion network, leverages an entropy-block attention
module (EBAM) to refine the feature fusion process. This attention-based approach results in a
notable 10% improvement in mean Average Precision (mAP) over existing methods, making it a
powerful solution for reliable object detection under low-light or adverse weather conditions.
      </p>
      <p>
        Bhown [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] tackles the critical issue of long-range detection for autonomous trucks, which must
identify vulnerable road users (VRUs) in time to avoid collisions. Large vehicles require extended
detection ranges because of their slower maneuverability compared to smaller cars. By fusing data
from LiDAR and monocular cameras, Bhown’s method compensates for the inherent sparsity of
LiDAR point clouds at long distances. This fusion strategy is essential for ensuring that large
autonomous vehicles can detect objects in urban and suburban environments where space is
limited and reaction time is critical.
      </p>
      <p>
        In the context of smart city applications, Alsubaei et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] address the challenge of detecting
and classifying small objects for effective garbage waste management. Their work leverages an
enhanced version of the RefineDet deep learning model, with hyperparameters optimally tuned
using an arithmetic optimization algorithm (AOA). Although the focus is on waste segregation, the
techniques developed have broader implications for detecting small, dangerous objects in complex
environments, demonstrating the versatility of their approach.
      </p>
      <p>
        Ceccarelli and Montecchi [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] provide a critical analysis of traditional object detection metrics,
arguing that conventional measures like Average Precision do not adequately account for safety
and reliability. They introduce an object criticality model that factors in an object’s distance,
relative velocity, and trajectory—elements that determine the potential risk posed by the object.
This approach shifts the focus from merely detecting objects to prioritizing those that could
significantly impact safety, a concept that is particularly relevant for autonomous driving systems.
      </p>
      <p>
        Tabrik et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] explore the intriguing overlap between visual and tactile perception. Their
experiments with virtual 3D objects, or “digital embryos,” reveal that both the visual and tactile
systems share common shape features when it comes to object recognition. This finding suggests
that the cognitive processes underlying these two sensory modalities are remarkably similar, which
in turn supports the idea of integrating tactile data with visual data in robotic systems to enhance
overall recognition performance.
      </p>
      <p>
        Building on the interplay between vision and touch, Rouhafzay and Cretu [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] propose a
framework in which visual attention guides tactile data acquisition. In their system, visually
selected object contours determine where tactile data should be sequentially collected. By
combining both cutaneous (surface) and kinesthetic (movement-based) cues through a deep
learning approach employing CNNs, their framework achieves a very high recognition accuracy of
98.97%. This adaptive strategy mirrors how humans explore objects, and it demonstrates the
benefits of a synergistic visuo-tactile approach.
      </p>
      <p>
        Ahmad and Del Bue [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] present mmFUSION, an intermediate-level fusion framework that
specifically addresses the challenges of integrating features from heterogeneous sensors like
cameras and LiDAR. Their approach uses separate encoders to process each modality, and then
employs cross-modality and multi-modality attention mechanisms to fuse these features effectively.
The method not only preserves the detailed semantic and spatial information from each sensor but
also achieves superior performance on standard benchmarks like KITTI and NuScenes.
      </p>
      <p>
        Önal and Dandıl [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] take a slightly different approach by focusing on the detection of unsafe
behaviors in workplace environments. Their system, Unsafe-Net, combines the spatial detection
power of YOLO v4 with the temporal analysis capabilities of ConvLSTM networks. After
processing 39 days of factory video footage, their hybrid approach achieves a classification
accuracy of 95.81% and an action recognition latency of just 0.14 seconds. Although their primary
application is workplace safety, the underlying techniques are highly relevant to object detection in
other safety-critical domains.
      </p>
      <p>
        Finally, Danso et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] explore the use of terahertz imaging for detecting concealed dangerous
objects, a method particularly useful for security screening applications. Terahertz images, despite
being safe and non-ionizing, are often plagued by low resolution and noise. To address these issues,
the authors enhance the YOLOv5 model with a BiFPN module and employ transfer learning to
finetune the network. Their incremental improvements in mAP metrics highlight the potential of
combining non-traditional imaging modalities with deep learning for detecting hidden hazards.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed Approach</title>
      <p>Below, the proposed unified fusion framework that integrates visual and sensory modalities to
improve threat detection is described. The proposed approach is designed around a mid-level
fusion strategy that uses attention mechanisms to dynamically weight and combine features from
various sensors. This section details the overall architecture, key processing steps, and rationale for
the proposed approach.</p>
      <p>The framework is structured in multiple stages that transform raw sensor data into a
consolidated detection decision.</p>
      <p>The sensor suite provides raw data from multiple modalities. Preprocessing extracts features
which are then fused in the Intermediate Fusion Module using attention mechanisms. The fused
features are decoded into object detections that are further evaluated for safety-criticality before
triggering the final decision and alerting systems.</p>
      <p>The detailed list of the main modules of the proposed framework is provided below.</p>
      <p>
        Module 1 — Sensor Suite and Preprocessing. The framework starts with a Sensor Suite that
includes:
1. Camera – This sensor captures RGB images, which are important for obtaining semantic
details and texture information [
        <xref ref-type="bibr" rid="ref13 ref7">7, 13</xref>
        ].
2. LiDAR – This sensor provides accurate depth and spatial data that is later transformed
into 3D occupancy grids [
        <xref ref-type="bibr" rid="ref6 ref8">6, 8</xref>
        ].
3. Thermal/IR Sensor – This sensor records temperature gradients, which helps in detecting
objects in low-light conditions or during adverse weather [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
4. Tactile Sensors (Optional) – These sensors collect cutaneous and kinesthetic data, which
are useful for analyzing the shape and texture of objects [
        <xref ref-type="bibr" rid="ref11 ref12">12, 11</xref>
        ].
      </p>
      <p>Each sensor’s raw data is processed through specific preprocessing steps:
1. Visual Data: The data is enhanced and normalized using CNN-based methods to reduce
noise and improve contrast.
2. LiDAR Data: The data is converted into structured formats, such as voxel grids or
occupancy maps, to aid in feature extraction.
3. Thermal Data: The data is synchronized with camera frames to ensure spatial alignment
between the different modalities.
4. Tactile Data: The data is transformed into feature maps that capture cues related to
surface pressure and texture.</p>
      <p>Module 2 — Intermediate Fusion Module. At the core of the proposed framework lies the
Intermediate Fusion Module (IFM), which is responsible for combining features from different
modalities into a common representation. Unlike early fusion — which concatenates raw data —
and late fusion — which aggregates final decisions, intermediate fusion leverages high-level
features while preserving spatial and semantic integrity.</p>
      <p>The IFM consists of two main steps, listed below.</p>
      <p>Step 1. Separate Encoding. Each modality’s features are encoded into a lower-dimensional
space while maintaining key geometric and semantic details. This is accomplished using
modalityspecific encoders:</p>
      <p>
        Sfinal= β∗Sdetection+( 1− β )∗C , (6)
where β (0 ≤ β ≤ 1) is a weighting factor balancing detection confidence and criticality [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. If
Sfinal exceeds a predefined threshold, the system issues a detection decision and triggers
corresponding safety protocols.
      </p>
      <p>The proposed framework integrates multiple sensor modalities at an intermediate level to
exploit the strengths of each sensor while mitigating their individual weaknesses. The process
begins with dedicated preprocessing and feature extraction from raw sensor data, followed by an
attention-based fusion that produces a robust, unified feature representation. A joint decoder then
translates these features into object detections, which are further evaluated for safety-criticality.
Finally, a decision module synthesizes this information to yield a final detection outcome and, if
necessary, initiate safety alerts.</p>
      <p>By adopting this framework, systems can achieve enhanced detection accuracy and robustness in
various complex and dynamic environments, thus making them more suitable for applications
where safety is of utmost importance.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Hypothetical Example</title>
      <p>In this section, the example of the operation of the proposed intermediate fusion framework
through a detailed, hypothetical scenario is illustrated. The example demonstrates how multiple
sensor inputs are processed, fused, and evaluated to make a safety-critical detection decision.</p>
      <sec id="sec-4-1">
        <title>4.1. Scenario Description</title>
        <p>Imagine an autonomous truck operating in an urban environment approaching an intersection. The
system is tasked with detecting a pedestrian who is potentially crossing the road in an unsafe
manner. The detection process involves three sensor modalities:
• Camera (RGB) captures visual information, including color and texture, to identify objects.
• LiDAR provides depth information by generating point clouds, crucial for estimating object
distance.
• Thermal/IR Sensor captures temperature differences, which can highlight living beings
even under low-light conditions.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Sensor Inputs and Preprocessing</title>
        <p>For this example, assume the following sensor observations:
• Camera detects a candidate pedestrian with a raw confidence score of 0.90. After image
enhancement and feature extraction (using a CNN encoder), the extracted feature is
•
•
denoted as f cam.</p>
        <p>LiDAR returns a sparse point cloud corresponding to an object with a raw confidence score
of 0.80. The LiDAR data is converted into an occupancy grid and then processed by its
dedicated encoder to produce featuref LiDAR .</p>
        <p>Thermal/IR Sensor detects a warm signature in the same region with a confidence score of
0.85. Thermal features are extracted after alignment with the camera frame, resulting in
feature f t h ermal.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Intermediate Fusion Process</title>
        <p>Each sensor’s feature is weighted according to its relevance, as determined by the attention
mechanism. Assume that under the current environmental conditions (e.g., dusk with low ambient
light), the system assigns the following attention weights:
1. Camera: acam=0.6
2. LiDAR: aLiDAR=0.3
3. Thermal: at h ermal=0.1
The fused feature F is computed as below:</p>
        <p>F = acam∗f cam+ aLiDAR∗f LiDAR+ at h ermal∗f t h ermal, (7)</p>
        <p>Simultaneously, the raw detection confidences from each modality are fused (as a simplified
weighted sum) to produce an overall detection confidence:</p>
        <p>Sdetection=( 0.6∗0.90 )+( 0.3∗0.85 )+( 0.1∗0.85 )=0.54 +0.24 +0.085=0.865(8)</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Object Criticality Evaluation</title>
        <p>Given the safety-critical context, the system computes an object criticality score C to prioritize
objects based on potential risk.</p>
        <p>For the current example, suppose the following values are measured or estimated:
1. Distance, d : 10 meters from the truck.
2. Relative Velocity, v : The pedestrian is moving toward the truck at 2 m/s.
3. Orientation Factor, θ : The pedestrian’s path is directly toward the truck (θ = 1 ).
4. Decay Constant, α : 0.1, chosen to modulate the impact of distance.</p>
        <p>The criticality score is then calculated as:</p>
        <p>C =e−αd∗v∗θ=e−0.1∗10∗2∗1=e−1∗2 ≈ 0.3679∗2=0.7358 , (9)</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Final Detection Decision</title>
        <p>The final detection score Sfinal is derived by combining the fused detection confidence and the
criticality score. Using a weighting factor β =0.7 (to prioritize raw detection confidence with still
considering safety-critical information):</p>
        <p>Sfinal= β∗Sdetection+( 1− β )∗C ,
Substituting the values:</p>
        <p>Sfinal=0.7∗0.865+0.3∗0.7358 ≈ 0.6055+0.2207=0.8262
(10)
(11)</p>
        <p>Assume the system’s detection threshold is set at 0.80. Since Sfinal=0.8262 exceeds this
threshold, the system classifies the object as dangerous. Consequently, safety protocols are
activated - such as issuing and audible and visual alert to initiate braking or evasive maneuvers.
Below is an illustration of the proposed structure that integrates multiple sensor modalities at an
intermediate level to leverage the strengths of each sensor while reducing their individual
weaknesses. The process begins with specialized preprocessing and feature extraction from raw
sensor data, followed by an attention-based fusion that produces a robust unified feature
representation. Next, a joint decoder transforms these features into object detection, which are
then evaluated for safety-criticality. Finally, a decision-making module synthesizes this information
to generate the final detection result and, if necessary, trigger safety alerts.</p>
        <p>This flow diagram outlines the sequential process from sensor input through to the final
decision, demonstrating how the intermediate fusion and safety-critical evaluation work together.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Summary</title>
        <p>This hypothetical example illustrates the comprehensive process of the proposed fusion
framework. Initially, raw sensor data from various sources is preprocessed and encoded to prepare
it for further analysis. Following this, an attention-based fusion mechanism is applied to
dynamically weight and combine features, resulting in a unified representation. The framework
then performs a safety-critical assessment by computing object criticality based on factors such as
distance, relative velocity, and orientation. Finally, the detection confidence is combined with the
criticality score to determine whether the object is hazardous.</p>
        <p>The example demonstrates that, even when sensor confidence and conditions vary, the
proposed framework can robustly integrate multimodal data to enhance the detection of
safetycritical objects. This approach is particularly beneficial in real-world scenarios, where the timely
identification of dangerous objects is essential.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Advantages</title>
        <p>The proposed fusion framework and accompanying mathematical formulations address several
long-standing challenges in detecting dangerous objects across varied, real-world scenarios. In this
section, the benefits, limitations, and future prospects of the proposed approach are discussed.
The framework improves robustness and accuracy by combining several sensor modalities such as
RGB cameras, LiDAR, thermal sensors, and tactile data. This design takes advantage of the
strengths of each sensor. For instance, RGB cameras provide detailed semantic information, while
LiDAR offers precise depth measurements. Thermal sensors work effectively in low-light
conditions, and tactile data adds useful insights into object shape and texture. This blend of sensor
inputs enhances overall detection accuracy and reliability, especially in challenging situations
where systems using a single modality might fail.</p>
        <p>The system also uses an attention-based fusion mechanism that adjusts the weight of each
sensor based on the current environment. For example, in poor lighting or bad weather, the system
can give more importance to thermal or LiDAR data than to RGB images. The softmax-based
attention mechanism helps to ensure that the most reliable sensor inputs have the greatest
influence on the final feature representation.</p>
        <p>Additionally, the framework includes an object criticality model to increase safety by
prioritizing detections that present higher risks. By combining detection confidence with factors
such as distance, speed, and orientation, the system focuses quickly on objects that may be on a
collision path. This approach is vital in areas like autonomous driving and maritime navigation,
where detection errors can have serious consequences.</p>
        <p>Finally, the framework uses an intermediate fusion strategy that avoids the problems of both
early fusion, which can lead to misaligned raw data, and late fusion, which may depend on
inaccurate initial proposals. By merging high-level features from each sensor, the approach
maintains important semantic and spatial details, leading to better detection performance.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Limitations</title>
        <p>Ensuring precise calibration between sensors is one of the most challenging aspects of multimodal
fusion. Differences in field of view, resolution, and sampling rates may lead to misalignment that
reduces the quality of integrated features. Although the framework applies preprocessing and
synchronization steps, further research is needed to develop more robust and adaptive calibration
methods.</p>
        <p>Another issue is computational complexity. The use of attention mechanisms and intermediate
fusion increases the computational overhead, which can be a significant challenge for real-time
applications such as autonomous vehicles and industrial monitoring systems. Optimizing the
network architecture and utilizing hardware accelerators like GPUs or TPUs may help mitigate
these costs.</p>
        <p>Furthermore, many of the current datasets used for object detection in safety-critical domains
are limited in diversity and lack comprehensive multimodal labeling. This scarcity of fully
annotated multimodal datasets makes it difficult to completely train and evaluate advanced fusion
models. Expanding these datasets to cover a wider range of dangerous objects and adverse
conditions is essential for future progress.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Future Work</title>
        <p>Future research should concentrate on developing dynamic calibration techniques that
automatically align sensor data in real time. These techniques may use adaptive algorithms that
adjust to changes in sensor positioning and environmental conditions. Addressing the
computational overhead is also critical for real-time applications; exploring model compression,
efficient network architectures, and specialized hardware solutions can help bridge the gap
between theoretical research and practical deployment. Moreover, there is an urgent need for
comprehensive datasets containing synchronized and labeled data from multiple sensor modalities
across various scenarios. Such datasets would enable more thorough training, benchmarking, and
refinement of multimodal fusion models, ultimately improving their applicability in real-world
settings. Finally, although the current framework focuses on detection, future systems might
integrate these outputs with higher-level decision-making and control processes. For example, in
autonomous vehicles, detection results could be directly linked to trajectory planning algorithms
that make immediate adjustments to prevent collisions.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Summary</title>
        <p>The discussion underscores that the proposed fusion framework effectively addresses key
challenges in dangerous object detection by leveraging multimodal sensor data and advanced
fusion strategies. The dynamic weighting through attention mechanisms and the inclusion of a
safety-critical evaluation component significantly enhance detection robustness and reliability.
Nonetheless, challenges remain in sensor calibration, computational efficiency, and dataset
availability. Addressing these limitations through future research will be essential to fully realize
the potential of multimodal sensor fusion in safety-critical applications.</p>
        <p>References from earlier sections consistently emphasize the importance of robust data fusion,
dynamic sensor weighting, and safety-critical performance evaluation. The current work builds
upon these foundational ideas, providing a comprehensive, adaptable, and practical solution for
enhanced detection in complex environments.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, a comprehensive framework for the enhanced detection of dangerous objects through
the fusion of visual and sensory modalities was presented. By synthesizing insights from the latest
studies, the proposed approach addresses key challenges encountered in safety-critical applications
such as autonomous driving, maritime navigation, and industrial monitoring.</p>
      <p>The described unified framework employs an intermediate-level fusion strategy that leverages
dedicated encoders for each sensor modality — such as cameras, LiDAR, thermal sensors, and
tactile sensors — to extract high-level features. These features are dynamically weighted using
attention mechanisms and fused into a unified representation, which is then decoded to produce
robust 3D object detections. A critical component of the proposed approach is the object criticality
model, which quantifies the risk posed by detected objects based on their distance, relative velocity,
and orientation. This enables the system to prioritize high-risk objects, thus enhancing safety in
environments where timely detection is essential.</p>
      <p>The hypothetical example in Section 4 further illustrates how the proposed framework
effectively integrates multimodal sensor data to produce reliable detection decisions in a real-world
scenario. While the proposed framework shows significant promise, challenges remain. Accurate
sensor calibration, computational efficiency for real-time processing, and the need for expanded,
well-annotated multimodal datasets are areas that warrant further investigation. Future work
should focus on developing dynamic calibration methods, optimizing the fusion architecture, and
integrating the detection module with higher-level decision-making systems to enable seamless
real-time responses.</p>
      <p>In summary, the integration of visual and sensory modalities through intermediate fusion and
attention mechanisms represents a powerful solution for detecting dangerous objects in complex,
dynamic environments. Current approach starts the future research and practical implementations
in safety-critical domains, ultimately contributing to the development of next-generation
autonomous systems with enhanced robustness and reliability.
During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool/service, the authors reviewed and
edited the content as needed and takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Arya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tripathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Diwakar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <article-title>Object detection using deep learning: a review</article-title>
          ,
          <source>Journal of Physics: Conference Series</source>
          ,
          <year>1854</year>
          (1), p.
          <fpage>012012</fpage>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .1088/
          <fpage>1742</fpage>
          -
          <lpage>6596</lpage>
          /
          <year>1854</year>
          /1/012012.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <source>Deep Multimodal Data Fusion, ACM Comput. Surv.</source>
          ,
          <volume>56</volume>
          (
          <issue>9</issue>
          ), Article 216, pp.
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .1145/3649447.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Fedorenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fesenko</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Kharchenko</surname>
          </string-name>
          ,
          <article-title>Analysis of methods and development of a concept for guaranteed detection and recognition of explosive objects</article-title>
          ,
          <source>Innovative Technologies and Scientific Solutions for Industries</source>
          ,
          <volume>4</volume>
          (
          <issue>22</issue>
          ), pp.
          <fpage>20</fpage>
          -
          <lpage>31</lpage>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Shchur</surname>
          </string-name>
          ,
          <article-title>Intelligent system for demining robot control in a virtual environment</article-title>
          , Herald of Khmelnytskyi National University.
          <source>Technical Sciences</source>
          ,
          <volume>335</volume>
          (
          <issue>3</issue>
          (
          <issue>1</issue>
          )), pp.
          <fpage>326</fpage>
          -
          <lpage>329</lpage>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .31891/
          <fpage>2307</fpage>
          -5732-2024-335-3-43.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Dumyn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Basystiuk</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Dumyn</surname>
          </string-name>
          ,
          <article-title>Graph-based approaches for multimodal medical data processing</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Thompson</surname>
          </string-name>
          , Maritime Object Detection, Tracking, and
          <article-title>Classification Using LiDAR</article-title>
          and
          <string-name>
            <surname>Vision-Based Sensor</surname>
            <given-names>Fusion</given-names>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Vadidar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kariminezhad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mayr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kloeker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Eckstein</surname>
          </string-name>
          ,
          <article-title>Robust Environment Perception for Automated Driving: A Unified Learning Pipeline for Visual-Infrared Object Detection</article-title>
          .
          <source>In 2022 IEEE Intelligent Vehicles Symposium (IV)</source>
          (pp.
          <fpage>367</fpage>
          -
          <lpage>374</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhown</surname>
          </string-name>
          ,
          <article-title>Improving Long-Range 3D Object Detection Methods for Autonomous Box Trucks Using Sensor Fusion</article-title>
          .
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Alsubaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. N.</given-names>
            <surname>Al-Wesabi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Hilal</surname>
          </string-name>
          ,
          <article-title>Deep Learning-Based Small Object Detection and Classification Model for Garbage Waste Management in Smart Cities</article-title>
          and
          <string-name>
            <given-names>IoT</given-names>
            <surname>Environment</surname>
          </string-name>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ceccarelli</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Montecchi</surname>
          </string-name>
          ,
          <article-title>Evaluating Object (mis)Detection from a Safety and Reliability Perspective: Discussion and Measures</article-title>
          .
          <source>IEEE Access</source>
          ,
          <volume>11</volume>
          ,
          <fpage>44952</fpage>
          -
          <lpage>44963</lpage>
          .
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tabrik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Behroozi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schlaffke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lissek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Güntürkün</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Dinse</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Tegenthoff</surname>
          </string-name>
          ,
          <article-title>Visual and Tactile Sensory Systems Share Common Features in Object Recognition</article-title>
          .
          <source>Eneuro</source>
          ,
          <volume>8</volume>
          (
          <issue>5</issue>
          ).
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Rouhafzay</surname>
          </string-name>
          and
          <string-name>
            <surname>A.-M. Cretu</surname>
          </string-name>
          ,
          <article-title>An Application of Deep Learning to Tactile Data for Object Recognition under Visual Guidance</article-title>
          . Sensors,
          <volume>19</volume>
          (
          <issue>7</issue>
          ),
          <fpage>1534</fpage>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Del Bue</surname>
          </string-name>
          ,
          <source>mmFUSION: Multimodal Fusion for 3D Objects Detection.arXiv preprint arXiv:2311.04058</source>
          .
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>O.</given-names>
            <surname>Önal</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Dandıl</surname>
          </string-name>
          , Unsafe-Net:
          <article-title>YOLO v4 and ConvLSTM Based Computer Vision System for Real-Time Detection of Unsafe Behaviours in Workplace</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Danso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Odoom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Nyarko</surname>
          </string-name>
          ,
          <article-title>Hidden Dangerous Object Recognition in Terahertz Images Using Deep Learning Methods</article-title>
          .
          <source>Applied Sciences</source>
          ,
          <volume>12</volume>
          (
          <issue>15</issue>
          ),
          <volume>7354</volume>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>