<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>L. Pallonetto);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Healthcare Social Robotics: A Preliminary Study Using Multimodal Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Pallonetto</string-name>
          <email>luca.pallonetto@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi D'Arco</string-name>
          <email>luigi.darco@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Silvia Rossi</string-name>
          <email>silvia.rossi@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Naples</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Italy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electrical Engineering and Information Technologies, University of Naples Federico II</institution>
          ,
          <addr-line>Via Claudio 21, 80125</addr-line>
        </aff>
      </contrib-group>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Socially assistive robots in healthcare must interpret complex and ambiguous environments to behave safely and appropriately. This pilot study investigates the use of Multimodal Large Language Models (MLLMs) for contextaware scene understanding by combining visual and auditory inputs. We propose a modular pipeline integrating Moondream, a CLIP-based vision-language model, and CoNeTTE, an audio captioning model, to interpret static images and ambient sounds. The system was evaluated on two datasets: the Audiovisual Aerial Scene dataset and a custom synthetic hospital dataset with images from HIOD and audio generated via Stable-Audio 1.0. On the aerial dataset, multimodal input improved classification accuracy from 69.04% to 81.09% and F1-score from 65.15% to 80.22%, showing the benefit of audio in disambiguating visually similar scenes. In contrast, limited gains were observed on the hospital dataset due to weak image-audio alignment, highlighting challenges in synthetic healthcare data. The findings highlight the significant impact that MLLM-based perception can have on healthcare robotics, yet they also reveal present challenges with data quality, domain adaptation, and cross-modal grounding in practical applications. Future work will integrate the proposed perception layer into an actual robotic platform to evaluate its real-time context awareness and adaptive responses in dynamic settings. Ultimately, combining the multimodal perception layer with advanced planning, dialogue, and emotion recognition capabilities will be essential for developing socially intelligent robots capable of assisting both patients and healthcare professionals in a contextually aware manner.</p>
      </abstract>
      <kwd-group>
        <kwd>Multimodal Language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The adoption of social robots in healthcare settings is steadily increasing, with applications ranging from
patient monitoring to therapeutic support and logistical assistance [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, successful integration
into these environments demands a deep understanding of the surrounding context. Unlike industrial
robots operating in structured and predictable environments, social robots deployed in hospitals, elderly
care facilities, and rehabilitation centers must function in dynamic, unstructured spaces filled with
social cues, physical obstacles, and safety-critical scenarios [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These settings are often ambiguous and
continuously changing, requiring robots to interpret not only spatial information but also the evolving
human activities, emotional tone, and environmental constraints [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For instance, in a hospital setting,
an empathetic demeanor may be appropriate in a waiting room when interacting with people, whereas
eficient and streamlined movement may be prioritized in a hallway. In contrast, discretion and minimal
disruption are essential in more sensitive areas such as operating rooms. These scenarios require not
only spatial awareness but also necessitate a sophisticated level of contextual understanding that goes
beyond basic perception. The potential of context-aware robotics lies in its ability to deliver not only
enhanced operational robustness but also more intuitive, trustworthy, and human-aligned behavior.
This includes promoting safer interactions, improving task efectiveness, and ensuring that robotic
actions are interpretable and appropriate within a healthcare scenario.
      </p>
      <p>To enable efective context awareness, a fundamental capability is the ability for the robot to interpret
and understand its surrounding environment. Early approaches to scene understanding primarily relied</p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
on handcrafted features and rule-based systems. These traditional methods, while foundational, often
struggled with generalization and scalability across diverse environments [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Initially, scene recognition
relied on global attribute descriptors, which aimed to mimic human visual perception using low-level
features. Techniques such as GIST [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], CENTRIST [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and LDBP [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] captured holistic characteristics of
scenes, but lacked robustness to variations in viewpoint, lighting, and object occlusion. To address these
limitations, researchers introduced local patch-based representations, leveraging descriptors like SIFT
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and SURF [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] to extract finer-grained features. These were often aggregated using methods like
Bag-of-Visual-Words (BoVW), improving recognition accuracy by capturing local structure and texture.
However, these methods still required substantial manual tuning and could not dynamically adapt to
novel scenes. Subsequent methods introduced spatial layout pattern learning and discriminative region
detection, aiming to model scene composition more flexibly and emphasize key visual regions. While
these approaches ofered improvements, they still fell short of integrating semantic and contextual
information in a unified framework. More recently, transformer-based architectures have emerged as
powerful tools for multimodal perception in robotics [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. These models excel at fusing visual and
linguistic data, enabling richer contextual understanding. In particular, Vision-Language Models (VLMs)
integrate image features with textual knowledge, allowing robots not only to recognize scenes but
also to reason about them semantically. Further steps forward have been taken with the release of
Multimodal Large Language Models (MLLMs) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. These models, trained on large corpora of data and
types, such as images, text, and audio, can not only describe a scene, but also reason about it, inferring
afordances, intentions, and risks from abstract cues. MLLM are able to interpret a scene through a
human-like reasoning lens, overcoming the problem of partial information or occluded images.
      </p>
      <p>This study explores the feasibility of exploiting MLLMs as a foundation for scene understanding and
contextual reasoning, with the long-term goal of enabling adaptive and context-aware behaviors in
social robots. By evaluating the reasoning capabilities of MLLMs, we aim to determine their potential
as high-level perception modules to support intelligent and socially appropriate robotic behaviors
in dynamic environments. In this preliminary stage, we focus on static scenes containing partial or
ambiguous visual information, such as images depicting only a fragment of a room, isolated medical
equipment, or occluded spaces. This setup allowed the investigation of the efectiveness of MLLMs
in inferring contextual meaning, environmental afordances, and potential human activities from
incomplete or implicit visual cues. Initially, the models were assessed using only images; subsequently,
audio information was integrated to improve situational interpretation. This form of multimodal
reasoning enables a transition to goal-aware, context-sensitive behavioral modulation, where a robot’s
actions are guided not only by spatial awareness but also by inferred intent, emotion, urgency, and
social appropriateness. For example, understanding whether a scene suggests a quiet waiting room, a
critical medical event, or a routine interaction can influence a robot’s decision to speak, move, assist, or
remain passive. The final aim will be to expand on this foundation by incorporating richer sensory
inputs and evaluating real-time performance in human-robot interaction scenarios.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>To explore the potential of multimodal perception in assistive robotics for healthcare, we developed
a structured evaluation pipeline utilizing a MLLM to assess its ability to interpret and reason about
scene context. This preliminary study explores whether integrating visual and auditory inputs can
improve a model’s ability to interpret complex, ambiguous, or safety-critical scenes, conditions that
social robots frequently encounter in real-world clinical settings. To simulate realistic sensory inputs,
we focus on static visual scenes paired with ambient audio recordings, evaluating how well MLLMs can
extract meaningful semantic and contextual understanding from this multimodal data.</p>
      <p>
        The developed pipeline integrates Moondream1, a lightweight yet expressive vision-language model
designed for real-time applications. Moondream is built on the CLIP (Contrastive Language-Image
Pre-training) encoder architecture [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which aligns visual and textual representations in a shared
embedding space. Unlike larger multimodal transformers, Moondream is optimized for low-latency
inference and on-device execution, making it particularly suitable for robotic platforms with limited
computational resources. In our setup, Moondream is tasked with producing semantic-level descriptions
of visual scenes based on single static images, capturing both object-level content and contextual cues.
To incorporate auditory information, the pipeline also integrates CoNeTTE [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], a neural captioning
model that generates descriptive text from audio recordings. CoNeTTE uses a conformer-based encoder
to extract temporal and spectral features from raw audio and decodes them into natural language
descriptions via a transformer-based language decoder. This model is capable of identifying both
environmental sounds (e.g., alarms, footsteps, conversations) and their semantic implications, ofering a
high-level linguistic summary of the audio context. In our implementation, the audio caption produced
by CoNeTTE is prepended to the visual prompt before being passed to Moondream, efectively creating
a multimodal composite input that allows the system to reason over combined visual and auditory cues.
      </p>
      <p>The experimental pipeline was tested under two configurations:
• Visual-only condition: the system is presented with a static image and tasked with classifying the
scene based solely on visual cues;
• Visual + Audio condition: the system is presented with a static image and 5 seconds of audio
recording.</p>
      <p>This modular setup allows us to systematically assess how multimodal inputs contribute to contextual
reasoning, laying the groundwork for future, real-time integration into socially intelligent robotic
platforms.</p>
      <sec id="sec-2-1">
        <title>2.1. Evaluation Datasets</title>
        <p>
          To validate the proposed architecture, we initially employed the Audiovisual Aerial Scene Recognition
Dataset [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], a publicly available collection of 5,075 paired images and environmental audio clips
depicting various ambiguous outdoor scenes. For this study, a total of 2996 image-audio pairs have
been chosen across 9 scene categories characterized by the presence of partial information in order
to test the feasibility of the pipeline. The categories included: airport, beach, bridge, farmland, forest,
grassland, and harbour. This dataset is not directly aligned with healthcare scenarios, but it provides a
valuable benchmark to test the pipeline’s robustness in disambiguating semantically distinct contexts,
which is a relevant capability for real-world robotic perception.
        </p>
        <p>Building on this preliminary phase, the methodology was extended to the healthcare domain, where
multimodal datasets suitable for robotics applications remain scarce and underexplored. To address this
gap, we developed a custom synthetic dataset designed to simulate realistic indoor hospital scenarios.
Visual data were manually selected from the Hospital Indoor Object Detection (HIOD) dataset [15],
focusing on scenes with partial occlusions, limited visible cues, or ambiguous content. A total of 160
images were selected from this dataset and included common healthcare spaces, including waiting
areas, patient rooms, operating rooms, and hospital corridors. Furthermore, this dataset contains
various visual cues such as partially occluded equipment and hallways without identifiable signage.
These images were particularly relevant as they often depicted only partial views of the environments
mentioned before, simulating the constrained and task-oriented perspective a robot might have while
performing specific actions within the scene.</p>
        <p>Since the HIOD dataset does not include audio data, we synthetically generated ambient sounds to
simulate auditory context. As shown in Fig. 1, for each image, a semantic caption summarizing the scene
was generated using the BLIP (Bootstrapping Language-Image Pretraining) model [16], which extracts
high-level visual descriptions by aligning image content with natural language. Additionally, two
categorical labels were manually assigned to each image to reflect the likely room type and its function,
aiding in downstream contextual interpretation. For each label, a prompt template was constructed by
combining it with the image captions, and passed through LLAMA 3.1 to generate a natural language
audio scene description. This textual description was subsequently used as input for Stable Audio 1.02,
a state-of-the-art difusion-based generative model capable of producing high-fidelity 5-second audio
clips. These clips reflect typical ambient sounds expected in each room type (e.g., beeping monitors in
patient rooms, footsteps and murmurs in waiting areas, equipment sounds in operating rooms).</p>
        <p>The motivation behind constructing this dataset was to enable experimentation under conditions
that closely mirror real-world deployment scenarios for assistive robots in hospitals. In such settings, a
robot must continuously interpret environmental cues and adjust its behavior accordingly.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
      <p>To evaluate the efectiveness of the proposed multimodal pipeline, a comparative analysis of the model’s
performance was conducted under two experimental conditions: using only visual input (visual-only)
and using both visual and auditory inputs (visual + audio). Performance was assessed using standard
scene classification metrics, including Accuracy and F1 score.</p>
      <sec id="sec-3-1">
        <title>3.1. Audiovisual Aerial Scene Dataset</title>
        <p>The performance of the proposed pipeline on the Audiovisual aerial scene dataset is presented in Table
1. In the visual-only experiments, where the model was prompted with a single image, classification
performance reached an Accuracy of 69.04% and an F1-score of 65.15%. Similarity among the images
with semantically distinct but visually overlapping environments posed challenges to the model in the
recognition phase. Such limitations suggest that relying exclusively on visual input may be insuficient
for robust scene interpretation, especially in real-world scenarios requiring nuanced contextual
understanding. In the visual-audio experiments, the images were accompanied by the audio description. The
pipeline’s recognition performance increased to 81.09% and 80.22%, for Accuracy and F1-score,
respectively. The inclusion of audio-derived context helped reduce confusion in key categories. For example,
environmental sounds such as crowd chatter, transportation noises (e.g., engines, horns, or train signals),
and various ambient sounds produced by everyday objects served as strong semantic information that
complemented the image content, enabling the model to diferentiate between similar-looking scenes
more efectively.</p>
        <p>These findings validate the benefit of multimodal integration for scene recognition tasks,
demonstrating that ambient audio, even when transformed into linguistic input, can provide complementary
cues. This improvement is particularly relevant in assistive robotics contexts, where misinterpretation
of a setting could lead to inappropriate or unsafe behavior.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Hospital Rooms Dataset</title>
        <p>To evaluate the performance of the proposed pipeline in real-world healthcare settings, a synthetic
dataset was created with ambiguous images and generated audios.</p>
        <p>The pipeline, with moondream core, achieved an accuracy of 35.95% and F1 score of 29.88% in the
visual-only experiment, and 35.29% and 32.50% in the visual-audio experiment. Following a manual
inspection of the model outputs and a thorough analysis of the synthetic hospital dataset, it became
evident that the data contained a high degree of semantic ambiguity. Many of the visual scenes lacked
distinctive cues, and the alignment between images and their corresponding audio descriptions was
often weak or non-informative, particularly when compared to the more coherent and contextually rich
audiovisual dataset used in the earlier phase of the study.</p>
        <p>To verify whether performance limitations were due to model constraints or dataset quality, several
models were tested under both visual-only and visual+audio conditions, including gemma, llava, and
qwen2.5. The results are reported in Table 2. While some models exhibited slightly improvements when
prompts with audio description, the overall classification performance across models remained relatively
low. This consistently poor Accuracy and F1-score across architectures supports the conclusion that
the primary bottleneck lies in the dataset itself, rather than in the models or pipeline design. Even
when selecting models with diferent number of parameters (gemma3 with 4b and 12b parameters) the
overall performance increased slightly. These findings reinforce the importance of using high-quality,
semantically aligned multimodal data, particularly in sensitive domains like healthcare, where clarity
and precision are crucial.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>This work explored the application of MLLMs to context-aware scene understanding for socially assistive
robots in healthcare settings. We proposed and evaluated a modular perception pipeline capable of
interpreting complex indoor scenes by combining visual and auditory inputs. Preliminary results
indicate that the multimodal configuration improves performance, particularly when dealing with
partially occluded or visually ambiguous scenes. Notably, the addition of audio information played
a disambiguating role, helping the model to distinguish between visually similar settings by using
contextual acoustic cues. However, the healthcare dataset did not demonstrate an improvement in
performance, requiring additional investigation. Overall, equipping robots with the ability to interpret
such diferences through multimodal inputs and MLLMs could enable more appropriate, responsive,
and socially aligned behaviors that can adapt to diferent scenarios.</p>
      <p>Future eforts should focus on collecting real-world multimodal datasets within clinical settings,
including rich audio environments and high-fidelity image data annotated for context, emotional tone,
and functional zones. From a system integration perspective, the next step is to embed the proposed
perception pipeline into a physical robotic platform, evaluating its real-time performance in a dynamic
environment. Furthermore, future research will explore how to integrate the MLLMs capabilities of
scene understanding with higher-level planning, dialogue, and emotion recognition capabilities with
the final aim of building socially intelligent robots capable of assisting patients and professionals in a</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This research has been supported by the European Union - Next Generation EU, Mission 4 Component
1, CUP E53D23016260001 PRIN 2022 PNRR ADVISOR, and under the complementary actions to the
NRRP “Fit4MedRob - Fit for Medical Robotics” Grant (# PNC0000007).</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During this work, the authors used ChatGPT for grammar and spelling checks. All content was
subsequently reviewed and edited by the authors, who take full responsibility for the final version.
[15] D. Hu, S. Li, M. Wang, Object detection in hospital facilities: A comprehensive dataset and
performance evaluation, Engineering Applications of Artificial Intelligence 123 (2023) 106223.
[16] J. Li, D. Li, C. Xiong, S. Hoi, Blip: Bootstrapping language-image pre-training for unified
visionlanguage understanding and generation, in: International conference on machine learning, PMLR,
2022, pp. 12888–12900.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Kwon,</surname>
          </string-name>
          <article-title>The influence of politeness behavior on user compliance with social robots in a healthcare service setting</article-title>
          ,
          <source>International Journal of Social Robotics</source>
          <volume>9</volume>
          (
          <year>2017</year>
          )
          <fpage>727</fpage>
          -
          <lpage>743</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Johanson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Ahn</surname>
          </string-name>
          , E. Broadbent,
          <article-title>Improving interactions with healthcare robots: a review of communication behaviours in social and healthcare contexts</article-title>
          ,
          <source>International Journal of Social Robotics</source>
          <volume>13</volume>
          (
          <year>2021</year>
          )
          <fpage>1835</fpage>
          -
          <lpage>1850</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>L. D'Arco</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Raggioli</surname>
            , G. Randazzo, G. De Gasperis,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Chella</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Costantini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rossi</surname>
          </string-name>
          ,
          <article-title>Towards trustworthy and explainable socially assistive robots: A cognitive architecture for dietary guidance</article-title>
          , in: 2025 IEEE International Conference on Simulation, Modeling, and
          <article-title>Programming for Autonomous Robots (SIMPAR)</article-title>
          , IEEE,
          <year>2025</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kotani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Scene recognition: A comprehensive survey</article-title>
          ,
          <source>Pattern Recognition</source>
          <volume>102</volume>
          (
          <year>2020</year>
          )
          <fpage>107205</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Oliva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Torralba</surname>
          </string-name>
          ,
          <article-title>Modeling the shape of the scene: A holistic representation of the spatial envelope</article-title>
          ,
          <source>International journal of computer vision 42</source>
          (
          <year>2001</year>
          )
          <fpage>145</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Rehg</surname>
          </string-name>
          ,
          <article-title>Centrist: A visual descriptor for scene categorization</article-title>
          ,
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          <volume>33</volume>
          (
          <year>2010</year>
          )
          <fpage>1489</fpage>
          -
          <lpage>1501</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          , L. Wu,
          <article-title>Building global image features for scene recognition</article-title>
          ,
          <source>Pattern recognition 45</source>
          (
          <year>2012</year>
          )
          <fpage>373</fpage>
          -
          <lpage>380</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Distinctive image features from scale-invariant keypoints</article-title>
          ,
          <source>International journal of computer vision 60</source>
          (
          <year>2004</year>
          )
          <fpage>91</fpage>
          -
          <lpage>110</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tuytelaars</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Van Gool</surname>
          </string-name>
          ,
          <article-title>Surf: Speeded up robust features</article-title>
          ,
          <source>in: Computer Vision-ECCV 2006: 9th European Conference on Computer Vision</source>
          , Graz, Austria, May 7-
          <issue>13</issue>
          ,
          <year>2006</year>
          .
          <source>Proceedings, Part I 9</source>
          , Springer,
          <year>2006</year>
          , pp.
          <fpage>404</fpage>
          -
          <lpage>417</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>An</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Multimodal fusion and vision-language models: A survey for robot vision</article-title>
          ,
          <source>arXiv preprint arXiv:2504.02477</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>J. Wu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Gan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Wan</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Multimodal large language models: A survey</article-title>
          ,
          <source>in: 2023 IEEE International Conference on Big Data (BigData)</source>
          , IEEE,
          <year>2023</year>
          , pp.
          <fpage>2247</fpage>
          -
          <lpage>2256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          , in: International conference on machine learning,
          <source>PmLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Étienne</given-names>
            <surname>Labbé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pellegrini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pinquier</surname>
          </string-name>
          ,
          <string-name>
            <surname>Conette:</surname>
          </string-name>
          <article-title>An eficient audio captioning system leveraging multiple datasets with task embedding</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/pdf/2309.00454.pdf. arXiv:
          <volume>2309</volume>
          .
          <fpage>00454</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <article-title>Audiovisual aerial scene recognition dataset</article-title>
          ,
          <year>2020</year>
          . URL: https://doi.org/10.5281/zenodo.3828124. doi:
          <volume>10</volume>
          .5281/zenodo.3828124.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>