<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multimodal Signal Processing for HRI in RoboCup: Understanding a Human Referee</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Filippo Ansalone</string-name>
          <email>ansalone.1950936@studenti.uniroma1.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flavio Maiorana</string-name>
          <email>maiorana.2051396@studenti.uniroma1.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Afinita</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flavio Volpi</string-name>
          <email>volpi.1884040@studenti.uniroma1.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eugenio Bugli</string-name>
          <email>bugli.1934824@studenti.uniroma1.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Petri</string-name>
          <email>francesco.petri@uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Brienza</string-name>
          <email>brienza@diag.uniroma1.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerio Spagnoli</string-name>
          <email>spagnoli.1887715@studenti.uniroma1.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vincenzo Suriani</string-name>
          <email>vincenzo.suriani@unibas.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniele Nardi</string-name>
          <email>nardi@diag.uniroma1.it</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Domenico D. Bloisi</string-name>
          <email>domenico.bloisi@unint.eu</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Human-Robot Interaction, Audio Communication, Gesture Communication, Soccer Robots</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Cognitive Sciences and Technologies, National Research Council</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2059</year>
      </pub-date>
      <abstract>
        <p>Advancing human-robot communication is crucial for autonomous systems operating in dynamic environments, where accurate real-time interpretation of human signals is essential. RoboCup provides a compelling scenario for testing these capabilities, requiring robots to understand referee gestures and whistle with minimal network reliance. Using the NAO robot platform, this study implements a two-stage pipeline for gesture recognition through keypoint extraction and classification, alongside continuous convolutional neural networks (CCNNs) for eficient whistle detection. The proposed approach enhances real-time human-robot interaction in a competitive setting like RoboCup, ofering some tools to advance the development of autonomous systems capable of cooperating with humans.</p>
      </abstract>
      <kwd-group>
        <kwd>Referee</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Human-robot communication has evolved significantly, but it becomes challenging in competitive
environments such as RoboCup, where robots must interpret human signals with high accuracy. In
these settings, the challenge is to reduce the reliance on network-based communications in favor of
multimodal signal processing. This shift aligns with the growing interest in developing robots capable
of understanding human gestures and audio cues, such as referee signals during matches. The challenge
lies in the robots’ ability to process and interpret these multimodal signals in real-time, despite the
constraints of limited computational resources. In the context of RoboCup, where human referees
convey critical game states and events through gestures and whistles, the need for precise and eficient
recognition systems becomes evident. This paper explores the integration of multimodal perception of
gestures and whistles using the NAO robot platform, focusing on achieving robust performance under
real-time conditions while being compliant with the oficial competition rules.</p>
      <p>
        We employ a two-stage pipeline approach for gesture recognition, combining keypoint extraction and
classification to interpret referee poses accurately. Simultaneously, we utilize continuous convolutional
kernel neural networks (CKCNNs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for whistle detection, balancing accuracy with computational
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
eficiency. The proposed methods demonstrate the potential for enhancing human-robot interaction in
competitive environments, contributing to the ongoing development of robot synergy with humans.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Interpreting human behavior has long been a central challenge in robotics. Humans communicate
through various modalities, including vision, audio, and motion. This multimodal nature provides rich
information that sensory input can capture and analyze.</p>
      <p>
        Recent advances in Deep Learning have facilitated the integration of multimodal data, significantly
improving the comprehension of relationships within individual modalities, a key factor for precise
message interpretation [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>In the context of RoboCup, human-robot interaction is predominantly one-way, with human referees
conveying game states and events to robots. A significant trend in the RoboCup SPL league is the
progressive reduction of network communication in favor of human-like signal interpretation, allowing
robots to interpret human signals more naturally.</p>
      <p>
        In human soccer matches, gestures serve as a critical means of communication, especially in noisy
environments such as stadiums. Previous works have extensively explored gesture recognition among
agents using deep learning models, as seen in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A common approach is a two-stage pipeline, in
which the person’s skeleton is first extracted [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], followed by the classification of keypoint evolution
over time. Specifically, Di Giambattista et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] employed OpenPose with Part Afinity Fields to
extract the skeleton, using a subsequent network to analyze the relative positioning of keypoints for
ifnal pose prediction. Alternatively, single-stage pipelines [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] ofer end-to-end models, but require
consideration of both spatial and temporal data from image sequences, often resulting in significantly
larger models. Given that the NAO robot is an edge device with limited computational resources, we
opted for a two-stage pipeline to maintain eficiency while ensuring accurate pose recognition.
      </p>
      <p>
        Audio processing to detect specific sounds is an active research field, finding applications in various
domains such as environmental monitoring [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], security [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and sports analytics [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. In particular,
whistle detection has received attention in the context of sports, where referees’ whistles are used to
signal important events during matches. Unlike gestures, the whistling signal itself does not convey a
specific meaning directly. Instead, it must be interpreted in the context of the current situation and
game state, requiring a grounding [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] mechanism to relate the sound to relevant game events. A
potential approach for whistle detection is to use LSTMs [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], which ofer the advantage of modeling
long-range temporal dependencies and providing a larger context for analysis. However, they tend to be
computationally expensive and slower due to their recurrent structure [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Alternatively, computing
the Fourier transform of the audio signal and using CNNs to process the resulting spectrogram is a more
eficient solution [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Given that whistle recognition does not require modeling extensive temporal
context, CNNs provide a better balance between accuracy and computational eficiency for our task.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>To handle the detection of signals coming from the referee, a pipeline has been designed, involving
many robots of the team. In Fig. 2, the pipeline is shown with the game states and the teammates.</p>
      <sec id="sec-3-1">
        <title>3.1. Whistle Recognition</title>
        <p>For whistle recognition, we employed Continuous Kernel Convolutional Neural Networks, which extend
classical CNNs by using a kernel parametrized by a small neural network. CNNs excel in eficiently
learning functions over structured data, like images or audio, by leveraging translation equivariance,
albeit with a fixed receptive field size. In contrast, continuous kernel convolutions adapt to varying
input lengths and resolutions, ofering several advantages in audio processing:
• The same architecture accommodates diferent preprocessing techniques, such as varying
sampling rates, window sizes, or feature extraction methods (e.g., STFT or MFCC).
• The number of parameters of the network is decoupled from its receptive field, allowing to have
a long-range kernel with a relatively small number of parameters.</p>
        <sec id="sec-3-1-1">
          <title>In our application, the basic building block is the CKBlock:</title>
          <p>input -&gt; BatchNorm -&gt; CKConv -&gt; GELU -&gt; DropOut -&gt; Linear -&gt; GELU -&gt; + -&gt; output
|__________________________________________________________________|
The CKConv layer is the core of the architecture, since it contains the kernel generation and convolution
operation. The convolution operation is defined as ( ∗  )() =
∑   ( ) ⋅   ( −  ) , which means that
the convolver is now viewed as a vector-valued continuous function  ∶ ℝ → ℝ  
×  , parametrized
with a small neural network  
 :</p>
          <p>∑
=1 =0
• The input is a relative position ( −  ) of the convolvee
• The output is the value  ( −  )</p>
          <p>of the convolutional kernel at that position</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>The main consequence of this is that the kernel is arbitrarily large.</title>
          <p>
            The entire network is a sequence of 4 CKBlocks, with a final fully connected part. More specifically,
the convolutional layers have a hidden size of 32. The convolutional kernels are structured as simple
3-layer MLPs with hidden size 16. We chose as kernel size 31, since an overly large kernel would overfit
the training data, while a too small kernel would need a deeper network. Overall, we reached a network
size of 59.1k trainable parameters.
3.1.1. Data gathering
Structure and preprocessing In addition to the task of classifying an audio sample as either whistle
or no-whistle, a critical challenge in RoboCup games is ensuring accurate predictions in the presence of
background noise, such as crowd sounds, robot movements, and other environmental sounds. Therefore,
the dataset [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ] is a collection of audio files collected both in lab conditions and during the actual
matches, using the robots’ microphones. Since, on average there are few whistles in a match, the result
is a heavily unbalanced dataset, with a ratio of 10 ∶ 1 (60000 no-whistle samples, 6000 whistle samples).
The dataset was manually cleaned, removing many samples where the only noise source was the robot
walking, or where there was silence. Also, the labelling happened manually through the software
Audacity by extracting the audio events, defined as start and end of the whistle, in text files. These were
then associated with the corresponding audio samples using the library Librosa [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ].
Feature extraction To extract the features, we perform a frequency analysis of the audio
signal using short-time fourier transforms. The result is, for each audio, a series of vectors of shape
(1,NUMBER_FREQUENCIES), where each vector represents the frequency amplitudes of a window. We
extracted 1024 frames per window at 44100 Hz. This resulted in every data sample being a vector of
shape (1,513).
          </p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Gesture Recognition</title>
        <p>For the recognition of a referee pose, we propose a 2 step architecture based on a pretrained key point
extractor and then a classification module.</p>
        <p>
          Since one of the goals in RoboCup is to optimize as much as possible each algorithm to grant a fast
real-time execution, we had to rely on MoveNet Lightning[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] which is a deep learning architecture
based on MobileNetV2 [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] specifically developed for real-time applications which takes as input a
192x192 RGB image. We adapt the Nao camera frames, featuring a resolution 640x480, by scaling
and padding to match the input shape. Due to the distance of the referee from the robots, the image
scaling down leads to a detail loss on our region of interest (ROI) and the key point extractor does not
recognize the pose correctly. To overcome this issue, we implemented a crop on the ROI containing the
referee, and then resized and padded it to the desired input shape. This crop is also useful to prevent
the MoveNet to focus on a diferent person which is standing at the border of the field which may
cause false readings making the entire pipeline more robust. Figure 1 illustrates an example of the ROI
selection and the pose estimation network in action, estimating the referee’s skeleton.
        </p>
        <p>After the key point extraction, we needed to extract a good feature because classifying directly on
the raw key point coordinates would be a much harder problem, especially with a small dataset. To
address this problem we decided to calculate the angles of the joints that are more useful for our task.
So for both the left and right sides of the body, the algorithm computes the angles between:
• Hip - Shoulder - Elbow
• Shoulder - Elbow - Wrist
This procedure eventually computes less features that are, on the other hand, much more representative
of our problem. In general, given 3 points (A, B, C) the angle is computed as:
 = 2(
 ,   ) − 2(
 ,   )
This feature is better not only because it is easier to interpret but also because it grants scale and rotation
invariance which are very useful considering that both the dataset and the classifier architecture were
small. These two properties, together with the intrinsic translation equivariance provided by the CNN
architecture, contribute to a generally more robust pipeline.</p>
        <p>When a robot sees the pose for at least 4 consecutive frames, the recognition is succesful and a packet
is sent to the team so that every robot can enter the ready state, as shown in Fig. 2.</p>
        <p>Test</p>
        <p>Real (Play)
Real (Ready/Set)
(b) Gesture Recognition Results based on 153
test samples and 18 real situations over 8
games
3.2.1. Data gathering
The released rule that has to be followed states that “To announce the transition from standby to ready
state, the referee will raise both hands over their head”.</p>
        <p>To this end, the dataset was collected by our team in a private environment, allowing for consistent
conditions throughout the data acquisition process. This approach facilitated data gathering, which
was subsequently manually labeled to ensure a high quality labeling.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We evaluated separately the whistle and the gesture subsystems. Table 1 shows the results of the models
on the test data and the real scenario. In the whistle test data case, we reached a lower precision, due to
the highly imbalanced dataset. Whereas, in a real scenario, the detector worked pretty well. Lowers
precision could be a problem in cases of similar sounds to whistles that could cause false detections.
This can be easily mitigated by using a consensus approach. In case of the real scenario, the distinction
between playing and not playing is made to show the diference between these two cases. When the
robots are playing, the whistle always comes after a goal is scored, and usually the crowd cheers in
such a situation. Therefore, especially when referees do not whistle loudly, the model is not able to
distinguish the whistle sound from the crowd noise. On the other hand, when the robots are not playing,
it means they are waiting for a kick-of. In this case, there is usually less noise, and the model is able to
detect the whistles with high accuracy. The same pattern occurs in the gesture recognition case, in
which high precision was preferred over recall to avoid incurring rule penalties.</p>
      <p>Both pipelines are fast enough to run on a NAO robot in about 0.8 ms (whistle) and 200 ms (gesture).</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This paper presents an approach to detecting audiovisual signals from a human in the context of a
robot soccer game in real-time. Using a two-stage pipeline for gestures and a CCNN for whistles, we
balanced computational eficiency with accuracy on the NAO robot platform.</p>
      <p>Our results showed strong performance in whistle detection, while gesture recognition faced
challenges in real-world conditions, particularly in noisy environments. Future work will focus on enhancing
noise resilience and improving gesture recognition to better handle dynamic scenarios.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been carried out while Francesco Petri and Michele Brienza were enrolled in the Italian
National Doctorate on Artificial Intelligence run by Sapienza University of Rome. We also acknowledge
partial financial support from PNRR MUR project PE0000013-FAIR.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kuzina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Bekkers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Tomczak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoogendoorn</surname>
          </string-name>
          , Ckconv:
          <article-title>Continuous kernel convolution for sequential data</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2102.02611. arXiv:
          <volume>2102</volume>
          .
          <fpage>02611</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Deep learning-based multimodal control interface for human-robot collaboration</article-title>
          ,
          <source>Procedia CIRP 72</source>
          (
          <year>2018</year>
          )
          <fpage>3</fpage>
          -
          <lpage>8</lpage>
          . URL: https://www.sciencedirect.com/ science/article/pii/S2212827118303846. doi:https://doi.org/10.1016/j.procir.
          <year>2018</year>
          .
          <volume>03</volume>
          .224,
          <source>51st CIRP Conference on Manufacturing Systems.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sandoval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Laribi</surname>
          </string-name>
          ,
          <article-title>Recent advancements in multimodal human-robot interaction</article-title>
          ,
          <source>Frontiers in Neurorobotics</source>
          <volume>17</volume>
          (
          <year>2023</year>
          )
          <fpage>1084000</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Neto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simão</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Safeea</surname>
          </string-name>
          ,
          <article-title>Gesture-based human-robot interaction for human assistance in manufacturing</article-title>
          ,
          <source>The International Journal of Advanced Manufacturing Technology</source>
          <volume>101</volume>
          (
          <year>2019</year>
          )
          <fpage>119</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kendall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Grimes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cipolla</surname>
          </string-name>
          ,
          <article-title>Posenet: A convolutional network for real-time 6-dof camera relocalization</article-title>
          ,
          <source>2015 IEEE International Conference on Computer Vision</source>
          (ICCV) (
          <year>2015</year>
          )
          <fpage>2938</fpage>
          -
          <lpage>2946</lpage>
          . URL: https://api.semanticscholar.org/CorpusID:12888763.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <article-title>Pose flow: Eficient online pose tracking</article-title>
          , arXiv preprint arXiv:
          <year>1802</year>
          .
          <volume>00977</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Di Giambattista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fawakherji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Suriani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Bloisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nardi</surname>
          </string-name>
          ,
          <article-title>On field gesture-based robot-to-robot communication with nao soccer players</article-title>
          , in: S. Chalup,
          <string-name>
            <given-names>T.</given-names>
            <surname>Niemueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Suthakorn</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Williams</surname>
          </string-name>
          (Eds.),
          <source>RoboCup</source>
          <year>2019</year>
          : Robot World Cup XXIII, Springer International Publishing, Cham,
          <year>2019</year>
          , pp.
          <fpage>367</fpage>
          -
          <lpage>375</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F. B.</given-names>
            <surname>Ashraf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. U.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Kabir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uddin</surname>
          </string-name>
          ,
          <article-title>Yonet: A neural network for yoga pose classification</article-title>
          ,
          <source>SN Computer Science</source>
          <volume>4</volume>
          (
          <year>2023</year>
          ).
          <source>doi:10.1007/s42979- 022- 01618- 8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ur Rehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Attique</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Tariq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Abdulaziz Alfouzan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M</given-names>
            <surname>Alzahrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <article-title>Dynamic hand gesture recognition using 3d-cnn and lstm networks</article-title>
          ,
          <source>Computers, Materials &amp; Continua</source>
          <volume>70</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E. L.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. R.</given-names>
            <surname>White</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Bull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Risch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Beck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. W.</given-names>
            <surname>Edwards</surname>
          </string-name>
          ,
          <article-title>More than a whistle: Automated detection of marine sound sources with a convolutional neural network</article-title>
          ,
          <source>Frontiers in Marine Science</source>
          <volume>9</volume>
          (
          <year>2022</year>
          )
          <fpage>879145</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Neri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Battisti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Carli</surname>
          </string-name>
          ,
          <article-title>Sound event detection for human safety and security in noisy environments</article-title>
          ,
          <source>IEEE Access 10</source>
          (
          <year>2022</year>
          )
          <fpage>134230</fpage>
          -
          <lpage>134240</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>P.-M. Filippidis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Vryzas</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Kotsakis</surname>
            , I. Thoidis,
            <given-names>C. A.</given-names>
          </string-name>
          <string-name>
            <surname>Dimoulas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bratsas</surname>
          </string-name>
          ,
          <article-title>Audio event identification in sports media content: The case of basketball</article-title>
          , in: Audio Engineering Society Convention 146, Audio Engineering Society,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <article-title>Afective grounding in human-robot interaction</article-title>
          ,
          <source>in: Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>273</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          , G. Zweig,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <article-title>Lstm time and frequency recurrence for automatic speech recognition</article-title>
          , in: 2015 IEEE workshop
          <article-title>on automatic speech recognition and understanding (ASRU)</article-title>
          , IEEE,
          <year>2015</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Purwins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Virtanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schlüter</surname>
          </string-name>
          , S.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sainath</surname>
          </string-name>
          ,
          <article-title>Deep learning for audio signal processing</article-title>
          ,
          <source>IEEE Journal of Selected Topics in Signal Processing</source>
          <volume>13</volume>
          (
          <year>2019</year>
          )
          <fpage>206</fpage>
          -
          <lpage>219</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kleingarn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Brämer</surname>
          </string-name>
          ,
          <article-title>Neural network and prior knowledge ensemble for whistle recognition</article-title>
          , in: C.
          <string-name>
            <surname>Buche</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rossi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Simões</surname>
          </string-name>
          , U. Visser (Eds.),
          <source>RoboCup</source>
          <year>2023</year>
          : Robot World Cup XXVI, Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[17] librosa/librosa: 0.10.2.post1</source>
          ,
          <year>2024</year>
          . URL: https://doi.org/10.5281/zenodo.11192913. doi:
          <volume>10</volume>
          .5281/ zenodo.11192913.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>Next-generation pose detection with movenet and tensorflow</article-title>
          .js,
          <year>2021</year>
          . URL: https://blog.tensorflow. org/
          <year>2021</year>
          /05/next-generation
          <article-title>-pose-detection-with-movenet-and-tensorflowjs</article-title>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , L.-C.
          <article-title>Chen, Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>
          , arXiv preprint arXiv:
          <year>1801</year>
          .
          <volume>04381</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>