<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multisource Approaches to Italian Sign Language (LIS) Recognition: Insights from the MultiMedaLIS Dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gaia Caligiore</string-name>
          <email>gaia.caligiore@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raffaele Mineo</string-name>
          <email>raffaele.mineo@phd.unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Concetto Spampinato</string-name>
          <email>concetto.spampinato@unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Egidio Ragonese</string-name>
          <email>egidio.ragonese@unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Palazzo</string-name>
          <email>simone.palazzo@unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sabina Fontana</string-name>
          <email>sfontana@unict.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
          ,
          <addr-line>Dec 04 - 06, 2024, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Catania</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Modena Reggio-Emilia</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Given their status as unwritten visual-gestural languages, research on the automatic recognition of sign languages has increasingly implemented multisource capturing tools for data collection and processing. This paper explores advancements in Italian Sign Language (LIS) recognition using a multimodal dataset in the medical domain: the MultiMedaLIS Dataset. We investigate the integration of RGB frames, depth data, optical flow, and skeletal information to develop and evaluate two computational models: Skeleton-Based Graph Convolutional Network (SL-GCN) and Spatiotemporal Separable Convolutional Network (SSTCN). RADAR data was collected but not included in the testing phase. Our experiments validate the effectiveness of these models in enhancing the accuracy and robustness of isolated LIS signs recognition. Our findings highlight the potential of multisource approaches in computational linguistics to improve linguistic accessibility and inclusivity for members of the signing community.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Italian Sign Language</kwd>
        <kwd>Sign Language Recognition</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Computer Vision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Italian Sign Language (LIS- Lingua dei Segni Italiana) is
the primary means of communication within the Italian
signing community. Due to their visual-gestural
modality, sign languages (SLs) were initially not
considered fully-fledged linguistic systems. However,
since the 1960s, beginning with Stokoe’s pioneering
works [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the contemporary study of SLs has evolved
into a robust field of research. Over the past
halfcentury, significant societal and scientific advancements
have transformed the perception and status of SLs, now
recognized as natural and complete languages, having
received legal recognition in many countries.
      </p>
      <p>
        In the Italian context, the study of signed
communication began in the early 1980s, involving both
hearing and deaf researchers. At that time, what we now
call LIS was still mostly unnamed and was often referred
to as ‘mime’ or ‘gesture’ by both signers and non-signers
alike [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The first significant publications on LIS [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
along with the collaborative efforts of deaf and hearing
researchers, initiated a transformative period in SL
research in the Italian context [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This shift in
perspective was influenced by factors beyond the
language itself, such as increased meta-linguistic
awareness and greater visibility of the community and
its language to the wider public. In fact, from a societal
perspective, the visibility of SL in Italy, especially in
media, has significantly changed with technological
advancements, mirroring global trends.
      </p>
      <p>
        In the late 1980s, Italy introduced subtitles in movies
on television, marking a step toward content
accessibility. The importance of media accessibility,
through subtitles or LIS interpreting, was accentuated
during the COVID-19 pandemic. The need for equitable
access to critical information for deaf individuals
became evident, with efforts born within the community
000-0002-7087-1819 (G. Caligiore), 0000-0002-1171-5672 (R.
Mineo); 0000-0001-6653-2577 (C. Spampinato); 0000-0001-6893-7076
(E. Ragonese); 0000-0002-2441-0982 (S. Palazzo);
0000-0003-30831676 (S. Fontana)
© 2024 Copyright for this paper by its authors. Use permitted under
Creative Commons License Attribution 4.0 International (CC BY 4.0).
stressing the central role of LIS in ensuring that the deaf
signers received accessible information during
challenging times [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], highlighting the significant
communication barriers that deaf individuals face,
especially when in-person interactions were restricted.
This increased visibility, along with persistent advocacy
by the signing community, played a crucial role in the
official recognition of LIS and Tactile LIS (LISt) in May
2021.
      </p>
      <p>Within this evolving societal and linguistic
framework, the increased media visibility of LIS and the
introduction of video capturing tools in daily lives,
language collection emerges as a central issue. For SLs,
the need for comprehensive collections is particularly
significant. Unlike oral languages, which in some cases
have developed standardized written systems, SLs must
rely on video collections to capture signed
communication accurately. These videos, whether raw
or annotated, are essential for analyzing SLs with both
qualitative and quantitative evidence.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Automatic Sign Language</title>
    </sec>
    <sec id="sec-3">
      <title>Recognition</title>
      <p>The development and use of preferably annotated SL
datasets or corpora are crucial for training and
validating automatic recognition models, and access to
high-quality data from diverse SLs and cultural contexts
enhances the generalizability of these solutions.
Comprehensive data collections of this kind ensures that
models can effectively understand and process the wide
range of linguistic and cultural nuances present in
different SLs.</p>
      <p>
        In the domain of automatic sign language
recognition (SLR) of LIS, the integration of visual and
spatial information presents a complex challenge. As
mentioned, LIS operates through the visual-gestural
channel. More precisely, it is characterized as
multimodal2 (signed discourse is comprised of manual
and body components) and multilinear (manual and
body components are performed simultaneously) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Recent advancements in SLR have been significantly
driven by annotated datasets, which serve as the basis
for training and validating models [
        <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8 ref9">7, 8, 9, 10, 11</xref>
        ].
      </p>
      <p>
        Machine learning technologies, particularly deep
learning neural networks, have facilitated the
development of more precise and robust models for SL
interpretation. These models are able to refine their
performance through training on diverse and complex
2 Given our group’s interdisciplinarity, we found “multimodal” can
mean different things depending on one’s background: in linguistics,
it refers to the employment of manual and body components while
signing, while in computer vision, it means using multiple capturing
tools. To differentiate, we use “multisource” for capturing tools.
Thus, “multimodal” in this text follows SL linguistics terminology.
datasets. Additionally, computer vision plays a central
role in this field by enabling real-time analysis and
interpretation of body and manual components [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] that
is hand movements, facial expressions, and body posture
[
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15">12, 13, 14, 15</xref>
        ].
      </p>
      <p>A significant challenge in applying deep learning
and computer vision methods to SLR lies in ensuring the
quality and adequacy of training data, which is essential
for achieving optimal model performance.</p>
      <p>Therefore, in this study, we focus on evaluating the
efficacy of the MultiMedaLIS Dataset (Multimodal
Medical LIS Dataset) and assessing various deep
learning models for SLR which employ advanced deep
learning techniques to interpret isolated signs by
integrating diverse data types such as RGB video, depth
information, optical flow, and skeletal data.</p>
      <p>We benchmark our Dataset with two models: the
Skeleton-Based Graph Convolutional Network
(SLGCN) and the Spatiotemporal Separable Convolutional
Network (SSTCN). These models are trained on the
MultiMedaLIS Dataset, showcasing how the
incorporation of multisource data can enhance the
accuracy of sign recognition. This approach aims at
testing the potential of integrating different data
modalities to improve the robustness and performance
of SLR systems.</p>
    </sec>
    <sec id="sec-4">
      <title>3. State of the Art</title>
      <p>In this section, we discuss the state of the art from two
perspectives considered during our work on the Dataset:
LIS data collection and SLR tools</p>
      <sec id="sec-4-1">
        <title>3.1. LIS Data Collections</title>
        <p>SL researchers in Italy have been actively engaged in the
creation of LIS corpora and datasets. This effort involves
a complex process of video data collection and
annotation, as SL datasets can vary significantly
depending on their intended use. Within this context, SL
data collections can be categorized into two main types.
The first type includes datasets that feature videos
depicting continuous signing, capturing the flow and
context of natural SL usage. The second type comprises
datasets that focus on isolated signs, which are
individual signs presented separately from continuous
discourse.</p>
        <p>The scarcity of available LIS data collections has
prompted researchers to develop their own resources.
Several smaller-scale LIS corpora have been
independently established, each serving distinct
purposes based on the type of data collected.</p>
        <p>
          The methodologies employed for collecting LIS data
encompass a diverse array of approaches, ranging from
naming tasks to semi-structured and spontaneous
interviews with deaf signers, to video recording sessions
involving hearing individuals learning LIS as a second
language (L2) or second modality (M2) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. These
documentations serve equally diverse purposes, ranging
from documenting the language itself to creating tools
for automatic translation highlighting the ongoing
commitment of researchers to expand and enrich the
available resources for studying LIS [
          <xref ref-type="bibr" rid="ref17 ref18 ref19 ref20 ref21 ref22 ref23 ref24">17, 18, 19, 20, 21, 22,
23, 24</xref>
          ].
        </p>
        <p>
          Despite the predominant private nature of corpora
collections, an exception to the accessibility challenge is
found in the online dictionary SpreadTheSign, a project
originating in 2004. Initially conceived as a dictionary
for SLs, SpreadTheSign has evolved into a versatile
resource for language documentation [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Another
significant resource is the Corpus LIS, recognized as the
largest collection of spontaneous, semi-structured, and
structured videos in LIS by deaf signers. The primary
objectives of this corpus were twofold: to collect a
substantial quantity of data suitable for quantitative
analysis and to establish a comprehensive
representation of LIS usage in Italy [
          <xref ref-type="bibr" rid="ref26 ref27 ref28">26, 27, 28</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. SLR Tools</title>
        <p>Like SL data collections, SLR approaches can be broadly
classified into two main categories: those that rely on
specialized hardware and those that use visual
information. The former employ specialized hardware,
such as gloves able to capture precise hand movements.
While these systems can provide detailed data, they are
often considered intrusive and can compromise the
natural flow of communication. Additionally, they are
unable to capture the full spectrum of SLs, which
includes manual and body components. In contrast,
vision-based approaches use visual information
captured by cameras, including RGB, depth, infrared, or
a combination of these. These methods are less intrusive
for users, as they do not require the use of special
equipment.</p>
        <p>
          In SLR, a challenge lies in effectively capturing both
body movements and specific motions of hands, arms,
and face. For instance, [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] introduces a multi-scale,
multi-modal framework that focuses on spatial details
across different scales. This approach involves each
visual modality capturing spatial information uniquely,
supported by a system operating at three temporal
scales. The training methodology emphasizes precise
initialization of individual modalities and progressive
fusion via ModDrop, which enhances overall robustness
and performance.
        </p>
        <p>
          Another study proposes an iterative optimization
alignment network tailored for weakly supervised
continuous SLR [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. The framework employs a 3D
residual convolutional network for feature extraction,
complemented by an encoder-decoder architecture
featuring LSTM decoders and Connectionist Temporal
Classification (CTC).
        </p>
        <p>
          [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] introduces a 3D convolutional neural network
enhanced with an attention module, designed to extract
spatiotemporal features directly from raw video data. In
contrast, [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] combines bidirectional recurrence and
temporal convolutions, emphasizing temporal
information’s effectiveness in sign tasks, although not
covering the full spectrum of movements. Moreover,
[
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] employs CNNs, a Feature Pooling Module, and
LSTM networks to generate distinctive visual
representations but falls short in capturing
comprehensive movements and signing.
        </p>
        <p>
          However, as previously noted, RGB-based SLR
systems can raise privacy concerns, particularly when
processing visual data in cloud environments or for
machine learning training [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ]. Addressing these issues,
radio frequency (RF) sensors have emerged as a
promising alternative, ensuring privacy preservation
while enabling innovative data representations for SLR.
In the literature, deep learning techniques have been
applied to various RF modalities such as ultra-wideband
(UWB) [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ], Doppler [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ], continuous wave (CW) [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ],
micro-Doppler [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], frequency modulated continuous
wave (FMCW) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], multi-antenna systems [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ], and
millimeter waves [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ].
        </p>
        <p>As part of the Dataset discussed in this work, we
have also collected RADAR data and are actively
analyzing it. However, preliminary results are not
available at this time, so they are not included in this
report. Currently, RADAR-based solutions have
demonstrated robust performance across diverse
environmental conditions, highlighting the productivity
of incorporating this sensor technology in data
collection efforts. Nevertheless, many existing RADAR
solutions are tailored to recognizing a limited set of
signs, highlighting the ongoing challenge of expanding
vocabulary recognition capabilities in datasets like the
one discussed in the following section.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. The MultiMedaLIS Dataset</title>
      <p>
        The MultiMedaLIS [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ] Dataset was created thanks to
the interdisciplinary collaboration established between
the Department of Humanities (DISUM) and the
Department of Electrical, Electronic and Computer
Engineering (DIEEI) of the University of Catania (Unict).
It aims to offer a multimodal collection of LIS signs
specifically focused on medical contexts.
      </p>
      <p>For the data recording protocol, the DIEEI group
developed a customized recording software to collect the
LIS data, supplemented with a desktop computer and a
modified keyboard transformed into a pedal board. This
pedal board, equipped with two pedals, allowed
handsfree navigation of the software, enabling users to move
forward (by pushing on the right pedal) or backward (by
pushing on the left pedal) while maintaining a neutral
recording position3. During sessions, one of 126 Italian
labels or alphabet letters was displayed on a screen, with
adjustable display time for preparation and transition
from one sign to the other. Each recording started from
a neutral position, and the right pedal marked the
completion of a sign. If errors occurred, the left pedal
allowed re-recording. The software’s interface features
a color-coded background: yellow for preparation and
green for recording. Additionally, it supports flexible
data expansion, accepting word lists from text files for
easy customization in future collections.</p>
      <p>
        After the recording process, Dataset included
synchronized data capturing facial expressions, hand
and body movements and comprises a total of 25,830
sign instances. This includes 205 repetitions of 100
different signs and the 26 signs of the LIS alphabet [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ].
Beyond these 26 signs, the signs included in the
MultiMedaLIS Dataset can be broadly categorized into
two groups [
        <xref ref-type="bibr" rid="ref42">42</xref>
        ]: semantically marked signs related to
health and health issues, and non-semantically marked
signs. It is important to note that while the first group of
signs is categorized as semantically marked, this
classification does not imply that these signs belong
exclusively to a specialized jargon lexicon. The decision
to categorize signs as semantically marked was driven
by their significance in contexts related to health and
medical interactions in the post-pandemic world (hence,
when the Dataset was first theorized). However, it was
also important to include additional signs that could
contribute to constructing meaningful utterances in
patient-doctor interactions. During the creation of the
MultiMedaLIS Dataset, careful consideration was given
to selecting signs that could be combined to form
coherent and meaningful utterances.
      </p>
      <p>
        Regarding the specific form of signs, the
MultiMedaLIS Dataset includes a lexicon of standard,
isolated signs that are not combined within utterances.
3 The neutral recording position referenced is a seated position in
which the user has their arms extended along the sides of the torso,
elbows bent at 90°, and palms facing downward [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ].
      </p>
      <p>These signs reflect forms commonly found in online
dictionaries and educational materials. To ensure the
accuracy of the data, sign variants performed by a
professional LIS interpreter during the collection of a
test dataset were compared with the same variants
found in the online dictionary SpreadTheSign. This
comparison aimed to select documented versions of each
sign for inclusion in the Dataset. By incorporating these
documented variants, we aimed to enhance its precision,
reliability, and real-world applicability. This approach
contributed to ensuring that the Dataset aligns with
established standards and supports effective research
and application in the field of LIS.</p>
      <p>
        When discussing recording tools for state-of-the-art
multimodal corpora in the Italian context, such as the
Corpus LIS [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] and the CORMIP [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ] the emphasis is
placed on the portability and non-invasiveness of these
tools. This approach ensures minimal interference with
the signer's natural environment and activities.
      </p>
      <p>Portable and non-invasive recording tools are
chosen specifically for their ability to capture data in
familiar, and sometimes domestic, settings without
disrupting the signer’s surroundings, aiming to maintain
the authenticity of the signed interactions and minimize
any discomfort or distraction for the participants.</p>
      <p>
        To capture LIS for recognition with minimal
invasiveness we integrated a combination of recording
tools. A 60GHz RADAR sensor, employed to capture
detailed manual motion data, provided Time- and
Frequency-Domain data and Range Doppler Maps for
distinguishing moving objects at 13 fps. For more
structured depth and facial recognition data, the
Realsense D455 depth camera and Kinect v1 were
incorporated. The Realsense D455, equipped with dual
infrared cameras and RGB mode, captured depth data at
848x480 pixels and RGB data at 1280x720 pixels, both at
30 fps, enabling the tracking of facial expressions
through 68 facial points. The Zed v1 and Zed v2 cameras
provided high-resolution stereoscopic data, recording at
1920x1080 pixels and 25 fps, with capabilities for
generating depth maps and 3D point clouds.
Additionally, the Zed v2 offered tracking for 18 body
points in both 2D and 3D [
        <xref ref-type="bibr" rid="ref41">41</xref>
        ].
for exploring these combinations, allowing researchers
to develop more effective and accurate solutions for SLR.
      </p>
      <p>By prioritizing portability and non-invasiveness,
high-quality data can be still collected, while respecting
the privacy and comfort of the individuals recorded.
Anonymization is achieved through the use of the
RADAR sensor, which we introduced specifically to
address privacy concerns inherent in face-to-face signed
communication.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Testing the Dataset</title>
      <p>The MultiMedaLIS Dataset was designed with the aim of
supporting the development of SLR models by enabling
the collection and integration of information through
various data modalities:
• RGB frames: images extracted from videos.
• Depth data: three-dimensional information for
each RGB frame
• Optical flow: to emphasize movement
• Skeletal data: face landmarks and body joints
One of the main components of the Dataset are RGB
frames, which are images extracted from videos. These
frames provide a two-dimensional visual representation
of the signs performed by the signer, capturing details
such as hand positions and facial expressions. The
Dataset includes depth data, providing a
threedimensional aspect to the images. allowing for more
detailed information on the distance and relative
position of elements in the scene. This type of data is
particularly useful for understanding the spatial
dynamics of signs.</p>
      <p>Alongside RGB and depth data, the MultiMedaLIS
Dataset also contains optical flow information, which
describes the movement between consecutive frames.
Optical flow is essential for capturing the direction and
speed of movements, providing a more detailed
understanding of the transitions between various signs.
Finally, the Dataset includes skeletal data, representing
face landmarks and body joints, allowing for precise
tracking of joint and body segment positions, facilitating
the analysis of signs in terms of joint movements.</p>
      <p>Managing this multimodal data is an emerging topic
in computational linguistics. By combining different
sources of information, it is possible to significantly
improve the performance of SLR models. For example,
integrating depth data with RGB frames can provide a
more complete representation of signs, while adding
optical flow and skeletal data can further enrich the
analysis of movement’s temporal structure. In our view,
the MultiMedaLIS Dataset provides a solid foundation</p>
    </sec>
    <sec id="sec-7">
      <title>6. Models and Architectures</title>
      <p>In the context of automatic SLR, various approaches and
model architectures have been tested to leverage the
characteristics of multimodal data in the MultiMedaLIS
Dataset.</p>
      <p>
        The SL-GCN (Skeleton-Based Graph Convolutional
Network) represents a significant innovation in this
field. This model generates skeletal data from videos and
creates temporal graphs that capture the spatiotemporal
relationships between joint movements. Through
finetuning and the combination of different data streams,
SL-GCN has demonstrated high accuracy in sign
recognition [
        <xref ref-type="bibr" rid="ref44">44</xref>
        ] [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ].
      </p>
      <p>
        Another prominent architecture is the SSTCN
(Spatiotemporal Separable Convolutional Network) [
        <xref ref-type="bibr" rid="ref46">46</xref>
        ],
which excels in feature extraction from videos using
HRNet [
        <xref ref-type="bibr" rid="ref47">47</xref>
        ]. This approach has shown an accuracy of
96.33%, highlighting its effectiveness in capturing spatial
and temporal dynamics of LIS signs.
      </p>
      <p>
        RGB frames are crucial for the visual representation
of signs. The process of splitting videos into frames,
cropping, and normalization optimally prepares the data
for analysis by deep learning models. The use of dense
optical flow presents significant challenges in sign
recognition. Optical flow extraction using the Farneback
algorithm [
        <xref ref-type="bibr" rid="ref48">48</xref>
        ] led to 56% accuracy, highlighting
difficulties in capturing precise details of movements,
alongside computational limitations. Depth data
encoded with Height, Horizontal disparity, Angle
(HHA) represent another crucial resource in the
MultiMedaLIS Dataset. Applying HHA encoding to
depth frames achieved 88% accuracy using the
ResNet(2+1)D architecture [
        <xref ref-type="bibr" rid="ref49">49</xref>
        ], substantiating
importance of three-dimensional information in
enhancing understanding and interpretation of signs,
offering a more detailed perspective compared to
twodimensional data.
      </p>
    </sec>
    <sec id="sec-8">
      <title>7. Training and Evaluation</title>
    </sec>
    <sec id="sec-9">
      <title>Procedure</title>
      <p>For the training of the models, we employed a
multistream approach that integrates skeletal, RGB, and depth
data to improve sign recognition accuracy. The models
were trained on a NVIDIA Tesla T4 16GB GPU using the
Adam optimizer with an initial learning rate of 0.001 and
a batch size of 8. We applied cross-validation to ensure
the robustness of the results, splitting the Dataset into
training (70%) and validation (15%) subsets and data
augmentation techniques, such as color jittering,
changing the brightness, contrast, saturation and hue, to
increase the diversity of the training data and improve
generalization.</p>
      <p>The loss function adopted for training was
categorical cross-entropy, appropriate for multi-class
classification tasks. The models were trained for a
maximum of 100 epochs, with an early stopping
criterion set to terminate training if no improvement in
validation loss was observed for 10 consecutive epochs.
For evaluation, we used a test set comprising 15% of the
Dataset, ensuring that the models were tested on unseen
data.</p>
    </sec>
    <sec id="sec-10">
      <title>8. Results</title>
      <p>The results demonstrate the model’s efficiency in
leveraging multi-modal data for improved outcomes. As
can be seen in Table 1, the SL-GCN multi-stream model
achieved the best accuracy, with a Top-1 accuracy of
97.98% and a Top-5 accuracy of 99.94%, surpassing the
performance of models using single data streams such as
skeletal joints, bones, or motion alone. This
demonstrates the advantage of combining multiple
streams of information to capture both spatial and
temporal dynamics of signs.
optical flow data alone, reaching just 56.31% accuracy,
suggesting that while the optical flow provides valuable
information on motion, it lacks the richness of spatial
features found in RGB and depth data. The
HHAencoded depth data, when processed with the
ResNet(2+1)D model, achieved an accuracy of 88.04%,
confirming that depth information is complementary,
but not as effective as RGB data in isolation.</p>
    </sec>
    <sec id="sec-11">
      <title>9. Discussion and Conclusion</title>
      <p>In this study, our goal was to demonstrate our first steps
into testing the efficacy of the MultiMedaLIS Dataset in
contributing to the advancement of the field of SLR
through multisource approaches. The integration of
RGB frames, depth data, optical flow, and skeletal data
has provided a comprehensive basis for developing and
evaluating SLR models. Our experiments with the
SLGCN and SSTCN architectures have highlighted
advancements in recognizing isolated LIS signs in
medical semantic contexts, given the domain of our
Dataset.</p>
      <p>The SL-GCN model, trained on skeletal data to
construct temporal graphs, achieved accuracy in
capturing spatiotemporal relationships critical to sign
recognition. This approach not only enhances the
precision of rendering LIS signs but is also reinforced by
a Dataset able to support robust graph-based
convolutional networks in multimodal SLR tasks. At the
same time, our Dataset proved robust, precise and
variable enough for SSTCN model testing, focusing on
spatiotemporal separable convolutions, revealing robust
performance in extracting spatial dynamics from RGB
frames.</p>
      <p>Having validated the visual modalities on the
mentioned models, we have promising preliminary
results on adapting these models to accept RADAR data.
We plan to extract the pre-trained RADAR data
processing module and use it independently during
inference. This approach will eliminate the need for RGB
visual data. Furthermore, we plan to expand the Dataset
by applying the same protocol with 10 deaf signers. This
will effectively increase the current Dataset, enhancing
the generalizability across different signers. Our goal is
to develop an autonomous, resource-constrained system
(thanks to the exclusion of RGB data) that operates
onedge or even offline. This cost-effective solution can be
used in any emergency contexts where direct access to
interpreting is not available.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>Stokoe</surname>
          </string-name>
          ,
          <article-title>Sign language structure: an outline of the visual communication systems of the American deaf</article-title>
          , University of Buffalo, Buffalo, New York,
          <year>1960</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Volterra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Roccaforte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Di</given-names>
            <surname>Renzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fontana</surname>
          </string-name>
          ,
          <article-title>Italian Sign Language from a Cognitive and Sociosemiotic Perspective. Implications for a general language theory</article-title>
          , John Benjamins Publishing Company, Amsterdam-Philadelphia,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Montanini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Facchini</surname>
          </string-name>
          , L. Fruggeri, Dal Gesto al Gesto:
          <article-title>il bambino sordo tra gesto e parola</article-title>
          , Cappelli, Bologna,
          <year>1979</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Volterra</surname>
          </string-name>
          ,
          <article-title>I segni come le parole: la comunicazione dei sordi</article-title>
          , Boringhieri, Torino,
          <year>1981</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fontana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Corazza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Boyes-Braem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Volterra</surname>
          </string-name>
          ,
          <article-title>Language research and language community change: Italian Sign Language (LIS)</article-title>
          <year>1981</year>
          -
          <fpage>2013</fpage>
          , in volume
          <volume>236</volume>
          of the
          <source>International Journal of the Sociology of Language</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Tomasuolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gulli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Volterra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fontana</surname>
          </string-name>
          ,
          <article-title>The Italian Deaf Community at the Time of Coronavirus</article-title>
          , in volume 5 of Frontiers in Sociology,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Opazo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison</article-title>
          ,
          <source>in proceedings of the 2020 IEEE WACV, Snowmass</source>
          , CO, USA,
          <year>2020</year>
          , pp.
          <fpage>1448</fpage>
          -
          <lpage>1458</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O. Mercanoglu</given-names>
            <surname>Sincan</surname>
          </string-name>
          ,
          <string-name>
            <surname>H.</surname>
          </string-name>
          <article-title>Yalim Keles, AUTSL: A large scale multi-modal Turkish sign language dataset and baseline methods</article-title>
          ,
          <source>IEEE Access</source>
          ,
          <year>2020</year>
          . https://doi.org/10.48550/arXiv.
          <year>2008</year>
          .00932
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H. R. Vaezi</given-names>
            <surname>Joze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Koller</surname>
          </string-name>
          , MS-ASL:
          <article-title>A large-scale data set and benchmark for understanding American sign language</article-title>
          ,
          <source>arXiv preprint arXiv</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>U.</given-names>
            <surname>von Agris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Knorr</surname>
          </string-name>
          and
          <string-name>
            <given-names>K. F.</given-names>
            <surname>Kraiss</surname>
          </string-name>
          ,
          <article-title>The significance of facial features for automatic sign language recognition</article-title>
          ,
          <source>proceedings of the 8th IEEE International Conference on Automatic Face &amp; Gesture Recognition</source>
          , Amsterdam, Netherlands,
          <year>2008</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tornay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Aran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Magimai</given-names>
            <surname>Doss</surname>
          </string-name>
          ,
          <article-title>An HMM Approach with Inherent Model Selection for Sign Language and Gesture Recognition</article-title>
          ,
          <source>In Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          , Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>6049</fpage>
          -
          <lpage>6056</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X. -S.</given-names>
            <surname>Wei</surname>
          </string-name>
          , L. Liu and
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Adversarial PoseNet: A Structure-Aware Convolutional Network for Human Pose Estimation, 2017</article-title>
          IEEE ICCV,
          <year>2017</year>
          , pp.
          <fpage>1221</fpage>
          -
          <lpage>1230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Barsoum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , C. Canton
          <string-name>
            <surname>Ferrer</surname>
            ,
            <given-names>Z. Zhang,</given-names>
          </string-name>
          <article-title>Training deep networks for facial expression recognition with crowd-sourced label distribution</article-title>
          ,
          <source>in Proceedings of the 18th ACM ICMI</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>279</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A Novel</given-names>
            <surname>Detection</surname>
          </string-name>
          and
          <article-title>Recognition Method for Continuous Hand Gesture Using FMCW Radar in volume 8</article-title>
          of IEEE Access,
          <year>2020</year>
          , pp.
          <fpage>167264</fpage>
          -
          <lpage>167275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>O.</given-names>
            <surname>Yusuf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Habib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moustafa</surname>
          </string-name>
          ,
          <article-title>Real-time hand gesture recognition: Integrating skeleton-based data fusion and multi-</article-title>
          <string-name>
            <surname>stream</surname>
            <given-names>CNN</given-names>
          </string-name>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cardinaletti</surname>
          </string-name>
          , L. Mantovan, Le Lingue dei Segni nel 'Volume Complementare' e l'
          <source>Insegnamento della LIS nelle Università Italiane</source>
          ,
          <volume>2</volume>
          , volume
          <volume>14</volume>
          of Italiano Lingua Seconda.
          <article-title>Rivista internazionale di linguistica italiana e educazione linguistica</article-title>
          ,
          <year>2022</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>T.</given-names>
            <surname>Russo</surname>
          </string-name>
          <string-name>
            <surname>Cardona</surname>
          </string-name>
          ,
          <source>Iconicity and Productivity in Sign Language Discourse: An Analysis of Three LIS Discourse Registers</source>
          ,
          <volume>2</volume>
          , volume
          <volume>4</volume>
          of Sign Language Studies,
          <volume>200</volume>
          ), pp.
          <fpage>164</fpage>
          -
          <lpage>197</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ricci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bonsignori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Renzo</surname>
          </string-name>
          ,
          <article-title>Che giorno è oggi? Prime analisi e riflessioni sull'espressione del tempo in LIS [Poster presentation], IV Convegno Nazionale LIS 'La Lingua dei Segni Italiana: una risorsa per il futuro'</article-title>
          , Rome,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fornasiero</surname>
          </string-name>
          ,
          <article-title>La morfologia valutativa in LIS: una descrizione preliminare [Poster presentation], IV Convegno Nazionale LIS 'La Lingua dei Segni Italiana: una risorsa per il futuro'</article-title>
          , Rome,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Di Renzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Slonimska</surname>
          </string-name>
          ,
          <string-name>
            <surname>L'</surname>
          </string-name>
          <article-title>uso delle Strutture di Grande Iconicità nei testi narrativi segnati: primi dati su bambini prescolari, scolari e adulti [Poster presentation], IV Convegno Nazionale LIS 'La Lingua dei Segni Italiana: una risorsa per il futuro'</article-title>
          , Rome,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Conte</surname>
          </string-name>
          ,
          <article-title>Nomi di persona e di luogo nella comunità sorda in Italia: interviste, analisi e primi risultati [Poster presentation], IV Convegno Nazionale LIS 'La Lingua dei Segni Italiana: una risorsa per il futuro'</article-title>
          , Rome,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fontana</surname>
          </string-name>
          , E. Raniolo,
          <article-title>Interazioni tra oralità e unità segniche: uno studio sulle labializzazioni nella Lingua dei Segni Italiana (LIS)</article-title>
          , in: G. Schneider,
          <string-name>
            <given-names>M.</given-names>
            <surname>Janner</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          Élie (Eds.),
          <source>Proceedings of the VII Dies Romanicus Turicensis</source>
          ,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Lang</surname>
          </string-name>
          , Bern,
          <year>2015</year>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>258</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>V.</given-names>
            <surname>Cuccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Di</given-names>
            <surname>Stasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fontana</surname>
          </string-name>
          ,
          <article-title>On the Embodiment of Negation in Italian Sign Language: An Approach Based on Multiple Representation Theories</article-title>
          , in volume 1 of Frontiers in Psychology,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fontana</surname>
          </string-name>
          ,
          <article-title>Grammar and Experience: The Interplay Between Language Awareness and Attitude in Italian Sign Language (LIS), 5</article-title>
          , volume
          <volume>14</volume>
          of the
          <source>International Journal of Linguistics</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hilzensauer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Krammer</surname>
          </string-name>
          ,
          <article-title>A multilingual dictionary for sign languages: 'SpreadTheSign'</article-title>
          ,
          <source>in proceedings of ICERI , Seville</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Cecchetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Giudice</surname>
          </string-name>
          , E. Mereghetti,
          <article-title>La raccolta del Corpus LIS</article-title>
          , in: A.
          <string-name>
            <surname>Cardinaletti</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cecchetto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Donati (Eds.), Grammatica, Lessico e Dimensioni di Variazione della LIS, FrancoAngeli, Milan,
          <year>2011</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Geraci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Battaglia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cardinaletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cecchetto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Donati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Giudice</surname>
          </string-name>
          , E. Mereghetti,
          <source>The LIS Corpus Project, in volume 11 of Sign Language Studies</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>528</fpage>
          -
          <lpage>571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Santoro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Poletti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            '
            <surname>Annotazione del Corpus</surname>
          </string-name>
          , in: A.
          <string-name>
            <surname>Cardinaletti</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cecchetto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Donati (Eds.), Grammatica, Lessico e Dimensioni di Variazione della LIS, FrancoAngeli, Milan,
          <year>2011</year>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>78</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>N.</given-names>
            <surname>Neverova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Taylor</surname>
          </string-name>
          and F. Nebout,
          <article-title>ModDrop: Adaptive Multi-Modal Gesture Recognition</article-title>
          ,
          <source>in volume 8 of IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1692</fpage>
          -
          <lpage>1706</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Iterative alignment network for continuous sign language recognition</article-title>
          ,
          <source>in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4165</fpage>
          -
          <lpage>4174</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>AttentionBased 3D-CNNs for Large-Vocabulary Sign Language Recognition</article-title>
          , in volume 29
          <source>of IEEE Transactions on Circuits and Systems for Video Technology</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2822</fpage>
          -
          <lpage>2832</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bragg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Verhoef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vogler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bellard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Berke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Boudreault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Braffort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huenerfauth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kacorri</surname>
          </string-name>
          ,
          <article-title>Sign language recognition, generation, and translation: An interdisciplinary perspective</article-title>
          ,
          <source>in Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>O. Mercanoglu</given-names>
            <surname>Sincan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Tur</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Yalim Keles</surname>
          </string-name>
          ,
          <article-title>Isolated Sign Language Recognition with Multi-scale Features using LSTM</article-title>
          ,
          <source>in proceedings of the 27th Signal Processing and Communications Applications Conference (SIU)</source>
          , Sivas, Turkey,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Gurbuz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Gurbuz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Malaia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Griffin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Crawford</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Rahman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aksu</surname>
            , E. Kurtoglu,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Mdrafi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Anbuselvam</surname>
            ,
            <given-names>T</given-names>
          </string-name>
          <string-name>
            <surname>Macks</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Ozcelik</surname>
          </string-name>
          ,
          <article-title>A linguistic perspective on radar microdoppler analysis of American sign language</article-title>
          ,
          <source>in proceedings of the 2020 IEEE International Radar Conference (RADAR)</source>
          , Washington, DC, USA,
          <year>2020</year>
          , pp.
          <fpage>232</fpage>
          -
          <lpage>237</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Sign language/gesture recognition based on cumulative distribution density features using UWB radar</article-title>
          ,
          <source>in volume 70 of IEEE TIM</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kulhandjian</surname>
          </string-name>
          ,
          <article-title>Sign language gesture recognition using Doppler radar and deep learning" in proceedings of the 2019 IEEE Globecom Workshops (GC Wkshps)</article-title>
          , Waikoloa,
          <string-name>
            <surname>HI</surname>
          </string-name>
          , USA,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <article-title>Sign language recognition with CW radar and machine learning</article-title>
          ,
          <source>proceedings of the 21st International Radar Symposium (IRS)</source>
          , Warsaw, Poland,
          <year>2020</year>
          , pp.
          <fpage>31</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>J. McCleary</surname>
          </string-name>
          ,
          <article-title>Sign language recognition using micro-doppler and explainable deep learning</article-title>
          ,
          <source>in volume 139 of Computer Modeling in Engineering &amp; Sciences</source>
          <year>2024</year>
          ,
          <year>2024</year>
          , pp.
          <fpage>2399</fpage>
          -
          <lpage>2450</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards real-time object detection with region proposal networks</article-title>
          , volume
          <volume>39</volume>
          <source>of IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1137</fpage>
          -
          <lpage>1149</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>O. O.</given-names>
            <surname>Adeoluwa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Kearney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kurtoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Connors</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Z.</surname>
          </string-name>
          <article-title>Gurbuz, near real-time ASL recognition using a millimeter wave radar</article-title>
          ,
          <source>Proceedings of Volume 11742 of Radar Sensor Technology XXV, SPIE</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mineo</surname>
          </string-name>
          , G. Caligiore,
          <string-name>
            <given-names>C.</given-names>
            <surname>Spampinato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fontana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Palazzo</surname>
          </string-name>
          , E. Ragonese,
          <article-title>Sign Language Recognition for Patient-Doctor Communication: A Multimedia/Multimodal Dataset</article-title>
          ,
          <source>Proceedings of the IEEE 8th Forum on Research and Technologies for Society and Industry Innovation (RTSI)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>G.</given-names>
            <surname>Caligiore</surname>
          </string-name>
          ,
          <article-title>Codifying the body: exploring the cognitive and socio-semiotic framework in building a multimodal Italian sign language (LIS) dataset [</article-title>
          <source>Ph.D. thesis]</source>
          , University of Catania, Catania,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lo</surname>
          </string-name>
          <string-name>
            <surname>Re</surname>
          </string-name>
          ,
          <article-title>Corpus Multimodale dell'Italiano Parlato: basi metodologiche per la creazione di un prototipo [</article-title>
          <source>Ph.D. thesis]</source>
          , University of Florence, Florence,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>C. Correia de Amorim</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macedo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zanchettin</surname>
          </string-name>
          ,
          <article-title>Spatial- Temporal Graph Convolutional Networks for Sign Language Recognition</article-title>
          ,
          <source>Proceedings of the 2019 International Conference on Artificial Neural Networks</source>
          , Munich, Germany,
          <year>2019</year>
          , pp.
          <fpage>646</fpage>
          -
          <lpage>657</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <article-title>Ayas Faikar Nafis and Nanik Suciati, Sign language recognition on video data based on graph convolutional network</article-title>
          .
          <volume>18</volume>
          , volume
          <volume>99</volume>
          <source>of Journal of Theoretical and Applied Information Technology</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>4323</fpage>
          -
          <lpage>4333</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          .
          <article-title>Skeleton aware multi-modal sign language recognition</article-title>
          ,
          <source>Proceedings of the 2021 IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition (CVPR) Workshops</surname>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>5693</fpage>
          -
          <lpage>5703</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Deep highresolution representation learning for human pose estimation</article-title>
          .
          <source>Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5693</fpage>
          -
          <lpage>5703</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>G.</given-names>
            <surname>Farneback</surname>
          </string-name>
          ,
          <article-title>Two-frame motion estimation based on polynomial expansion</article-title>
          .
          <source>Volume 2749 of Lecture Notes in Computer Science</source>
          , Springer, Berlin, Heidelberg, pp.
          <fpage>363</fpage>
          -
          <lpage>370</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Torresani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. LeCun</given-names>
            , &amp;
            <surname>M. Paluri</surname>
          </string-name>
          ,
          <article-title>A closer look at spatiotemporal convolutions for action recognition</article-title>
          ,
          <source>in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>6450</fpage>
          -
          <lpage>6459</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>