=Paper= {{Paper |id=Vol-3878/16_main_long |storemode=property |title=Multisource Approaches to Italian Sign Language (LIS) Recognition: Insights from the MultiMedaLIS Dataset |pdfUrl=https://ceur-ws.org/Vol-3878/16_main_long.pdf |volume=Vol-3878 |authors=Gaia Caligiore,Raffaele Mineo,Concetto Spampinato,Egidio Ragonese,Simone Palazzo,Sabina Fontana |dblpUrl=https://dblp.org/rec/conf/clic-it/CaligioreMSRPF24 }} ==Multisource Approaches to Italian Sign Language (LIS) Recognition: Insights from the MultiMedaLIS Dataset== https://ceur-ws.org/Vol-3878/16_main_long.pdf
                                Multisource Approaches to Italian Sign Language (LIS)
                                Recognition: Insights from the MultiMedaLIS Dataset
                                Gaia Caligiore∗†1, Raffaele Mineo†2, Concetto Spampinato2, Egidio Ragonese2,
                                Simone Palazzo 2, Sabina Fontana2
                                1
                                    University of Modena Reggio-Emilia, Italy.
                                2
                                    University of Catania, Italy.

                                                   Abstract
                                                   Given their status as unwritten visual-gestural languages, research on the automatic recognition of
                                                   sign languages has increasingly implemented multisource capturing tools for data collection and
                                                   processing. This paper explores advancements in Italian Sign Language (LIS) recognition using a
                                                   multimodal dataset in the medical domain: the MultiMedaLIS Dataset. We investigate the integration
                                                   of RGB frames, depth data, optical flow, and skeletal information to develop and evaluate two
                                                   computational models: Skeleton-Based Graph Convolutional Network (SL-GCN) and Spatiotemporal
                                                   Separable Convolutional Network (SSTCN). RADAR data was collected but not included in the testing
                                                   phase. Our experiments validate the effectiveness of these models in enhancing the accuracy and
                                                   robustness of isolated LIS signs recognition. Our findings highlight the potential of multisource
                                                   approaches in computational linguistics to improve linguistic accessibility and inclusivity for
                                                   members of the signing community.

                                                   Keywords
                                                   Italian Sign Language, Sign Language Recognition, Deep Learning, Computer Vision



                                1. Introduction                                                        alike [2]. The first significant publications on LIS [3] [4],
                                                                                                       along with the collaborative efforts of deaf and hearing
                                Italian Sign Language (LIS- Lingua dei Segni Italiana) is              researchers, initiated a transformative period in SL
                                the primary means of communication within the Italian                  research in the Italian context [5]. This shift in
                                signing community. Due to their visual-gestural                        perspective was influenced by factors beyond the
                                modality, sign languages (SLs) were initially not                      language itself, such as increased meta-linguistic
                                considered fully-fledged linguistic systems. However,                  awareness and greater visibility of the community and
                                since the 1960s, beginning with Stokoe’s pioneering                    its language to the wider public. In fact, from a societal
                                works [1], the contemporary study of SLs has evolved                   perspective, the visibility of SL in Italy, especially in
                                into a robust field of research. Over the past half-                   media, has significantly changed with technological
                                century, significant societal and scientific advancements              advancements, mirroring global trends.
                                have transformed the perception and status of SLs, now                      In the late 1980s, Italy introduced subtitles in movies
                                recognized as natural and complete languages, having                   on television, marking a step toward content
                                received legal recognition in many countries.                          accessibility. The importance of media accessibility,
                                     In the Italian context, the study of signed                       through subtitles or LIS interpreting, was accentuated
                                communication began in the early 1980s, involving both                 during the COVID-19 pandemic. The need for equitable
                                hearing and deaf researchers. At that time, what we now                access to critical information for deaf individuals
                                call LIS was still mostly unnamed and was often referred               became evident, with efforts born within the community
                                to as ‘mime’ or ‘gesture’ by both signers and non-signers


                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,          000-0002-7087-1819 (G. Caligiore), 0000-0002-1171-5672 (R.
                                Dec 04 — 06, 2024, Pisa, Italy                                            Mineo); 0000-0001-6653-2577 (C. Spampinato); 0000-0001-6893-7076
                                ∗
                                  Corresponding author.                                                   (E. Ragonese); 0000-0002-2441-0982 (S. Palazzo); 0000-0003-3083-
                                †
                                  These authors contributed equally.                                      1676 (S. Fontana)
                                                                                                                       © 2024 Copyright for this paper by its authors. Use permitted under
                                   gaia.caligiore@unimore.it (G. Caligiore);                                           Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                raffaele.mineo@phd.unict.it (R. Mineo);
                                concetto.spampinato@unict.it (C. Spampinato);
                                egidio.ragonese@unict.it (E. Ragonese); simone.palazzo@unict.it (S.
                                Palazzo); sfontana@unict.it (S. Fontana).




CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
stressing the central role of LIS in ensuring that the deaf            datasets. Additionally, computer vision plays a central
signers received accessible information during                         role in this field by enabling real-time analysis and
challenging times [6], highlighting the significant                    interpretation of body and manual components [2] that
communication barriers that deaf individuals face,                     is hand movements, facial expressions, and body posture
especially when in-person interactions were restricted.                [12, 13, 14, 15].
This increased visibility, along with persistent advocacy                   A significant challenge in applying deep learning
by the signing community, played a crucial role in the                 and computer vision methods to SLR lies in ensuring the
official recognition of LIS and Tactile LIS (LISt) in May              quality and adequacy of training data, which is essential
2021.                                                                  for achieving optimal model performance.
    Within this evolving societal and linguistic                            Therefore, in this study, we focus on evaluating the
framework, the increased media visibility of LIS and the               efficacy of the MultiMedaLIS Dataset (Multimodal
introduction of video capturing tools in daily lives,                  Medical LIS Dataset) and assessing various deep
language collection emerges as a central issue. For SLs,               learning models for SLR which employ advanced deep
the need for comprehensive collections is particularly                 learning techniques to interpret isolated signs by
significant. Unlike oral languages, which in some cases                integrating diverse data types such as RGB video, depth
have developed standardized written systems, SLs must                  information, optical flow, and skeletal data.
rely on video collections to capture signed                                 We benchmark our Dataset with two models: the
communication accurately. These videos, whether raw                    Skeleton-Based Graph Convolutional Network (SL-
or annotated, are essential for analyzing SLs with both                GCN) and the Spatiotemporal Separable Convolutional
qualitative and quantitative evidence.                                 Network (SSTCN). These models are trained on the
                                                                       MultiMedaLIS Dataset, showcasing how the
2. Automatic Sign Language                                             incorporation of multisource data can enhance the
                                                                       accuracy of sign recognition. This approach aims at
   Recognition
                                                                       testing the potential of integrating different data
The development and use of preferably annotated SL                     modalities to improve the robustness and performance
datasets or corpora are crucial for training and                       of SLR systems.
validating automatic recognition models, and access to
high-quality data from diverse SLs and cultural contexts               3. State of the Art
enhances the generalizability of these solutions.
Comprehensive data collections of this kind ensures that               In this section, we discuss the state of the art from two
models can effectively understand and process the wide                 perspectives considered during our work on the Dataset:
range of linguistic and cultural nuances present in                    LIS data collection and SLR tools
different SLs.
    In the domain of automatic sign language                           3.1. LIS Data Collections
recognition (SLR) of LIS, the integration of visual and                SL researchers in Italy have been actively engaged in the
spatial information presents a complex challenge. As                   creation of LIS corpora and datasets. This effort involves
mentioned, LIS operates through the visual-gestural                    a complex process of video data collection and
channel. More precisely, it is characterized as                        annotation, as SL datasets can vary significantly
multimodal2 (signed discourse is comprised of manual                   depending on their intended use. Within this context, SL
and body components) and multilinear (manual and                       data collections can be categorized into two main types.
body components are performed simultaneously) [2].                     The first type includes datasets that feature videos
Recent advancements in SLR have been significantly                     depicting continuous signing, capturing the flow and
driven by annotated datasets, which serve as the basis                 context of natural SL usage. The second type comprises
for training and validating models [7, 8, 9, 10, 11].                  datasets that focus on isolated signs, which are
    Machine learning technologies, particularly deep                   individual signs presented separately from continuous
learning neural networks, have facilitated the                         discourse.
development of more precise and robust models for SL                       The scarcity of available LIS data collections has
interpretation. These models are able to refine their                  prompted researchers to develop their own resources.
performance through training on diverse and complex                    Several smaller-scale LIS corpora have been


2
  Given our group’s interdisciplinarity, we found “multimodal” can
mean different things depending on one’s background: in linguistics,
it refers to the employment of manual and body components while
signing, while in computer vision, it means using multiple capturing
tools. To differentiate, we use “multisource” for capturing tools.
Thus, “multimodal” in this text follows SL linguistics terminology.
independently established, each serving distinct                    Another study proposes an iterative optimization
purposes based on the type of data collected.                   alignment network tailored for weakly supervised
    The methodologies employed for collecting LIS data          continuous SLR [30]. The framework employs a 3D
encompass a diverse array of approaches, ranging from           residual convolutional network for feature extraction,
naming tasks to semi-structured and spontaneous                 complemented by an encoder-decoder architecture
interviews with deaf signers, to video recording sessions       featuring LSTM decoders and Connectionist Temporal
involving hearing individuals learning LIS as a second          Classification (CTC).
language (L2) or second modality (M2) [16]. These                   [31] introduces a 3D convolutional neural network
documentations serve equally diverse purposes, ranging          enhanced with an attention module, designed to extract
from documenting the language itself to creating tools          spatiotemporal features directly from raw video data. In
for automatic translation highlighting the ongoing              contrast, [32] combines bidirectional recurrence and
commitment of researchers to expand and enrich the              temporal      convolutions,    emphasizing     temporal
available resources for studying LIS [17, 18, 19, 20, 21, 22,   information’s effectiveness in sign tasks, although not
23, 24].                                                        covering the full spectrum of movements. Moreover,
    Despite the predominant private nature of corpora           [33] employs CNNs, a Feature Pooling Module, and
collections, an exception to the accessibility challenge is     LSTM networks to generate distinctive visual
found in the online dictionary SpreadTheSign, a project         representations but falls short in capturing
originating in 2004. Initially conceived as a dictionary        comprehensive movements and signing.
for SLs, SpreadTheSign has evolved into a versatile                 However, as previously noted, RGB-based SLR
resource for language documentation [25]. Another               systems can raise privacy concerns, particularly when
significant resource is the Corpus LIS, recognized as the       processing visual data in cloud environments or for
largest collection of spontaneous, semi-structured, and         machine learning training [34]. Addressing these issues,
structured videos in LIS by deaf signers. The primary           radio frequency (RF) sensors have emerged as a
objectives of this corpus were twofold: to collect a            promising alternative, ensuring privacy preservation
substantial quantity of data suitable for quantitative          while enabling innovative data representations for SLR.
analysis and to establish a comprehensive                       In the literature, deep learning techniques have been
representation of LIS usage in Italy [26, 27, 28].              applied to various RF modalities such as ultra-wideband
                                                                (UWB) [35], Doppler [36], continuous wave (CW) [37],
3.2. SLR Tools                                                  micro-Doppler [38], frequency modulated continuous
                                                                wave (FMCW) [14], multi-antenna systems [39], and
Like SL data collections, SLR approaches can be broadly
                                                                millimeter waves [40].
classified into two main categories: those that rely on
                                                                    As part of the Dataset discussed in this work, we
specialized hardware and those that use visual
                                                                have also collected RADAR data and are actively
information. The former employ specialized hardware,
                                                                analyzing it. However, preliminary results are not
such as gloves able to capture precise hand movements.
                                                                available at this time, so they are not included in this
While these systems can provide detailed data, they are
                                                                report. Currently, RADAR-based solutions have
often considered intrusive and can compromise the
                                                                demonstrated robust performance across diverse
natural flow of communication. Additionally, they are
                                                                environmental conditions, highlighting the productivity
unable to capture the full spectrum of SLs, which
                                                                of incorporating this sensor technology in data
includes manual and body components. In contrast,
                                                                collection efforts. Nevertheless, many existing RADAR
vision-based approaches use visual information
                                                                solutions are tailored to recognizing a limited set of
captured by cameras, including RGB, depth, infrared, or
                                                                signs, highlighting the ongoing challenge of expanding
a combination of these. These methods are less intrusive
                                                                vocabulary recognition capabilities in datasets like the
for users, as they do not require the use of special
                                                                one discussed in the following section.
equipment.
     In SLR, a challenge lies in effectively capturing both
body movements and specific motions of hands, arms,             4. The MultiMedaLIS Dataset
and face. For instance, [29] introduces a multi-scale,          The MultiMedaLIS [41] Dataset was created thanks to
multi-modal framework that focuses on spatial details           the interdisciplinary collaboration established between
across different scales. This approach involves each            the Department of Humanities (DISUM) and the
visual modality capturing spatial information uniquely,         Department of Electrical, Electronic and Computer
supported by a system operating at three temporal               Engineering (DIEEI) of the University of Catania (Unict).
scales. The training methodology emphasizes precise             It aims to offer a multimodal collection of LIS signs
initialization of individual modalities and progressive         specifically focused on medical contexts.
fusion via ModDrop, which enhances overall robustness               For the data recording protocol, the DIEEI group
and performance.                                                developed a customized recording software to collect the
LIS data, supplemented with a desktop computer and a                   These signs reflect forms commonly found in online
modified keyboard transformed into a pedal board. This                 dictionaries and educational materials. To ensure the
pedal board, equipped with two pedals, allowed hands-                  accuracy of the data, sign variants performed by a
free navigation of the software, enabling users to move                professional LIS interpreter during the collection of a
forward (by pushing on the right pedal) or backward (by                test dataset were compared with the same variants
pushing on the left pedal) while maintaining a neutral                 found in the online dictionary SpreadTheSign. This
recording position3. During sessions, one of 126 Italian               comparison aimed to select documented versions of each
labels or alphabet letters was displayed on a screen, with             sign for inclusion in the Dataset. By incorporating these
adjustable display time for preparation and transition                 documented variants, we aimed to enhance its precision,
from one sign to the other. Each recording started from                reliability, and real-world applicability. This approach
a neutral position, and the right pedal marked the                     contributed to ensuring that the Dataset aligns with
completion of a sign. If errors occurred, the left pedal               established standards and supports effective research
allowed re-recording. The software’s interface features                and application in the field of LIS.
a color-coded background: yellow for preparation and                       When discussing recording tools for state-of-the-art
green for recording. Additionally, it supports flexible                multimodal corpora in the Italian context, such as the
data expansion, accepting word lists from text files for               Corpus LIS [27] and the CORMIP [43] the emphasis is
easy customization in future collections.                              placed on the portability and non-invasiveness of these
                                                                       tools. This approach ensures minimal interference with
                                                                       the signer's natural environment and activities.
                                                                           Portable and non-invasive recording tools are
                                                                       chosen specifically for their ability to capture data in
                                                                       familiar, and sometimes domestic, settings without
                                                                       disrupting the signer’s surroundings, aiming to maintain
Figure 1: User interface display presented during the                  the authenticity of the signed interactions and minimize
recording phase (green) and preparation phase (yellow).                any discomfort or distraction for the participants.
                                                                           To capture LIS for recognition with minimal
    After the recording process, Dataset included
                                                                       invasiveness we integrated a combination of recording
synchronized data capturing facial expressions, hand                   tools. A 60GHz RADAR sensor, employed to capture
and body movements and comprises a total of 25,830
                                                                       detailed manual motion data, provided Time- and
sign instances. This includes 205 repetitions of 100
                                                                       Frequency-Domain data and Range Doppler Maps for
different signs and the 26 signs of the LIS alphabet [41].
                                                                       distinguishing moving objects at 13 fps. For more
Beyond these 26 signs, the signs included in the                       structured depth and facial recognition data, the
MultiMedaLIS Dataset can be broadly categorized into
                                                                       Realsense D455 depth camera and Kinect v1 were
two groups [42]: semantically marked signs related to
                                                                       incorporated. The Realsense D455, equipped with dual
health and health issues, and non-semantically marked
                                                                       infrared cameras and RGB mode, captured depth data at
signs. It is important to note that while the first group of
                                                                       848x480 pixels and RGB data at 1280x720 pixels, both at
signs is categorized as semantically marked, this
                                                                       30 fps, enabling the tracking of facial expressions
classification does not imply that these signs belong
                                                                       through 68 facial points. The Zed v1 and Zed v2 cameras
exclusively to a specialized jargon lexicon. The decision
                                                                       provided high-resolution stereoscopic data, recording at
to categorize signs as semantically marked was driven
                                                                       1920x1080 pixels and 25 fps, with capabilities for
by their significance in contexts related to health and
                                                                       generating depth maps and 3D point clouds.
medical interactions in the post-pandemic world (hence,
                                                                       Additionally, the Zed v2 offered tracking for 18 body
when the Dataset was first theorized). However, it was                 points in both 2D and 3D [41].
also important to include additional signs that could
contribute to constructing meaningful utterances in
patient-doctor interactions. During the creation of the
MultiMedaLIS Dataset, careful consideration was given
to selecting signs that could be combined to form
coherent and meaningful utterances.
    Regarding the specific form of signs, the
MultiMedaLIS Dataset includes a lexicon of standard,
isolated signs that are not combined within utterances.


3
  The neutral recording position referenced is a seated position in
which the user has their arms extended along the sides of the torso,
elbows bent at 90°, and palms facing downward [41].
Figure 2: Combination of synchronized infrared and           for exploring these combinations, allowing researchers
depth data from the MultiMedaLIS Dataset.                    to develop more effective and accurate solutions for SLR.
    By prioritizing portability and non-invasiveness,
high-quality data can be still collected, while respecting   6. Models and Architectures
the privacy and comfort of the individuals recorded.         In the context of automatic SLR, various approaches and
Anonymization is achieved through the use of the             model architectures have been tested to leverage the
RADAR sensor, which we introduced specifically to            characteristics of multimodal data in the MultiMedaLIS
address privacy concerns inherent in face-to-face signed     Dataset.
communication.                                                    The SL-GCN (Skeleton-Based Graph Convolutional
                                                             Network) represents a significant innovation in this
5. Testing the Dataset                                       field. This model generates skeletal data from videos and
                                                             creates temporal graphs that capture the spatiotemporal
The MultiMedaLIS Dataset was designed with the aim of
                                                             relationships between joint movements. Through fine-
supporting the development of SLR models by enabling
                                                             tuning and the combination of different data streams,
the collection and integration of information through
                                                             SL-GCN has demonstrated high accuracy in sign
various data modalities:
                                                             recognition [44] [45].
     • RGB frames: images extracted from videos.                  Another prominent architecture is the SSTCN
     • Depth data: three-dimensional information for         (Spatiotemporal Separable Convolutional Network) [46],
          each RGB frame                                     which excels in feature extraction from videos using
     • Optical flow: to emphasize movement                   HRNet [47]. This approach has shown an accuracy of
     • Skeletal data: face landmarks and body joints         96.33%, highlighting its effectiveness in capturing spatial
                                                             and temporal dynamics of LIS signs.
    One of the main components of the Dataset are RGB             RGB frames are crucial for the visual representation
frames, which are images extracted from videos. These        of signs. The process of splitting videos into frames,
frames provide a two-dimensional visual representation       cropping, and normalization optimally prepares the data
of the signs performed by the signer, capturing details      for analysis by deep learning models. The use of dense
such as hand positions and facial expressions. The           optical flow presents significant challenges in sign
Dataset includes depth data, providing a three-              recognition. Optical flow extraction using the Farneback
dimensional aspect to the images. allowing for more          algorithm [48] led to 56% accuracy, highlighting
detailed information on the distance and relative            difficulties in capturing precise details of movements,
position of elements in the scene. This type of data is      alongside computational limitations. Depth data
particularly useful for understanding the spatial            encoded with Height, Horizontal disparity, Angle
dynamics of signs.                                           (HHA) represent another crucial resource in the
                                                             MultiMedaLIS Dataset. Applying HHA encoding to
    Alongside RGB and depth data, the MultiMedaLIS           depth frames achieved 88% accuracy using the
Dataset also contains optical flow information, which        ResNet(2+1)D architecture [49],             substantiating
describes the movement between consecutive frames.           importance of three-dimensional information in
Optical flow is essential for capturing the direction and    enhancing understanding and interpretation of signs,
speed of movements, providing a more detailed                offering a more detailed perspective compared to two-
understanding of the transitions between various signs.      dimensional data.
Finally, the Dataset includes skeletal data, representing
face landmarks and body joints, allowing for precise         7. Training and Evaluation
tracking of joint and body segment positions, facilitating       Procedure
the analysis of signs in terms of joint movements.
                                                                 For the training of the models, we employed a multi-
    Managing this multimodal data is an emerging topic       stream approach that integrates skeletal, RGB, and depth
in computational linguistics. By combining different         data to improve sign recognition accuracy. The models
sources of information, it is possible to significantly      were trained on a NVIDIA Tesla T4 16GB GPU using the
improve the performance of SLR models. For example,          Adam optimizer with an initial learning rate of 0.001 and
integrating depth data with RGB frames can provide a         a batch size of 8. We applied cross-validation to ensure
more complete representation of signs, while adding          the robustness of the results, splitting the Dataset into
optical flow and skeletal data can further enrich the        training (70%) and validation (15%) subsets and data
analysis of movement’s temporal structure. In our view,      augmentation techniques, such as color jittering,
the MultiMedaLIS Dataset provides a solid foundation         changing the brightness, contrast, saturation and hue, to
increase the diversity of the training data and improve    optical flow data alone, reaching just 56.31% accuracy,
generalization.                                            suggesting that while the optical flow provides valuable
    The loss function adopted for training was             information on motion, it lacks the richness of spatial
categorical cross-entropy, appropriate for multi-class     features found in RGB and depth data. The HHA-
classification tasks. The models were trained for a        encoded depth data, when processed with the
maximum of 100 epochs, with an early stopping              ResNet(2+1)D model, achieved an accuracy of 88.04%,
criterion set to terminate training if no improvement in   confirming that depth information is complementary,
validation loss was observed for 10 consecutive epochs.    but not as effective as RGB data in isolation.
For evaluation, we used a test set comprising 15% of the
Dataset, ensuring that the models were tested on unseen    Table 3
data.                                                      Performance of various methods on the MultiMedaLIS
                                                           Dataset
8. Results                                                           Methods           Dataset Accuracy(%)
                                                                      SS-CGN            RGB 97.98
    The results demonstrate the model’s efficiency in
                                                                      SSTCN             RGB 96.33
leveraging multi-modal data for improved outcomes. As
                                                             ResNet(2+1)D Optical Flow  RGB 56.31
can be seen in Table 1, the SL-GCN multi-stream model
                                                                ResNet(2+1)D Frame      RGB 97.29
achieved the best accuracy, with a Top-1 accuracy of
                                                            ResNet(2+1)D Encoding HHA Depth 88.04
97.98% and a Top-5 accuracy of 99.94%, surpassing the
performance of models using single data streams such as
skeletal joints, bones, or motion alone. This                  The results highlight importance of combining
demonstrates the advantage of combining multiple           multiple data modalities, especially RGB and skeletal
streams of information to capture both spatial and         data, for improving the accuracy and robustness of SLR
temporal dynamics of signs.                                systems. The performance of the SL-GCN model with
                                                           multi-stream data shows the model’s ability to
Table 1                                                    effectively capture signs, as well as the Dataset’s value.
Performance of SL-GCN multi-stream on the test set
     Data         Accuracy     Accuracy                    9. Discussion and Conclusion
                  Top-1 (%)    Top-5(%)
                                                           In this study, our goal was to demonstrate our first steps
     Joints         96.24      99.84                       into testing the efficacy of the MultiMedaLIS Dataset in
    Bones           95.82      99.84                       contributing to the advancement of the field of SLR
 Joint Motion       90.37      99.15                       through multisource approaches. The integration of
 Bone Motion        92.69      99.52                       RGB frames, depth data, optical flow, and skeletal data
 Multi-stream       97.98      99.94                       has provided a comprehensive basis for developing and
                                                           evaluating SLR models. Our experiments with the SL-
    In Table 2, datasets trained on the SL-GCN model are   GCN and SSTCN architectures have highlighted
compared. Our Dataset produced the highest accuracy        advancements in recognizing isolated LIS signs in
(97.98%) among the datasets evaluated, outperforming       medical semantic contexts, given the domain of our
larger datasets like AUTSL (95.45%).                       Dataset.
                                                               The SL-GCN model, trained on skeletal data to
Table 2                                                    construct temporal graphs, achieved accuracy in
Comparison of different datasets on SL-GCN model           capturing spatiotemporal relationships critical to sign
   Dataset          Number of signs       Accuracy (%)     recognition. This approach not only enhances the
 MultiMedaLIS           126               97.98            precision of rendering LIS signs but is also reinforced by
   AUTSL                226               95.45            a Dataset able to support robust graph-based
  ASLLVD                 20               61.04            convolutional networks in multimodal SLR tasks. At the
  Alphabet               26               85.19            same time, our Dataset proved robust, precise and
                                                           variable enough for SSTCN model testing, focusing on
                                                           spatiotemporal separable convolutions, revealing robust
    Table 3 presents a comparison of different methods     performance in extracting spatial dynamics from RGB
across the entire Dataset. The SL-GCN trained on RGB       frames.
frames achieved the highest accuracy (97.98%), followed        Having validated the visual modalities on the
by the SSTCN model with 96.33%. The ResNet(2+1)D           mentioned models, we have promising preliminary
architecture showed strong performance when applied        results on adapting these models to accept RADAR data.
to RGB frames (97.29%), but struggled when using           We plan to extract the pre-trained RADAR data
processing module and use it independently during                   Gesture Recognition, Amsterdam, Netherlands,
inference. This approach will eliminate the need for RGB            2008, pp. 1-6.
visual data. Furthermore, we plan to expand the Dataset      [11]   S. Tornay, O. Aran, M. Magimai Doss, An HMM
by applying the same protocol with 10 deaf signers. This            Approach with Inherent Model Selection for Sign
will effectively increase the current Dataset, enhancing            Language and Gesture Recognition, In
the generalizability across different signers. Our goal is          Proceedings of the Twelfth Language Resources
to develop an autonomous, resource-constrained system               and Evaluation Conference, Marseille, France,
(thanks to the exclusion of RGB data) that operates on-             2020, pp. 6049-6056.
edge or even offline. This cost-effective solution can be    [12]   Y. Chen, C. Shen, X. -S. Wei, L. Liu and J. Yang,
used in any emergency contexts where direct access to               Adversarial      PoseNet:      A     Structure-Aware
interpreting is not available.                                      Convolutional Network for Human Pose
                                                                    Estimation, 2017 IEEE ICCV, 2017, pp. 1221-1230.
References                                                   [13]   E. Barsoum, C. Zhang, C. Canton Ferrer, Z. Zhang,
                                                                    Training deep networks for facial expression
[1]  W. Stokoe, Sign language structure: an outline of              recognition with crowd-sourced label distribution,
     the visual communication systems of the                        in Proceedings of the 18th ACM ICMI, 2016, pp.
     American deaf, University of Buffalo, Buffalo, New             279–283.
     York, 1960.                                             [14]   Y. Wang, A. Ren, M. Zhou, W. Wang and X. Yang,
[2] V. Volterra, M. Roccaforte, A. Di Renzo, S. Fontana,            A Novel Detection and Recognition Method for
     Italian Sign Language from a Cognitive and Socio-              Continuous Hand Gesture Using FMCW Radar
     semiotic Perspective. Implications for a general               in volume 8 of IEEE Access, 2020, pp. 167264-
     language theory, John Benjamins Publishing                     167275.
     Company, Amsterdam-Philadelphia, 2022.                  [15]   O. Yusuf, M. Habib, M. Moustafa, Real-time hand
[3] M. Montanini, M. Facchini, L. Fruggeri, Dal Gesto               gesture recognition: Integrating skeleton-based
     al Gesto: il bambino sordo tra gesto e parola,                 data fusion and multi-stream CNN, 2024.
     Cappelli, Bologna, 1979.                                [16]   A. Cardinaletti, L. Mantovan, Le Lingue dei Segni
[4] V. Volterra, I segni come le parole: la                         nel 'Volume Complementare' e l’Insegnamento
     comunicazione dei sordi, Boringhieri, Torino,                  della LIS nelle Università Italiane, 2, volume 14 of
     1981.                                                          Italiano Lingua Seconda. Rivista internazionale di
[5] S. Fontana, S. Corazza, P. Boyes-Braem, V.                      linguistica italiana e educazione linguistica, 2022,
     Volterra, Language research and language                       pp. 113-128.
     community change: Italian Sign Language (LIS)           [17]   T. Russo Cardona, Iconicity and Productivity in
     1981-2013, in volume 236 of the International                  Sign Language Discourse: An Analysis of Three
     Journal of the Sociology of Language, 2015.                    LIS Discourse Registers, 2, volume 4 of Sign
[6] E. Tomasuolo, T. Gulli, V. Volterra, S. Fontana, The            Language Studies, 200), pp. 164-197.
     Italian Deaf Community at the Time of                   [18]   A. Ricci, C. Bonsignori, A. Di Renzo, Che giorno è
     Coronavirus, in volume 5 of Frontiers in                       oggi? Prime analisi e riflessioni sull’espressione
     Sociology, 2021.                                               del tempo in LIS [Poster presentation], IV
[7] D. Li, C. R. Opazo, X. Yu and H. Li, Word-level                 Convegno Nazionale LIS 'La Lingua dei Segni
     Deep Sign Language Recognition from Video: A                   Italiana: una risorsa per il futuro', Rome, 2018.
     New Large-scale Dataset and Methods                     [19]   E. Fornasiero, La morfologia valutativa in LIS: una
     Comparison, in proceedings of the 2020 IEEE                    descrizione preliminare [Poster presentation], IV
     WACV, Snowmass, CO, USA, 2020, pp. 1448-1458.                  Convegno Nazionale LIS 'La Lingua dei Segni
[8] O. Mercanoglu Sincan, H. Yalim Keles, AUTSL: A                  Italiana: una risorsa per il futuro', Rome, 2018.
     large scale multi-modal Turkish sign language           [20]   A. Di Renzo, A. Slonimska, L’uso delle Strutture di
     dataset and baseline methods, IEEE Access, 2020.               Grande Iconicità nei testi narrativi segnati: primi
     https://doi.org/10.48550/arXiv.2008.00932                      dati su bambini prescolari, scolari e adulti [Poster
[9] H. R. Vaezi Joze, O. Koller, MS-ASL: A large-scale              presentation], IV Convegno Nazionale LIS 'La
     data set and benchmark for understanding                       Lingua dei Segni Italiana: una risorsa per il futuro',
     American sign language, arXiv preprint arXiv,                  Rome, 2018.
     2018.                                                   [21]   S. R. Conte, Nomi di persona e di luogo nella
[10] U. von Agris, M. Knorr and K. F. Kraiss, The                   comunità sorda in Italia: interviste, analisi e primi
     significance of facial features for automatic sign             risultati [Poster presentation], IV Convegno
     language recognition, proceedings of the 8th IEEE              Nazionale LIS 'La Lingua dei Segni Italiana: una
     International Conference on Automatic Face &                   risorsa per il futuro', Rome, 2018.
[22] S. Fontana, E. Raniolo, Interazioni tra oralità e       [33] O. Mercanoglu Sincan, A. O. Tur and H. Yalim
     unità segniche: uno studio sulle labializzazioni             Keles, Isolated Sign Language Recognition with
     nella Lingua dei Segni Italiana (LIS), in: G.                Multi-scale Features using LSTM, in proceedings
     Schneider, M. Janner, B. Élie (Eds.), Proceedings of         of the 27th Signal Processing and Communications
     the VII Dies Romanicus Turicensis, Peter Lang,               Applications Conference (SIU), Sivas, Turkey,
     Bern, 2015, pp. 241-258.                                     2019, pp. 1-4.
[23] V. Cuccio, G. Di Stasio, S. Fontana, On the             [34] S. Z. Gurbuz, A. C. Gurbuz, E. A. Malaia, D. J.
     Embodiment of Negation in Italian Sign Language:             Griffin, C. Crawford, M. M. Rahman, R. Aksu, E.
     An Approach Based on Multiple Representation                 Kurtoglu, R. Mdrafi, A. Anbuselvam, T Macks, E.
     Theories, in volume 1 of Frontiers in Psychology,            Ozcelik, A linguistic perspective on radar micro-
     2022.                                                        doppler analysis of American sign language, in
[24] S. Fontana, Grammar and Experience: The                      proceedings of the 2020 IEEE International Radar
     Interplay Between Language Awareness and                     Conference (RADAR), Washington, DC, USA,
     Attitude in Italian Sign Language (LIS), 5, volume           2020, pp. 232-237.
     14 of the International Journal of Linguistics, 2022,   [35] B. Li, Sign language/gesture recognition based on
     pp. 1-18.                                                    cumulative distribution density features using
[25] M. Hilzensauer, K. Krammer, A multilingual                   UWB radar, in volume 70 of IEEE TIM, 2021, pp. 1-
     dictionary for sign languages: 'SpreadTheSign', in           13.
     proceedings of ICERI , Seville, 2015.                   [36] H. Kulhandjian, Sign language gesture recognition
[26] C. Cecchetto, S. Giudice, E. Mereghetti, La raccolta         using Doppler radar and deep learning" in
     del Corpus LIS, in: A. Cardinaletti, C. Cecchetto, C.        proceedings of the 2019 IEEE Globecom
     Donati (Eds.), Grammatica, Lessico e Dimensioni              Workshops (GC Wkshps), Waikoloa, HI, USA,
     di Variazione della LIS, FrancoAngeli, Milan, 2011,          2019, pp. 1-6.
     pp. 55-68.                                              [37] Y. Lu, Y. Lang, Sign language recognition with CW
[27] C. Geraci, K. Battaglia, A. Cardinaletti, C.                 radar and machine learning, proceedings of the
     Cecchetto, C. Donati, S. Giudice, E. Mereghetti,             21st International Radar Symposium (IRS),
     The LIS Corpus Project, in volume 11 of Sign                 Warsaw, Poland, 2020, pp. 31-34.
     Language Studies, 2011, pp. 528-571.                    [38] J. McCleary, Sign language recognition using
[28] M. Santoro, F. Poletti, L'Annotazione del Corpus,            micro-doppler and explainable deep learning, in
     in: A. Cardinaletti, C. Cecchetto, C. Donati (Eds.),         volume 139 of Computer Modeling in Engineering
     Grammatica, Lessico e Dimensioni di Variazione               & Sciences 2024, 2024, pp. 2399-2450.
     della LIS, FrancoAngeli, Milan, 2011, pp. 69-78.        [39] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN:
[29] N. Neverova, C. Wolf, G. Taylor and F. Nebout,               Towards real-time object detection with region
     ModDrop: Adaptive Multi-Modal Gesture                        proposal networks, volume 39 of IEEE
     Recognition, in volume 8 of IEEE Transactions on             Transactions on Pattern Analysis and Machine
     Pattern Analysis and Machine Intelligence                    Intelligence (TPAMI), 2016, pp. 1137-1149.
     (TPAMI), 2016, pp. 1692-1706.                           [40] O. O. Adeoluwa, S. J. Kearney, E. Kurtoglu, C. J.
[30] J. Pu, W. Zhou, and H. Li, Iterative alignment               Connors, S. Z. Gurbuz, near real-time ASL
     network for continuous sign language recognition,            recognition using a millimeter wave radar,
     in Proceedings of the IEEE/CVF Conference on                 Proceedings of Volume 11742 of Radar Sensor
     Computer Vision and Pattern Recognition (CVPR),              Technology XXV, SPIE, 2021.
     2019, pp. 4165–4174.                                    [41] R. Mineo, G. Caligiore, C. Spampinato, S. Fontana,
[31] J. Huang, W. Zhou, H. Li and W. Li, Attention-               S. Palazzo, E. Ragonese, Sign Language
     Based 3D-CNNs for Large-Vocabulary Sign                      Recognition for Patient-Doctor Communication: A
     Language Recognition, in volume 29 of IEEE                   Multimedia/Multimodal Dataset, Proceedings of
     Transactions on Circuits and Systems for Video               the IEEE 8th Forum on Research and Technologies
     Technology, 2019, pp. 2822-2832.                             for Society and Industry Innovation (RTSI), 2024.
[32] D. Bragg, T. Verhoef, C. Vogler, M. Morris, O.          [42] G. Caligiore, Codifying the body: exploring the
     Koller, M. Bellard, L. Berke, P. Boudreault, A.              cognitive and socio-semiotic framework in
     Braffort, N. Caselli, M. Huenerfauth, H. Kacorri,            building a multimodal Italian sign language (LIS)
     Sign language recognition, generation, and                   dataset [Ph.D. thesis], University of Catania,
     translation: An interdisciplinary perspective, in            Catania, 2024.
     Proceedings of the 21st International ACM               [43] L. Lo Re, Corpus Multimodale dell’Italiano Parlato:
     SIGACCESS Conference on Computers and                        basi metodologiche per la creazione di un
     Accessibility, 2019, pp. 16 – 31.
       prototipo [Ph.D. thesis], University of Florence,
       Florence, 2022.
[44]   C. Correia de Amorim, C. Macedo, C. Zanchettin,
       Spatial- Temporal Graph Convolutional Networks
       for Sign Language Recognition, Proceedings of the
       2019 International Conference on Artificial Neural
       Networks, Munich, Germany, 2019, pp. 646-657.
[45]   Ayas Faikar Nafis and Nanik Suciati, Sign
       language recognition on video data based on graph
       convolutional network. 18, volume 99 of Journal of
       Theoretical and Applied Information Technology,
       2023, pp. 4323-4333.
[46]   S. Jiang, B. Sun, L. Wang, Y. Bai, K Li, Y. Fu.
       Skeleton aware multi-modal sign language
       recognition, Proceedings of the 2021 IEEE/CVF
       Conference on Computer Vision and Pattern
       Recognition (CVPR) Workshops, 2021, pp. 5693-
       5703.
[47]   K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-
       resolution representation learning for human pose
       estimation. Proceedings of the 2019 IEEE/CVF
       Conference on Computer Vision and Pattern
       Recognition (CVPR), 2019, pp. 5693-5703.
[48]   G. Farneback, Two-frame motion estimation based
       on polynomial expansion. Volume 2749 of Lecture
       Notes in Computer Science, Springer, Berlin,
       Heidelberg, pp. 363-370.
[49]   D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun,
       & M. Paluri, A closer look at spatiotemporal
       convolutions for action recognition, in
       Proceedings of the IEEE/CVF Conference on
       Computer Vision and Pattern Recognition (CVPR),
       2018, pp. 6450-6459.