=Paper=
{{Paper
|id=Vol-3878/16_main_long
|storemode=property
|title=Multisource Approaches to Italian Sign Language (LIS) Recognition: Insights from the MultiMedaLIS Dataset
|pdfUrl=https://ceur-ws.org/Vol-3878/16_main_long.pdf
|volume=Vol-3878
|authors=Gaia Caligiore,Raffaele Mineo,Concetto Spampinato,Egidio Ragonese,Simone Palazzo,Sabina Fontana
|dblpUrl=https://dblp.org/rec/conf/clic-it/CaligioreMSRPF24
}}
==Multisource Approaches to Italian Sign Language (LIS) Recognition: Insights from the MultiMedaLIS Dataset==
Multisource Approaches to Italian Sign Language (LIS)
Recognition: Insights from the MultiMedaLIS Dataset
Gaia Caligiore∗†1, Raffaele Mineo†2, Concetto Spampinato2, Egidio Ragonese2,
Simone Palazzo 2, Sabina Fontana2
1
University of Modena Reggio-Emilia, Italy.
2
University of Catania, Italy.
Abstract
Given their status as unwritten visual-gestural languages, research on the automatic recognition of
sign languages has increasingly implemented multisource capturing tools for data collection and
processing. This paper explores advancements in Italian Sign Language (LIS) recognition using a
multimodal dataset in the medical domain: the MultiMedaLIS Dataset. We investigate the integration
of RGB frames, depth data, optical flow, and skeletal information to develop and evaluate two
computational models: Skeleton-Based Graph Convolutional Network (SL-GCN) and Spatiotemporal
Separable Convolutional Network (SSTCN). RADAR data was collected but not included in the testing
phase. Our experiments validate the effectiveness of these models in enhancing the accuracy and
robustness of isolated LIS signs recognition. Our findings highlight the potential of multisource
approaches in computational linguistics to improve linguistic accessibility and inclusivity for
members of the signing community.
Keywords
Italian Sign Language, Sign Language Recognition, Deep Learning, Computer Vision
1. Introduction alike [2]. The first significant publications on LIS [3] [4],
along with the collaborative efforts of deaf and hearing
Italian Sign Language (LIS- Lingua dei Segni Italiana) is researchers, initiated a transformative period in SL
the primary means of communication within the Italian research in the Italian context [5]. This shift in
signing community. Due to their visual-gestural perspective was influenced by factors beyond the
modality, sign languages (SLs) were initially not language itself, such as increased meta-linguistic
considered fully-fledged linguistic systems. However, awareness and greater visibility of the community and
since the 1960s, beginning with Stokoe’s pioneering its language to the wider public. In fact, from a societal
works [1], the contemporary study of SLs has evolved perspective, the visibility of SL in Italy, especially in
into a robust field of research. Over the past half- media, has significantly changed with technological
century, significant societal and scientific advancements advancements, mirroring global trends.
have transformed the perception and status of SLs, now In the late 1980s, Italy introduced subtitles in movies
recognized as natural and complete languages, having on television, marking a step toward content
received legal recognition in many countries. accessibility. The importance of media accessibility,
In the Italian context, the study of signed through subtitles or LIS interpreting, was accentuated
communication began in the early 1980s, involving both during the COVID-19 pandemic. The need for equitable
hearing and deaf researchers. At that time, what we now access to critical information for deaf individuals
call LIS was still mostly unnamed and was often referred became evident, with efforts born within the community
to as ‘mime’ or ‘gesture’ by both signers and non-signers
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, 000-0002-7087-1819 (G. Caligiore), 0000-0002-1171-5672 (R.
Dec 04 — 06, 2024, Pisa, Italy Mineo); 0000-0001-6653-2577 (C. Spampinato); 0000-0001-6893-7076
∗
Corresponding author. (E. Ragonese); 0000-0002-2441-0982 (S. Palazzo); 0000-0003-3083-
†
These authors contributed equally. 1676 (S. Fontana)
© 2024 Copyright for this paper by its authors. Use permitted under
gaia.caligiore@unimore.it (G. Caligiore); Creative Commons License Attribution 4.0 International (CC BY 4.0).
raffaele.mineo@phd.unict.it (R. Mineo);
concetto.spampinato@unict.it (C. Spampinato);
egidio.ragonese@unict.it (E. Ragonese); simone.palazzo@unict.it (S.
Palazzo); sfontana@unict.it (S. Fontana).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
stressing the central role of LIS in ensuring that the deaf datasets. Additionally, computer vision plays a central
signers received accessible information during role in this field by enabling real-time analysis and
challenging times [6], highlighting the significant interpretation of body and manual components [2] that
communication barriers that deaf individuals face, is hand movements, facial expressions, and body posture
especially when in-person interactions were restricted. [12, 13, 14, 15].
This increased visibility, along with persistent advocacy A significant challenge in applying deep learning
by the signing community, played a crucial role in the and computer vision methods to SLR lies in ensuring the
official recognition of LIS and Tactile LIS (LISt) in May quality and adequacy of training data, which is essential
2021. for achieving optimal model performance.
Within this evolving societal and linguistic Therefore, in this study, we focus on evaluating the
framework, the increased media visibility of LIS and the efficacy of the MultiMedaLIS Dataset (Multimodal
introduction of video capturing tools in daily lives, Medical LIS Dataset) and assessing various deep
language collection emerges as a central issue. For SLs, learning models for SLR which employ advanced deep
the need for comprehensive collections is particularly learning techniques to interpret isolated signs by
significant. Unlike oral languages, which in some cases integrating diverse data types such as RGB video, depth
have developed standardized written systems, SLs must information, optical flow, and skeletal data.
rely on video collections to capture signed We benchmark our Dataset with two models: the
communication accurately. These videos, whether raw Skeleton-Based Graph Convolutional Network (SL-
or annotated, are essential for analyzing SLs with both GCN) and the Spatiotemporal Separable Convolutional
qualitative and quantitative evidence. Network (SSTCN). These models are trained on the
MultiMedaLIS Dataset, showcasing how the
2. Automatic Sign Language incorporation of multisource data can enhance the
accuracy of sign recognition. This approach aims at
Recognition
testing the potential of integrating different data
The development and use of preferably annotated SL modalities to improve the robustness and performance
datasets or corpora are crucial for training and of SLR systems.
validating automatic recognition models, and access to
high-quality data from diverse SLs and cultural contexts 3. State of the Art
enhances the generalizability of these solutions.
Comprehensive data collections of this kind ensures that In this section, we discuss the state of the art from two
models can effectively understand and process the wide perspectives considered during our work on the Dataset:
range of linguistic and cultural nuances present in LIS data collection and SLR tools
different SLs.
In the domain of automatic sign language 3.1. LIS Data Collections
recognition (SLR) of LIS, the integration of visual and SL researchers in Italy have been actively engaged in the
spatial information presents a complex challenge. As creation of LIS corpora and datasets. This effort involves
mentioned, LIS operates through the visual-gestural a complex process of video data collection and
channel. More precisely, it is characterized as annotation, as SL datasets can vary significantly
multimodal2 (signed discourse is comprised of manual depending on their intended use. Within this context, SL
and body components) and multilinear (manual and data collections can be categorized into two main types.
body components are performed simultaneously) [2]. The first type includes datasets that feature videos
Recent advancements in SLR have been significantly depicting continuous signing, capturing the flow and
driven by annotated datasets, which serve as the basis context of natural SL usage. The second type comprises
for training and validating models [7, 8, 9, 10, 11]. datasets that focus on isolated signs, which are
Machine learning technologies, particularly deep individual signs presented separately from continuous
learning neural networks, have facilitated the discourse.
development of more precise and robust models for SL The scarcity of available LIS data collections has
interpretation. These models are able to refine their prompted researchers to develop their own resources.
performance through training on diverse and complex Several smaller-scale LIS corpora have been
2
Given our group’s interdisciplinarity, we found “multimodal” can
mean different things depending on one’s background: in linguistics,
it refers to the employment of manual and body components while
signing, while in computer vision, it means using multiple capturing
tools. To differentiate, we use “multisource” for capturing tools.
Thus, “multimodal” in this text follows SL linguistics terminology.
independently established, each serving distinct Another study proposes an iterative optimization
purposes based on the type of data collected. alignment network tailored for weakly supervised
The methodologies employed for collecting LIS data continuous SLR [30]. The framework employs a 3D
encompass a diverse array of approaches, ranging from residual convolutional network for feature extraction,
naming tasks to semi-structured and spontaneous complemented by an encoder-decoder architecture
interviews with deaf signers, to video recording sessions featuring LSTM decoders and Connectionist Temporal
involving hearing individuals learning LIS as a second Classification (CTC).
language (L2) or second modality (M2) [16]. These [31] introduces a 3D convolutional neural network
documentations serve equally diverse purposes, ranging enhanced with an attention module, designed to extract
from documenting the language itself to creating tools spatiotemporal features directly from raw video data. In
for automatic translation highlighting the ongoing contrast, [32] combines bidirectional recurrence and
commitment of researchers to expand and enrich the temporal convolutions, emphasizing temporal
available resources for studying LIS [17, 18, 19, 20, 21, 22, information’s effectiveness in sign tasks, although not
23, 24]. covering the full spectrum of movements. Moreover,
Despite the predominant private nature of corpora [33] employs CNNs, a Feature Pooling Module, and
collections, an exception to the accessibility challenge is LSTM networks to generate distinctive visual
found in the online dictionary SpreadTheSign, a project representations but falls short in capturing
originating in 2004. Initially conceived as a dictionary comprehensive movements and signing.
for SLs, SpreadTheSign has evolved into a versatile However, as previously noted, RGB-based SLR
resource for language documentation [25]. Another systems can raise privacy concerns, particularly when
significant resource is the Corpus LIS, recognized as the processing visual data in cloud environments or for
largest collection of spontaneous, semi-structured, and machine learning training [34]. Addressing these issues,
structured videos in LIS by deaf signers. The primary radio frequency (RF) sensors have emerged as a
objectives of this corpus were twofold: to collect a promising alternative, ensuring privacy preservation
substantial quantity of data suitable for quantitative while enabling innovative data representations for SLR.
analysis and to establish a comprehensive In the literature, deep learning techniques have been
representation of LIS usage in Italy [26, 27, 28]. applied to various RF modalities such as ultra-wideband
(UWB) [35], Doppler [36], continuous wave (CW) [37],
3.2. SLR Tools micro-Doppler [38], frequency modulated continuous
wave (FMCW) [14], multi-antenna systems [39], and
Like SL data collections, SLR approaches can be broadly
millimeter waves [40].
classified into two main categories: those that rely on
As part of the Dataset discussed in this work, we
specialized hardware and those that use visual
have also collected RADAR data and are actively
information. The former employ specialized hardware,
analyzing it. However, preliminary results are not
such as gloves able to capture precise hand movements.
available at this time, so they are not included in this
While these systems can provide detailed data, they are
report. Currently, RADAR-based solutions have
often considered intrusive and can compromise the
demonstrated robust performance across diverse
natural flow of communication. Additionally, they are
environmental conditions, highlighting the productivity
unable to capture the full spectrum of SLs, which
of incorporating this sensor technology in data
includes manual and body components. In contrast,
collection efforts. Nevertheless, many existing RADAR
vision-based approaches use visual information
solutions are tailored to recognizing a limited set of
captured by cameras, including RGB, depth, infrared, or
signs, highlighting the ongoing challenge of expanding
a combination of these. These methods are less intrusive
vocabulary recognition capabilities in datasets like the
for users, as they do not require the use of special
one discussed in the following section.
equipment.
In SLR, a challenge lies in effectively capturing both
body movements and specific motions of hands, arms, 4. The MultiMedaLIS Dataset
and face. For instance, [29] introduces a multi-scale, The MultiMedaLIS [41] Dataset was created thanks to
multi-modal framework that focuses on spatial details the interdisciplinary collaboration established between
across different scales. This approach involves each the Department of Humanities (DISUM) and the
visual modality capturing spatial information uniquely, Department of Electrical, Electronic and Computer
supported by a system operating at three temporal Engineering (DIEEI) of the University of Catania (Unict).
scales. The training methodology emphasizes precise It aims to offer a multimodal collection of LIS signs
initialization of individual modalities and progressive specifically focused on medical contexts.
fusion via ModDrop, which enhances overall robustness For the data recording protocol, the DIEEI group
and performance. developed a customized recording software to collect the
LIS data, supplemented with a desktop computer and a These signs reflect forms commonly found in online
modified keyboard transformed into a pedal board. This dictionaries and educational materials. To ensure the
pedal board, equipped with two pedals, allowed hands- accuracy of the data, sign variants performed by a
free navigation of the software, enabling users to move professional LIS interpreter during the collection of a
forward (by pushing on the right pedal) or backward (by test dataset were compared with the same variants
pushing on the left pedal) while maintaining a neutral found in the online dictionary SpreadTheSign. This
recording position3. During sessions, one of 126 Italian comparison aimed to select documented versions of each
labels or alphabet letters was displayed on a screen, with sign for inclusion in the Dataset. By incorporating these
adjustable display time for preparation and transition documented variants, we aimed to enhance its precision,
from one sign to the other. Each recording started from reliability, and real-world applicability. This approach
a neutral position, and the right pedal marked the contributed to ensuring that the Dataset aligns with
completion of a sign. If errors occurred, the left pedal established standards and supports effective research
allowed re-recording. The software’s interface features and application in the field of LIS.
a color-coded background: yellow for preparation and When discussing recording tools for state-of-the-art
green for recording. Additionally, it supports flexible multimodal corpora in the Italian context, such as the
data expansion, accepting word lists from text files for Corpus LIS [27] and the CORMIP [43] the emphasis is
easy customization in future collections. placed on the portability and non-invasiveness of these
tools. This approach ensures minimal interference with
the signer's natural environment and activities.
Portable and non-invasive recording tools are
chosen specifically for their ability to capture data in
familiar, and sometimes domestic, settings without
disrupting the signer’s surroundings, aiming to maintain
Figure 1: User interface display presented during the the authenticity of the signed interactions and minimize
recording phase (green) and preparation phase (yellow). any discomfort or distraction for the participants.
To capture LIS for recognition with minimal
After the recording process, Dataset included
invasiveness we integrated a combination of recording
synchronized data capturing facial expressions, hand tools. A 60GHz RADAR sensor, employed to capture
and body movements and comprises a total of 25,830
detailed manual motion data, provided Time- and
sign instances. This includes 205 repetitions of 100
Frequency-Domain data and Range Doppler Maps for
different signs and the 26 signs of the LIS alphabet [41].
distinguishing moving objects at 13 fps. For more
Beyond these 26 signs, the signs included in the structured depth and facial recognition data, the
MultiMedaLIS Dataset can be broadly categorized into
Realsense D455 depth camera and Kinect v1 were
two groups [42]: semantically marked signs related to
incorporated. The Realsense D455, equipped with dual
health and health issues, and non-semantically marked
infrared cameras and RGB mode, captured depth data at
signs. It is important to note that while the first group of
848x480 pixels and RGB data at 1280x720 pixels, both at
signs is categorized as semantically marked, this
30 fps, enabling the tracking of facial expressions
classification does not imply that these signs belong
through 68 facial points. The Zed v1 and Zed v2 cameras
exclusively to a specialized jargon lexicon. The decision
provided high-resolution stereoscopic data, recording at
to categorize signs as semantically marked was driven
1920x1080 pixels and 25 fps, with capabilities for
by their significance in contexts related to health and
generating depth maps and 3D point clouds.
medical interactions in the post-pandemic world (hence,
Additionally, the Zed v2 offered tracking for 18 body
when the Dataset was first theorized). However, it was points in both 2D and 3D [41].
also important to include additional signs that could
contribute to constructing meaningful utterances in
patient-doctor interactions. During the creation of the
MultiMedaLIS Dataset, careful consideration was given
to selecting signs that could be combined to form
coherent and meaningful utterances.
Regarding the specific form of signs, the
MultiMedaLIS Dataset includes a lexicon of standard,
isolated signs that are not combined within utterances.
3
The neutral recording position referenced is a seated position in
which the user has their arms extended along the sides of the torso,
elbows bent at 90°, and palms facing downward [41].
Figure 2: Combination of synchronized infrared and for exploring these combinations, allowing researchers
depth data from the MultiMedaLIS Dataset. to develop more effective and accurate solutions for SLR.
By prioritizing portability and non-invasiveness,
high-quality data can be still collected, while respecting 6. Models and Architectures
the privacy and comfort of the individuals recorded. In the context of automatic SLR, various approaches and
Anonymization is achieved through the use of the model architectures have been tested to leverage the
RADAR sensor, which we introduced specifically to characteristics of multimodal data in the MultiMedaLIS
address privacy concerns inherent in face-to-face signed Dataset.
communication. The SL-GCN (Skeleton-Based Graph Convolutional
Network) represents a significant innovation in this
5. Testing the Dataset field. This model generates skeletal data from videos and
creates temporal graphs that capture the spatiotemporal
The MultiMedaLIS Dataset was designed with the aim of
relationships between joint movements. Through fine-
supporting the development of SLR models by enabling
tuning and the combination of different data streams,
the collection and integration of information through
SL-GCN has demonstrated high accuracy in sign
various data modalities:
recognition [44] [45].
• RGB frames: images extracted from videos. Another prominent architecture is the SSTCN
• Depth data: three-dimensional information for (Spatiotemporal Separable Convolutional Network) [46],
each RGB frame which excels in feature extraction from videos using
• Optical flow: to emphasize movement HRNet [47]. This approach has shown an accuracy of
• Skeletal data: face landmarks and body joints 96.33%, highlighting its effectiveness in capturing spatial
and temporal dynamics of LIS signs.
One of the main components of the Dataset are RGB RGB frames are crucial for the visual representation
frames, which are images extracted from videos. These of signs. The process of splitting videos into frames,
frames provide a two-dimensional visual representation cropping, and normalization optimally prepares the data
of the signs performed by the signer, capturing details for analysis by deep learning models. The use of dense
such as hand positions and facial expressions. The optical flow presents significant challenges in sign
Dataset includes depth data, providing a three- recognition. Optical flow extraction using the Farneback
dimensional aspect to the images. allowing for more algorithm [48] led to 56% accuracy, highlighting
detailed information on the distance and relative difficulties in capturing precise details of movements,
position of elements in the scene. This type of data is alongside computational limitations. Depth data
particularly useful for understanding the spatial encoded with Height, Horizontal disparity, Angle
dynamics of signs. (HHA) represent another crucial resource in the
MultiMedaLIS Dataset. Applying HHA encoding to
Alongside RGB and depth data, the MultiMedaLIS depth frames achieved 88% accuracy using the
Dataset also contains optical flow information, which ResNet(2+1)D architecture [49], substantiating
describes the movement between consecutive frames. importance of three-dimensional information in
Optical flow is essential for capturing the direction and enhancing understanding and interpretation of signs,
speed of movements, providing a more detailed offering a more detailed perspective compared to two-
understanding of the transitions between various signs. dimensional data.
Finally, the Dataset includes skeletal data, representing
face landmarks and body joints, allowing for precise 7. Training and Evaluation
tracking of joint and body segment positions, facilitating Procedure
the analysis of signs in terms of joint movements.
For the training of the models, we employed a multi-
Managing this multimodal data is an emerging topic stream approach that integrates skeletal, RGB, and depth
in computational linguistics. By combining different data to improve sign recognition accuracy. The models
sources of information, it is possible to significantly were trained on a NVIDIA Tesla T4 16GB GPU using the
improve the performance of SLR models. For example, Adam optimizer with an initial learning rate of 0.001 and
integrating depth data with RGB frames can provide a a batch size of 8. We applied cross-validation to ensure
more complete representation of signs, while adding the robustness of the results, splitting the Dataset into
optical flow and skeletal data can further enrich the training (70%) and validation (15%) subsets and data
analysis of movement’s temporal structure. In our view, augmentation techniques, such as color jittering,
the MultiMedaLIS Dataset provides a solid foundation changing the brightness, contrast, saturation and hue, to
increase the diversity of the training data and improve optical flow data alone, reaching just 56.31% accuracy,
generalization. suggesting that while the optical flow provides valuable
The loss function adopted for training was information on motion, it lacks the richness of spatial
categorical cross-entropy, appropriate for multi-class features found in RGB and depth data. The HHA-
classification tasks. The models were trained for a encoded depth data, when processed with the
maximum of 100 epochs, with an early stopping ResNet(2+1)D model, achieved an accuracy of 88.04%,
criterion set to terminate training if no improvement in confirming that depth information is complementary,
validation loss was observed for 10 consecutive epochs. but not as effective as RGB data in isolation.
For evaluation, we used a test set comprising 15% of the
Dataset, ensuring that the models were tested on unseen Table 3
data. Performance of various methods on the MultiMedaLIS
Dataset
8. Results Methods Dataset Accuracy(%)
SS-CGN RGB 97.98
The results demonstrate the model’s efficiency in
SSTCN RGB 96.33
leveraging multi-modal data for improved outcomes. As
ResNet(2+1)D Optical Flow RGB 56.31
can be seen in Table 1, the SL-GCN multi-stream model
ResNet(2+1)D Frame RGB 97.29
achieved the best accuracy, with a Top-1 accuracy of
ResNet(2+1)D Encoding HHA Depth 88.04
97.98% and a Top-5 accuracy of 99.94%, surpassing the
performance of models using single data streams such as
skeletal joints, bones, or motion alone. This The results highlight importance of combining
demonstrates the advantage of combining multiple multiple data modalities, especially RGB and skeletal
streams of information to capture both spatial and data, for improving the accuracy and robustness of SLR
temporal dynamics of signs. systems. The performance of the SL-GCN model with
multi-stream data shows the model’s ability to
Table 1 effectively capture signs, as well as the Dataset’s value.
Performance of SL-GCN multi-stream on the test set
Data Accuracy Accuracy 9. Discussion and Conclusion
Top-1 (%) Top-5(%)
In this study, our goal was to demonstrate our first steps
Joints 96.24 99.84 into testing the efficacy of the MultiMedaLIS Dataset in
Bones 95.82 99.84 contributing to the advancement of the field of SLR
Joint Motion 90.37 99.15 through multisource approaches. The integration of
Bone Motion 92.69 99.52 RGB frames, depth data, optical flow, and skeletal data
Multi-stream 97.98 99.94 has provided a comprehensive basis for developing and
evaluating SLR models. Our experiments with the SL-
In Table 2, datasets trained on the SL-GCN model are GCN and SSTCN architectures have highlighted
compared. Our Dataset produced the highest accuracy advancements in recognizing isolated LIS signs in
(97.98%) among the datasets evaluated, outperforming medical semantic contexts, given the domain of our
larger datasets like AUTSL (95.45%). Dataset.
The SL-GCN model, trained on skeletal data to
Table 2 construct temporal graphs, achieved accuracy in
Comparison of different datasets on SL-GCN model capturing spatiotemporal relationships critical to sign
Dataset Number of signs Accuracy (%) recognition. This approach not only enhances the
MultiMedaLIS 126 97.98 precision of rendering LIS signs but is also reinforced by
AUTSL 226 95.45 a Dataset able to support robust graph-based
ASLLVD 20 61.04 convolutional networks in multimodal SLR tasks. At the
Alphabet 26 85.19 same time, our Dataset proved robust, precise and
variable enough for SSTCN model testing, focusing on
spatiotemporal separable convolutions, revealing robust
Table 3 presents a comparison of different methods performance in extracting spatial dynamics from RGB
across the entire Dataset. The SL-GCN trained on RGB frames.
frames achieved the highest accuracy (97.98%), followed Having validated the visual modalities on the
by the SSTCN model with 96.33%. The ResNet(2+1)D mentioned models, we have promising preliminary
architecture showed strong performance when applied results on adapting these models to accept RADAR data.
to RGB frames (97.29%), but struggled when using We plan to extract the pre-trained RADAR data
processing module and use it independently during Gesture Recognition, Amsterdam, Netherlands,
inference. This approach will eliminate the need for RGB 2008, pp. 1-6.
visual data. Furthermore, we plan to expand the Dataset [11] S. Tornay, O. Aran, M. Magimai Doss, An HMM
by applying the same protocol with 10 deaf signers. This Approach with Inherent Model Selection for Sign
will effectively increase the current Dataset, enhancing Language and Gesture Recognition, In
the generalizability across different signers. Our goal is Proceedings of the Twelfth Language Resources
to develop an autonomous, resource-constrained system and Evaluation Conference, Marseille, France,
(thanks to the exclusion of RGB data) that operates on- 2020, pp. 6049-6056.
edge or even offline. This cost-effective solution can be [12] Y. Chen, C. Shen, X. -S. Wei, L. Liu and J. Yang,
used in any emergency contexts where direct access to Adversarial PoseNet: A Structure-Aware
interpreting is not available. Convolutional Network for Human Pose
Estimation, 2017 IEEE ICCV, 2017, pp. 1221-1230.
References [13] E. Barsoum, C. Zhang, C. Canton Ferrer, Z. Zhang,
Training deep networks for facial expression
[1] W. Stokoe, Sign language structure: an outline of recognition with crowd-sourced label distribution,
the visual communication systems of the in Proceedings of the 18th ACM ICMI, 2016, pp.
American deaf, University of Buffalo, Buffalo, New 279–283.
York, 1960. [14] Y. Wang, A. Ren, M. Zhou, W. Wang and X. Yang,
[2] V. Volterra, M. Roccaforte, A. Di Renzo, S. Fontana, A Novel Detection and Recognition Method for
Italian Sign Language from a Cognitive and Socio- Continuous Hand Gesture Using FMCW Radar
semiotic Perspective. Implications for a general in volume 8 of IEEE Access, 2020, pp. 167264-
language theory, John Benjamins Publishing 167275.
Company, Amsterdam-Philadelphia, 2022. [15] O. Yusuf, M. Habib, M. Moustafa, Real-time hand
[3] M. Montanini, M. Facchini, L. Fruggeri, Dal Gesto gesture recognition: Integrating skeleton-based
al Gesto: il bambino sordo tra gesto e parola, data fusion and multi-stream CNN, 2024.
Cappelli, Bologna, 1979. [16] A. Cardinaletti, L. Mantovan, Le Lingue dei Segni
[4] V. Volterra, I segni come le parole: la nel 'Volume Complementare' e l’Insegnamento
comunicazione dei sordi, Boringhieri, Torino, della LIS nelle Università Italiane, 2, volume 14 of
1981. Italiano Lingua Seconda. Rivista internazionale di
[5] S. Fontana, S. Corazza, P. Boyes-Braem, V. linguistica italiana e educazione linguistica, 2022,
Volterra, Language research and language pp. 113-128.
community change: Italian Sign Language (LIS) [17] T. Russo Cardona, Iconicity and Productivity in
1981-2013, in volume 236 of the International Sign Language Discourse: An Analysis of Three
Journal of the Sociology of Language, 2015. LIS Discourse Registers, 2, volume 4 of Sign
[6] E. Tomasuolo, T. Gulli, V. Volterra, S. Fontana, The Language Studies, 200), pp. 164-197.
Italian Deaf Community at the Time of [18] A. Ricci, C. Bonsignori, A. Di Renzo, Che giorno è
Coronavirus, in volume 5 of Frontiers in oggi? Prime analisi e riflessioni sull’espressione
Sociology, 2021. del tempo in LIS [Poster presentation], IV
[7] D. Li, C. R. Opazo, X. Yu and H. Li, Word-level Convegno Nazionale LIS 'La Lingua dei Segni
Deep Sign Language Recognition from Video: A Italiana: una risorsa per il futuro', Rome, 2018.
New Large-scale Dataset and Methods [19] E. Fornasiero, La morfologia valutativa in LIS: una
Comparison, in proceedings of the 2020 IEEE descrizione preliminare [Poster presentation], IV
WACV, Snowmass, CO, USA, 2020, pp. 1448-1458. Convegno Nazionale LIS 'La Lingua dei Segni
[8] O. Mercanoglu Sincan, H. Yalim Keles, AUTSL: A Italiana: una risorsa per il futuro', Rome, 2018.
large scale multi-modal Turkish sign language [20] A. Di Renzo, A. Slonimska, L’uso delle Strutture di
dataset and baseline methods, IEEE Access, 2020. Grande Iconicità nei testi narrativi segnati: primi
https://doi.org/10.48550/arXiv.2008.00932 dati su bambini prescolari, scolari e adulti [Poster
[9] H. R. Vaezi Joze, O. Koller, MS-ASL: A large-scale presentation], IV Convegno Nazionale LIS 'La
data set and benchmark for understanding Lingua dei Segni Italiana: una risorsa per il futuro',
American sign language, arXiv preprint arXiv, Rome, 2018.
2018. [21] S. R. Conte, Nomi di persona e di luogo nella
[10] U. von Agris, M. Knorr and K. F. Kraiss, The comunità sorda in Italia: interviste, analisi e primi
significance of facial features for automatic sign risultati [Poster presentation], IV Convegno
language recognition, proceedings of the 8th IEEE Nazionale LIS 'La Lingua dei Segni Italiana: una
International Conference on Automatic Face & risorsa per il futuro', Rome, 2018.
[22] S. Fontana, E. Raniolo, Interazioni tra oralità e [33] O. Mercanoglu Sincan, A. O. Tur and H. Yalim
unità segniche: uno studio sulle labializzazioni Keles, Isolated Sign Language Recognition with
nella Lingua dei Segni Italiana (LIS), in: G. Multi-scale Features using LSTM, in proceedings
Schneider, M. Janner, B. Élie (Eds.), Proceedings of of the 27th Signal Processing and Communications
the VII Dies Romanicus Turicensis, Peter Lang, Applications Conference (SIU), Sivas, Turkey,
Bern, 2015, pp. 241-258. 2019, pp. 1-4.
[23] V. Cuccio, G. Di Stasio, S. Fontana, On the [34] S. Z. Gurbuz, A. C. Gurbuz, E. A. Malaia, D. J.
Embodiment of Negation in Italian Sign Language: Griffin, C. Crawford, M. M. Rahman, R. Aksu, E.
An Approach Based on Multiple Representation Kurtoglu, R. Mdrafi, A. Anbuselvam, T Macks, E.
Theories, in volume 1 of Frontiers in Psychology, Ozcelik, A linguistic perspective on radar micro-
2022. doppler analysis of American sign language, in
[24] S. Fontana, Grammar and Experience: The proceedings of the 2020 IEEE International Radar
Interplay Between Language Awareness and Conference (RADAR), Washington, DC, USA,
Attitude in Italian Sign Language (LIS), 5, volume 2020, pp. 232-237.
14 of the International Journal of Linguistics, 2022, [35] B. Li, Sign language/gesture recognition based on
pp. 1-18. cumulative distribution density features using
[25] M. Hilzensauer, K. Krammer, A multilingual UWB radar, in volume 70 of IEEE TIM, 2021, pp. 1-
dictionary for sign languages: 'SpreadTheSign', in 13.
proceedings of ICERI , Seville, 2015. [36] H. Kulhandjian, Sign language gesture recognition
[26] C. Cecchetto, S. Giudice, E. Mereghetti, La raccolta using Doppler radar and deep learning" in
del Corpus LIS, in: A. Cardinaletti, C. Cecchetto, C. proceedings of the 2019 IEEE Globecom
Donati (Eds.), Grammatica, Lessico e Dimensioni Workshops (GC Wkshps), Waikoloa, HI, USA,
di Variazione della LIS, FrancoAngeli, Milan, 2011, 2019, pp. 1-6.
pp. 55-68. [37] Y. Lu, Y. Lang, Sign language recognition with CW
[27] C. Geraci, K. Battaglia, A. Cardinaletti, C. radar and machine learning, proceedings of the
Cecchetto, C. Donati, S. Giudice, E. Mereghetti, 21st International Radar Symposium (IRS),
The LIS Corpus Project, in volume 11 of Sign Warsaw, Poland, 2020, pp. 31-34.
Language Studies, 2011, pp. 528-571. [38] J. McCleary, Sign language recognition using
[28] M. Santoro, F. Poletti, L'Annotazione del Corpus, micro-doppler and explainable deep learning, in
in: A. Cardinaletti, C. Cecchetto, C. Donati (Eds.), volume 139 of Computer Modeling in Engineering
Grammatica, Lessico e Dimensioni di Variazione & Sciences 2024, 2024, pp. 2399-2450.
della LIS, FrancoAngeli, Milan, 2011, pp. 69-78. [39] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN:
[29] N. Neverova, C. Wolf, G. Taylor and F. Nebout, Towards real-time object detection with region
ModDrop: Adaptive Multi-Modal Gesture proposal networks, volume 39 of IEEE
Recognition, in volume 8 of IEEE Transactions on Transactions on Pattern Analysis and Machine
Pattern Analysis and Machine Intelligence Intelligence (TPAMI), 2016, pp. 1137-1149.
(TPAMI), 2016, pp. 1692-1706. [40] O. O. Adeoluwa, S. J. Kearney, E. Kurtoglu, C. J.
[30] J. Pu, W. Zhou, and H. Li, Iterative alignment Connors, S. Z. Gurbuz, near real-time ASL
network for continuous sign language recognition, recognition using a millimeter wave radar,
in Proceedings of the IEEE/CVF Conference on Proceedings of Volume 11742 of Radar Sensor
Computer Vision and Pattern Recognition (CVPR), Technology XXV, SPIE, 2021.
2019, pp. 4165–4174. [41] R. Mineo, G. Caligiore, C. Spampinato, S. Fontana,
[31] J. Huang, W. Zhou, H. Li and W. Li, Attention- S. Palazzo, E. Ragonese, Sign Language
Based 3D-CNNs for Large-Vocabulary Sign Recognition for Patient-Doctor Communication: A
Language Recognition, in volume 29 of IEEE Multimedia/Multimodal Dataset, Proceedings of
Transactions on Circuits and Systems for Video the IEEE 8th Forum on Research and Technologies
Technology, 2019, pp. 2822-2832. for Society and Industry Innovation (RTSI), 2024.
[32] D. Bragg, T. Verhoef, C. Vogler, M. Morris, O. [42] G. Caligiore, Codifying the body: exploring the
Koller, M. Bellard, L. Berke, P. Boudreault, A. cognitive and socio-semiotic framework in
Braffort, N. Caselli, M. Huenerfauth, H. Kacorri, building a multimodal Italian sign language (LIS)
Sign language recognition, generation, and dataset [Ph.D. thesis], University of Catania,
translation: An interdisciplinary perspective, in Catania, 2024.
Proceedings of the 21st International ACM [43] L. Lo Re, Corpus Multimodale dell’Italiano Parlato:
SIGACCESS Conference on Computers and basi metodologiche per la creazione di un
Accessibility, 2019, pp. 16 – 31.
prototipo [Ph.D. thesis], University of Florence,
Florence, 2022.
[44] C. Correia de Amorim, C. Macedo, C. Zanchettin,
Spatial- Temporal Graph Convolutional Networks
for Sign Language Recognition, Proceedings of the
2019 International Conference on Artificial Neural
Networks, Munich, Germany, 2019, pp. 646-657.
[45] Ayas Faikar Nafis and Nanik Suciati, Sign
language recognition on video data based on graph
convolutional network. 18, volume 99 of Journal of
Theoretical and Applied Information Technology,
2023, pp. 4323-4333.
[46] S. Jiang, B. Sun, L. Wang, Y. Bai, K Li, Y. Fu.
Skeleton aware multi-modal sign language
recognition, Proceedings of the 2021 IEEE/CVF
Conference on Computer Vision and Pattern
Recognition (CVPR) Workshops, 2021, pp. 5693-
5703.
[47] K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-
resolution representation learning for human pose
estimation. Proceedings of the 2019 IEEE/CVF
Conference on Computer Vision and Pattern
Recognition (CVPR), 2019, pp. 5693-5703.
[48] G. Farneback, Two-frame motion estimation based
on polynomial expansion. Volume 2749 of Lecture
Notes in Computer Science, Springer, Berlin,
Heidelberg, pp. 363-370.
[49] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun,
& M. Paluri, A closer look at spatiotemporal
convolutions for action recognition, in
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
2018, pp. 6450-6459.