Multisource Approaches to Italian Sign Language (LIS) Recognition: Insights from the MultiMedaLIS Dataset Gaia Caligiore∗†1, Raffaele Mineo†2, Concetto Spampinato2, Egidio Ragonese2, Simone Palazzo 2, Sabina Fontana2 1 University of Modena Reggio-Emilia, Italy. 2 University of Catania, Italy. Abstract Given their status as unwritten visual-gestural languages, research on the automatic recognition of sign languages has increasingly implemented multisource capturing tools for data collection and processing. This paper explores advancements in Italian Sign Language (LIS) recognition using a multimodal dataset in the medical domain: the MultiMedaLIS Dataset. We investigate the integration of RGB frames, depth data, optical flow, and skeletal information to develop and evaluate two computational models: Skeleton-Based Graph Convolutional Network (SL-GCN) and Spatiotemporal Separable Convolutional Network (SSTCN). RADAR data was collected but not included in the testing phase. Our experiments validate the effectiveness of these models in enhancing the accuracy and robustness of isolated LIS signs recognition. Our findings highlight the potential of multisource approaches in computational linguistics to improve linguistic accessibility and inclusivity for members of the signing community. Keywords Italian Sign Language, Sign Language Recognition, Deep Learning, Computer Vision 1. Introduction alike [2]. The first significant publications on LIS [3] [4], along with the collaborative efforts of deaf and hearing Italian Sign Language (LIS- Lingua dei Segni Italiana) is researchers, initiated a transformative period in SL the primary means of communication within the Italian research in the Italian context [5]. This shift in signing community. Due to their visual-gestural perspective was influenced by factors beyond the modality, sign languages (SLs) were initially not language itself, such as increased meta-linguistic considered fully-fledged linguistic systems. However, awareness and greater visibility of the community and since the 1960s, beginning with Stokoe’s pioneering its language to the wider public. In fact, from a societal works [1], the contemporary study of SLs has evolved perspective, the visibility of SL in Italy, especially in into a robust field of research. Over the past half- media, has significantly changed with technological century, significant societal and scientific advancements advancements, mirroring global trends. have transformed the perception and status of SLs, now In the late 1980s, Italy introduced subtitles in movies recognized as natural and complete languages, having on television, marking a step toward content received legal recognition in many countries. accessibility. The importance of media accessibility, In the Italian context, the study of signed through subtitles or LIS interpreting, was accentuated communication began in the early 1980s, involving both during the COVID-19 pandemic. The need for equitable hearing and deaf researchers. At that time, what we now access to critical information for deaf individuals call LIS was still mostly unnamed and was often referred became evident, with efforts born within the community to as ‘mime’ or ‘gesture’ by both signers and non-signers CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, 000-0002-7087-1819 (G. Caligiore), 0000-0002-1171-5672 (R. Dec 04 — 06, 2024, Pisa, Italy Mineo); 0000-0001-6653-2577 (C. Spampinato); 0000-0001-6893-7076 ∗ Corresponding author. (E. Ragonese); 0000-0002-2441-0982 (S. Palazzo); 0000-0003-3083- † These authors contributed equally. 1676 (S. Fontana) © 2024 Copyright for this paper by its authors. Use permitted under gaia.caligiore@unimore.it (G. Caligiore); Creative Commons License Attribution 4.0 International (CC BY 4.0). raffaele.mineo@phd.unict.it (R. Mineo); concetto.spampinato@unict.it (C. Spampinato); egidio.ragonese@unict.it (E. Ragonese); simone.palazzo@unict.it (S. Palazzo); sfontana@unict.it (S. Fontana). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings stressing the central role of LIS in ensuring that the deaf datasets. Additionally, computer vision plays a central signers received accessible information during role in this field by enabling real-time analysis and challenging times [6], highlighting the significant interpretation of body and manual components [2] that communication barriers that deaf individuals face, is hand movements, facial expressions, and body posture especially when in-person interactions were restricted. [12, 13, 14, 15]. This increased visibility, along with persistent advocacy A significant challenge in applying deep learning by the signing community, played a crucial role in the and computer vision methods to SLR lies in ensuring the official recognition of LIS and Tactile LIS (LISt) in May quality and adequacy of training data, which is essential 2021. for achieving optimal model performance. Within this evolving societal and linguistic Therefore, in this study, we focus on evaluating the framework, the increased media visibility of LIS and the efficacy of the MultiMedaLIS Dataset (Multimodal introduction of video capturing tools in daily lives, Medical LIS Dataset) and assessing various deep language collection emerges as a central issue. For SLs, learning models for SLR which employ advanced deep the need for comprehensive collections is particularly learning techniques to interpret isolated signs by significant. Unlike oral languages, which in some cases integrating diverse data types such as RGB video, depth have developed standardized written systems, SLs must information, optical flow, and skeletal data. rely on video collections to capture signed We benchmark our Dataset with two models: the communication accurately. These videos, whether raw Skeleton-Based Graph Convolutional Network (SL- or annotated, are essential for analyzing SLs with both GCN) and the Spatiotemporal Separable Convolutional qualitative and quantitative evidence. Network (SSTCN). These models are trained on the MultiMedaLIS Dataset, showcasing how the 2. Automatic Sign Language incorporation of multisource data can enhance the accuracy of sign recognition. This approach aims at Recognition testing the potential of integrating different data The development and use of preferably annotated SL modalities to improve the robustness and performance datasets or corpora are crucial for training and of SLR systems. validating automatic recognition models, and access to high-quality data from diverse SLs and cultural contexts 3. State of the Art enhances the generalizability of these solutions. Comprehensive data collections of this kind ensures that In this section, we discuss the state of the art from two models can effectively understand and process the wide perspectives considered during our work on the Dataset: range of linguistic and cultural nuances present in LIS data collection and SLR tools different SLs. In the domain of automatic sign language 3.1. LIS Data Collections recognition (SLR) of LIS, the integration of visual and SL researchers in Italy have been actively engaged in the spatial information presents a complex challenge. As creation of LIS corpora and datasets. This effort involves mentioned, LIS operates through the visual-gestural a complex process of video data collection and channel. More precisely, it is characterized as annotation, as SL datasets can vary significantly multimodal2 (signed discourse is comprised of manual depending on their intended use. Within this context, SL and body components) and multilinear (manual and data collections can be categorized into two main types. body components are performed simultaneously) [2]. The first type includes datasets that feature videos Recent advancements in SLR have been significantly depicting continuous signing, capturing the flow and driven by annotated datasets, which serve as the basis context of natural SL usage. The second type comprises for training and validating models [7, 8, 9, 10, 11]. datasets that focus on isolated signs, which are Machine learning technologies, particularly deep individual signs presented separately from continuous learning neural networks, have facilitated the discourse. development of more precise and robust models for SL The scarcity of available LIS data collections has interpretation. These models are able to refine their prompted researchers to develop their own resources. performance through training on diverse and complex Several smaller-scale LIS corpora have been 2 Given our group’s interdisciplinarity, we found “multimodal” can mean different things depending on one’s background: in linguistics, it refers to the employment of manual and body components while signing, while in computer vision, it means using multiple capturing tools. To differentiate, we use “multisource” for capturing tools. Thus, “multimodal” in this text follows SL linguistics terminology. independently established, each serving distinct Another study proposes an iterative optimization purposes based on the type of data collected. alignment network tailored for weakly supervised The methodologies employed for collecting LIS data continuous SLR [30]. The framework employs a 3D encompass a diverse array of approaches, ranging from residual convolutional network for feature extraction, naming tasks to semi-structured and spontaneous complemented by an encoder-decoder architecture interviews with deaf signers, to video recording sessions featuring LSTM decoders and Connectionist Temporal involving hearing individuals learning LIS as a second Classification (CTC). language (L2) or second modality (M2) [16]. These [31] introduces a 3D convolutional neural network documentations serve equally diverse purposes, ranging enhanced with an attention module, designed to extract from documenting the language itself to creating tools spatiotemporal features directly from raw video data. In for automatic translation highlighting the ongoing contrast, [32] combines bidirectional recurrence and commitment of researchers to expand and enrich the temporal convolutions, emphasizing temporal available resources for studying LIS [17, 18, 19, 20, 21, 22, information’s effectiveness in sign tasks, although not 23, 24]. covering the full spectrum of movements. Moreover, Despite the predominant private nature of corpora [33] employs CNNs, a Feature Pooling Module, and collections, an exception to the accessibility challenge is LSTM networks to generate distinctive visual found in the online dictionary SpreadTheSign, a project representations but falls short in capturing originating in 2004. Initially conceived as a dictionary comprehensive movements and signing. for SLs, SpreadTheSign has evolved into a versatile However, as previously noted, RGB-based SLR resource for language documentation [25]. Another systems can raise privacy concerns, particularly when significant resource is the Corpus LIS, recognized as the processing visual data in cloud environments or for largest collection of spontaneous, semi-structured, and machine learning training [34]. Addressing these issues, structured videos in LIS by deaf signers. The primary radio frequency (RF) sensors have emerged as a objectives of this corpus were twofold: to collect a promising alternative, ensuring privacy preservation substantial quantity of data suitable for quantitative while enabling innovative data representations for SLR. analysis and to establish a comprehensive In the literature, deep learning techniques have been representation of LIS usage in Italy [26, 27, 28]. applied to various RF modalities such as ultra-wideband (UWB) [35], Doppler [36], continuous wave (CW) [37], 3.2. SLR Tools micro-Doppler [38], frequency modulated continuous wave (FMCW) [14], multi-antenna systems [39], and Like SL data collections, SLR approaches can be broadly millimeter waves [40]. classified into two main categories: those that rely on As part of the Dataset discussed in this work, we specialized hardware and those that use visual have also collected RADAR data and are actively information. The former employ specialized hardware, analyzing it. However, preliminary results are not such as gloves able to capture precise hand movements. available at this time, so they are not included in this While these systems can provide detailed data, they are report. Currently, RADAR-based solutions have often considered intrusive and can compromise the demonstrated robust performance across diverse natural flow of communication. Additionally, they are environmental conditions, highlighting the productivity unable to capture the full spectrum of SLs, which of incorporating this sensor technology in data includes manual and body components. In contrast, collection efforts. Nevertheless, many existing RADAR vision-based approaches use visual information solutions are tailored to recognizing a limited set of captured by cameras, including RGB, depth, infrared, or signs, highlighting the ongoing challenge of expanding a combination of these. These methods are less intrusive vocabulary recognition capabilities in datasets like the for users, as they do not require the use of special one discussed in the following section. equipment. In SLR, a challenge lies in effectively capturing both body movements and specific motions of hands, arms, 4. The MultiMedaLIS Dataset and face. For instance, [29] introduces a multi-scale, The MultiMedaLIS [41] Dataset was created thanks to multi-modal framework that focuses on spatial details the interdisciplinary collaboration established between across different scales. This approach involves each the Department of Humanities (DISUM) and the visual modality capturing spatial information uniquely, Department of Electrical, Electronic and Computer supported by a system operating at three temporal Engineering (DIEEI) of the University of Catania (Unict). scales. The training methodology emphasizes precise It aims to offer a multimodal collection of LIS signs initialization of individual modalities and progressive specifically focused on medical contexts. fusion via ModDrop, which enhances overall robustness For the data recording protocol, the DIEEI group and performance. developed a customized recording software to collect the LIS data, supplemented with a desktop computer and a These signs reflect forms commonly found in online modified keyboard transformed into a pedal board. This dictionaries and educational materials. To ensure the pedal board, equipped with two pedals, allowed hands- accuracy of the data, sign variants performed by a free navigation of the software, enabling users to move professional LIS interpreter during the collection of a forward (by pushing on the right pedal) or backward (by test dataset were compared with the same variants pushing on the left pedal) while maintaining a neutral found in the online dictionary SpreadTheSign. This recording position3. During sessions, one of 126 Italian comparison aimed to select documented versions of each labels or alphabet letters was displayed on a screen, with sign for inclusion in the Dataset. By incorporating these adjustable display time for preparation and transition documented variants, we aimed to enhance its precision, from one sign to the other. Each recording started from reliability, and real-world applicability. This approach a neutral position, and the right pedal marked the contributed to ensuring that the Dataset aligns with completion of a sign. If errors occurred, the left pedal established standards and supports effective research allowed re-recording. The software’s interface features and application in the field of LIS. a color-coded background: yellow for preparation and When discussing recording tools for state-of-the-art green for recording. Additionally, it supports flexible multimodal corpora in the Italian context, such as the data expansion, accepting word lists from text files for Corpus LIS [27] and the CORMIP [43] the emphasis is easy customization in future collections. placed on the portability and non-invasiveness of these tools. This approach ensures minimal interference with the signer's natural environment and activities. Portable and non-invasive recording tools are chosen specifically for their ability to capture data in familiar, and sometimes domestic, settings without disrupting the signer’s surroundings, aiming to maintain Figure 1: User interface display presented during the the authenticity of the signed interactions and minimize recording phase (green) and preparation phase (yellow). any discomfort or distraction for the participants. To capture LIS for recognition with minimal After the recording process, Dataset included invasiveness we integrated a combination of recording synchronized data capturing facial expressions, hand tools. A 60GHz RADAR sensor, employed to capture and body movements and comprises a total of 25,830 detailed manual motion data, provided Time- and sign instances. This includes 205 repetitions of 100 Frequency-Domain data and Range Doppler Maps for different signs and the 26 signs of the LIS alphabet [41]. distinguishing moving objects at 13 fps. For more Beyond these 26 signs, the signs included in the structured depth and facial recognition data, the MultiMedaLIS Dataset can be broadly categorized into Realsense D455 depth camera and Kinect v1 were two groups [42]: semantically marked signs related to incorporated. The Realsense D455, equipped with dual health and health issues, and non-semantically marked infrared cameras and RGB mode, captured depth data at signs. It is important to note that while the first group of 848x480 pixels and RGB data at 1280x720 pixels, both at signs is categorized as semantically marked, this 30 fps, enabling the tracking of facial expressions classification does not imply that these signs belong through 68 facial points. The Zed v1 and Zed v2 cameras exclusively to a specialized jargon lexicon. The decision provided high-resolution stereoscopic data, recording at to categorize signs as semantically marked was driven 1920x1080 pixels and 25 fps, with capabilities for by their significance in contexts related to health and generating depth maps and 3D point clouds. medical interactions in the post-pandemic world (hence, Additionally, the Zed v2 offered tracking for 18 body when the Dataset was first theorized). However, it was points in both 2D and 3D [41]. also important to include additional signs that could contribute to constructing meaningful utterances in patient-doctor interactions. During the creation of the MultiMedaLIS Dataset, careful consideration was given to selecting signs that could be combined to form coherent and meaningful utterances. Regarding the specific form of signs, the MultiMedaLIS Dataset includes a lexicon of standard, isolated signs that are not combined within utterances. 3 The neutral recording position referenced is a seated position in which the user has their arms extended along the sides of the torso, elbows bent at 90°, and palms facing downward [41]. Figure 2: Combination of synchronized infrared and for exploring these combinations, allowing researchers depth data from the MultiMedaLIS Dataset. to develop more effective and accurate solutions for SLR. By prioritizing portability and non-invasiveness, high-quality data can be still collected, while respecting 6. Models and Architectures the privacy and comfort of the individuals recorded. In the context of automatic SLR, various approaches and Anonymization is achieved through the use of the model architectures have been tested to leverage the RADAR sensor, which we introduced specifically to characteristics of multimodal data in the MultiMedaLIS address privacy concerns inherent in face-to-face signed Dataset. communication. The SL-GCN (Skeleton-Based Graph Convolutional Network) represents a significant innovation in this 5. Testing the Dataset field. This model generates skeletal data from videos and creates temporal graphs that capture the spatiotemporal The MultiMedaLIS Dataset was designed with the aim of relationships between joint movements. Through fine- supporting the development of SLR models by enabling tuning and the combination of different data streams, the collection and integration of information through SL-GCN has demonstrated high accuracy in sign various data modalities: recognition [44] [45]. • RGB frames: images extracted from videos. Another prominent architecture is the SSTCN • Depth data: three-dimensional information for (Spatiotemporal Separable Convolutional Network) [46], each RGB frame which excels in feature extraction from videos using • Optical flow: to emphasize movement HRNet [47]. This approach has shown an accuracy of • Skeletal data: face landmarks and body joints 96.33%, highlighting its effectiveness in capturing spatial and temporal dynamics of LIS signs. One of the main components of the Dataset are RGB RGB frames are crucial for the visual representation frames, which are images extracted from videos. These of signs. The process of splitting videos into frames, frames provide a two-dimensional visual representation cropping, and normalization optimally prepares the data of the signs performed by the signer, capturing details for analysis by deep learning models. The use of dense such as hand positions and facial expressions. The optical flow presents significant challenges in sign Dataset includes depth data, providing a three- recognition. Optical flow extraction using the Farneback dimensional aspect to the images. allowing for more algorithm [48] led to 56% accuracy, highlighting detailed information on the distance and relative difficulties in capturing precise details of movements, position of elements in the scene. This type of data is alongside computational limitations. Depth data particularly useful for understanding the spatial encoded with Height, Horizontal disparity, Angle dynamics of signs. (HHA) represent another crucial resource in the MultiMedaLIS Dataset. Applying HHA encoding to Alongside RGB and depth data, the MultiMedaLIS depth frames achieved 88% accuracy using the Dataset also contains optical flow information, which ResNet(2+1)D architecture [49], substantiating describes the movement between consecutive frames. importance of three-dimensional information in Optical flow is essential for capturing the direction and enhancing understanding and interpretation of signs, speed of movements, providing a more detailed offering a more detailed perspective compared to two- understanding of the transitions between various signs. dimensional data. Finally, the Dataset includes skeletal data, representing face landmarks and body joints, allowing for precise 7. Training and Evaluation tracking of joint and body segment positions, facilitating Procedure the analysis of signs in terms of joint movements. For the training of the models, we employed a multi- Managing this multimodal data is an emerging topic stream approach that integrates skeletal, RGB, and depth in computational linguistics. By combining different data to improve sign recognition accuracy. The models sources of information, it is possible to significantly were trained on a NVIDIA Tesla T4 16GB GPU using the improve the performance of SLR models. For example, Adam optimizer with an initial learning rate of 0.001 and integrating depth data with RGB frames can provide a a batch size of 8. We applied cross-validation to ensure more complete representation of signs, while adding the robustness of the results, splitting the Dataset into optical flow and skeletal data can further enrich the training (70%) and validation (15%) subsets and data analysis of movement’s temporal structure. In our view, augmentation techniques, such as color jittering, the MultiMedaLIS Dataset provides a solid foundation changing the brightness, contrast, saturation and hue, to increase the diversity of the training data and improve optical flow data alone, reaching just 56.31% accuracy, generalization. suggesting that while the optical flow provides valuable The loss function adopted for training was information on motion, it lacks the richness of spatial categorical cross-entropy, appropriate for multi-class features found in RGB and depth data. The HHA- classification tasks. The models were trained for a encoded depth data, when processed with the maximum of 100 epochs, with an early stopping ResNet(2+1)D model, achieved an accuracy of 88.04%, criterion set to terminate training if no improvement in confirming that depth information is complementary, validation loss was observed for 10 consecutive epochs. but not as effective as RGB data in isolation. For evaluation, we used a test set comprising 15% of the Dataset, ensuring that the models were tested on unseen Table 3 data. Performance of various methods on the MultiMedaLIS Dataset 8. Results Methods Dataset Accuracy(%) SS-CGN RGB 97.98 The results demonstrate the model’s efficiency in SSTCN RGB 96.33 leveraging multi-modal data for improved outcomes. As ResNet(2+1)D Optical Flow RGB 56.31 can be seen in Table 1, the SL-GCN multi-stream model ResNet(2+1)D Frame RGB 97.29 achieved the best accuracy, with a Top-1 accuracy of ResNet(2+1)D Encoding HHA Depth 88.04 97.98% and a Top-5 accuracy of 99.94%, surpassing the performance of models using single data streams such as skeletal joints, bones, or motion alone. This The results highlight importance of combining demonstrates the advantage of combining multiple multiple data modalities, especially RGB and skeletal streams of information to capture both spatial and data, for improving the accuracy and robustness of SLR temporal dynamics of signs. systems. The performance of the SL-GCN model with multi-stream data shows the model’s ability to Table 1 effectively capture signs, as well as the Dataset’s value. Performance of SL-GCN multi-stream on the test set Data Accuracy Accuracy 9. Discussion and Conclusion Top-1 (%) Top-5(%) In this study, our goal was to demonstrate our first steps Joints 96.24 99.84 into testing the efficacy of the MultiMedaLIS Dataset in Bones 95.82 99.84 contributing to the advancement of the field of SLR Joint Motion 90.37 99.15 through multisource approaches. The integration of Bone Motion 92.69 99.52 RGB frames, depth data, optical flow, and skeletal data Multi-stream 97.98 99.94 has provided a comprehensive basis for developing and evaluating SLR models. Our experiments with the SL- In Table 2, datasets trained on the SL-GCN model are GCN and SSTCN architectures have highlighted compared. Our Dataset produced the highest accuracy advancements in recognizing isolated LIS signs in (97.98%) among the datasets evaluated, outperforming medical semantic contexts, given the domain of our larger datasets like AUTSL (95.45%). Dataset. The SL-GCN model, trained on skeletal data to Table 2 construct temporal graphs, achieved accuracy in Comparison of different datasets on SL-GCN model capturing spatiotemporal relationships critical to sign Dataset Number of signs Accuracy (%) recognition. This approach not only enhances the MultiMedaLIS 126 97.98 precision of rendering LIS signs but is also reinforced by AUTSL 226 95.45 a Dataset able to support robust graph-based ASLLVD 20 61.04 convolutional networks in multimodal SLR tasks. At the Alphabet 26 85.19 same time, our Dataset proved robust, precise and variable enough for SSTCN model testing, focusing on spatiotemporal separable convolutions, revealing robust Table 3 presents a comparison of different methods performance in extracting spatial dynamics from RGB across the entire Dataset. The SL-GCN trained on RGB frames. frames achieved the highest accuracy (97.98%), followed Having validated the visual modalities on the by the SSTCN model with 96.33%. The ResNet(2+1)D mentioned models, we have promising preliminary architecture showed strong performance when applied results on adapting these models to accept RADAR data. to RGB frames (97.29%), but struggled when using We plan to extract the pre-trained RADAR data processing module and use it independently during Gesture Recognition, Amsterdam, Netherlands, inference. This approach will eliminate the need for RGB 2008, pp. 1-6. visual data. Furthermore, we plan to expand the Dataset [11] S. Tornay, O. Aran, M. Magimai Doss, An HMM by applying the same protocol with 10 deaf signers. This Approach with Inherent Model Selection for Sign will effectively increase the current Dataset, enhancing Language and Gesture Recognition, In the generalizability across different signers. Our goal is Proceedings of the Twelfth Language Resources to develop an autonomous, resource-constrained system and Evaluation Conference, Marseille, France, (thanks to the exclusion of RGB data) that operates on- 2020, pp. 6049-6056. edge or even offline. This cost-effective solution can be [12] Y. Chen, C. Shen, X. -S. Wei, L. Liu and J. Yang, used in any emergency contexts where direct access to Adversarial PoseNet: A Structure-Aware interpreting is not available. Convolutional Network for Human Pose Estimation, 2017 IEEE ICCV, 2017, pp. 1221-1230. References [13] E. Barsoum, C. Zhang, C. Canton Ferrer, Z. Zhang, Training deep networks for facial expression [1] W. Stokoe, Sign language structure: an outline of recognition with crowd-sourced label distribution, the visual communication systems of the in Proceedings of the 18th ACM ICMI, 2016, pp. American deaf, University of Buffalo, Buffalo, New 279–283. York, 1960. [14] Y. Wang, A. Ren, M. Zhou, W. Wang and X. Yang, [2] V. Volterra, M. Roccaforte, A. Di Renzo, S. Fontana, A Novel Detection and Recognition Method for Italian Sign Language from a Cognitive and Socio- Continuous Hand Gesture Using FMCW Radar semiotic Perspective. Implications for a general in volume 8 of IEEE Access, 2020, pp. 167264- language theory, John Benjamins Publishing 167275. Company, Amsterdam-Philadelphia, 2022. [15] O. Yusuf, M. Habib, M. Moustafa, Real-time hand [3] M. Montanini, M. Facchini, L. Fruggeri, Dal Gesto gesture recognition: Integrating skeleton-based al Gesto: il bambino sordo tra gesto e parola, data fusion and multi-stream CNN, 2024. Cappelli, Bologna, 1979. [16] A. Cardinaletti, L. Mantovan, Le Lingue dei Segni [4] V. Volterra, I segni come le parole: la nel 'Volume Complementare' e l’Insegnamento comunicazione dei sordi, Boringhieri, Torino, della LIS nelle Università Italiane, 2, volume 14 of 1981. Italiano Lingua Seconda. Rivista internazionale di [5] S. Fontana, S. Corazza, P. Boyes-Braem, V. linguistica italiana e educazione linguistica, 2022, Volterra, Language research and language pp. 113-128. community change: Italian Sign Language (LIS) [17] T. Russo Cardona, Iconicity and Productivity in 1981-2013, in volume 236 of the International Sign Language Discourse: An Analysis of Three Journal of the Sociology of Language, 2015. LIS Discourse Registers, 2, volume 4 of Sign [6] E. Tomasuolo, T. Gulli, V. Volterra, S. Fontana, The Language Studies, 200), pp. 164-197. Italian Deaf Community at the Time of [18] A. Ricci, C. Bonsignori, A. Di Renzo, Che giorno è Coronavirus, in volume 5 of Frontiers in oggi? Prime analisi e riflessioni sull’espressione Sociology, 2021. del tempo in LIS [Poster presentation], IV [7] D. Li, C. R. Opazo, X. Yu and H. Li, Word-level Convegno Nazionale LIS 'La Lingua dei Segni Deep Sign Language Recognition from Video: A Italiana: una risorsa per il futuro', Rome, 2018. New Large-scale Dataset and Methods [19] E. Fornasiero, La morfologia valutativa in LIS: una Comparison, in proceedings of the 2020 IEEE descrizione preliminare [Poster presentation], IV WACV, Snowmass, CO, USA, 2020, pp. 1448-1458. Convegno Nazionale LIS 'La Lingua dei Segni [8] O. Mercanoglu Sincan, H. Yalim Keles, AUTSL: A Italiana: una risorsa per il futuro', Rome, 2018. large scale multi-modal Turkish sign language [20] A. Di Renzo, A. Slonimska, L’uso delle Strutture di dataset and baseline methods, IEEE Access, 2020. Grande Iconicità nei testi narrativi segnati: primi https://doi.org/10.48550/arXiv.2008.00932 dati su bambini prescolari, scolari e adulti [Poster [9] H. R. Vaezi Joze, O. Koller, MS-ASL: A large-scale presentation], IV Convegno Nazionale LIS 'La data set and benchmark for understanding Lingua dei Segni Italiana: una risorsa per il futuro', American sign language, arXiv preprint arXiv, Rome, 2018. 2018. [21] S. R. Conte, Nomi di persona e di luogo nella [10] U. von Agris, M. Knorr and K. F. Kraiss, The comunità sorda in Italia: interviste, analisi e primi significance of facial features for automatic sign risultati [Poster presentation], IV Convegno language recognition, proceedings of the 8th IEEE Nazionale LIS 'La Lingua dei Segni Italiana: una International Conference on Automatic Face & risorsa per il futuro', Rome, 2018. [22] S. Fontana, E. Raniolo, Interazioni tra oralità e [33] O. Mercanoglu Sincan, A. O. Tur and H. Yalim unità segniche: uno studio sulle labializzazioni Keles, Isolated Sign Language Recognition with nella Lingua dei Segni Italiana (LIS), in: G. Multi-scale Features using LSTM, in proceedings Schneider, M. Janner, B. Élie (Eds.), Proceedings of of the 27th Signal Processing and Communications the VII Dies Romanicus Turicensis, Peter Lang, Applications Conference (SIU), Sivas, Turkey, Bern, 2015, pp. 241-258. 2019, pp. 1-4. [23] V. Cuccio, G. Di Stasio, S. Fontana, On the [34] S. Z. Gurbuz, A. C. Gurbuz, E. A. Malaia, D. J. Embodiment of Negation in Italian Sign Language: Griffin, C. Crawford, M. M. Rahman, R. Aksu, E. An Approach Based on Multiple Representation Kurtoglu, R. Mdrafi, A. Anbuselvam, T Macks, E. Theories, in volume 1 of Frontiers in Psychology, Ozcelik, A linguistic perspective on radar micro- 2022. doppler analysis of American sign language, in [24] S. Fontana, Grammar and Experience: The proceedings of the 2020 IEEE International Radar Interplay Between Language Awareness and Conference (RADAR), Washington, DC, USA, Attitude in Italian Sign Language (LIS), 5, volume 2020, pp. 232-237. 14 of the International Journal of Linguistics, 2022, [35] B. Li, Sign language/gesture recognition based on pp. 1-18. cumulative distribution density features using [25] M. Hilzensauer, K. Krammer, A multilingual UWB radar, in volume 70 of IEEE TIM, 2021, pp. 1- dictionary for sign languages: 'SpreadTheSign', in 13. proceedings of ICERI , Seville, 2015. [36] H. Kulhandjian, Sign language gesture recognition [26] C. Cecchetto, S. Giudice, E. Mereghetti, La raccolta using Doppler radar and deep learning" in del Corpus LIS, in: A. Cardinaletti, C. Cecchetto, C. proceedings of the 2019 IEEE Globecom Donati (Eds.), Grammatica, Lessico e Dimensioni Workshops (GC Wkshps), Waikoloa, HI, USA, di Variazione della LIS, FrancoAngeli, Milan, 2011, 2019, pp. 1-6. pp. 55-68. [37] Y. Lu, Y. Lang, Sign language recognition with CW [27] C. Geraci, K. Battaglia, A. Cardinaletti, C. radar and machine learning, proceedings of the Cecchetto, C. Donati, S. Giudice, E. Mereghetti, 21st International Radar Symposium (IRS), The LIS Corpus Project, in volume 11 of Sign Warsaw, Poland, 2020, pp. 31-34. Language Studies, 2011, pp. 528-571. [38] J. McCleary, Sign language recognition using [28] M. Santoro, F. Poletti, L'Annotazione del Corpus, micro-doppler and explainable deep learning, in in: A. Cardinaletti, C. Cecchetto, C. Donati (Eds.), volume 139 of Computer Modeling in Engineering Grammatica, Lessico e Dimensioni di Variazione & Sciences 2024, 2024, pp. 2399-2450. della LIS, FrancoAngeli, Milan, 2011, pp. 69-78. [39] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: [29] N. Neverova, C. Wolf, G. Taylor and F. Nebout, Towards real-time object detection with region ModDrop: Adaptive Multi-Modal Gesture proposal networks, volume 39 of IEEE Recognition, in volume 8 of IEEE Transactions on Transactions on Pattern Analysis and Machine Pattern Analysis and Machine Intelligence Intelligence (TPAMI), 2016, pp. 1137-1149. (TPAMI), 2016, pp. 1692-1706. [40] O. O. Adeoluwa, S. J. Kearney, E. Kurtoglu, C. J. [30] J. Pu, W. Zhou, and H. Li, Iterative alignment Connors, S. Z. Gurbuz, near real-time ASL network for continuous sign language recognition, recognition using a millimeter wave radar, in Proceedings of the IEEE/CVF Conference on Proceedings of Volume 11742 of Radar Sensor Computer Vision and Pattern Recognition (CVPR), Technology XXV, SPIE, 2021. 2019, pp. 4165–4174. [41] R. Mineo, G. Caligiore, C. Spampinato, S. Fontana, [31] J. Huang, W. Zhou, H. Li and W. Li, Attention- S. Palazzo, E. Ragonese, Sign Language Based 3D-CNNs for Large-Vocabulary Sign Recognition for Patient-Doctor Communication: A Language Recognition, in volume 29 of IEEE Multimedia/Multimodal Dataset, Proceedings of Transactions on Circuits and Systems for Video the IEEE 8th Forum on Research and Technologies Technology, 2019, pp. 2822-2832. for Society and Industry Innovation (RTSI), 2024. [32] D. Bragg, T. Verhoef, C. Vogler, M. Morris, O. [42] G. Caligiore, Codifying the body: exploring the Koller, M. Bellard, L. Berke, P. Boudreault, A. cognitive and socio-semiotic framework in Braffort, N. Caselli, M. Huenerfauth, H. Kacorri, building a multimodal Italian sign language (LIS) Sign language recognition, generation, and dataset [Ph.D. thesis], University of Catania, translation: An interdisciplinary perspective, in Catania, 2024. Proceedings of the 21st International ACM [43] L. Lo Re, Corpus Multimodale dell’Italiano Parlato: SIGACCESS Conference on Computers and basi metodologiche per la creazione di un Accessibility, 2019, pp. 16 – 31. prototipo [Ph.D. thesis], University of Florence, Florence, 2022. [44] C. Correia de Amorim, C. Macedo, C. Zanchettin, Spatial- Temporal Graph Convolutional Networks for Sign Language Recognition, Proceedings of the 2019 International Conference on Artificial Neural Networks, Munich, Germany, 2019, pp. 646-657. [45] Ayas Faikar Nafis and Nanik Suciati, Sign language recognition on video data based on graph convolutional network. 18, volume 99 of Journal of Theoretical and Applied Information Technology, 2023, pp. 4323-4333. [46] S. Jiang, B. Sun, L. Wang, Y. Bai, K Li, Y. Fu. Skeleton aware multi-modal sign language recognition, Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 5693- 5703. [47] K. Sun, B. Xiao, D. Liu, J. Wang, Deep high- resolution representation learning for human pose estimation. Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693-5703. [48] G. Farneback, Two-frame motion estimation based on polynomial expansion. Volume 2749 of Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, pp. 363-370. [49] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, & M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6450-6459.