1. Introduction

Did Somebody Say 'Gest-IT'? A Pilot Exploration of Multimodal Data Management

Ludovica Pannitto

ludovica.pannitto@unibo.it 0

Lorenzo Albanesi

lorenzo.albanesi@studio.unibo.it 0

Laura Marion

laura.marion@studio.unibo.it 0

Federica Maria Martines

federica.martines2@studio.unibo.it 0

Carmelo Caruso

carmelo.caruso@unibo.it 0

Claudia S. Bianchini

claudia.savina.bianchini@univ-poitiers.fr 1

Francesca Masini

francesca.masini@unibo.it 0

Caterina Mauri

caterina.mauri@unibo.it 0 0 Alma Mater Studiorum - University of Bologna 1 University of Poitiers, FoReLLIS Laboratory

The paper presents a pilot exploration of the construction, management and analysis of a multimodal corpus. Through a three-layer annotation that provides orthographic, prosodic, and gestural transcriptions, the Gest-IT resource allows to investigate the variation of gesture-making patterns in conversations between sighted people and people with visual impairment. After discussing the transcription methods and technical procedures employed in our study, we propose a unified CoNLL-U corpus and indicate our future steps.

eol>Corpora Multimodality Gestuality Blindness Universal Dependencies

1. Introduction

1There exists a wide variety of multimodal resources for spoken and signed language, many of them openly available to the community through initiatives such as CLARIN-ERIC (https://www. clarin.eu/). For a collection of available multimodal resources see https://www.clarin.eu/resource-families/multimodal-corpora (spoken language) and https://www.clarin.eu/resource-families/ sign-language-resources (sign language), while a list of audio-only resources of spoken language can be found at https://www.clarin. eu/resource-families/spoken-corpora label is often used to refer to any pragmatic gesture of concerning available resources is the non-separation beepistemic denial performed by moving the shoulders, tween the identification and description of gestures on without specifying either the characteristics of the shoul- the one hand, and their interpretation on the other. Inder movement or the movements of other body parts that deed, in many resources and studies a particular gestural may have contributed to the execution of the gesture). pattern is transcribed based on its function, i.e. its interSimilar challenges arise in studies on Sign Languages. pretation, rather than on a description of the ‘objective’ Although there is a larger number of transcription sys- aspects that characterize its ‘form’. However, if we aim tems for the latter (see, for instance, the review in [7]), to provide an integrated analysis of verbal and nonverbal none of them has achieved the status of a universal stan- communication, it is crucial that – just as we employ ipa dard. Additionally, attempts to adapt these systems to or simplified orthographic transcriptions for verbal signs the transcription of gestures have so far been limited and – we establish a standard to transcribe nonverbal signs not particularly successful. The lack of a transcription in order to then annotate and interpret them. Furtherstandard for gestures, that describes them independently more, in most resources gesture is transcribed only with of their function or meaning, hinders the ability to pre- reference to verbal behaviour: the very identification of cisely investigate the relationship between speech and the gestures depends on their association, according to gesture. the annotator’s subjective filtering, to an identifiable ver

Another aspect concerns the nature of language data bal sequence. In the PoliModal corpus2 [14], a resource captured in multimodal resources: as the collection and including transcripts of 14 hours of TV face-to-face interstandardization process for this kind of linguistic data views from the Italian political talk show Mezz’ora in più, is, by its very nature, much more complex, resources are for instance, gestures are annotated if they are judged often tailored to specific purposes and therefore involve as having a communicative intention [15] (displayed or task-oriented interactions (e.g., describing objects as in signalled), or a noticeable efect on the recipient. Once a the NM-MoCap-Corpus [8]; spatial comunication tasks as gesture has been selected, it is annotated with functional in the SaGA Corpus [9]), thus capturing interactions that values, as well as features that describe its behavioural may be naturalistic but are inherently non-ecological, shape and dynamics. The descriptions provided for gesi.e. not naturally-occurring [10, 11]. Often, participants ture annotation, moreover, seem to be an approximation are asked to wear special devices such as headsets or of the movement: gestures are often described relying on trackers during the recordings [12, 13], clearly altering the annotator’s categorization and not using meaningful the spontaneity of the interaction. and objective parameters. For example, in the MUMIN

The aim of the Gest-IT project is to build a multimodal coding [16] scheme used in the PoliModal Corpus and corpus of ecological data, allowing for the integrated reported in Table 1, a number of possible values for each analysis of verbal and gestural communication in spon- behaviour attribute are defined, but these fail to describe taneous interactions. In this paper, we will focus on the the entire range of possibilities (i.e., only three values protocol of multimodal data management that we tested are provided for face movements) or excessively simplify for this resource. We will first discuss the main existing the description (i.e., the value complex is used to capture multimodal resources (Section 2), showing how, as of movements where several trajectories are combined, thus today, there doesn’t seem to be any ecological, accessible, leaving unspecified whether they combine sequentially multimodal corpus for Italian. We will then introduce or in a non-linear trajectory for instance). Similar code the Gest-IT pilot resource and present its main features schemes are used in the Corpus d’interactions dialogales with respect to existing resources (Section 3). Section 4 (CID, [12]) and in the Hungarian Multimodal Corpus [17]. outlines the main design choices taken for the creation For resources such as Natural Media Motion-Capture of our resource and Section 5 describes the path ahead. Corpus (NM-MoCap-Corpus [8]), Bielefeld Speech and Gesture Alignment Corpus (SaGA [9]) and BAS SmartKom Public Video and Gesture corpus (SKP [18]), researchers 2. Multimodal resources: problems decided to adopt McNeill’s categories [19] or a schema and overview inspired by them [20, 21]. In addition, some of the Swedish data in the Thai/Swedish child data corpus [22] Multimodal corpus research faces two major problems: were partially annotated thanks to the standard notation (i) the lack of existing transcription and annotation stan- CHAT [23]. In the CORMIP [24] resource, instead, each dards (tools, formats and schemes), especially for coding gesture is segmented according to gesture phrases and nonverbal behavior [4]; and (ii) the time consuming na- gesture units [25]. Gestures are then classified solely ture of transcription and annotation process, which is based on iconicity, classifying them as ‘Pictorial’, ‘Nonresponsible for the relatively small sizes of searchable Pictorial’ or ‘Conventional’. While they claim to avoid multimodal corpora that are currently available.

Specifically with respect to point (i), a major problem 2https://github.com/dhfbk/InMezzoraDataset description of the entire set of body parts— from fingers to toes, including the head and torso — used to transcribe both sign languages and co-verbal gestures.

3. Towards the Gest-IT corpus: blind and sighted speakers We aim at building a corpus consisting of maximally eco

logical interactions, transcribed on three separate layers aligned to each other: (i) an orthographic transcription; Handedness (ii) a prosodic transcription, and (iii) a gestural transcripjHeactnodrymovement tra- tion. At present, we are still in an initial, exploratory Body posture Towards-Interlocutor, Up, Down, Sideways, Other phase, but we already addressed the most important decisions to be made.

Table 2 The first decision concerned the informants to be Gestures classification in CORMIP [26] recorded. In order to be able to investigate whether the ability to see and the perception of being seen during a Pictorial iombjaegcet-olirkaecsthioanp.es, or boundaries of a real-world communicative exchange can influence gesture producNon-Pictorial rythmic movements (i.e., batonic) or geometric tion, we decided to take into consideration both sighted feogromrys.. Deictic gestures also fall within this cat- and visually impaired L1 speakers in dialogical situations. Conventional gestures with a degree of conventionality that al- Gesture is indeed closely linked not only to intersubjeclsoewmsatnotiacsvsaolcuiaetteo, itnehamsp(eec.gif.,icthlieng‘oukiastyi’cssiygsnt)e.m, a tive needs, connected to clarity, eficiency and attentiongetting functions, but also to cognitive needs: speakers recur to gestures both when the interlocutor is not visicategorization of gesture functions or conventionality, ble [33] and when the speaker is visually impaired [34], the description of their lables (see Table 2) seems to con- thus independently of the interlocutors’ ability to see and tradict this statement [26]. Lastly, as far as Italian is interpret them. Yet, the actual relation and reciprocal inlfuence between gestures and the perception of being tcoonbceermneend,titohneedP,adwohvearMetuelxtitmuaoldatrlaCnosrcpriupsti[o2n7s, 2a8r]e heans- seen has received little attention so far. riched with annotations about a number of non-verbal We included in the study 6 blind and 8 sighted particicomponents, there including also aspects such as gaze pants, recruited on a voluntary basis and through a protolaanbdelgliensgtusrcehse.mThe3e [M29u,lt3i0M] otrdiaels MtoudlteicDoiumpelensgieosntaulre(Mtr3aDn)- ceothlitchaaltrheaqsubireeemneenvtasl4u.aTtehdeabslicnodmgprloiaunpt iwnictlhudGeDdPsRpeaankdscription in the three diferent dimensions of its form, its ers who were born blind, who acquired blindness later relation to spoken prosody and its semantic or pragmatic and who are partially-sighted. The total average age of functions. As reported in their manual, however, the the participants is mean = 39 years old (sd = ± 18.7). transcriber is required to make choices, on the form layer, The average age of the PG is mean = 55.8 years (sd such as which is the predominant articulator (e.g., left or = ± 18), while the control group has an average age of right hand, or both) or to choose for the articulator one mean = ± 26 years old (sd = ± 3.9). The total gender of the provided forms, one of which is labeled as iconic distribution is 85.7% F and 14.2% M. In the blind goup (BG) 100% of the participants are F. In the sighted group OKTshheapceh.allenge of transcription becomes even more (SG) 75% are F and 25% are M. The total average educasignificant when dealing with multimodal corpora repre- tional level distribution shows that 64.2% of participants senting sign language. Typically, this issue is addressed has a bachelor’s degree, while 35.7% has a high school using glosses, a form of sign-to-word translation that diploma. In the BG 83.3% of participants has a high provides information about the meaning of signs with- school diploma, while 16.7% of the participants has a out indicating their form [7]. However, over the years, bachelor’s degree. In the SG 100% of participants in the some systems have been developed to represent the shape control group has a bachelor’s degree. of signs. Most of these systems focus primarily on the All participants were paired and later involved in 30hands [31], which are only a small part of the articu- minutes seated conversations, to elicit samples of spontalators contributing to meaning. Among these systems, neous speech. As the participants to each dialogue were Typannot [32] stands out as it ofers a comprehensive

3https://osf.io/ankdx/ 4Positive evaluation of the Bioethics Committee of the University of

Bologna n. 0020349, 24/01/2024.

4.1. Data repository

unlikely to know each other, in order to avoid moments of silence, some questions were prepared to enhance spontaneous conversations (see Appendix A). Interestingly, speakers recurred to these prompts only in few cases: the interactions developed very spontaneously despite the absence of previous contacts among the interlocutors.

We built the pairs and the interactional setting according to two parameters: • speakers could belong to the same category of participant (both blind, both sighted) or diferent categories. We coded these two situations as S (same, blind-blind or sighted-sighted conversation) or D (diferent , blind-sighted conversation); • speakers could be facing each other or be seated back-to-back, to ensure that participants could not perceive the other’s nonverbal communication. We coded these two situations as M (masked, back-to-back situation) or U (unmasked, facing situation).

Resource building is a team enterprise, performed asyn

chronously by a number of diferent people (i.e., PIs, interns, technicians etc.), often with diferent levels of technical expertise and background knowledge about the

We recorded 13 conversations, for a total of roughly 7 genesis of the data. Our project is no exception. hours (428.15 minutes), from three points of view: the Therefore, in order to ensure data consistency and central camera faced the couple, whereas the other two maintenance, a specific workflow has been put in place. recorded the left side and the right side (the left and More specifically, a central git repository6 keeps track of right cameras were located so that they could capture the status of the resource. The main branch contains the the participants frontally, see Figures 1 and 2). The goal last, released version of the corpus while the dev branch was to take the participants’ gestures from all possible is used for development in between releases (versions are perspectives. Recordings took place over two days. Some numbered according to semantic versioning standards). details about the 13 recorded conversations are available Each participant and each conversation is defined in Table 35. through a .yaml file (Appendix B), allowing for a number of CI/CD practices to be put in place: each time a new conversation description file is pushed to the repository, 4. The Gest-IT corpus schema for instance, a table summarizing the full status of the resource is generated. Similarly, automatic checks are The other decisions that we had to make from the very performed each time a transcription is updated to ensure beginning concerned the repository, the archiving proto- the consistency of the overall resource: for instance, a col, and the standards for transcribing the three layers we script makes sure that names of layers in the transcripaim to represent (orthographic, prosodic and gestural). tion correspond to participants, that jefersonian notation (see Section 4.2) is well formed, etc.

5Interactions involving only blind speakers did not require the

masked setting, which was aimed to let sighted speakers experience a sight impairment of some sort during in-presence communication. 6https://github.com/LaboratorioSperimentale/Gest-IT

Data pertaining to each conversation is constituted Transcriptions are thus available in two formats: they by a set of digital objects, that represent diferent layers can be read as simple orthographic texts, or they can be of information attached to the same recording. These read as enriched texts with prosodic and interactional include: (i) three video tracks and one audio track; (ii) a information (such as overlaps, speed alterations, ascendverbal transcription layer, which was initially automat- ing or descending intonation, pauses, etc., as in example ically created with the whisper ASR toolkit [35] and below). In both cases, it is possible to directly relate the then revised at the ortographic and prosodic level (Sec- transcription unit to the audiovisual unit. A further retion 4.2); (iii) gesture transcription, starting from video vision step will be done once the corpus will be fully sources (Section 4.3); (iv) UD annotation layers. transcribed, in order to make sure that notation is con

Transcriptions are maintained in CoNLL-U format7, sistent throughout the resource. with specific MISC features for the gesture component.

This will allow, in the future, to enrich the resource with additional annotation layers.

4.2. Verbal language transcription As regards verbal communication, we decided to adopt

the standards of the KIParla corpus [36], a corpus of spoken Italian that allows full access to audio files and transcriptions of roughly 153 hours of spontaneous speech 8.

Once the recordings were acquired, the transcription process began. In accordance with the KIParla protocol, it was agreed to use the ELAN software [37], which allows for time alignment of videos, audio files and transcriptions. In practice, the speech was segmented into transcription units identified on a perceptual basis, especially by reference to prosodic unit boundaries. The transcription process involved two steps: 7https://universaldependencies.org/ext-format.html 8The KIParla corpus is an incremental and modular resource, therefore this count refers to the three modules KIP, ParlaTO and KIPasti, which are online at the moment (as of July 2024). As soon as new modules are published, the global dimension of the resource will increase.

• orthographic transcription, which included anonimization, turn assignment, and nonverbal behaviours. Whenever the annotator didn’t under- ‘wonderful Seville’ stand, they could either choose ‘xxx’ or type their hypothesis in parentheses; 4.3. Gesture transcription • prosodic transcription, following a simplification of the Jeferson system [ 38], widely shared by the In order to provide also a transcription of gestures, as scientific community [ 39]. The employed conven- objective and interpretation-independent as possible, we tions [40] are reported in Table 4. decided to employ Typannot.

Typannot is a typographic system for the representation of sign languages, a project in development since 2013 by the Gestual Script research group, composed of linguists, graphic designers, typographers, and computer scientists. Its articulatory description of the body, independent of the language studied, allows it to be adapted (4) B001 [siviglia meravigliosa]

speak-id overlap overlap (1) S001 l’ultimo che ho fatt:o allora sono speak-id - - - Prolonged - stata a siviglia:, p[er natale] - - Prolonged+Ascending overlap overlap ‘my last one well I was in Seville for Christmas’ (2) B001 [che io ador]o che [io speak-id overlap overlap overlap - overlap adoro] overlap ‘(Seville) which I love’ (3) S001 [bellissima po]i a [natale è speak-id overlap overlap - overlap overlap stato mag]ico overlap ‘wonderful Christmas was magic’ to the study of gestures as well. Typannot proposes to analyze gestures and signs as realizations of the body and not just the hands: to facilitate analysis, the body is divided into diferent Articulatory Systems (AS), covering every body part from the hands to the feet and includes a description of facial expressions. For the purpose of the Gest-IT project, only three will be considered:

Finger (F): the dynamics of the fingers of the hand

(thumb, index, middle, ring, and little finger). Furthermore, the distinction between the fingers of the right hand and those of the left hand will be considered and referred to respectively as RH and LH;

UpperLimb (UL): the dynamics of the upper limbs (arm, forearm, hand); UpperBody (UB): the dynamics of the segments that make up torso (hip, spine and shoulder), neck and head. In this system, the sign’s form is seen as a set of ar

ticulatory body information (we extend this view to gestures). Currently, the generic characters that make up the graphic inventory of Typannot are used to describe the dynamics of all body segments.

4.4. Towards a unified CoNLL-U corpus The resulting corpus is composed of verbal-prosodic units

and gestural units, with information about their overlaps9. Each unit is described by the metadata listed in Table 5. In case of non verbal units, the text is filled with a placeholder token (EMPTY) and relevant information is contained in the MISC column, where the following features are introduced, meta for para-verbal information (such as laughs, coughs...) and gesture for Typannot codes (see Appendix C).

The aim of this paper is to share with the scientific com

munity the protocol developed to build a multimodal resource for the Italian language in terms of data collection (design, ethic issues, practicalities); data management and curation; data transcription, annotation and analysis. In doing so, we contribute to the debate on multimodal resource building, which is still lacking an established standard. In particular, our contribution in this respect is twofold.

Firstly, our study suggests to adopt a three-layer transcription where the three layers (i.e., the orthographic transcription, the prosodic/interactional transcription, and the gestural transcription) align to each other, by using ELAN as a tool for transcribing and CoNLL-X as an interoperable output format. This has the advantage of grounding gestures as an integrated semiotic source within verbal conversation and ultimately allows to unveil gesture-speech regularities.

Secondly, we propose an innovative approach for the annotation of gesture data. By relying on common practices in the field of sign languages, we suggest that gesture transcription should follow the same rationale of phonetic transcription, with a method that describes ‘objective’ aspects that characterize the ‘form’ of the gesture, thus allowing for an interpretation-independent annotation.

Clearly, the project is still at a very preliminary stage. Next steps will include the complete orthographic, prosodic and gesture transcription of the recordings; a thorough revision and pseudoanymization.

Acknowledgments The Gest-IT corpus was built as an internship project

at the Experimental Lab10 of the Department of Modern Languages, Literatures, and Cultures (LILEC) of the University of Bologna. We would like to thank the Gestual Script team (Ecole Supérieure d’Art et Design, ESAD, Amiens) for providing us with the Typannot system, and the Istituto dei ciechi Francesco Cavazza11 (Bologna) for helping us with the recruiting of visually impaired participants. A special acknowledgment to the Centro Studi sul Seicento e Settecento Spagnolo (CSSS) of the LILEC Department for letting us record our pilot videos in their studio. [1] A. Lüdeling, M. Kytö, Corpus Linguistics: An International Handbook, De Gruyter Mouton,

9At the moment of writing, 1 minute of pilot transcription has been

produced. 10https://site.unibo.it/laboratorio-sperimentale/ 11https://www.cavazza.it/ 2009. URL: https://www.degruyter.com/database/ [14] D. Trotta, A. Palmero Aprosio, S. Tonelli, A. Elia, COGBIB/entry/cogbib.7917/html. Adding gesture, posture and facial displays to the [2] J. Bezemer, C. Jewitt, Multimodality: A guide for polimodal corpus of political interviews, in: Prolinguists, Research methods in linguistics 28 ( 2018 ). ceedings of the 12th Language Resources and Evalu[3] N. Abner, K. Cooperrider, S. Goldin-Meadow, Ges- ation Conference (LREC 2020), European Language ture for linguists: A handy primer, Language Resources Association, 2020, pp. 4320–4326. and Linguistics Compass 9 (2015) 437–451. doi:10. [15] J. Allwood, Capturing diferences between social ac1111/lnc3.12168. tivities in spoken language, Pragmatics and Beyond [4] Á. Abuczki, E. B. Ghazaleh, An overview of multi- New Series (2001) 301–320.

modal corpora, annotation tools and schemes, Ar- [16] J. Allwood, L. Cerrato, K. Jokinen, C. Navarretta, gumentum 9 (2013) 86–98. P. Paggio, The mumin coding scheme for the anno[5] M. E. Foster, J. Oberlander, Corpus-based gen- tation of feedback, turn management and sequenceration of head and eyebrow motion for an ing phenomena, Language Resources and Evaluaembodied conversational agent, Language Re- tion 41 ( 2007 ) 273–287. sources and Evaluation 41 ( 2007 ). doi:10.1007/ [17] K. Pápay, S. Szeghalmy, I. Szekrényes, Hucomtech s10579-007-9055-3. multimodal corpus annotation, Argumentum 7 [6] J. Bressem, S. H. Ladewig, C. Müller, 71. Linguistic (2011) 330–347.

Annotation System for Gestures, De Gruyter Mou- [18] F. Schiel, S. Steininger, U. Türk, The smartkom ton, Berlin, Boston, 2013, pp. 1098–1124. URL: https: multimodal corpus at bas., in: LREC, Citeseer, 2002. //doi.org/10.1515/9783110261318.1098. doi:doi:10. [19] D. McNeill, Gesture and Thought, University 1515/9783110261318.1098. of Chicago Press, 2013. doi:10.7208/chicago/ [7] C. S. Bianchini, (D)écrire les Langues des Signes: 9780226514642.001.0001. une approche grapholinguistique aux Langues [20] I. Mlakar, M. Rojc, Capturing form of non-verbal des Signes, number 8 in Grapholinguistics and conversational behavior for recreation on synthetic its Applications, Fluxus Editions, 2024. URL: conversational agent eva, WSEAS Trans. Comhttps://hal.science/hal-04602726. doi:10.36824/ put.[Print ed.] 11 (2012) 218–226. 2024-bianchini, iSSN 2681-8566 & eISSN 2534- [21] I. Mlakar, D. Verdonik, S. Majhenič, M. Rojc, To5192; EAN 9782487055025; CrossRef 1612798840. wards pragmatic understanding of conversational [8] F. Freigang, M. A. Priesters, R. Nishio, K. Bergmann, intent: A multimodal annotation approach to mulYour data at the center of attention: A metadata tiparty informal interaction–the eva corpus, in: session profile for multimodal corpora, in: Proceed- Statistical Language and Speech Processing: 7th Inings of the CLARIN Annual Conference, volume ternational Conference, SLSP 2019, Ljubljana, Slove2014, 2014. nia, October 14–16, 2019, Proceedings 7, Springer, [9] A. Lücking, K. Bergmann, F. Hahn, S. Kopp, 2019, pp. 19–30.

H. Rieser, The bielefeld speech and gesture align- [22] D. Fišer, J. Lenardič, Overview of multimodal corment corpus (saga), in: LREC 2010 workshop: Mul- pora in the clarin ( 2020 ). timodal corpora–advances in capturing, coding and [23] B. MacWhinney, Computational transcript analysis analyzing multimodality, 2010. and language disorders, in: Handbook of Neurolin[10] J. Du Bois, G. Troiani, Typology and its data: func- guistics, Elsevier, 1998, pp. 599–616. tional monoculture or structural diversity?, pre- [24] L. Lo Re, Prosody and gestures to modelling mulsented at Naturally occurring data in and beyond timodal interaction: Constructing an italian pilot linguistic typology, 2023. corpus, IJCoL. Italian Journal of Computational [11] G. Troiani, Representing a language in use: cor- Linguistics 7 (2021) 33–44.

pus construction, prosody, and grammar in Kazakh, [25] A. Kendon, et al., Gesticulation and speech: Two Ph.D. thesis, UC Santa Barbara, 2023. aspects of the process of utterance, The relationship [12] R. Bertrand, P. Blache, R. Espesser, G. Ferré, C. Meu- of verbal and nonverbal communication 25 (1980) nier, B. Priego-Valverde, S. Rauzy, Le cid-corpus of 207–227. interactional data-annotation et exploitation mul- [26] L. Lo Re, Corpus multimodale dell’italiano parlato: timodale de parole conversationnelle, Revue TAL: basi metodologiche per la creazione di un prototipo, traitement automatique des langues 49 ( 2008 ) pp– Ph.D. thesis, University of Firenze, 2022. 105. [27] K. Ackerley, F. Coccetta, Enriching language learn[13] D. Knight, S. Adolphs, P. Tennent, R. Carter, The ing through a multimedia corpus, ReCALL 19 ( 2007 ) nottingham multi-modal corpus: A demonstration, 351–370. doi:10.1017/S0958344007000730. in: Programme of the Workshop on Multimodal [28] F. Coccetta, et al., Multimodal functional-notional Corpora, 2009, p. 64. concordancing, New Trends in Corpora and Lan

A. Prompts for conversation B. Metadata schemata For both participants (see Subsection B.1) and conversations Subsection B.2), metadata is collected and maintained in .yaml files, with the following formats B.1. Participants metadata Code : # 4− c h a r s t r i n g composed by e i t h e r S ( S i g h t e d ) or B ( B l i n d ) and an i n t e g e r padded with 0 s Gender : # e i t h e r F ( Female ) or M ( Male ) Age : # age range o f t h e p a r t i c i p a n t e x p r e s s e d a s 5− y e a r s b i n s ( 0 − 5 , 6 −10 , 1 1 − 2 0 , . . . )

Region : # 1 o f t h e 20 i t a l i a n r e g i o n s ( t y p i n g c o n v e n t i o n s p r o v i d e d )

F i r s t l a n g u a g e : # upper c a s e d i s o

−693 −3 code o f mother tongue E d u c a t i o n l e v e l : # one v a l u e i n ( P r i m a r i a , Medie i n f e r i o r i , Medie s u p e r i o r i , Laurea , PhD ) P a r t i c i p a n t s : − [ p a r t i c i p a n t _ c o d e _ 1 ] # code o f p a r t i c i p a n t s i t t i n g on l e f t s i d e − [ p a r t i c i p a n t _ c o d e _ 2 ] # code o f p a r t i c i p a n t s i t t i n g on r i g h t s i d e F a c i n g : # M ( Masked ) or U ( unmasked ) depending on t y p e o f c o n v e r s a t i o n Data : − Video : − L e f t : path / t o / l e f t / camera / r e c o r d i n g − C e n t r e : path / t o / c e n t r a l /

camera / r e c o r d i n g − R i g h t : path / t o / r i g h t / camera / r e c o r d i n g − Audio : path / t o / a u d i o / f i l e − T r a n s c r i p t i o n : − Automatic : path / t o /

a u t o m a t i c / t r a n s c r i p t i o n − Manually r e v i s e d : path / t o / manually / r e v i s e d / t r a n s c r i p t i o n − P r o s o d i c : path / t o / p r o s o d i c / t r a n s c r i p t i o n − G e s t u a l : path / t o / g e s t u a l / t r a n s c r i p t i o n

C. Integrated transcription in ELAN

# s e n t _ i d = tu0005 # o v e r l a p s = gu0003 gu0004 gu0005 gu0006 # c o n v e r s a t i o n = DUC22051430 # s p e a k e r _ i d = S001 # d u r a t i o n = 1 . 0 8 8 # t e x t _ j e f f e r s o n = entrambe da s o l e # t e x t = entrambe da s o l e 1 entrambe entrambi PRON _ Gender=Fem | Number= P l u r | PronType = Ind 0 r o o t _ A l i g n B e g i n = 1 1 . 7 0 4 2 da da ADP _ _ 3 c a s e _ _ 3 s o l e s o l o ADJ _ Gender=Fem | Number= P l u r 1 nmod _ AlignEnd = 1 2 . 7 9 2 # s e n t _ i d = tu0006 # o v e r l a p s = tu0007 gu0007 gu0008 gu0009 # c o n v e r s a t i o n = DUC22051430 # s p e a k e r _ i d = S001 # d u r a t i o n = 0 . 9 8 7 # t e x t _ j e f f e r s o n = e [ h : l a s e r a ] # t e x t = eh l a s e r a 1 eh eh INTJ _ _ 3 d i s c o u r s e _ A l i g n B e g i n = 1 3 . 0 4 7 | Overlap =B : tu0007 | ProlongedSound =eh : 2 l a i l DET _ D e f i n i t e =Def | Gender=Fem | Number= Sing 3 det _ Overlap = I 3 s e r a s e r a NOUN _ Gender=Fem | Number= Sing 0 r o o t _ AlignEnd = 1 4 . 0 3 4 | Overlap = I # c o n v e r s a t i o n = DUC22051430 # s e n t _ i d = gu0003 # o v e r l a p s = tu0004 tu0005 # s p e a k e r = S001 # d u r a t i o n = 0 . 3 3 0 # t e x t = EMPTY # type = F : LH 1 EMPTY EMPTY X _ _ 0 r o o t _ A l i g n B e g i n = 1 1 . 4 1 0 | AlignEnd = 1 1 . 7 4 0 | g e s t u r e = ’ \ ue5de \ ue002 [ \ u f 1 9 7 \ ue008 \ u f 1 9 f \ ue5ea \ u e 5 e f \ ue5e8 \ u e 5 e f \ uf1a0 \ u e 5 f e \ ue5ee − \ ue004 \ ue005 \ ue006 \ ue007 \ u f 1 9 f \ u e 5 f e \ ue5ee \ ue5e8 \ u e 5 e f \ uf1a0 \ ue5e7 \ u e 5 e f ] [ \ u f 1 9 8 \ ue001 ] ’ # c o n v e r s a t i o n = DUC22051430 # s e n t _ i d = gu0004 # o v e r l a p s = tu0005 gu0005 # s p e a k e r = S001 # d u r a t i o n = 0 . 6 1 0 # t e x t = EMPTY # type = F : LH 1 EMPTY EMPTY X _ _ 0 r o o t _ A l i g n B e g i n = 1 1 . 7 4 0 | AlignEnd = 1 2 . 3 5 0 | g e s t u r e = ’ \ ue5de \ ue002 [ \ u f 1 9 7 \ ue008 \ u f 1 9 f \ u e 5 f f \ ue5ee \ ue5fb \ ue5ee \ uf1a0 \ u e 5 f e \ ue5ee − \ ue004 \ ue005 \ ue006 \ ue007 \ u f 1 9 f \ u e 5 f e \ ue5ee \ ue5fb \ ue5ee \ uf1a0 \ ue5fd \ ue5ee ] [ \ u f 1 9 8 \ ue001 ] ’ # c o n v e r s a t i o n = DUC22051430 # s e n t _ i d = gu0005 # o v e r l a p s = tu0005 gu0004 # s p e a k e r = S001 # d u r a t i o n = 0 . 6 1 0 # t e x t = EMPTY # type = F : RH 1 EMPTY EMPTY X _ _ 0 r o o t _ A l i g n B e g i n = 1 1 . 7 4 0 | AlignEnd = 1 2 . 3 5 0 | g e s t u r e = ’ \ ue5de \ ue003 [ \ u f 1 9 7 \ ue008 \ u f 1 9 f \ u e 5 f f \ ue5ee \ ue5fb \ ue5ee \ uf1a0 \ u e 5 f e \ ue5ee − \ ue004 \ ue005 \ ue006 \ ue007 \ u f 1 9 f \ u e 5 f e \ ue5ee \ ue5fb \ ue5ee \ uf1a0 \ ue5fd \ ue5ee ] [ \ u f 1 9 8 \ ue001 ] ’

guage Learning. London: Continuum ( 2011 ) 121 -

138. [29]

P. L.

Rohrer ,

Vilà-Giménez ,

Florit-Pons ,

Prieto , The multimodal multidimensional (m3d)

(GESPIN) ( 2020 ). [30]

P. L.

Rohrer ,

Delais-Roussarie ,

Prieto , Visual-

demic discourses, Lingua 293 ( 2023 ) 103583 . [31]

Chevrefils ,

Danet ,

Doan , C. Thomas,

Modalities 1 ( 2021 ) 49 - 63 . [32]

Boutet ,

Doan ,

Danet ,

C. S.

Bianchini ,

otics ( 2018 ) 391 - 426 . [33]

M. W.

Alibali ,

D. C.

Heath ,

H. J.

Myers , Efects of

Journal of Memory and Language 44 ( 2001 ) 169 - 188 .

doi:10 .1006/JMLA. 2000 . 2752 . [34] J. M. Iverson , S. Goldin-Meadow , Why people

gesture when they speak , Nature 1998 396: 6708

396 ( 1998 ) 228 - 228 . URL: https://www.nature.com/

articles/24300 . doi: 10 .1038/24300. [35]

Radford ,

J. W.

Kim , T. Xu, G. Brockman,

nition via large-scale weak supervision , 2023 . [36]

Mauri ,

Ballarè , E. Goria,

Cerruti ,

Suriano ,

SunSITE Central Europe , 2019 , pp. 1 - 7 . [37]

Sloetjes ,

Wittenburg , Annotation by category-

elan and iso dcr , in: 6th international Conference on

Language

Resources and Evaluation (LREC 2008 ),

2008 . [38]

Jeferson , et al., Glossary of transcript symbols

with an introduction, Conversation analysis (

2004 )

13- 31 . [39]

Slembrouck , Transcription-the extended direc-

variation in transcription', Discourse Studies 9

( 2007 ) 822 - 827 . [40]

Ballarè ,

Goria ,

Mauri , Italiano parlato e vari-

del corpus

KIParla

, Pàtron editore, 2022 .