<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Did Somebody Say 'Gest-IT'? A Pilot Exploration of Multimodal Data Management</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ludovica Pannitto</string-name>
          <email>ludovica.pannitto@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lorenzo Albanesi</string-name>
          <email>lorenzo.albanesi@studio.unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Marion</string-name>
          <email>laura.marion@studio.unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federica Maria Martines</string-name>
          <email>federica.martines2@studio.unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carmelo Caruso</string-name>
          <email>carmelo.caruso@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudia S. Bianchini</string-name>
          <email>claudia.savina.bianchini@univ-poitiers.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Masini</string-name>
          <email>francesca.masini@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Caterina Mauri</string-name>
          <email>caterina.mauri@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Alma Mater Studiorum - University of Bologna</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Poitiers, FoReLLIS Laboratory</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper presents a pilot exploration of the construction, management and analysis of a multimodal corpus. Through a three-layer annotation that provides orthographic, prosodic, and gestural transcriptions, the Gest-IT resource allows to investigate the variation of gesture-making patterns in conversations between sighted people and people with visual impairment. After discussing the transcription methods and technical procedures employed in our study, we propose a unified CoNLL-U corpus and indicate our future steps.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Corpora</kwd>
        <kwd>Multimodality</kwd>
        <kwd>Gestuality</kwd>
        <kwd>Blindness</kwd>
        <kwd>Universal Dependencies</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1There exists a wide variety of multimodal resources for spoken
and signed language, many of them openly available to the
community through initiatives such as CLARIN-ERIC (https://www.
clarin.eu/). For a collection of available multimodal resources
see https://www.clarin.eu/resource-families/multimodal-corpora
(spoken language) and https://www.clarin.eu/resource-families/
sign-language-resources (sign language), while a list of audio-only
resources of spoken language can be found at https://www.clarin.
eu/resource-families/spoken-corpora
label is often used to refer to any pragmatic gesture of concerning available resources is the non-separation
beepistemic denial performed by moving the shoulders, tween the identification and description of gestures on
without specifying either the characteristics of the shoul- the one hand, and their interpretation on the other.
Inder movement or the movements of other body parts that deed, in many resources and studies a particular gestural
may have contributed to the execution of the gesture). pattern is transcribed based on its function, i.e. its
interSimilar challenges arise in studies on Sign Languages. pretation, rather than on a description of the ‘objective’
Although there is a larger number of transcription sys- aspects that characterize its ‘form’. However, if we aim
tems for the latter (see, for instance, the review in [7]), to provide an integrated analysis of verbal and nonverbal
none of them has achieved the status of a universal stan- communication, it is crucial that – just as we employ ipa
dard. Additionally, attempts to adapt these systems to or simplified orthographic transcriptions for verbal signs
the transcription of gestures have so far been limited and – we establish a standard to transcribe nonverbal signs
not particularly successful. The lack of a transcription in order to then annotate and interpret them.
Furtherstandard for gestures, that describes them independently more, in most resources gesture is transcribed only with
of their function or meaning, hinders the ability to pre- reference to verbal behaviour: the very identification of
cisely investigate the relationship between speech and the gestures depends on their association, according to
gesture. the annotator’s subjective filtering, to an identifiable
ver</p>
      <p>Another aspect concerns the nature of language data bal sequence. In the PoliModal corpus2 [14], a resource
captured in multimodal resources: as the collection and including transcripts of 14 hours of TV face-to-face
interstandardization process for this kind of linguistic data views from the Italian political talk show Mezz’ora in più,
is, by its very nature, much more complex, resources are for instance, gestures are annotated if they are judged
often tailored to specific purposes and therefore involve as having a communicative intention [15] (displayed or
task-oriented interactions (e.g., describing objects as in signalled), or a noticeable efect on the recipient. Once a
the NM-MoCap-Corpus [8]; spatial comunication tasks as gesture has been selected, it is annotated with functional
in the SaGA Corpus [9]), thus capturing interactions that values, as well as features that describe its behavioural
may be naturalistic but are inherently non-ecological, shape and dynamics. The descriptions provided for
gesi.e. not naturally-occurring [10, 11]. Often, participants ture annotation, moreover, seem to be an approximation
are asked to wear special devices such as headsets or of the movement: gestures are often described relying on
trackers during the recordings [12, 13], clearly altering the annotator’s categorization and not using meaningful
the spontaneity of the interaction. and objective parameters. For example, in the MUMIN</p>
      <p>The aim of the Gest-IT project is to build a multimodal coding [16] scheme used in the PoliModal Corpus and
corpus of ecological data, allowing for the integrated reported in Table 1, a number of possible values for each
analysis of verbal and gestural communication in spon- behaviour attribute are defined, but these fail to describe
taneous interactions. In this paper, we will focus on the the entire range of possibilities (i.e., only three values
protocol of multimodal data management that we tested are provided for face movements) or excessively simplify
for this resource. We will first discuss the main existing the description (i.e., the value complex is used to capture
multimodal resources (Section 2), showing how, as of movements where several trajectories are combined, thus
today, there doesn’t seem to be any ecological, accessible, leaving unspecified whether they combine sequentially
multimodal corpus for Italian. We will then introduce or in a non-linear trajectory for instance). Similar code
the Gest-IT pilot resource and present its main features schemes are used in the Corpus d’interactions dialogales
with respect to existing resources (Section 3). Section 4 (CID, [12]) and in the Hungarian Multimodal Corpus [17].
outlines the main design choices taken for the creation For resources such as Natural Media Motion-Capture
of our resource and Section 5 describes the path ahead. Corpus (NM-MoCap-Corpus [8]), Bielefeld Speech and
Gesture Alignment Corpus (SaGA [9]) and BAS SmartKom
Public Video and Gesture corpus (SKP [18]), researchers
2. Multimodal resources: problems decided to adopt McNeill’s categories [19] or a schema
and overview inspired by them [20, 21]. In addition, some of the
Swedish data in the Thai/Swedish child data corpus [22]
Multimodal corpus research faces two major problems: were partially annotated thanks to the standard notation
(i) the lack of existing transcription and annotation stan- CHAT [23]. In the CORMIP [24] resource, instead, each
dards (tools, formats and schemes), especially for coding gesture is segmented according to gesture phrases and
nonverbal behavior [4]; and (ii) the time consuming na- gesture units [25]. Gestures are then classified solely
ture of transcription and annotation process, which is based on iconicity, classifying them as ‘Pictorial’,
‘Nonresponsible for the relatively small sizes of searchable Pictorial’ or ‘Conventional’. While they claim to avoid
multimodal corpora that are currently available.</p>
      <p>Specifically with respect to point (i), a major problem 2https://github.com/dhfbk/InMezzoraDataset
description of the entire set of body parts— from fingers
to toes, including the head and torso — used to transcribe
both sign languages and co-verbal gestures.</p>
    </sec>
    <sec id="sec-2">
      <title>3. Towards the Gest-IT corpus: blind and sighted speakers</title>
      <sec id="sec-2-1">
        <title>We aim at building a corpus consisting of maximally eco</title>
        <p>logical interactions, transcribed on three separate layers
aligned to each other: (i) an orthographic transcription;
Handedness (ii) a prosodic transcription, and (iii) a gestural
transcripjHeactnodrymovement tra- tion. At present, we are still in an initial, exploratory
Body posture Towards-Interlocutor, Up, Down, Sideways, Other phase, but we already addressed the most important
decisions to be made.</p>
        <p>Table 2 The first decision concerned the informants to be
Gestures classification in CORMIP [26] recorded. In order to be able to investigate whether the
ability to see and the perception of being seen during a
Pictorial iombjaegcet-olirkaecsthioanp.es, or boundaries of a real-world communicative exchange can influence gesture
producNon-Pictorial rythmic movements (i.e., batonic) or geometric tion, we decided to take into consideration both sighted
feogromrys.. Deictic gestures also fall within this cat- and visually impaired L1 speakers in dialogical situations.
Conventional gestures with a degree of conventionality that al- Gesture is indeed closely linked not only to
intersubjeclsoewmsatnotiacsvsaolcuiaetteo, itnehamsp(eec.gif.,icthlieng‘oukiastyi’cssiygsnt)e.m, a tive needs, connected to clarity, eficiency and
attentiongetting functions, but also to cognitive needs: speakers
recur to gestures both when the interlocutor is not
visicategorization of gesture functions or conventionality, ble [33] and when the speaker is visually impaired [34],
the description of their lables (see Table 2) seems to con- thus independently of the interlocutors’ ability to see and
tradict this statement [26]. Lastly, as far as Italian is interpret them. Yet, the actual relation and reciprocal
inlfuence between gestures and the perception of being
tcoonbceermneend,titohneedP,adwohvearMetuelxtitmuaoldatrlaCnosrcpriupsti[o2n7s, 2a8r]e heans- seen has received little attention so far.
riched with annotations about a number of non-verbal We included in the study 6 blind and 8 sighted
particicomponents, there including also aspects such as gaze pants, recruited on a voluntary basis and through a
protolaanbdelgliensgtusrcehse.mThe3e [M29u,lt3i0M] otrdiaels MtoudlteicDoiumpelensgieosntaulre(Mtr3aDn)- ceothlitchaaltrheaqsubireeemneenvtasl4u.aTtehdeabslicnodmgprloiaunpt
iwnictlhudGeDdPsRpeaankdscription in the three diferent dimensions of its form, its ers who were born blind, who acquired blindness later
relation to spoken prosody and its semantic or pragmatic and who are partially-sighted. The total average age of
functions. As reported in their manual, however, the the participants is mean = 39 years old (sd = ± 18.7).
transcriber is required to make choices, on the form layer, The average age of the PG is mean = 55.8 years (sd
such as which is the predominant articulator (e.g., left or = ± 18), while the control group has an average age of
right hand, or both) or to choose for the articulator one mean = ± 26 years old (sd = ± 3.9). The total gender
of the provided forms, one of which is labeled as iconic distribution is 85.7% F and 14.2% M. In the blind goup
(BG) 100% of the participants are F. In the sighted group
OKTshheapceh.allenge of transcription becomes even more (SG) 75% are F and 25% are M. The total average
educasignificant when dealing with multimodal corpora repre- tional level distribution shows that 64.2% of participants
senting sign language. Typically, this issue is addressed has a bachelor’s degree, while 35.7% has a high school
using glosses, a form of sign-to-word translation that diploma. In the BG 83.3% of participants has a high
provides information about the meaning of signs with- school diploma, while 16.7% of the participants has a
out indicating their form [7]. However, over the years, bachelor’s degree. In the SG 100% of participants in the
some systems have been developed to represent the shape control group has a bachelor’s degree.
of signs. Most of these systems focus primarily on the All participants were paired and later involved in
30hands [31], which are only a small part of the articu- minutes seated conversations, to elicit samples of
spontalators contributing to meaning. Among these systems, neous speech. As the participants to each dialogue were
Typannot [32] stands out as it ofers a comprehensive</p>
      </sec>
      <sec id="sec-2-2">
        <title>3https://osf.io/ankdx/</title>
      </sec>
      <sec id="sec-2-3">
        <title>4Positive evaluation of the Bioethics Committee of the University of</title>
        <p>Bologna n. 0020349, 24/01/2024.</p>
        <sec id="sec-2-3-1">
          <title>4.1. Data repository</title>
          <p>unlikely to know each other, in order to avoid moments of
silence, some questions were prepared to enhance
spontaneous conversations (see Appendix A). Interestingly,
speakers recurred to these prompts only in few cases: the
interactions developed very spontaneously despite the
absence of previous contacts among the interlocutors.</p>
          <p>We built the pairs and the interactional setting
according to two parameters:
• speakers could belong to the same category of
participant (both blind, both sighted) or diferent
categories. We coded these two situations as S
(same, blind-blind or sighted-sighted
conversation) or D (diferent , blind-sighted conversation);
• speakers could be facing each other or be seated
back-to-back, to ensure that participants could
not perceive the other’s nonverbal
communication. We coded these two situations as M (masked,
back-to-back situation) or U (unmasked, facing
situation).</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Resource building is a team enterprise, performed asyn</title>
        <p>chronously by a number of diferent people (i.e., PIs,
interns, technicians etc.), often with diferent levels of
technical expertise and background knowledge about the</p>
        <p>We recorded 13 conversations, for a total of roughly 7 genesis of the data. Our project is no exception.
hours (428.15 minutes), from three points of view: the Therefore, in order to ensure data consistency and
central camera faced the couple, whereas the other two maintenance, a specific workflow has been put in place.
recorded the left side and the right side (the left and More specifically, a central git repository6 keeps track of
right cameras were located so that they could capture the status of the resource. The main branch contains the
the participants frontally, see Figures 1 and 2). The goal last, released version of the corpus while the dev branch
was to take the participants’ gestures from all possible is used for development in between releases (versions are
perspectives. Recordings took place over two days. Some numbered according to semantic versioning standards).
details about the 13 recorded conversations are available Each participant and each conversation is defined
in Table 35. through a .yaml file (Appendix B), allowing for a
number of CI/CD practices to be put in place: each time a new
conversation description file is pushed to the repository,
4. The Gest-IT corpus schema for instance, a table summarizing the full status of the
resource is generated. Similarly, automatic checks are
The other decisions that we had to make from the very performed each time a transcription is updated to ensure
beginning concerned the repository, the archiving proto- the consistency of the overall resource: for instance, a
col, and the standards for transcribing the three layers we script makes sure that names of layers in the
transcripaim to represent (orthographic, prosodic and gestural). tion correspond to participants, that jefersonian notation
(see Section 4.2) is well formed, etc.</p>
      </sec>
      <sec id="sec-2-5">
        <title>5Interactions involving only blind speakers did not require the</title>
        <p>masked setting, which was aimed to let sighted speakers experience
a sight impairment of some sort during in-presence communication. 6https://github.com/LaboratorioSperimentale/Gest-IT</p>
        <p>Data pertaining to each conversation is constituted Transcriptions are thus available in two formats: they
by a set of digital objects, that represent diferent layers can be read as simple orthographic texts, or they can be
of information attached to the same recording. These read as enriched texts with prosodic and interactional
include: (i) three video tracks and one audio track; (ii) a information (such as overlaps, speed alterations,
ascendverbal transcription layer, which was initially automat- ing or descending intonation, pauses, etc., as in example
ically created with the whisper ASR toolkit [35] and below). In both cases, it is possible to directly relate the
then revised at the ortographic and prosodic level (Sec- transcription unit to the audiovisual unit. A further
retion 4.2); (iii) gesture transcription, starting from video vision step will be done once the corpus will be fully
sources (Section 4.3); (iv) UD annotation layers. transcribed, in order to make sure that notation is
con</p>
        <p>Transcriptions are maintained in CoNLL-U format7, sistent throughout the resource.
with specific MISC features for the gesture component.</p>
        <p>This will allow, in the future, to enrich the resource with
additional annotation layers.</p>
        <sec id="sec-2-5-1">
          <title>4.2. Verbal language transcription</title>
        </sec>
      </sec>
      <sec id="sec-2-6">
        <title>As regards verbal communication, we decided to adopt</title>
        <p>the standards of the KIParla corpus [36], a corpus of
spoken Italian that allows full access to audio files and
transcriptions of roughly 153 hours of spontaneous speech 8.</p>
        <p>Once the recordings were acquired, the transcription
process began. In accordance with the KIParla
protocol, it was agreed to use the ELAN software [37], which
allows for time alignment of videos, audio files and
transcriptions. In practice, the speech was segmented into
transcription units identified on a perceptual basis,
especially by reference to prosodic unit boundaries. The
transcription process involved two steps:
7https://universaldependencies.org/ext-format.html
8The KIParla corpus is an incremental and modular resource,
therefore this count refers to the three modules KIP, ParlaTO and KIPasti,
which are online at the moment (as of July 2024). As soon as new
modules are published, the global dimension of the resource will
increase.</p>
        <p>• orthographic transcription, which included
anonimization, turn assignment, and nonverbal
behaviours. Whenever the annotator didn’t under- ‘wonderful Seville’
stand, they could either choose ‘xxx’ or type their
hypothesis in parentheses; 4.3. Gesture transcription
• prosodic transcription, following a simplification
of the Jeferson system [ 38], widely shared by the In order to provide also a transcription of gestures, as
scientific community [ 39]. The employed conven- objective and interpretation-independent as possible, we
tions [40] are reported in Table 4. decided to employ Typannot.</p>
        <p>Typannot is a typographic system for the
representation of sign languages, a project in development since
2013 by the Gestual Script research group, composed of
linguists, graphic designers, typographers, and computer
scientists. Its articulatory description of the body,
independent of the language studied, allows it to be adapted
(4) B001 [siviglia meravigliosa]</p>
        <p>speak-id overlap overlap
(1) S001 l’ultimo che ho fatt:o allora sono
speak-id - - - Prolonged -
stata a siviglia:, p[er natale]
- - Prolonged+Ascending overlap overlap
‘my last one well I was in Seville for Christmas’
(2) B001 [che io ador]o che [io
speak-id overlap overlap overlap - overlap
adoro]
overlap
‘(Seville) which I love’
(3) S001 [bellissima po]i a [natale è
speak-id overlap overlap - overlap overlap
stato mag]ico
overlap
‘wonderful Christmas was magic’
to the study of gestures as well. Typannot proposes to
analyze gestures and signs as realizations of the body and
not just the hands: to facilitate analysis, the body is
divided into diferent Articulatory Systems (AS), covering
every body part from the hands to the feet and includes a
description of facial expressions. For the purpose of the
Gest-IT project, only three will be considered:</p>
      </sec>
      <sec id="sec-2-7">
        <title>Finger (F): the dynamics of the fingers of the hand</title>
        <p>(thumb, index, middle, ring, and little finger).
Furthermore, the distinction between the fingers of
the right hand and those of the left hand will be
considered and referred to respectively as RH and
LH;</p>
      </sec>
      <sec id="sec-2-8">
        <title>UpperLimb (UL): the dynamics of the upper limbs (arm, forearm, hand);</title>
      </sec>
      <sec id="sec-2-9">
        <title>UpperBody (UB): the dynamics of the segments that make up torso (hip, spine and shoulder), neck and head.</title>
      </sec>
      <sec id="sec-2-10">
        <title>In this system, the sign’s form is seen as a set of ar</title>
        <p>ticulatory body information (we extend this view to
gestures). Currently, the generic characters that make up
the graphic inventory of Typannot are used to describe
the dynamics of all body segments.</p>
        <sec id="sec-2-10-1">
          <title>4.4. Towards a unified CoNLL-U corpus</title>
        </sec>
      </sec>
      <sec id="sec-2-11">
        <title>The resulting corpus is composed of verbal-prosodic units</title>
        <p>and gestural units, with information about their
overlaps9. Each unit is described by the metadata listed in
Table 5. In case of non verbal units, the text is filled with
a placeholder token (EMPTY) and relevant information is
contained in the MISC column, where the following
features are introduced, meta for para-verbal information
(such as laughs, coughs...) and gesture for Typannot
codes (see Appendix C).</p>
      </sec>
      <sec id="sec-2-12">
        <title>The aim of this paper is to share with the scientific com</title>
        <p>munity the protocol developed to build a multimodal
resource for the Italian language in terms of data collection
(design, ethic issues, practicalities); data management
and curation; data transcription, annotation and analysis.
In doing so, we contribute to the debate on multimodal
resource building, which is still lacking an established
standard. In particular, our contribution in this respect is
twofold.</p>
        <p>Firstly, our study suggests to adopt a three-layer
transcription where the three layers (i.e., the orthographic
transcription, the prosodic/interactional transcription,
and the gestural transcription) align to each other, by
using ELAN as a tool for transcribing and CoNLL-X as
an interoperable output format. This has the advantage
of grounding gestures as an integrated semiotic source
within verbal conversation and ultimately allows to
unveil gesture-speech regularities.</p>
        <p>Secondly, we propose an innovative approach for the
annotation of gesture data. By relying on common
practices in the field of sign languages, we suggest that
gesture transcription should follow the same rationale of
phonetic transcription, with a method that describes
‘objective’ aspects that characterize the ‘form’ of the gesture,
thus allowing for an interpretation-independent
annotation.</p>
        <p>Clearly, the project is still at a very preliminary
stage. Next steps will include the complete orthographic,
prosodic and gesture transcription of the recordings; a
thorough revision and pseudoanymization.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>The Gest-IT corpus was built as an internship project</title>
        <p>at the Experimental Lab10 of the Department of Modern
Languages, Literatures, and Cultures (LILEC) of the
University of Bologna. We would like to thank the Gestual
Script team (Ecole Supérieure d’Art et Design, ESAD,
Amiens) for providing us with the Typannot system, and
the Istituto dei ciechi Francesco Cavazza11 (Bologna) for
helping us with the recruiting of visually impaired
participants. A special acknowledgment to the Centro Studi
sul Seicento e Settecento Spagnolo (CSSS) of the LILEC
Department for letting us record our pilot videos in their
studio.
[1] A. Lüdeling, M. Kytö, Corpus Linguistics: An
International Handbook, De Gruyter Mouton,</p>
      </sec>
      <sec id="sec-3-2">
        <title>9At the moment of writing, 1 minute of pilot transcription has been</title>
        <p>
          produced.
10https://site.unibo.it/laboratorio-sperimentale/
11https://www.cavazza.it/
2009. URL: https://www.degruyter.com/database/ [14] D. Trotta, A. Palmero Aprosio, S. Tonelli, A. Elia,
COGBIB/entry/cogbib.7917/html. Adding gesture, posture and facial displays to the
[2] J. Bezemer, C. Jewitt, Multimodality: A guide for polimodal corpus of political interviews, in:
Prolinguists, Research methods in linguistics 28 (
          <xref ref-type="bibr" rid="ref7">2018</xref>
          ). ceedings of the 12th Language Resources and
Evalu[3] N. Abner, K. Cooperrider, S. Goldin-Meadow, Ges- ation Conference (LREC 2020), European Language
ture for linguists: A handy primer, Language Resources Association, 2020, pp. 4320–4326.
and Linguistics Compass 9 (2015) 437–451. doi:10. [15] J. Allwood, Capturing diferences between social
ac1111/lnc3.12168. tivities in spoken language, Pragmatics and Beyond
[4] Á. Abuczki, E. B. Ghazaleh, An overview of multi- New Series (2001) 301–320.
        </p>
        <p>
          modal corpora, annotation tools and schemes, Ar- [16] J. Allwood, L. Cerrato, K. Jokinen, C. Navarretta,
gumentum 9 (2013) 86–98. P. Paggio, The mumin coding scheme for the
anno[5] M. E. Foster, J. Oberlander, Corpus-based gen- tation of feedback, turn management and
sequenceration of head and eyebrow motion for an ing phenomena, Language Resources and
Evaluaembodied conversational agent, Language Re- tion 41 (
          <xref ref-type="bibr" rid="ref21">2007</xref>
          ) 273–287.
sources and Evaluation 41 (
          <xref ref-type="bibr" rid="ref21">2007</xref>
          ). doi:10.1007/ [17] K. Pápay, S. Szeghalmy, I. Szekrényes, Hucomtech
s10579-007-9055-3. multimodal corpus annotation, Argumentum 7
[6] J. Bressem, S. H. Ladewig, C. Müller, 71. Linguistic (2011) 330–347.
        </p>
        <p>Annotation System for Gestures, De Gruyter Mou- [18] F. Schiel, S. Steininger, U. Türk, The smartkom
ton, Berlin, Boston, 2013, pp. 1098–1124. URL: https: multimodal corpus at bas., in: LREC, Citeseer, 2002.
//doi.org/10.1515/9783110261318.1098. doi:doi:10. [19] D. McNeill, Gesture and Thought, University
1515/9783110261318.1098. of Chicago Press, 2013. doi:10.7208/chicago/
[7] C. S. Bianchini, (D)écrire les Langues des Signes: 9780226514642.001.0001.
une approche grapholinguistique aux Langues [20] I. Mlakar, M. Rojc, Capturing form of non-verbal
des Signes, number 8 in Grapholinguistics and conversational behavior for recreation on synthetic
its Applications, Fluxus Editions, 2024. URL: conversational agent eva, WSEAS Trans.
Comhttps://hal.science/hal-04602726. doi:10.36824/ put.[Print ed.] 11 (2012) 218–226.
2024-bianchini, iSSN 2681-8566 &amp; eISSN 2534- [21] I. Mlakar, D. Verdonik, S. Majhenič, M. Rojc,
To5192; EAN 9782487055025; CrossRef 1612798840. wards pragmatic understanding of conversational
[8] F. Freigang, M. A. Priesters, R. Nishio, K. Bergmann, intent: A multimodal annotation approach to
mulYour data at the center of attention: A metadata tiparty informal interaction–the eva corpus, in:
session profile for multimodal corpora, in: Proceed- Statistical Language and Speech Processing: 7th
Inings of the CLARIN Annual Conference, volume ternational Conference, SLSP 2019, Ljubljana,
Slove2014, 2014. nia, October 14–16, 2019, Proceedings 7, Springer,
[9] A. Lücking, K. Bergmann, F. Hahn, S. Kopp, 2019, pp. 19–30.</p>
        <p>
          H. Rieser, The bielefeld speech and gesture align- [22] D. Fišer, J. Lenardič, Overview of multimodal
corment corpus (saga), in: LREC 2010 workshop: Mul- pora in the clarin (
          <xref ref-type="bibr" rid="ref4">2020</xref>
          ).
timodal corpora–advances in capturing, coding and [23] B. MacWhinney, Computational transcript analysis
analyzing multimodality, 2010. and language disorders, in: Handbook of
Neurolin[10] J. Du Bois, G. Troiani, Typology and its data: func- guistics, Elsevier, 1998, pp. 599–616.
tional monoculture or structural diversity?, pre- [24] L. Lo Re, Prosody and gestures to modelling
mulsented at Naturally occurring data in and beyond timodal interaction: Constructing an italian pilot
linguistic typology, 2023. corpus, IJCoL. Italian Journal of Computational
[11] G. Troiani, Representing a language in use: cor- Linguistics 7 (2021) 33–44.
        </p>
        <p>
          pus construction, prosody, and grammar in Kazakh, [25] A. Kendon, et al., Gesticulation and speech: Two
Ph.D. thesis, UC Santa Barbara, 2023. aspects of the process of utterance, The relationship
[12] R. Bertrand, P. Blache, R. Espesser, G. Ferré, C. Meu- of verbal and nonverbal communication 25 (1980)
nier, B. Priego-Valverde, S. Rauzy, Le cid-corpus of 207–227.
interactional data-annotation et exploitation mul- [26] L. Lo Re, Corpus multimodale dell’italiano parlato:
timodale de parole conversationnelle, Revue TAL: basi metodologiche per la creazione di un prototipo,
traitement automatique des langues 49 (
          <xref ref-type="bibr" rid="ref17">2008</xref>
          ) pp– Ph.D. thesis, University of Firenze, 2022.
105. [27] K. Ackerley, F. Coccetta, Enriching language
learn[13] D. Knight, S. Adolphs, P. Tennent, R. Carter, The ing through a multimedia corpus, ReCALL 19 (
          <xref ref-type="bibr" rid="ref21">2007</xref>
          )
nottingham multi-modal corpus: A demonstration, 351–370. doi:10.1017/S0958344007000730.
in: Programme of the Workshop on Multimodal [28] F. Coccetta, et al., Multimodal functional-notional
Corpora, 2009, p. 64. concordancing, New Trends in Corpora and
Lan
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. Prompts for conversation</title>
    </sec>
    <sec id="sec-5">
      <title>B. Metadata schemata</title>
      <sec id="sec-5-1">
        <title>For both participants (see Subsection B.1) and conversations Subsection B.2), metadata is collected and maintained in .yaml files, with the following formats</title>
        <sec id="sec-5-1-1">
          <title>B.1. Participants metadata</title>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Code : # 4− c h a r s t r i n g composed by e i t h e r S ( S i g h t e d ) or B ( B l i n d ) and an i n t e g e r padded with 0 s</title>
      </sec>
      <sec id="sec-5-3">
        <title>Gender : # e i t h e r F ( Female ) or M ( Male )</title>
      </sec>
      <sec id="sec-5-4">
        <title>Age : # age range o f t h e p a r t i c i p a n t e x p r e s s e d a s 5− y e a r s b i n s ( 0 − 5 , 6 −10 , 1 1 − 2 0 , . . . )</title>
        <p>Region : # 1 o f t h e 20 i t a l i a n
r e g i o n s ( t y p i n g c o n v e n t i o n s
p r o v i d e d )</p>
      </sec>
      <sec id="sec-5-5">
        <title>F i r s t l a n g u a g e : # upper c a s e d i s o</title>
        <p>−693 −3 code o f mother tongue
E d u c a t i o n l e v e l : # one v a l u e i n (
P r i m a r i a , Medie i n f e r i o r i ,
Medie s u p e r i o r i , Laurea , PhD )
P a r t i c i p a n t s :
− [ p a r t i c i p a n t _ c o d e _ 1 ] # code
o f p a r t i c i p a n t s i t t i n g on
l e f t s i d e
− [ p a r t i c i p a n t _ c o d e _ 2 ] # code
o f p a r t i c i p a n t s i t t i n g on
r i g h t s i d e
F a c i n g : # M ( Masked ) or U (
unmasked ) depending on t y p e o f
c o n v e r s a t i o n
Data :
− Video :
− L e f t : path / t o / l e f t / camera /
r e c o r d i n g
− C e n t r e : path / t o / c e n t r a l /</p>
        <p>camera / r e c o r d i n g
− R i g h t : path / t o / r i g h t / camera
/ r e c o r d i n g
− Audio : path / t o / a u d i o / f i l e
− T r a n s c r i p t i o n :
− Automatic : path / t o /</p>
        <p>a u t o m a t i c / t r a n s c r i p t i o n
− Manually r e v i s e d : path / t o /
manually / r e v i s e d /
t r a n s c r i p t i o n
− P r o s o d i c : path / t o / p r o s o d i c /
t r a n s c r i p t i o n
− G e s t u a l : path / t o / g e s t u a l /
t r a n s c r i p t i o n</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>C. Integrated transcription in</title>
    </sec>
    <sec id="sec-7">
      <title>ELAN</title>
      <p># s e n t _ i d = tu0005
# o v e r l a p s = gu0003 gu0004 gu0005 gu0006
# c o n v e r s a t i o n = DUC22051430
# s p e a k e r _ i d = S001
# d u r a t i o n = 1 . 0 8 8
# t e x t _ j e f f e r s o n = entrambe da s o l e
# t e x t = entrambe da s o l e
1 entrambe entrambi PRON _ Gender=Fem | Number= P l u r | PronType = Ind 0 r o o t _ A l i g n B e g i n = 1 1 . 7 0 4
2 da da ADP _ _ 3 c a s e _ _
3 s o l e s o l o ADJ _ Gender=Fem | Number= P l u r 1 nmod _ AlignEnd = 1 2 . 7 9 2
# s e n t _ i d = tu0006
# o v e r l a p s = tu0007 gu0007 gu0008 gu0009
# c o n v e r s a t i o n = DUC22051430
# s p e a k e r _ i d = S001
# d u r a t i o n = 0 . 9 8 7
# t e x t _ j e f f e r s o n = e [ h : l a s e r a ]
# t e x t = eh l a s e r a
1 eh eh INTJ _ _ 3 d i s c o u r s e _ A l i g n B e g i n = 1 3 . 0 4 7 | Overlap =B : tu0007 | ProlongedSound =eh :
2 l a i l DET _ D e f i n i t e =Def | Gender=Fem | Number= Sing 3 det _ Overlap = I
3 s e r a s e r a NOUN _ Gender=Fem | Number= Sing 0 r o o t _ AlignEnd = 1 4 . 0 3 4 | Overlap = I
# c o n v e r s a t i o n = DUC22051430
# s e n t _ i d = gu0003
# o v e r l a p s = tu0004 tu0005
# s p e a k e r = S001
# d u r a t i o n = 0 . 3 3 0
# t e x t = EMPTY
# type = F : LH
1 EMPTY EMPTY X _ _ 0 r o o t _ A l i g n B e g i n = 1 1 . 4 1 0 | AlignEnd = 1 1 . 7 4 0 | g e s t u r e = ’ \ ue5de \ ue002 [ \ u f 1 9 7 \ ue008 \ u f 1 9 f \
ue5ea \ u e 5 e f \ ue5e8 \ u e 5 e f \ uf1a0 \ u e 5 f e \ ue5ee − \ ue004 \ ue005 \ ue006 \ ue007 \ u f 1 9 f \ u e 5 f e \ ue5ee \ ue5e8 \ u e 5 e f \ uf1a0 \ ue5e7
\ u e 5 e f ] [ \ u f 1 9 8 \ ue001 ] ’
# c o n v e r s a t i o n = DUC22051430
# s e n t _ i d = gu0004
# o v e r l a p s = tu0005 gu0005
# s p e a k e r = S001
# d u r a t i o n = 0 . 6 1 0
# t e x t = EMPTY
# type = F : LH
1 EMPTY EMPTY X _ _ 0 r o o t _ A l i g n B e g i n = 1 1 . 7 4 0 | AlignEnd = 1 2 . 3 5 0 | g e s t u r e = ’ \ ue5de \ ue002 [ \ u f 1 9 7 \ ue008 \ u f 1 9 f \
u e 5 f f \ ue5ee \ ue5fb \ ue5ee \ uf1a0 \ u e 5 f e \ ue5ee − \ ue004 \ ue005 \ ue006 \ ue007 \ u f 1 9 f \ u e 5 f e \ ue5ee \ ue5fb \ ue5ee \ uf1a0 \ ue5fd
\ ue5ee ] [ \ u f 1 9 8 \ ue001 ] ’
# c o n v e r s a t i o n = DUC22051430
# s e n t _ i d = gu0005
# o v e r l a p s = tu0005 gu0004
# s p e a k e r = S001
# d u r a t i o n = 0 . 6 1 0
# t e x t = EMPTY
# type = F : RH
1 EMPTY EMPTY X _ _ 0 r o o t _ A l i g n B e g i n = 1 1 . 7 4 0 | AlignEnd = 1 2 . 3 5 0 | g e s t u r e = ’ \ ue5de \ ue003 [ \ u f 1 9 7 \ ue008 \ u f 1 9 f \
u e 5 f f \ ue5ee \ ue5fb \ ue5ee \ uf1a0 \ u e 5 f e \ ue5ee − \ ue004 \ ue005 \ ue006 \ ue007 \ u f 1 9 f \ u e 5 f e \ ue5ee \ ue5fb \ ue5ee \ uf1a0 \ ue5fd
\ ue5ee ] [ \ u f 1 9 8 \ ue001 ] ’</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>guage Learning. London: Continuum</source>
          (
          <year>2011</year>
          )
          <fpage>121</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          138. [29]
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Rohrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Vilà-Giménez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Florit-Pons</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Prieto</surname>
          </string-name>
          ,
          <article-title>The multimodal multidimensional (m3d)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>(GESPIN)</source>
          (
          <year>2020</year>
          ). [30]
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Rohrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Delais-Roussarie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prieto</surname>
          </string-name>
          , Visual-
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>demic discourses, Lingua</source>
          <volume>293</volume>
          (
          <year>2023</year>
          )
          <fpage>103583</fpage>
          . [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Chevrefils</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Danet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Doan</surname>
          </string-name>
          , C. Thomas,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <issue>Modalities 1</issue>
          (
          <year>2021</year>
          )
          <fpage>49</fpage>
          -
          <lpage>63</lpage>
          . [32]
          <string-name>
            <given-names>D.</given-names>
            <surname>Boutet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Danet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Bianchini</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>otics</surname>
          </string-name>
          (
          <year>2018</year>
          )
          <fpage>391</fpage>
          -
          <lpage>426</lpage>
          . [33]
          <string-name>
            <given-names>M. W.</given-names>
            <surname>Alibali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. C.</given-names>
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. J.</given-names>
            <surname>Myers</surname>
          </string-name>
          , Efects of
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>Journal of Memory and Language</source>
          <volume>44</volume>
          (
          <year>2001</year>
          )
          <fpage>169</fpage>
          -
          <lpage>188</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>doi:10</source>
          .1006/JMLA.
          <year>2000</year>
          .
          <volume>2752</volume>
          . [34]
          <string-name>
            <surname>J. M. Iverson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goldin-Meadow</surname>
          </string-name>
          , Why people
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>gesture when they speak</article-title>
          ,
          <source>Nature</source>
          <year>1998</year>
          396:
          <fpage>6708</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <volume>396</volume>
          (
          <year>1998</year>
          )
          <fpage>228</fpage>
          -
          <lpage>228</lpage>
          . URL: https://www.nature.com/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>articles/24300</source>
          . doi:
          <volume>10</volume>
          .1038/24300. [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          , T. Xu, G. Brockman,
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>nition via large-scale weak supervision</article-title>
          ,
          <year>2023</year>
          . [36]
          <string-name>
            <given-names>C.</given-names>
            <surname>Mauri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ballarè</surname>
          </string-name>
          , E. Goria,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cerruti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Suriano</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>SunSITE Central Europe</surname>
          </string-name>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . [37]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sloetjes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wittenburg</surname>
          </string-name>
          , Annotation by category-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>elan and iso dcr</article-title>
          , in: 6th international Conference on
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Language</given-names>
            <surname>Resources</surname>
          </string-name>
          and
          <article-title>Evaluation (LREC</article-title>
          <year>2008</year>
          ),
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <year>2008</year>
          . [38]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jeferson</surname>
          </string-name>
          , et al.,
          <source>Glossary of transcript symbols</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>with an introduction, Conversation analysis (</article-title>
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          13-
          <fpage>31</fpage>
          . [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Slembrouck</surname>
          </string-name>
          ,
          <article-title>Transcription-the extended direc-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          variation in transcription',
          <source>Discourse Studies 9</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          (
          <year>2007</year>
          )
          <fpage>822</fpage>
          -
          <lpage>827</lpage>
          . [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ballarè</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Goria</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mauri</surname>
          </string-name>
          , Italiano parlato e vari-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>del corpus</surname>
            <given-names>KIParla</given-names>
          </string-name>
          , Pàtron editore,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>