=Paper=
{{Paper
|id=Vol-2091/paper8
|storemode=property
|title=An Audiovisual Corpus of Guided Tours in Cultural Sites: Data Collection protocols in the CHROME Project
|pdfUrl=https://ceur-ws.org/Vol-2091/paper8.pdf
|volume=Vol-2091
|authors=Antonio Origlia,Renata Savy,Isabella Poggi,Francesco Cutugno,Iolanda Alfano,Francesca D'Errico,Laura Vincze,Violetta Cataldo
|dblpUrl=https://dblp.org/rec/conf/avi/OrigliaSPCADVC18
}}
==An Audiovisual Corpus of Guided Tours in Cultural Sites: Data Collection protocols in the CHROME Project==
<pdf width="1500px">https://ceur-ws.org/Vol-2091/paper8.pdf</pdf>
<pre>
    An Audiovisual Corpus of Guided Tours in Cultural Sites: Data
            Collection protocols in the CHROME Project
          Antonio Origlia                                 Renata Savy                    Isabella Poggi                   Francesco Cutugno
       URBAN/ECO Research                        Department of Humanities           Department of Philosophy,            Department of Electrical
        Center, University of                      Studies, University of               Communication and                    Engineering and
        Naples, "Federico II"                             Salerno                   Performing Arts, Roma Tre            Information Technology,
           Naples, Italy                               Salerno, Italy                        University                    University of Naples
      antonio.origlia@unina.it                        rsavy@unisa.it                         Rome, Italy                       "Federico II"
                                                                                     isabella.poggi@uniroma3.                  Naples, Italy
                                                                                                 it                         cutugno@unina.it

           Iolanda Alfano                           Francesca D’Errico                    Laura Vincze                      Violetta Cataldo
    Department of Humanities                     Department of Philosophy,          Department of Philosophy,          Department of Humanities
      Studies, University of                        Communication and                  Communication and                 Studies, University of
             Salerno                             Performing Arts, Roma Tre          Performing Arts, Roma Tre                    Salerno
          Salerno, Italy                                 University                         University                        Salerno, Italy
        ialfano@unisa.it                                 Rome, Italy                        Rome, Italy                 violetta.cataldo@live.itt
                                                     francesca.derrico@              Laura.vincze@gmail.com
                                                         uniroma3.it
ABSTRACT                                                                             1   INTRODUCTION
Creating interfaces for cultural heritage access is considered a fun-                Developing Social Signal Processing [15] techniques for advanced,
damental research field because of the many beneficial effects it has                natural interfaces requires a significant analysis effort on multiple
on society. In this era of significant advances towards natural inter-               aspects of communication between individuals engaging in social
action with machines and deeper understanding of social commu-                       activity. Collecting meaningful corpora to document the multimodal
nication nuances, it is important to investigate the communicative                   signals people exchange during these activities has been the subject
strategies human experts adopt when delivering contents to the                       of a large amount of research. Among others, available corpora
visitors of cultural sites, as this allows the creation of a strong theo-            document meetings [6], intercultural dynamics of first acquain-
retical background for the development of efficient conversational                   tance [1], phone calls between non acquainted subjects [10], and
agents. In this work, we present the data collection and annotation                  two-person dialogues [14]. The Italian national project CHROME
protocols adopted for the ongoing creation of the reference material                 aims at developing a data collection and annotation procedure to
to be used in the Cultural Heritage Resources Orienting Multimodal                   support the development of new interactive technologies for cul-
Experiences (CHROME) project to accomplish that goal.                                tural heritage. The project concentrates on the three Campanian
                                                                                     Charterhouses: an integrated description of these from different
CCS CONCEPTS                                                                         point of views (textual, behavioural, geometrical, etc. . . ) is being
• Human-centered computing → User studies; HCI theory,                               developed.
concepts and models; User models;                                                       In this paper, we present the data collection and annotation pro-
                                                                                     tocols adopted in the CHROME project to obtain reference material
KEYWORDS                                                                             of expert gatekeepers, intended as holders of knowledge for others
Corpus collection, guided tours, social signal processing                            to refer to, accompanying visitors of cultural sites. This data will be
                                                                                     used to investigate the social communication strategies adopted by
ACM Reference Format:
                                                                                     the considered experts to deliver information to different groups of
Antonio Origlia, Renata Savy, Isabella Poggi, Francesco Cutugno, Iolanda
                                                                                     visitors. By comparing different experts (inter-subject comparisons)
Alfano, Francesca D’Errico, Laura Vincze, and Violetta Cataldo. 2018. An Au-
diovisual Corpus of Guided Tours in Cultural Sites: Data Collection protocols        and different groups accompanied by the same expert (intra-subject
in the CHROME Project. In Proceedings of 2nd Workshop on Advanced Vi-                comparison) a Gatekeeper Computational Model will be obtained
sual Interfaces for Cultural Heritage (AVI-CH 2018). Vol. 2091. CEUR-WS.org,         and, on the basis of this model, a socially aware conversational
Article 8. http://ceur-ws.org/Vol-2091/paper8.pdf, 4 pages.                          agent, in the form of a 3D avatar, will be developed. This is ex-
                                                                                     pected to improve the capabilities of an interactive agent to involve
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy                          people in engaging presentations of cultural heritage. These will
© 2018 Copyright held by the owner/author(s).                                        make use of the 3D reconstructions of the three Campanian Char-
                                                                                     terhouses, also collected in the framework of the CHROME project.

.
                                                                                1
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy                                                                      A. Origlia et al.


                                                                                Each recruited expert accompanies four groups of four people in
                                                                             an hour long guided tour at the San Martino Charterhouse in Naples.
                                                                             Recruited members of the audience vary on a socio-demographic
                                                                             basis and each group is gender balanced. The visit is divided into
                                                                             six points of interest (POIs), selected as the most relevant parts of
                                                                             the Charterhouse from an architectural and artistic point of view:
                                                                                  • Pronaos: outside the doorstep of the church. The introduc-
                                                                                    tory part of the visit is recorded in this POI. Environmental
                                                                                    elements mainly consist of architectural details;
                                                                                  • Great cloister: a large external place, near the monks’ ceme-
                                                                                    tery. Further details about the monks’ life are given. Envi-
                                                                                    ronmental elements consist of the natural setting of a large
Figure 1: A screenshot of the ELAN interface showing the                            garden and of the cemetery elements (e.g. memento mori);
synchronised videos of the expert and of the audience, to-                        • Parlor: the first internal setting. Specific details about the
gether with an example annotation.                                                  Charthusians’ rules are given here. Environmental elements
                                                                                    mainly consist of frescoes;
Upon completion of the project, the dataset will be made freely                   • Chapter hall: next to the parlor. Specific details about the
available for the scientific community.                                             Charthusians’ order are given here. Environmental elements
   In the next sections, we will present the data collection protocol,              mainly consist of frescoes;
highlighting the chosen recording positions in the site of interest               • Wooden choir: inside the church, behind the altar. The history
and the recording setup. We will, then, present the multimodal                      of the church decoration process is given here. Environmen-
annotation protocol, designed to provide a formal description of                    tal elements consist of both architectural details (e.g. the
how the guide makes use of social signals exchange to adapt the                     choir and the harmonic chassis) and artistic elements (fres-
presentation and to effectively support the verbal transfer of cul-                 coes and statues);
tural contents. Next, we will describe the informative, syntactic                 • Treasure hall: deeper inside the complex. Details about the
and prosodic annotations documenting the linguistic behaviour                       relationship between the monks and the different governing
that characterises the domain expert. The transcribed recordings,                   parties in Naples are given. Environmental elements mainly
together with the produced annotations, will be compared with a                     consist of architectural details.
corpus of textual resources describing the objects of interest. This            The selected POIs allow us to capture the social behaviour visi-
will support the development of a synthetic voice model for 3D               tors and gatekeepers exhibit to negotiate the approach to the visit
avatars designed to extract cultural contents from textual databases         and to document postural and gestural behaviour of an art historian
and deliver them using social communication strategies. To improve           presenting a complex environment.
the quality of the model, the linguistic analysis will also include a           Videos and audio recordings are synchronised a posteriori using
detailed annotation of disfluency phenomena, which are important             a visual-acoustic marker. Linguistic and multimodal annotations,
to produce a natural sounding voice.                                         performed on the synchronised versions of the collected material,
                                                                             will be merged using the ELAN software [17]. An ELAN project file
2   DATA COLLECTION                                                          will be produced for each POI visit in order to allow cross-domain
The data collection plan foresees a campaign of audiovisual record-          research and closed vocabularies for the label sets belonging to
ings involving four art historians with strong experience in accom-          each annotation domain will be used to ensure consistency. An
panying groups of visitors. Given the limited number of gatekeep-            example of the ELAN interface showing the two video shots and a
ers considered in the CHROME project, only female experts were               sample annotation tier is shown in Figure 1.
recruited to remove gender effects in multimodal and linguistic
analysis. Future extensions of the corpus will include male experts
                                                                             3   MULTIMODAL ANNOTATION
as well.                                                                     The video recording of the expert gatekeeper is annotated as to the
   Recorded data include two Full-HD video recordings: the first             structure of verbal discourse and to body communicative behaviour.
one is a fixed shot of the gatekeeper, taken from a position im-             The discourse structure point of view is based on a previous analy-
mediately next to the attending group while the second one is a              sis on videos of Art Commentators (ACs), that is, both museums
fixed shot of the visitors. A close range digital microphone with            gatekeepers and art historians illustrating artworks in tv, where a
background noise cancellation is used to record the gatekeeper’s             general script was extracted of what the AC can /should say in one’s
voice. Immediately after the visit, the recruited visitors compile           work. This allowed to outline the typical discourse structure of any
a questionnaire composed of 23 items including both Likert scale             AC which, based on the analysis of discourse as a hierarchy of goals
evaluations and open answer questions. The items are designed to             [8], distinguishes four main goals pursued by the gatekeeper: a
collect anagraphic data, a self-evaluation of artistic competence, an        general goal of cultural elevation; encompassing favouring aesthetic
evaluation of personal satisfaction after the visit and an evaluation        enjoyment, imagination and emotion triggering, and, subsumed to
of the gatekeeper’s performance. These data will be used to weight           it; the textual goals of providing information about the opera, its
objective measures of social behaviour.                                      history, function, cultural milieu, and the author; the corresponding

                                                                         2
Audiovisual Corpus of Guided Tours in Cultural Sites                                 AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy

                                           Table 1: An example of multimodality annotation.

                                                Discourse
                  Verbal Text                                                     Gesture                         Meaning
                                                function
                                                                                                                I am framing
              The Saint Martin’s               Textual goal:
                                                                     hands, palms to each other,          the object of discourse,
          Charterhouse here in Naples         Information on
                                                                      like framing something                   Metacognitive
             has at least two souls             the artwork
                                                                                                                   gesture
                                               Textual goal:             Left hand moves to left,           I locate the identity
               Nowadays it is not
                                            Information on the                Metadiscursive             Charterhouse on my left →
              only a Charterhouse
                                          identity of the artwork                gesture                   I build the first entity
                                                                               Right hand
                                               Textual goal:                                                I locate the identity
                 but it is also                                               moves to right.
                                            Information on the                                       known as Museum on my right ->
              a national museum                                               Metadiscursive
                                          identity of the artwork                                        I build the second entity
                                                                                 gesture
               So try to imagine             Emotional goal:
                                                                                     -                                 -
              Naples 700 years ago          Solicit imagination


modal goals of attracting and sustaining attention, favouring com-            4     LINGUISTIC ANNOTATION
prehension and inferential connections with the tourists’ previous            Using the close-mic recordings, speech produced by the expert
knowledge; interactional goals such as tuning, setting empathic con-          gatekeeper is analysed and annotated on different levels. From the
nection with tourists. Each particular performance of a gatekeeper            informative-syntactic point of view, an orthographic level is pro-
or other AC can be analysed in terms of this abstract script, and             duced on the basis of the indications provided by [11]. This level
this allows, among other things, to distinguish the idiosyncratic             involves the transcription of a number of elements: lexical elements,
styles of different ACs in terms of which nodes of the structure              silent and filled pauses, noises, vocal (nonverbal) phenomena, trun-
they prefer to expand. Some mainly focus on the author and his                cated words, interrupted words, false starts and lapsus linguae. A
life, some on the deep symbolic meanings of the artwork, some on              phonetic level is included to store the phonetic transcription of the
the author’s style and the surrounding cultural milieu, and so on.            utterances and markers of phonetic phenomena like coarticulation,
    The analysis of the gatekeeper’s multimodal communication takes           following the indications found in [12]. A syllabic level is produced
into account the following body communicative modalities: ges-                to allow speech fluency and speech rate analyses. A disfluency level,
tures, postures, head movements, facial expression, gaze communi-             involving the annotation of disfluency phenomena [3, 13], is also in-
cation. For each communicative item in each modality, the signal is           cluded. This analysis level consists of four annotation tiers, detailed
annotated in ELAN in terms of a detailed description of its produc-           in Table 2.
tion: gestures are described according to their parameters of hand               To document the prosodic component of the experts’ linguistic
configuration, location, orientation and movement; gaze in terms              behaviour, a multilevel annotation, structured in different tiers,
of eye direction, eyebrows and eyelids movements; face in terms of            has been produced. The considered aspects include: an intonative
Ekman’s FACS; head movements in term of head nod, shake, toss,                level, using the INTSINT coding scheme [4, 5], providing a labels
canting; postures in terms of leg and trunk movements. Then, for              sequence representing the f0 curve, obtained with the Prosomarker
the signal described in this way, a verbal phrasing of its meaning is         tool [7]; a pragmatic - informative level, providing an analysis of
provided (after [9]). Based on this meaning, the item is classified as        information structure considering topic (preposed or postposed)
to its role and function within the gatekeeper’s discourse structure.         and comment units [2]; a macro-syntactic level, indicating the types
An example of multimodal annotation is shown in Table 1.                      of clauses dividing independent clauses from dependent clauses and
                                                                              specifying the type of subordination; a syntactic level, describing the
             Table 2: Disfluency annotation levels                            main syntactic functions; an intra-syntactic level, labelling the type
                                                                              of phrase and its composition (between parenthesis); a measure of
            Disfluency           Type of disfluent                            syntactic weight, based on [16], which takes into account both the
               Type                phenomenon                                 structure and the length of constituents. It considers the following
             Disfluency        Pragmatic function of                          features: ± presence of determiners, ± presence of modifiers, ±
              Function       the disfluent phenomenon                         presence of pronouns, ± verbal valency saturation. An annotation
             Disfluency                                                       example is shown in Figure 2.
                                Model of occurrence
               Model
             Disfluency          Internal regions of                          5     CONCLUSIONS AND FUTURE WORK
            Components            the phenomenon                              We have presented the data collection and annotations protocols
                                                                              for a work in progress on an audiovisual corpus documenting how
                                                                              cultural heritage gatekeepers support people in accessing archi-
                                                                              tectural heritage and consists of both video and audio recordings
                                                                          3   to capture the social interaction process taking place between the
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy                                                                                                           A. Origlia et al.


             500
             400
             300
             200
             100
               0
Pitch (Hz)


                         M       H   T(H)                   L H L(B) H              H        L        S L      UH                          L       L                   H L H

                             T                                                                            C

                                                                                            IC

                         SUB                                PRED                            OBJ                                                             IO

                       NP(DET+N)                            VP(V)          NP(DET+N+PP(PREP+DET+N))                                             PP(PREP+NP(DET+N))

                         NP4                                 VP4                            NP5                                                            PP4

                   0                                                                                                                                                        9.862
                                                                                        Time (s)


Figure 2: An example on the utterance I certosini devono la fondazione del loro ordine a un uomo (Carthusians due their order’s
foundation to a man). The order of the annotation tiers is the one found in the text.


group guide and the attending audience. Annotation levels cover                                         In Proceedings of the 5th International Conference on Methods and Techniques in
linguistic and multimodal aspects of communication to allow a                                           Behavioral Research, Vol. 88. 100.
                                                                                                    [7] Antonio Origlia and Iolanda Alfano. 2012. Prosomarker: a prosodic analysis tool
multi-faceted investigation of the ongoing communicative process.                                       based on optimal pitch stylization and automatic syllabi fication.. In Proc. of the
The collected material will be used as reference to build a computa-                                    International Conference on Language Resources and Evaluation (LREC). 997–1002.
                                                                                                    [8] Domenico Parisi and Cristiano Castelfranchi. 1976. The discourse as a hierarchy of
tional model of a 3D virtual character presenting reconstructions                                       goals. Centro Internazionale di Semiotica e di Linguistica, Università di Urbino.
of architectural heritage sites.                                                                    [9] Isabella Poggi. 2007. Mind, hands, face and body: a goal and belief view of multi-
                                                                                                        modal communication. Weidler.
                                                                                                   [10] Hugues Salamin, Anna Polychroniou, and Alessandro Vinciarelli. 2013. Automatic
6            ACKNOWLEDGMENTS                                                                            detection of laughter and fillers in spontaneous mobile phone conversations. In
Antonio Origlia’s work is funded by the Italian PRIN project Cul-                                       Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on. IEEE,
                                                                                                        4282–4287.
tural Heritage Resources Orienting Multimodal Experience (CHROME)                                  [11] Renata Savy. 2005. Specifiche per la trascrizione ortografica annotata dei testi.
#B52F15000450001.                                                                                       Italiano Parlato, Analisi di un dialogo (2005), 1–28.
                                                                                                   [12] Renata Savy. 2005. Specifiche per l’etichettatura dei livelli segmentali. Italiano
                                                                                                        Parlato. Analisi di un dialogo. Napoli: Liguori (2005).
REFERENCES                                                                                         [13] Elizabeth Ellen Shriberg. 1994. Preliminaries to a theory of speech disfluencies.
 [1] Jens Allwood, Nataliya Berbyuk Lindström, and Jia Lu. 2011. Intercultural dy-                      Ph.D. Dissertation. Citeseer.
     namics of fist acquaintance: comparative study of swedish, chinese and swedish-               [14] Yasir Tahir, Debsubhra Chakraborty, Tomasz Maszczyk, Shoko Dauwels, Justin
     chinese first time encounters. In International Conference on Universal Access in                  Dauwels, Nadia Thalmann, and Daniel Thalmann. 2015. Real-time sociometrics
     Human-Computer Interaction. Springer, 12–21.                                                       from audio-visual features for two-person dialogs. In Digital Signal Processing
 [2] Jeanette K Gundel. 1988. Universals of topic-comment structure. Studies in                         (DSP), 2015 IEEE International Conference on. IEEE, 823–827.
     syntactic typology 17 (1988), 209–239.                                                        [15] Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal
 [3] Adolf E Hieke. 1981. A content-processing view of hesitation phenomena. Lan-                       processing: Survey of an emerging domain. Image and vision computing 27, 12
     guage and Speech 24, 2 (1981), 147–160.                                                            (2009), 1743–1759.
 [4] Daniel Hirst and Albert Di Cristo. 1998. A survey of intonation systems. Intonation           [16] Miriam Voghera and Giuseppina Turco. 2007. Il peso del parlare e dello scrivere.
     systems: A survey of twenty languages (1998), 1–44.                                                In Proc. of International Conf. Il Parlato Italiano, Liguori, Napoli.
 [5] Daniel Hirst, Albert Di Cristo, and Robert Espesser. 2000. Levels of representation           [17] Peter Wittenburg, Hennie Brugman, Albert Russel, Alex Klassmann, and Han
     and levels of analysis for the description of intonation systems. In Prosody: Theory               Sloetjes. 2006. ELAN: a professional framework for multimodality research. In
     and experiment. Springer, 51–87.                                                                   Proc. of the International Conference on Language Resources and Evaluation (LREC).
 [6] Iain McCowan, Jean Carletta, W Kraaij, S Ashby, S Bourban, M Flynn, M Guille-                      1556–1559.
     mot, T Hain, J Kadlec, V Karaiskos, and others. 2005. The AMI meeting corpus.


                                                                                            4

</pre>