=Paper=
{{Paper
|id=Vol-2091/paper8
|storemode=property
|title=An Audiovisual Corpus of Guided Tours in Cultural Sites: Data Collection protocols in the CHROME Project
|pdfUrl=https://ceur-ws.org/Vol-2091/paper8.pdf
|volume=Vol-2091
|authors=Antonio Origlia,Renata Savy,Isabella Poggi,Francesco Cutugno,Iolanda Alfano,Francesca D'Errico,Laura Vincze,Violetta Cataldo
|dblpUrl=https://dblp.org/rec/conf/avi/OrigliaSPCADVC18
}}
==An Audiovisual Corpus of Guided Tours in Cultural Sites: Data Collection protocols in the CHROME Project==
An Audiovisual Corpus of Guided Tours in Cultural Sites: Data
Collection protocols in the CHROME Project
Antonio Origlia Renata Savy Isabella Poggi Francesco Cutugno
URBAN/ECO Research Department of Humanities Department of Philosophy, Department of Electrical
Center, University of Studies, University of Communication and Engineering and
Naples, "Federico II" Salerno Performing Arts, Roma Tre Information Technology,
Naples, Italy Salerno, Italy University University of Naples
antonio.origlia@unina.it rsavy@unisa.it Rome, Italy "Federico II"
isabella.poggi@uniroma3. Naples, Italy
it cutugno@unina.it
Iolanda Alfano Francesca D’Errico Laura Vincze Violetta Cataldo
Department of Humanities Department of Philosophy, Department of Philosophy, Department of Humanities
Studies, University of Communication and Communication and Studies, University of
Salerno Performing Arts, Roma Tre Performing Arts, Roma Tre Salerno
Salerno, Italy University University Salerno, Italy
ialfano@unisa.it Rome, Italy Rome, Italy violetta.cataldo@live.itt
francesca.derrico@ Laura.vincze@gmail.com
uniroma3.it
ABSTRACT 1 INTRODUCTION
Creating interfaces for cultural heritage access is considered a fun- Developing Social Signal Processing [15] techniques for advanced,
damental research field because of the many beneficial effects it has natural interfaces requires a significant analysis effort on multiple
on society. In this era of significant advances towards natural inter- aspects of communication between individuals engaging in social
action with machines and deeper understanding of social commu- activity. Collecting meaningful corpora to document the multimodal
nication nuances, it is important to investigate the communicative signals people exchange during these activities has been the subject
strategies human experts adopt when delivering contents to the of a large amount of research. Among others, available corpora
visitors of cultural sites, as this allows the creation of a strong theo- document meetings [6], intercultural dynamics of first acquain-
retical background for the development of efficient conversational tance [1], phone calls between non acquainted subjects [10], and
agents. In this work, we present the data collection and annotation two-person dialogues [14]. The Italian national project CHROME
protocols adopted for the ongoing creation of the reference material aims at developing a data collection and annotation procedure to
to be used in the Cultural Heritage Resources Orienting Multimodal support the development of new interactive technologies for cul-
Experiences (CHROME) project to accomplish that goal. tural heritage. The project concentrates on the three Campanian
Charterhouses: an integrated description of these from different
CCS CONCEPTS point of views (textual, behavioural, geometrical, etc. . . ) is being
• Human-centered computing → User studies; HCI theory, developed.
concepts and models; User models; In this paper, we present the data collection and annotation pro-
tocols adopted in the CHROME project to obtain reference material
KEYWORDS of expert gatekeepers, intended as holders of knowledge for others
Corpus collection, guided tours, social signal processing to refer to, accompanying visitors of cultural sites. This data will be
used to investigate the social communication strategies adopted by
ACM Reference Format:
the considered experts to deliver information to different groups of
Antonio Origlia, Renata Savy, Isabella Poggi, Francesco Cutugno, Iolanda
visitors. By comparing different experts (inter-subject comparisons)
Alfano, Francesca D’Errico, Laura Vincze, and Violetta Cataldo. 2018. An Au-
diovisual Corpus of Guided Tours in Cultural Sites: Data Collection protocols and different groups accompanied by the same expert (intra-subject
in the CHROME Project. In Proceedings of 2nd Workshop on Advanced Vi- comparison) a Gatekeeper Computational Model will be obtained
sual Interfaces for Cultural Heritage (AVI-CH 2018). Vol. 2091. CEUR-WS.org, and, on the basis of this model, a socially aware conversational
Article 8. http://ceur-ws.org/Vol-2091/paper8.pdf, 4 pages. agent, in the form of a 3D avatar, will be developed. This is ex-
pected to improve the capabilities of an interactive agent to involve
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy people in engaging presentations of cultural heritage. These will
© 2018 Copyright held by the owner/author(s). make use of the 3D reconstructions of the three Campanian Char-
terhouses, also collected in the framework of the CHROME project.
.
1
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy A. Origlia et al.
Each recruited expert accompanies four groups of four people in
an hour long guided tour at the San Martino Charterhouse in Naples.
Recruited members of the audience vary on a socio-demographic
basis and each group is gender balanced. The visit is divided into
six points of interest (POIs), selected as the most relevant parts of
the Charterhouse from an architectural and artistic point of view:
• Pronaos: outside the doorstep of the church. The introduc-
tory part of the visit is recorded in this POI. Environmental
elements mainly consist of architectural details;
• Great cloister: a large external place, near the monks’ ceme-
tery. Further details about the monks’ life are given. Envi-
ronmental elements consist of the natural setting of a large
Figure 1: A screenshot of the ELAN interface showing the garden and of the cemetery elements (e.g. memento mori);
synchronised videos of the expert and of the audience, to- • Parlor: the first internal setting. Specific details about the
gether with an example annotation. Charthusians’ rules are given here. Environmental elements
mainly consist of frescoes;
Upon completion of the project, the dataset will be made freely • Chapter hall: next to the parlor. Specific details about the
available for the scientific community. Charthusians’ order are given here. Environmental elements
In the next sections, we will present the data collection protocol, mainly consist of frescoes;
highlighting the chosen recording positions in the site of interest • Wooden choir: inside the church, behind the altar. The history
and the recording setup. We will, then, present the multimodal of the church decoration process is given here. Environmen-
annotation protocol, designed to provide a formal description of tal elements consist of both architectural details (e.g. the
how the guide makes use of social signals exchange to adapt the choir and the harmonic chassis) and artistic elements (fres-
presentation and to effectively support the verbal transfer of cul- coes and statues);
tural contents. Next, we will describe the informative, syntactic • Treasure hall: deeper inside the complex. Details about the
and prosodic annotations documenting the linguistic behaviour relationship between the monks and the different governing
that characterises the domain expert. The transcribed recordings, parties in Naples are given. Environmental elements mainly
together with the produced annotations, will be compared with a consist of architectural details.
corpus of textual resources describing the objects of interest. This The selected POIs allow us to capture the social behaviour visi-
will support the development of a synthetic voice model for 3D tors and gatekeepers exhibit to negotiate the approach to the visit
avatars designed to extract cultural contents from textual databases and to document postural and gestural behaviour of an art historian
and deliver them using social communication strategies. To improve presenting a complex environment.
the quality of the model, the linguistic analysis will also include a Videos and audio recordings are synchronised a posteriori using
detailed annotation of disfluency phenomena, which are important a visual-acoustic marker. Linguistic and multimodal annotations,
to produce a natural sounding voice. performed on the synchronised versions of the collected material,
will be merged using the ELAN software [17]. An ELAN project file
2 DATA COLLECTION will be produced for each POI visit in order to allow cross-domain
The data collection plan foresees a campaign of audiovisual record- research and closed vocabularies for the label sets belonging to
ings involving four art historians with strong experience in accom- each annotation domain will be used to ensure consistency. An
panying groups of visitors. Given the limited number of gatekeep- example of the ELAN interface showing the two video shots and a
ers considered in the CHROME project, only female experts were sample annotation tier is shown in Figure 1.
recruited to remove gender effects in multimodal and linguistic
analysis. Future extensions of the corpus will include male experts
3 MULTIMODAL ANNOTATION
as well. The video recording of the expert gatekeeper is annotated as to the
Recorded data include two Full-HD video recordings: the first structure of verbal discourse and to body communicative behaviour.
one is a fixed shot of the gatekeeper, taken from a position im- The discourse structure point of view is based on a previous analy-
mediately next to the attending group while the second one is a sis on videos of Art Commentators (ACs), that is, both museums
fixed shot of the visitors. A close range digital microphone with gatekeepers and art historians illustrating artworks in tv, where a
background noise cancellation is used to record the gatekeeper’s general script was extracted of what the AC can /should say in one’s
voice. Immediately after the visit, the recruited visitors compile work. This allowed to outline the typical discourse structure of any
a questionnaire composed of 23 items including both Likert scale AC which, based on the analysis of discourse as a hierarchy of goals
evaluations and open answer questions. The items are designed to [8], distinguishes four main goals pursued by the gatekeeper: a
collect anagraphic data, a self-evaluation of artistic competence, an general goal of cultural elevation; encompassing favouring aesthetic
evaluation of personal satisfaction after the visit and an evaluation enjoyment, imagination and emotion triggering, and, subsumed to
of the gatekeeper’s performance. These data will be used to weight it; the textual goals of providing information about the opera, its
objective measures of social behaviour. history, function, cultural milieu, and the author; the corresponding
2
Audiovisual Corpus of Guided Tours in Cultural Sites AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy
Table 1: An example of multimodality annotation.
Discourse
Verbal Text Gesture Meaning
function
I am framing
The Saint Martin’s Textual goal:
hands, palms to each other, the object of discourse,
Charterhouse here in Naples Information on
like framing something Metacognitive
has at least two souls the artwork
gesture
Textual goal: Left hand moves to left, I locate the identity
Nowadays it is not
Information on the Metadiscursive Charterhouse on my left →
only a Charterhouse
identity of the artwork gesture I build the first entity
Right hand
Textual goal: I locate the identity
but it is also moves to right.
Information on the known as Museum on my right ->
a national museum Metadiscursive
identity of the artwork I build the second entity
gesture
So try to imagine Emotional goal:
- -
Naples 700 years ago Solicit imagination
modal goals of attracting and sustaining attention, favouring com- 4 LINGUISTIC ANNOTATION
prehension and inferential connections with the tourists’ previous Using the close-mic recordings, speech produced by the expert
knowledge; interactional goals such as tuning, setting empathic con- gatekeeper is analysed and annotated on different levels. From the
nection with tourists. Each particular performance of a gatekeeper informative-syntactic point of view, an orthographic level is pro-
or other AC can be analysed in terms of this abstract script, and duced on the basis of the indications provided by [11]. This level
this allows, among other things, to distinguish the idiosyncratic involves the transcription of a number of elements: lexical elements,
styles of different ACs in terms of which nodes of the structure silent and filled pauses, noises, vocal (nonverbal) phenomena, trun-
they prefer to expand. Some mainly focus on the author and his cated words, interrupted words, false starts and lapsus linguae. A
life, some on the deep symbolic meanings of the artwork, some on phonetic level is included to store the phonetic transcription of the
the author’s style and the surrounding cultural milieu, and so on. utterances and markers of phonetic phenomena like coarticulation,
The analysis of the gatekeeper’s multimodal communication takes following the indications found in [12]. A syllabic level is produced
into account the following body communicative modalities: ges- to allow speech fluency and speech rate analyses. A disfluency level,
tures, postures, head movements, facial expression, gaze communi- involving the annotation of disfluency phenomena [3, 13], is also in-
cation. For each communicative item in each modality, the signal is cluded. This analysis level consists of four annotation tiers, detailed
annotated in ELAN in terms of a detailed description of its produc- in Table 2.
tion: gestures are described according to their parameters of hand To document the prosodic component of the experts’ linguistic
configuration, location, orientation and movement; gaze in terms behaviour, a multilevel annotation, structured in different tiers,
of eye direction, eyebrows and eyelids movements; face in terms of has been produced. The considered aspects include: an intonative
Ekman’s FACS; head movements in term of head nod, shake, toss, level, using the INTSINT coding scheme [4, 5], providing a labels
canting; postures in terms of leg and trunk movements. Then, for sequence representing the f0 curve, obtained with the Prosomarker
the signal described in this way, a verbal phrasing of its meaning is tool [7]; a pragmatic - informative level, providing an analysis of
provided (after [9]). Based on this meaning, the item is classified as information structure considering topic (preposed or postposed)
to its role and function within the gatekeeper’s discourse structure. and comment units [2]; a macro-syntactic level, indicating the types
An example of multimodal annotation is shown in Table 1. of clauses dividing independent clauses from dependent clauses and
specifying the type of subordination; a syntactic level, describing the
Table 2: Disfluency annotation levels main syntactic functions; an intra-syntactic level, labelling the type
of phrase and its composition (between parenthesis); a measure of
Disfluency Type of disfluent syntactic weight, based on [16], which takes into account both the
Type phenomenon structure and the length of constituents. It considers the following
Disfluency Pragmatic function of features: ± presence of determiners, ± presence of modifiers, ±
Function the disfluent phenomenon presence of pronouns, ± verbal valency saturation. An annotation
Disfluency example is shown in Figure 2.
Model of occurrence
Model
Disfluency Internal regions of 5 CONCLUSIONS AND FUTURE WORK
Components the phenomenon We have presented the data collection and annotations protocols
for a work in progress on an audiovisual corpus documenting how
cultural heritage gatekeepers support people in accessing archi-
tectural heritage and consists of both video and audio recordings
3 to capture the social interaction process taking place between the
AVI-CH 2018, May 29, 2018, Castiglione della Pescaia, Italy A. Origlia et al.
500
400
300
200
100
0
Pitch (Hz)
M H T(H) L H L(B) H H L S L UH L L H L H
T C
IC
SUB PRED OBJ IO
NP(DET+N) VP(V) NP(DET+N+PP(PREP+DET+N)) PP(PREP+NP(DET+N))
NP4 VP4 NP5 PP4
0 9.862
Time (s)
Figure 2: An example on the utterance I certosini devono la fondazione del loro ordine a un uomo (Carthusians due their order’s
foundation to a man). The order of the annotation tiers is the one found in the text.
group guide and the attending audience. Annotation levels cover In Proceedings of the 5th International Conference on Methods and Techniques in
linguistic and multimodal aspects of communication to allow a Behavioral Research, Vol. 88. 100.
[7] Antonio Origlia and Iolanda Alfano. 2012. Prosomarker: a prosodic analysis tool
multi-faceted investigation of the ongoing communicative process. based on optimal pitch stylization and automatic syllabi fication.. In Proc. of the
The collected material will be used as reference to build a computa- International Conference on Language Resources and Evaluation (LREC). 997–1002.
[8] Domenico Parisi and Cristiano Castelfranchi. 1976. The discourse as a hierarchy of
tional model of a 3D virtual character presenting reconstructions goals. Centro Internazionale di Semiotica e di Linguistica, Università di Urbino.
of architectural heritage sites. [9] Isabella Poggi. 2007. Mind, hands, face and body: a goal and belief view of multi-
modal communication. Weidler.
[10] Hugues Salamin, Anna Polychroniou, and Alessandro Vinciarelli. 2013. Automatic
6 ACKNOWLEDGMENTS detection of laughter and fillers in spontaneous mobile phone conversations. In
Antonio Origlia’s work is funded by the Italian PRIN project Cul- Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference on. IEEE,
4282–4287.
tural Heritage Resources Orienting Multimodal Experience (CHROME) [11] Renata Savy. 2005. Specifiche per la trascrizione ortografica annotata dei testi.
#B52F15000450001. Italiano Parlato, Analisi di un dialogo (2005), 1–28.
[12] Renata Savy. 2005. Specifiche per l’etichettatura dei livelli segmentali. Italiano
Parlato. Analisi di un dialogo. Napoli: Liguori (2005).
REFERENCES [13] Elizabeth Ellen Shriberg. 1994. Preliminaries to a theory of speech disfluencies.
[1] Jens Allwood, Nataliya Berbyuk Lindström, and Jia Lu. 2011. Intercultural dy- Ph.D. Dissertation. Citeseer.
namics of fist acquaintance: comparative study of swedish, chinese and swedish- [14] Yasir Tahir, Debsubhra Chakraborty, Tomasz Maszczyk, Shoko Dauwels, Justin
chinese first time encounters. In International Conference on Universal Access in Dauwels, Nadia Thalmann, and Daniel Thalmann. 2015. Real-time sociometrics
Human-Computer Interaction. Springer, 12–21. from audio-visual features for two-person dialogs. In Digital Signal Processing
[2] Jeanette K Gundel. 1988. Universals of topic-comment structure. Studies in (DSP), 2015 IEEE International Conference on. IEEE, 823–827.
syntactic typology 17 (1988), 209–239. [15] Alessandro Vinciarelli, Maja Pantic, and Hervé Bourlard. 2009. Social signal
[3] Adolf E Hieke. 1981. A content-processing view of hesitation phenomena. Lan- processing: Survey of an emerging domain. Image and vision computing 27, 12
guage and Speech 24, 2 (1981), 147–160. (2009), 1743–1759.
[4] Daniel Hirst and Albert Di Cristo. 1998. A survey of intonation systems. Intonation [16] Miriam Voghera and Giuseppina Turco. 2007. Il peso del parlare e dello scrivere.
systems: A survey of twenty languages (1998), 1–44. In Proc. of International Conf. Il Parlato Italiano, Liguori, Napoli.
[5] Daniel Hirst, Albert Di Cristo, and Robert Espesser. 2000. Levels of representation [17] Peter Wittenburg, Hennie Brugman, Albert Russel, Alex Klassmann, and Han
and levels of analysis for the description of intonation systems. In Prosody: Theory Sloetjes. 2006. ELAN: a professional framework for multimodality research. In
and experiment. Springer, 51–87. Proc. of the International Conference on Language Resources and Evaluation (LREC).
[6] Iain McCowan, Jean Carletta, W Kraaij, S Ashby, S Bourban, M Flynn, M Guille- 1556–1559.
mot, T Hain, J Kadlec, V Karaiskos, and others. 2005. The AMI meeting corpus.
4