<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Attention in a Multimodal Corpus of Guided Tours</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Amelio Ravelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Origlia</string-name>
          <email>antonio.origlia@unina.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>andreaamelio.ravelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>felice.dellorlettag@ilc.cnr.it</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Naples “Federico II”</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper explores the possibility to annotate engagement as an extra-linguistic information in a multimodal corpus of guided tours in cultural sites. Engagement has been annotated in terms of gain or loss of perceived attention from the audience, and this information has been aligned to the transcription of the speech from the guide. A preliminary analysis suggests that the level of engagement correlates with some specific linguistic features, opening up to possible future exploitation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Understanding a message expressed through the
speech channel in face-to-face interactions
involves more than the ability to decipher a string of
characters and to assign a meaning to words and
sentences. The linguistic information conveyed
by lexicon is only the tip of the iceberg:
intonation, gesture, facial expression, gaze, body
movement play a key role in spoken communication. By
summing the information in all these
complementary modalities acquired through different
channels (i.e. auditory and visual systems), the
human brain is capable to analyse and decode a
message not only on the basis of the words it contains.
Moreover, the vision modality enables the speaker
to evaluate the effectiveness of his/her message on
the audience. In fact, face-to-face interactions
offer the possibility to have an on-line feedback from
the addressee even without an ongoing active
dialogue. Simply by interpreting unconscious signals
accessible from the vision modality, such as body
postures and movements, facial expressions,
eyegazes, the speaker can understand if the addressee</p>
      <p>Copyright c 2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
is engaged with the discourse, and continuously
fine-tune his/her communication strategy in order
to keep the attention high in the audience.</p>
      <p>
        Engagement can be explained as the process by
which two or more actors establish, maintain and
end their perceived connection during interactions
they jointly undertake
        <xref ref-type="bibr" rid="ref14">(Rich et al., 2010)</xref>
        . It is
composed of a series of verbal and non verbal
behaviours, useful to understand the involvement
between the actors, and specifically between the
actors and the content of their communication scene,
and it can be used to provide evidence of the
waning of connectedness
        <xref ref-type="bibr" rid="ref15">(Sidner et al., 2005)</xref>
        .
      </p>
      <p>
        In this work we describe a pilot annotation
of audience engagement during guided tours in
cultural sites, by evaluating the observable
behaviours of the visitors in response to the speech
from the guide. The main goal is to trace the level
of attention of the visitors. Engagement is defined
as a multidimensional meta-construct
        <xref ref-type="bibr" rid="ref3">(Fredricks et
al., 2004)</xref>
        , and attention is considered a component
of its the visible cues.1 The paper is organised
as follows: section 2 introduces the CHROME
project and its multimodal corpus; section 3
describes the visual annotation; section 4 reports the
results of the annotation in terms of agreement
and some linguistic analysis on the available set
of aligned transcriptions; section 5 concludes with
some discussions on possible future works and
exploitation for this kind of resource.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>The CHROME Project</title>
      <p>
        The Italian national project Cultural Heritage
Resources Orienting Multimodal Experience
        <xref ref-type="bibr" rid="ref13">(Origlia
et al., 2018)</xref>
        aims at developing a data collection
and annotation procedure to support the
develop1Per definition, cognitive engagement refers to internal
processes, whereas only the emotional and behavioral
components are manifested in visible cues. Nevertheless, all
engagement elements are highly interrelated and do not occur
in isolation
        <xref ref-type="bibr" rid="ref3">(Fredricks et al., 2004)</xref>
        . Thus, attention plays a
crucial role
        <xref ref-type="bibr" rid="ref4">(Goldberg et al., 2019)</xref>
        .
ment of new interactive technologies for cultural
heritage. The project concentrates on the three
Campanian Charterhouses: an integrated
description of these from different point of views
(textual, behavioural, geometrical, etc. . . ) is being
developed. In the framework of this project, a
data collection campaign to document how
professional guides present architectural heritage
contents when on-site was defined.
2.1
      </p>
      <sec id="sec-2-1">
        <title>The CHROME multimodal corpus</title>
        <p>The collected data consist of audiovisual
recordings involving three art historians with strong
experience in accompanying groups of visitors.
Given the limited number of informants
considered in the CHROME project, only female experts
were recruited to remove gender effects in
multimodal and linguistic analysis.</p>
        <p>Recorded data include two Full-HD video
recordings: the first one is a fixed shot of the art
historian, taken from a position immediately next
to the attending group, while the second one is a
fixed shot of the group of recruited visitors. A
close-range digital microphone with background
noise cancellation is used to record the guide’s
voice.</p>
        <p>Each recruited expert accompanied four groups
of four people in an hour long guided tour at
the San Martino Charterhouse in Naples.
Recruited members of the audience vary on a
sociodemographic basis and each group is gender
balanced. The visit is divided into six points of
interest (POIs), selected as the most relevant parts of
the Charterhouse from an architectural and artistic
point of view:</p>
        <p>Pronaos: outside the doorstep of the church.
The introductory part of the visit is recorded
in this POI. Environmental elements mainly
consist of architectural details;
Great cloister: a large external place, near
the monks’ cemetery. Further details about
the monks’ life are given. Environmental
elements consist of the natural setting of a large
garden and of the cemetery elements (e.g.
memento mori);
Parlor: the first internal setting. Specific
details about the Charthusians’ rules are given
here. Environmental elements mainly consist
of frescoes;
Chapter hall: next to the parlor. Specific
details about the Charthusians’ order are given
here. Environmental elements mainly consist
of frescoes;
Wooden choir: inside the church, behind the
altar. The history of the church decoration
process is given here. Environmental
elements consist of both architectural details
(e.g. the choir and the harmonic chassis) and
artistic elements (frescoes and statues);
Treasure hall: deeper inside the complex.
Details about the relationship between the
monks and the different governing parties in
Naples are given. Environmental elements
mainly consist of architectural details.</p>
        <p>The selected POIs allow us to capture the social
behaviour visitors and gatekeepers exhibit to
negotiate the approach to the visit and to document
postural and gestural behaviour of an art historian
presenting a complex environment.</p>
        <p>
          Videos and audio recordings are synchronised
a posteriori using a visual-acoustic marker.
Linguistic and multimodal annotations, performed on
the synchronised versions of the collected
material, are merged and aligned using the ELAN
software
          <xref ref-type="bibr" rid="ref16">(Wittenburg et al., 2006)</xref>
          . An ELAN project
file is produced for each POI visit in order to allow
cross-domain research and closed vocabularies for
the label sets belonging to each annotation domain
are used to ensure consistency. Specifically about
linguistic annotations, the considered levels
consisting of word, syllable and phone level
transcriptions are obtained using WebMAUS
          <xref ref-type="bibr" rid="ref6">(Kisler et al.,
2017)</xref>
          and manually checked by human experts.
Also, tonal units are manually marked by a human
expert, as well as syntactic structures.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Engagement annotation</title>
      <p>
        A subset of data from the CHROME Project has
been used for this work. More specifically, we
acquired data for one guide accompanying four
different groups of visitors in the Charterhouse of St.
Martin in Naples, consisting in 24 video couples
(aligned videos of both the guide and the
audience, one couple for each POI). Annotation has
been performed by two annotators by means of
PAGAN annotation web-based platform
        <xref ref-type="bibr" rid="ref11">(Melhart
et al., 2019)</xref>
        , which enables the users to easily
align and play two videos. Annotators have been
asked to recognise signals of gain or loss of
attention in the audience, and they recorded their
observations through simple interactions with the up
and down keys of the keyboard, where up stands
for a gain and down for a loss in attention. Given
the nature of the annotation (and the scope of this
pilot work), no strict instructions have been
delivered to the annotators. They based their
judgement on visible cues of perceivable variation in
the level of attention from the group of visitors,
such as gaze following a deictic gestures, facial
expressions as feedback to the guide’s speech, head
movements, pose and so on. The interactions in
PAGAN are recorded using RankTrace framework
        <xref ref-type="bibr" rid="ref9">(Lopes et al., 2017)</xref>
        , and the whole annotation
session is exported as a tab-separated file containing
continuous series of milliseconds and values for
each interaction. In total, the set of videos consists
of 3:20 hours, with an average length of 8:40
minutes per point of interest.
      </p>
      <p>
        For 3 of these videos it was already available2
the ELAN project file containing the orthographic
transcription of the guide’s speech (more
specifically, the speech from the visit in the POI 1 with
the first three groups), thus it has been possible
to automatically align the visually-derived
annotation, using the pympi-ling Python Module
        <xref ref-type="bibr" rid="ref10">(Lubbers and Torreira, 2018)</xref>
        .
      </p>
      <p>Figure 1 shows an example of the alignment for
one of the videos in an ELAN project file. Using
these alignments it has been possible to investigate
2The transcription and annotation of the whole corpus of
the CHROME Project is still an ongoing work, thus
completely annotated and aligned data is still limited.
if any correlation exists between linguistic features
extracted from the guide’s speech and engagement
from the visitors.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation of the corpus</title>
      <p>Video
POI 1
POI 2
POI 3
POI 4
POI 5
POI 6
POI 1
POI 2
POI 3
POI 4
POI 5
POI 6
POI 1
POI 2
POI 3
POI 4
POI 5
POI 6
POI 1
POI 2
POI 3
POI 4
POI 5
POI 6
AVG</p>
      <p>To evaluate the agreement and thus the
reliability of the annotation, we calculated the
Spearman’s rho for the continuous series of values from
the two annotators. Table 1 reports the results of
the correlations: the overall agreement is
significantly high, with a average correlation between the
two series of 0.87. Figure 2 and 3 shows
respectively the plot for highest and lowest correlation.</p>
      <sec id="sec-4-1">
        <title>Linguistic</title>
        <p>Feature
n tokens
% NOUN
% PROPN
% PRON
% VERB
% AUX
% ADJ
% ADV
% DET
% NUM
% CCONJ
% SCONJ</p>
      </sec>
      <sec id="sec-4-2">
        <title>Positive</title>
        <p>Avg (St.Dev)
19.78 (14.63)**
15.97 (9.69)
4.48 (11.7)*
7.65 (8.04)**
11.33 (9.2)*
5.87 (7.19)**
3.94 (5.04)
14.14 (13.49)**
15.49 (13.99)
0.32 (1.35)
4.85 (15.34)**
2.48 (3.52)**</p>
      </sec>
      <sec id="sec-4-3">
        <title>Null</title>
        <p>Avg (St.Dev)
10.42 (9.79)**
17.32 (14.32)</p>
        <p>4.24 (9.9)*
6.77 (11.85)**
12.2 (18.07)*
5.07 (12.12)**</p>
        <p>5.06 (10.91)
13.55 (20.19)**
14.74 (12.73)</p>
        <p>
          0.42 (2.45)
2.48 (8.16)**
3 (11.46)**
Such information can be used to extract
meaningful segments concerning the level of attention (e.g
for machine learning purposes).
As briefly mentioned before, we exploited the
corpus composed of available orthographic
transcriptions to carry out some analysis about the
possible correlation between content of the speech and
the perceivable engagement of the audience. To
do so, we considered pause tags, i.e. short and
long pauses (respectively, &lt;sp&gt; and &lt;lp&gt;), as
boundaries for sentence-like units of text to be
processed along with the corresponding
engagement value. We are aware that breath groups
cannot be considered as reference units for the
analysis of speech,3 and that applying written
language methodologies and tools to spoken modality
is biasing
          <xref ref-type="bibr" rid="ref7 ref8">(Linell, 2005; Linell, 2019)</xref>
          , but for the
scope of the present work it has been necessary to
make use of the available segmentation.
        </p>
        <p>
          Even if we had few text available (3
transcriptions, for a total of 5,648 tokens in 464 sentences;
12 tokens per sentence), we analysed the
corpus using Profiling-UD4
          <xref ref-type="bibr" rid="ref2">(Brunato et al., 2020)</xref>
          ,
a web–based application that performs linguistic
profiling of a given text. The output of
ProfilingUD is a tab-separated file, with one row per
document (one for sentence, in this case) and one
column for each of the 122 linguistic features
analysed by the system. The objective is to investigate
3Segmentation of speech in basic units is still an open
challenge in spoken language studies, as recently testified by
Izre’el et al. (2020) and Mello et al. (2020).
        </p>
        <p>4http://linguistic-profiling.
italianlp.it
if any relation could be traced between the
perceived attention from the audience and the
linguistic features extracted from the guide’s speech. We
observed the scores for the sentences marked with
a gain of attention against those for which
annotators did not interact with the platform (i.e. those
sentences that, aligned with time stamps to the
series of the annotations, was not marked as gain
or loss of attention). We performed the Wilcoxon
rank sum test on features values for the two groups
of sentences (positive vs. null) for both the
annotators.</p>
        <p>Table 2 reports average and standard deviation
for the linguistic features with p&lt;0.05 for at least
one annotator.5 It is possible to notice that, among
positive and null marked sentences in both the
annotator’s data, the feature that significantly varies
more than the other are the length of sentences
(n tokens) and the distribution of auxiliars.</p>
        <p>
          The correlation between length and attention is
not surprising, since longer sentences are likely to
be more informative and thus probably more
engaging. Even if sentence length is normally
associated to a higher sentence complexity
          <xref ref-type="bibr" rid="ref1">(Brunato
et al., 2018)</xref>
          , other typical features of complexity
are not appreciably, given that subordinative
conjunctions (SCONJ) are sensibly lower in higher
attention marked sentences, while coordinative
conjunctions (CCONJ) shows opposite trend in the
two groups. For both the groups proper names
(PROPN) and pronouns (PRON) seem to
characterise engaging sentences.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Works</title>
      <p>In this work we introduced a pilot annotation of
visually perceivable attention, meant as a
component of engagement, and its alignment in a
multimodal corpus of guided tours in cultural sites.
Moreover, we analysed the available speech
transcription for 3 of the 24 videos and,
notwithstanding the small dimension of the corpus ( 5K
tokens), some signal of the connection between
attention and specific lexical features emerges, and
it would be interesting to augment data in terms
of annotations and alignment in order to
extensively verify these correlations. Much more
reliable analysis may be carried on by exploiting
bet5In this analysis we consider exclusively features on
sentence length and part-of-speech distributions. Profiling-UD
is a tool designed for written text and not trained to work on
speech transcriptions, thus any significance on syntactic
features is not reliable.
ter textual segmentation, e.g. tonal units, and
finetuning the feature extraction procedure in order
to better handle spoken language. In this way, it
would be possible to account also spoken-specific
peculiarities and correlate them to audience
engagement.</p>
      <p>Finally, in the specific context of hosting and
guiding visitors in cultural sites, the possibility
to trace the level of engagement during tours can
open up to interesting outcomes. In this regard,
aligning speech transcription with attention
tracking and other data, such as gaze, intonation,
gesture, facial expression, body movement (for both
the speaker and the addressee), would be
particularly useful to train a classifier to recognise
engaging information both in spoken language and
in videos.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Dominique</given-names>
            <surname>Brunato</surname>
          </string-name>
          , Lorenzo De Mattei, Felice Dell'Orletta,
          <string-name>
            <given-names>Benedetta</given-names>
            <surname>Iavarone</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Is this sentence difficult? do you agree</article-title>
          ?
          <source>In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>2690</fpage>
          -
          <lpage>2699</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Dominique</given-names>
            <surname>Brunato</surname>
          </string-name>
          , Andrea Cimino, Felice Dell'Orletta,
          <string-name>
            <given-names>Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Profiling-ud: a tool for linguistic profiling of texts</article-title>
          .
          <source>In Proceedings of The 12th Language Resources and Evaluation Conference</source>
          , pages
          <fpage>7145</fpage>
          -
          <lpage>7151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Jennifer A Fredricks</surname>
          </string-name>
          ,
          <article-title>Phyllis C Blumenfeld,</article-title>
          and Alison H Paris.
          <year>2004</year>
          .
          <article-title>School engagement: Potential of the concept, state of the evidence</article-title>
          .
          <source>Review of educational research</source>
          ,
          <volume>74</volume>
          (
          <issue>1</issue>
          ):
          <fpage>59</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Patricia</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , O¨ mer Su¨mer, Kathleen Stu¨rmer, Wolfgang Wagner, Richard Go¨llner, Peter Gerjets, Enkelejda Kasneci, and
          <string-name>
            <given-names>Ulrich</given-names>
            <surname>Trautwein</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Attentive or Not? Toward a Machine Learning Approach to Assessing Students' Visible Engagement in Classroom Instruction</article-title>
          . Educational Psychology Review,
          <volume>35</volume>
          (
          <issue>1</issue>
          ):
          <fpage>463</fpage>
          -
          <lpage>23</lpage>
          , January.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Shlomo</given-names>
            <surname>Izre'el</surname>
          </string-name>
          , Heliana Mello, Alessandro Panunzi, and
          <string-name>
            <given-names>Tommaso</given-names>
            <surname>Raso</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>In Search of Basic Units of Spoken Language, volume 94 of A corpus-driven approach</article-title>
          . John Benjamins Publishing Company, Amsterdam, June.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Kisler</surname>
          </string-name>
          , Uwe Reichel, and
          <string-name>
            <given-names>Florian</given-names>
            <surname>Schiel</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Multilingual processing of speech via web services</article-title>
          .
          <source>Computer Speech &amp; Language</source>
          ,
          <volume>45</volume>
          :
          <fpage>326</fpage>
          -
          <lpage>347</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Per</given-names>
            <surname>Linell</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>The Written Language Bias in Linguistics. Its Nature, Origins and Transformations</article-title>
          . Routledge.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Per</given-names>
            <surname>Linell</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>The Written Language Bias (WLB) in linguistics 40 years after</article-title>
          .
          <source>Language Sciences</source>
          ,
          <volume>76</volume>
          :
          <fpage>101230</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Phil</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <surname>Georgios N Yannakakis</surname>
            , and
            <given-names>Antonios</given-names>
          </string-name>
          <string-name>
            <surname>Liapis</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Ranktrace: Relative and unbounded affect annotation</article-title>
          .
          <source>In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII)</source>
          , pages
          <fpage>158</fpage>
          -
          <lpage>163</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Mart</given-names>
            <surname>Lubbers</surname>
          </string-name>
          and Francisco Torreira.
          <year>2018</year>
          .
          <article-title>pympi-ling: a Python module for processing ELANs EAF and Praats TextGrid annotation files</article-title>
          . https://pypi.python.org/pypi/ pympi-ling.
          <source>Version</source>
          <volume>1</volume>
          .
          <fpage>69</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Melhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Antonios</given-names>
            <surname>Liapis</surname>
          </string-name>
          , and
          <string-name>
            <surname>Georgios N Yannakakis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Pagan: Video affect annotation made easy</article-title>
          .
          <source>In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII)</source>
          , pages
          <fpage>130</fpage>
          -
          <lpage>136</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Heliana</given-names>
            <surname>Mello</surname>
          </string-name>
          , Lu´cia Ferrari, and
          <string-name>
            <given-names>Bruno</given-names>
            <surname>Rocha</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Multimodality, Segmentation and Prominence in Speech</article-title>
          .
          <source>Journal of Speech Sciences</source>
          ,
          <volume>9</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Antonio</given-names>
            <surname>Origlia</surname>
          </string-name>
          , Renata Savy, Isabella Poggi, Francesco Cutugno, Iolanda Alfano,
          <string-name>
            <surname>Francesca D'Errico</surname>
            ,
            <given-names>Laura</given-names>
          </string-name>
          <string-name>
            <surname>Vincze</surname>
            , and
            <given-names>Violetta</given-names>
          </string-name>
          <string-name>
            <surname>Cataldo</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>An Audiovisual Corpus of Guided Tours in Cultural Sites - Data Collection protocols in the CHROME Project</article-title>
          . JOWO,
          <year>2091</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Charles</given-names>
            <surname>Rich</surname>
          </string-name>
          , Brett Ponsler, Aaron Holroyd, and Candace L Sidner.
          <year>2010</year>
          .
          <article-title>Recognizing engagement in human-robot interaction</article-title>
          .
          <source>In 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI)</source>
          , pages
          <fpage>375</fpage>
          -
          <lpage>382</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Candace L Sidner</surname>
          </string-name>
          ,
          <string-name>
            <surname>Christopher Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>Cory D Kidd</surname>
            ,
            <given-names>Neal</given-names>
          </string-name>
          <string-name>
            <surname>Lesh</surname>
          </string-name>
          , and Charles Rich.
          <year>2005</year>
          .
          <article-title>Explorations in engagement for humans and robots</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>166</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>140</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Peter</given-names>
            <surname>Wittenburg</surname>
          </string-name>
          , Hennie Brugman,
          <string-name>
            <surname>Albert Russel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Alex</given-names>
            <surname>Klassmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Han</given-names>
            <surname>Sloetjes</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Elan: a professional framework for multimodality research</article-title>
          .
          <source>In Proc. of the International Conference on Language Resources and Evaluation (LREC)</source>
          , pages
          <fpage>1556</fpage>
          -
          <lpage>1559</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>