<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Audience Engagement Prediction in Guided Tours through Multimodal Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Amelio Ravelli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Cimino</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>andreaamelio.ravelli</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>andrea.cimino</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>felice.dellorletta}@ilc.cnr.it</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper explores the possibility to predict audience engagement, measured in terms of visible attention, in the context of guided tours. We built a dataset composed of Italian sentences derived from the speech of an expert guide leading visitors in cultural sites, enriched with multimodal features, and labelled on the basis of the perceivable engagement of the audience. We run experiments in various classification scenarios and observed the impact of modality-specific features on the classifiers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>During face-to-face interactions, the average
speaker is generally very good at estimating the
interlocutor’s level of involvement, without the need
of an explicit verbal feedback. He/she only needs
to interpret visually accessible unconscious
signals, such as body postures and movements,
facial expressions, eye-gazes. The speaker can
understand if the addressee is engaged with the
discourse, and continuously fine-tune his/her
communication strategy in order to keep the
communication channel open and the attention high in the
audience.1</p>
      <p>Understanding of non-verbal feedback is not
easy to achieve for virtual agents and robots, but
this ability is strategic for enabling more natural
interfaces capable of adapting to users. Indeed,</p>
      <p>Copyright © 2021 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).</p>
      <p>
        1Recent studies have shown that the processing of
emotionality in prosody, facial expressions and speech content
is associated in the listeners’ brain with enhanced activation
of auditory cortices, fusiform gyri and middle temporal gyri,
respectively, confirming that emotional states are processed
through modality-specific modulation strategies
        <xref ref-type="bibr" rid="ref19">(Regenbogen et al., 2012)</xref>
        .
perceiving signals of loss of attention (and thus, of
engagement) is of paramount importance to design
naturally behaving virtual agents, enabled to
adjust the communication strategy to keep high the
interest of their addressees. That information is
also a general sign of the quality of the interaction
and, more broadly, of the communication
experience. At the same time, the ability to generate
engaging behaviors in an agent can be beneficial in
terms of social awareness
        <xref ref-type="bibr" rid="ref14">(Oertel et al., 2020)</xref>
        .
      </p>
      <p>
        The objective of developing a natural behaving
agent, able to guide visitors along a tour in cultural
sites, was at the core of the CHROME Project2
        <xref ref-type="bibr" rid="ref15 ref4">(Cutugno et al., 2018; Origlia et al., 2018)</xref>
        , and
the present work is intended in the same direction.
More specifically, this paper explores the
possibility to predict audience engagement in the context
of guided tours, by considering acoustic and
linguistic features of the speech of an expert guide
leading visitors inside museums.
      </p>
      <p>The paper is organised as follows: Section 2
draws a brief overview of related works in the
ifeld of engagement annotation and prediction;
Section 3 describes in details the construction of
the dataset; Section 4 reports the methodology
adopted to extract features specific for both
linguistic and acoustic modalities; Section 5
illustrates the set of experiments conducted on the
collected data, in terms of classification scenarios and
features used; Section 6 gathers final observations
and ideas for future works.</p>
      <p>Contributions The main contributions in this
paper are: i) a novel multimodal Italian dataset
with engagement annotation; ii) multiple
classification scenarios experiments; iii) impact of
modality-specific features on multimodal
classification.</p>
      <p>2Cultural Heritage Resources Orienting Multimodal
Experience. http://www.chrome.unina.it/</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        With the word engagement we refer to the level
of involvement reached during a social interaction,
which assumes the shape of a process through the
whole communication exchange. More
specifically, Poggi (2007) defines the process of social
engagement as the value that a participant in an
interaction attributes to the goal of being together
with the other participant(s) and continuing the
interaction. Another definition, adopted by many
studies in Human-Robot Interaction (HRI),3
describes engagement as the process by which
interactors start, maintain, and end their perceived
connections to each other during an interaction
        <xref ref-type="bibr" rid="ref21">(Sidner et al., 2005)</xref>
        .
      </p>
      <p>
        Observations and annotations of engagement
are collected on the basis of visible cues, such
as facial expressions and reactions, eye gazes,
body movements and postures. The majority of
the studies are often conducted on a dyadic base,
i.e. focusing on communication contexts
involving only two participants, most of the times a
human interacting with an agent/robot
        <xref ref-type="bibr" rid="ref1 ref20 ref3">(Castellano et
al., 2009; Sanghvi et al., 2011; Ben-Youssef et
al., 2021)</xref>
        . Nevertheless, engagement can be
measured in groups of people taking part in the same
communication event as the average of the degree
to which individuals are involved
        <xref ref-type="bibr" rid="ref13 ref8">(Gatica-Perez et
al., 2005; Oertel et al., 2011)</xref>
        . Human-to-human
interactions within groups have been studied
principally in the research field of education
        <xref ref-type="bibr" rid="ref7">(Fredricks
et al., 2004)</xref>
        where visible cues are related to
attention, which is considered as a perceivable proxy
to the more complex and inner process of
engagement
        <xref ref-type="bibr" rid="ref9">(Goldberg et al., 2019)</xref>
        .
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Dataset</title>
      <p>
        The dataset presented in this paper is derived from
a subset of the CHROME Project data collection
        <xref ref-type="bibr" rid="ref15">(Origlia et al., 2018)</xref>
        , which comprises aligned
videos, audios and transcriptions of guided tours
in three Charterhouses in Campania. Two videos
have been recorded for each session: one video
with the guide as subject, the other focused on the
group of visitors. Data of 3 visits with the same
expert guide (in the same Charterhouse) have been
selected. Each visit is organised in 6 points of
interest (POI), i.e. rooms or areas inside the
Charterhouse where groups stop during the tours and
3For a broad and complete overview of works on
engagement in HRI studies, see Oertel et al. (2020)
the guide describes the place with its furnishings,
history, and anecdotes.
      </p>
      <p>In total, starting data consist of 2:44:25 hours of
audiovisual material and 22,621 tokens from the
aligned transcriptions. The language of the speech
is Italian.
3.1</p>
      <sec id="sec-3-1">
        <title>Annotation and Segmentation</title>
        <p>
          Engagement has been annotated as a continuous
measurement of visitor’s attention, as a visible cue
of engagement. The annotation has been carried
out using PAGAN Annotation Tool
          <xref ref-type="bibr" rid="ref11">(Melhart et al.,
2019)</xref>
          , and performed by two annotators watching
videos of the groups of visitors in order to observe
cues of gain or loss of attention. Following
Oertel et al. (2011), annotators have been asked to
evaluate the average behaviour of the whole group.
Agreement between the two annotators is
consistent, with an average Spearman’s rho of 0.87
          <xref ref-type="bibr" rid="ref18">(Ravelli et al., 2020)</xref>
          .
        </p>
        <p>The raw transcriptions have been manually
segmented with the objective of creating textual
segments close to written sentences, and this
segmentation has been projected on audio files, in order
to obtain aligned text-audio pairs for each
segment. Given that every visit is similarly
structured, and also topics and whole pieces of
information are mostly the same across different
visits, the resulting transcriptions are extremely clear
and phenomena such as retracting and
disfluencies are minimum if compared to transcriptions of
typical spontaneous speech. Thus, text
normalisation (i.e., disfluencies removal, basic punctuation
insertion) has been easy to obtain, and the
resulting adaptation lead to sentences easy to parse with
common NLP tools trained on written texts.</p>
        <p>Segmentation has been performed on the
basis of perceptual cues of utterance completeness.
As described by Danieli et al. (2005), a break is
said terminal if a competent speaker (i.e. mother
tongue speaker) assigns to it the quality of
concluding the sequence. Starting with this
observation, two annotators have been asked to listen to
the original audio tracks and mark transcriptions
with a full stop where they perceived a break as a
boundary between utterances, on the basis of
intonation and prosodic contour. Utterances perceived
as independent but pronounced too quickly to
allow a clean cut (especially considering audio
segmentation and the consequent features extraction)
have been kept together in a single segment.</p>
        <p>To assess the reliability of the segmentation
process, we measured the accuracy between the two
annotators on a subset of the data (the 40% of the
total, corresponding to one of the three visits). We
adopted a chunking approach to the problem, by
adapting an IOB (Inside-Outside-Begin) tagging
framework to label tokens, from the continuous
transcriptions of the sample, at the beginning (B),
inside (I), end (E) of segments, or outside (O) any
of those. We measured an accuracy of 91,53% in
terms of agreement/disagreement on the basis of
the series of labelled tokens derived for each
annotator.</p>
        <p>At the end of the segmentation process, the
dataset counts 1,114 Italian sentences, with an
average of 20.31 tokens per sentence (std: 11.96),
and an average duration of audio segments of 8.13
seconds (std: 5.22).</p>
        <p>An engagement class has been assigned to each
sentence: 1 if an increase in engagement has been
recorded in the span of that sentence, 0 in case of
decrease or no variation. To compute the class, we
considered the delta between the input and output
values of the continuous measurement obtained
with the annotations, with respect to the
beginning and end of sentences. Specifically, for each
sentence we selected all the annotations (one per
millisecond) falling into the sentence boundaries,
and then we subtracted the value of the first one
from the last one. We reduced the task to a
binary classification in order to test to which extent
it is possible to predict engaging content before to
evaluate the possibility to expand the analysis to
a finer classification, accounting also for what is
specifically engaging, not-engaging or neutral.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Features Extraction</title>
      <p>In order to train and test a classifier in
predicting the engagement of the addressee of an
utterance, using both linguistic and acoustic
information, features specific for each modality have
been extracted independently, and then
concatenated as unique vectors representing each entry of
the dataset.
4.1</p>
      <sec id="sec-4-1">
        <title>Linguistic Features</title>
        <p>
          The textual modality has been encoded by using
Profiling–UD
          <xref ref-type="bibr" rid="ref2">(Brunato et al., 2020)</xref>
          , a publicly
available web–based application4 inspired to the
4Profiling-UD can be accessed at the following link: ht
tp://linguistic-profiling.italianlp.it
methodology initially presented in Montemagni
(2013), that performs linguistic profiling of a text,
or a large collection of texts, for multiple
languages. The system, based on an intermediate step
of linguistic annotation with UDPipe
          <xref ref-type="bibr" rid="ref22">(Straka et al.,
2016)</xref>
          , extracts a total of 129 features per each
analysed document. In this case, Profiling-UD
analysis has been performed per sentence, thus the
output has been considered as the linguistic
feature set of each segment of the dataset. Table 1
reports the 127 features extracted with
ProfilingUD and used as textual modality features for the
classifier. 5
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Linguistic features</title>
        <p>Raw text properties
Morpho–syntactic information</p>
        <p>Verbal predicate structure</p>
        <p>Parsed tree structures</p>
        <p>Syntactic relations
Subordination phenomena</p>
        <p>
          Total
The acoustic modality has been encoded using
OpenSmile6
          <xref ref-type="bibr" rid="ref6">(Eyben et al., 2010)</xref>
          , a complete and
open-source toolkit for analysis, processing and
classification of audio data, especially targeted at
speech and music applications such as automatic
speech recognition, speaker identification,
emotion recognition, or beat tracking and chord
detection. The acoustic features set used in this case
is the Computational Paralinguistics ChallengE7
(ComParE), which comprises 65 Low-Level
Descriptors (LLDs), computed per frame. Table 2
reports a summary of the ComParE LLDs extracted
with OpenSmile, grouped by type:
prosodyrelated, spectrum-related and quality-related.
        </p>
        <p>Given that the duration (and number of frames,
consequently) of audio segments varies, common
transformations (min, max, mean, median, std)
have been applied on the set of per-frame features
5Out of the 129 Profiling-UD features, n sentences and
tokens per sent (raw text properties) have not been considered,
given that the analysis has been performed per sentence.</p>
        <p>6https://www.audeering.com/research/o
pensmile/
7http://www.compare.openaudio.eu</p>
      </sec>
      <sec id="sec-4-3">
        <title>Acoustic features</title>
        <p>Prosodic
F0 (SHS and viterbi smoothing)
Sum of auditory spectrum (loudness)
Sum of RASTA-style filtered auditory
spectrum
RMS energy, zero-crossing rate</p>
        <p>Spectral
RASTA-style auditory spectrum, bands
1–26 (0–8 kHz)
MFCC 1–14
Spectral energy 250–650 Hz, 1 k–4 kHz
Spectral roll off point 0.25, 0.50, 0.75,
0.90
Spectral flux, centroid, entropy, slope
Psychoacoustic sharpness, harmonicity
Spectral variance, skewness, kurtosis</p>
        <p>
          Sound quality
Voicing probability
Log. HNR, Jitter (local, delta), Shimmer
(local)
n
To explore the possibility to predict engaging
sentences, we implemented a machine learning
classifier using the linear SVM algorithm provided by
the scikit-learn library
          <xref ref-type="bibr" rid="ref16">(Pedregosa et al., 2011)</xref>
          .
        </p>
        <p>We defined various classification scenarios on
the basis of 3 different train-test splitting of the
dataset. The first, and more common scenario, is
based on a k-fold setting, in which data has been
randomly split in 10 folds, trained on 9 of them
and tested on the remaining one. The second
scenario uses data from one POI from all the visits as
a test, and it is trained on the remaining parts. The
third scenario considers data from a whole visit as
test and is trained on the remaining two. Global
results are obtained by averaging the classification
performances of each run per scenario (e.g.
average of all k-fold outputs tested on every fold).</p>
        <p>For each scenario, the SVM classifier has been
trained and tested three times, once per single
modality (i.e. linguistic or acoustic features
exclusively) and once with joint representations (the
full set of both linguistic and acoustic features).
All the features have been normalised in the range
[0, 1] using the MinMaxScaler algorithm
implemented in scikit-learn.</p>
        <p>Table 3 reports the aggregated results, in terms
of accuracy, from all the experiments. The
baseline considered is the assignment of the majority
class found in the training data. All the classifiers
in the three scenarios obtain better results than the
baseline, but the multimodal systems (the ones
exploiting both linguistic and acoustic sets of
features) are never able to do better than models based
on linguistic features only. Moreover, it is possible
to observe that multimodal systems achieve scores
similar to acoustic systems.</p>
        <p>Low performances, especially for multimodal
systems, may be ascribed to the fact that the
classifiers are fed with too many features (452 total;
127 textual and 325 acoustic features) with respect
to the dimension of the dataset (1,114 items), and
thus they build representations with low variation
in terms of single feature weight. Moreover,
summing the two sets in the multimodal systems leads
to worst results than single-modality systems,
amplifying the problem.</p>
        <p>In order to verify this hypotesis, we reduced the
number of features by observing the weights
assigned to each feature by classifiers trained on
single modalities, and selecting only the top 20 from
each ranked set. Figures 1 and 2 show the reduced
set of features along with their weights for the
linguistic and acoustic set of features, respectively.
Among the top-rated, on the linguistic side, we can
ifnd features related to the syntactic tree of the
sentence and verbal predicate structure; on the
acoustic side, principally spectral and prosodic features.</p>
        <p>As shown in Table 4, by using this reduced
features sets, all systems obtain better results with
respect to the experiments conducted exploiting the
whole sets of features. Most significant
improvements can be traced for models based on acoustic
and multimodal features set, with an average
increase in accuracy of the 10%. Differently from
previous experiments, multimodal systems reach
the best overall results in two out of three
scenarios (k-fold and POI).</p>
        <p>Again, multimodal systems scores are close to
those obtained exploiting exclusively acoustic
features. For this reason, we compared the
predictions from single modalities with multimodal
ones, and we found out that multimodal systems
predictions overlap more with acoustic systems
(0.86) than with linguistic systems (0.79). It
conifrms that this behaviour is due to the fact that
acoustic features are those more considered by the
multimodal classifier.</p>
        <p>
          It is possible to observe the higher contribution
from acoustic features to the multimodal systems
in Figure 3: among the top 10 most important
features, only 2 are linguistic, and the trend is
dramatically off balance in favour of acoustic features.
In this paper we introduced a novel multimodal
dataset for the analysis and prediction of
engagement, composed of Italian sentences derived from
the speech of an expert guide leading visitors in
cultural sites, enriched with multimodal features,
and labelled on the basis of the perceivable
engagement of the audience. We performed several
experiments in different classification scenarios, in
order to explore the possibility to predict
engagement on the basis of features extracted for both
the linguistic and acoustic modalities.
Combining modalities in classification leads to good
results, but with a filtered set of features to avoid
too noisy representations. An in interesting
experiment would be to combine the outcomes of
two different systems (one exploiting exclusively
acoustic features, linguistic features the other)
rather than using a monolithic one fed with all the
features. This technique often leads to better
performances with respect to the decisions taken by a
single system
          <xref ref-type="bibr" rid="ref10 ref23">(Woz´niak et al., 2014; Malmasi and
Dras, 2018)</xref>
          .
        </p>
        <p>Moreover, we are working on aligning features
derived from the visual modality, by encoding
information from the videos used to annotate
engagement. In this way, the dataset will contain
a more complete representation, and it would be
possible to correlate perceived engagement in the
audience with the full set of stimuli offered during
the guided tour.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The authors would like to acknowledge the
contribution of Luca Poggianti and Mario Gomis, who
have annotated the engagement on the videos, and
Federico Boggia and Ludovica Binetti, who have
segmented the sentences of the dataset.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Atef</given-names>
            <surname>Ben-Youssef</surname>
          </string-name>
          , Chloe´ Clavel,
          <string-name>
            <given-names>and Slim</given-names>
            <surname>Essid</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Early detection of user engagement breakdown in spontaneous human-humanoid interaction</article-title>
          .
          <source>IEEE Transactions on Affective Computing</source>
          ,
          <volume>12</volume>
          (
          <issue>3</issue>
          ):
          <fpage>776</fpage>
          -
          <lpage>787</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Dominique</given-names>
            <surname>Brunato</surname>
          </string-name>
          , Andrea Cimino, Felice Dell'Orletta,
          <string-name>
            <given-names>Giulia</given-names>
            <surname>Venturi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Profiling-ud: a tool for linguistic profiling of texts</article-title>
          .
          <source>In Proceedings of The 12th Language Resources and Evaluation Conference</source>
          , pages
          <fpage>7145</fpage>
          -
          <lpage>7151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Ginevra</given-names>
            <surname>Castellano</surname>
          </string-name>
          , Andre´ Pereira, Iolanda Leite, Ana Paiva, and
          <string-name>
            <surname>Peter W McOwan</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Detecting user engagement with a robot companion using task and social interaction-based features</article-title>
          .
          <source>In Proceedings of the 2009 international conference on Multimodal interfaces</source>
          , pages
          <fpage>119</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Francesco</given-names>
            <surname>Cutugno</surname>
          </string-name>
          , Felice Dell'Orletta, Isabella Poggi, Renata Savy, and Antonio Sorgente.
          <year>2018</year>
          .
          <article-title>The CHROME Manifesto: Integrating Multimodal Data into Cultural Heritage Resources</article-title>
          . In Elena Cabrio, Alessandro Mazzei, and Fabio Tamburini, editors,
          <source>Proceedings of the Fifth Italian Conference on Computational Linguistics</source>
          , CLiC-it
          <year>2018</year>
          , Torino, Italy,
          <source>December 10-12</source>
          ,
          <year>2018</year>
          , volume
          <volume>2253</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Morena</given-names>
            <surname>Danieli</surname>
          </string-name>
          , Juan Mar´ıa Garrido, Massimo Moneglia, Andrea Panizza, Silvia Quazza, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Swerts</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Evaluation of Consensus on the Annotation of Prosodic Breaks in the Romance Corpus of Spontaneous Speech ”C-ORAL-ROM”</article-title>
          . In Emanuela Cresti and Massimo Moneglia, editors, CORAL-ROM:
          <article-title>integrated reference corpora for spoken romance languages</article-title>
          , pages
          <fpage>1513</fpage>
          -
          <lpage>1516</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Florian</given-names>
            <surname>Eyben</surname>
          </string-name>
          , Martin Wo¨llmer, and Bjo¨rn Schuller.
          <year>2010</year>
          .
          <article-title>Opensmile: the munich versatile and fast open-source audio feature extractor</article-title>
          .
          <source>In Proceedings of the 18th ACM international conference on Multimedia</source>
          , pages
          <fpage>1459</fpage>
          -
          <lpage>1462</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Jennifer A Fredricks</surname>
          </string-name>
          ,
          <article-title>Phyllis C Blumenfeld,</article-title>
          and Alison H Paris.
          <year>2004</year>
          .
          <article-title>School engagement: Potential of the concept, state of the evidence</article-title>
          .
          <source>Review of educational research</source>
          ,
          <volume>74</volume>
          (
          <issue>1</issue>
          ):
          <fpage>59</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Gatica-Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L McCowan</given-names>
            ,
            <surname>Dong Zhang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Samy</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Detecting group interest-level in meetings</article-title>
          .
          <source>In Proceedings.(ICASSP'05)</source>
          .
          <source>IEEE International Conference on Acoustics, Speech, and Signal Processing</source>
          ,
          <year>2005</year>
          ., volume
          <volume>1</volume>
          ,
          <string-name>
            <surname>pages</surname>
            <given-names>I</given-names>
          </string-name>
          -
          <fpage>489</fpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Patricia</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , O¨ mer Su¨mer, Kathleen Stu¨rmer, Wolfgang Wagner, Richard Go¨llner, Peter Gerjets, Enkelejda Kasneci, and
          <string-name>
            <given-names>Ulrich</given-names>
            <surname>Trautwein</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Attentive or Not? Toward a Machine Learning Approach to Assessing Students' Visible Engagement in Classroom Instruction</article-title>
          . Educational Psychology Review,
          <volume>35</volume>
          (
          <issue>1</issue>
          ):
          <fpage>463</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Shervin</given-names>
            <surname>Malmasi</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Dras</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Native language identification with classifier stacking and ensembles</article-title>
          .
          <source>Comput. Linguistics</source>
          ,
          <volume>44</volume>
          (
          <issue>3</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Melhart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Antonios</given-names>
            <surname>Liapis</surname>
          </string-name>
          , and
          <string-name>
            <surname>Georgios N Yannakakis</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>PAGAN: Video Affect Annotation Made Easy</article-title>
          .
          <source>In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII)</source>
          , pages
          <fpage>130</fpage>
          -
          <lpage>136</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Simonetta</given-names>
            <surname>Montemagni</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Tecnologie linguisticocomputazionali e monitoraggio della lingua italiana. Studi Italiani di Linguistica Teorica e Applicata (SILTA), XLII(1</article-title>
          ):
          <fpage>145</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Catharine</given-names>
            <surname>Oertel</surname>
          </string-name>
          , Stefan Scherer, and Nick Campbell.
          <year>2011</year>
          .
          <article-title>On the use of multimodal cues for the prediction of degrees of involvement in spontaneous conversation</article-title>
          .
          <source>In Twelfth annual conference of the international speech communication association.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Catharine</given-names>
            <surname>Oertel</surname>
          </string-name>
          , Ginevra Castellano, Mohamed Chetouani, Jauwairia Nasir, Mohammad Obaid, Catherine Pelachaud, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Peters</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Engagement in human-agent interaction: An overview</article-title>
          .
          <source>Frontiers in Robotics and AI</source>
          ,
          <volume>7</volume>
          :
          <fpage>92</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Antonio</given-names>
            <surname>Origlia</surname>
          </string-name>
          , Renata Savy, Isabella Poggi, Francesco Cutugno, Iolanda Alfano,
          <string-name>
            <surname>Francesca D'Errico</surname>
            ,
            <given-names>Laura</given-names>
          </string-name>
          <string-name>
            <surname>Vincze</surname>
            , and
            <given-names>Violetta</given-names>
          </string-name>
          <string-name>
            <surname>Cataldo</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>An audiovisual corpus of guided tours in cultural sites: Data collection protocols in the chrome project</article-title>
          .
          <source>In 2018 AVI-CH Workshop on Advanced Visual Interfaces for Cultural Heritage</source>
          , volume
          <year>2091</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          , Gae¨l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss,
          <string-name>
            <surname>Vincent Dubourg</surname>
          </string-name>
          , et al.
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Isabella</given-names>
            <surname>Poggi</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Mind, hands, face and body: a goal and belief view of multimodal communication</article-title>
          .
          <source>Weidler.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Amelio</surname>
          </string-name>
          <string-name>
            <surname>Ravelli</surname>
          </string-name>
          , Antonio Origlia, and Felice Dell'Orletta.
          <year>2020</year>
          .
          <article-title>Exploring Attention in a Multimodal Corpus of Guided Tours</article-title>
          . In Johanna Monti, Felice Dell'Orletta, and Fabio Tamburini, editors,
          <source>Proceedings of the Seventh Italian Conference on Computational Linguistics</source>
          , CLiC-it
          <year>2020</year>
          , Bologna, Italy, March 1-
          <issue>3</issue>
          ,
          <year>2021</year>
          , volume
          <volume>2769</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Christina</surname>
            <given-names>Regenbogen</given-names>
          </string-name>
          , Daniel A Schneider, Raquel E Gur, Frank Schneider,
          <string-name>
            <given-names>Ute</given-names>
            <surname>Habel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Thilo</given-names>
            <surname>Kellermann</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Multimodal human communication - Targeting facial expressions, speech content and prosody</article-title>
          .
          <source>NeuroImage</source>
          ,
          <volume>60</volume>
          (
          <issue>4</issue>
          ):
          <fpage>2346</fpage>
          -
          <lpage>2356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Jyotirmay</given-names>
            <surname>Sanghvi</surname>
          </string-name>
          , Ginevra Castellano, Iolanda Leite, Andre´ Pereira,
          <string-name>
            <surname>Peter W. McOwan</surname>
            ,
            <given-names>and Ana</given-names>
          </string-name>
          <string-name>
            <surname>Paiva</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Automatic analysis of affective postures and body motion to detect engagement with a game companion</article-title>
          .
          <source>In Proceedings of the 6th International Conference on Human-Robot Interaction, HRI '11, page 305-312</source>
          , New York, NY, USA. Association for Computing Machinery.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Candace L Sidner</surname>
          </string-name>
          ,
          <string-name>
            <surname>Christopher Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>Cory D Kidd</surname>
            ,
            <given-names>Neal</given-names>
          </string-name>
          <string-name>
            <surname>Lesh</surname>
          </string-name>
          , and Charles Rich.
          <year>2005</year>
          .
          <article-title>Explorations in engagement for humans and robots</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>166</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>140</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Milan</given-names>
            <surname>Straka</surname>
          </string-name>
          , Jan Hajicˇ, and Jana Strakova´.
          <year>2016</year>
          .
          <article-title>UDPipe: Trainable pipeline for processing CoNLL-U ifles performing tokenization, morphological analysis, POS tagging and parsing</article-title>
          .
          <source>In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)</source>
          , pages
          <fpage>4290</fpage>
          -
          <lpage>4297</lpage>
          , Portorozˇ, Slovenia.
          <source>European Language Resources Association (ELRA).</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Michał</given-names>
            <surname>Woz</surname>
          </string-name>
          <article-title>´niak, Manuel Gran˜a, and</article-title>
          <string-name>
            <given-names>Emilio</given-names>
            <surname>Corchado</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A survey of multiple classifier systems as hybrid systems</article-title>
          .
          <source>Information Fusion</source>
          ,
          <volume>16</volume>
          :
          <fpage>3</fpage>
          -
          <lpage>17</lpage>
          . Special Issue on Information Fusion in
          <source>Hybrid Intelligent Fusion Systems.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>