<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Sixth Workshop on Natural Language for Artificial Intelligence, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>What makes the audience engaged? Engagement prediction exploiting multimodal features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Borghesi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Amelio Ravelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Felice Dell'Orletta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Istituto di Linguistica Computazionale “A. Zampolli" (ILC-CNR)</institution>
          ,
          <addr-line>Via Giuseppe Moruzzi 1, 56124, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Università di Pisa</institution>
          ,
          <addr-line>Lungarno Pacinotti 43, 56126, Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>30</volume>
      <issue>2022</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper reports a series of experiments and analyses aimed at understanding which, among numerous linguistic and acoustic aspects of the spoken language, are distinctive in the detection of an engagement potential within speech. Starting from a dataset consisting of numerous sentences, pronounced during guided sightseeing tours, and characterised by a set of multimodal features, various classification algorithms were tested and optimised in diferent scenarios and configurations. Thanks to the implementation of a recursive feature elimination algorithm, it has been possible to select and identify which characteristics of the language play a key role in the presence of an engagement potential, and which can thus diferentiate an engaging sentence or speech from a non-engaging one. The analyses on the selected features showed that, among the strictly linguistic aspects, only basic features (i.e. sentence or word length) proved to be relevant in the classification process. In contrast, aspects of acoustic nature showed to play a considerably important role, in particular aspects related to sound spectrum and prosody. Overall, a feature selection led to appreciable increases in the performance of all implemented classification models.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multimodal dataset</kwd>
        <kwd>feature selection</kwd>
        <kwd>engagement prediction</kwd>
        <kwd>audience engagement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and motivation</title>
      <p>
        In recent years we have witnessed to major advances in Artificial Intelligence and Natural
Language Processing, to the point that we now have models capable to write complete (and
most of all, sounding) pieces of text out of a simple prompt [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. The ability to generate
content is impressive, but the scope of a text is often beyond the pure information conveyed
with it. Nevertheless, the efectiveness of information transfer is often due to the willingness
of the receiver to accept it. This is particularly evident if we move our focus from the written
page to more interactive communication media and channels, such as face-to-face interactions.
In fact, the average (human) speaker is generally very good at estimating the interlocutor’s
level of involvement from visually accessible signals (e.g. body postures and movements, facial
expressions, eye-gazes), and at refining his/her communication strategy, in order to keep the
communication channel open and the attention high in the audience. Such visible cues are
mostly signals of attention, which is considered as a perceivable proxy to broader and more
complex inner processes of engagement [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Moreover, recent studies have shown that the
processing of emotionality in the human brain is performed on modality-specific basis [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]:
prosody, facial expressions and speech content (i.e. the semantic information) are processed in
the listener’s brain with the selective activation of the auditory cortex, the fusiform gyri and
the middle temporal gyri, respectively.
      </p>
      <p>
        Understanding of non-verbal feedback is not easy to achieve for virtual agents and robots, but
this ability is strategic for enabling more natural interfaces capable of adapting to users. Indeed,
perceiving signals of loss of attention (and thus, of engagement) is of paramount importance to
design naturally behaving virtual agents, enabled to adjust the communication strategy to keep
high the interest of their addressees. That information is also a general sign of the quality of the
interaction and, more broadly, of the communication experience. At the same time, the ability
to generate engaging behaviours in an agent can be beneficial in terms of social awareness [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>The objective of the present work is to understand the phenomena correlated to the increase
or decrease of perceivable engagement in the audience of a speech, specicfially in the domain
of guided tours. We are interested in highlighting which features, from which specific modality,
have a key role in driving the attention in the listener(s), in order to exploit a reduced set of
features as dense but highly informative representations.</p>
      <sec id="sec-1-1">
        <title>1.1. Related Work</title>
        <p>
          With the word engagement we refer to the level of involvement reached during a social
interaction, which assumes the shape of a process through the whole communication exchange. More
specifically, [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] defines the process of social engagement as the value that a participant in an
interaction attributes to the goal of being together with the other participant(s) and continuing
the interaction. Another definition, adopted by many studies in Human-Robot Interaction
(HRI),1 describes engagement as the process by which interactors start, maintain, and end their
perceived connections to each other during an interaction [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The majority of the studies are
often conducted on a dyadic base (i.e. one-to-one) in context where one of the participants is
often an agent/robot [
          <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
          ]. Nevertheless, engagement can be measured in groups of people
as the average of the degree to which individuals are involved [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>
        Data for the experiments described herein derive from a subset of the data collected for the
CHROME Project2 [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. The domain of the data is Cultural Heritage; more specifically,
the project has been focused on guided tours in 3 Charterhouses in Campania (Italy), where
an expert historian led groups of 4 persons. Tours are organised in 6 Points of Interest (POI),
1For a broad and complete overview of works on engagement in HRI studies, see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>2Cultural Heritage Resources Orienting Multimodal Experience. http://www.chrome.unina.it/
i.e. rooms or areas inside the Charterhouses where the visits stop and the guide describes
the place with its furnishings, history and anecdotes. The communication event type is
quasiunidirectional (one-to-many), i.e one of the participants is the holder of the knowledge (the
expert guide) and talks to the others (the audience), with a few moments of dialogue (e.g. when
the guide asks something to the audience).</p>
      <p>The original data collection campaign lead to a multimodal corpus with aligned transcriptions,
audios and videos. From this, we selected a subset composed of 3 visits (i.e. 3 diferent groups of
4 persons) leaded by the same expert guide inside one of the three Charterhouses (San Martino
Charterhouse, Naples). Given the exploratory objective of the present study, we made this
selection in order to leverage diferences such as voice features and discourse style, which are
speaker-specific.</p>
      <p>The final set of data on which we run our experiments is composed of 1,114 sentences,
enriched with the annotation of the perceivable engagement of the audience, and characterised
with a total of 452 features extracted from multiple modalities (127 linguistic, 325 acoustic)
used to model the speech of the guide. The process through which we obtained our dataset is
described in the following.</p>
      <sec id="sec-2-1">
        <title>2.1. Human engagement annotation</title>
        <p>
          We considered the attention of the audience as a perceivable proxy to model and highlight
participants’ engagement, in line with the assumption that engagement is a complex process,
a multidimensional meta-construct [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] composed of behavioural, emotional and cognitive
aspects. Behavioural aspects (and some externalisation of emotional states) can be tracked by
observing the subject, while for the others it is necessary to exploit specific equipment to record
biomarkers such as heart rate or neural activity. Nevertheless, all aspects of engagement are
highly interrelated and do not occur in isolation, thus attention plays a crucial role in defining
if the audience is engaged or not [18].3
        </p>
        <p>To annotate audience engagement, we exploited the visual part of the original CHROME
dataset, consisting of 2 parallel video recordings for each visit: one focused on the speaker,
the other on the audience. We asked 2 annotators to watch at the same time the audience and
the guide videos, with the guide video in a small window superimposed on the audience one,
and to annotate the level of attention and its variation among the attendee. We recorded this
information by means of PAGAN Annotation Tool [19], that enables the annotator to easily
track the observed phenomenon with a simple press of two keys on the keyboard: arrow-up if
a rise is perceived, arrow-down otherwise. Our annotators reached a high agreement on this
task, with an average Spearman’s rho of 0.87.4 The resulting annotation is a continuous series
of values, indicating rise or fall of engagement along the whole visit, for all the visits in our
dataset.</p>
        <p>3We will continue to use the term engagement referencing to the perceivable attention of the audience.
4For a more detailed description of the annotation process, see [20].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Sentence segmentation</title>
        <p>We acquired the textual data in the form of ELAN annotation files [ 21], containing the
orthographic transcription of single words tagged with their start- and end-time (in milliseconds),
aligned on the timeline of the whole speech. In order to obtain more exploitable units of text,
we split the flow of the speech in sentence-like segments, by concatenating together all the
words that can represent a finite unit of language. 5 We asked two annotators to segment our
texts, relying on a pure perceptual principle: mark the end of a sentence whenever conceptual
completeness is perceived. We relied on the capability of mothertongue speakers of Italian
to mentally segment the flow of the speech, the same as we normally do during everyday
conversations. In other words, we asked the annotators to identify terminal breaks and mark
them with a full stop. Given that punctuation is a convention of the written medium the
annotators were asked to minimise the use of it, but beside the full stop we allowed for the use
of commas to signal short pauses or listings, and question marks when a questioning intonation
was identified.</p>
        <p>A limitation to this methodology is that it is often possible that the speech rate makes dificult
to finely segment, especially taking into account the necessity to propagate the segmentation
from the text to the audio files. In fact, we projected the start-end spans of sentences onto
the audio files in order to obtain the audio objects from where we extracted acoustic features.
We kept together in the same text/audio object multiple sentences if uttered at high rate and
dificult to cleanly separate on the audio level, in order to avoid noise that would have altered
the computing of acoustic features.</p>
        <p>We measured the accuracy of the segmentation on a portion of the data (about the 40% of the
total) by adapting an IOB (Inside-Outside-Begin) tagging framework. We labelled all the tokens,
according to each annotator, on the basis of their position at the beginning (B), the inside (I), the
end (E) or the outside (O) of a constructed sentence. By applying this annotation, we registered
an agreement of 91.53% in terms of accuracy on the basis of the two series of labelled tokens,
thus the obtained segments can be considered reliable and consistent.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Engagement projection on sentences</title>
        <p>As anticipated in 2.1, the engagement annotation consists of a continuous series of values
along the timeline of each video/visit: we dispose of a numerical value indicating the level of
engagement for each instant in which the latter has changed. Our aim was to use these values
to extract the level of engagement for each individual sentence; for this, we aggregated all
values within the span of the segmented sentences, in order to adapt the continuous annotation
of the engagement to discrete units (i.e. the sentences), by translating those values into finite
classes: engaging (associated with class 1) vs. non-engaging (associated with class 0). In this
regard, two diferent aggregation methods were designed and implemented: by subtraction and
by summation.</p>
        <p>5Speech segmentation is not a trivial task, and many researchers debated (and they are still debating) on the
problem. A recent special issue on the topic has been collected in [22].</p>
        <sec id="sec-2-3-1">
          <title>2.3.1. Aggregation by subtraction</title>
          <p>By using the subtraction method, we considered the delta between the first and the last value
of engagement annotated in the time span of a sentence. Considering the time interval of a
sentence , where  values of engagement were annotated (one for each variation), to obtain
the engagement level  of an entire sentence we subtracted the first engagement value ( 0)
from the last one (), as illustrated by equation 1:

 = ∑︁</p>
          <p>=1
 =
{︃1,</p>
          <p>if  &gt; − 1
− 1, if  &lt; − 1</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>2.3.2. Aggregation by summation</title>
          <p>=  − 0
By using the summation method, all the values and variations in the series of engagement
values, within the time span of a sentence, are taken into account. Considering a series of 
values of engagement (one for each variation) annotated within the time interval of a sentence
, a cumulative sum was calculated, to which 1 was added in the case where an increase in
the level of engagement ( &gt; − 1) occurred, while − 1 was added in the case where, on the
other hand, a decrease in the level of engagement ( &lt; − 1) occurred. The final result of the
sum allows us to obtain the level of engagement  of an entire sentence, as illustrated by the
equation 2, based on the system of equations 3:</p>
        </sec>
        <sec id="sec-2-3-3">
          <title>2.3.3. Engagement thresholds</title>
          <p>After computing the engagement level for each sentence, we further converted these values to
Boolean classes (): 1 if resulting engaging, 0 if non-engaging. We considered 3 thresholds as
diferent degrees of inclusiveness:
• − 1, to generate a more generous classification;
• 0, to generate a more balanced classification;
• +1, to generate a more sceptical classification.</p>
          <p>Every sentence with an engagement level  above the threshold  was considered engaging,
while the others were considered non-engaging, as illustrated by the system of equations 4:
 =
{︃1, if  &gt;</p>
          <p>0, if  ≤</p>
          <p>In conclusion, we obtain six diferent sentence classification series: three series (one for each
engagement threshold) for each of the two aggregation methodology. The selection of the most
suitable series is specified within the section 3.4.
(1)
(2)
(3)
(4)</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Features</title>
        <p>In this section we describe the methodology and the tools used to extract features for both
the textual and acoustic modality. We relied on explicit feature extraction systems in order to
explore which specific features, and to which extend, convey the most of the information that
create an engagement status in the audience.6</p>
        <sec id="sec-2-4-1">
          <title>2.4.1. Linguistic Features</title>
          <p>The textual modality has been encoded by using Profiling–UD [ 26], a publicly available
web–based application7 inspired to the methodology initially presented in [27], that performs
linguistic profiling of a text, or a large collection of texts, for multiple languages. The system,
based on an intermediate step of linguistic annotation with UDPipe [28], extracts a total of 129
features per each analysed document. In this case, Profiling-UD analysis has been performed
per sentence, thus the output has been considered as the linguistic feature set of each segment
of the dataset. Table 1 reports the 127 features extracted with Profiling-UD and used as textual
modality features for the classifier. 8</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>2.4.2. Acoustic Features</title>
          <p>The acoustic modality has been encoded using OpenSmile9 [29], a complete and open-source
toolkit for analysis, processing and classification of audio data, especially targeted at speech
and music applications such as automatic speech recognition, speaker identification, emotion
recognition, or beat tracking and chord detection. The acoustic features set used in this case
is the Computational Paralinguistics ChallengE10 (ComParE), which comprises 65 Low-Level
Descriptors (LLDs), computed per frame.</p>
          <p>6The current state of the art in both linguistic and acoustic feature extraction make use of recent Deep Learning
methods and technique [23, 24, 25], but those systems extract features that are by nature not explainable.
7Profiling-UD can be accessed at the following link: http://linguistic-profiling.italianlp.it
8Out of the 129 Profiling-UD features, n_sentences and tokens_per_sent (raw text properties) have not been
considered, given that the analysis has been performed per sentence.</p>
          <p>9https://www.audeering.com/research/opensmile/
10http://www.compare.openaudio.eu</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental setting</title>
      <p>In order to explore multiple methodologies ad techniques to study the task of engagement
potential prediction, we set our experiments in diferent classification scenarios, exploiting two
Machine Learning models, applying alternative feature normalisation and engagement class
assignment methods, and executing a selection of the most representative features to predict
the engagement potential of a sentence.</p>
      <sec id="sec-3-1">
        <title>3.1. Classification scenarios and baseline</title>
        <p>Dealing with a few data, as in this case (1,114 total items), may lead to an overestimation of the
classification performances, making the predictions unreliable, especially if relying on a simple
train-validation split of the dataset [30, 31, 32]. To avoid this, we opted for a Cross-Validation
approach [33, 34, 35], declining our experimentation in 3 classification scenarios:
• By stratified Random Partitioning (RaP): the dataset is divided into 10 equally sized parts,
composed of randomly extracted elements. The stratified approach makes it possible to
maintain the same proportion between classes in the dataset even in individual
subdivisions; this is possible exploiting a Stratified Cross-Validation technique; 11
• By Visits (Vis): the dataset is divided on the basis of tourist visits, thus obtaining three
partitions, related to the three visits considered;
• By Points Of Interest (POI ): the dataset is partitioned on the basis of Points of Interest,
thus obtaining six partitions, based on the POIs taken into consideration.</p>
        <p>It is important to specify that, in each classification scenario, an unseen part of the dataset (a
test-set) has been kept aside until the conclusion of the study, in order to ultimately test the
performance of the fully optimised system on unknown data. For the RaP scenario we excluded
from the Cross-Validation a portion of 20% of the data, which is also stratified. In the case of
the Vis scenario, the test-set is represented by the data related to the first visit (V01), while for
the POI scenario, the test-set is represented by the data related to the the first point of interest
(P01).</p>
        <p>For each classifier, and in each scenario, we trained 3 diferent models, namely Multimodal,
Linguistic and Acoustic, on the basis of the type (or the combination of types) of features used
as training. We decided also to calculate and use a baseline for each validation-set and each
test-set: each sentence in the set was assigned the Most Frequent Class within the respective
training-set. The individual baselines can be found in the appendices, where we report tables
with details of every experiment we run in this work, with the figures of each baseline.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Classifiers</title>
        <p>One of the primary objectives of the study is to obtain a model capable of classifying a sentence
as either engaging or not engaging. To achieve this goal, as anticipated, we selected two
Machine Learning models: Linear Support Vector Classifier [36, 37] (Linear-SVC) and Random
Forest Classifier [38, 39] (Random-Forest). Choosing two radically diferent classifiers, rather
than using a single one, allows us to perform an accurate comparison between two diferent
classification processes, in terms of behaviour and performances. Most important, we relied on
fully explainable classification models, where it is possible to work with explicit features, thus
focusing on the phenomena behind a decision.</p>
        <p>More precisely, at the feature selection stage, it will be possible to highlight which feature
categories were deemed important by both classifiers, i.e. could be considered relevant for
detecting an engagement potential in language. Indeed, both classifiers are able to sort the
training features on the basis of their influence in the classification process, assigning them a
rank [40, 41, 42, 39] that can be used for performing feature selection and subsequent in-depth
analysis.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Hyperparameters tuning</title>
          <p>A very important aspect in setting up the classifiers is the optimisation of the hyperparameters:
the Machine Learning models, in fact, have several hyperparameters that can be modified to
improve classification performance, allowing of more accurate results [43, 44, 45, 46].
11https://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold</p>
          <p>A complete engineering of the chosen models would have been outside the objectives of the
study, thus we choose to optimise exclusively the most relevant hyperparameter in each of the
two chosen classifiers:
• For Linear-SVC, the regularisation parameter (commonly referred to as parameter C) was
optimised by testing a range of values (0.001, 0.01, 0.10, and 1.00) [47];
• For the Random-Forest, the number of decision trees (Decision-Trees) that make up the
"forest" was optimised. In this case, a number of trees equal to 10, to 100, and to 1000 was
tested.</p>
          <p>The hyperparameters tuning results showed that the Linear-SVC achieved the best performance
by using the regularisation parameter of 0.001, while the Random-Forest scores best with a
Decision-Tree number of 1000. Detailed results relative to hyperparameters tuning, on a
crosscomparison with the aggregation methods explained in section 2.3, can be found in Appendix A
(tables 3 and 4).</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Feature normalisation</title>
        <p>Standardising and normalising data (e.g., scaling within a common numerical range) can benefit
the training and performance of Machine Learning models [48]. In this regard, we tested many
normalisation methods, that we can divide in two main groups:
• Linear normalisation methods: Standard-Scaler (StaS), Max-Abs-Scaler (MAS),
Min-MaxScaler (MiMaS) with two diferent numerical ranges (0 to 1, and -1 to 1), and Robust-Scaler
(RoS);
• Nonlinear normalisation methods: Power-Transformer (PoT) and Quantile-Transformer
(QuT).</p>
        <p>In our experimentation, no appreciable diferences emerged in terms of accuracy between
all the normalisation methods. However, the Quantile-Transformer (QuT) provided slightly
best overall results, thus it has been selected as default for the subsequent experiments. All the
results relative to the comparison between data normalisation methods, for both the classifiers,
can be found in Appendix A (Table 5).</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Engagement class assignment</title>
        <p>As anticipated, we considered 2 alternative methodologies (i.e. summation and subtraction)
with 3 thresholds to determine whether a sentence could be classified as engaging or not. From
our experimentation it resulted that the summation extraction method led to the best results;
therefore, we applied this in our configuration. As for the engagement thresholds, however,
a further test was performed: by comparing the three devised thresholds (-1, 0 and 1), it was
found that threshold 0 (considered the most neutral) allowed for the best accuracy. Accuracy
results relative to the comparison between engagement thresholds, for each classifier, can be
found in Appendix A (Table 6).</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Feature selection algorithm</title>
        <p>The performance of a Machine Learning model can be improved by reducing the number of
features it is trained with, based on their influence in the classification process [ 49]. For this
reason, we implemented a recursive feature elimination algorithm to identify which features
are most relevant for the prediction of engagement potential in a sentence, and consequently
to improve the performance of the models. The process of the feature selection algorithm is
structured in four steps:
1. Using the total set of features, the value of Accuracy in Cross-Validation is calculated;
2. The Accuracy value is compared with the best result obtained so far (0, if we are at the
ifrst iteration):
• If the value obtained is greater, a ranking of features is made (based on the degrees
of importance provided by the classification model), which will be considered the
new optimal feature combination;
• If the value obtained turns out to be lower, the previous optimal feature combination
(obtained from the model that provided the higher Accuracy result) is retained;
3. Steps 1 and 2 are repeated, recursively eliminating a predefined number of features
(recursively deleting a predefined number of features, starting with the least important
based on the ranking), until it is reached the minimum threshold of about 10% of the total
feature set;
4. The algorithm provides the selection of the most important features with which the best</p>
        <p>Accuracy result was obtained.</p>
        <p>Given the long calculation times required for training Random-Forest (with 1000 estimators), it
was decided to set the number of features to be eliminated at each iteration as follows:
• 10, for experiments performed with all features (Multimodal);
• 3, for experiments performed with linguistic features only (Linguistic);
• 7, for experiments performed with acoustic features only (Acoustic).</p>
        <p>The choice was made taking into account the approximate proportion of linguistic features,
and acoustic features, to the number of total features. In this way, it is possible to significantly
reduce the number of iterations, and therefore speed up the experiment, while maintaining
excellent Accuracy. With Linear-SVC, given its speed of execution, the number of features to be
deleted at each iteration was set to 1 for all tests.</p>
        <p>Another aspect that needs to be illustrated is the minimum number of features that the
algorithm is forced to select: about 10% of the total set, for each type of feature:
• 45, for experiments performed with all features (Multimodal);
• 13, for experiments performed with linguistic features only (Linguistic);
• 32, for experiments performed with acoustic features only (Acoustic).</p>
        <p>In this case, the proportion was maintained both for the experiments performed with Linear-SVC
and for those performed with Random-Forest. Maintaining a minimum percentage of features
of about 10% allows us to have a suficient number of features in the analysis phase, to draw
more precise and in-depth conclusions on the behaviour and choices of the models.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental results</title>
      <p>This section discusses the results for the most relevant experiments in the study, with a focus on
the efects of feature selection. All the figures reporting the results contain absolute scores for
accuracy values and bar extension for increments over baseline. Simultaneously, the percentages
for each classification scenario represent the average of those obtained for the three models
(Multimodal, Linguistic and Acoustic); conversely, each model represents the average of those
obtained for the three scenarios (RaP, Vis, POI ). The reported results are meant as an average
between Linear-SVC and Random-Forest models. As stated in the introduction, the focus of this
paper is not the state-of-the-art performance, but the analysis of the most salient features in
both modalities. That said, we are not considering Linear-SVC and Random-Forest the same,
but we are looking at an average performance based on a specific subset of features. Detailed
tables on all the experiments can be found in the Appendix B, where we report single classifier
results.</p>
      <sec id="sec-4-1">
        <title>4.1. Feature selection efectiveness</title>
        <p>The feature selection algorithm was run on both classifiers in use, testing all classification
scenarios and all models. As illustrated in Figure 1, feature selection leads to increases in mean
accuracy percentages in all the cases, by reducing the feature space in a range between 10 and
32% (average range between Linear-SVC and Random-Forest) of the original features set.</p>
        <p>The increases in accuracy percentages are a signal of the presence of features that are
particularly relevant to predict an engagement potential and, conversely, that many of the
original features are redundant and noisy for both scenario and model variation.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results on test-set</title>
        <p>We run the final step of experimentation on an unknown portion of the dataset (i.e. test-set)
for each classification scenario: as illustrated in Figure 2, the classifiers achieved Accuracy
increments over baseline on test-set that are extremely similar to those achieved on the validation-set,
despite the fact that these were unknown data. A single exception is detectable in the case of
POI classification scenario: the gaps between the percentages can be attributed to the large
diferences between the baseline of the validation-set and the test-set (Detailed accuracy results
on test-set can be found in Appendix C, for each classification model).</p>
        <p>The good results achieved on the test-set indicate that classifiers trained with a restricted set
of features are able to efectively detect an engagement potential in unseen data. This provides
us with a definitive confirmation of how the most important features, selected by the feature
selection algorithm, can indeed constitute a set of fundamental aspects to detect the engagement
of a sentence.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Feature analysis</title>
      <p>In order to understand which linguistic and acoustic features are the most relevant to detect an
engagement potential, it is necessary to analyse the subset of features with which the classifiers
performed best. Specifically, it is possible to define what percentage of each feature category (e.g.
linguistic:morpho-syntactic, acoustic:spectral) was included by the feature selection algorithm
among the most relevant. It is important to specify that only those categories of features selected
by both classifiers were considered, i.e. only those that resulted to be highly relevant to the
classification process, regardless of the exploited classifier.</p>
      <p>Considering the most important 10% of all features (on the basis of the algorithm ranking),
we can observe that acoustic features seems to be the most important for the classification of
engaging and non-engaging sentences: the average percentage of acoustic features (9.34%),
included in the total set of most important features selected, is about 1.57 times higher than the
average percentage of linguistic features (5.94%). We can derive that acoustic features play a
significantly more important role, compared to linguistic features.</p>
      <p>A closer look at the selected feature categories shows us what percentage of them were
included among the most important ones by the classifiers. As shown in Figure 3, Raw Text
Properties (i.e. sentence and word length) are the most relevant group of features. Other linguistic
features included in the selected features regard syntactic relations and the order of elements,
but those are selected only for the 7.89% of the total. Nevertheless, the rest of the selected
features are all coming from the acoustic modality, specifically related to the sound spectrum.
In this regard, it is possible to observe that the timbre of the speech, its amplitude and richness
in the frequency range are decisive factors in the maintenance of attention. However, it is also
necessary to note that the first group of acoustic features ( RMS energy and zero-crossing rate),
turns out to be a prosodic feature; the rhythmic features of the voice, therefore, which highlight
traits such as irony and sarcasm, still play a strong role. Voice quality aspects, on the other hand,
do not seem to be particularly implicated in the classification process. Detailed percentages of
included features per modality can be found in Appendix D.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>The implemented Machine Learning models were able to detect an engagement potential in
language in multiple scenarios and on unknown data. It emerged that certain phenomena and
features of language, mainly acoustic in nature (like prosodic or spectral), play a key role in the
classification process, and thus in assessing an engagement potential of an uttered sentence.</p>
      <p>Ultimately, it is possible to observe that all the results were achieved by fully exploiting
the potential of a restricted set of features (between 10 and 32% of the total sets). This study,
therefore, also aims to show to what extent optimised Machine Learning models, combined with
a selected and optimised data representation (i.e. relevant features), can succeed in achieving
better accuracy results. The stringent feature selection, moreover, proved to be crucial in
understanding which aspects, among the various linguistic and acoustic ones considered, play a
critical role in making a sentence engaging or not. On the acoustic level, prosodic and spectrum
related features play a major role in discriminating engaging and non-engaging sentences, while
on the linguistic level raw text properties give the main contribution. We can conclude that the
attention of the listener(s), and thus the perceivable engagement, can be driven by acoustic and
linguistic features, and for this reason we studied the phenomenon of engagement by means of
fully explainable classification models.</p>
      <sec id="sec-6-1">
        <title>6.1. Future developments</title>
        <p>One of the critical issues of this study undoubtedly concerns the size of the dataset, that can be
considered relatively small (1,114 sentences) and not very varied (the 3 visits are lead by the
same guide, thus all the data regard one person). To conduct an even more precise and accurate
study, and to generalise the results, it would be necessary to increase the size of the dataset by
including data coming from more guides and groups of visitors.</p>
        <p>Another important enrichment of the dataset could involve visual data. Currently, we
exploited the visual part of the dataset exclusively for the annotation of the variation of
attention/engagement, but it would be interesting to explore visual features in the classification
process, and to measure the performance of a model that considers linguistic, acoustic, and
visual data to predict the engagement potential of a communication act.
[18] P. Goldberg, O. Sümer, K. Stürmer, W. Wagner, R. Göllner, P. Gerjets, E. Kasneci,
U. Trautwein, Attentive or Not? Toward a Machine Learning Approach to
Assessing Students’ Visible Engagement in Classroom Instruction, Educational
Psychology Review 35 (2019) 463–23. URL: http://link.springer.com/10.1007/s10648-019-09514-z.
doi:10.1007/s10648-019-09514-z, publisher: Springer US.
[19] D. Melhart, A. Liapis, G. N. Yannakakis, Pagan: Video afect annotation made easy, in:
2019 8th International Conference on Afective Computing and Intelligent Interaction
(ACII), IEEE, 2019, pp. 130–136.
[20] A. A. Ravelli, A. Origlia, F. Dell’Orletta, Exploring attention in a multimodal corpus of
guided tours, in: Computational Linguistics CLiC-it 2020, 2020, p. 353.
[21] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, H. Sloetjes, Elan: a professional
framework for multimodality research, in: Proc. of the International Conference on
Language Resources and Evaluation (LREC), 2006, pp. 1556–1559.
[22] S. Izre’el, H. Mello, A. Panunzi, T. Raso, In Search of Basic Units of Spoken Language,
volume 94 of A corpus-driven approach, John Benjamins Publishing Company, Amsterdam,
2020. doi:10.1075/scl.94, iSSN: 1388-0373.
[23] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[24] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan,
N. Dawalatabad, A. Heba, J. Zhong, et al., Speechbrain: A general-purpose speech toolkit,
arXiv preprint arXiv:2106.04624 (2021).
[25] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised
learning of speech representations, Advances in Neural Information Processing Systems
33 (2020) 12449–12460.
[26] D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi, S. Montemagni, Profiling-ud: a tool
for linguistic profiling of texts, in: Proceedings of The 12th Language Resources and
Evaluation Conference, 2020, pp. 7145–7151.
[27] S. Montemagni, Tecnologie linguistico-computazionali e monitoraggio della lingua italiana,</p>
        <p>Studi Italiani di Linguistica Teorica e Applicata (SILTA) XLII (2013) 145–172.
[28] M. Straka, J. Hajič, J. Straková, UDPipe: Trainable pipeline for processing CoNLL-U files
performing tokenization, morphological analysis, POS tagging and parsing, in: Proceedings
of the Tenth International Conference on Language Resources and Evaluation (LREC’16),
European Language Resources Association (ELRA), Portorož, Slovenia, 2016, pp. 4290–4297.</p>
        <p>URL: https://aclanthology.org/L16-1680.
[29] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versatile and fast open-source
audio feature extractor, in: Proceedings of the 18th ACM international conference on
Multimedia, 2010, pp. 1459–1462.
[30] J. Gareth, W. Daniela, H. Trevor, T. Robert, An introduction to statistical learning: with
applications in R, Spinger, 2013.
[31] G. C. Cawley, N. L. Talbot, On over-fitting in model selection and subsequent selection
bias in performance evaluation, The Journal of Machine Learning Research 11 (2010)
2079–2107.
[32] G. Seni, J. F. Elder, Ensemble methods in data mining: improving accuracy through
combining predictions, Synthesis lectures on data mining and knowledge discovery 2
(2010) 1–126.
[33] D. M. Allen, The relationship between variable selection and data agumentation and a
method for prediction, technometrics 16 (1974) 125–127.
[34] M. Stone, Cross-validatory choice and assessment of statistical predictions, Journal of the
royal statistical society: Series B (Methodological) 36 (1974) 111–133.
[35] M. Stone, An asymptotic equivalence of choice of model by cross-validation and akaike’s
criterion, Journal of the Royal Statistical Society: Series B (Methodological) 39 (1977)
44–47.
[36] N. Cristianini, J. Shawe-Taylor, et al., An introduction to support vector machines and
other kernel-based learning methods, Cambridge university press, 2000.
[37] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm for optimal margin classifiers,
in: Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp.
144–152.
[38] T. K. Ho, Random decision forests, in: Proceedings of 3rd international conference on
document analysis and recognition, volume 1, IEEE, 1995, pp. 278–282.
[39] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
[40] A. L. Blum, P. Langley, Selection of relevant features and examples in machine learning,</p>
        <p>Artificial intelligence 97 (1997) 245–271.
[41] P. S. Bradley, O. L. Mangasarian, Feature selection via concave minimization and support
vector machines., in: ICML, volume 98, Citeseer, 1998, pp. 82–90.
[42] P. S. Bradley, O. L. Mangasarian, W. N. Street, Feature selection via mathematical
programming, INFORMS Journal on Computing 10 (1998) 209–217.
[43] F. Hutter, L. Kotthof, J. Vanschoren, Automated machine learning: methods, systems,
challenges, Springer Nature, 2019.
[44] M. Claesen, B. De Moor, Hyperparameter search in machine learning, arXiv preprint
arXiv:1502.02127 (2015).
[45] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization,</p>
        <p>Advances in neural information processing systems 24 (2011).
[46] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization., Journal of
machine learning research 13 (2012).
[47] F. Cucker, S. Smale, et al., Best choices for regularization parameters in learning theory: on
the bias-variance problem, Foundations of computational Mathematics 2 (2002) 413–428.
[48] J. Han, J. Pei, H. Tong, Data mining: concepts and techniques, Morgan kaufmann, 2022.
[49] I. Guyon, A. Elisseef, An introduction to variable and feature selection, Journal of machine
learning research 3 (2003) 1157–1182.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>A. Setting of classifiers and data preprocessing</title>
      <p>This appendix shows the detailed tables containing the Accuracy results obtained during the
ifrst phase of experimentation, where the classifier models and dataset were configured and
optimised. Specifically, the percentages obtained from the cross-comparison of various
hyperparameter configurations (Section 3.2.1), for both classifier models, and of the two engagement
aggregation techniques implemented in the study (Section 2.3) are illustrated. The results of
the comparison between the various data normalisation techniques tested (Section 3.3), and
between the engagement thresholds used (Section 2.3.3) are also shown. Each table shows the
baseline percentages (based on the most frequent class of engagement, i.e. 0 for non-engaging
or 1 for engaging) for each classification scenario, for each engagement threshold, and for each
engagement aggregation technique (experiments shown in these Tables were all performed in
the RaP classification scenario).</p>
    </sec>
    <sec id="sec-8">
      <title>B. Accuracies in Cross-Validation</title>
      <p>This appendix details the averages of all Accuracy results obtained during Cross-Validation, with
a cross-comparison between the feature combinations used and the classification scenarios. The
tables are divided by individual classifier model, and show the accuracies obtained both before
and after feature selection (see section 3.5). Each table also shows the baseline percentages
(based on the most frequent class) for each classification scenario and for each modality, used
for comparison with the results obtained.</p>
    </sec>
    <sec id="sec-9">
      <title>C. Accuracies on test-set</title>
      <p>This appendix shows the Accuracy percentages obtained on the test-set, in the final phase of
testing the classifier models. In particular, there is a table of accuracies for each of the two
classification models, with a cross-comparison between the feature combinations used and the
classification scenarios. Each table also shows the baseline percentages (based on the most
frequent class) for each classification scenario and for each modality.</p>
    </sec>
    <sec id="sec-10">
      <title>D. Feature selection results</title>
      <p>This appendix shows the results of the analysis of the features selected by the classifier models
through the feature selection algorithm. The tables show the percentage by which each feature
category and subcategory was included in the top 10% of the features, based on the ranking
processed by the classification models. It’s important to notice that the indicated percentages
are an average value between the percentages found with Linear-SVC and Random-Forest.</p>
      <p>Multimodal
Linguistic
Acoustic
MEAN
BASELINE
Multimodal
Linguistic
Acoustic
MEAN
BASELINE</p>
      <p>Vis</p>
      <p>POI</p>
      <p>ACOUSTIC FEATURE CATEGORIES
PROSODICS
F0 (SHS and Viterbi smoothing)
Sum of auditory spectrum (loudness)
Sum of RASTA-style filtered auditory spectrum
RMS energy and zero-crossing rate
SPECTRAL
RASTA-style auditory spectrum, bands 1-26
MFCC 1-14
Spectral energy 250-650 Hz, 1 k-4 kHz
Spectral roll of point 0.25, 0.50, 0.75, 0.90
Spectral flux, centroid, entropy, slope
Psychoacoustic sharpness, harmonicity
Spectral variance, skewness, kurtosis
VOICE QUALITY
Voicing probability
Log. HNR, Jitter (local, delta), Shimmer (local)</p>
      <p>PaC
10.00
20.00
0.00
25.00
6.92
20.71
20.00
0,00
15.00
20.00
10.00
10.00
5,00</p>
      <p>Vis
10.00
10.00
0.00
20.00
9.61
14.28
20.00
2.50
15.00
5.00
6.66
7,50</p>
      <p>POI
0.00
20.00
10.00
30.00
8.07
16.43
20.00
0,00
15.00
15.00
6.67
10.00
7.50</p>
      <p>MEAN
6.66
16.66
3.33
25.00
8.20
17.14
20.00
0.83
15.00
13.33
7.78
6.67
9.16</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          , Preface to the
          <source>Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI)</source>
          , in: D.
          <string-name>
            <surname>Nozza</surname>
            ,
            <given-names>L. C.</given-names>
          </string-name>
          <string-name>
            <surname>Passaro</surname>
          </string-name>
          , M. Polignano (Eds.),
          <source>Proceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI</source>
          <year>2022</year>
          )
          <article-title>co-located with 21th International Conference of the Italian Association for Artificial Intelligence (AI*IA</article-title>
          <year>2022</year>
          ), November 30,
          <year>2022</year>
          , CEUR-WS.org,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>L.</given-names>
            <surname>Floridi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chiriatti</surname>
          </string-name>
          , Gpt-3
          <article-title>: Its nature, scope, limits, and consequences</article-title>
          ,
          <source>Minds and Machines</source>
          <volume>30</volume>
          (
          <year>2020</year>
          )
          <fpage>681</fpage>
          -
          <lpage>694</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , et al.,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          .,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Goldberg</surname>
          </string-name>
          , Ö. Sümer,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stürmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wagner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Göllner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gerjets</surname>
          </string-name>
          , E. Kasneci, U. Trautwein, Attentive or Not?
          <article-title>Toward a Machine Learning Approach to Assessing Students' Visible Engagement in Classroom Instruction</article-title>
          ,
          <source>Educational Psychology Review</source>
          <volume>35</volume>
          (
          <year>2019</year>
          )
          <fpage>463</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Regenbogen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Gur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Habel</surname>
          </string-name>
          , T. Kellermann,
          <article-title>Multimodal human communication - Targeting facial expressions, speech content and prosody</article-title>
          ,
          <source>NeuroImage</source>
          <volume>60</volume>
          (
          <year>2012</year>
          )
          <fpage>2346</fpage>
          -
          <lpage>2356</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Oertel</surname>
          </string-name>
          , G. Castellano,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chetouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nasir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Obaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pelachaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <article-title>Engagement in human-agent interaction: An overview</article-title>
          ,
          <source>Frontiers in Robotics and AI</source>
          <volume>7</volume>
          (
          <year>2020</year>
          )
          <fpage>92</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>I. Poggi</surname>
          </string-name>
          ,
          <article-title>Mind, hands, face and body: a goal and belief view of multimodal communication</article-title>
          ,
          <source>Weidler</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Sidner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Kidd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lesh</surname>
          </string-name>
          , C. Rich,
          <article-title>Explorations in engagement for humans and robots</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>166</volume>
          (
          <year>2005</year>
          )
          <fpage>140</fpage>
          -
          <lpage>164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Castellano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Leite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paiva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>McOwan</surname>
          </string-name>
          ,
          <article-title>Detecting user engagement with a robot companion using task and social interaction-based features</article-title>
          ,
          <source>in: Proceedings of the 2009 international conference on Multimodal interfaces</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Sanghvi</surname>
          </string-name>
          , G. Castellano,
          <string-name>
            <surname>I. Leite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pereira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. W.</given-names>
            <surname>McOwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Paiva</surname>
          </string-name>
          ,
          <article-title>Automatic analysis of afective postures and body motion to detect engagement with a game companion</article-title>
          ,
          <source>in: Proceedings of the 6th International Conference on Human-Robot Interaction, HRI '11</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2011</year>
          , p.
          <fpage>305</fpage>
          -
          <lpage>312</lpage>
          . URL: https://doi.org/10.1145/1957656.1957781. doi:
          <volume>10</volume>
          .1145/1957656.1957781.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben-Youssef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clavel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Essid</surname>
          </string-name>
          ,
          <article-title>Early detection of user engagement breakdown in spontaneous human-humanoid interaction</article-title>
          ,
          <source>IEEE Transactions on Afective Computing</source>
          <volume>12</volume>
          (
          <year>2021</year>
          )
          <fpage>776</fpage>
          -
          <lpage>787</lpage>
          . doi:
          <volume>10</volume>
          .1109/TAFFC.
          <year>2019</year>
          .
          <volume>2898399</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Gatica-Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>McCowan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Bengio, Detecting group interest
          <article-title>-level in meetings</article-title>
          ,
          <source>in: Proceedings.(ICASSP'05)</source>
          .
          <source>IEEE International Conference on Acoustics, Speech, and Signal Processing</source>
          ,
          <year>2005</year>
          ., volume
          <volume>1</volume>
          , IEEE,
          <year>2005</year>
          , pp.
          <fpage>I</fpage>
          -
          <volume>489</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Oertel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Scherer</surname>
          </string-name>
          , N. Campbell,
          <article-title>On the use of multimodal cues for the prediction of degrees of involvement in spontaneous conversation</article-title>
          .,
          <source>in: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>1541</fpage>
          -
          <lpage>1544</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Fredricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Blumenfeld</surname>
          </string-name>
          , A. H. Paris, School engagement:
          <article-title>Potential of the concept, state of the evidence</article-title>
          ,
          <source>Review of educational research 74</source>
          (
          <year>2004</year>
          )
          <fpage>59</fpage>
          -
          <lpage>109</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Cutugno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Savy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sorgente</surname>
          </string-name>
          ,
          <article-title>The chrome manifesto: integrating multimodal data into cultural heritage resources</article-title>
          ,
          <string-name>
            <surname>Computational Linguistics</surname>
          </string-name>
          CLiC-it
          <year>2018</year>
          (
          <year>2018</year>
          )
          <fpage>155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Origlia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Savy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Poggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cutugno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Alfano</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. D'Errico</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Vincze</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Cataldo</surname>
          </string-name>
          ,
          <article-title>An audiovisual corpus of guided tours in cultural sites: Data collection protocols in the chrome project</article-title>
          ,
          <source>in: 2018 AVI-CH Workshop on Advanced Visual Interfaces for Cultural Heritage</source>
          , volume
          <year>2091</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Fredricks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. C.</given-names>
            <surname>Blumenfeld</surname>
          </string-name>
          , A. H. Paris, School engagement:
          <article-title>Potential of the concept, state of the evidence</article-title>
          ,
          <source>Review of educational research 74</source>
          (
          <year>2004</year>
          )
          <fpage>59</fpage>
          -
          <lpage>109</lpage>
          . Publisher: Sage Publications Sage CA: Thousand Oaks, CA.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>