=Paper= {{Paper |id=Vol-3287/paper10 |storemode=property |title=What makes the audience engaged? Engagement prediction exploiting multimodal features |pdfUrl=https://ceur-ws.org/Vol-3287/paper10.pdf |volume=Vol-3287 |authors=Daniele Borghesi,Andrea Amelio Ravelli,Felice Dell’Orletta |dblpUrl=https://dblp.org/rec/conf/aiia/BorghesiRD22 }} ==What makes the audience engaged? Engagement prediction exploiting multimodal features== https://ceur-ws.org/Vol-3287/paper10.pdf
What makes the audience engaged?
Engagement prediction
exploiting multimodal features
Daniele Borghesi1 , Andrea Amelio Ravelli2 and Felice Dell’Orletta2
1
    Università di Pisa, Lungarno Pacinotti 43, 56126, Pisa, Italy
2
    Istituto di Linguistica Computazionale “A. Zampolli" (ILC–CNR), Via Giuseppe Moruzzi 1, 56124, Pisa, Italy


                                         Abstract
                                         This paper reports a series of experiments and analyses aimed at understanding which, among numerous
                                         linguistic and acoustic aspects of the spoken language, are distinctive in the detection of an engagement
                                         potential within speech. Starting from a dataset consisting of numerous sentences, pronounced dur-
                                         ing guided sightseeing tours, and characterised by a set of multimodal features, various classification
                                         algorithms were tested and optimised in different scenarios and configurations. Thanks to the imple-
                                         mentation of a recursive feature elimination algorithm, it has been possible to select and identify which
                                         characteristics of the language play a key role in the presence of an engagement potential, and which
                                         can thus differentiate an engaging sentence or speech from a non-engaging one. The analyses on the
                                         selected features showed that, among the strictly linguistic aspects, only basic features (i.e. sentence
                                         or word length) proved to be relevant in the classification process. In contrast, aspects of acoustic
                                         nature showed to play a considerably important role, in particular aspects related to sound spectrum and
                                         prosody. Overall, a feature selection led to appreciable increases in the performance of all implemented
                                         classification models.

                                         Keywords
                                         multimodal dataset, feature selection, engagement prediction, audience engagement




1. Introduction and motivation
In recent years we have witnessed to major advances in Artificial Intelligence and Natural
Language Processing, to the point that we now have models capable to write complete (and
most of all, sounding) pieces of text out of a simple prompt [2, 3]. The ability to generate
content is impressive, but the scope of a text is often beyond the pure information conveyed
with it. Nevertheless, the effectiveness of information transfer is often due to the willingness
of the receiver to accept it. This is particularly evident if we move our focus from the written
page to more interactive communication media and channels, such as face-to-face interactions.
In fact, the average (human) speaker is generally very good at estimating the interlocutor’s

NL4AI 2022: Sixth Workshop on Natural Language for Artificial Intelligence, November 30, 2022, Udine, Italy [1]
$ d.borghesi@studenti.unipi.it (D. Borghesi); andreaamelio.ravelli@ilc.cnr.it (A. A. Ravelli);
felice.dellorletta@ilc.cnr.it (F. Dell’Orletta)
€ http://www.ilc.cnr.it/it/content/andrea-amelio-ravelli (A. A. Ravelli);
http://www.ilc.cnr.it/it/content/felice-dellorletta (F. Dell’Orletta)
 0000-0002-0979-0585 (D. Borghesi); 0000-0002-0232-8881 (A. A. Ravelli); 0000-0003-3454-9387 (F. Dell’Orletta)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
level of involvement from visually accessible signals (e.g. body postures and movements, facial
expressions, eye-gazes), and at refining his/her communication strategy, in order to keep the
communication channel open and the attention high in the audience. Such visible cues are
mostly signals of attention, which is considered as a perceivable proxy to broader and more
complex inner processes of engagement [4]. Moreover, recent studies have shown that the
processing of emotionality in the human brain is performed on modality-specific basis [5]:
prosody, facial expressions and speech content (i.e. the semantic information) are processed in
the listener’s brain with the selective activation of the auditory cortex, the fusiform gyri and
the middle temporal gyri, respectively.
   Understanding of non-verbal feedback is not easy to achieve for virtual agents and robots, but
this ability is strategic for enabling more natural interfaces capable of adapting to users. Indeed,
perceiving signals of loss of attention (and thus, of engagement) is of paramount importance to
design naturally behaving virtual agents, enabled to adjust the communication strategy to keep
high the interest of their addressees. That information is also a general sign of the quality of the
interaction and, more broadly, of the communication experience. At the same time, the ability
to generate engaging behaviours in an agent can be beneficial in terms of social awareness [6].
   The objective of the present work is to understand the phenomena correlated to the increase
or decrease of perceivable engagement in the audience of a speech, specifically in the domain
of guided tours. We are interested in highlighting which features, from which specific modality,
have a key role in driving the attention in the listener(s), in order to exploit a reduced set of
features as dense but highly informative representations.

1.1. Related Work
With the word engagement we refer to the level of involvement reached during a social interac-
tion, which assumes the shape of a process through the whole communication exchange. More
specifically, [7] defines the process of social engagement as the value that a participant in an
interaction attributes to the goal of being together with the other participant(s) and continuing
the interaction. Another definition, adopted by many studies in Human-Robot Interaction
(HRI),1 describes engagement as the process by which interactors start, maintain, and end their
perceived connections to each other during an interaction [8]. The majority of the studies are
often conducted on a dyadic base (i.e. one-to-one) in context where one of the participants is
often an agent/robot [9, 10, 11]. Nevertheless, engagement can be measured in groups of people
as the average of the degree to which individuals are involved [12, 13, 14].


2. Dataset
Data for the experiments described herein derive from a subset of the data collected for the
CHROME Project2 [15, 16]. The domain of the data is Cultural Heritage; more specifically,
the project has been focused on guided tours in 3 Charterhouses in Campania (Italy), where
an expert historian led groups of 4 persons. Tours are organised in 6 Points of Interest (POI),

   1
       For a broad and complete overview of works on engagement in HRI studies, see [6].
   2
       Cultural Heritage Resources Orienting Multimodal Experience. http://www.chrome.unina.it/
i.e. rooms or areas inside the Charterhouses where the visits stop and the guide describes
the place with its furnishings, history and anecdotes. The communication event type is quasi-
unidirectional (one-to-many), i.e one of the participants is the holder of the knowledge (the
expert guide) and talks to the others (the audience), with a few moments of dialogue (e.g. when
the guide asks something to the audience).
   The original data collection campaign lead to a multimodal corpus with aligned transcriptions,
audios and videos. From this, we selected a subset composed of 3 visits (i.e. 3 different groups of
4 persons) leaded by the same expert guide inside one of the three Charterhouses (San Martino
Charterhouse, Naples). Given the exploratory objective of the present study, we made this
selection in order to leverage differences such as voice features and discourse style, which are
speaker-specific.
   The final set of data on which we run our experiments is composed of 1,114 sentences,
enriched with the annotation of the perceivable engagement of the audience, and characterised
with a total of 452 features extracted from multiple modalities (127 linguistic, 325 acoustic)
used to model the speech of the guide. The process through which we obtained our dataset is
described in the following.

2.1. Human engagement annotation
We considered the attention of the audience as a perceivable proxy to model and highlight
participants’ engagement, in line with the assumption that engagement is a complex process,
a multidimensional meta-construct [17] composed of behavioural, emotional and cognitive
aspects. Behavioural aspects (and some externalisation of emotional states) can be tracked by
observing the subject, while for the others it is necessary to exploit specific equipment to record
biomarkers such as heart rate or neural activity. Nevertheless, all aspects of engagement are
highly interrelated and do not occur in isolation, thus attention plays a crucial role in defining
if the audience is engaged or not [18].3
    To annotate audience engagement, we exploited the visual part of the original CHROME
dataset, consisting of 2 parallel video recordings for each visit: one focused on the speaker,
the other on the audience. We asked 2 annotators to watch at the same time the audience and
the guide videos, with the guide video in a small window superimposed on the audience one,
and to annotate the level of attention and its variation among the attendee. We recorded this
information by means of PAGAN Annotation Tool [19], that enables the annotator to easily
track the observed phenomenon with a simple press of two keys on the keyboard: arrow-up if
a rise is perceived, arrow-down otherwise. Our annotators reached a high agreement on this
task, with an average Spearman’s rho of 0.87.4 The resulting annotation is a continuous series
of values, indicating rise or fall of engagement along the whole visit, for all the visits in our
dataset.




   3
       We will continue to use the term engagement referencing to the perceivable attention of the audience.
   4
       For a more detailed description of the annotation process, see [20].
2.2. Sentence segmentation
We acquired the textual data in the form of ELAN annotation files [21], containing the ortho-
graphic transcription of single words tagged with their start- and end-time (in milliseconds),
aligned on the timeline of the whole speech. In order to obtain more exploitable units of text,
we split the flow of the speech in sentence-like segments, by concatenating together all the
words that can represent a finite unit of language.5 We asked two annotators to segment our
texts, relying on a pure perceptual principle: mark the end of a sentence whenever conceptual
completeness is perceived. We relied on the capability of mothertongue speakers of Italian
to mentally segment the flow of the speech, the same as we normally do during everyday
conversations. In other words, we asked the annotators to identify terminal breaks and mark
them with a full stop. Given that punctuation is a convention of the written medium the
annotators were asked to minimise the use of it, but beside the full stop we allowed for the use
of commas to signal short pauses or listings, and question marks when a questioning intonation
was identified.
   A limitation to this methodology is that it is often possible that the speech rate makes difficult
to finely segment, especially taking into account the necessity to propagate the segmentation
from the text to the audio files. In fact, we projected the start-end spans of sentences onto
the audio files in order to obtain the audio objects from where we extracted acoustic features.
We kept together in the same text/audio object multiple sentences if uttered at high rate and
difficult to cleanly separate on the audio level, in order to avoid noise that would have altered
the computing of acoustic features.
   We measured the accuracy of the segmentation on a portion of the data (about the 40% of the
total) by adapting an IOB (Inside-Outside-Begin) tagging framework. We labelled all the tokens,
according to each annotator, on the basis of their position at the beginning (B), the inside (I), the
end (E) or the outside (O) of a constructed sentence. By applying this annotation, we registered
an agreement of 91.53% in terms of accuracy on the basis of the two series of labelled tokens,
thus the obtained segments can be considered reliable and consistent.

2.3. Engagement projection on sentences
As anticipated in 2.1, the engagement annotation consists of a continuous series of values
along the timeline of each video/visit: we dispose of a numerical value indicating the level of
engagement for each instant in which the latter has changed. Our aim was to use these values
to extract the level of engagement for each individual sentence; for this, we aggregated all
values within the span of the segmented sentences, in order to adapt the continuous annotation
of the engagement to discrete units (i.e. the sentences), by translating those values into finite
classes: engaging (associated with class 1) vs. non-engaging (associated with class 0). In this
regard, two different aggregation methods were designed and implemented: by subtraction and
by summation.



    5
     Speech segmentation is not a trivial task, and many researchers debated (and they are still debating) on the
problem. A recent special issue on the topic has been collected in [22].
2.3.1. Aggregation by subtraction
By using the subtraction method, we considered the delta between the first and the last value
of engagement annotated in the time span of a sentence. Considering the time interval of a
sentence 𝑆, where 𝑛 values of engagement were annotated (one for each variation), to obtain
the engagement level 𝐸𝑆 of an entire sentence we subtracted the first engagement value (𝑒0 )
from the last one (𝑒𝑛 ), as illustrated by equation 1:

                                          𝐸𝑆 = 𝑒𝑛 − 𝑒0                                          (1)

2.3.2. Aggregation by summation
By using the summation method, all the values and variations in the series of engagement
values, within the time span of a sentence, are taken into account. Considering a series of 𝑛
values of engagement (one for each variation) annotated within the time interval of a sentence
𝑆, a cumulative sum was calculated, to which 1 was added in the case where an increase in
the level of engagement (𝑒𝑛 > 𝑒𝑛−1 ) occurred, while −1 was added in the case where, on the
other hand, a decrease in the level of engagement (𝑒𝑛 < 𝑒𝑛−1 ) occurred. The final result of the
sum allows us to obtain the level of engagement 𝐸𝑆 of an entire sentence, as illustrated by the
equation 2, based on the system of equations 3:
                                                  𝑛
                                                 ∑︁
                                          𝐸𝑆 =         𝑎𝑖                                       (2)
                                                 𝑖=1
                                         {︃
                                           1,  if 𝑒𝑖 > 𝑒𝑖−1
                                    𝑎𝑖 =                                                        (3)
                                           −1, if 𝑒𝑖 < 𝑒𝑖−1

2.3.3. Engagement thresholds
After computing the engagement level for each sentence, we further converted these values to
Boolean classes (𝐶𝑠 ): 1 if resulting engaging, 0 if non-engaging. We considered 3 thresholds as
different degrees of inclusiveness:

    • −1, to generate a more generous classification;
    • 0, to generate a more balanced classification;
    • +1, to generate a more sceptical classification.

  Every sentence with an engagement level 𝐸𝑠 above the threshold 𝑡 was considered engaging,
while the others were considered non-engaging, as illustrated by the system of equations 4:
                                        {︃
                                          1, if 𝐸𝑆 > 𝑡
                                  𝐶𝑆 =                                                     (4)
                                          0, if 𝐸𝑆 ≤ 𝑡
  In conclusion, we obtain six different sentence classification series: three series (one for each
engagement threshold) for each of the two aggregation methodology. The selection of the most
suitable series is specified within the section 3.4.
2.4. Features
In this section we describe the methodology and the tools used to extract features for both
the textual and acoustic modality. We relied on explicit feature extraction systems in order to
explore which specific features, and to which extend, convey the most of the information that
create an engagement status in the audience.6

2.4.1. Linguistic Features


Table 1
Set of linguistic features extracted with Profiling-UD.
                                         Linguistic features                 n
                                               Raw text properties            2
                                     Morpho–syntactic information            52
                                         Verbal predicate structure          10
                                             Parsed tree structures          15
                                                Syntactic relations          38
                                        Subordination phenomena              10
                                                              Total         127


   The textual modality has been encoded by using Profiling–UD [26], a publicly available
web–based application7 inspired to the methodology initially presented in [27], that performs
linguistic profiling of a text, or a large collection of texts, for multiple languages. The system,
based on an intermediate step of linguistic annotation with UDPipe [28], extracts a total of 129
features per each analysed document. In this case, Profiling-UD analysis has been performed
per sentence, thus the output has been considered as the linguistic feature set of each segment
of the dataset. Table 1 reports the 127 features extracted with Profiling-UD and used as textual
modality features for the classifier.8

2.4.2. Acoustic Features
The acoustic modality has been encoded using OpenSmile9 [29], a complete and open-source
toolkit for analysis, processing and classification of audio data, especially targeted at speech
and music applications such as automatic speech recognition, speaker identification, emotion
recognition, or beat tracking and chord detection. The acoustic features set used in this case
is the Computational Paralinguistics ChallengE10 (ComParE), which comprises 65 Low-Level
Descriptors (LLDs), computed per frame.

    6
      The current state of the art in both linguistic and acoustic feature extraction make use of recent Deep Learning
methods and technique [23, 24, 25], but those systems extract features that are by nature not explainable.
    7
      Profiling-UD can be accessed at the following link: http://linguistic-profiling.italianlp.it
    8
      Out of the 129 Profiling-UD features, n_sentences and tokens_per_sent (raw text properties) have not been
considered, given that the analysis has been performed per sentence.
    9
      https://www.audeering.com/research/opensmile/
   10
      http://www.compare.openaudio.eu
Table 2
Set of acoustic features extracted with OpenSmile.
                                       Acoustic features                     n
                                               Prosodic
                    F0 (SHS and viterbi smoothing)                           1
                    Sum of auditory spectrum (loudness)                      1
                    Sum of RASTA-style filtered auditory spectrum            1
                    RMS energy, zero-crossing rate                           2
                                                Spectral
                    RASTA-style auditory spectrum, bands 1–26 (0–8 kHz)      26
                    MFCC 1–14                                                14
                    Spectral energy 250–650 Hz, 1 k–4 kHz                    2
                    Spectral roll off point 0.25, 0.50, 0.75, 0.90           4
                    Spectral flux, centroid, entropy, slope                  4
                    Psychoacoustic sharpness, harmonicity                    2
                    Spectral variance, skewness, kurtosis                    3
                                            Sound quality
                    Voicing probability                                       1
                    Log. HNR, Jitter (local, delta), Shimmer (local)          4
                                                                     Total   65


   Table 2 reports a summary of the ComParE LLDs extracted with OpenSmile, grouped by type:
prosody-related, spectrum-related and quality-related. Given that the duration (and number of
frames, consequently) of audio segments varies, common transformations (min, max, mean,
median, std) have been applied on the set of per-frame features of each segment, leading to a
total of 325 acoustic features (65 LLDs x 5 transformations).


3. Experimental setting
In order to explore multiple methodologies ad techniques to study the task of engagement
potential prediction, we set our experiments in different classification scenarios, exploiting two
Machine Learning models, applying alternative feature normalisation and engagement class
assignment methods, and executing a selection of the most representative features to predict
the engagement potential of a sentence.

3.1. Classification scenarios and baseline
Dealing with a few data, as in this case (1,114 total items), may lead to an overestimation of the
classification performances, making the predictions unreliable, especially if relying on a simple
train-validation split of the dataset [30, 31, 32]. To avoid this, we opted for a Cross-Validation
approach [33, 34, 35], declining our experimentation in 3 classification scenarios:

    • By stratified Random Partitioning (RaP): the dataset is divided into 10 equally sized parts,
      composed of randomly extracted elements. The stratified approach makes it possible to
      maintain the same proportion between classes in the dataset even in individual subdivi-
      sions; this is possible exploiting a Stratified Cross-Validation technique;11
    • By Visits (Vis): the dataset is divided on the basis of tourist visits, thus obtaining three
      partitions, related to the three visits considered;
    • By Points Of Interest (POI ): the dataset is partitioned on the basis of Points of Interest,
      thus obtaining six partitions, based on the POIs taken into consideration.

   It is important to specify that, in each classification scenario, an unseen part of the dataset (a
test-set) has been kept aside until the conclusion of the study, in order to ultimately test the
performance of the fully optimised system on unknown data. For the RaP scenario we excluded
from the Cross-Validation a portion of 20% of the data, which is also stratified. In the case of
the Vis scenario, the test-set is represented by the data related to the first visit (V01), while for
the POI scenario, the test-set is represented by the data related to the the first point of interest
(P01).
   For each classifier, and in each scenario, we trained 3 different models, namely Multimodal,
Linguistic and Acoustic, on the basis of the type (or the combination of types) of features used
as training. We decided also to calculate and use a baseline for each validation-set and each
test-set: each sentence in the set was assigned the Most Frequent Class within the respective
training-set. The individual baselines can be found in the appendices, where we report tables
with details of every experiment we run in this work, with the figures of each baseline.

3.2. Classifiers
One of the primary objectives of the study is to obtain a model capable of classifying a sentence
as either engaging or not engaging. To achieve this goal, as anticipated, we selected two
Machine Learning models: Linear Support Vector Classifier [36, 37] (Linear-SVC) and Random
Forest Classifier [38, 39] (Random-Forest). Choosing two radically different classifiers, rather
than using a single one, allows us to perform an accurate comparison between two different
classification processes, in terms of behaviour and performances. Most important, we relied on
fully explainable classification models, where it is possible to work with explicit features, thus
focusing on the phenomena behind a decision.
   More precisely, at the feature selection stage, it will be possible to highlight which feature
categories were deemed important by both classifiers, i.e. could be considered relevant for
detecting an engagement potential in language. Indeed, both classifiers are able to sort the
training features on the basis of their influence in the classification process, assigning them a
rank [40, 41, 42, 39] that can be used for performing feature selection and subsequent in-depth
analysis.

3.2.1. Hyperparameters tuning
A very important aspect in setting up the classifiers is the optimisation of the hyperparameters:
the Machine Learning models, in fact, have several hyperparameters that can be modified to
improve classification performance, allowing of more accurate results [43, 44, 45, 46].
   11
        https://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold
   A complete engineering of the chosen models would have been outside the objectives of the
study, thus we choose to optimise exclusively the most relevant hyperparameter in each of the
two chosen classifiers:

    • For Linear-SVC, the regularisation parameter (commonly referred to as parameter C) was
      optimised by testing a range of values (0.001, 0.01, 0.10, and 1.00) [47];
    • For the Random-Forest, the number of decision trees (Decision-Trees) that make up the
      "forest" was optimised. In this case, a number of trees equal to 10, to 100, and to 1000 was
      tested.

The hyperparameters tuning results showed that the Linear-SVC achieved the best performance
by using the regularisation parameter of 0.001, while the Random-Forest scores best with a
Decision-Tree number of 1000. Detailed results relative to hyperparameters tuning, on a cross-
comparison with the aggregation methods explained in section 2.3, can be found in Appendix A
(tables 3 and 4).

3.3. Feature normalisation
Standardising and normalising data (e.g., scaling within a common numerical range) can benefit
the training and performance of Machine Learning models [48]. In this regard, we tested many
normalisation methods, that we can divide in two main groups:

    • Linear normalisation methods: Standard-Scaler (StaS), Max-Abs-Scaler (MAS), Min-Max-
      Scaler (MiMaS) with two different numerical ranges (0 to 1, and -1 to 1), and Robust-Scaler
      (RoS);
    • Nonlinear normalisation methods: Power-Transformer (PoT) and Quantile-Transformer
      (QuT).

   In our experimentation, no appreciable differences emerged in terms of accuracy between
all the normalisation methods. However, the Quantile-Transformer (QuT) provided slightly
best overall results, thus it has been selected as default for the subsequent experiments. All the
results relative to the comparison between data normalisation methods, for both the classifiers,
can be found in Appendix A (Table 5).

3.4. Engagement class assignment
As anticipated, we considered 2 alternative methodologies (i.e. summation and subtraction)
with 3 thresholds to determine whether a sentence could be classified as engaging or not. From
our experimentation it resulted that the summation extraction method led to the best results;
therefore, we applied this in our configuration. As for the engagement thresholds, however,
a further test was performed: by comparing the three devised thresholds (-1, 0 and 1), it was
found that threshold 0 (considered the most neutral) allowed for the best accuracy. Accuracy
results relative to the comparison between engagement thresholds, for each classifier, can be
found in Appendix A (Table 6).
3.5. Feature selection algorithm
The performance of a Machine Learning model can be improved by reducing the number of
features it is trained with, based on their influence in the classification process [49]. For this
reason, we implemented a recursive feature elimination algorithm to identify which features
are most relevant for the prediction of engagement potential in a sentence, and consequently
to improve the performance of the models. The process of the feature selection algorithm is
structured in four steps:
   1. Using the total set of features, the value of Accuracy in Cross-Validation is calculated;
   2. The Accuracy value is compared with the best result obtained so far (0, if we are at the
      first iteration):
          • If the value obtained is greater, a ranking of features is made (based on the degrees
            of importance provided by the classification model), which will be considered the
            new optimal feature combination;
          • If the value obtained turns out to be lower, the previous optimal feature combination
            (obtained from the model that provided the higher Accuracy result) is retained;
   3. Steps 1 and 2 are repeated, recursively eliminating a predefined number of features
      (recursively deleting a predefined number of features, starting with the least important
      based on the ranking), until it is reached the minimum threshold of about 10% of the total
      feature set;
   4. The algorithm provides the selection of the most important features with which the best
      Accuracy result was obtained.
Given the long calculation times required for training Random-Forest (with 1000 estimators), it
was decided to set the number of features to be eliminated at each iteration as follows:
    • 10, for experiments performed with all features (Multimodal);
    • 3, for experiments performed with linguistic features only (Linguistic);
    • 7, for experiments performed with acoustic features only (Acoustic).
The choice was made taking into account the approximate proportion of linguistic features,
and acoustic features, to the number of total features. In this way, it is possible to significantly
reduce the number of iterations, and therefore speed up the experiment, while maintaining
excellent Accuracy. With Linear-SVC, given its speed of execution, the number of features to be
deleted at each iteration was set to 1 for all tests.
   Another aspect that needs to be illustrated is the minimum number of features that the
algorithm is forced to select: about 10% of the total set, for each type of feature:
    • 45, for experiments performed with all features (Multimodal);
    • 13, for experiments performed with linguistic features only (Linguistic);
    • 32, for experiments performed with acoustic features only (Acoustic).
In this case, the proportion was maintained both for the experiments performed with Linear-SVC
and for those performed with Random-Forest. Maintaining a minimum percentage of features
of about 10% allows us to have a sufficient number of features in the analysis phase, to draw
more precise and in-depth conclusions on the behaviour and choices of the models.
4. Experimental results
This section discusses the results for the most relevant experiments in the study, with a focus on
the effects of feature selection. All the figures reporting the results contain absolute scores for
accuracy values and bar extension for increments over baseline. Simultaneously, the percentages
for each classification scenario represent the average of those obtained for the three models
(Multimodal, Linguistic and Acoustic); conversely, each model represents the average of those
obtained for the three scenarios (RaP, Vis, POI ). The reported results are meant as an average
between Linear-SVC and Random-Forest models. As stated in the introduction, the focus of this
paper is not the state-of-the-art performance, but the analysis of the most salient features in
both modalities. That said, we are not considering Linear-SVC and Random-Forest the same,
but we are looking at an average performance based on a specific subset of features. Detailed
tables on all the experiments can be found in the Appendix B, where we report single classifier
results.




Figure 1: Comparison between average accuracy increments over baseline, before and after feature-
selection (bar percentages refer to absolute Accuracy values, not to increments). All the values represent
a mean between Linear-SVC and Random-Forest (Multi = Multimodal; Lin = Linguistic; Aco = Acoustic).



4.1. Feature selection effectiveness
The feature selection algorithm was run on both classifiers in use, testing all classification
scenarios and all models. As illustrated in Figure 1, feature selection leads to increases in mean
accuracy percentages in all the cases, by reducing the feature space in a range between 10 and
32% (average range between Linear-SVC and Random-Forest) of the original features set.
  The increases in accuracy percentages are a signal of the presence of features that are
particularly relevant to predict an engagement potential and, conversely, that many of the
original features are redundant and noisy for both scenario and model variation.
4.2. Results on test-set




Figure 2: Comparison between average accuracy increments over baseline on validation-set and test-set
(bar percentages refer to absolute Accuracy values, not to increments). All the values represent a mean
between Linear-SVC and Random-Forest (Multi = Multimodal; Lin = Linguistic; Aco = Acoustic).


   We run the final step of experimentation on an unknown portion of the dataset (i.e. test-set)
for each classification scenario: as illustrated in Figure 2, the classifiers achieved Accuracy incre-
ments over baseline on test-set that are extremely similar to those achieved on the validation-set,
despite the fact that these were unknown data. A single exception is detectable in the case of
POI classification scenario: the gaps between the percentages can be attributed to the large
differences between the baseline of the validation-set and the test-set (Detailed accuracy results
on test-set can be found in Appendix C, for each classification model).
   The good results achieved on the test-set indicate that classifiers trained with a restricted set
of features are able to effectively detect an engagement potential in unseen data. This provides
us with a definitive confirmation of how the most important features, selected by the feature
selection algorithm, can indeed constitute a set of fundamental aspects to detect the engagement
of a sentence.


5. Feature analysis
In order to understand which linguistic and acoustic features are the most relevant to detect an
engagement potential, it is necessary to analyse the subset of features with which the classifiers
performed best. Specifically, it is possible to define what percentage of each feature category (e.g.
linguistic:morpho-syntactic, acoustic:spectral) was included by the feature selection algorithm
among the most relevant. It is important to specify that only those categories of features selected
by both classifiers were considered, i.e. only those that resulted to be highly relevant to the
classification process, regardless of the exploited classifier.
Figure 3: Average percentage (between the two classifiers) in which each feature category was included
in the top 10% of the features, based on the ranking developed by the classifier models (only the feature
categories selected by both the classifiers are considered).


   Considering the most important 10% of all features (on the basis of the algorithm ranking),
we can observe that acoustic features seems to be the most important for the classification of
engaging and non-engaging sentences: the average percentage of acoustic features (9.34%),
included in the total set of most important features selected, is about 1.57 times higher than the
average percentage of linguistic features (5.94%). We can derive that acoustic features play a
significantly more important role, compared to linguistic features.
   A closer look at the selected feature categories shows us what percentage of them were
included among the most important ones by the classifiers. As shown in Figure 3, Raw Text
Properties (i.e. sentence and word length) are the most relevant group of features. Other linguistic
features included in the selected features regard syntactic relations and the order of elements,
but those are selected only for the 7.89% of the total. Nevertheless, the rest of the selected
features are all coming from the acoustic modality, specifically related to the sound spectrum.
In this regard, it is possible to observe that the timbre of the speech, its amplitude and richness
in the frequency range are decisive factors in the maintenance of attention. However, it is also
necessary to note that the first group of acoustic features (RMS energy and zero-crossing rate),
turns out to be a prosodic feature; the rhythmic features of the voice, therefore, which highlight
traits such as irony and sarcasm, still play a strong role. Voice quality aspects, on the other hand,
do not seem to be particularly implicated in the classification process. Detailed percentages of
included features per modality can be found in Appendix D.
6. Conclusions
The implemented Machine Learning models were able to detect an engagement potential in
language in multiple scenarios and on unknown data. It emerged that certain phenomena and
features of language, mainly acoustic in nature (like prosodic or spectral), play a key role in the
classification process, and thus in assessing an engagement potential of an uttered sentence.
   Ultimately, it is possible to observe that all the results were achieved by fully exploiting
the potential of a restricted set of features (between 10 and 32% of the total sets). This study,
therefore, also aims to show to what extent optimised Machine Learning models, combined with
a selected and optimised data representation (i.e. relevant features), can succeed in achieving
better accuracy results. The stringent feature selection, moreover, proved to be crucial in
understanding which aspects, among the various linguistic and acoustic ones considered, play a
critical role in making a sentence engaging or not. On the acoustic level, prosodic and spectrum
related features play a major role in discriminating engaging and non-engaging sentences, while
on the linguistic level raw text properties give the main contribution. We can conclude that the
attention of the listener(s), and thus the perceivable engagement, can be driven by acoustic and
linguistic features, and for this reason we studied the phenomenon of engagement by means of
fully explainable classification models.

6.1. Future developments
One of the critical issues of this study undoubtedly concerns the size of the dataset, that can be
considered relatively small (1,114 sentences) and not very varied (the 3 visits are lead by the
same guide, thus all the data regard one person). To conduct an even more precise and accurate
study, and to generalise the results, it would be necessary to increase the size of the dataset by
including data coming from more guides and groups of visitors.
   Another important enrichment of the dataset could involve visual data. Currently, we
exploited the visual part of the dataset exclusively for the annotation of the variation of at-
tention/engagement, but it would be interesting to explore visual features in the classification
process, and to measure the performance of a model that considers linguistic, acoustic, and
visual data to predict the engagement potential of a communication act.


References
 [1] D. Nozza, L. Passaro, M. Polignano, Preface to the Sixth Workshop on Natural Language
     for Artificial Intelligence (NL4AI), in: D. Nozza, L. C. Passaro, M. Polignano (Eds.), Pro-
     ceedings of the Sixth Workshop on Natural Language for Artificial Intelligence (NL4AI
     2022) co-located with 21th International Conference of the Italian Association for Artificial
     Intelligence (AI*IA 2022), November 30, 2022, CEUR-WS.org, 2022.
 [2] L. Floridi, M. Chiriatti, Gpt-3: Its nature, scope, limits, and consequences, Minds and
     Machines 30 (2020) 681–694.
 [3] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     et al., Exploring the limits of transfer learning with a unified text-to-text transformer., J.
     Mach. Learn. Res. 21 (2020) 1–67.
 [4] P. Goldberg, Ö. Sümer, K. Stürmer, W. Wagner, R. Göllner, P. Gerjets, E. Kasneci,
     U. Trautwein, Attentive or Not? Toward a Machine Learning Approach to Assessing
     Students’ Visible Engagement in Classroom Instruction, Educational Psychology Review
     35 (2019) 463–23.
 [5] C. Regenbogen, D. A. Schneider, R. E. Gur, F. Schneider, U. Habel, T. Kellermann, Multi-
     modal human communication — Targeting facial expressions, speech content and prosody,
     NeuroImage 60 (2012) 2346–2356.
 [6] C. Oertel, G. Castellano, M. Chetouani, J. Nasir, M. Obaid, C. Pelachaud, C. Peters, En-
     gagement in human-agent interaction: An overview, Frontiers in Robotics and AI 7 (2020)
     92.
 [7] I. Poggi, Mind, hands, face and body: a goal and belief view of multimodal communication,
     Weidler, 2007.
 [8] C. L. Sidner, C. Lee, C. D. Kidd, N. Lesh, C. Rich, Explorations in engagement for humans
     and robots, Artificial Intelligence 166 (2005) 140–164.
 [9] G. Castellano, A. Pereira, I. Leite, A. Paiva, P. W. McOwan, Detecting user engagement
     with a robot companion using task and social interaction-based features, in: Proceedings
     of the 2009 international conference on Multimodal interfaces, 2009, pp. 119–126.
[10] J. Sanghvi, G. Castellano, I. Leite, A. Pereira, P. W. McOwan, A. Paiva, Automatic analysis
     of affective postures and body motion to detect engagement with a game companion,
     in: Proceedings of the 6th International Conference on Human-Robot Interaction, HRI
     ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 305–312. URL:
     https://doi.org/10.1145/1957656.1957781. doi:10.1145/1957656.1957781.
[11] A. Ben-Youssef, C. Clavel, S. Essid, Early detection of user engagement breakdown in
     spontaneous human-humanoid interaction, IEEE Transactions on Affective Computing 12
     (2021) 776–787. doi:10.1109/TAFFC.2019.2898399.
[12] D. Gatica-Perez, L. McCowan, D. Zhang, S. Bengio, Detecting group interest-level in
     meetings, in: Proceedings.(ICASSP’05). IEEE International Conference on Acoustics,
     Speech, and Signal Processing, 2005., volume 1, IEEE, 2005, pp. I–489.
[13] C. Oertel, S. Scherer, N. Campbell, On the use of multimodal cues for the prediction of
     degrees of involvement in spontaneous conversation., in: Proceedings of the Annual
     Conference of the International Speech Communication Association, INTERSPEECH, 2011,
     pp. 1541–1544.
[14] J. A. Fredricks, P. C. Blumenfeld, A. H. Paris, School engagement: Potential of the concept,
     state of the evidence, Review of educational research 74 (2004) 59–109.
[15] F. Cutugno, F. Dell’Orletta, I. Poggi, R. Savy, A. Sorgente, The chrome manifesto: integrating
     multimodal data into cultural heritage resources, Computational Linguistics CLiC-it 2018
     (2018) 155.
[16] A. Origlia, R. Savy, I. Poggi, F. Cutugno, I. Alfano, F. D’Errico, L. Vincze, V. Cataldo, An
     audiovisual corpus of guided tours in cultural sites: Data collection protocols in the chrome
     project, in: 2018 AVI-CH Workshop on Advanced Visual Interfaces for Cultural Heritage,
     volume 2091, 2018, pp. 1–4.
[17] J. A. Fredricks, P. C. Blumenfeld, A. H. Paris, School engagement: Potential of the concept,
     state of the evidence, Review of educational research 74 (2004) 59–109. Publisher: Sage
     Publications Sage CA: Thousand Oaks, CA.
[18] P. Goldberg, O. Sümer, K. Stürmer, W. Wagner, R. Göllner, P. Gerjets, E. Kasneci,
     U. Trautwein, Attentive or Not? Toward a Machine Learning Approach to Assess-
     ing Students’ Visible Engagement in Classroom Instruction, Educational Psychol-
     ogy Review 35 (2019) 463–23. URL: http://link.springer.com/10.1007/s10648-019-09514-z.
     doi:10.1007/s10648-019-09514-z, publisher: Springer US.
[19] D. Melhart, A. Liapis, G. N. Yannakakis, Pagan: Video affect annotation made easy, in:
     2019 8th International Conference on Affective Computing and Intelligent Interaction
     (ACII), IEEE, 2019, pp. 130–136.
[20] A. A. Ravelli, A. Origlia, F. Dell’Orletta, Exploring attention in a multimodal corpus of
     guided tours, in: Computational Linguistics CLiC-it 2020, 2020, p. 353.
[21] P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, H. Sloetjes, Elan: a professional
     framework for multimodality research, in: Proc. of the International Conference on
     Language Resources and Evaluation (LREC), 2006, pp. 1556–1559.
[22] S. Izre’el, H. Mello, A. Panunzi, T. Raso, In Search of Basic Units of Spoken Language,
     volume 94 of A corpus-driven approach, John Benjamins Publishing Company, Amsterdam,
     2020. doi:10.1075/scl.94, iSSN: 1388-0373.
[23] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[24] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, C. Subakan,
     N. Dawalatabad, A. Heba, J. Zhong, et al., Speechbrain: A general-purpose speech toolkit,
     arXiv preprint arXiv:2106.04624 (2021).
[25] A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 2.0: A framework for self-supervised
     learning of speech representations, Advances in Neural Information Processing Systems
     33 (2020) 12449–12460.
[26] D. Brunato, A. Cimino, F. Dell’Orletta, G. Venturi, S. Montemagni, Profiling-ud: a tool
     for linguistic profiling of texts, in: Proceedings of The 12th Language Resources and
     Evaluation Conference, 2020, pp. 7145–7151.
[27] S. Montemagni, Tecnologie linguistico-computazionali e monitoraggio della lingua italiana,
     Studi Italiani di Linguistica Teorica e Applicata (SILTA) XLII (2013) 145–172.
[28] M. Straka, J. Hajič, J. Straková, UDPipe: Trainable pipeline for processing CoNLL-U files
     performing tokenization, morphological analysis, POS tagging and parsing, in: Proceedings
     of the Tenth International Conference on Language Resources and Evaluation (LREC’16),
     European Language Resources Association (ELRA), Portorož, Slovenia, 2016, pp. 4290–4297.
     URL: https://aclanthology.org/L16-1680.
[29] F. Eyben, M. Wöllmer, B. Schuller, Opensmile: the munich versatile and fast open-source
     audio feature extractor, in: Proceedings of the 18th ACM international conference on
     Multimedia, 2010, pp. 1459–1462.
[30] J. Gareth, W. Daniela, H. Trevor, T. Robert, An introduction to statistical learning: with
     applications in R, Spinger, 2013.
[31] G. C. Cawley, N. L. Talbot, On over-fitting in model selection and subsequent selection
     bias in performance evaluation, The Journal of Machine Learning Research 11 (2010)
     2079–2107.
[32] G. Seni, J. F. Elder, Ensemble methods in data mining: improving accuracy through
     combining predictions, Synthesis lectures on data mining and knowledge discovery 2
     (2010) 1–126.
[33] D. M. Allen, The relationship between variable selection and data agumentation and a
     method for prediction, technometrics 16 (1974) 125–127.
[34] M. Stone, Cross-validatory choice and assessment of statistical predictions, Journal of the
     royal statistical society: Series B (Methodological) 36 (1974) 111–133.
[35] M. Stone, An asymptotic equivalence of choice of model by cross-validation and akaike’s
     criterion, Journal of the Royal Statistical Society: Series B (Methodological) 39 (1977)
     44–47.
[36] N. Cristianini, J. Shawe-Taylor, et al., An introduction to support vector machines and
     other kernel-based learning methods, Cambridge university press, 2000.
[37] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm for optimal margin classifiers,
     in: Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp.
     144–152.
[38] T. K. Ho, Random decision forests, in: Proceedings of 3rd international conference on
     document analysis and recognition, volume 1, IEEE, 1995, pp. 278–282.
[39] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.
[40] A. L. Blum, P. Langley, Selection of relevant features and examples in machine learning,
     Artificial intelligence 97 (1997) 245–271.
[41] P. S. Bradley, O. L. Mangasarian, Feature selection via concave minimization and support
     vector machines., in: ICML, volume 98, Citeseer, 1998, pp. 82–90.
[42] P. S. Bradley, O. L. Mangasarian, W. N. Street, Feature selection via mathematical program-
     ming, INFORMS Journal on Computing 10 (1998) 209–217.
[43] F. Hutter, L. Kotthoff, J. Vanschoren, Automated machine learning: methods, systems,
     challenges, Springer Nature, 2019.
[44] M. Claesen, B. De Moor, Hyperparameter search in machine learning, arXiv preprint
     arXiv:1502.02127 (2015).
[45] J. Bergstra, R. Bardenet, Y. Bengio, B. Kégl, Algorithms for hyper-parameter optimization,
     Advances in neural information processing systems 24 (2011).
[46] J. Bergstra, Y. Bengio, Random search for hyper-parameter optimization., Journal of
     machine learning research 13 (2012).
[47] F. Cucker, S. Smale, et al., Best choices for regularization parameters in learning theory: on
     the bias-variance problem, Foundations of computational Mathematics 2 (2002) 413–428.
[48] J. Han, J. Pei, H. Tong, Data mining: concepts and techniques, Morgan kaufmann, 2022.
[49] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, Journal of machine
     learning research 3 (2003) 1157–1182.
A. Setting of classifiers and data preprocessing
This appendix shows the detailed tables containing the Accuracy results obtained during the
first phase of experimentation, where the classifier models and dataset were configured and
optimised. Specifically, the percentages obtained from the cross-comparison of various hyper-
parameter configurations (Section 3.2.1), for both classifier models, and of the two engagement
aggregation techniques implemented in the study (Section 2.3) are illustrated. The results of
the comparison between the various data normalisation techniques tested (Section 3.3), and
between the engagement thresholds used (Section 2.3.3) are also shown. Each table shows the
baseline percentages (based on the most frequent class of engagement, i.e. 0 for non-engaging
or 1 for engaging) for each classification scenario, for each engagement threshold, and for each
engagement aggregation technique (experiments shown in these Tables were all performed in
the RaP classification scenario).

Table 3
Accuracies with different regularisation parameter values used by the Linear Support Vector Classifier
(RaP scenario).
                                        Subtraction   Summation      MEAN
                                           58.59         59.27        58.93
                                0.001
                                          (+5.84)       (+6.63)      (+6.23)
                                           58.23         58.79        58.51
                                 0.01
                                          (+5.48)       (+6.15)      (+5.82)
                                           57.15         57.38        57.27
                                 0.10
                                          (+4.40)       (+4.74)      (+4.57)
                                           55.35         56.68        56.01
                                 1.00
                                          (+2.60)       (+4.04)      (+3.32)
                                           57.23         58.03        57.63
                              MEAN
                                          (+4.58)       (+5.39)      (+4.94)
                          BASELINE         52.75         52.64        52.69



Table 4
Accuracies with different Decision-Tree quantities used by the Random Forest Classifier (RaP scenario).
                                        Subtraction   Summation      MEAN
                                           60.36         58.66        59.51
                                   10
                                          (+7.61)       (+6.03)      (+6.82)
                                           60.14         61.63        60.89
                                 100
                                          (+7.39)       (+8.99)      (+8.19)
                                           61.93         63.14        62.53
                                1000
                                          (+9.18)      (+10.50)      (+9.84)
                                           60.81         61.14        60.97
                              MEAN
                                          (+8.06)       (+8.50)      (+8.28)
                          BASELINE         52.75         52.64        52.69
Table 5
Comparison of normalization techniques: accuracies of classifiers on RaP scenario depending on data
normalization (using only aggregation by summation technique, and engagement threshold 0).
                                                   MiMaS      MiMaS
                              StaS       MAS                              RoS       PoT     QuT
                                                    (-1,1)      (0,1)
            Linear-SVC        59.93    59.26        59.03      59.03      58.13     58.92   60.60
            Random-Forest     63.19    63.41       63.41       63.41      63.30     61.95   63.30
            MEAN              61.55    61.33        61.22      61.22      60.71     60.44   61.95
            BASELINE                                          52.64

Table 6
Comparison of engagement thresholds on RaP scenario (using only aggregation by summation tech-
nique).
                                                  -1              0          1
                                                 57.35          60.60      58.47
                            Linear-SVC
                                               (-23.12)        (+7.96)    (-4.04)
                                                 80.47          63.30      64.64
                            Random-Forest
                                                (+0.00)       (+10.66)    (+2.13)
                                                 68.91          61.95      61.55
                            MEAN
                                               (-11.56)        (+9.31)    (-0.96)
                            BASELINE             80.47          52.64      62.51


B. Accuracies in Cross-Validation
This appendix details the averages of all Accuracy results obtained during Cross-Validation, with
a cross-comparison between the feature combinations used and the classification scenarios. The
tables are divided by individual classifier model, and show the accuracies obtained both before
and after feature selection (see section 3.5). Each table also shows the baseline percentages
(based on the most frequent class) for each classification scenario and for each modality, used
for comparison with the results obtained.

Table 7
Accuracy values obtained by comparison between models and classification scenarios, before feature
selection, with Linear-SVC.
                                          RaP         Vis         POI        MEAN
                                          60.60      59.55       58.02        59.39
                       Multimodal
                                         (+7.96)    (+5.96)    (+14.12)      (+9.35)
                                          57.46      58.08       58.44        57.99
                       Linguistic
                                         (+4.82)    (+4.49)    (+14.54)      (+7.95)
                                          60.27      59.59       59.58        59.81
                       Acoustic
                                         (+7.63)    (+6.00)    (+15.68)      (+9.77)
                                          59.44      59.07       58.68        59.06
                       MEAN
                                         (+6.80)    (+5.48)    (+14.78)      (+9.02)
                       BASELINE           52.64      53.59       43.90        50.04
Table 8
Accuracy values obtained by comparison between models and classification scenarios, before feature
selection, with Random-Forest.
                                      RaP        Vis         POI      MEAN
                                      63.19     59.45       60.46      61.03
                      Multimodal
                                    (+10.55)   (+5.86)    (+16.56)   (+10.99)
                                      56.22     56.96       54.76      55.98
                      Linguistic
                                     (+3.58)   (+3.37)    (+10.86)    (+5.94)
                                      65.53     59.89       59.86      61.09
                      Acoustic
                                    (+12.89)   (+6.30)    (+15.96)   (+11.05)
                                      60.98     58.77       58.36      59.37
                      MEAN
                                     (+8.34)   (+5.18)    (+14.46)    (+9.33)
                      BASELINE        52.64     53.59       43.90      50.04



Table 9
Accuracy values obtained by comparison between models and classification scenarios, after feature
selection, with Linear-SVC.
                                     RaP          Vis        POI      MEAN
                                     64.42       64.80      65.21      64.81
                     Multimodal
                                   (+11.78)    (+11.21)   (+21.31)   (+14.77)
                                     59.92       63.21      61.04      61.39
                     Linguistic
                                    (+7.28)     (+9.62)   (+17.14)   (+11.35)
                                     64.20       64.66      65.42      64.76
                     Acoustic
                                   (+11.56)    (+11.07)   (+21.52)   (+14.72)
                                     62.85       64.22      63.89      63.65
                     MEAN
                                   (+10.21)    (+10.63)   (+19.99)   (+13.61)
                     BASELINE        52.64       53.59      43.90      50.04



Table 10
Accuracy values obtained by comparison between models and classification scenarios, after feature
selection, with Random-Forest.
                                     RaP          Vis        POI      MEAN
                                     67.46       65.51      66.17      66.38
                     Multimodal
                                   (+14.82)    (+11.92)   (+22.24)   (+16.34)
                                     57.46       59.45      55.98      57.63
                     Linguistic
                                    (+4.82)     (+5.86)   (+12.08)    (+7.59)
                                     67.68       64.59      65.91      66.06
                     Acoustic
                                   (+15.04)    (+11.00)   (+22.01)   (+16.02)
                                     64.20       63.18      62.69      63.36
                     MEAN
                                    (+9.56)     (+9.59)   (+18.79)   (+13.32)
                     BASELINE        52.64       53.59      43.90      50.04
C. Accuracies on test-set
This appendix shows the Accuracy percentages obtained on the test-set, in the final phase of
testing the classifier models. In particular, there is a table of accuracies for each of the two
classification models, with a cross-comparison between the feature combinations used and the
classification scenarios. Each table also shows the baseline percentages (based on the most
frequent class) for each classification scenario and for each modality.

Table 11
Accuracy values on test-set, obtained by comparison between models and classification scenarios, with
Linear-SVC.
                                        RaP        Vis       POI     MEAN
                                        64.57     54.75     57.68     59.00
                       Multimodal
                                      (+12.10) (+22.79) (+6.00) (+13.63)
                                        59.64     49.68     57.68     55.67
                       Linguistic
                                       (+7.17)  (+17.72) (+6.00)     (+10.30
                                        65.92     53.48     58.05     59.15
                       Acoustic
                                      (+13.45) (+21.52) (+6.37) (+13.78)
                                        63.38     52.64     57.80     57.94
                       MEAN
                                      (+10.91) (+20.68) (+6.12) (+12.57)
                       BASELINE         52.47     31.96     51.68     45.37



Table 12
Accuracy values on test-set, obtained by comparison between models and classification scenarios, with
Random-Forest.
                                        RaP        Vis       POI      MEAN
                                        65.47    55.70      64.42      61.86
                      Multimodal
                                      (+13.00) (+23.73) (+12.73) (+16.49)
                                        57.85    48.42      53.56      53.28
                      Linguistic
                                       (+5.38)  (+16.46) (+1.87)      (+7.91)
                                        66.82    55.70      62.92      61.81
                      Acoustic
                                      (+14.35) (+23.27) (+11.24) (+16.44)
                                        63.38    52.27      60.30      58.98
                      MEAN
                                      (+10.91) (+21.31) (+8.62)      (+13.61)
                      BASELINE          52.47     31.96     51.68      45.37



D. Feature selection results
This appendix shows the results of the analysis of the features selected by the classifier models
through the feature selection algorithm. The tables show the percentage by which each feature
category and subcategory was included in the top 10% of the features, based on the ranking
processed by the classification models. It’s important to notice that the indicated percentages
are an average value between the percentages found with Linear-SVC and Random-Forest.
Table 13
Percentage of inclusion of linguistic features categories among the top 10% of the features, based on the
ranking processed by the classification models (average between Linear-SVC and Random-Forest).
                    Linguistic feature
                                                     RaP      Vis     POI     MEAN
                    category
                    Raw Text Properties              16.66    33.33   33.33    27.77
                    Morphosyntactic information       2.25     3.82    3.82     3.30
                    Verbal Predicate Structure       13.63    13.63    0.00     9.09
                    Parsed Tree Structures           12.50    12.50      25    16.67
                    Syntactic Relations               6.58     6.58   10.52     7.89
                    Use of Subordination             15.00    15.00    5.00    11.67




Table 14
Percentage of inclusion of acoustic features categories among the top 10% of the features, based on the
ranking processed by the classification models (average between Linear-SVC and Random-Forest).
           ACOUSTIC FEATURE CATEGORIES                        PaC      Vis     POI     MEAN
           PROSODICS                                          13.75   10.00   15.00     12.91
           F0 (SHS and Viterbi smoothing)                     10.00   10.00    0.00      6.66
           Sum of auditory spectrum (loudness)                20.00   10.00   20.00     16.66
           Sum of RASTA-style filtered auditory spectrum       0.00    0.00   10.00      3.33
           RMS energy and zero-crossing rate                  25.00   20.00   30.00     25.00
           SPECTRAL                                           13.23   10.44   11.59     11.77
           RASTA-style auditory spectrum, bands 1-26           6.92    9.61    8.07      8.20
           MFCC 1-14                                          20.71   14.28   16.43     17.14
           Spectral energy 250-650 Hz, 1 k-4 kHz              20.00   20.00   20.00     20.00
           Spectral roll off point 0.25, 0.50, 0.75, 0.90      0,00    2.50    0,00      0.83
           Spectral flux, centroid, entropy, slope            15.00   15.00   15.00     15.00
           Psychoacoustic sharpness, harmonicity              20.00    5.00   15.00     13.33
           Spectral variance, skewness, kurtosis              10.00    6.66    6.67      7.78
           VOICE QUALITY                                       7.75    7,50    8.75      7.91
           Voicing probability                                10.00    0,00   10.00      6.67
           Log. HNR, Jitter (local, delta), Shimmer (local)    5,00   15,00    7.50      9.16