=Paper= {{Paper |id=Vol-2882/MediaEval_20_paper_3 |storemode=property |title=No-Audio Multimodal Speech Detection Task at MediaEval 2020 |pdfUrl=https://ceur-ws.org/Vol-2882/paper3.pdf |volume=Vol-2882 |authors=Laura Cabrera-Quiros,Jose Vargas-Quirós,Hayley Hung |dblpUrl=https://dblp.org/rec/conf/mediaeval/QuirosQH20 }} ==No-Audio Multimodal Speech Detection Task at MediaEval 2020== https://ceur-ws.org/Vol-2882/paper3.pdf
 No-Audio Multimodal Speech Detection Task at MediaEval 2020
                                             Laura Cabrera-Quiros1, Jose Vargas2, Hayley Hung2
                                                 lcabrera@itcr.ac.cr,{j.d.vargasquiros,h.hung}@tudelft.nl
                                                     1Instituto Tecnológico de Costa Rica, Costa Rica
                                                       2Delft University of Technology, Netherlands


ABSTRACT                                                                       movements that accompany speech. This task is motivated by such
This overview paper provides a description of the No-Audio mul-                insights, and past work which estimated speaking status from a
timodal speech detection task for MediaEval 2020. Similar to the               single body worn tri-axial accelerometer [5, 6] and video [4].
previous two editions, the participants of this task are encouraged               Despite many efforts, one of the major challenges of these al-
to estimate the speaking status (i.e. person speaking or not) of in-           ternative approaches has been achieving competitive estimation
dividuals interacting freely during a crowded mingle event, from               performance against audio-based systems. Moreover, results from
multimodal data. In contrast to conventional speech detection ap-              past editions of this task have shown a significant difference in the
proaches, no audio is used for this task. Instead, the automatic               performance of different individuals, and lower performances for a
estimation system proposed must exploit the natural human move-                particular subset of them (failure cases) not fully understood yet.
ments that accompany speech, captured by cameras and wearable                  2 TASK DETAILS
sensors. Task participants are provided with cropped videos of in-
dividuals while interacting, captured by an overhead camera, and               2.1 Unimodal estimation of speaking status
the tri-axial acceleration of each individual throughout the event,            Participants are encouraged to design and implement separate
captured with a single badge-like device hung around the neck.                 speaking status estimators for each modality. However, baseline
This year’s edition of the task also focuses on investigating posible          approaches for each modality are provided, in case they prefer to
reasons for interpersonal differences in the performances obtained.            focus on improving an estimator for only one of the modalities, or
                                                                               the fusion technique. The baseline using acceleration implements
1    INTRODUCTION                                                              the logistic regression in [5] and the video baseline employs dense
Speaking status is one of the key signals that is used for studying            trajectories and multiple instance learning, as explained in [3].
conversational dynamics in face to face settings [10]. From the                   For the video modality, the input will be a video of a person in-
speaking status of multiple people one can also derive speaking                teracting freely in a social gathering (see Figure 1), and a estimation
turns, and other features that have shown beneficial for estimating            of that persons’ speaking status (speaking/non-speaking) should be
many different social constructs such as dominance [8], or cohesion            provided every second. For the wearable modality, the method will
[7]. Overall, automated analysis of conversational dynamics in large           have the wearable tri-axial acceleration signal of a person as input
unstructured social gatherings is an under-explored problem despite            and must also return a speaking status estimation every second.
the relevance of such events [11], and automated speaking detection
one of its key components.                                                     2.2     Multimodal estimation of speaking status
   The majority of works regarding speaking status detection fo-               For this subtask teams must provide an estimation of speaking
cuses on utilizing the audio signal captured by microphones. How-              status every second by exploiting both modalities together. Teams
ever, most unstructured social gatherings such as parties or cocktail          can use any type of fusion method they see fit [1]. The goal is
events tend to have inherent background noise and to collect good              to leverage the complementary nature of the modalities to better
quality audio signals, participants need to wear uncomfortable and             estimate the speaking status. Thus, teams are encouraged to go
intrusive equipment. Recording audio also risks to be perceived                beyond basic fusion and really think about the impact of each
as an invasion of privacy due to the access to the precise verbal              modality on the estimation.
contents of the conversation, further limiting the natural behavior
of the individuals involved. Because of these restrictions, recording
                                                                               2.3     Analysis of failure test cases
audio in such cases is challenging.                                            As a new addition for this year’s edition, teams must analyze the
   As a suitable alternative, the main goal of this task is to estimate        differences in the performance results for the test set, focusing on
a person’s speaking status using video and wearable acceleration               the three subjects with the lower performances, and hypothesize
data from a smart ID badge which is hung around the neck, instead              about the reasons the method underperforms for these persons.
of audio. Such alternative modalities are more privacy-preserving,             Participants are encouraged to think about the circumstances for
and easy to use and replicate for crowded environments such as                 the subjects (e.g. occlusion) or interpersonal differences that could
conferences, networking events, or organizational settings.                    explain such dissimilarities.
   Body movements such as gesturing tend to co-occur with speak-               3     DATA
ing, as it has been well-documented by social scientists [9]. Thus,
an automatic estimation system should exploit the natural human                A subset of the MatchNMingle dataset1 [2] is used for this task.
                                                                               It contains data for 70 people who attended one of three separate
Copyright 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).                     1 MatchNMingle is openly available for research purposes under an EULA at
MediaEval’20, 14-15 December 2020, Online                                      http://matchmakers.ewi.tudelft.nl/matchnmingle/pmwiki/
MediaEval’20, December, 2020, Virtual                                                                                        Cabrera-Quiros et al.


                                                                             Optional evaluation. Teams may optionally submit up to 5 runs
                                                                          (per person) using person dependent training. To do so, a separate 5
                                                                          minutes interval for all people in the training set is provided. Thus,
                                                                          samples and labels from the same subject can be used to train or
                                                                          fine-tune and then test for a specific test subject’s data, which is
                                                                          temporally to adjacent to the training samples. This method would
                                                                          be expected to perform better when trained or fine-tuned on the
                                                                          target person rather than other people.
Figure 1: Alternative modalities to audio used for the task.
Left: Individual video of each participant while interacting              5    DISCUSSION AND OUTLOOK
freely. Right: Wearable triaxial acceleration recorded by a               With this task, we aim to support the study of speaking status
device hung around the neck.                                              detection in the wild using alternative modalities to audio. We aim
                                                                          to learn more about the connection between speaking and body
mingle events for over 45 minutes. To eliminate the effects of ac-        movements, expecting that in the future this will bring on valuable
climatization, only 30 minutes in the middle of the event are used.       insights for both the social science and multimedia communities.
Subjects were separated using stratified sampling to create the train        Participation in previous editions of the task has been limited,
(54 subjects) and test sets (16 subjects). Stratification was done with   with only small improvements over the baseline. We believe this
various criteria to ensure balanced distributions in both sets for        is due to the variety of ways in which this task is atypical. For
speaking status, gender, event day, and level of occlusion in the         example, the connection between speech and body movements
video.2 An additional segment of the data was created for the op-         has been found to be person-specific [5]. Additionally, the interac-
tional subject specific evaluation of the task (see more in Section       tion between the two modalities of interest (chest acceleration and
4). While the dataset used this year is the same as the one used in       video) is not traditionally explored, i.e. the combination of these
previous versions of the challenge, making comparisons possible           two modalities is not common. This leaves open opportunities to
between solutions of different years, focus is given to the differences   explore their complimentarity, to better understand in which situa-
shown by the 16 subjects in the test set.                                 tions one modality is more reliable over the other, and develop or
    Videos were captured from an overhead view at 20FPS. The              apply appropriate fusion strategies.
rectangular (bounding box) area around each subject has been                 Moreover, differences in the performances between test subjects
cropped, in such a way that a video is provided per person. Impor-        was consistently found in previous editions, further supporting
tant challenges in the automatic analysis of this data include the        past research [5]. Thus, this year participants are encouraged to
significant amount of cross-contamination and occlusion, both in          focus on such failure cases and hypothesize about the reasons of
self-occlusion and occlusion by other subjects, due to the crowded        such dissimilarities.
nature of the event (cocktail party).                                        We are reaching out to different communities (affective com-
    Subjects also wore a badge-like body-worn accelerometer (see          puting, multimedia, computer vision, and speech), as we believe
Figure 1), recording tri-axial acceleration at 20Hz. These accelera-      each of these communities can bring their own expertise to the
tion readings were processed via whitening applied per axis. All          task. In the following years as well as augmenting the data, we
video and wearable the data is synchronized.                              aim to include and explore the implications of the spatial social
    Finally, binary speaking status (speaking/non-speaking) was           component of the mingle (e.g. F-Formations) on the speaking status
annotated by 3 different annotators. Inter-annotator agreement            detection.
was calculated on a 2 minute segment of the data, which resulted
in a Fleiss’ kappa coefficient of 0.55.
                                                                          ACKNOWLEDGMENTS
4     EVALUATION                                                          This task is partially supported by the Netherlands Organization
                                                                          for Scientific Research (NWO) under project number 639.022.606.
The Area Under the ROC Curve (ROC-AUC) is used as evaluation
metric, since it is robust against class imbalance which exists in our
scenario. Therefore, participants need to submit continuous pre-          REFERENCES
diction scores (posterior probabilities, distances to the separating       [1] Pradeep K Atrey, M Anwar Hossain, Abdulmotaleb El Saddik, and Mo-
hyperplane, etc.) obtained by running their method on the eval-                han S Kankanhalli. 2010. Multimodal fusion for multimedia analysis:
uation set. These scores will be compared against the test labels,             a survey. Multimedia Systems 16, 6 (2010), 345–379.
which are not available to participants.                                   [2] Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander
                                                                               van der Meij, and Hayley Hung. 2018. The MatchNMingle dataset: a
   Required evaluation. For unimodal and multimodal estimations,               novel multi-sensor resource for the analysis of social interactions and
each team must provide up to 5 runs with their scores for a per-               group dynamics in-the-wild during free-standing conversations and
sons’ speaking status. As mentioned, the evaluation set does not               speed dates. IEEE Transactions on Affective Computing (2018).
contain any data from participants in the test set to achieve person       [3] Laura Cabrera-Quiros, David MJ Tax, and Hayley Hung. 2019. Gestures
independent results.                                                           in-the-wild: detecting conversational hand gestures in crowded scenes
                                                                               using a multimodal fusion of bags of video trajectories and body worn
2 Occlusion levels can be requested if needed for training set.                acceleration. IEEE Transactions on Multimedia (2019).
No-Audio Multimodal Speech Detection Task                                       MediaEval’20, December, 2020, Virtual


 [4] Marco Cristani, Anna Pesarin, Alessandro Vinciarelli, Marco Crocco,
     and Vittorio Murino. 2011. Look at who’s talking: Voice activity detec-
     tion by automated gesture analysis. In International Joint Conference
     on Ambient Intelligence. Springer, 72–80.
 [5] Ekin Gedik and Hayley Hung. 2017. Personalised models for speech
     detection from body movements using transductive parameter transfer.
     Personal and Ubiquitous Computing 21, 4 (2017), 723–737.
 [6] Hayley Hung, Gwenn Englebienne, and Jeroen Kools. 2013. Classify-
     ing social actions with a single accelerometer. In Proceedings of the
     2013 ACM international joint conference on Pervasive and ubiquitous
     computing. ACM, 207–210.
 [7] Hayley Hung and Daniel Gatica-Perez. 2010. Estimating cohesion in
     small groups using audio-visual nonverbal behavior. IEEE Transactions
     on Multimedia 12, 6 (2010), 563–575.
 [8] Dinesh Babu Jayagopi, Hayley Hung, Chuohao Yeo, and Daniel Gatica-
     Perez. 2009. Modeling Dominance in Group Conversations Using
     Nonverbal Activity Cues. IEEE Transactions on Audio, Speech, and
     Language Processing 17, 3 (2009), 501–513.
 [9] David McNeill. 2000. Language and gesture. Vol. 2. Cambridge Univer-
     sity Press.
[10] Alessandro Vinciarelli, Maja Pantic, Dirk Heylen, Catherine Pelachaud,
     Isabella Poggi, Francesca D’Errico, and Marc Schroeder. 2012. Bridging
     the gap between social animal and unsocial machine: A survey of social
     signal processing. IEEE Transactions on Affective Computing 3, 1 (2012),
     69–87.
[11] Hans-Georg Wolff and Klaus Moser. 2009. Effects of networking on
     career success: a longitudinal study. Journal of Applied Psychology 94,
     1 (2009), 196.