<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>No-Audio Multimodal Speech Detection Task at MediaEval 2020</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Cabrera-Quiros</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jose Vargas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hayley Hung</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>lcabrera@itcr.ac.cr</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>j.d.vargasquiros</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>h.hung}@tudelft.nl</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Delft University of Technology</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Instituto Tecnológico de Costa Rica</institution>
          ,
          <country country="CR">Costa Rica</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>This overview paper provides a description of the No-Audio multimodal speech detection task for MediaEval 2020. Similar to the previous two editions, the participants of this task are encouraged to estimate the speaking status (i.e. person speaking or not) of individuals interacting freely during a crowded mingle event, from multimodal data. In contrast to conventional speech detection approaches, no audio is used for this task. Instead, the automatic estimation system proposed must exploit the natural human movements that accompany speech, captured by cameras and wearable sensors. Task participants are provided with cropped videos of individuals while interacting, captured by an overhead camera, and the tri-axial acceleration of each individual throughout the event, captured with a single badge-like device hung around the neck. This year's edition of the task also focuses on investigating posible reasons for interpersonal diferences in the performances obtained.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Speaking status is one of the key signals that is used for studying
conversational dynamics in face to face settings [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. From the
speaking status of multiple people one can also derive speaking
turns, and other features that have shown beneficial for estimating
many diferent social constructs such as dominance [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], or cohesion
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Overall, automated analysis of conversational dynamics in large
unstructured social gatherings is an under-explored problem despite
the relevance of such events [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and automated speaking detection
one of its key components.
      </p>
      <p>The majority of works regarding speaking status detection
focuses on utilizing the audio signal captured by microphones.
However, most unstructured social gatherings such as parties or cocktail
events tend to have inherent background noise and to collect good
quality audio signals, participants need to wear uncomfortable and
intrusive equipment. Recording audio also risks to be perceived
as an invasion of privacy due to the access to the precise verbal
contents of the conversation, further limiting the natural behavior
of the individuals involved. Because of these restrictions, recording
audio in such cases is challenging.</p>
      <p>As a suitable alternative, the main goal of this task is to estimate
a person’s speaking status using video and wearable acceleration
data from a smart ID badge which is hung around the neck, instead
of audio. Such alternative modalities are more privacy-preserving,
and easy to use and replicate for crowded environments such as
conferences, networking events, or organizational settings.</p>
      <p>
        Body movements such as gesturing tend to co-occur with
speaking, as it has been well-documented by social scientists [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Thus,
an automatic estimation system should exploit the natural human
movements that accompany speech. This task is motivated by such
insights, and past work which estimated speaking status from a
single body worn tri-axial accelerometer [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] and video [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Despite many eforts, one of the major challenges of these
alternative approaches has been achieving competitive estimation
performance against audio-based systems. Moreover, results from
past editions of this task have shown a significant diference in the
performance of diferent individuals, and lower performances for a
particular subset of them (failure cases) not fully understood yet.</p>
    </sec>
    <sec id="sec-2">
      <title>TASK DETAILS</title>
    </sec>
    <sec id="sec-3">
      <title>Unimodal estimation of speaking status</title>
      <p>
        Participants are encouraged to design and implement separate
speaking status estimators for each modality. However, baseline
approaches for each modality are provided, in case they prefer to
focus on improving an estimator for only one of the modalities, or
the fusion technique. The baseline using acceleration implements
the logistic regression in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and the video baseline employs dense
trajectories and multiple instance learning, as explained in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>For the video modality, the input will be a video of a person
interacting freely in a social gathering (see Figure 1), and a estimation
of that persons’ speaking status (speaking/non-speaking) should be
provided every second. For the wearable modality, the method will
have the wearable tri-axial acceleration signal of a person as input
and must also return a speaking status estimation every second.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Multimodal estimation of speaking status</title>
      <p>
        For this subtask teams must provide an estimation of speaking
status every second by exploiting both modalities together. Teams
can use any type of fusion method they see fit [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The goal is
to leverage the complementary nature of the modalities to better
estimate the speaking status. Thus, teams are encouraged to go
beyond basic fusion and really think about the impact of each
modality on the estimation.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Analysis of failure test cases</title>
      <p>As a new addition for this year’s edition, teams must analyze the
diferences in the performance results for the test set, focusing on
the three subjects with the lower performances, and hypothesize
about the reasons the method underperforms for these persons.
Participants are encouraged to think about the circumstances for
the subjects (e.g. occlusion) or interpersonal diferences that could
explain such dissimilarities.
3</p>
    </sec>
    <sec id="sec-6">
      <title>DATA</title>
      <p>
        A subset of the MatchNMingle dataset1 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is used for this task.
It contains data for 70 people who attended one of three separate
1MatchNMingle is openly available for research purposes under an EULA at
http://matchmakers.ewi.tudelft.nl/matchnmingle/pmwiki/
mingle events for over 45 minutes. To eliminate the efects of
acclimatization, only 30 minutes in the middle of the event are used.
Subjects were separated using stratified sampling to create the train
(54 subjects) and test sets (16 subjects). Stratification was done with
various criteria to ensure balanced distributions in both sets for
speaking status, gender, event day, and level of occlusion in the
video.2 An additional segment of the data was created for the
optional subject specific evaluation of the task (see more in Section
4). While the dataset used this year is the same as the one used in
previous versions of the challenge, making comparisons possible
between solutions of diferent years, focus is given to the diferences
shown by the 16 subjects in the test set.
      </p>
      <p>Videos were captured from an overhead view at 20FPS. The
rectangular (bounding box) area around each subject has been
cropped, in such a way that a video is provided per person.
Important challenges in the automatic analysis of this data include the
significant amount of cross-contamination and occlusion, both in
self-occlusion and occlusion by other subjects, due to the crowded
nature of the event (cocktail party).</p>
      <p>Subjects also wore a badge-like body-worn accelerometer (see
Figure 1), recording tri-axial acceleration at 20Hz. These
acceleration readings were processed via whitening applied per axis. All
video and wearable the data is synchronized.</p>
      <p>Finally, binary speaking status (speaking/non-speaking) was
annotated by 3 diferent annotators. Inter-annotator agreement
was calculated on a 2 minute segment of the data, which resulted
in a Fleiss’ kappa coeficient of 0.55.
4</p>
    </sec>
    <sec id="sec-7">
      <title>EVALUATION</title>
      <p>The Area Under the ROC Curve (ROC-AUC) is used as evaluation
metric, since it is robust against class imbalance which exists in our
scenario. Therefore, participants need to submit continuous
prediction scores (posterior probabilities, distances to the separating
hyperplane, etc.) obtained by running their method on the
evaluation set. These scores will be compared against the test labels,
which are not available to participants.</p>
      <p>Required evaluation. For unimodal and multimodal estimations,
each team must provide up to 5 runs with their scores for a
persons’ speaking status. As mentioned, the evaluation set does not
contain any data from participants in the test set to achieve person
independent results.
2Occlusion levels can be requested if needed for training set.</p>
      <p>Optional evaluation. Teams may optionally submit up to 5 runs
(per person) using person dependent training. To do so, a separate 5
minutes interval for all people in the training set is provided. Thus,
samples and labels from the same subject can be used to train or
ifne-tune and then test for a specific test subject’s data, which is
temporally to adjacent to the training samples. This method would
be expected to perform better when trained or fine-tuned on the
target person rather than other people.
5</p>
    </sec>
    <sec id="sec-8">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>With this task, we aim to support the study of speaking status
detection in the wild using alternative modalities to audio. We aim
to learn more about the connection between speaking and body
movements, expecting that in the future this will bring on valuable
insights for both the social science and multimedia communities.</p>
      <p>
        Participation in previous editions of the task has been limited,
with only small improvements over the baseline. We believe this
is due to the variety of ways in which this task is atypical. For
example, the connection between speech and body movements
has been found to be person-specific [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Additionally, the
interaction between the two modalities of interest (chest acceleration and
video) is not traditionally explored, i.e. the combination of these
two modalities is not common. This leaves open opportunities to
explore their complimentarity, to better understand in which
situations one modality is more reliable over the other, and develop or
apply appropriate fusion strategies.
      </p>
      <p>
        Moreover, diferences in the performances between test subjects
was consistently found in previous editions, further supporting
past research [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Thus, this year participants are encouraged to
focus on such failure cases and hypothesize about the reasons of
such dissimilarities.
      </p>
      <p>We are reaching out to diferent communities (afective
computing, multimedia, computer vision, and speech), as we believe
each of these communities can bring their own expertise to the
task. In the following years as well as augmenting the data, we
aim to include and explore the implications of the spatial social
component of the mingle (e.g. F-Formations) on the speaking status
detection.</p>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <p>This task is partially supported by the Netherlands Organization
for Scientific Research (NWO) under project number 639.022.606.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Pradeep</surname>
            <given-names>K Atrey</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>M Anwar</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <source>Abdulmotaleb El Saddik, and Mohan S Kankanhalli</source>
          .
          <year>2010</year>
          .
          <article-title>Multimodal fusion for multimedia analysis: a survey</article-title>
          .
          <source>Multimedia Systems</source>
          <volume>16</volume>
          ,
          <issue>6</issue>
          (
          <year>2010</year>
          ),
          <fpage>345</fpage>
          -
          <lpage>379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Cabrera-Quiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Demetriou</surname>
          </string-name>
          , Ekin Gedik, Leander van der Meij, and
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Cabrera-Quiros</surname>
          </string-name>
          ,
          <source>David MJ Tax, and Hayley Hung</source>
          .
          <year>2019</year>
          .
          <article-title>Gestures in-the-wild: detecting conversational hand gestures in crowded scenes using a multimodal fusion of bags of video trajectories and body worn acceleration</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Marco</given-names>
            <surname>Cristani</surname>
          </string-name>
          , Anna Pesarin, Alessandro Vinciarelli, Marco Crocco, and
          <string-name>
            <given-names>Vittorio</given-names>
            <surname>Murino</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Look at who's talking: Voice activity detection by automated gesture analysis</article-title>
          .
          <source>In International Joint Conference on Ambient Intelligence</source>
          . Springer,
          <fpage>72</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ekin</given-names>
            <surname>Gedik</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Personalised models for speech detection from body movements using transductive parameter transfer</article-title>
          .
          <source>Personal and Ubiquitous Computing</source>
          <volume>21</volume>
          ,
          <issue>4</issue>
          (
          <year>2017</year>
          ),
          <fpage>723</fpage>
          -
          <lpage>737</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung</surname>
          </string-name>
          , Gwenn Englebienne, and
          <string-name>
            <given-names>Jeroen</given-names>
            <surname>Kools</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Classifying social actions with a single accelerometer</article-title>
          .
          <source>In Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. ACM</source>
          ,
          <volume>207</volume>
          -
          <fpage>210</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung and Daniel</surname>
          </string-name>
          Gatica-Perez.
          <year>2010</year>
          .
          <article-title>Estimating cohesion in small groups using audio-visual nonverbal behavior</article-title>
          .
          <source>IEEE Transactions on Multimedia 12</source>
          ,
          <issue>6</issue>
          (
          <year>2010</year>
          ),
          <fpage>563</fpage>
          -
          <lpage>575</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Dinesh</given-names>
            <surname>Babu</surname>
          </string-name>
          <string-name>
            <surname>Jayagopi</surname>
          </string-name>
          , Hayley Hung, Chuohao Yeo, and Daniel GaticaPerez.
          <year>2009</year>
          .
          <article-title>Modeling Dominance in Group Conversations Using Nonverbal Activity Cues</article-title>
          .
          <source>IEEE Transactions on Audio, Speech, and Language Processing</source>
          <volume>17</volume>
          ,
          <issue>3</issue>
          (
          <year>2009</year>
          ),
          <fpage>501</fpage>
          -
          <lpage>513</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>David</given-names>
            <surname>McNeill</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Language and gesture</article-title>
          . Vol.
          <volume>2</volume>
          . Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Alessandro</surname>
            <given-names>Vinciarelli</given-names>
          </string-name>
          , Maja Pantic, Dirk Heylen, Catherine Pelachaud, Isabella Poggi,
          <string-name>
            <surname>Francesca D'Errico</surname>
            ,
            <given-names>and Marc</given-names>
          </string-name>
          <string-name>
            <surname>Schroeder</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Bridging the gap between social animal and unsocial machine: A survey of social signal processing</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          <volume>3</volume>
          ,
          <issue>1</issue>
          (
          <year>2012</year>
          ),
          <fpage>69</fpage>
          -
          <lpage>87</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Hans-Georg Wolf</surname>
            and
            <given-names>Klaus</given-names>
          </string-name>
          <string-name>
            <surname>Moser</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Efects of networking on career success: a longitudinal study</article-title>
          .
          <source>Journal of Applied Psychology</source>
          <volume>94</volume>
          ,
          <issue>1</issue>
          (
          <year>2009</year>
          ),
          <fpage>196</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>