<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Fusion of Body Movement Signals for No-audio Speech Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xinsheng Wang</string-name>
          <email>wangxinsheng@stu.xjtu.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jihua Zhu</string-name>
          <email>zhujh@xjtu.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Odette Scharenborg</string-name>
          <email>o.e.scharenborg@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia Computing Group, Delft University of Technology</institution>
          ,
          <addr-line>Delft</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Software Engineering, Xi'an Jiaotong University</institution>
          ,
          <addr-line>Xi'an</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>No-audio Multimodal Speech Detection is one of the tasks in MediaEval 2020, with the goal to automatically detect whether someone is speaking in social interaction on the basis of body movement signals. In this paper, a multimodal fusion method, combining signals obtained by an overhead camera and a wearable accelerometer, was proposed to determine whether someone was speaking. The proposed system directly takes the accelerometer signals as input, while using a pre-trained 3D convolutional network to extract the video features that work as input. Experiments on the No-audio Multimodal Speech Detection task show that our method outperforms all submissions of previous years.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        There is a close relationship between body movements, e.g.,
gesturing, and speaking status, i.e., whether someone is speaking or
not. This might make it possible to determine whether a person
is speaking by analyzing the person’s body movements. This
NoAudio Multimodal Speech Detection task of MediaEval 2020 focuses
on analyzing the problem of determining the speaking status of
standing subjects in crowded mingling scenarios with the
information recorded by an overhead camera and a single body-worn
triaxial accelerometer, hung around the neck of the subjects [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In
this paper, we fuse the signals from these two modalities to perform
the No-audio Speech Detection task. The details of the proposed
approach will be described in the following section1.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>The architecture of the proposed method is shown in Fig. 1. The
proposed model consists of three parts, i.e., AccelNet, VideoNet,
and the fusion part for the accelerometer data input, the video
input, and the multi-modality fusion respectively. According to
the requirements of this task, the AccelNet and VideoNet are also
designed to be able to predict the speaking status individually.</p>
    </sec>
    <sec id="sec-3">
      <title>Data processing</title>
      <p>In the provided database, video and accelerometer data were recorded
with a duration of 22 minutes at 20Hz. For training, we segmented
the video and accelerometer data into 11 segments, each of which
1The code of the proposed method can
https://github.com/xinshengwang/No-audio-speech-detection
be
found
at:
has a duration of 2 minutes, resulting in a video segment with 2400
frames and an accelerometer data segment with a size of 3 × 2400.
2.2
As shown in Fig. 1, the AccelNet consists of 3 1-D convolution
layers and a bi-directional GRU layer. Between every two adjacent
convolutional layers, a batch normalization layer is adopted. The 3
convolution layers take kernel sizes of 5, 3, and 3 respectively, and
take stride sizes of 5, 2, and 2 respectively, resulting in a feature with
a receptive field of 23 frames, which is similar to the sampling rate
of 20Hz. Therefore, we can assume that each frame out of the total
of 120 frames from the last convolutional layer, with a dimension
of 256, represents the movement status within a second. Intuitively,
the speaking status in one moment would have a relationship with
the previous and following several time steps, the bi-directional
GRU, with 256 units, is adopted after the last 1-D convolutional
layer to capture this relationship.</p>
      <p>
        Concatenating the features of two directions at each time step,
the bi-directional GRU results in a 512-d feature with a sequence
length of 120. Then this feature will be concatenated with the
video feature to perform the multimodal speech detection task. In
order for the AccelNet to detect speaking status on the basis of
the accelerometer data only, a linear transformation followed by a
sigmoid layer can be added after the bi-directional GRU.
2.3
The C3D [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] pre-trained on Sports-1M [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is adopted to extract the
video features. The video was recorded with a frequency of 20Hz,
while the C3D model only uses 16 consecutive frames as context to
obtain the 3D convolutional features. In practice, we dropped the
last 4 frames within each second in the video, so that we can use
the C3D to extract video features of each second, resulting in 120
feature vectors with a dimension of 512 for each video segment (2
minutes). The C3D features go through a bi-directional GRU, with
256 units, before being fused with the accelerometer features.
      </p>
      <p>Similar to the AccelNet, the output of VideoNet can also be used
for unimodal speech detection.
2.4</p>
    </sec>
    <sec id="sec-4">
      <title>Fusion and objective function</title>
      <p>The early fusion strategy is adopted in this paper. Specifically, the
accelerometer feature from the AccelNet and the visual feature
from the VideoNet are concatenated, resulting in a feature with
1024 dimensions and 120 frames. Two linear transformation layers
are used to transform the feature dimension from 1024 to 1, and
then a sigmoid layer is utilized after the last linear transformation
layer to obtain the final prediction probability.</p>
      <p>To train the model, the binary cross-entropy loss is adopted on
the frame level. First, the AccelNet and VideoNet are trained for the
unimodal prediction task individually. Next, the pre-trained models
are used in the multimodal task. During multimodal task training,
we only updated the fusion network, i.e., two linear transformation
layers, while keeping the parameters of the pre-trained AccelNet
and VideoNet fixed.</p>
      <p>
        t
u
p
n
I
In order to evaluate our speech detection approach, we followed
the given split method of the No-audio Speech Detection task. The
model was trained on data from 54 subjects and tested on data
from 16 unseen subjects that non-overlap with the subjects in the
training set. We report the Area Under Curve (AUC) metric for each
test subject and each modality. The mean AUC scores computed
over all test subjects are shown in Table 1, while the AUC scores
for each test subject separately are shown in Fig. 2.
Method Accel Video Fusion
Cabrera-Quiros et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] 0.656±0.074 0.549±0.079 0.658±0.073
Liu et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] 0.533±0.020 0.512± 0.021 0.535±0.019
Giannakeris et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] 0.649±0.066 0.614±0.067 0.672±0.051
Li et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] 0.644 0.513 0.620
Vargas et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] 0.692 0.552 0.693
The proposed model 0.689±0.094 0.656±0.076 0.712±0.081
      </p>
      <p>
        In Table 1, our method is compared with the submission results
of pervious years. Our method achieves the better performance on
the multimodal speech detection task. On the unimodal tasks, our
AccelNet outperforms our VideoNet. Moreover, our accelerometer
data-based method is only slightly lower than that of [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], while
our video-based method achieves a much higher performance than
the second best approach [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], indicating the good performance of
C3D on extracting video features and also the good design of the
VideoNet. The best performance of our multimodal result benefits
from the good performance of the VideoNet.
      </p>
      <p>From Fig. 2 we can see that the accelerometer modality-based
method does not always outperform the video-based method,
indicating that the signals from the accelerometer and video could be
complementary, which could explain the higher performance of the
fusion of the two modalities compared to the unimodal methods.
However, fusion did not lead to an improved performance for all
individual test subjects (see subjects 17 and 83), and a better fusion
method should be considered in the future.
4</p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSION</title>
      <p>In this paper, we proposed a multimodal speech detection model,
with video and accelerometer data as input. Our model showed
competitive results on the unimodal speech detection tasks with
either video or accelerometer data as input, and it outperformed
previous methods on the multi-modal task which uses both types
of input.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Cabrera-Quiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Demetriou</surname>
          </string-name>
          , Ekin Gedik, Leander van der Meij, and
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates</article-title>
          .
          <source>IEEE Transactions on Afective Computing</source>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Cabrera-Quiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ekin</given-names>
            <surname>Gedik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Transductive Parameter Transfer, Bags of Dense Trajectories and MILES for No-Audio Multimodal Speech Detection</article-title>
          .
          <source>In 2018 Working Notes Proceedings of the MediaEval Workshop</source>
          , MediaEval
          <year>2018</year>
          .
          <article-title>CEUR-WS. org, 3</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Panagiotis</given-names>
            <surname>Giannakeris</surname>
          </string-name>
          , Stefanos Vrochidis, and
          <string-name>
            <given-names>Ioannis</given-names>
            <surname>Kompatsiaris</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech Detection</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Andrej</given-names>
            <surname>Karpathy</surname>
          </string-name>
          , George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2014</year>
          .
          <article-title>Large-scale Video Classification with Convolutional Neural Networks</article-title>
          .
          <source>In CVPR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Liandong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Zhuo</given-names>
            <surname>Hao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Bo</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Combining Body Pose and Movement Modalities for No-audio Speech Detection</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Yang</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhonglei Gu</surname>
          </string-name>
          , and
          <string-name>
            <surname>Tobey H Ko</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Analyzing Human Behavior in Subspace: Dimensionality Reduction+ Classification.</article-title>
          .
          <source>In MediaEval.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Du</given-names>
            <surname>Tran</surname>
          </string-name>
          , Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
          <string-name>
            <given-names>Manohar</given-names>
            <surname>Paluri</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning spatiotemporal features with 3d convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision</source>
          . 4489-
          <fpage>4497</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jose</given-names>
            <surname>Vargas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>