Multimodal Fusion of Body Movement Signals for No-audio
                           Speech Detection
                                          Xinsheng Wang1,2 , Jihua Zhu1 , Odette Scharenborg2
                                   1 School of Software Engineering, Xi’an Jiaotong University, Xi’an, China
                         2 Multimedia Computing Group, Delft University of Technology, Delft, The Netherlands

                               wangxinsheng@stu.xjtu.edu.cn,zhujh@xjtu.edu.cn,o.e.scharenborg@tudelft.nl

ABSTRACT                                                                             has a duration of 2 minutes, resulting in a video segment with 2400
No-audio Multimodal Speech Detection is one of the tasks in Media-                   frames and an accelerometer data segment with a size of 3 × 2400.
Eval 2020, with the goal to automatically detect whether someone
is speaking in social interaction on the basis of body movement                      2.2    AccelNet
signals. In this paper, a multimodal fusion method, combining sig-                   As shown in Fig. 1, the AccelNet consists of 3 1-D convolution
nals obtained by an overhead camera and a wearable accelerometer,                    layers and a bi-directional GRU layer. Between every two adjacent
was proposed to determine whether someone was speaking. The                          convolutional layers, a batch normalization layer is adopted. The 3
proposed system directly takes the accelerometer signals as input,                   convolution layers take kernel sizes of 5, 3, and 3 respectively, and
while using a pre-trained 3D convolutional network to extract the                    take stride sizes of 5, 2, and 2 respectively, resulting in a feature with
video features that work as input. Experiments on the No-audio                       a receptive field of 23 frames, which is similar to the sampling rate
Multimodal Speech Detection task show that our method outper-                        of 20Hz. Therefore, we can assume that each frame out of the total
forms all submissions of previous years.                                             of 120 frames from the last convolutional layer, with a dimension
                                                                                     of 256, represents the movement status within a second. Intuitively,
                                                                                     the speaking status in one moment would have a relationship with
1       INTRODUCTION
                                                                                     the previous and following several time steps, the bi-directional
There is a close relationship between body movements, e.g., ges-                     GRU, with 256 units, is adopted after the last 1-D convolutional
turing, and speaking status, i.e., whether someone is speaking or                    layer to capture this relationship.
not. This might make it possible to determine whether a person                          Concatenating the features of two directions at each time step,
is speaking by analyzing the person’s body movements. This No-                       the bi-directional GRU results in a 512-d feature with a sequence
Audio Multimodal Speech Detection task of MediaEval 2020 focuses                     length of 120. Then this feature will be concatenated with the
on analyzing the problem of determining the speaking status of                       video feature to perform the multimodal speech detection task. In
standing subjects in crowded mingling scenarios with the infor-                      order for the AccelNet to detect speaking status on the basis of
mation recorded by an overhead camera and a single body-worn                         the accelerometer data only, a linear transformation followed by a
triaxial accelerometer, hung around the neck of the subjects [1]. In                 sigmoid layer can be added after the bi-directional GRU.
this paper, we fuse the signals from these two modalities to perform
the No-audio Speech Detection task. The details of the proposed
                                                                                     2.3    VideoNet
approach will be described in the following section1 .
                                                                                     The C3D [7] pre-trained on Sports-1M [4] is adopted to extract the
2       APPROACH                                                                     video features. The video was recorded with a frequency of 20Hz,
                                                                                     while the C3D model only uses 16 consecutive frames as context to
The architecture of the proposed method is shown in Fig. 1. The
                                                                                     obtain the 3D convolutional features. In practice, we dropped the
proposed model consists of three parts, i.e., AccelNet, VideoNet,
                                                                                     last 4 frames within each second in the video, so that we can use
and the fusion part for the accelerometer data input, the video
                                                                                     the C3D to extract video features of each second, resulting in 120
input, and the multi-modality fusion respectively. According to
                                                                                     feature vectors with a dimension of 512 for each video segment (2
the requirements of this task, the AccelNet and VideoNet are also
                                                                                     minutes). The C3D features go through a bi-directional GRU, with
designed to be able to predict the speaking status individually.
                                                                                     256 units, before being fused with the accelerometer features.
                                                                                        Similar to the AccelNet, the output of VideoNet can also be used
2.1      Data processing                                                             for unimodal speech detection.
In the provided database, video and accelerometer data were recorded
with a duration of 22 minutes at 20Hz. For training, we segmented                    2.4    Fusion and objective function
the video and accelerometer data into 11 segments, each of which
                                                                                     The early fusion strategy is adopted in this paper. Specifically, the
1 The     code    of    the  proposed     method     can       be    found     at:   accelerometer feature from the AccelNet and the visual feature
https://github.com/xinshengwang/No-audio-speech-detection                            from the VideoNet are concatenated, resulting in a feature with
                                                                                     1024 dimensions and 120 frames. Two linear transformation layers
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons   are used to transform the feature dimension from 1024 to 1, and
License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, December 14-15 2020, Online                                            then a sigmoid layer is utilized after the last linear transformation
                                                                                     layer to obtain the final prediction probability.
MediaEval’20, December 14-15 2020, Online                                                                                                                     X. Wang et al.


   To train the model, the binary cross-entropy loss is adopted on                       Table 1: Performance of each of the previously submitted
the frame level. First, the AccelNet and VideoNet are trained for the                    results and our proposed method for the unimodal and mul-
unimodal prediction task individually. Next, the pre-trained models                      timodal speech detection tasks. Bold indicates best result.
are used in the multimodal task. During multimodal task training,
we only updated the fusion network, i.e., two linear transformation                       Method                       Accel       Video        Fusion
layers, while keeping the parameters of the pre-trained AccelNet                          Cabrera-Quiros et al. [2] 0.656±0.074 0.549±0.079 0.658±0.073
and VideoNet fixed.                                                                       Liu et al. [6]            0.533±0.020 0.512± 0.021 0.535±0.019
                                                                                          Giannakeris et al. [3]    0.649±0.066 0.614±0.067 0.672±0.051
                                                                                          Li et al. [5]                0.644        0.513        0.620
      Input


                   3×2400                              1st second         120th second
                                                                                          Vargas et al. [8]            0.692        0.552        0.693
                   1-D CNN                                                                The proposed model        0.689±0.094 0.656±0.076 0.712±0.081
         kernel 5, stride 5, padding 2
                   64×480


              Batch Normalization                                                                          1
                                                                                                         0.9                  Acceleration   Video   Fusion
                   1-D CNN                                                                               0.8
         kernel 3, stride 2, padding 1                                                                   0.7


                                                                                             AUC score
                   128×240                               C3D                 C3D                         0.6
                                                                                                         0.5
                                                                                                         0.4
              Batch Normalization                                   120
                                                                                                         0.3
                                                                                                         0.2
                                                                                                         0.1
                                                       512


                   1-D CNN
         kernel 3, stride 2, padding 1                                                                     0
                   256×120                                                                                     2   3   15 17 26 39 40 43 51 54 59 65 67 80 83 85
                                                                                                                                     Subject ID

                 Bi-directional                              Bi-directional
                     GRU                                         GRU                                           Figure 2: AUC scores for each test subject.
                   512×120                                     512×120

      AccelNet                                                               VideoNet

                      FC                 Concatenate                FC                      In Table 1, our method is compared with the submission results
                    1×120                1024×120               1×120
                                                                                         of pervious years. Our method achieves the better performance on
                                             FC
                                                                                         the multimodal speech detection task. On the unimodal tasks, our
                   Sigmoid                                     Sigmoid
                                          512×120                                        AccelNet outperforms our VideoNet. Moreover, our accelerometer
                                                                                         data-based method is only slightly lower than that of [8], while
                    Output                                      Output
                                             FC                                          our video-based method achieves a much higher performance than
                  Unimodal                 1×120               Unimodal                  the second best approach [3], indicating the good performance of
                                                                                         C3D on extracting video features and also the good design of the
                                          Sigmoid
                                                                                         VideoNet. The best performance of our multimodal result benefits
                                                                                         from the good performance of the VideoNet.
                                           Output                                           From Fig. 2 we can see that the accelerometer modality-based
                                           Fusion                                        method does not always outperform the video-based method, indi-
                                                                                         cating that the signals from the accelerometer and video could be
Figure 1: The proposed multimodal speech detection net-                                  complementary, which could explain the higher performance of the
work.                                                                                    fusion of the two modalities compared to the unimodal methods.
                                                                                         However, fusion did not lead to an improved performance for all
                                                                                         individual test subjects (see subjects 17 and 83), and a better fusion
                                                                                         method should be considered in the future.
3   RESULTS
In order to evaluate our speech detection approach, we followed
the given split method of the No-audio Speech Detection task. The                        4               CONCLUSION
model was trained on data from 54 subjects and tested on data                            In this paper, we proposed a multimodal speech detection model,
from 16 unseen subjects that non-overlap with the subjects in the                        with video and accelerometer data as input. Our model showed
training set. We report the Area Under Curve (AUC) metric for each                       competitive results on the unimodal speech detection tasks with
test subject and each modality. The mean AUC scores computed                             either video or accelerometer data as input, and it outperformed
over all test subjects are shown in Table 1, while the AUC scores                        previous methods on the multi-modal task which uses both types
for each test subject separately are shown in Fig. 2.                                    of input.
No-Audio Multimodal Speech Detection Task                                      MediaEval’20, December 14-15 2020, Online


REFERENCES
[1] Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander
    van der Meij, and Hayley Hung. 2018. The MatchNMingle dataset: a
    novel multi-sensor resource for the analysis of social interactions and
    group dynamics in-the-wild during free-standing conversations and
    speed dates. IEEE Transactions on Affective Computing (2018).
[2] Laura Cabrera-Quiros, Ekin Gedik, and Hayley Hung. 2018. Trans-
    ductive Parameter Transfer, Bags of Dense Trajectories and MILES
    for No-Audio Multimodal Speech Detection. In 2018 Working Notes
    Proceedings of the MediaEval Workshop, MediaEval 2018. CEUR-WS.
    org, 3.
[3] Panagiotis Giannakeris, Stefanos Vrochidis, and Ioannis Kompatsiaris.
    2019. Multimodal Fusion of Appearance Features, Optical Flow and
    Accelerometer Data for Speech Detection. (2019).
[4] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung,
    Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale Video Classifica-
    tion with Convolutional Neural Networks. In CVPR.
[5] Liandong Li, Zhuo Hao, and Bo Sun. 2019. Combining Body Pose and
    Movement Modalities for No-audio Speech Detection. (2019).
[6] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2018. Analyzing Human
    Behavior in Subspace: Dimensionality Reduction+ Classification.. In
    MediaEval.
[7] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
    Manohar Paluri. 2015. Learning spatiotemporal features with 3d con-
    volutional networks. In Proceedings of the IEEE international conference
    on computer vision. 4489–4497.
[8] Jose Vargas and Hayley Hung. 2019. CNNs and Fisher Vectors for
    No-Audio Multimodal Speech Detection. (2019).