=Paper=
{{Paper
|id=Vol-2882/MediaEval_20_paper_8
|storemode=property
|title=Multimodal
Fusion of Body Movement Signals for No-audio Speech Detection
|pdfUrl=https://ceur-ws.org/Vol-2882/paper8.pdf
|volume=Vol-2882
|authors=Xinsheng
Wang,Jihua Zhu,Odette Scharenborg
|dblpUrl=https://dblp.org/rec/conf/mediaeval/WangZS20
}}
==Multimodal
Fusion of Body Movement Signals for No-audio Speech Detection==
Multimodal Fusion of Body Movement Signals for No-audio
Speech Detection
Xinsheng Wang1,2 , Jihua Zhu1 , Odette Scharenborg2
1 School of Software Engineering, Xi’an Jiaotong University, Xi’an, China
2 Multimedia Computing Group, Delft University of Technology, Delft, The Netherlands
wangxinsheng@stu.xjtu.edu.cn,zhujh@xjtu.edu.cn,o.e.scharenborg@tudelft.nl
ABSTRACT has a duration of 2 minutes, resulting in a video segment with 2400
No-audio Multimodal Speech Detection is one of the tasks in Media- frames and an accelerometer data segment with a size of 3 × 2400.
Eval 2020, with the goal to automatically detect whether someone
is speaking in social interaction on the basis of body movement 2.2 AccelNet
signals. In this paper, a multimodal fusion method, combining sig- As shown in Fig. 1, the AccelNet consists of 3 1-D convolution
nals obtained by an overhead camera and a wearable accelerometer, layers and a bi-directional GRU layer. Between every two adjacent
was proposed to determine whether someone was speaking. The convolutional layers, a batch normalization layer is adopted. The 3
proposed system directly takes the accelerometer signals as input, convolution layers take kernel sizes of 5, 3, and 3 respectively, and
while using a pre-trained 3D convolutional network to extract the take stride sizes of 5, 2, and 2 respectively, resulting in a feature with
video features that work as input. Experiments on the No-audio a receptive field of 23 frames, which is similar to the sampling rate
Multimodal Speech Detection task show that our method outper- of 20Hz. Therefore, we can assume that each frame out of the total
forms all submissions of previous years. of 120 frames from the last convolutional layer, with a dimension
of 256, represents the movement status within a second. Intuitively,
the speaking status in one moment would have a relationship with
1 INTRODUCTION
the previous and following several time steps, the bi-directional
There is a close relationship between body movements, e.g., ges- GRU, with 256 units, is adopted after the last 1-D convolutional
turing, and speaking status, i.e., whether someone is speaking or layer to capture this relationship.
not. This might make it possible to determine whether a person Concatenating the features of two directions at each time step,
is speaking by analyzing the person’s body movements. This No- the bi-directional GRU results in a 512-d feature with a sequence
Audio Multimodal Speech Detection task of MediaEval 2020 focuses length of 120. Then this feature will be concatenated with the
on analyzing the problem of determining the speaking status of video feature to perform the multimodal speech detection task. In
standing subjects in crowded mingling scenarios with the infor- order for the AccelNet to detect speaking status on the basis of
mation recorded by an overhead camera and a single body-worn the accelerometer data only, a linear transformation followed by a
triaxial accelerometer, hung around the neck of the subjects [1]. In sigmoid layer can be added after the bi-directional GRU.
this paper, we fuse the signals from these two modalities to perform
the No-audio Speech Detection task. The details of the proposed
2.3 VideoNet
approach will be described in the following section1 .
The C3D [7] pre-trained on Sports-1M [4] is adopted to extract the
2 APPROACH video features. The video was recorded with a frequency of 20Hz,
while the C3D model only uses 16 consecutive frames as context to
The architecture of the proposed method is shown in Fig. 1. The
obtain the 3D convolutional features. In practice, we dropped the
proposed model consists of three parts, i.e., AccelNet, VideoNet,
last 4 frames within each second in the video, so that we can use
and the fusion part for the accelerometer data input, the video
the C3D to extract video features of each second, resulting in 120
input, and the multi-modality fusion respectively. According to
feature vectors with a dimension of 512 for each video segment (2
the requirements of this task, the AccelNet and VideoNet are also
minutes). The C3D features go through a bi-directional GRU, with
designed to be able to predict the speaking status individually.
256 units, before being fused with the accelerometer features.
Similar to the AccelNet, the output of VideoNet can also be used
2.1 Data processing for unimodal speech detection.
In the provided database, video and accelerometer data were recorded
with a duration of 22 minutes at 20Hz. For training, we segmented 2.4 Fusion and objective function
the video and accelerometer data into 11 segments, each of which
The early fusion strategy is adopted in this paper. Specifically, the
1 The code of the proposed method can be found at: accelerometer feature from the AccelNet and the visual feature
https://github.com/xinshengwang/No-audio-speech-detection from the VideoNet are concatenated, resulting in a feature with
1024 dimensions and 120 frames. Two linear transformation layers
Copyright 2020 for this paper by its authors. Use permitted under Creative Commons are used to transform the feature dimension from 1024 to 1, and
License Attribution 4.0 International (CC BY 4.0).
MediaEval’20, December 14-15 2020, Online then a sigmoid layer is utilized after the last linear transformation
layer to obtain the final prediction probability.
MediaEval’20, December 14-15 2020, Online X. Wang et al.
To train the model, the binary cross-entropy loss is adopted on Table 1: Performance of each of the previously submitted
the frame level. First, the AccelNet and VideoNet are trained for the results and our proposed method for the unimodal and mul-
unimodal prediction task individually. Next, the pre-trained models timodal speech detection tasks. Bold indicates best result.
are used in the multimodal task. During multimodal task training,
we only updated the fusion network, i.e., two linear transformation Method Accel Video Fusion
layers, while keeping the parameters of the pre-trained AccelNet Cabrera-Quiros et al. [2] 0.656±0.074 0.549±0.079 0.658±0.073
and VideoNet fixed. Liu et al. [6] 0.533±0.020 0.512± 0.021 0.535±0.019
Giannakeris et al. [3] 0.649±0.066 0.614±0.067 0.672±0.051
Li et al. [5] 0.644 0.513 0.620
Input
3×2400 1st second 120th second
Vargas et al. [8] 0.692 0.552 0.693
1-D CNN The proposed model 0.689±0.094 0.656±0.076 0.712±0.081
kernel 5, stride 5, padding 2
64×480
Batch Normalization 1
0.9 Acceleration Video Fusion
1-D CNN 0.8
kernel 3, stride 2, padding 1 0.7
AUC score
128×240 C3D C3D 0.6
0.5
0.4
Batch Normalization 120
0.3
0.2
0.1
512
1-D CNN
kernel 3, stride 2, padding 1 0
256×120 2 3 15 17 26 39 40 43 51 54 59 65 67 80 83 85
Subject ID
Bi-directional Bi-directional
GRU GRU Figure 2: AUC scores for each test subject.
512×120 512×120
AccelNet VideoNet
FC Concatenate FC In Table 1, our method is compared with the submission results
1×120 1024×120 1×120
of pervious years. Our method achieves the better performance on
FC
the multimodal speech detection task. On the unimodal tasks, our
Sigmoid Sigmoid
512×120 AccelNet outperforms our VideoNet. Moreover, our accelerometer
data-based method is only slightly lower than that of [8], while
Output Output
FC our video-based method achieves a much higher performance than
Unimodal 1×120 Unimodal the second best approach [3], indicating the good performance of
C3D on extracting video features and also the good design of the
Sigmoid
VideoNet. The best performance of our multimodal result benefits
from the good performance of the VideoNet.
Output From Fig. 2 we can see that the accelerometer modality-based
Fusion method does not always outperform the video-based method, indi-
cating that the signals from the accelerometer and video could be
Figure 1: The proposed multimodal speech detection net- complementary, which could explain the higher performance of the
work. fusion of the two modalities compared to the unimodal methods.
However, fusion did not lead to an improved performance for all
individual test subjects (see subjects 17 and 83), and a better fusion
method should be considered in the future.
3 RESULTS
In order to evaluate our speech detection approach, we followed
the given split method of the No-audio Speech Detection task. The 4 CONCLUSION
model was trained on data from 54 subjects and tested on data In this paper, we proposed a multimodal speech detection model,
from 16 unseen subjects that non-overlap with the subjects in the with video and accelerometer data as input. Our model showed
training set. We report the Area Under Curve (AUC) metric for each competitive results on the unimodal speech detection tasks with
test subject and each modality. The mean AUC scores computed either video or accelerometer data as input, and it outperformed
over all test subjects are shown in Table 1, while the AUC scores previous methods on the multi-modal task which uses both types
for each test subject separately are shown in Fig. 2. of input.
No-Audio Multimodal Speech Detection Task MediaEval’20, December 14-15 2020, Online
REFERENCES
[1] Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander
van der Meij, and Hayley Hung. 2018. The MatchNMingle dataset: a
novel multi-sensor resource for the analysis of social interactions and
group dynamics in-the-wild during free-standing conversations and
speed dates. IEEE Transactions on Affective Computing (2018).
[2] Laura Cabrera-Quiros, Ekin Gedik, and Hayley Hung. 2018. Trans-
ductive Parameter Transfer, Bags of Dense Trajectories and MILES
for No-Audio Multimodal Speech Detection. In 2018 Working Notes
Proceedings of the MediaEval Workshop, MediaEval 2018. CEUR-WS.
org, 3.
[3] Panagiotis Giannakeris, Stefanos Vrochidis, and Ioannis Kompatsiaris.
2019. Multimodal Fusion of Appearance Features, Optical Flow and
Accelerometer Data for Speech Detection. (2019).
[4] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung,
Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale Video Classifica-
tion with Convolutional Neural Networks. In CVPR.
[5] Liandong Li, Zhuo Hao, and Bo Sun. 2019. Combining Body Pose and
Movement Modalities for No-audio Speech Detection. (2019).
[6] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2018. Analyzing Human
Behavior in Subspace: Dimensionality Reduction+ Classification.. In
MediaEval.
[7] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and
Manohar Paluri. 2015. Learning spatiotemporal features with 3d con-
volutional networks. In Proceedings of the IEEE international conference
on computer vision. 4489–4497.
[8] Jose Vargas and Hayley Hung. 2019. CNNs and Fisher Vectors for
No-Audio Multimodal Speech Detection. (2019).