Multimodal Fusion of Body Movement Signals for No-audio Speech Detection Xinsheng Wang1,2 , Jihua Zhu1 , Odette Scharenborg2 1 School of Software Engineering, Xi’an Jiaotong University, Xi’an, China 2 Multimedia Computing Group, Delft University of Technology, Delft, The Netherlands wangxinsheng@stu.xjtu.edu.cn,zhujh@xjtu.edu.cn,o.e.scharenborg@tudelft.nl ABSTRACT has a duration of 2 minutes, resulting in a video segment with 2400 No-audio Multimodal Speech Detection is one of the tasks in Media- frames and an accelerometer data segment with a size of 3 × 2400. Eval 2020, with the goal to automatically detect whether someone is speaking in social interaction on the basis of body movement 2.2 AccelNet signals. In this paper, a multimodal fusion method, combining sig- As shown in Fig. 1, the AccelNet consists of 3 1-D convolution nals obtained by an overhead camera and a wearable accelerometer, layers and a bi-directional GRU layer. Between every two adjacent was proposed to determine whether someone was speaking. The convolutional layers, a batch normalization layer is adopted. The 3 proposed system directly takes the accelerometer signals as input, convolution layers take kernel sizes of 5, 3, and 3 respectively, and while using a pre-trained 3D convolutional network to extract the take stride sizes of 5, 2, and 2 respectively, resulting in a feature with video features that work as input. Experiments on the No-audio a receptive field of 23 frames, which is similar to the sampling rate Multimodal Speech Detection task show that our method outper- of 20Hz. Therefore, we can assume that each frame out of the total forms all submissions of previous years. of 120 frames from the last convolutional layer, with a dimension of 256, represents the movement status within a second. Intuitively, the speaking status in one moment would have a relationship with 1 INTRODUCTION the previous and following several time steps, the bi-directional There is a close relationship between body movements, e.g., ges- GRU, with 256 units, is adopted after the last 1-D convolutional turing, and speaking status, i.e., whether someone is speaking or layer to capture this relationship. not. This might make it possible to determine whether a person Concatenating the features of two directions at each time step, is speaking by analyzing the person’s body movements. This No- the bi-directional GRU results in a 512-d feature with a sequence Audio Multimodal Speech Detection task of MediaEval 2020 focuses length of 120. Then this feature will be concatenated with the on analyzing the problem of determining the speaking status of video feature to perform the multimodal speech detection task. In standing subjects in crowded mingling scenarios with the infor- order for the AccelNet to detect speaking status on the basis of mation recorded by an overhead camera and a single body-worn the accelerometer data only, a linear transformation followed by a triaxial accelerometer, hung around the neck of the subjects [1]. In sigmoid layer can be added after the bi-directional GRU. this paper, we fuse the signals from these two modalities to perform the No-audio Speech Detection task. The details of the proposed 2.3 VideoNet approach will be described in the following section1 . The C3D [7] pre-trained on Sports-1M [4] is adopted to extract the 2 APPROACH video features. The video was recorded with a frequency of 20Hz, while the C3D model only uses 16 consecutive frames as context to The architecture of the proposed method is shown in Fig. 1. The obtain the 3D convolutional features. In practice, we dropped the proposed model consists of three parts, i.e., AccelNet, VideoNet, last 4 frames within each second in the video, so that we can use and the fusion part for the accelerometer data input, the video the C3D to extract video features of each second, resulting in 120 input, and the multi-modality fusion respectively. According to feature vectors with a dimension of 512 for each video segment (2 the requirements of this task, the AccelNet and VideoNet are also minutes). The C3D features go through a bi-directional GRU, with designed to be able to predict the speaking status individually. 256 units, before being fused with the accelerometer features. Similar to the AccelNet, the output of VideoNet can also be used 2.1 Data processing for unimodal speech detection. In the provided database, video and accelerometer data were recorded with a duration of 22 minutes at 20Hz. For training, we segmented 2.4 Fusion and objective function the video and accelerometer data into 11 segments, each of which The early fusion strategy is adopted in this paper. Specifically, the 1 The code of the proposed method can be found at: accelerometer feature from the AccelNet and the visual feature https://github.com/xinshengwang/No-audio-speech-detection from the VideoNet are concatenated, resulting in a feature with 1024 dimensions and 120 frames. Two linear transformation layers Copyright 2020 for this paper by its authors. Use permitted under Creative Commons are used to transform the feature dimension from 1024 to 1, and License Attribution 4.0 International (CC BY 4.0). MediaEval’20, December 14-15 2020, Online then a sigmoid layer is utilized after the last linear transformation layer to obtain the final prediction probability. MediaEval’20, December 14-15 2020, Online X. Wang et al. To train the model, the binary cross-entropy loss is adopted on Table 1: Performance of each of the previously submitted the frame level. First, the AccelNet and VideoNet are trained for the results and our proposed method for the unimodal and mul- unimodal prediction task individually. Next, the pre-trained models timodal speech detection tasks. Bold indicates best result. are used in the multimodal task. During multimodal task training, we only updated the fusion network, i.e., two linear transformation Method Accel Video Fusion layers, while keeping the parameters of the pre-trained AccelNet Cabrera-Quiros et al. [2] 0.656±0.074 0.549±0.079 0.658±0.073 and VideoNet fixed. Liu et al. [6] 0.533±0.020 0.512± 0.021 0.535±0.019 Giannakeris et al. [3] 0.649±0.066 0.614±0.067 0.672±0.051 Li et al. [5] 0.644 0.513 0.620 Input 3×2400 1st second 120th second Vargas et al. [8] 0.692 0.552 0.693 1-D CNN The proposed model 0.689±0.094 0.656±0.076 0.712±0.081 kernel 5, stride 5, padding 2 64×480 Batch Normalization 1 0.9 Acceleration Video Fusion 1-D CNN 0.8 kernel 3, stride 2, padding 1 0.7 AUC score 128×240 C3D C3D 0.6 0.5 0.4 Batch Normalization 120 0.3 0.2 0.1 512 1-D CNN kernel 3, stride 2, padding 1 0 256×120 2 3 15 17 26 39 40 43 51 54 59 65 67 80 83 85 Subject ID Bi-directional Bi-directional GRU GRU Figure 2: AUC scores for each test subject. 512×120 512×120 AccelNet VideoNet FC Concatenate FC In Table 1, our method is compared with the submission results 1×120 1024×120 1×120 of pervious years. Our method achieves the better performance on FC the multimodal speech detection task. On the unimodal tasks, our Sigmoid Sigmoid 512×120 AccelNet outperforms our VideoNet. Moreover, our accelerometer data-based method is only slightly lower than that of [8], while Output Output FC our video-based method achieves a much higher performance than Unimodal 1×120 Unimodal the second best approach [3], indicating the good performance of C3D on extracting video features and also the good design of the Sigmoid VideoNet. The best performance of our multimodal result benefits from the good performance of the VideoNet. Output From Fig. 2 we can see that the accelerometer modality-based Fusion method does not always outperform the video-based method, indi- cating that the signals from the accelerometer and video could be Figure 1: The proposed multimodal speech detection net- complementary, which could explain the higher performance of the work. fusion of the two modalities compared to the unimodal methods. However, fusion did not lead to an improved performance for all individual test subjects (see subjects 17 and 83), and a better fusion method should be considered in the future. 3 RESULTS In order to evaluate our speech detection approach, we followed the given split method of the No-audio Speech Detection task. The 4 CONCLUSION model was trained on data from 54 subjects and tested on data In this paper, we proposed a multimodal speech detection model, from 16 unseen subjects that non-overlap with the subjects in the with video and accelerometer data as input. Our model showed training set. We report the Area Under Curve (AUC) metric for each competitive results on the unimodal speech detection tasks with test subject and each modality. The mean AUC scores computed either video or accelerometer data as input, and it outperformed over all test subjects are shown in Table 1, while the AUC scores previous methods on the multi-modal task which uses both types for each test subject separately are shown in Fig. 2. of input. No-Audio Multimodal Speech Detection Task MediaEval’20, December 14-15 2020, Online REFERENCES [1] Laura Cabrera-Quiros, Andrew Demetriou, Ekin Gedik, Leander van der Meij, and Hayley Hung. 2018. The MatchNMingle dataset: a novel multi-sensor resource for the analysis of social interactions and group dynamics in-the-wild during free-standing conversations and speed dates. IEEE Transactions on Affective Computing (2018). [2] Laura Cabrera-Quiros, Ekin Gedik, and Hayley Hung. 2018. Trans- ductive Parameter Transfer, Bags of Dense Trajectories and MILES for No-Audio Multimodal Speech Detection. In 2018 Working Notes Proceedings of the MediaEval Workshop, MediaEval 2018. CEUR-WS. org, 3. [3] Panagiotis Giannakeris, Stefanos Vrochidis, and Ioannis Kompatsiaris. 2019. Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech Detection. (2019). [4] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale Video Classifica- tion with Convolutional Neural Networks. In CVPR. [5] Liandong Li, Zhuo Hao, and Bo Sun. 2019. Combining Body Pose and Movement Modalities for No-audio Speech Detection. (2019). [6] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2018. Analyzing Human Behavior in Subspace: Dimensionality Reduction+ Classification.. In MediaEval. [7] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d con- volutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497. [8] Jose Vargas and Hayley Hung. 2019. CNNs and Fisher Vectors for No-Audio Multimodal Speech Detection. (2019).