=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_28
|storemode=property
|title=Combining Visual and Movement Modalities for No-audio Speech Setection
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_28.pdf
|volume=Vol-2670
|authors=Liandong Li,Zhuo Hao,Bo
Sun
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LiH019
}}
==Combining Visual and Movement Modalities for No-audio Speech Setection==
Combining Body Pose and Movement Modalities for No-audio Speech Detection Liandong Li, Zhuo Hao, Bo Sun Beijing Normal University, China bnulee@hotmail.com,hz@mail.bnu.edu.cn,tosunbo@bnu.edu.cn ABSTRACT Speech detection is important to automatic social behaviour anal- ysis. In this paper, we describe our approach for no-audio speech detection. We estimate speaker pose and movement by both cam- eras and acceleration sensors. The multimodal features are com- bined and are utilized for per-second speech status prediction. The approach is tested on the MediaEval 2019 No-Audio Multimodal Speech Detection Task. 1 INTRODUCTION Speech detection is important to automatic social behaviour anal- ysis. There has been research focusing on this task using audio signal. However, the utilization of audio could be restricted in cer- tain situations like noisy environment or when privacy-preserving Figure 1: The scheme of our proposed method. Pose and ac- is required [7]. Thus, some research explores this task by analysing celeration signal are combined to predict the speech proba- human body behaviour [5, 7]. bilities. The 2019 No-Audio Multimodal Speech Detection Task[2] pro- vides recorded data of speakers in social situations. Visual signal is captured through over-head cameras. Besides, tri-axial body ac- celerations are collected using wearable devices [3]. The modality 70 dimension acceleration feature X acce is standardised to have signals are captured at 20Hz FPS, while the binary speaking status zero mean and unit variance. are annotated at each time-step. To extract visual representation, we utilize the Regional Multi- In our work, we estimate human body pose and movement using Person Pose Estimation Model [1]. This model consists of two the multimodal signals. The accelerators provide information about components: Symmetric Spatial Transformer Network (SSTN) and the overall movement of a person. However, it cannot describe Parametric Pose Non-Maximum-Suppression (p-Pose NMS). The body language which is expressed by the movement and pose of SSTN network combines the spatial transformer network (STN), human body. Thus, we estimate body pose points in every frame to single-person pose estimator [6] (SPPE) and spatial detransformer represent detailed body movement. The detail of proposed approach network [4] (SDTN). The STN selects the potential area of human is described in Section 2. body and the SPPE estimate the body pose. Then, the estimated human pose is mapped back to the original image coordinate. The 2 APPROACH NMS is used to eliminate redundant pose estimations. During the In this section, we introduce our framework for no-audio speech training progress, Pose-Guided Proposals Generator is utilized to detection. The framework consists of two components: multimodal augment the training data. The extracted 34 dimension pose feature representation and sequential classification. The first component X pose is normalized per speaker. extracts multimodal feature representations while the second com- ponent classifies the sequential data. 2.2 Sequential Logistic Classification Given the multimodal feature representation of speech data, we 2.1 Multimodal Representation follow a sequential classification scheme to predict the speech status. Two types of modalities are utilized in our framework: tri-axial ac- However, The typical sequential model may face the challenge of celeration and visual. For tri-axial acceleration signal, we follow the limited data that could be used for training. Instead, we use logistic method of [7]. Acceleration features are extracted from 3s windows regression model to classify speech statues at time-step. Then, a with 1.5s overlap of the raw signal, absolute values of signal and filter is used to eliminate the outlier of prediction. the magnitude of the acceleration. Mean, variance and the power Specifically, given feature X acce and X pose , we firstly concate- spectral density are calculated to form the final representation. The nate them to be X con . A logistic classifier h Θ (X ) = f (ΘT X con +b) is utilized to train and predict the binary speech status yt at time-step Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution t. The sequential prediction ys (t) of a speaker is then convoluted by 4.0 International (CC BY 4.0). a N dimensional filter д(t): Y = ys (t)∗д(t), where д(t) = [ N1 , N1 , ...] MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France L. Li et al. Table 1: Cross validation AUC results with different value of N N 3 5 7 9 11 13 15 17 19 AUC 0.5439 0.5486 0.5505 0.5517 0.5524 0.5528 0.5529 0.5527 0.5523 Table 2: Cross validation AUC results of different modalities Table 3: Testing data AUC results Method Raw Filtered Subject ID Accel Video Fusion Accel 0.5281 0.5393 2 0.6735 0.4234 0.5742 Video 0.5215 0.5305 3 0.6759 0.5127 0.6534 Fusion 0.5341 0.5529 15 0.7465 0.4958 0.7438 17 0.6556 0.5555 0.6539 26 0.7023 0.4557 0.6225 39 0.5429 0.5131 0.5468 3 RESULTS AND ANALYSIS 40 0.6123 0.4733 0.5701 We evaluate the performance of the proposed method on the Media- 43 0.6751 0.5484 0.6421 Eval 2019 No-Audio Multimodal Speech Detection Task Dataset. 51 0.4020 0.2527 0.2588 The data is split into training set and testing set, with 54 and 16 54 0.6652 0.6763 0.7191 videos respectively. Each video is 22 minutes long. Logistic re- 59 0.6067 0.6430 0.6611 gression model is trained to predict binary speech status at each 65 0.5718 0.5812 0.6254 timestep. For training the logistic regression model, 54×22×60=71280 67 0.7621 0.6173 0.7590 samples are used. To evaluate the model performance, we utilize 80 0.8573 0.4529 0.7941 three-fold cross validation on the training set, with the result shown 83 0.6234 0.5034 0.5682 in Table 2. Results are measured by the Area Under the ROC Curve 85 0.5274 0.5100 0.5208 (AUC). Mean 0.6438 0.5134 0.6196 Testing is done on each video of testing set. After gaining the prediction on test videos, a N dimensional filter [ N1 , N1 , ...] is con- voluted with the prediction probability vector to smooth the output. we will explore more efficient and accurate visual feature represen- The dimension N is chosen through grid search on the training set. tation. The testing results are shown in Table 3. From the result, we can see that the acceleration signal outperforms visual signal. Though ACKNOWLEDGMENTS the fusion prediction gets higher result on the training data, its This work is supported by the Fundamental Research Funds for the testing performance is not as good as acceleration modality. Central Universities. It is understandable that utilizing visual signal is very challenging in this task, especially with the limited video data. In spite of the REFERENCES large number of image frames that could be used to train image [1] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE: level classifier, it is actually difficult to tackle the task with image Regional Multi-person Pose Estimation. In ICCV. classifier. Considering that, we employed pose estimator trained [2] Ekin Gedik, Laura Cabrera-Quiros, and Hayley Hung. 2019. No-Audio on larger video dataset. However, the position of the camera from Multimodal Speech Detection task at MediaEval 2019. In Proc. of the which the data was recorded is different from regular video. Thus, MediaEval 2019 Workshop. the pose estimator does not work well in this task, which brings [3] Hayley Hung, Gwenn Englebienne, and Jeroen Kools. 2013. Classify- ing social actions with a single accelerometer. In Proceedings of the large proportion of inaccurate pose features. In comparison, the 2013 ACM international joint conference on Pervasive and ubiquitous acceleration signal is acquired by wearable physical sensor, which computing. ACM, 207–210. guarantees its reliability. Despite this, as we mentioned before, the [4] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and others. 2015. acceleration signal is ambiguous about the local body movement. Spatial transformer networks. In Advances in neural information pro- It is still important to explore more efficient and accurate visual cessing systems. 2017–2025. feature representation. [5] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2018. Analyzing Human Behavior in Subspace: Dimensionality Reduction+ Classification.. In MediaEval. 4 DISCUSSION AND OUTLOOK [6] Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hour- In this paper, we describe our work for no-audio speech detection. glass networks for human pose estimation. In European conference on We estimate speaker pose and movement through cameras and computer vision. Springer, 483–499. acceleration sensors. The multimodal features are combined and [7] Laura Cabrera Quiros, Ekin Gedik, and Hayley Hung. 2018. Transduc- utilized for per-second speech status prediction. Results show that tive Parameter Transfer, Bags of Dense Trajectories and MILES for acceleration modality outperforms visual modality. In the future, No-Audio Multimodal Speech Detection.. In MediaEval.