=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_28
|storemode=property
|title=Combining Visual and Movement Modalities for No-audio Speech Setection
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_28.pdf
|volume=Vol-2670
|authors=Liandong Li,Zhuo Hao,Bo
          Sun
|dblpUrl=https://dblp.org/rec/conf/mediaeval/LiH019
}}
==Combining Visual and Movement Modalities for No-audio Speech Setection==
<pdf width="1500px">https://ceur-ws.org/Vol-2670/MediaEval_19_paper_28.pdf</pdf>
<pre>
    Combining Body Pose and Movement Modalities for No-audio
                       Speech Detection
                                                             Liandong Li, Zhuo Hao, Bo Sun
                                                       Beijing Normal University, China
                                           bnulee@hotmail.com,hz@mail.bnu.edu.cn,tosunbo@bnu.edu.cn
ABSTRACT
Speech detection is important to automatic social behaviour anal-
ysis. In this paper, we describe our approach for no-audio speech
detection. We estimate speaker pose and movement by both cam-
eras and acceleration sensors. The multimodal features are com-
bined and are utilized for per-second speech status prediction. The
approach is tested on the MediaEval 2019 No-Audio Multimodal
Speech Detection Task.


1     INTRODUCTION
Speech detection is important to automatic social behaviour anal-
ysis. There has been research focusing on this task using audio
signal. However, the utilization of audio could be restricted in cer-
tain situations like noisy environment or when privacy-preserving
                                                                             Figure 1: The scheme of our proposed method. Pose and ac-
is required [7]. Thus, some research explores this task by analysing
                                                                             celeration signal are combined to predict the speech proba-
human body behaviour [5, 7].
                                                                             bilities.
    The 2019 No-Audio Multimodal Speech Detection Task[2] pro-
vides recorded data of speakers in social situations. Visual signal
is captured through over-head cameras. Besides, tri-axial body ac-
celerations are collected using wearable devices [3]. The modality           70 dimension acceleration feature X acce is standardised to have
signals are captured at 20Hz FPS, while the binary speaking status           zero mean and unit variance.
are annotated at each time-step.                                                To extract visual representation, we utilize the Regional Multi-
    In our work, we estimate human body pose and movement using              Person Pose Estimation Model [1]. This model consists of two
the multimodal signals. The accelerators provide information about           components: Symmetric Spatial Transformer Network (SSTN) and
the overall movement of a person. However, it cannot describe                Parametric Pose Non-Maximum-Suppression (p-Pose NMS). The
body language which is expressed by the movement and pose of                 SSTN network combines the spatial transformer network (STN),
human body. Thus, we estimate body pose points in every frame to             single-person pose estimator [6] (SPPE) and spatial detransformer
represent detailed body movement. The detail of proposed approach            network [4] (SDTN). The STN selects the potential area of human
is described in Section 2.                                                   body and the SPPE estimate the body pose. Then, the estimated
                                                                             human pose is mapped back to the original image coordinate. The
2     APPROACH                                                               NMS is used to eliminate redundant pose estimations. During the
In this section, we introduce our framework for no-audio speech              training progress, Pose-Guided Proposals Generator is utilized to
detection. The framework consists of two components: multimodal              augment the training data. The extracted 34 dimension pose feature
representation and sequential classification. The first component            X pose is normalized per speaker.
extracts multimodal feature representations while the second com-
ponent classifies the sequential data.                                       2.2    Sequential Logistic Classification
                                                                             Given the multimodal feature representation of speech data, we
2.1     Multimodal Representation                                            follow a sequential classification scheme to predict the speech status.
Two types of modalities are utilized in our framework: tri-axial ac-         However, The typical sequential model may face the challenge of
celeration and visual. For tri-axial acceleration signal, we follow the      limited data that could be used for training. Instead, we use logistic
method of [7]. Acceleration features are extracted from 3s windows           regression model to classify speech statues at time-step. Then, a
with 1.5s overlap of the raw signal, absolute values of signal and           filter is used to eliminate the outlier of prediction.
the magnitude of the acceleration. Mean, variance and the power                  Specifically, given feature X acce and X pose , we firstly concate-
spectral density are calculated to form the final representation. The        nate them to be X con . A logistic classifier h Θ (X ) = f (ΘT X con +b) is
                                                                             utilized to train and predict the binary speech status yt at time-step
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                         t. The sequential prediction ys (t) of a speaker is then convoluted by
4.0 International (CC BY 4.0).                                               a N dimensional filter д(t): Y = ys (t)∗д(t), where д(t) = [ N1 , N1 , ...]
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                                     L. Li et al.

                                         Table 1: Cross validation AUC results with different value of N

                               N            3            5     7        9       11        13        15        17        19
                             AUC      0.5439        0.5486   0.5505   0.5517   0.5524   0.5528    0.5529    0.5527     0.5523


Table 2: Cross validation AUC results of different modalities                                    Table 3: Testing data AUC results

                     Method        Raw          Filtered                                         Subject ID    Accel     Video     Fusion
                      Accel        0.5281       0.5393                                                2        0.6735    0.4234    0.5742
                      Video        0.5215       0.5305                                                3        0.6759    0.5127    0.6534
                      Fusion       0.5341       0.5529                                               15        0.7465    0.4958    0.7438
                                                                                                     17        0.6556    0.5555    0.6539
                                                                                                     26        0.7023    0.4557    0.6225
                                                                                                     39        0.5429    0.5131    0.5468
3    RESULTS AND ANALYSIS                                                                            40        0.6123    0.4733    0.5701
We evaluate the performance of the proposed method on the Media-                                     43        0.6751    0.5484    0.6421
Eval 2019 No-Audio Multimodal Speech Detection Task Dataset.                                         51        0.4020    0.2527    0.2588
The data is split into training set and testing set, with 54 and 16                                  54        0.6652    0.6763    0.7191
videos respectively. Each video is 22 minutes long. Logistic re-                                     59        0.6067    0.6430    0.6611
gression model is trained to predict binary speech status at each                                    65        0.5718    0.5812    0.6254
timestep. For training the logistic regression model, 54×22×60=71280                                 67        0.7621    0.6173    0.7590
samples are used. To evaluate the model performance, we utilize                                      80        0.8573    0.4529    0.7941
three-fold cross validation on the training set, with the result shown                               83        0.6234    0.5034    0.5682
in Table 2. Results are measured by the Area Under the ROC Curve                                     85        0.5274    0.5100    0.5208
(AUC).                                                                                             Mean        0.6438    0.5134    0.6196
    Testing is done on each video of testing set. After gaining the
prediction on test videos, a N dimensional filter [ N1 , N1 , ...] is con-
voluted with the prediction probability vector to smooth the output.             we will explore more efficient and accurate visual feature represen-
The dimension N is chosen through grid search on the training set.               tation.
The testing results are shown in Table 3. From the result, we can
see that the acceleration signal outperforms visual signal. Though               ACKNOWLEDGMENTS
the fusion prediction gets higher result on the training data, its               This work is supported by the Fundamental Research Funds for the
testing performance is not as good as acceleration modality.                     Central Universities.
    It is understandable that utilizing visual signal is very challenging
in this task, especially with the limited video data. In spite of the            REFERENCES
large number of image frames that could be used to train image                    [1] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. 2017. RMPE:
level classifier, it is actually difficult to tackle the task with image              Regional Multi-person Pose Estimation. In ICCV.
classifier. Considering that, we employed pose estimator trained                  [2] Ekin Gedik, Laura Cabrera-Quiros, and Hayley Hung. 2019. No-Audio
on larger video dataset. However, the position of the camera from                     Multimodal Speech Detection task at MediaEval 2019. In Proc. of the
which the data was recorded is different from regular video. Thus,                    MediaEval 2019 Workshop.
the pose estimator does not work well in this task, which brings                  [3] Hayley Hung, Gwenn Englebienne, and Jeroen Kools. 2013. Classify-
                                                                                      ing social actions with a single accelerometer. In Proceedings of the
large proportion of inaccurate pose features. In comparison, the
                                                                                      2013 ACM international joint conference on Pervasive and ubiquitous
acceleration signal is acquired by wearable physical sensor, which                    computing. ACM, 207–210.
guarantees its reliability. Despite this, as we mentioned before, the             [4] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and others. 2015.
acceleration signal is ambiguous about the local body movement.                       Spatial transformer networks. In Advances in neural information pro-
It is still important to explore more efficient and accurate visual                   cessing systems. 2017–2025.
feature representation.                                                           [5] Yang Liu, Zhonglei Gu, and Tobey H Ko. 2018. Analyzing Human
                                                                                      Behavior in Subspace: Dimensionality Reduction+ Classification.. In
                                                                                      MediaEval.
4    DISCUSSION AND OUTLOOK                                                       [6] Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hour-
In this paper, we describe our work for no-audio speech detection.                    glass networks for human pose estimation. In European conference on
We estimate speaker pose and movement through cameras and                             computer vision. Springer, 483–499.
acceleration sensors. The multimodal features are combined and                    [7] Laura Cabrera Quiros, Ekin Gedik, and Hayley Hung. 2018. Transduc-
utilized for per-second speech status prediction. Results show that                   tive Parameter Transfer, Bags of Dense Trajectories and MILES for
acceleration modality outperforms visual modality. In the future,                     No-Audio Multimodal Speech Detection.. In MediaEval.

</pre>