=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_20
|storemode=property
|title=Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech
Detection.
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_20.pdf
|volume=Vol-2670
|authors=Panagiotis Giannakeris,Stefanos Vrochidis,Ioannis Kompatsiaris
|dblpUrl=https://dblp.org/rec/conf/mediaeval/GiannakerisVK19
}}
==Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech
Detection.==
Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech Detection. Panagiotis Giannakeris1 , Stefanos Vrochidis1 , Ioannis Kompatsiaris1 1 Centre for Research & Technology Hellas - Information Technologies Institute, Greece {giannakeris,stefanos,ikom}@iti.gr ABSTRACT • Facial characteristics are severely occluded. A subject’s In this paper we examine the task of automatic detection of speech body may be partially occluded as well, as a result of his without microphones, using an overhead camera and wearable movements and interactions with others. accelerometers. For this purpose, we propose the extraction of • Multiple other subjects may appear inside a subject’s im- hand-crafted appearance and optical flow features from the video mediate area cross contaminating the video data. modality, and time-domain features from the accelerometer data. • When the cropped region is moving to follow a subject We evaluate the performance of the separate modalities in a large global camera motion is introduced. dataset of over 25 hours of standing conversation between multiple • The orientation of the video is not aligned with head pose individuals. Finally, we show that applying a multimodal late fusion orientation making it difficult to obtain structured infor- technique can lead to a performance boost in most cases. mation consistent with pose or gaze. 1 INTRODUCTION In order to deal with occlusions and the changing orientation An increasing interest exists for applications that require automatic of the human body we select to extract appearance features and voice activity detection. It is significantly insightful to recognize the specifically the Histogram of Oriented Gradients (HOG) descriptor speech status of people gathered at crowded environments, such as in a spatial 3 × 3 grid. Therefore, 9 different HOG descriptors are meetings or conferences, as speech is one of the primary elements obtained and concatenated to form the HOG vector of a frame. We of social interaction. hypothesize that using HOG features in this manner we introduce This paper presents the algorithms and results from CERTH- some structure to the final representation regarding: (a) the primary ITI’s participation to the No-Audio Multimodal Speech Detection subject’s pose orientation and (b) the surrounding area elements task at MediaEval 2019 [2]. The task focuses on automatic speech which may consist of other people as well as background space. detection using an overhead camera and wearable accelerometers. To capture gestures and body movements from the speaker we The camera records a meeting event where several individuals par- compute dense optical flow for each frame. Then, we extract His- ticipate in standing conversations. Each subject wears a tri-axial ac- togram of Optical Flow (HOF) features in a spatial grid as described celerometer that captures body movement. The use of microphones above. The grid partitioning here should make our representations is not suitable in many cases since they may introduce background capable of describing movement in different areas of the frame. noise from the environment, or be uncomfortable to wear, or even The surrounding environment may contain other people talking raise privacy concerns. In contrast, an overhead camera is not as and moving which can indicate that the primary subject in the invasive, and the accelerometers are isolated instruments free of center is currently not speaking. It is expected in these cases that environment noise. HOF descriptors in peripheral grid cells have higher values. To 2 APPROACH compensate for camera motion we also extract Motion Boundary Histogram (MBH) features for each cell of the spatial grid. HOF 2.1 Detecting Speech from Video and MBH are generally known to have complementary benefits for We aim to process short, non-overlapping, video segments in order activity recognition tasks. to classify them into speech or not-speech status. For this purpose All the low-level frame descriptors of the same type are L2 nor- we chose to extract low-level descriptors for each frame that repre- malized and averaged across temporal windows of 20 frames and sent body pose movements and speech gestures and then aggregate then concatenated together to form a single representation for the information along the short temporal windows. each second. Since the annotations are provided for each frame, The videos are all taken from a single overhead camera which we assign the label that the majority of the frames hold in order captures the full meeting space. Each video clip is a cropped version to annotate each 1 second segment. We remove any black screen of the full resolution video that shows the subject and the immedi- instances from the training set and since the classes are severely ate surrounding space. The subjects move freely inside the room, imbalanced we remove random negative samples as well to balance changing conversation partners and as such the videos follow the the training set. We chose under-sampling instead of over-sampling subjects at all times. There are several challenges posed as a result in order to avoid having duplicates in the training set. Finally a of this particular setting: Linear SVM classifier is trained using cross-validation on a random split, leaving the 30% of the subjects out, to obtain the optimal value Copyright 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution of the regularization parameter C. 4.0 International (CC BY 4.0). MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France P. Giannakeris et al. 2.2 Detecting Speech from Accelerometers Table 1: Comparison of mean AUC±std scores between speech detection algorithms. We deal with the task of speech detection from accelerometers in a similar fashion. We slide non-overlapping windows of 20 steps to segment the continuous x, y, z signal values, computing the magni- Method Accel Video Fusion tude values in each window: [1] 0.656±0.074 0.549±0.079 0.658±0.073 q M = [m 1 , m 2 , . . . m 20 ], mi = x i2 + yi2 + zi2 [3] 0.533±0.020 0.512± 0.021 0.535±0.019 Ours 0.649±0.066 0.614±0.067 0.672±0.051 Then the following time-domain features are extracted from the magnitude values: (1) Kurtosis (2) Interquartile range (3) Mean value (4) Standard Deviation (5) Min and Max values (6) Number of zero crossings Again, due to the fact that we try to solve the task by classifying each temporal window, we remove random negative instances in order to balance the training set. A Linear SVM classifier is trained here as well, cross-validating on a random split, leaving the 30% of the subjects out, to obtain the optimal C. 2.3 Late Fusion We deploy a late fusion mechanism in order to explore the mul- timodal nature of the task. We feed the visual and accelerometer SVMs with all the test samples, in order to obtain for each one a Figure 1: AUC scores for each test subject. pair of distances from the two separating hyper planes respectively. Then, we assign the label that corresponds to the farthest absolute distance of the two. This simple late fusion mechanism can guar- antee that the most confident classifier for a particular sample is was performed. The under-sampling strategy during the training trusted. phase may be a factor of improvement in this case as well as for the video estimator. 3 RESULTS AND ANALYSIS The fusion scores are better than the video and accelerometer In order to evaluate our speech detection algorithms we train our scores for the majority of the test subjects. This shows that the classifiers on videos taken from 54 subjects and test on videos from confidence of the individual classifiers is actually a trustworthy 16 unseen subjects. We report the Area Under Curve (AUC) metric measure for producing fused predictions in this task. for each test subject and each modality (Fig. 1). Also the mean AUC In this paper we tackle this task by classification of temporal scores for all subjects is presented in Table 1 and the performance segments. A promising alternative would be to deploy statistical is compared with last year’s participation on this task. Our video modeling to the sequences of the extracted features, like Hidden estimator has the lowest mean score with 61% mean AUC and the Markov Models. Additionally, in neither technique did we adopt accelerometer estimator performs higher by nearly 5%. The late any speech behavioral modeling for the subjects which is a topic fusion scheme achieves the best result gaining another 2%, which yet to be explored. looks promising given that our fusion scheme is a fairly simple one. We hypothesize that the shortcomings of our video estimator lie 4 DISCUSSION AND OUTLOOK on the ineffectiveness of our approach with respect to the frequent In this work we have managed to achieve competitive results for the head pose orientation changes of the subjects. Nevertheless, it per- video modality regarding the task of no-audio speech detection and forms better by a good margin from the dense trajectories of [1] as a result we have made the late fusion estimator more effective and the colorhist+LBP of [3], which enhances our belief that the using only the confidence of the individual classifiers. However, spatial grid structure is a good first step towards making the video there is still a lot of experimentation to be done with early fusion estimators achieve more competitive results in this task. Another techniques as well. Finally, we have proposed some key areas for im- step for improvement would be to detect the head pose of the pri- provement that should be examined thoroughly in order to achieve mary subject and align the spatial grid accordingly to ensure that better performance from the separate modalities. each cell encapsulates visual information from a similar position relative to the speaker across all subjects. The accelerometer estimator yields a satisfying performance ACKNOWLEDGMENTS compared with other methods presented at a previous version of This work was supported by SUITCEYES project funded by the this task despite the fact that no frequency domain signal processing European Commission under grant agreement No 780814. No-Audio Multimodal Speech Detection MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Laura Cabrera-Quiros, Ekin Gedik, and Hayley Hung. Transductive Parameter Transfer, Bags of Dense Trajectories and MILES for No- Audio Multimodal Speech Detection. In Proc. of the MediaEval 2018 Workshop. 2018. [2] Ekin Gedik, Laura Cabrera-Quiros, and Hayley Hung. No-Audio Multimodal Speech Detection task at MediaEval 2019. In Proc. of the MediaEval 2019 Workshop. Sophia Antipolis, France, Oct. 27-29, 2019. [3] Yang Liu, Zhonglei Gu, and Tobey H Ko. Analyzing Human Behavior in Subspace: Dimensionality Reduction + Classification. In Proc. of the MediaEval 2018 Workshop. 2018.