=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_20
|storemode=property
|title=Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech
            Detection.
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_20.pdf
|volume=Vol-2670
|authors=Panagiotis Giannakeris,Stefanos Vrochidis,Ioannis Kompatsiaris
|dblpUrl=https://dblp.org/rec/conf/mediaeval/GiannakerisVK19
}}
==Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech
            Detection.==
<pdf width="1500px">https://ceur-ws.org/Vol-2670/MediaEval_19_paper_20.pdf</pdf>
<pre>
    Multimodal Fusion of Appearance Features, Optical Flow and
            Accelerometer Data for Speech Detection.
                               Panagiotis Giannakeris1 , Stefanos Vrochidis1 , Ioannis Kompatsiaris1
                          1 Centre for Research & Technology Hellas - Information Technologies Institute, Greece

                                                             {giannakeris,stefanos,ikom}@iti.gr

ABSTRACT                                                                              • Facial characteristics are severely occluded. A subject’s
In this paper we examine the task of automatic detection of speech                      body may be partially occluded as well, as a result of his
without microphones, using an overhead camera and wearable                              movements and interactions with others.
accelerometers. For this purpose, we propose the extraction of                        • Multiple other subjects may appear inside a subject’s im-
hand-crafted appearance and optical flow features from the video                        mediate area cross contaminating the video data.
modality, and time-domain features from the accelerometer data.                       • When the cropped region is moving to follow a subject
We evaluate the performance of the separate modalities in a large                       global camera motion is introduced.
dataset of over 25 hours of standing conversation between multiple                    • The orientation of the video is not aligned with head pose
individuals. Finally, we show that applying a multimodal late fusion                    orientation making it difficult to obtain structured infor-
technique can lead to a performance boost in most cases.                                mation consistent with pose or gaze.


1    INTRODUCTION
                                                                                   In order to deal with occlusions and the changing orientation
An increasing interest exists for applications that require automatic
                                                                                of the human body we select to extract appearance features and
voice activity detection. It is significantly insightful to recognize the
                                                                                specifically the Histogram of Oriented Gradients (HOG) descriptor
speech status of people gathered at crowded environments, such as
                                                                                in a spatial 3 × 3 grid. Therefore, 9 different HOG descriptors are
meetings or conferences, as speech is one of the primary elements
                                                                                obtained and concatenated to form the HOG vector of a frame. We
of social interaction.
                                                                                hypothesize that using HOG features in this manner we introduce
   This paper presents the algorithms and results from CERTH-
                                                                                some structure to the final representation regarding: (a) the primary
ITI’s participation to the No-Audio Multimodal Speech Detection
                                                                                subject’s pose orientation and (b) the surrounding area elements
task at MediaEval 2019 [2]. The task focuses on automatic speech
                                                                                which may consist of other people as well as background space.
detection using an overhead camera and wearable accelerometers.
                                                                                   To capture gestures and body movements from the speaker we
The camera records a meeting event where several individuals par-
                                                                                compute dense optical flow for each frame. Then, we extract His-
ticipate in standing conversations. Each subject wears a tri-axial ac-
                                                                                togram of Optical Flow (HOF) features in a spatial grid as described
celerometer that captures body movement. The use of microphones
                                                                                above. The grid partitioning here should make our representations
is not suitable in many cases since they may introduce background
                                                                                capable of describing movement in different areas of the frame.
noise from the environment, or be uncomfortable to wear, or even
                                                                                The surrounding environment may contain other people talking
raise privacy concerns. In contrast, an overhead camera is not as
                                                                                and moving which can indicate that the primary subject in the
invasive, and the accelerometers are isolated instruments free of
                                                                                center is currently not speaking. It is expected in these cases that
environment noise.
                                                                                HOF descriptors in peripheral grid cells have higher values. To
2 APPROACH                                                                      compensate for camera motion we also extract Motion Boundary
                                                                                Histogram (MBH) features for each cell of the spatial grid. HOF
2.1 Detecting Speech from Video                                                 and MBH are generally known to have complementary benefits for
We aim to process short, non-overlapping, video segments in order               activity recognition tasks.
to classify them into speech or not-speech status. For this purpose                All the low-level frame descriptors of the same type are L2 nor-
we chose to extract low-level descriptors for each frame that repre-            malized and averaged across temporal windows of 20 frames and
sent body pose movements and speech gestures and then aggregate                 then concatenated together to form a single representation for
the information along the short temporal windows.                               each second. Since the annotations are provided for each frame,
   The videos are all taken from a single overhead camera which                 we assign the label that the majority of the frames hold in order
captures the full meeting space. Each video clip is a cropped version           to annotate each 1 second segment. We remove any black screen
of the full resolution video that shows the subject and the immedi-             instances from the training set and since the classes are severely
ate surrounding space. The subjects move freely inside the room,                imbalanced we remove random negative samples as well to balance
changing conversation partners and as such the videos follow the                the training set. We chose under-sampling instead of over-sampling
subjects at all times. There are several challenges posed as a result           in order to avoid having duplicates in the training set. Finally a
of this particular setting:                                                     Linear SVM classifier is trained using cross-validation on a random
                                                                                split, leaving the 30% of the subjects out, to obtain the optimal value
Copyright 2019 for this paper by its authors. Use
permitted under Creative Commons License Attribution                            of the regularization parameter C.
4.0 International (CC BY 4.0).
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                             P. Giannakeris et al.


2.2    Detecting Speech from Accelerometers                             Table 1: Comparison of mean AUC±std scores between
                                                                        speech detection algorithms.
We deal with the task of speech detection from accelerometers in a
similar fashion. We slide non-overlapping windows of 20 steps to
segment the continuous x, y, z signal values, computing the magni-            Method        Accel          Video           Fusion
tude values in each window:                                                       [1]    0.656±0.074    0.549±0.079    0.658±0.073
                                              q
           M = [m 1 , m 2 , . . . m 20 ], mi = x i2 + yi2 + zi2                   [3]    0.533±0.020    0.512± 0.021   0.535±0.019
                                                                                 Ours    0.649±0.066    0.614±0.067    0.672±0.051
Then the following time-domain features are extracted from the
magnitude values:
     (1) Kurtosis
     (2) Interquartile range
     (3) Mean value
     (4) Standard Deviation
     (5) Min and Max values
     (6) Number of zero crossings
Again, due to the fact that we try to solve the task by classifying
each temporal window, we remove random negative instances in
order to balance the training set. A Linear SVM classifier is trained
here as well, cross-validating on a random split, leaving the 30% of
the subjects out, to obtain the optimal C.

2.3    Late Fusion
We deploy a late fusion mechanism in order to explore the mul-
timodal nature of the task. We feed the visual and accelerometer
SVMs with all the test samples, in order to obtain for each one a                 Figure 1: AUC scores for each test subject.
pair of distances from the two separating hyper planes respectively.
Then, we assign the label that corresponds to the farthest absolute
distance of the two. This simple late fusion mechanism can guar-
antee that the most confident classifier for a particular sample is     was performed. The under-sampling strategy during the training
trusted.                                                                phase may be a factor of improvement in this case as well as for
                                                                        the video estimator.
3     RESULTS AND ANALYSIS                                                 The fusion scores are better than the video and accelerometer
In order to evaluate our speech detection algorithms we train our       scores for the majority of the test subjects. This shows that the
classifiers on videos taken from 54 subjects and test on videos from    confidence of the individual classifiers is actually a trustworthy
16 unseen subjects. We report the Area Under Curve (AUC) metric         measure for producing fused predictions in this task.
for each test subject and each modality (Fig. 1). Also the mean AUC        In this paper we tackle this task by classification of temporal
scores for all subjects is presented in Table 1 and the performance     segments. A promising alternative would be to deploy statistical
is compared with last year’s participation on this task. Our video      modeling to the sequences of the extracted features, like Hidden
estimator has the lowest mean score with 61% mean AUC and the           Markov Models. Additionally, in neither technique did we adopt
accelerometer estimator performs higher by nearly 5%. The late          any speech behavioral modeling for the subjects which is a topic
fusion scheme achieves the best result gaining another 2%, which        yet to be explored.
looks promising given that our fusion scheme is a fairly simple one.
   We hypothesize that the shortcomings of our video estimator lie      4   DISCUSSION AND OUTLOOK
on the ineffectiveness of our approach with respect to the frequent     In this work we have managed to achieve competitive results for the
head pose orientation changes of the subjects. Nevertheless, it per-    video modality regarding the task of no-audio speech detection and
forms better by a good margin from the dense trajectories of [1]        as a result we have made the late fusion estimator more effective
and the colorhist+LBP of [3], which enhances our belief that the        using only the confidence of the individual classifiers. However,
spatial grid structure is a good first step towards making the video    there is still a lot of experimentation to be done with early fusion
estimators achieve more competitive results in this task. Another       techniques as well. Finally, we have proposed some key areas for im-
step for improvement would be to detect the head pose of the pri-       provement that should be examined thoroughly in order to achieve
mary subject and align the spatial grid accordingly to ensure that      better performance from the separate modalities.
each cell encapsulates visual information from a similar position
relative to the speaker across all subjects.
   The accelerometer estimator yields a satisfying performance
                                                                        ACKNOWLEDGMENTS
compared with other methods presented at a previous version of          This work was supported by SUITCEYES project funded by the
this task despite the fact that no frequency domain signal processing   European Commission under grant agreement No 780814.
No-Audio Multimodal Speech Detection                                          MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
[1] Laura Cabrera-Quiros, Ekin Gedik, and Hayley Hung. Transductive
    Parameter Transfer, Bags of Dense Trajectories and MILES for No-
    Audio Multimodal Speech Detection. In Proc. of the MediaEval 2018
    Workshop. 2018.
[2] Ekin Gedik, Laura Cabrera-Quiros, and Hayley Hung. No-Audio
    Multimodal Speech Detection task at MediaEval 2019. In Proc. of the
    MediaEval 2019 Workshop. Sophia Antipolis, France, Oct. 27-29, 2019.
[3] Yang Liu, Zhonglei Gu, and Tobey H Ko. Analyzing Human Behavior
    in Subspace: Dimensionality Reduction + Classification. In Proc. of the
    MediaEval 2018 Workshop. 2018.

</pre>