<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multimodal Fusion of Appearance Features, Optical Flow and Accelerometer Data for Speech Detection.</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Panagiotis Giannakeris</string-name>
          <email>giannakeris@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefanos Vrochidis</string-name>
          <email>stefanos@iti.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ioannis Kompatsiaris</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for Research &amp; Technology Hellas - Information Technologies Institute</institution>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>27</fpage>
      <lpage>29</lpage>
      <abstract>
        <p>In this paper we examine the task of automatic detection of speech without microphones, using an overhead camera and wearable accelerometers. For this purpose, we propose the extraction of hand-crafted appearance and optical flow features from the video modality, and time-domain features from the accelerometer data. We evaluate the performance of the separate modalities in a large dataset of over 25 hours of standing conversation between multiple individuals. Finally, we show that applying a multimodal late fusion technique can lead to a performance boost in most cases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>An increasing interest exists for applications that require automatic
voice activity detection. It is significantly insightful to recognize the
speech status of people gathered at crowded environments, such as
meetings or conferences, as speech is one of the primary elements
of social interaction.</p>
      <p>
        This paper presents the algorithms and results from
CERTHITI’s participation to the No-Audio Multimodal Speech Detection
task at MediaEval 2019 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The task focuses on automatic speech
detection using an overhead camera and wearable accelerometers.
The camera records a meeting event where several individuals
participate in standing conversations. Each subject wears a tri-axial
accelerometer that captures body movement. The use of microphones
is not suitable in many cases since they may introduce background
noise from the environment, or be uncomfortable to wear, or even
raise privacy concerns. In contrast, an overhead camera is not as
invasive, and the accelerometers are isolated instruments free of
environment noise.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Detecting Speech from Video</title>
      <p>We aim to process short, non-overlapping, video segments in order
to classify them into speech or not-speech status. For this purpose
we chose to extract low-level descriptors for each frame that
represent body pose movements and speech gestures and then aggregate
the information along the short temporal windows.</p>
      <p>The videos are all taken from a single overhead camera which
captures the full meeting space. Each video clip is a cropped version
of the full resolution video that shows the subject and the
immediate surrounding space. The subjects move freely inside the room,
changing conversation partners and as such the videos follow the
subjects at all times. There are several challenges posed as a result
of this particular setting:
• Facial characteristics are severely occluded. A subject’s
body may be partially occluded as well, as a result of his
movements and interactions with others.
• Multiple other subjects may appear inside a subject’s
immediate area cross contaminating the video data.
• When the cropped region is moving to follow a subject
global camera motion is introduced.
• The orientation of the video is not aligned with head pose
orientation making it dificult to obtain structured
information consistent with pose or gaze.</p>
      <p>In order to deal with occlusions and the changing orientation
of the human body we select to extract appearance features and
specifically the Histogram of Oriented Gradients (HOG) descriptor
in a spatial 3 × 3 grid. Therefore, 9 diferent HOG descriptors are
obtained and concatenated to form the HOG vector of a frame. We
hypothesize that using HOG features in this manner we introduce
some structure to the final representation regarding: (a) the primary
subject’s pose orientation and (b) the surrounding area elements
which may consist of other people as well as background space.</p>
      <p>To capture gestures and body movements from the speaker we
compute dense optical flow for each frame. Then, we extract
Histogram of Optical Flow (HOF) features in a spatial grid as described
above. The grid partitioning here should make our representations
capable of describing movement in diferent areas of the frame.
The surrounding environment may contain other people talking
and moving which can indicate that the primary subject in the
center is currently not speaking. It is expected in these cases that
HOF descriptors in peripheral grid cells have higher values. To
compensate for camera motion we also extract Motion Boundary
Histogram (MBH) features for each cell of the spatial grid. HOF
and MBH are generally known to have complementary benefits for
activity recognition tasks.</p>
      <p>All the low-level frame descriptors of the same type are L2
normalized and averaged across temporal windows of 20 frames and
then concatenated together to form a single representation for
each second. Since the annotations are provided for each frame,
we assign the label that the majority of the frames hold in order
to annotate each 1 second segment. We remove any black screen
instances from the training set and since the classes are severely
imbalanced we remove random negative samples as well to balance
the training set. We chose under-sampling instead of over-sampling
in order to avoid having duplicates in the training set. Finally a
Linear SVM classifier is trained using cross-validation on a random
split, leaving the 30% of the subjects out, to obtain the optimal value
of the regularization parameter C.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Detecting Speech from Accelerometers</title>
      <p>We deal with the task of speech detection from accelerometers in a
similar fashion. We slide non-overlapping windows of 20 steps to
segment the continuous x, y, z signal values, computing the
magnitude values in each window:</p>
      <p>M = [m1, m2, . . . m20],
mi = qxi2 + yi2 + zi2
Then the following time-domain features are extracted from the
magnitude values:
(1) Kurtosis
(2) Interquartile range
(3) Mean value
(4) Standard Deviation
(5) Min and Max values
(6) Number of zero crossings
Again, due to the fact that we try to solve the task by classifying
each temporal window, we remove random negative instances in
order to balance the training set. A Linear SVM classifier is trained
here as well, cross-validating on a random split, leaving the 30% of
the subjects out, to obtain the optimal C.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Late Fusion</title>
      <p>We deploy a late fusion mechanism in order to explore the
multimodal nature of the task. We feed the visual and accelerometer
SVMs with all the test samples, in order to obtain for each one a
pair of distances from the two separating hyper planes respectively.
Then, we assign the label that corresponds to the farthest absolute
distance of the two. This simple late fusion mechanism can
guarantee that the most confident classifier for a particular sample is
trusted.
3</p>
    </sec>
    <sec id="sec-6">
      <title>RESULTS AND ANALYSIS</title>
      <p>In order to evaluate our speech detection algorithms we train our
classifiers on videos taken from 54 subjects and test on videos from
16 unseen subjects. We report the Area Under Curve (AUC) metric
for each test subject and each modality (Fig. 1). Also the mean AUC
scores for all subjects is presented in Table 1 and the performance
is compared with last year’s participation on this task. Our video
estimator has the lowest mean score with 61% mean AUC and the
accelerometer estimator performs higher by nearly 5%. The late
fusion scheme achieves the best result gaining another 2%, which
looks promising given that our fusion scheme is a fairly simple one.</p>
      <p>
        We hypothesize that the shortcomings of our video estimator lie
on the inefectiveness of our approach with respect to the frequent
head pose orientation changes of the subjects. Nevertheless, it
performs better by a good margin from the dense trajectories of [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and the colorhist+LBP of [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which enhances our belief that the
spatial grid structure is a good rfist step towards making the video
estimators achieve more competitive results in this task. Another
step for improvement would be to detect the head pose of the
primary subject and align the spatial grid accordingly to ensure that
each cell encapsulates visual information from a similar position
relative to the speaker across all subjects.
      </p>
      <p>The accelerometer estimator yields a satisfying performance
compared with other methods presented at a previous version of
this task despite the fact that no frequency domain signal processing
was performed. The under-sampling strategy during the training
phase may be a factor of improvement in this case as well as for
the video estimator.</p>
      <p>The fusion scores are better than the video and accelerometer
scores for the majority of the test subjects. This shows that the
confidence of the individual classifiers is actually a trustworthy
measure for producing fused predictions in this task.</p>
      <p>In this paper we tackle this task by classification of temporal
segments. A promising alternative would be to deploy statistical
modeling to the sequences of the extracted features, like Hidden
Markov Models. Additionally, in neither technique did we adopt
any speech behavioral modeling for the subjects which is a topic
yet to be explored.
4</p>
    </sec>
    <sec id="sec-7">
      <title>DISCUSSION AND OUTLOOK</title>
      <p>In this work we have managed to achieve competitive results for the
video modality regarding the task of no-audio speech detection and
as a result we have made the late fusion estimator more efective
using only the confidence of the individual classifiers. However,
there is still a lot of experimentation to be done with early fusion
techniques as well. Finally, we have proposed some key areas for
improvement that should be examined thoroughly in order to achieve
better performance from the separate modalities.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by SUITCEYES project funded by the
European Commission under grant agreement No 780814.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Laura</given-names>
            <surname>Cabrera-Quiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Ekin</given-names>
            <surname>Gedik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung</surname>
          </string-name>
          .
          <article-title>Transductive Parameter Transfer, Bags of Dense Trajectories and MILES for NoAudio Multimodal Speech Detection</article-title>
          .
          <source>In Proc. of the MediaEval 2018 Workshop</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ekin</given-names>
            <surname>Gedik</surname>
          </string-name>
          , Laura Cabrera-Quiros, and
          <string-name>
            <given-names>Hayley</given-names>
            <surname>Hung</surname>
          </string-name>
          .
          <article-title>No-Audio Multimodal Speech Detection task at MediaEval 2019</article-title>
          .
          <source>In Proc. of the MediaEval 2019 Workshop</source>
          . Sophia Antipolis, France, Oct.
          <volume>27</volume>
          -
          <fpage>29</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Yang</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhonglei Gu</surname>
          </string-name>
          , and
          <string-name>
            <surname>Tobey H Ko</surname>
          </string-name>
          .
          <article-title>Analyzing Human Behavior in Subspace: Dimensionality Reduction + Classification</article-title>
          .
          <source>In Proc. of the MediaEval 2018 Workshop</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>