Transductive Parameter Transfer, Bags of Dense Trajectories
        and MILES for No-Audio Multimodal Speech Detection
                                           Laura Cabrera-Quiros1,2 , Ekin Gedik1 , Hayley Hung1
                                                  1 Delft University of Technology, Netherlands
                                                 2 Instituto Tecnológico de Costa Rica, Costa Rica.

                                                  {l.c.cabreraquiros,e.gedik,h.hung}@tudelft.nl

ABSTRACT                                                                    2 METHODOLOGY
This paper presents the algorithms that task organisers deployed            2.1 Estimation from acceleration: TPT
for the automatic Human Behaviour Analysis (HBA) task of the                Even though speakers are known to act differently from non-speakers
MediaEval 2018. HBA task aims to investigate alternate modalities           [5], their behaviours vary greatly, making automatic estimation
of video and body-worn acceleration for the detection of speaking           from acceleration a challenging task. In order to account for this
status. For unimodal estimation from acceleration, a transfer learn-        variance, we employed a transfer learning model called TPT which
ing approach, Transductive Parameter Transfer (TPT), which is               can provide personalised models. It computes the parameters of
shown to perform satisfactorily in a similar setting[4] is employed.        the optimal classifier for a target dataset X t given a set of source
For the estimation from the video modality, bags of Dense Trajec-           datasets with their own corresponding optimal classifiers. The clas-
tories were used in a multiple instance learning approach (MILES)           sifier for the target data is computed without using any label in-
[2]. Finally, late fusion is used for combining the outputs from both       formation for the target dataset. The method was first proposed
modalities. The multi-modal approach resulted in a mean AUC                 for facial expression detection [7]. A specialised version tuned for
of 0.658, outperforming the performance of both single modality             speaking status detection from acceleration was presented in [4].
approaches.                                                                     Let N source datasets with label information      ( and )the      unlabelled
                                                                                                                                               ns
1    INTRODUCTION                                                           target dataset be defined as D s1 , ..., D sN , D is = x js , y sj i and X t =
                                                                                                                                           j=1
The Human Behaviour Analysis (HBA) task of MediaEval 2018 fo-               {x jt }nj=1
                                                                                     t
                                                                                        , the following steps are taken for computing the optimal
cuses on non-audio speaking status detection in crowded mingling            parameters (wt , c t ) for X t (where w and c correspond to regression
events [1]. Such events are interesting since they are concentrated         coefficients and the intercept, respectively):
moments for people to interact freely, resulting in unstructured
                                                                            (1) {θ i = (wi , c i )}i=1
                                                                                                    N is computed using L2 penalized logistic re-
and varied social behaviour. Since speaking turns are shown to be
                                                                                   gression,
vital units of social behaviour [9], their automatic detection makes
                                                                            (2) Training set τ = {X is , θ i }i=1
                                                                                                              N is created,
detailed analysis of social behaviour possible.
                                                                            (3) The kernel matrix K that defines the distances between dis-
   Traditionally, audio is used for the detection of speech. However,
                                                                                   tributions where Ki j = κ (X is , X js ) is computed with an Earth
the dense nature of large gatherings introduces restrictions such as
                                                                                   Mover’s distance kernel [6].
background noise, making the use of audio challenging. In order to
                                                                            (4) Given K and τ , fˆ (.), the mapping between marginal distribu-
overcome this challenge, the HBA task investigates the alternative
                                                                                   tions of the datasets and their optimal parameters, is computed
modalities of wearable acceleration and video for the detection of
                                                                                   with Kernel Ridge Regression.
speaking status. The main idea behind this approach is backed by
prior work in social science where speakers were shown to move              (5) (wt , c t ) = fˆ (X t )is computed using the mapping obtained in
(e.g. gesture) during speech [5].                                                  the former step.
   The task requires participants to provide solutions for unimodal         For a more detailed explanation of each step, readers can refer
estimations, both for acceleration and video, and a multimodal              to [4]. We used statistical and spectral features extracted from 3s
estimation. For more details about the task, please refer to [1].           windows with 1.5s overlap for each axis of the raw acceleration
   For acceleration, we employed the transfer learning method               signal, absolute values of the acceleration signal and the magnitude
called Transductive Parameter Transfer (TPT) which was shown                of the acceleration. As the statistical features, mean and variance
to perform satisfactorily in a similar setting [4]. Speaker estima-         values are calculated. The power spectral density computed using 8
tion from video is carried out by extracting bags of dense trajec-          bins with logarithmic spacing forms our spectral feature set. Each
tories and using MILES (a multiple instance learning method) for            axis of the acceleration is standardised to have zero mean and
classification. This approach from video allow us to overcome the           unit variance. The probability outputs are then upsampled to 1s
cross-contamination of subjects standing close together due to their        windows.
respective overlapping bounding boxes. Finally, the multimodal es-          2.2     Estimation from video: Bags of dense
timation is done by combining the outputs of these two unimodal                     trajectories and MILES
classification approaches using late fusion. In the following section,      The video for this problem is inherently noisy, as we can have more
we will explain these approaches in detail.                                 than one person in the video for our person of interest (eg. people
Copyright held by the owner/author(s).
                                                                            talking close together). Thus, we propose to use bags of dense
MediaEval’18, 29-31 October 2018, France
                                                                            trajectories to overcome the cross-contamination in the video.
MediaEval’18, 29-31 October 2018, France                                                                                  L. Cabrera-Quiros et al.


    First, we extract the dense trajectories for all the participants                          Accel         Video        Fusion
using the method proposed by Wang et.al. [10]. Then, these trajec-             Mean AUC±Std 0.656 ± 0.074 0.549 ± 0.079 0.658 ± 0.073
tories are clustered into bags using a sliding window of 3sec with            Table 1: Performances of each modality and their (late) fu-
an overlap of 1.5sec. Thus, all the trajectories that overlap at least        sion.
an 80% with the window are part of the bag for this window.
    This clustering into bags results in a set Bs of bags (positive           3     RESULTS
and negative) for subject s, where s = {1, ..., S } and S is the total
                                                                              Table 1 presents the performances for each task. Similarly, we
number of subjects. A bag from this set is then Bsj , where j =
                                                                              present the performance obtained for each participant in Figure 1.
{1..., N s }, and N s is the total number of bags possible for subject s.     For unimodal estimations, mean AUC scores of 0.656 and 0.549 with
Moreover, we cluster also in space the trajectories within a bag using        standard deviations of 0.074 and 0.079 are obtained for acceleration
k-means clustering. We do so to account for spatial similarities and          and video. As it can be seen from the Figure 1, performance per
for computational efficiency. This way, the trajectories for each bag         participant is highly varied. This further supports the claim that the
are clustered into the k most representative prototypes for the bag.          movement patterns of speakers are highly varied, making detection
    Note that each bag Bsj will consist of good trajectories (corre-          harder for some than others.
sponding to the subject s) and bad or noise trajectories (other sub-
jects or shadows and other background artifacts). Thus, we need
to treat the samples in a bag differently, instead of each trajectory
independently. This is the main motivation for using a Multiple
Instance Learning (MIL) approach for classification on video.
    As our MIL approach we use Multiple Instance Learning via
Embedded Instance Selection (MILES)[2]. Overall, MILES classifies
a bag by considering both contributing information (e.g. trajectories
of subject s in our case) and opposing information (e.g. trajectories
from other subjects or background). It does so by creating a concept
in an embedded space and comparing all instances to this concept.
    Let us define B = {B1 , B2 , ..., BS }, as the set of bags for all par-
ticipants in the training set. Ba is then a bag of this set B, where
a = {1..., A} and A is the sum of the total number of bags for all S              Figure 1: Performances per participant (p. independent)
              j
subjects. xa is then an instance (prototype trajectory) from this bag.           Relatively low performance of the video modality is probably
For a given bag Ba the measure of similarity between this bag and             caused by the missing video data for some participants. These
all other instances (disregarding their bag) is calculated by                 missing intervals are included in the performance evaluation drop-
                                             ||x − xk || 2 +
                 s (xk , Ba ) = max exp *− ab 2                         (1)   ping the overall performance for that participant. Cases where
                                 b
                                        ,         σ        -                  acceleration modality are outperformed by video further show the
                                                                              multimodal nature of the problem.
   where xk is the set of instances in the training and x ab is a given
                                                                                 Moreover, the data present from the video can be noisy due to
instance b within bag Ba . Thus, bag Ba is embedded into a space
                                                                              occlusions between the participants. Our MIL approach for video
of similarities defined as
                                                                              could tackle this problem up to a certain degree, but some cases are
           m(Ba ) = [s (x1 , Ba ), s (x2 , Ba ), ..., s (xna , Ba )]T (2)
   where na is the total number of instances in the training set.             too crowded to be tackled from the video alone.
This results in the matrix representation of all training bags in the            Finally, we can see that even with a basic fusion technique like
embedded space (IFc ) : m(B) = [m(B1 ), ..., m(BA )].                         mean fusion, a multimodal approach provided better performance
   On this representation a (sparse) linear classifier is then trained.       than the single modalities. Even though the overall performance
The classification of new bags is done by:                                    difference is marginal, mean fusion guaranteed similar or higher
                             X                                                performance scores than both modalities. We argue that with a more
                 y = sign(       w k∗ s (xk , Bnew ) + b ∗ )          (3)     sophisticated fusion approach, it should be possible to exploit the
                            k ∈I                                              multimodal nature of the problem even more. A possible direction
    where I is the subset of instances with non-zero weights (I = {k :        of research is addressing the occlusion segments during video in a
|w k∗ | > 0}). Note that instances with contributing information will         smart fusion manner.
have positive weights w k∗ , while those with opposing information
will have negative weights. We used the MILES implementation in               4     CONCLUSION
PRTools [3]. For more details, please refer to [2].                           In this paper, we presented our approach for no-audio speech detec-
                                                                              tion. The promising performances showed the possibility of tackling
2.3    Multimodal estimation: Late fusion                                     such a challenging task. Highest performance scores obtained by
After computing 1 second estimations from acceleration and video              the multimodal fusion further supported the multimodal nature of
modalities with aforementioned methods, we combine the predic-                the problem. However, there is still a huge room for improvement.
tions of both methods using mean fusion [8]. If the video of the              We believe with the help of many, it will be possible to finally solve
current subject is missing, we directly use the output of the TPT.            this challenging problem.
Transductive Parameter Transfer and Dense Trajectories for No-Audio Multimodal Speech Detection
                                                                                          MediaEval’18, 29-31 October 2018, France


ACKNOWLEDGMENTS
This task is partially supported by the Instituto Tecnológico de
Costa Rica and the Netherlands Organization for Scientific Research
(NWO) under project number 639.022.606.

REFERENCES
 [1] L. Cabrera-Quiros, E. Gedik, and H. Hung. 2018. No-Audio Multimodal
     Speech Detection in Crowded Social Settings task at MediaEval 2018.
     MediaEval (2018).
 [2] Y. Chen, J. Bi, and J.Z. Wang. 2006. MILES: Multiple-Instance Learning
     via Embedded Instance Selection. IEEE Transactions on Pattern Analysis
     and Machine Intelligence (PAMI) (2006).
 [3] P. Duin, R.P.W. Juszcak, P. Paclik, E. Pekalska, D. de Ridder, and D.M.J.
     Tax. 2017. PRTools, A Matlab Toolbox for Pattern Recognition. (March
     2017). version 5.3.
 [4] Ekin Gedik and Hayley Hung. 2017. Personalised models for speech
     detection from body movements using transductive parameter transfer.
     Personal and Ubiquitous Computing 21, 4 (2017), 723–737.
 [5] David McNeill. 2000. Language and gesture. Vol. 2. Cambridge Univer-
     sity Press.
 [6] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. 2000. The earth
     mover’s distance as a metric for image retrieval. International journal
     of computer vision 40, 2 (2000), 99–121.
 [7] Enver Sangineto, Gloria Zen, Elisa Ricci, and Nicu Sebe. 2014. We
     are not all equal: Personalizing models for facial expression analy-
     sis with transductive parameter transfer. In Proceedings of the ACM
     international conference on multimedia. ACM, 357–366.
 [8] David MJ Tax, Martijn Van Breukelen, Robert PW Duin, and Josef
     Kittler. 2000. Combining multiple classifiers by averaging or by multi-
     plying? Pattern recognition 33, 9 (2000), 1475–1485.
 [9] Alessandro Vinciarelli, Maja Pantic, Dirk Heylen, Catherine Pelachaud,
     Isabella Poggi, Francesca D’Errico, and Marc Schroeder. 2012. Bridging
     the gap between social animal and unsocial machine: A survey of social
     signal processing. IEEE Transactions on Affective Computing 3, 1 (2012),
     69–87.
[10] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. 2013. Dense Trajectories
     and Motion Boundary Descriptors for Action Recognition. Intern.
     Journal of Computer Vision (2013).