Transductive Parameter Transfer, Bags of Dense Trajectories and MILES for No-Audio Multimodal Speech Detection Laura Cabrera-Quiros1,2 , Ekin Gedik1 , Hayley Hung1 1 Delft University of Technology, Netherlands 2 Instituto Tecnológico de Costa Rica, Costa Rica. {l.c.cabreraquiros,e.gedik,h.hung}@tudelft.nl ABSTRACT 2 METHODOLOGY This paper presents the algorithms that task organisers deployed 2.1 Estimation from acceleration: TPT for the automatic Human Behaviour Analysis (HBA) task of the Even though speakers are known to act differently from non-speakers MediaEval 2018. HBA task aims to investigate alternate modalities [5], their behaviours vary greatly, making automatic estimation of video and body-worn acceleration for the detection of speaking from acceleration a challenging task. In order to account for this status. For unimodal estimation from acceleration, a transfer learn- variance, we employed a transfer learning model called TPT which ing approach, Transductive Parameter Transfer (TPT), which is can provide personalised models. It computes the parameters of shown to perform satisfactorily in a similar setting[4] is employed. the optimal classifier for a target dataset X t given a set of source For the estimation from the video modality, bags of Dense Trajec- datasets with their own corresponding optimal classifiers. The clas- tories were used in a multiple instance learning approach (MILES) sifier for the target data is computed without using any label in- [2]. Finally, late fusion is used for combining the outputs from both formation for the target dataset. The method was first proposed modalities. The multi-modal approach resulted in a mean AUC for facial expression detection [7]. A specialised version tuned for of 0.658, outperforming the performance of both single modality speaking status detection from acceleration was presented in [4]. approaches. Let N source datasets with label information ( and )the unlabelled ns 1 INTRODUCTION target dataset be defined as D s1 , ..., D sN , D is = x js , y sj i and X t = j=1 The Human Behaviour Analysis (HBA) task of MediaEval 2018 fo- {x jt }nj=1 t , the following steps are taken for computing the optimal cuses on non-audio speaking status detection in crowded mingling parameters (wt , c t ) for X t (where w and c correspond to regression events [1]. Such events are interesting since they are concentrated coefficients and the intercept, respectively): moments for people to interact freely, resulting in unstructured (1) {θ i = (wi , c i )}i=1 N is computed using L2 penalized logistic re- and varied social behaviour. Since speaking turns are shown to be gression, vital units of social behaviour [9], their automatic detection makes (2) Training set τ = {X is , θ i }i=1 N is created, detailed analysis of social behaviour possible. (3) The kernel matrix K that defines the distances between dis- Traditionally, audio is used for the detection of speech. However, tributions where Ki j = κ (X is , X js ) is computed with an Earth the dense nature of large gatherings introduces restrictions such as Mover’s distance kernel [6]. background noise, making the use of audio challenging. In order to (4) Given K and τ , fˆ (.), the mapping between marginal distribu- overcome this challenge, the HBA task investigates the alternative tions of the datasets and their optimal parameters, is computed modalities of wearable acceleration and video for the detection of with Kernel Ridge Regression. speaking status. The main idea behind this approach is backed by prior work in social science where speakers were shown to move (5) (wt , c t ) = fˆ (X t )is computed using the mapping obtained in (e.g. gesture) during speech [5]. the former step. The task requires participants to provide solutions for unimodal For a more detailed explanation of each step, readers can refer estimations, both for acceleration and video, and a multimodal to [4]. We used statistical and spectral features extracted from 3s estimation. For more details about the task, please refer to [1]. windows with 1.5s overlap for each axis of the raw acceleration For acceleration, we employed the transfer learning method signal, absolute values of the acceleration signal and the magnitude called Transductive Parameter Transfer (TPT) which was shown of the acceleration. As the statistical features, mean and variance to perform satisfactorily in a similar setting [4]. Speaker estima- values are calculated. The power spectral density computed using 8 tion from video is carried out by extracting bags of dense trajec- bins with logarithmic spacing forms our spectral feature set. Each tories and using MILES (a multiple instance learning method) for axis of the acceleration is standardised to have zero mean and classification. This approach from video allow us to overcome the unit variance. The probability outputs are then upsampled to 1s cross-contamination of subjects standing close together due to their windows. respective overlapping bounding boxes. Finally, the multimodal es- 2.2 Estimation from video: Bags of dense timation is done by combining the outputs of these two unimodal trajectories and MILES classification approaches using late fusion. In the following section, The video for this problem is inherently noisy, as we can have more we will explain these approaches in detail. than one person in the video for our person of interest (eg. people Copyright held by the owner/author(s). talking close together). Thus, we propose to use bags of dense MediaEval’18, 29-31 October 2018, France trajectories to overcome the cross-contamination in the video. MediaEval’18, 29-31 October 2018, France L. Cabrera-Quiros et al. First, we extract the dense trajectories for all the participants Accel Video Fusion using the method proposed by Wang et.al. [10]. Then, these trajec- Mean AUC±Std 0.656 ± 0.074 0.549 ± 0.079 0.658 ± 0.073 tories are clustered into bags using a sliding window of 3sec with Table 1: Performances of each modality and their (late) fu- an overlap of 1.5sec. Thus, all the trajectories that overlap at least sion. an 80% with the window are part of the bag for this window. This clustering into bags results in a set Bs of bags (positive 3 RESULTS and negative) for subject s, where s = {1, ..., S } and S is the total Table 1 presents the performances for each task. Similarly, we number of subjects. A bag from this set is then Bsj , where j = present the performance obtained for each participant in Figure 1. {1..., N s }, and N s is the total number of bags possible for subject s. For unimodal estimations, mean AUC scores of 0.656 and 0.549 with Moreover, we cluster also in space the trajectories within a bag using standard deviations of 0.074 and 0.079 are obtained for acceleration k-means clustering. We do so to account for spatial similarities and and video. As it can be seen from the Figure 1, performance per for computational efficiency. This way, the trajectories for each bag participant is highly varied. This further supports the claim that the are clustered into the k most representative prototypes for the bag. movement patterns of speakers are highly varied, making detection Note that each bag Bsj will consist of good trajectories (corre- harder for some than others. sponding to the subject s) and bad or noise trajectories (other sub- jects or shadows and other background artifacts). Thus, we need to treat the samples in a bag differently, instead of each trajectory independently. This is the main motivation for using a Multiple Instance Learning (MIL) approach for classification on video. As our MIL approach we use Multiple Instance Learning via Embedded Instance Selection (MILES)[2]. Overall, MILES classifies a bag by considering both contributing information (e.g. trajectories of subject s in our case) and opposing information (e.g. trajectories from other subjects or background). It does so by creating a concept in an embedded space and comparing all instances to this concept. Let us define B = {B1 , B2 , ..., BS }, as the set of bags for all par- ticipants in the training set. Ba is then a bag of this set B, where a = {1..., A} and A is the sum of the total number of bags for all S Figure 1: Performances per participant (p. independent) j subjects. xa is then an instance (prototype trajectory) from this bag. Relatively low performance of the video modality is probably For a given bag Ba the measure of similarity between this bag and caused by the missing video data for some participants. These all other instances (disregarding their bag) is calculated by missing intervals are included in the performance evaluation drop- ||x − xk || 2 + s (xk , Ba ) = max exp *− ab 2 (1) ping the overall performance for that participant. Cases where b , σ - acceleration modality are outperformed by video further show the multimodal nature of the problem. where xk is the set of instances in the training and x ab is a given Moreover, the data present from the video can be noisy due to instance b within bag Ba . Thus, bag Ba is embedded into a space occlusions between the participants. Our MIL approach for video of similarities defined as could tackle this problem up to a certain degree, but some cases are m(Ba ) = [s (x1 , Ba ), s (x2 , Ba ), ..., s (xna , Ba )]T (2) where na is the total number of instances in the training set. too crowded to be tackled from the video alone. This results in the matrix representation of all training bags in the Finally, we can see that even with a basic fusion technique like embedded space (IFc ) : m(B) = [m(B1 ), ..., m(BA )]. mean fusion, a multimodal approach provided better performance On this representation a (sparse) linear classifier is then trained. than the single modalities. Even though the overall performance The classification of new bags is done by: difference is marginal, mean fusion guaranteed similar or higher X performance scores than both modalities. We argue that with a more y = sign( w k∗ s (xk , Bnew ) + b ∗ ) (3) sophisticated fusion approach, it should be possible to exploit the k ∈I multimodal nature of the problem even more. A possible direction where I is the subset of instances with non-zero weights (I = {k : of research is addressing the occlusion segments during video in a |w k∗ | > 0}). Note that instances with contributing information will smart fusion manner. have positive weights w k∗ , while those with opposing information will have negative weights. We used the MILES implementation in 4 CONCLUSION PRTools [3]. For more details, please refer to [2]. In this paper, we presented our approach for no-audio speech detec- tion. The promising performances showed the possibility of tackling 2.3 Multimodal estimation: Late fusion such a challenging task. Highest performance scores obtained by After computing 1 second estimations from acceleration and video the multimodal fusion further supported the multimodal nature of modalities with aforementioned methods, we combine the predic- the problem. However, there is still a huge room for improvement. tions of both methods using mean fusion [8]. If the video of the We believe with the help of many, it will be possible to finally solve current subject is missing, we directly use the output of the TPT. this challenging problem. Transductive Parameter Transfer and Dense Trajectories for No-Audio Multimodal Speech Detection MediaEval’18, 29-31 October 2018, France ACKNOWLEDGMENTS This task is partially supported by the Instituto Tecnológico de Costa Rica and the Netherlands Organization for Scientific Research (NWO) under project number 639.022.606. REFERENCES [1] L. Cabrera-Quiros, E. Gedik, and H. Hung. 2018. No-Audio Multimodal Speech Detection in Crowded Social Settings task at MediaEval 2018. MediaEval (2018). [2] Y. Chen, J. Bi, and J.Z. Wang. 2006. MILES: Multiple-Instance Learning via Embedded Instance Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) (2006). [3] P. Duin, R.P.W. Juszcak, P. Paclik, E. Pekalska, D. de Ridder, and D.M.J. Tax. 2017. PRTools, A Matlab Toolbox for Pattern Recognition. (March 2017). version 5.3. [4] Ekin Gedik and Hayley Hung. 2017. Personalised models for speech detection from body movements using transductive parameter transfer. Personal and Ubiquitous Computing 21, 4 (2017), 723–737. [5] David McNeill. 2000. Language and gesture. Vol. 2. Cambridge Univer- sity Press. [6] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. 2000. The earth mover’s distance as a metric for image retrieval. International journal of computer vision 40, 2 (2000), 99–121. [7] Enver Sangineto, Gloria Zen, Elisa Ricci, and Nicu Sebe. 2014. We are not all equal: Personalizing models for facial expression analy- sis with transductive parameter transfer. In Proceedings of the ACM international conference on multimedia. ACM, 357–366. [8] David MJ Tax, Martijn Van Breukelen, Robert PW Duin, and Josef Kittler. 2000. Combining multiple classifiers by averaging or by multi- plying? Pattern recognition 33, 9 (2000), 1475–1485. [9] Alessandro Vinciarelli, Maja Pantic, Dirk Heylen, Catherine Pelachaud, Isabella Poggi, Francesca D’Errico, and Marc Schroeder. 2012. Bridging the gap between social animal and unsocial machine: A survey of social signal processing. IEEE Transactions on Affective Computing 3, 1 (2012), 69–87. [10] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. 2013. Dense Trajectories and Motion Boundary Descriptors for Action Recognition. Intern. Journal of Computer Vision (2013).