=Paper=
{{Paper
|id=Vol-2670/MediaEval_19_paper_65
|storemode=property
|title=DCU-ADAPT at MediaEval 2019: Eyes and Ears Together
|pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_65.pdf
|volume=Vol-2670
|authors=Yasufumi Moriya,Gareth J.F. Jones
|dblpUrl=https://dblp.org/rec/conf/mediaeval/MoriyaJ19a
}}
==DCU-ADAPT at MediaEval 2019: Eyes and Ears Together==
DCU-ADAPT at MediaEval 2019: Eyes and Ears Together Yasufumi Moriya, Gareth J. F. Jones ADAPT Centre, School of Computing, Dublin City University, Dublin 9, Ireland yasufumi.moriya@adaptcentre,gareth.jones@dcu.ie ABSTRACT We describe the DCU-ADAPT participation in the Eyes and Ears Together task at MediaEval 2019. Our submitted systems were developed to choose object bounding boxes from automatically generated proposals given query entities. The first system finds relevance between object proposals and queries using multiple instance learning. The second system employs an attention mech- anism to find object proposals which are most likely correspond to the given queries. The last system is a baseline system which chooses region proposals at random. We observed that the first two systems produced higher accuracy than the random baseline. The best approach was to use multiple instance learning which resulted in accuracy of 9% when the threshold of intersection over union was 0.5. Figure 1: Computation of loss function using contrastive loss. 1 INTRODUCTION The nature of human communication is often a multimodal process, 2 OUR APPROACH where textual, visual and audio information are simultaneously We use machine learning approaches to visual grounding using au- processed. The Eyes and Ears Together task at MediaEval 2019 aims tomatically generated object proposals. For each video frame, there to ground speech transcripts into videos [7]. Visual grounding tasks are n object proposals. We extract n fixed-length feature vectors are conducted on images or videos and manually created captions by cropping a video according to object proposals and applying a [4, 5, 8], but rarely on vision and speech. Speech grounding is convolutional neural network (CNN) to each cropped image. Each interesting, in that this replicates human communication, where query entity associated with a video frame is also transformed into listening to speech and seeing objects happen simultaneously. A a fixed-length vector using a word embedding model. practical advantage of grounding speech into vision is that, unlike caption grounding, speech transcripts can be obtained easily from user generated content (e.g., YouTube) or using automatic speech 2.1 Multiple Instance Learning recognition. Given region proposals transformed into fixed-length vectors, and As a task organiser, we generated pairs of video frames and a query entity also represented as a vector, a neural network model entities from the How2 dataset [7, 9]. The challenge of this task is can find the region proposal which is the most strongly associated that systems need to discover relationships between objects and with the query entity [4]. This can be expressed in the following entities without explicit annotation of objects, since pairs of video equations. frames and entities are automatically aligned. ϕ(r i jk ) = Wr (fC N N (r i jk )) (1) In this paper, we describe our investigation into whether two existing approaches employed for caption grounding could be ap- ψ (ei ) = We (f EM B (ei )) (2) T plied to speech grounding. The common characteristics of these k̄ = arg max(siдmoid(ϕ(r i jk ) · ψ (ei ))) (3) approaches are that they both use pre-computed candidate region k proposals of objects. The first approach is to find relationships be- where i denotes the ith entity, j - jth the video frame of an entity tween object proposals and queries using contrastive loss [4]. This and k - kth a region proposal, ϕ(r i jk ) is a CNN feature of r i jk , employs an established approach referred to as multiple instance ψ (w i ) is a word embedding of query entity ei , and k̄ is an index learning (MIL) which is often applied to other computer vision tasks of the region proposal which is the most strongly associated with [3]. The second approach is to use the attention mechanism [1], ei . While fC N N and f EM B are fixed during training, in the neural with an object bounding box which has the highest attention weight network model Wr and We are updated at training time. taken as a prediction given a query entity [8]. To compare these At training time, given region proposals and a query entity, a approaches to the most basic system, the final system randomly neural network model is trained to find relationships between video chooses object bounding boxes from candidate region proposals. frames and query entities, as shown in Figure 1. For each pair of Copyright 2019 for this paper by its authors. Use a video frame and a query entity, there are two additional pairs permitted under Creative Commons License Attribution which create a mis-match between a video frame and a query entity. 4.0 International (CC BY 4.0). The loss function penalises a model when it gives a higher score to MediaEval’19, 27-29 October 2019, Sophia Antipolis, France MediaEval’19, 27-29 October 2019, Sophia Antipolis, France Y. Moriya, G. J. F. Jones Table 1: Results of visual grounding of accuracy at three thresholds 0.1, 0.3 and 0.5. 0.5 0.3 0.1 MIL 0.094 0.227 0.494 Rec 0.080 0.192 0.402 Random 0.077 0.181 0.408 In Equation 7, the sum of a visual feature from region proposals multiplied by attention weights ak is transformed into a recon- structed embedding of a query entity r at t n . In Equation 8, Lr ec is essentially a mean squared error of a reconstructed query entity and an embedded query entity. 3 IMPLEMENTATION DETAILS For each video frame, 20 region proposals were extracted from the How2 dataset [9] using Mask-RCNN [6]. The Mask-RCNN uses Figure 2: Computation of loss function using reconstruc- ResNeXt101 [10] as its backbone. For each region proposal, the tion. ResNet 152 model was used to extract fixed-length vectors. The dimension of each visual feature was 2,048. The word embedding model was trained on the training set of the How2 speech tran- a mis-matched pair. This is expressed in Equation 5. scripts using the fastText library [2], and each query entity was embedded into a 100 dimensional vector. Õ Sii = max(ϕ(r i jk )T · ψ (ei )) (4) k j I Õ 4 RESULTS L = (max(0, Sil − Sii + δ ) + max(0, Sl i − Sii + δ )) (5) Table 1 shows results of visual grounding using the MIL-based i approach, the reconstruction based approach and the system which where Sii is a correctly matched image-entity pair, Sil is the current chooses region proposals at random. The systems were evaluated image and a random query entity and Sl i a random image and the in terms of intersection of a selected region proposal and a gold current query entity. standard bounding box divided by union of a region proposal and a gold standard (IoU). When an IoU value exceeded thresholds of 0.5, 2.2 Reconstruction 0.3 or 0.1, a system prediction was regarded as correct. As can be A neural network can find a region proposal that is the most seen in the table, both MIL and reconstruction approaches generally strongly associated with a query entity using attention mechanism produced slightly better results than a simple random approach. A [1]. possible explanation for poor results of the two models is that those approaches have been applied to caption grounding and showed k̄ = arg max(f AT T N ([ϕ(r i jk );ψ (ei )])) (6) reasonable results, but have not been applied to speech grounding. k For speech grounding, it is possible that entities are sometimes This is applied in Equation 6, where f AT T N is an attention function weakly associated with visual objects. Therefore, existing models which computes attention weights over k region proposals given may need modification for speech grounding to efficiently learn concatenation of visual features ϕ(r i jk ) and an embedded query relationships between entities and objects. entity ψ (ei ). At training time, a model can learn a relationship between a 5 CONCLUSIONS visual object and a query entity by reconstructing an embedded This paper describes the DCU-ADAPT participation in Eyes and query entity from a region proposal which has the highest attention Ears Together at MediaEval 2019. We employed machine learning weight [8]. Figure 2 shows how an object bounding box is found at approaches previously applied to caption grounding, and investi- testing time, and how a model is trained to reconstruct a query en- gated whether those models can work on speech grounding as well. tity from a region proposal at training time. Formally, the following It was found that whlie they still perform better than the random equations express how to compute a reconstruction loss. baseline, they require modification to better capture weak relation- N Õ ships between entities in speech transcripts and visual objects. r at t n = Wr ec ak ϕ(r i jk ) (7) k=1 ACKNOWLEDGMENTS D 1 Õ This work was supported by Science Foundation Ireland as part of Lr ec = (ψ (ei )d − r at d tn) (8) D the ADAPT Centre (Grant 13/RC/2106) at Dublin City University. d =1 Eyes and Ears Together MediaEval’19, 27-29 October 2019, Sophia Antipolis, France REFERENCES [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In 3rd International Conference on Learning Representations (ICLR). [2] Piotr. Bojanowski, Edouard. Grave, Armand. Joulin, and Ttomas. Mikolov. 2016. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5 (2016), 135–146. [3] Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. 2018. Multiple instance learning: A survey of prob- lem characteristics and applications. Pattern Recognition 77 (2018), 329 – 353. https://doi.org/10.1016/j.patcog.2017.10.009 [4] De-An Huang, Shyamal Buch, Lucio Dery, Animiesh Garg, Li Fei- Fei, and Juan Carlos Niebles. 2018. Finding “It”: Weakly-Supervised, Reference-Aware Visual Grounding in Instructional Videos. In Interna- tional Conference on Computer Vision and Pattern Recognition (CVPR). 5948–5957. [5] Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Computer Vision and Pattern Recognition (CVPR). 3128–3137. [6] Francisco Massa and Ross Girshick. 2018. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/ facebookresearch/maskrcnn-benchmark. (2018). Accessed: 07 June 2019. [7] Yasufumi Moriya, Ramon Sanabria, Florian Metze, and Gareth Jones J. F. MediaEval 2019: Eyes and Ears Together. In Proceedings of Media- Eval 2019. [8] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, and Bernt Schiele. 2016. Grounding of Textual Phrases in Images by Reconstruction. In European Conference on Computer Vision (ECCV). 817–834. [9] Ranom Sanabria, Ozan Caglayan, Shurti Palaskar, Desmond Elliott, Loic Barrault, Lucia Specia, and Florian Metze. 2018. How2: A Large- scale Dataset For Multimodal Language Understanding. In Workshop on Visually Grounded Interaction and Language (ViGIL). NeurIPS. http: //arxiv.org/abs/1811.00347 [10] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming He. 2017. Aggregated Residual Transformations for Deep Neural Networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR).