=Paper= {{Paper |id=Vol-2670/MediaEval_19_paper_65 |storemode=property |title=DCU-ADAPT at MediaEval 2019: Eyes and Ears Together |pdfUrl=https://ceur-ws.org/Vol-2670/MediaEval_19_paper_65.pdf |volume=Vol-2670 |authors=Yasufumi Moriya,Gareth J.F. Jones |dblpUrl=https://dblp.org/rec/conf/mediaeval/MoriyaJ19a }} ==DCU-ADAPT at MediaEval 2019: Eyes and Ears Together== https://ceur-ws.org/Vol-2670/MediaEval_19_paper_65.pdf
            DCU-ADAPT at MediaEval 2019: Eyes and Ears Together
                                                             Yasufumi Moriya, Gareth J. F. Jones
                                ADAPT Centre, School of Computing, Dublin City University, Dublin 9, Ireland
                                           yasufumi.moriya@adaptcentre,gareth.jones@dcu.ie

ABSTRACT
We describe the DCU-ADAPT participation in the Eyes and Ears
Together task at MediaEval 2019. Our submitted systems were
developed to choose object bounding boxes from automatically
generated proposals given query entities. The first system finds
relevance between object proposals and queries using multiple
instance learning. The second system employs an attention mech-
anism to find object proposals which are most likely correspond
to the given queries. The last system is a baseline system which
chooses region proposals at random. We observed that the first two
systems produced higher accuracy than the random baseline. The
best approach was to use multiple instance learning which resulted
in accuracy of 9% when the threshold of intersection over union
was 0.5.                                                                        Figure 1: Computation of loss function using contrastive
                                                                                loss.

1    INTRODUCTION
The nature of human communication is often a multimodal process,                2     OUR APPROACH
where textual, visual and audio information are simultaneously
                                                                                We use machine learning approaches to visual grounding using au-
processed. The Eyes and Ears Together task at MediaEval 2019 aims
                                                                                tomatically generated object proposals. For each video frame, there
to ground speech transcripts into videos [7]. Visual grounding tasks
                                                                                are n object proposals. We extract n fixed-length feature vectors
are conducted on images or videos and manually created captions
                                                                                by cropping a video according to object proposals and applying a
[4, 5, 8], but rarely on vision and speech. Speech grounding is
                                                                                convolutional neural network (CNN) to each cropped image. Each
interesting, in that this replicates human communication, where
                                                                                query entity associated with a video frame is also transformed into
listening to speech and seeing objects happen simultaneously. A
                                                                                a fixed-length vector using a word embedding model.
practical advantage of grounding speech into vision is that, unlike
caption grounding, speech transcripts can be obtained easily from
user generated content (e.g., YouTube) or using automatic speech
                                                                                2.1    Multiple Instance Learning
recognition.                                                                    Given region proposals transformed into fixed-length vectors, and
    As a task organiser, we generated pairs of video frames and                 a query entity also represented as a vector, a neural network model
entities from the How2 dataset [7, 9]. The challenge of this task is            can find the region proposal which is the most strongly associated
that systems need to discover relationships between objects and                 with the query entity [4]. This can be expressed in the following
entities without explicit annotation of objects, since pairs of video           equations.
frames and entities are automatically aligned.
                                                                                         ϕ(r i jk )   =   Wr (fC N N (r i jk ))                    (1)
    In this paper, we describe our investigation into whether two
existing approaches employed for caption grounding could be ap-                            ψ (ei )    =   We (f EM B (ei ))                        (2)
                                                                                                                                    T
plied to speech grounding. The common characteristics of these                                   k̄   =   arg max(siдmoid(ϕ(r i jk ) · ψ (ei )))   (3)
approaches are that they both use pre-computed candidate region                                               k
proposals of objects. The first approach is to find relationships be-           where i denotes the ith entity, j - jth the video frame of an entity
tween object proposals and queries using contrastive loss [4]. This             and k - kth a region proposal, ϕ(r i jk ) is a CNN feature of r i jk ,
employs an established approach referred to as multiple instance                ψ (w i ) is a word embedding of query entity ei , and k̄ is an index
learning (MIL) which is often applied to other computer vision tasks            of the region proposal which is the most strongly associated with
[3]. The second approach is to use the attention mechanism [1],                 ei . While fC N N and f EM B are fixed during training, in the neural
with an object bounding box which has the highest attention weight              network model Wr and We are updated at training time.
taken as a prediction given a query entity [8]. To compare these                    At training time, given region proposals and a query entity, a
approaches to the most basic system, the final system randomly                  neural network model is trained to find relationships between video
chooses object bounding boxes from candidate region proposals.                  frames and query entities, as shown in Figure 1. For each pair of
Copyright 2019 for this paper by its authors. Use
                                                                                a video frame and a query entity, there are two additional pairs
permitted under Creative Commons License Attribution                            which create a mis-match between a video frame and a query entity.
4.0 International (CC BY 4.0).                                                  The loss function penalises a model when it gives a higher score to
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France
MediaEval’19, 27-29 October 2019, Sophia Antipolis, France                                                                      Y. Moriya, G. J. F. Jones

                                                                                  Table 1: Results of visual grounding of accuracy at three
                                                                                  thresholds 0.1, 0.3 and 0.5.

                                                                                                               0.5      0.3      0.1
                                                                                                   MIL         0.094    0.227    0.494
                                                                                                   Rec         0.080    0.192    0.402
                                                                                                   Random      0.077    0.181    0.408


                                                                                  In Equation 7, the sum of a visual feature from region proposals
                                                                                  multiplied by attention weights ak is transformed into a recon-
                                                                                  structed embedding of a query entity r at t n . In Equation 8, Lr ec is
                                                                                  essentially a mean squared error of a reconstructed query entity
                                                                                  and an embedded query entity.

                                                                                  3   IMPLEMENTATION DETAILS
                                                                                  For each video frame, 20 region proposals were extracted from the
                                                                                  How2 dataset [9] using Mask-RCNN [6]. The Mask-RCNN uses
Figure 2: Computation of loss function using reconstruc-                          ResNeXt101 [10] as its backbone. For each region proposal, the
tion.                                                                             ResNet 152 model was used to extract fixed-length vectors. The
                                                                                  dimension of each visual feature was 2,048. The word embedding
                                                                                  model was trained on the training set of the How2 speech tran-
a mis-matched pair. This is expressed in Equation 5.                              scripts using the fastText library [2], and each query entity was
                                                                                  embedded into a 100 dimensional vector.
            Õ
   Sii =        max(ϕ(r i jk )T · ψ (ei ))                                  (4)
                             k
                     j
                    I
                    Õ
                                                                                  4   RESULTS
      L    =          (max(0, Sil − Sii + δ ) + max(0, Sl i − Sii + δ )) (5)      Table 1 shows results of visual grounding using the MIL-based
                     i                                                            approach, the reconstruction based approach and the system which
where Sii is a correctly matched image-entity pair, Sil is the current            chooses region proposals at random. The systems were evaluated
image and a random query entity and Sl i a random image and the                   in terms of intersection of a selected region proposal and a gold
current query entity.                                                             standard bounding box divided by union of a region proposal and a
                                                                                  gold standard (IoU). When an IoU value exceeded thresholds of 0.5,
2.2       Reconstruction                                                          0.3 or 0.1, a system prediction was regarded as correct. As can be
A neural network can find a region proposal that is the most                      seen in the table, both MIL and reconstruction approaches generally
strongly associated with a query entity using attention mechanism                 produced slightly better results than a simple random approach. A
[1].                                                                              possible explanation for poor results of the two models is that those
                                                                                  approaches have been applied to caption grounding and showed
               k̄        =       arg max(f AT T N ([ϕ(r i jk );ψ (ei )]))   (6)   reasonable results, but have not been applied to speech grounding.
                                     k
                                                                                  For speech grounding, it is possible that entities are sometimes
This is applied in Equation 6, where f AT T N is an attention function            weakly associated with visual objects. Therefore, existing models
which computes attention weights over k region proposals given                    may need modification for speech grounding to efficiently learn
concatenation of visual features ϕ(r i jk ) and an embedded query                 relationships between entities and objects.
entity ψ (ei ).
   At training time, a model can learn a relationship between a                   5   CONCLUSIONS
visual object and a query entity by reconstructing an embedded
                                                                                  This paper describes the DCU-ADAPT participation in Eyes and
query entity from a region proposal which has the highest attention
                                                                                  Ears Together at MediaEval 2019. We employed machine learning
weight [8]. Figure 2 shows how an object bounding box is found at
                                                                                  approaches previously applied to caption grounding, and investi-
testing time, and how a model is trained to reconstruct a query en-
                                                                                  gated whether those models can work on speech grounding as well.
tity from a region proposal at training time. Formally, the following
                                                                                  It was found that whlie they still perform better than the random
equations express how to compute a reconstruction loss.
                                                                                  baseline, they require modification to better capture weak relation-
                                                  N
                                                  Õ                               ships between entities in speech transcripts and visual objects.
                     r at t n       =    Wr ec          ak ϕ(r i jk )       (7)
                                                  k=1
                                                                                  ACKNOWLEDGMENTS
                                              D
                                          1 Õ                                     This work was supported by Science Foundation Ireland as part of
                         Lr ec      =         (ψ (ei )d − r at
                                                            d
                                                               tn)          (8)
                                          D                                       the ADAPT Centre (Grant 13/RC/2106) at Dublin City University.
                                             d =1
Eyes and Ears Together                                                          MediaEval’19, 27-29 October 2019, Sophia Antipolis, France


REFERENCES
 [1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural
     Machine Translation by Jointly Learning to Align and Translate. In
     3rd International Conference on Learning Representations (ICLR).
 [2] Piotr. Bojanowski, Edouard. Grave, Armand. Joulin, and Ttomas.
     Mikolov. 2016. Enriching word vectors with subword information.
     Transactions of the Association for Computational Linguistics 5 (2016),
     135–146.
 [3] Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and
     Ghyslain Gagnon. 2018. Multiple instance learning: A survey of prob-
     lem characteristics and applications. Pattern Recognition 77 (2018), 329
     – 353. https://doi.org/10.1016/j.patcog.2017.10.009
 [4] De-An Huang, Shyamal Buch, Lucio Dery, Animiesh Garg, Li Fei-
     Fei, and Juan Carlos Niebles. 2018. Finding “It”: Weakly-Supervised,
     Reference-Aware Visual Grounding in Instructional Videos. In Interna-
     tional Conference on Computer Vision and Pattern Recognition (CVPR).
     5948–5957.
 [5] Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments
     for generating image descriptions. In Computer Vision and Pattern
     Recognition (CVPR). 3128–3137.
 [6] Francisco Massa and Ross Girshick. 2018. maskrcnn-benchmark:
     Fast, modular reference implementation of Instance Segmentation
     and Object Detection algorithms in PyTorch. https://github.com/
     facebookresearch/maskrcnn-benchmark. (2018). Accessed: 07 June
     2019.
 [7] Yasufumi Moriya, Ramon Sanabria, Florian Metze, and Gareth Jones
     J. F. MediaEval 2019: Eyes and Ears Together. In Proceedings of Media-
     Eval 2019.
 [8] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell,
     and Bernt Schiele. 2016. Grounding of Textual Phrases in Images by
     Reconstruction. In European Conference on Computer Vision (ECCV).
     817–834.
 [9] Ranom Sanabria, Ozan Caglayan, Shurti Palaskar, Desmond Elliott,
     Loic Barrault, Lucia Specia, and Florian Metze. 2018. How2: A Large-
     scale Dataset For Multimodal Language Understanding. In Workshop
     on Visually Grounded Interaction and Language (ViGIL). NeurIPS. http:
     //arxiv.org/abs/1811.00347
[10] Saining Xie, Ross Girshick, Piotr Dollar, Zhuowen Tu, and Kaiming
     He. 2017. Aggregated Residual Transformations for Deep Neural
     Networks. In The IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR).