HCMUS at Eyes and Ears Together 2019: Entity Localization with
Guided Word Embedding and Human Pose Estimation approach
                          Gia-Han Diep, Duc-Tuan Luu, Son-Thanh Tran-Nguyen, Minh-Triet Tran
                                    Faculty of Information Technology and Software Engineering Laboratory
                                                   University of Science, VNU-HCM, Vietnam
                                          {dghan,ldtuan,tnsthanh}@apcs.vn,tmtriet@fit.hcmus.edu.vn
                                              {dghan,ldtuan,tnsthanh,tmtriet}@selab.hcmus.edu.vn

ABSTRACT                                                                   2.1    Data preprocessing
The Eyes and Ears Together Task focuses on developing an efficient         We noticed that there are synonyms, singular and plural nouns
framework for analyzing and localizing entities and associated             targeting the same object in the dataset; Hence, to ensure the con-
pronouns from speech transcript. We present the HCMUS Team’s               sistency, we decided to automatically convert given labels in the
approach, which employs a combination of Faster R - CNN model              frames’ names into real labels.
and Word2vec architecture. We submit multiple runs with different
priority orders in our combine model. Our methods show potential           2.2    Word Embedding with regional proposal
results and achieve up to 2x accuracy in comparison to the task            Our method is to create a dictionary to map the keyword of a query
organizers’ approaches.                                                    with an existing concept in the two datasets MS Coco and Open-
1    INTRODUCTION                                                          ImageV4. To do this, we first use word embedding (section 2.2.1)
                                                                           to encode the keyword in each query and each known concept to
Eyes and Ears Together at MediaEval 2019 challenge[10] aims to
                                                                           measure their contextual relationship, with L2 is used to calculate
automate data collection for visual grounding by exploiting speech
                                                                           the context distance. To build the dictionary, we manually create
transcription and videos and to develop larger-scale visual ground-
                                                                           ground truth bounding boxes for approximately 50% of the devel-
ing systems. In this challenge, we are given the collection of in-
                                                                           opment set - section 2.2.2, and for each keyword, we choose the top
struction videos called “How2” [8] and their corresponding speech
                                                                           3 related concepts (lowest L2 score), which have region proposal
transcripts. In addition, task organizers also provide list of nouns,
                                                                           detected with IoU ≥ a chosen threshold (section 2.2.2).
timestamps and 2048 dimensional feature vectors extracted from
                                                                              For the test set, with only this dictionary and region proposals,
Residual Neural Network 152 layers [3] for every proposal in top 20
                                                                           we obtain a list of correspondent proposals and select the final
produced by Mask-RCNN[2]. The goal of the challenge is to localize
                                                                           bounding boxes based on probability, position in comparison to the
specific entities given in time-align speech transcription in videos.
                                                                           central region or just random, depending on the query concepts.
   Since the given region proposals may not visually contain ex-
                                                                           Figure 2 shows how this associate dictionary is built.
actly the desired object, we decided to use Faster Region-based
Convolutional Neural Network (Faster R-CNN) [7] with COCO [5]                 2.2.1 Word Embedding. Using Word2vec tool with Google News
and OpenImage [4] pretrained weights to re-extract the proposals           pretrained weight and Skip-gram architecture [6], we compute a
of the dataset. After that, Word2vec tool and OpenPose [1] are used        300-dimensional vectors representing the predicted labels (from
in different priority order so as to select the best candidate proposal.   OpenImage and Coco) and the target concepts. We then calculate
We also examine using Tesseract Optical Character Recognition              cosine, L1 and L2 distance between each pair of target and predicted
(OCR) [9] for detection.                                                   concepts to get best distance scores and choose the best metric.
2    APPROACH                                                                          Target concept Top Predicted concepts
Our goal is to localize the target objects given the object concept                        POLISH        Cosmetics
from the frame names. We come up with our proposed approach                               MOUTH          Human mouth
assuming that any model used to extract features from frames                           CALF/CALVES Human leg, Human arm
(querying the target objects) has already gained some knowledge             Table 1: Example of the concept correspondent dictionary
to distinguish the target objects from others. Our first approach
is to exploit knowledge from the concepts using word embedding                2.2.2 Build concept dictionary from word embedding and regional
and invade regional proposal with pretrained models (section 2.2).         proposal. We use pretrained Faster R-CNN for COCO and OpenIm-
Then we can build a dictionary to translate the target concept to          ageV4 dataset to extract region proposals (predicting corresponding
known concepts on the development set and apply it to the test set.        concepts - section 2.2.1) for every frames. For each image in the de-
Our second approach uses Pose Estimation to localize the concepts          velopment set, we calculate the IoU between each outputted region
relating to human body (section 2.3). The last approach is to use          proposal and each bounding box from the ground-truth, and filter
OCR for detection (section 2.4). Overall, our method’s original            out those proposals with IoU less than a chosen threshold. We then
output is bounding boxes for all detected target objects.                  select the three-best-score predicted concepts (from the remain-
                                                                           ing proposals) for each target concept (using L2 metric) to obtain
Copyright 2019 for this paper by its authors. Use                          a dictionary translating the target concept into known concepts
permitted under Creative Commons License Attribution                       (example shown in table 1).
4.0 International (CC BY 4.0).
MediaEval’19, 29-31 October 2019, Sophia Antipolis, France
MediaEval’19, 29-31 October 2019, Sophia Antipolis, France                                                              Gia-Han Diep et al.


                                                                          3 EXPERIMENTS AND RESULTS
      List of nouns                                                       3.1 Run submissions
                                             Video frames                 In Run 1, we use word embedding with regional proposal (section
      and pronouns
                                                                          2.2) with threshold 0 to get regional proposal with predicted con-
                                              Faster-RCNN                 cepts in the correspondent dictionary then use OpenPose approach
         Word2Vec                        (COCO, OpenImageV4 pretrained)   (section 2.3) with only straight-forward concepts such as hands,
   (Google News pretrained)                                               eyes, etc. and take one-forth area in the center for the remaining,
                                                                          assumming target objects should be in the center of the image. We
                                                 Regional                 obtain Run 2 in the same way as Run 1, except for using thresh-
    Feature vectors                              proposals                old 0.5 for word embedding with regional proposal (section 2.2)
                                                                          approach. Run 3 and Run 4 are similar to the Run 1 and Run 2 but
        Top 3                                                             we take the bounding boxes with area equal nearly a half (instead
   similarity score                           Best IoU score              of one-fourth) of the images area in the worst cases.
                                                  (threshold = 0.5)          For Run 5, we also use word embedding with regional proposal
      (L1, L2, cosine distance)
                                                                          approach (section 2.2) with threshold 0.5 and OpenPose approach
                                                                          with more hypothesis (section 2.3). We also use OCR approach
                                  Appropriate                             (section 2.4) for this run for some concepts and lastly, one-fourth
                                   concepts                               center region of the frame in worst cases.


                                  Appropriate
                                   dictionary
Figure 2:    How associate dictionary is built using
  Guided Word Embedding Approach on development set


                                                                          Figure 1: Example result of Word Embedding and Regional
       Figure 3: Example result from OpenPose approach                    Proposal approach

2.3      OpenPose Approach                                                3.2    Result
We use OpenPose with concepts relating to human body, especially          Table 2 shows that Run 2 is better than Run 1 and Run 4 is better
those neither in Coco or Openimage known concepts, such as: back,         than Run 3. Run 5 is our best model overall, which suggests that
hips, toes, heels, etc. The pose estimation provide specific estimated    applying human knowledge for specific scenarios may get even
position of human bone’s keypoints. Based on those keypoints, we          better result.
                                                                                         Threshold     0.5      0.3     0.1
can expand the padding of the detected corresponding bone part
                                                                                           Run 1      0.208    0.35   0.545
to obtain the target object. We also apply human knowledge or
                                                                                           Run 2      0.209 0.354 0.551
heuristics for some concepts. For instance, we assume that human
                                                                                           Run 3      0.213 0.346      0.54
back is from hips to shoulders.
                                                                                           Run 4      0.215    0.35   0.547
   We also assume that objects hold by human hands should be
                                                                                          Run 5      0.216 0.348 0.542
somewhere either from the thumbs and index fingers to the upper
hand-region or from one hand to another or from one thumb to the              Table 2: Eyes and Ears Together challenge 2019’s result
region expanding from that hand to the direction from elbow to
                                                                          4     CONCLUSION AND FUTURE WORKS
hand. Figure 3 shows example result from our OpenPose approach.
                                                                          Eyes and Ears Together challenge is a novel problem trying to map
2.4      OCR Approach                                                     knowledge gained from natural language to vision. Our current
There are several concepts we decided to use OCR for detection. For       approach only focus on extracting proposals using Faster R-CNN,
these frames, we crop all regional proposals as input for an OCR          calculating distance between pair of target and known concepts,
model and take those with the output contains any words (with at          as well as using OpenPose keypoints as a guide for our hypothesis
least 90% letters) similar to words in the correspondent dictionary       with human body parts. With the current approach, we gained
we manually built only for concepts using OCR. However, this              a humble accuracy using only pretrained models, which can be
approach is not so effective due to the image resolution.                 increase with better detectors.
Eyes and Ears Together Task                                                         MediaEval’19, 29-31 October 2019, Sophia Antipolis, France


ACKNOWLEDGMENTS                                                                [5] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev,
Research is supported by Vingroup Innovation Foundation (VINIF)                    Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr
                                                                                   Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common
in project code VINIF.2019.DA19. We would like to thank AIOZ Pte
                                                                                   Objects in Context. In ECCV.
Ltd for supporting our team with computing infrastructure.                     [6] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013.
                                                                                   Efficient Estimation of Word Representations in Vector Space. CoRR
REFERENCES                                                                         abs/1301.3781 (2013).
[1] Zhe Cao, Gines Hidalgo, Tomá imon, Shih-En Wei, and Yaser Sheikh.          [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster
    2016. Realtime Multi-person 2D Pose Estimation Using Part Affinity             R-CNN: Towards Real-Time Object Detection with Region Proposal
    Fields. 2017 IEEE Conference on Computer Vision and Pattern Recogni-           Networks. In Advances in Neural Information Processing Systems 28,
    tion (CVPR) (2016), 1302–1310.                                                 C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett
[2] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017.        (Eds.). Curran Associates, Inc., 91–99.
    Mask R-CNN. 2017 IEEE International Conference on Computer Vision          [8] Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott,
    (ICCV) (2017), 2980–2988.                                                      Loïc Barrault, Lucia Specia, and Florian Metze. 2018. How2: A
[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep              Large-scale Dataset for Multimodal Language Understanding. CoRR
    Residual Learning for Image Recognition. 2016 IEEE Conference on               abs/1811.00347 (2018). arXiv:1811.00347 http://arxiv.org/abs/1811.
    Computer Vision and Pattern Recognition (CVPR) (Jun 2016). https:              00347
    //doi.org/10.1109/cvpr.2016.90                                             [9] Ray Smith and Google Inc. 2007. An overview of the Tesseract OCR En-
[4] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings,             gine. In Proc. 9th IEEE Intl. Conf. on Document Analysis and Recognition
    Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo             (ICDAR. 629–633.
    Malloci, Tom Duerig, and Vittorio Ferrari. 2018. The Open Images          [10] Florian Metze Yasufumi Moriya, Ramon Sanabria and Gareth J. F. Jones.
    Dataset V4: Unified image classification, object detection, and visual         2019. Eyes and Ears Together Task at MediaEval 2019. Media Eval’
    relationship detection at scale. ArXiv abs/1811.00982 (2018).                  2019 (2019).