HCMUS at Eyes and Ears Together 2019: Entity Localization with Guided Word Embedding and Human Pose Estimation approach Gia-Han Diep, Duc-Tuan Luu, Son-Thanh Tran-Nguyen, Minh-Triet Tran Faculty of Information Technology and Software Engineering Laboratory University of Science, VNU-HCM, Vietnam {dghan,ldtuan,tnsthanh}@apcs.vn,tmtriet@fit.hcmus.edu.vn {dghan,ldtuan,tnsthanh,tmtriet}@selab.hcmus.edu.vn ABSTRACT 2.1 Data preprocessing The Eyes and Ears Together Task focuses on developing an efficient We noticed that there are synonyms, singular and plural nouns framework for analyzing and localizing entities and associated targeting the same object in the dataset; Hence, to ensure the con- pronouns from speech transcript. We present the HCMUS Team’s sistency, we decided to automatically convert given labels in the approach, which employs a combination of Faster R - CNN model frames’ names into real labels. and Word2vec architecture. We submit multiple runs with different priority orders in our combine model. Our methods show potential 2.2 Word Embedding with regional proposal results and achieve up to 2x accuracy in comparison to the task Our method is to create a dictionary to map the keyword of a query organizers’ approaches. with an existing concept in the two datasets MS Coco and Open- 1 INTRODUCTION ImageV4. To do this, we first use word embedding (section 2.2.1) to encode the keyword in each query and each known concept to Eyes and Ears Together at MediaEval 2019 challenge[10] aims to measure their contextual relationship, with L2 is used to calculate automate data collection for visual grounding by exploiting speech the context distance. To build the dictionary, we manually create transcription and videos and to develop larger-scale visual ground- ground truth bounding boxes for approximately 50% of the devel- ing systems. In this challenge, we are given the collection of in- opment set - section 2.2.2, and for each keyword, we choose the top struction videos called “How2” [8] and their corresponding speech 3 related concepts (lowest L2 score), which have region proposal transcripts. In addition, task organizers also provide list of nouns, detected with IoU ≥ a chosen threshold (section 2.2.2). timestamps and 2048 dimensional feature vectors extracted from For the test set, with only this dictionary and region proposals, Residual Neural Network 152 layers [3] for every proposal in top 20 we obtain a list of correspondent proposals and select the final produced by Mask-RCNN[2]. The goal of the challenge is to localize bounding boxes based on probability, position in comparison to the specific entities given in time-align speech transcription in videos. central region or just random, depending on the query concepts. Since the given region proposals may not visually contain ex- Figure 2 shows how this associate dictionary is built. actly the desired object, we decided to use Faster Region-based Convolutional Neural Network (Faster R-CNN) [7] with COCO [5] 2.2.1 Word Embedding. Using Word2vec tool with Google News and OpenImage [4] pretrained weights to re-extract the proposals pretrained weight and Skip-gram architecture [6], we compute a of the dataset. After that, Word2vec tool and OpenPose [1] are used 300-dimensional vectors representing the predicted labels (from in different priority order so as to select the best candidate proposal. OpenImage and Coco) and the target concepts. We then calculate We also examine using Tesseract Optical Character Recognition cosine, L1 and L2 distance between each pair of target and predicted (OCR) [9] for detection. concepts to get best distance scores and choose the best metric. 2 APPROACH Target concept Top Predicted concepts Our goal is to localize the target objects given the object concept POLISH Cosmetics from the frame names. We come up with our proposed approach MOUTH Human mouth assuming that any model used to extract features from frames CALF/CALVES Human leg, Human arm (querying the target objects) has already gained some knowledge Table 1: Example of the concept correspondent dictionary to distinguish the target objects from others. Our first approach is to exploit knowledge from the concepts using word embedding 2.2.2 Build concept dictionary from word embedding and regional and invade regional proposal with pretrained models (section 2.2). proposal. We use pretrained Faster R-CNN for COCO and OpenIm- Then we can build a dictionary to translate the target concept to ageV4 dataset to extract region proposals (predicting corresponding known concepts on the development set and apply it to the test set. concepts - section 2.2.1) for every frames. For each image in the de- Our second approach uses Pose Estimation to localize the concepts velopment set, we calculate the IoU between each outputted region relating to human body (section 2.3). The last approach is to use proposal and each bounding box from the ground-truth, and filter OCR for detection (section 2.4). Overall, our method’s original out those proposals with IoU less than a chosen threshold. We then output is bounding boxes for all detected target objects. select the three-best-score predicted concepts (from the remain- ing proposals) for each target concept (using L2 metric) to obtain Copyright 2019 for this paper by its authors. Use a dictionary translating the target concept into known concepts permitted under Creative Commons License Attribution (example shown in table 1). 4.0 International (CC BY 4.0). MediaEval’19, 29-31 October 2019, Sophia Antipolis, France MediaEval’19, 29-31 October 2019, Sophia Antipolis, France Gia-Han Diep et al. 3 EXPERIMENTS AND RESULTS List of nouns 3.1 Run submissions Video frames In Run 1, we use word embedding with regional proposal (section and pronouns 2.2) with threshold 0 to get regional proposal with predicted con- Faster-RCNN cepts in the correspondent dictionary then use OpenPose approach Word2Vec (COCO, OpenImageV4 pretrained) (section 2.3) with only straight-forward concepts such as hands, (Google News pretrained) eyes, etc. and take one-forth area in the center for the remaining, assumming target objects should be in the center of the image. We Regional obtain Run 2 in the same way as Run 1, except for using thresh- Feature vectors proposals old 0.5 for word embedding with regional proposal (section 2.2) approach. Run 3 and Run 4 are similar to the Run 1 and Run 2 but Top 3 we take the bounding boxes with area equal nearly a half (instead similarity score Best IoU score of one-fourth) of the images area in the worst cases. (threshold = 0.5) For Run 5, we also use word embedding with regional proposal (L1, L2, cosine distance) approach (section 2.2) with threshold 0.5 and OpenPose approach with more hypothesis (section 2.3). We also use OCR approach Appropriate (section 2.4) for this run for some concepts and lastly, one-fourth concepts center region of the frame in worst cases. Appropriate dictionary Figure 2: How associate dictionary is built using Guided Word Embedding Approach on development set Figure 1: Example result of Word Embedding and Regional Figure 3: Example result from OpenPose approach Proposal approach 2.3 OpenPose Approach 3.2 Result We use OpenPose with concepts relating to human body, especially Table 2 shows that Run 2 is better than Run 1 and Run 4 is better those neither in Coco or Openimage known concepts, such as: back, than Run 3. Run 5 is our best model overall, which suggests that hips, toes, heels, etc. The pose estimation provide specific estimated applying human knowledge for specific scenarios may get even position of human bone’s keypoints. Based on those keypoints, we better result. Threshold 0.5 0.3 0.1 can expand the padding of the detected corresponding bone part Run 1 0.208 0.35 0.545 to obtain the target object. We also apply human knowledge or Run 2 0.209 0.354 0.551 heuristics for some concepts. For instance, we assume that human Run 3 0.213 0.346 0.54 back is from hips to shoulders. Run 4 0.215 0.35 0.547 We also assume that objects hold by human hands should be Run 5 0.216 0.348 0.542 somewhere either from the thumbs and index fingers to the upper hand-region or from one hand to another or from one thumb to the Table 2: Eyes and Ears Together challenge 2019’s result region expanding from that hand to the direction from elbow to 4 CONCLUSION AND FUTURE WORKS hand. Figure 3 shows example result from our OpenPose approach. Eyes and Ears Together challenge is a novel problem trying to map 2.4 OCR Approach knowledge gained from natural language to vision. Our current There are several concepts we decided to use OCR for detection. For approach only focus on extracting proposals using Faster R-CNN, these frames, we crop all regional proposals as input for an OCR calculating distance between pair of target and known concepts, model and take those with the output contains any words (with at as well as using OpenPose keypoints as a guide for our hypothesis least 90% letters) similar to words in the correspondent dictionary with human body parts. With the current approach, we gained we manually built only for concepts using OCR. However, this a humble accuracy using only pretrained models, which can be approach is not so effective due to the image resolution. increase with better detectors. Eyes and Ears Together Task MediaEval’19, 29-31 October 2019, Sophia Antipolis, France ACKNOWLEDGMENTS [5] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D. Bourdev, Research is supported by Vingroup Innovation Foundation (VINIF) Ross B. Girshick, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common in project code VINIF.2019.DA19. We would like to thank AIOZ Pte Objects in Context. In ECCV. Ltd for supporting our team with computing infrastructure. [6] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR REFERENCES abs/1301.3781 (2013). [1] Zhe Cao, Gines Hidalgo, Tomá imon, Shih-En Wei, and Yaser Sheikh. [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster 2016. Realtime Multi-person 2D Pose Estimation Using Part Affinity R-CNN: Towards Real-Time Object Detection with Region Proposal Fields. 2017 IEEE Conference on Computer Vision and Pattern Recogni- Networks. In Advances in Neural Information Processing Systems 28, tion (CVPR) (2016), 1302–1310. C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett [2] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. 2017. (Eds.). Curran Associates, Inc., 91–99. Mask R-CNN. 2017 IEEE International Conference on Computer Vision [8] Ramon Sanabria, Ozan Caglayan, Shruti Palaskar, Desmond Elliott, (ICCV) (2017), 2980–2988. Loïc Barrault, Lucia Specia, and Florian Metze. 2018. How2: A [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Large-scale Dataset for Multimodal Language Understanding. CoRR Residual Learning for Image Recognition. 2016 IEEE Conference on abs/1811.00347 (2018). arXiv:1811.00347 http://arxiv.org/abs/1811. Computer Vision and Pattern Recognition (CVPR) (Jun 2016). https: 00347 //doi.org/10.1109/cvpr.2016.90 [9] Ray Smith and Google Inc. 2007. An overview of the Tesseract OCR En- [4] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper R. R. Uijlings, gine. In Proc. 9th IEEE Intl. Conf. on Document Analysis and Recognition Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo (ICDAR. 629–633. Malloci, Tom Duerig, and Vittorio Ferrari. 2018. The Open Images [10] Florian Metze Yasufumi Moriya, Ramon Sanabria and Gareth J. F. Jones. Dataset V4: Unified image classification, object detection, and visual 2019. Eyes and Ears Together Task at MediaEval 2019. Media Eval’ relationship detection at scale. ArXiv abs/1811.00982 (2018). 2019 (2019).