<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS at Eyes and Ears Together 2019: Entity Localization with Guided Word Embedding and Human Pose Estimation approach</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Gia-Han Diep, Duc-Tuan Luu, Son-Thanh Tran-Nguyen, Minh-Triet Tran Faculty of Information Technology and Software Engineering Laboratory University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>The Eyes and Ears Together Task focuses on developing an eficient framework for analyzing and localizing entities and associated pronouns from speech transcript. We present the HCMUS Team's approach, which employs a combination of Faster R - CNN model and Word2vec architecture. We submit multiple runs with diferent priority orders in our combine model. Our methods show potential results and achieve up to 2x accuracy in comparison to the task organizers' approaches.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Eyes and Ears Together at MediaEval 2019 challenge[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] aims to
automate data collection for visual grounding by exploiting speech
transcription and videos and to develop larger-scale visual
grounding systems. In this challenge, we are given the collection of
instruction videos called “How2” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and their corresponding speech
transcripts. In addition, task organizers also provide list of nouns,
timestamps and 2048 dimensional feature vectors extracted from
Residual Neural Network 152 layers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for every proposal in top 20
produced by Mask-RCNN[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The goal of the challenge is to localize
specific entities given in time-align speech transcription in videos.
      </p>
      <p>
        Since the given region proposals may not visually contain
exactly the desired object, we decided to use Faster Region-based
Convolutional Neural Network (Faster R-CNN) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] with COCO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and OpenImage [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] pretrained weights to re-extract the proposals
of the dataset. After that, Word2vec tool and OpenPose [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] are used
in diferent priority order so as to select the best candidate proposal.
We also examine using Tesseract Optical Character Recognition
(OCR) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] for detection.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>Our goal is to localize the target objects given the object concept
from the frame names. We come up with our proposed approach
assuming that any model used to extract features from frames
(querying the target objects) has already gained some knowledge
to distinguish the target objects from others. Our first approach
is to exploit knowledge from the concepts using word embedding
and invade regional proposal with pretrained models (section 2.2).
Then we can build a dictionary to translate the target concept to
known concepts on the development set and apply it to the test set.
Our second approach uses Pose Estimation to localize the concepts
relating to human body (section 2.3). The last approach is to use
OCR for detection (section 2.4). Overall, our method’s original
output is bounding boxes for all detected target objects.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Data preprocessing</title>
      <p>We noticed that there are synonyms, singular and plural nouns
targeting the same object in the dataset; Hence, to ensure the
consistency, we decided to automatically convert given labels in the
frames’ names into real labels.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Word Embedding with regional proposal</title>
      <p>Our method is to create a dictionary to map the keyword of a query
with an existing concept in the two datasets MS Coco and
OpenImageV4. To do this, we first use word embedding (section 2.2.1)
to encode the keyword in each query and each known concept to
measure their contextual relationship, with L2 is used to calculate
the context distance. To build the dictionary, we manually create
ground truth bounding boxes for approximately 50% of the
development set - section 2.2.2, and for each keyword, we choose the top
3 related concepts (lowest L2 score), which have region proposal
detected with IoU ≥ a chosen threshold (section 2.2.2).</p>
      <p>For the test set, with only this dictionary and region proposals,
we obtain a list of correspondent proposals and select the final
bounding boxes based on probability, position in comparison to the
central region or just random, depending on the query concepts.
Figure 2 shows how this associate dictionary is built.</p>
      <p>
        2.2.1 Word Embedding. Using Word2vec tool with Google News
pretrained weight and Skip-gram architecture [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we compute a
300-dimensional vectors representing the predicted labels (from
OpenImage and Coco) and the target concepts. We then calculate
cosine, L1 and L2 distance between each pair of target and predicted
concepts to get best distance scores and choose the best metric.
      </p>
      <sec id="sec-4-1">
        <title>Target concept POLISH MOUTH CALF/CALVES</title>
      </sec>
      <sec id="sec-4-2">
        <title>Top Predicted concepts Cosmetics Human mouth Human leg, Human arm</title>
        <p>2.2.2 Build concept dictionary from word embedding and regional
proposal. We use pretrained Faster R-CNN for COCO and
OpenImageV4 dataset to extract region proposals (predicting corresponding
concepts - section 2.2.1) for every frames. For each image in the
development set, we calculate the IoU between each outputted region
proposal and each bounding box from the ground-truth, and filter
out those proposals with IoU less than a chosen threshold. We then
select the three-best-score predicted concepts (from the
remaining proposals) for each target concept (using L2 metric) to obtain
a dictionary translating the target concept into known concepts
(example shown in table 1).</p>
        <sec id="sec-4-2-1">
          <title>List of nouns and pronouns</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Word2Vec</title>
          <p>(Google News pretrained)</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>Feature vectors</title>
          <p>Top 3
similarity score
(L1, L2, cosine distance)</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>Video frames</title>
        </sec>
        <sec id="sec-4-2-5">
          <title>Faster-RCNN</title>
          <p>(COCO, OpenImageV4 pretrained)</p>
        </sec>
        <sec id="sec-4-2-6">
          <title>Regional proposals</title>
        </sec>
        <sec id="sec-4-2-7">
          <title>Best IoU score</title>
          <p>(threshold = 0.5)</p>
        </sec>
        <sec id="sec-4-2-8">
          <title>Appropriate concepts</title>
        </sec>
        <sec id="sec-4-2-9">
          <title>Appropriate dictionary</title>
          <p>We use OpenPose with concepts relating to human body, especially
those neither in Coco or Openimage known concepts, such as: back,
hips, toes, heels, etc. The pose estimation provide specific estimated
position of human bone’s keypoints. Based on those keypoints, we
can expand the padding of the detected corresponding bone part
to obtain the target object. We also apply human knowledge or
heuristics for some concepts. For instance, we assume that human
back is from hips to shoulders.</p>
          <p>We also assume that objects hold by human hands should be
somewhere either from the thumbs and index fingers to the upper
hand-region or from one hand to another or from one thumb to the
region expanding from that hand to the direction from elbow to
hand. Figure 3 shows example result from our OpenPose approach.
2.4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>OCR Approach</title>
      <p>There are several concepts we decided to use OCR for detection. For
these frames, we crop all regional proposals as input for an OCR
model and take those with the output contains any words (with at
least 90% letters) similar to words in the correspondent dictionary
we manually built only for concepts using OCR. However, this
approach is not so efective due to the image resolution.
In Run 1, we use word embedding with regional proposal (section
2.2) with threshold 0 to get regional proposal with predicted
concepts in the correspondent dictionary then use OpenPose approach
(section 2.3) with only straight-forward concepts such as hands,
eyes, etc. and take one-forth area in the center for the remaining,
assumming target objects should be in the center of the image. We
obtain Run 2 in the same way as Run 1, except for using
threshold 0.5 for word embedding with regional proposal (section 2.2)
approach. Run 3 and Run 4 are similar to the Run 1 and Run 2 but
we take the bounding boxes with area equal nearly a half (instead
of one-fourth) of the images area in the worst cases.</p>
      <p>For Run 5, we also use word embedding with regional proposal
approach (section 2.2) with threshold 0.5 and OpenPose approach
with more hypothesis (section 2.3). We also use OCR approach
(section 2.4) for this run for some concepts and lastly, one-fourth
center region of the frame in worst cases.
Eyes and Ears Together challenge is a novel problem trying to map
knowledge gained from natural language to vision. Our current
approach only focus on extracting proposals using Faster R-CNN,
calculating distance between pair of target and known concepts,
as well as using OpenPose keypoints as a guide for our hypothesis
with human body parts. With the current approach, we gained
a humble accuracy using only pretrained models, which can be
increase with better detectors.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS</title>
      <p>Research is supported by Vingroup Innovation Foundation (VINIF)
in project code VINIF.2019.DA19. We would like to thank AIOZ Pte
Ltd for supporting our team with computing infrastructure.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Zhe</given-names>
            <surname>Cao</surname>
          </string-name>
          , Gines Hidalgo, Tomá imon,
          <string-name>
            <surname>Shih-En Wei</surname>
            , and
            <given-names>Yaser</given-names>
          </string-name>
          <string-name>
            <surname>Sheikh</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Realtime Multi-person 2D Pose Estimation Using Part Afinity Fields</article-title>
          .
          <source>2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2016</year>
          ),
          <fpage>1302</fpage>
          -
          <lpage>1310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Georgia Gkioxari, Piotr Dollár, and
          <string-name>
            <surname>Ross</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <string-name>
            <surname>Mask R-CNN</surname>
          </string-name>
          .
          <source>2017 IEEE International Conference on Computer Vision</source>
          (ICCV) (
          <year>2017</year>
          ),
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jun</source>
          <year>2016</year>
          ). https: //doi.org/10.1109/cvpr.
          <year>2016</year>
          .90
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Alina</given-names>
            <surname>Kuznetsova</surname>
          </string-name>
          , Hassan Rom, Neil Alldrin,
          <string-name>
            <surname>Jasper R. R. Uijlings</surname>
            , Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Tom Duerig, and
            <given-names>Vittorio</given-names>
          </string-name>
          <string-name>
            <surname>Ferrari</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale</article-title>
          . ArXiv abs/
          <year>1811</year>
          .00982 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Tsung-Yi Lin</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>Serge J.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>Lubomir D.</given-names>
          </string-name>
          <string-name>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ross B. Girshick</surname>
            , James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
            <given-names>C. Lawrence</given-names>
          </string-name>
          <string-name>
            <surname>Zitnick</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>Common Objects in Context</article-title>
          .
          <source>In ECCV.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Gregory S. Corrado, and
          <string-name>
            <given-names>Jefrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Eficient Estimation of Word Representations in Vector Space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781 (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Shaoqing</given-names>
            <surname>Ren</surname>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          28,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cortes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. D.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          , and R. Garnett (Eds.). Curran Associates, Inc.,
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Ramon</given-names>
            <surname>Sanabria</surname>
          </string-name>
          , Ozan Caglayan, Shruti Palaskar, Desmond Elliott, Loïc Barrault, Lucia Specia, and
          <string-name>
            <given-names>Florian</given-names>
            <surname>Metze</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>How2: A Large-scale Dataset for Multimodal Language Understanding</article-title>
          . CoRR abs/
          <year>1811</year>
          .00347 (
          <year>2018</year>
          ). arXiv:
          <year>1811</year>
          .00347 http://arxiv.org/abs/
          <year>1811</year>
          . 00347
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ray</given-names>
            <surname>Smith</surname>
          </string-name>
          and
          <string-name>
            <given-names>Google</given-names>
            <surname>Inc</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>An overview of the Tesseract OCR Engine</article-title>
          .
          <source>In Proc. 9th IEEE Intl. Conf. on Document Analysis and Recognition (ICDAR</source>
          .
          <fpage>629</fpage>
          -
          <lpage>633</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Florian</given-names>
            <surname>Metze Yasufumi Moriya</surname>
          </string-name>
          , Ramon Sanabria and
          <string-name>
            <given-names>Gareth J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Eyes and Ears Together Task at MediaEval 2019</article-title>
          . Media Eval'
          <year>2019</year>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>