<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HCMUS at MediaEval 2020: Image-Text Fusion for Automatic News-Images Re-Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Thuc Nguyen-Quang</string-name>
          <email>nqthuc@apcs.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tuan-Duy H. Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thang-Long Nguyen-Ho</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anh-Kiet Duong</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nhat Hoang-Xuan</string-name>
          <email>hxnhat@selab.hcmus.edu.vn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinh-Thuyen Nguyen-Truong</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hai-Dang Nguyen</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minh-Triet Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>John von Neumann Institute</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Science</institution>
          ,
          <addr-line>VNU-HCM</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh city</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Matching text and images based on their semantics has an important role in cross-media retrieval. Especially, in terms of news, text and images connection is highly ambiguous. In the context of MediaEval 2020 Challenge, we propose three multi-modal methods for mapping text and images of news articles to the shared space in order to perform eficient cross-retrieval. Our methods show systemic improvement and validate our hypotheses, while the best-performed method reaches a recall@100 score of 0.2064.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        News articles represent a complex class of multimedia, whose textual
content and accompanying images might not be explicitly related
[
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Existing research in multimedia and recommendation system
domains mostly investigate image-text pairs with simple
relationships, e.g., image captions that literally describe components of the
images [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. To address this, the MediaEval 2020 NewsImages Task
calls for researchers to investigate the real-world relationship of
news text and images in more depth, in order to understand its
implications for journalism and news recommendation systems [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
      </p>
      <p>Our team at HCMUS responds to this call by addressing the
ImageText Re-Matching task. Particularly, given a set of image-text pairs
in the wild, the task requires us to correctly re-assign images to their
decoupled articles, with the aim to understand the implication of
journalism in choosing illustrative images.</p>
      <p>Our methods mainly concern fusing cross-modal embeddings for
automatic matching. We experimented with a range of embedded
information, including simple set intersection, deep neural features,
and knowledge-graph-enhanced neural features. We combine such
features in various ways for various experiments. Finally, we obtain
our best result with the ensemble of experimented methods.</p>
    </sec>
    <sec id="sec-2">
      <title>2 METHODS</title>
    </sec>
    <sec id="sec-3">
      <title>2.1 Metric Learning</title>
      <p>
        The primary idea of this baseline method is using metric learning to
project embeddings of image-text pairs to bases of significant
similarity. Particularly, we use two approaches to embed image features:
global context embedding and local context embedding. In the first
approach, we use the EficientNet [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ], a SOTA classification
architecture, to extract features of the image before taking the flatten output
features. Our motivation in the latter approach is to harness critical
local information from the extracted global context. Thus, we use the
bottom-up-attention model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to extract the top- objects based on
their confidence score, before passing them over to a self-attention
sequential model. For both routines, we employ BERT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] language
model to embed textual content, then project the textual and image
2.2
      </p>
    </sec>
    <sec id="sec-4">
      <title>Image-Text Matching via Categorization</title>
      <p>
        In this method, we train two gradient boosting decision trees [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ],
one for categorizing images, and the other for categorizing
articles. The target categories are [’nrw’, ’kultur’, ’region’, ’panorama’,
’sport’, ’wirtschaft’, ’koeln’, ’ratgeber’, ’politik’, ’unknown’], which
are deduced from URLs in the train set.
      </p>
      <p>
        We use features extracted for images and text to train the decision
tree. To augment the data, we use VGG16, InceptionResNetV2,
MobileNetV2, EficientNetB1-7, Xception, ResNet152V2, NASNetLarge,
DenseNet201 [
        <xref ref-type="bibr" rid="ref10 ref14 ref17 ref27 ref28 ref29 ref30 ref32">10, 14, 17, 27–30, 32</xref>
        ] for images, while using
pretrained BERT models[
        <xref ref-type="bibr" rid="ref11 ref2 ref8 ref9">2, 8, 9, 11</xref>
        ], and pretrained ELECTRA models
[
        <xref ref-type="bibr" rid="ref1 ref9">1, 9</xref>
        ] to extract contextual features.
      </p>
      <p>We presume that images and articles of the same category might
have some relations. Moreover, the rank of matching categories also
afects ranking. For example, an image-text pair sharing a 3rd-ranked
category might be less relevant than the pair sharing a 1st-ranked
category. Hence, instead of using Jaccard similarity, we propose
an iterative ranking method that takes into account the order of
matched categories. At the -th iteration, our method first finds
top- categories for each image and top- categories for each article.
Then for each article, we create a list of candidate images whose
top- categories intersect that of the article. This list of candidates
at the -th iteration is concatenated to the final list. Finally, the
remaining images that are not candidates are kept in their order and
concatenated to the end of the final list.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Graph-based Face-Name Matching</title>
      <p>Based on our observation, in a lot of instances, the publisher uses a
portrait of somebody mentioned in the text. We build the face-name
graph to represent the relation between the name and the face.</p>
      <p>
        Person name extraction: To automatically extract people’s name
from the text, we use entity-fishing [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] – an open-source
highperformance entity recognition and disambiguation tool. It relies on
Random Forest and Gradient Tree Boosting to recognize named
entities, in our case people’s names, and link them against Wikidata
entities using their word embeddings and Wikidata entities’ embeddings.
      </p>
      <p>
        Face encoding: We use face recognition open-source library[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]
to detect and represent the face as 128-dims vector. The tool uses a
pre-trained model from the dlib-models repository[
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] and chooses
ResNet as the backbone for face feature extraction.
      </p>
      <p>Using the train set, we connect each person mentioned in the
articles with features extracted from accompanying faces. During
testing, we encode the face from the image and aggregate the number
of matched faces connected to the people mentioned in the text. Two
faces are matched if 2-distance between two vectors less than 0.6.
The ranking of images is sorted by the total matched.</p>
    </sec>
    <sec id="sec-6">
      <title>Image-Text Fusion with Image</title>
    </sec>
    <sec id="sec-7">
      <title>Captioning and Contextual Embeddings</title>
      <p>
        Based on the hypothesis that the description of the image is
semantically similar to the title, we build an image captioning model which
is inspired by the tutorial Image captioning with visual attention[
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
The model has three main parts:
• Image feature extractor: We use EficientNet[
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] for feature
extraction. The feature has the shape (8, 8, 2048)
• Feature encoder: The features pass through fully connected
giving a vector 256-dims.
• Decoder: To generate the caption, we use Bahdanau attention[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and GRU to predict the next word.
      </p>
      <p>
        We merge the train set with Flickr and COCO for training. We use
fuzzywuzzy ratio and partial ratio string matching to compare
captions and articles title. To represent the caption and the title as a
vector, we use RoBERTa and doc2vec[
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] enwiki_dbow, apnews_dbow.
Then, we calculate the similarity of two vectors by cosine similarity.
The final score is calculated by:
      </p>
      <p>total =wiki +apnews +RoBERTa + (1−fuzzy) + (1−partial)
where wiki, apnews, RoBERTa are cosine similarity of two vectors
generated by enwiki_dbow, apnews_dbow, RoBERTa, and    ,
 are fuzzywuzzy and partial ratios, respectively.
2.5</p>
    </sec>
    <sec id="sec-8">
      <title>Image-Text Fusion with Knowledge</title>
    </sec>
    <sec id="sec-9">
      <title>Graph-based Contextual Embeddings</title>
      <p>We observe that image-text pairs may not have any explicit
relationships. Yet, such text-image pairs could still remotely related through
layers of abstraction. For example, an article about violence could
feature a stock photo of a gun barrel. Although such a stock photo
does not literally illustrate the textual content, we understand that
a gun conveys a sense of threat, which, in turn, is related to violence.</p>
      <p>
        Thus, we consider exploiting knowledge graphs. On a knowledge
graph, such as BabelNet [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], the concept node of gun is also remotely
connected with violence through intermediate nodes. Thus, we
hypothesize that the projection of the textual and imagery content of a
news article onto a knowledge graph would be connected, and their
embeddings, in turns, could be in close proximity.
      </p>
      <p>
        To implement this projection, we use EWISER word sense
disambiguator [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] to link textual entities from texts to their synsets
in the WordNet subset of BabelNet. Then, the mean of
accompanied SenSemBERT+LMMS embeddings corresponds to these
extracted synsets representing the texts. For the images, we first map
images to the textual domain. To enhance the method by
featuring abstract human-level concepts in the mapping, we decide to
use TResNET-L with Asymmetric Loss (ASL) [
        <xref ref-type="bibr" rid="ref26 ref5">5, 26</xref>
        ] pre-trained on
OpenImagesV6[
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] to extract multi-label from images. Our decision
is grounded since OpenImagesV6 features image-level labels
conform with Freebase[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] knowledge graph with figurative labels, e.g.,
festivals, sport, comedy, etc., while TResNET-L with ASL is the
stateof-the-art method for OpenImagesV6 multi-label benchmark. The
extracted lists of labels are also linked with synsets using EWISER,
and the mean of these synset embedding vectors represent images.
      </p>
      <p>We then train a canonical correlation analysis (CCA) module with
the vector representation on the train set before using it to transform
test set vectors. For relatedness measurement, for each test article,
we rank all images in the test set using the 2-distance between the
article vector and image vectors.
3
3.1</p>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENTAL RESULTS</title>
    </sec>
    <sec id="sec-11">
      <title>Data preprocessing</title>
      <p>The MediaEval 2020 Image-Text Re-Matching benchmark releases
three batches of data in total consists of the lede and titles of German
news articles and their accompanying images. The first two are used
for training, and the last one is used for testing.</p>
      <p>For the sake of manual assertion, we decide to translate all the
text to English using Google Translate and employ this translated
text in our experiments. All data batches are cleaned automatically,
with images crawled using the given URLs and pairs with 404 Not
Found URLs dropped from the train set.</p>
    </sec>
    <sec id="sec-12">
      <title>3.2 Submissions</title>
      <p>First, TripletLocal and TripletGlobal demonstrate respective methods
in Section 2.1. In both submissions, we empirically choose  = 30
to embed images with top- objects, then sort candidate images for
each article by the similarity of their embedding to that of the article.</p>
      <p>The Group-Face&amp;Cap submission, meanwhile, combine three
different methods. First, we matches image-article pairs using the
method in Section 2.2 with  = 5. However, at each iteration, we
sort the candidates by  score mentioned in 2.4. Finally,
candidate images matched with the article through the method in Section
2.3 are prioritized to the top of the final result.</p>
      <p>The KG-Fusion submission manifest the method described in
Section 2.5. Specifically, the TResNet-L with ASL model used for
multilabel extraction accepts a sigmoid threshold of 0.7, the EWISER
disambiguator consumes chunks of 5 tokens, and the target
decomposition of the CCA module has 64 components.</p>
      <p>Finally, the Ensemble submission combines all described methods,
weighting each models based on their eficiency. As such, the final
ranking of a candidate image is:</p>
      <p>Ensemble = 1Caption + 2Triplet + 3Face + 4KG−Fusion.
where  ,  ,   ,  , − are ranks of
the image produced by Group-Face&amp;Cap, TripletGlobal, Face
Matching, and KG-Fusion methods, respectively. Weighting factors are
empirically chosen to be  1 = 4 = 1,  2 = 0.02 and  3 = 0.25.</p>
    </sec>
    <sec id="sec-13">
      <title>4 CONCLUSION AND FUTURE WORKS</title>
      <p>Although, our methods show poor accuracy, they systematically
increase the performance on the recall@100 metric. This fact validates
our hypotheses that incorporating high-level semantics increase
performance. Moreover, our methods yield consistent results, i.e.,
high-ranking images are of relevance to queried articles. Thus, they
can still be useful for building news image recommendation systems
as the news-images suitability is not injective in practice. The
ensemble method’s performance also suggests practical system builders
to use multiple methods to handle diferent aspects of the complex
image-text multimodal relation. In future works, we wish to
investigate better fusion methods, consider a thorough ablation study for
proposed methods, and enhance the dataset for thorough evaluation
with information retrieval metrics like NDCG.</p>
      <p>Acknowledgments: Research is supported by Vingroup
Innovation Foundation (VINIF) in project code VINIF.2019.DA19.
NewsImages: The role of images in online news</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <fpage>2020</fpage>
          . Model from https://huggingface.co/german-nlp-group/electrabase-german-uncased. (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <fpage>2020</fpage>
          . Model from https://huggingface.co/
          <article-title>T-Systems-onsite/bertgerman-dbmdz-uncased-sentence-stsb</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Anderson</surname>
          </string-name>
          , Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,
          <string-name>
            <given-names>and Lei</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Bottom-Up and TopDown Attention for Image Captioning and VQA</article-title>
          .
          <source>CoRR abs/1707</source>
          .07998 (
          <year>2017</year>
          ). arXiv:
          <volume>1707</volume>
          .07998 http://arxiv.org/abs/1707.07998
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Dzmitry</given-names>
            <surname>Bahdanau</surname>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate</article-title>
          . (
          <year>2016</year>
          ).
          <source>arXiv:cs.CL/1409.0473</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Emanuel</given-names>
            <surname>Ben-Baruch</surname>
          </string-name>
          , Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and
          <string-name>
            <surname>Lihi</surname>
          </string-name>
          Zelnik-Manor.
          <year>2020</year>
          .
          <article-title>Asymmetric Loss For Multi-Label Classification</article-title>
          . arXiv preprint arXiv:
          <year>2009</year>
          .
          <volume>14119</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Michele</given-names>
            <surname>Bevilacqua</surname>
          </string-name>
          and
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Navigli</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Breaking through the 80% glass ceiling: Raising the state of the art in Word Sense Disambiguation by incorporating knowledge graph information</article-title>
          .
          <source>In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          .
          <fpage>2854</fpage>
          -
          <lpage>2864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Kurt</given-names>
            <surname>Bollacker</surname>
          </string-name>
          , Colin Evans, Praveen Paritosh, Tim Sturge, and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Freebase: a collaboratively created graph database for structuring human knowledge</article-title>
          .
          <source>In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. 1247-1250.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Malte</given-names>
            <surname>Pietsch Tanay Soni Branden Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Timo</given-names>
            <surname>Möller</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Model from https://huggingface.co/bert-base-german-cased</article-title>
          . (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Branden</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Stefan</given-names>
            <surname>Schweter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Timo</given-names>
            <surname>Möller</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>German's Next Language Model</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CL/
          <year>2010</year>
          .10906
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>François</given-names>
            <surname>Chollet</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Xception: Deep learning with depthwise separable convolutions</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>1251</volume>
          -
          <fpage>1258</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <fpage>dbmdz</fpage>
          .
          <year>2020</year>
          . Model from https://huggingface.co/dbmdz/bert-basegerman-uncased. (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Jacob</surname>
            <given-names>Devlin</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          . (
          <year>2019</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CL/
          <year>1810</year>
          .04805
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Adam</given-names>
            <surname>Geitgey</surname>
          </string-name>
          .
          <year>2018</year>
          . Face Recognition. (
          <year>2018</year>
          ). https: //github.com/ageitgey/face_recognition
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Identity mappings in deep residual networks</article-title>
          .
          <source>In European conference on computer vision</source>
          . Springer,
          <fpage>630</fpage>
          -
          <lpage>645</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Elad</given-names>
            <surname>Hofer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nir</given-names>
            <surname>Ailon</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Deep metric learning using Triplet network</article-title>
          . (
          <year>2018</year>
          ).
          <source>arXiv:cs.LG/1412.6622</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>MD</given-names>
            <surname>Zakir Hossain</surname>
          </string-name>
          , Ferdous Sohel, Mohd Fairuz Shiratuddin, and
          <string-name>
            <given-names>Hamid</given-names>
            <surname>Laga</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A comprehensive survey of deep learning for image captioning</article-title>
          .
          <source>ACM Computing Surveys (CSUR) 51</source>
          ,
          <issue>6</issue>
          (
          <year>2019</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Gao</surname>
            <given-names>Huang</given-names>
          </string-name>
          , Zhuang Liu,
          <string-name>
            <surname>Laurens Van Der Maaten</surname>
          </string-name>
          , and
          <string-name>
            <surname>Kilian Q Weinberger</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Densely connected convolutional networks</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>4700</volume>
          -
          <fpage>4708</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Guolin</surname>
            <given-names>Ke</given-names>
          </string-name>
          , Qi Meng, Thomas Finley, Taifeng Wang,
          <string-name>
            <surname>Wei</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Weidong Ma, Qiwei Ye, and
          <string-name>
            <surname>Tie-Yan Liu</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>LightGBM: A Highly Eficient Gradient Boosting Decision Tree</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , and R.
          <source>Garnett (Eds.)</source>
          , Vol.
          <volume>30</volume>
          . Curran Associates, Inc.,
          <fpage>3146</fpage>
          -
          <lpage>3154</lpage>
          . https://proceedings.neurips.cc/paper/ 2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Benjamin</surname>
            <given-names>Kille</given-names>
          </string-name>
          , Andreas Lommatzsch, and
          <string-name>
            <given-names>Özlem</given-names>
            <surname>Özgöbek</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>News Images in MediaEval 2020</article-title>
          .
          <source>In Proc. of the MediaEval 2020 Workshop</source>
          . Online.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Davis</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>King</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>dlib-models</article-title>
          . (
          <year>2018</year>
          ). https://github.com/davisking/ dlib-models
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Alina</surname>
            <given-names>Kuznetsova</given-names>
          </string-name>
          , Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and
          <string-name>
            <given-names>Vittorio</given-names>
            <surname>Ferrari</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>The Open Images Dataset V4: Unified image classification, object detection, and visual relationship detection at scale</article-title>
          .
          <source>IJCV</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22] Jey Han Lau and
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Baldwin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation</article-title>
          . (
          <year>2016</year>
          ).
          <source>arXiv:cs.CL/1607.05368</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Patrice</given-names>
            <surname>Lopez</surname>
          </string-name>
          .
          <year>2020</year>
          . Entity Fishing. (
          <year>2020</year>
          ). https: //github.com/kermitt2/entity-fishing
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Roberto</given-names>
            <surname>Navigli</surname>
          </string-name>
          and Simone Paolo Ponzetto.
          <year>2010</year>
          .
          <article-title>BabelNet: Building a very large multilingual semantic network</article-title>
          .
          <source>In Proceedings of the 48th annual meeting of the association for computational linguistics</source>
          .
          <volume>216</volume>
          -
          <fpage>225</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>NHJ</given-names>
            <surname>Oostdijk</surname>
          </string-name>
          , H van Halteren,
          <source>Erkan Basar, and Martha A Larson</source>
          .
          <year>2020</year>
          .
          <article-title>The Connection between the Text and Images of News Articles: New Insights for Multimedia Analysis</article-title>
          .
          <article-title>(</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Tal</surname>
            <given-names>Ridnik</given-names>
          </string-name>
          , Hussam Lawen, Asaf Noy, and
          <string-name>
            <given-names>Itamar</given-names>
            <surname>Friedman</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>TResNet: High Performance GPU-Dedicated Architecture</article-title>
          . arXiv preprint arXiv:
          <year>2003</year>
          .
          <volume>13630</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sandler</surname>
          </string-name>
          , Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and
          <string-name>
            <surname>Liang-Chieh Chen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>4510</volume>
          -
          <fpage>4520</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Christian</surname>
            <given-names>Szegedy</given-names>
          </string-name>
          , Sergey Iofe, Vincent Vanhoucke, and
          <string-name>
            <given-names>Alex</given-names>
            <surname>Alemi</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Inception-v4, inception-resnet and the impact of residual connections on learning</article-title>
          .
          <source>arXiv preprint arXiv:1602.07261</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Quoc V.</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>EficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          . (
          <year>2020</year>
          ).
          <article-title>arXiv:cs</article-title>
          .LG/
          <year>1905</year>
          .11946
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Kelvin</surname>
            <given-names>Xu</given-names>
          </string-name>
          , Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. (</article-title>
          <year>2016</year>
          ).
          <source>arXiv:cs.LG/1502.03044</source>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Barret</surname>
            <given-names>Zoph</given-names>
          </string-name>
          , Vijay Vasudevan, Jonathon Shlens, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Learning transferable architectures for scalable image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>8697</volume>
          -
          <fpage>8710</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>