<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Lviv, Ukraine, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Image Recommendation for Wikipedia Articles</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleh Onyshchak</string-name>
          <email>o.onyshchak@ucu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miriam Redi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ukrainian Catholic University</institution>
          ,
          <addr-line>Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Wikimedia Foundation.</institution>
          <addr-line>London</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>1</volume>
      <fpage>5</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>Multimodal learning, which is simultaneous learning from different data sources such as audio, text, images, is a rapidly emerging field of Machine Learning. It is also considered as machine learning at the next upper level of abstraction. This allows tackle more complicated problems such as creating cartoons from a plot or speech recognition based on lips movement. In this paper, we propose to research whether state-of-the-art techniques of multimodal learning, will solve the problem of recommending the most relevant images for a Wikipedia article. In other words, we need to create a shared text-image representation of an abstract notion which paper describes, so that having only a text description machine would ”understand” which images would visualize the same notion accurately.</p>
      </abstract>
      <kwd-group>
        <kwd>multimodal learning</kwd>
        <kwd>text-image similarity</kwd>
        <kwd>image recommendation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Every day we perceive the world around us through multiple cognitive feelings such as
sight, smell, hearing, touch, taste. Moreover, our ability to consolidate all the
information from different sources into one complete picture helps us comprehensively
understand the world.</p>
      <p>With a trend to digitizing in the last few decades, more and more information is
recorded in different kinds of media such as audio, image, video, text, and 3D modeling.
That also created new challenges of efficiently processing significant amounts of
recorded information, where we already have substantial achievements. However, every
type of digital storage only captures some subset of available information. For example,
imagery only captures visual appearance, while audio – the sound, just as our eyes and
ears do. Thus, all the scientific progress in processing some data carrier is bounded by
the limitation of what that medium can capture.</p>
      <p>In other words, to represent a dog digitally, we have to have more than just a visual
representation. Similar to humans, we need to combine all the information streams,
which describe the entity from different perspectives, into one comprehensive
representation.</p>
      <p>
        That is the motivation for multimodal representation learning, which aims to
combine different types of data into a complete representation of a real-world entity. In that
context, the word ”modality” refers to a particular way of encoding information. Thus
a problem in the domain of for example image processing is called unimodal, while a
problem in the domain of multiple information encodings, for example image to caption
generation, is called multimodal since it works with both image and text modalities [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        By having a complete representation of an entity, which was created via multimodal
data that captures complementary / supplementary information subsets of an object, we
have more comprehensive computational ”understanding” of that entity. That helps us
increase the precision of existing data science applications, and extend the limits to
more abstract problems such as not only identify the objects in an image, but understand
the value. For example [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], early research on speech recognition showed that, by
involving the visual modality of lips movement on top of sound modality, we get extra
information that allows us to increase the quality of voice recognition task, just as it
works for humans [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>In this project, we are going to research possible approaches toward image
recommendation for Wikipedia articles problem, which is also the part of multimodal
representation learning domain. That is, based on the article text, we need to recommend
images describing the entity described in the article. In other words, we need to create
a high-level representation of some entity, described by both text and images.
Furthermore, we are interested to find out which image representation of the notion is the best
suited for a given text description.</p>
      <p>In scope of this project, we are going to explore the State-of-the-Art techniques of
multimodal representation learning and analyze whether these could be applied to
solving this problem. We believe this project will be valuable from both research and
application perspectives.</p>
      <p>This paper presents a project proposal of Master Thesis, which will formally define
the problem, provide a rigorous overview of the State-of-the-Art approaches in
Multimodal Representation Learning domain, specify the goals of the project, suggest an
approach to envisioned solution, and provide a time plan of the thesis.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
      <p>Wikipedia is the biggest collection of human knowledge containing more than 35
million pages and having nearly 9 billion views per month1 And it continually growing,
having more than 500 new pages per day2, and all of that only in its English version.</p>
      <p>As a part of 2030 strategy, one of the key goals is to break down any barriers for
accessing free information3. By researching the possibilities to recommend images for
Wikipedia editors in an automated way, it will help get better media enrichment of</p>
      <sec id="sec-2-1">
        <title>1 https://stats.wikimedia.org/v2/#/en.wikipedia.org 2 https://en.wikipedia.org/wiki/Wikipedia:Statistics 3 https://meta.wikimedia.org/wiki/Strategy/Wikimedia movement/2017/Direction</title>
        <p>
          articles, which in turn will make information easier and faster to comprehend [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
Automation would also help reduce time and effort to be spent in searching for and adding
appropriate article visualizations.
        </p>
        <p>In addition to the motivation of making Wikipedia better, this work might present
some useful insights to the development of multimodal learning field. Since this is:
(1) a real-world problem, which might give us interesting insights of how to apply and
adjust current academia progress; and (2) we have a more complicated problem setting
of one large article corresponding to a multiple images, instead of more simplified
oneto-one correspondence of images and respective tags or descriptive sentences.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Problem 4</title>
    </sec>
    <sec id="sec-4">
      <title>Data</title>
      <p>
        We are going to research how the State-of-the-Art multimodal learning techniques
perform on a task of recommending images for Wikipedia articles. In other words, having
a text with wiki formatting, we need to rank images from Wikimedia Commons
database [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] by relevance.
      </p>
      <p>
        All data is publicly available on Wikipedia. Specifically, we have more than 35 million
Wikipedia pages with a fair amount of them enriched with images. We also have
Commons image dataset [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], containing more than 55 million images4. That is the
realworld data, where ultimately the solution should be applied.
      </p>
      <p>In research, we will use a reliable subset of the above-specified data for training. In
particular, Wikipedia has a notion of featured articles5, which are the best articles
having quality text and a lot of supporting visualization. In other words, it is a high quality
dataset of more than 5000 articles, each having multiple associated images, that was
manually created. Nevertheless, it still requires proper preprocessing and cleaning
before using.</p>
      <p>Particularly, by text we mean the entire article textual content cleared from
Wikipedia formatting along with some extra metadata such as categories or title. Images are
also collected with additional metadata such as filenames or descriptions. More details
could be found on Kaggle6
5</p>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>
        In the last decades, there was much progress in the field of unimodal representation;
research in multimodal learning was limited by simple concatenation of unimodal
features [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However, in recent years, the scientific landscape in this domain has being
4 https://en.wikipedia.org/wiki/Wikimedia Commons
5 https://en.wikipedia.org/wiki/Wikipedia:Featured articles
6 https://www.kaggle.com/jacksoncrow/extended-wikipedia-multimodal-dataset
rapidly evolving [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. One of the triggers for it was the success of deep learning models,
which have a powerful representation ability with multiple levels of abstractions.
Therefore, these models were incorporated in multimodal learning. As Guo et al.
suggested [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we can divide all the multimodal learning approaches into three categories:
1. Joint representation, which aims to integrate modality-specific features into some
common space
2. Coordinated representation, which aims to preserve modality-specific features,
while introducing a space to measure multimodal similarities
3. Intermediate representation, which aims to encode features of one modality to some
intermediate space, from where we later generate the features of another modality
In this section, we will cover available techniques to extract features from text and
image modalities, overview available solutions in each type of multimodal learning,
and then summarize their applicability for our problem.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Unimodal Representation</title>
        <p>
          Image. The most popular model used in feature extraction from images are different
types of Convolutional Neural Networks (CNN), such as AlexNet [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], VGGNet [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ],
and ResNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. When working with big datasets, it is preferable to use a pre-trained
version of the chosen CNN. This field has tremendous development in recent years, and
currently we already have well-defined solutions for most problems.
        </p>
        <p>
          Text. A popular way to extract features from text is to encode it to vector, as is done
in word2vec [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] or Glove [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] algorithms. Although, the common problem with those
approaches is when some words are not present in vocabulary or out-of-vocabulary
error. However, there is a variety of alternative solutions to this problem, such as
character embeddings [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Specifically, an alternative and more powerful tool for dealing
with text is Recurrent Neural Networks (RNN) [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], which is more context-aware and
can make better encoding of the n-th word, knowing what was already in a sentence.
One of the most successful realizations of RNN is Long Short-Term Memory (LSTM)
[
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Joint Representation</title>
        <p>
          The main idea of joint representation is to integrate multimodal features into a single
input, which we then process as some artificial unimodal input with well-known
machine learning techniques. More formally, it aims to project unimodal representations
into a shared semantic subspace, where the multimodal features can be fused [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], as
shown in Fig. 1(a). Until recently, it was the primary technique in multimodal learning,
where shared features were fused by concatenating these together. However, now, the
most popular choice is to use a distinct hidden layer, where modality-specific features
will be combined into a single output vector.
        </p>
        <p>
          Concatenation approach was historically the first and is still commonly applicable
in video classification [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], event detection [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], and visual question answering [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
However, its main disadvantage is neglecting the fact that different modalities have not
only supplementary information that is representing the same notion from different
perspectives, but also complementary information, where one modality captures the
information, which another cannot capture. For example, lips movement and audio of a
speech are mostly supplementary sources, while images of some bird and audio of it
singing are mostly complementary sources. Because of that, much information gets lost
in that shared space.
        </p>
        <p>
          Although it has advantages of being a simple method and producing
modality-invariant common space of features, it cannot be used to infer the separated representations
for each modality [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Thus, the methods from this category are not applicable to our
problem.
5.3
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Intermediate Representation</title>
        <p>
          Intermediate representation models aim to encode features of one modality to some
intermediate space, from which later features of another modality can be generated (or
decoded), as shown in Fig. 1(c). To prevent the intermediate space from being related
only to a source modality, during encoder-decoder training we maximize, e.g., the
likelihood of the target sentence given source image, so that the error function employs the
error of decoding. Subsequently, the generated intermediate representation tends to
capture the shared semantics from both modalities [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          Some interesting application of that model was proposed by Mor et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], where
the algorithm encodes a musical track into intermediate space, which is then decoded
by multiple decoders into a space of some specific instrument. In other words, the
encoder extracts instrument-invariant generic musical features, which then each decoder
transforms into features of its target instrument.
        </p>
        <p>
          The general advantage of this approach is that it is one of the best ways to generate
new features in a target domain. Thus, this technique is used in image caption [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ],
video description [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], and text to image [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] generations. The disadvantages of that
7 This figure has been drawn for this paper for illustrative purposes based on the inspiration
from [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
model are: (1) it can only encode one modality; (2) the complexity of designing a
feature generator should be taken into account [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]; and (3) intermediate space extracts only
shared subspace from two modalities. Moreover, because we need to query existing
information rather than generate one, these methods are also not suitable for our
problem solution.
5.4
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>Coordinated Representation</title>
        <p>The last type of multimodal learning is coordinated representation. Instead of learning
from a joint representation, it learns from modal-specific representations separately but
with a shared constraint, which is some loss function identifying cross-modal similarity
/ correlation. Since different modalities hold unique information about an object, this
approach operates with all available knowledge. A visual explanation can be seen in
Fig. 1(b). Regarding a constraint function, a commonly used option is a cross-modal
similarity function, where learning objective is to preserve both inter-modality and
intra-modality similarity structure. In other words, it would force cross-modal distance
for elements with the same semantics to be as small as possible, while with dissimilar
– as big as possible.</p>
        <p>
          The cross-modal ranking is a widely used constraint, where the loss function is defined
in the following way:
∑ ∑ − 
(0,  −  ( ,  ) +  ( ,  −)) + ∑ ∑ − 
(0,  −  ( ,  ) +  ( ,  −)) , (1)
where ( ,  ) is a matching image-text pair, α is margin, S is a similarity function,  − is
mismatching pair to t and vise-versa. Frome et al. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] used a combination of
dot-product similarity and margin rank loss to learn a visual-semantic embedding model
(DeViSE) for visual recognition [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. DeViSE trains deep networks for both image and text
features, and then adjusts features based on the above-mentioned ranked loss, though
in a more simplified form.
        </p>
        <p>
          Alternative to cross-modal ranking, another widely used constraint is Euclid
distance, which is also used for ensuring that similarity structure for both intra-modality
and inter-modality is preserved. That is, for inter-modality, we map text and image
features into low-dimensional space, where we can calculate the distance between
feature vectors. The idea here is to ensure that inter-modality features of the same
semantics are as close as possible [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. While for intra-modality, we want to preserve the
similarity between neighborhood items, that is:
   ,  
+ 
&lt;  (  ,   ), ∀  ∈  (  ), ∀  ∉  (  ) ,
(2)
where  is a data point of any modality,   is a point of interest,  ( ) denotes the
neighborhood of  [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ].
        </p>
        <p>
          Hence, coordinated representation preserves all modality-specific information. It
also explicitly compares features from different modalities, thus having data from one
modality, we can identify the closest data point from another modality. Because of
those properties, it is used for cross-modal retrieval [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ], retrieval-based visual
description [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], and transfer knowledge across modalities [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]. Thus, it can be applied to our
problem of image recommendation for articles, and we will proceed with those
methods.
Based on the analysis of the related work, coordinated representation techniques were
identified as the most relevant approach to solve our problem. Coordinated
representation approach aims to exploit modality-specific features fully, thus we train each feature
modality separately.
        </p>
        <p>
          To make the system learn right features in each modality, we map all of them into
space where inter-modality similarity can be evaluated but also preserving the
intramodality similarity structure [
          <xref ref-type="bibr" rid="ref12 ref27 ref28">12, 27, 28</xref>
          ]. Then we identify loss function, by enforcing
ranking function (1) in that space to return high values for mismatches modality pairs
and small otherwise. That would be a loss function, which each modality-specific
model will be minimizing, thus empowering modality-specific feature learning. You
can see visualization of this idea on Figure 2.
        </p>
        <p>
          We will focus on integrating recent Word2VisualVec [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and dual encoding [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
models to our more broader and more realistic problem settings. They showed impressive
results but were evaluated on a narrower problem. More specifically, they were working
with the Flickr dataset [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] where one image corresponds to 5 descriptive sentences. In
our settings, we have one article corresponding to multiple images, where all of them
having additional metadata such as category, name, description.
        </p>
        <p>
          This paper is supported by Github repository9 with all experiments.
8 The figure is adopted from the GitHub resource
(https://github.com/danieljf24/dual_encoding/blob/master/dual_encoding.jpg) by the author(s) of [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. This is done for illustrative purpose.
The resource is freely available for use under the conditions of the Apache License 2.0.
9 https://github.com/OlehOnyshchak/WikiImageRecommendation
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Methodology</title>
      <sec id="sec-6-1">
        <title>Methodological Approach</title>
        <p>
          The hypothesis under test is ”it is possible to implement a model to recommend relevant
Commons [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] images for a specific Wikipedia article using multimodal learning
techniques” and implies quantitative research approach. It is aiming to discover whether
state-of-the-art techniques of multimodal representation learning can solve this specific
problem for Wikipedia with not worse precision.
7.2
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>Methods of Data Collection</title>
        <p>Existing Wikipedia data will be used to conduct the research. More specifically, we
will use a collection of featured articles10 where each page went through thorough
manual review procedure by the Wikipedia community and represent the best Wikipedia
can offer. Thus, it is theoretically the best possible quality for machine learning
algorithms.
7.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>Methods of Analysis</title>
        <p>We will select candidate algorithms by analyzing recent literature surveys of a
corresponding domain, and choosing the most prominent state-of-the-art approaches
described there. We will also check the most cited approaches to solve a similar problem.
In that way, we can ensure that all state-of-the-art methods existing in that field would
be reviewed and then the most applicable would be adequately tested.
7.4</p>
      </sec>
      <sec id="sec-6-4">
        <title>Evaluation</title>
        <p>
          Since we have a labeled dataset, classical evaluation metrics would be applied here.
Currently, the most appropriate approach is rank-based performance metric [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] P@K
(K=1,5,10), where P is the percentage of articles for which corresponding images are
found within the top K ∗ Nimages images, where Nimages is the number of images of this
article.
        </p>
        <p>When scaling up on real-world image dataset size, evaluation metrics will require
additional improvements such as merging visually similar images from top-ranked
matches, although it is out of the scope of testing our hypothesis.
8</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Time Plan</title>
      <sec id="sec-7-1">
        <title>You can find timetable of milestones in the Table 1</title>
        <p>
          10 https://en.wikipedia.org/wiki/Wikipedia:Featured articles
Date
10 Sep 2019
The goal of the project is to research whether it is possible to implement a system,
which would recommend relevant Commons [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] images for a specific Wikipedia
article. Thus. we are planning to investigate the scientific landscape in that area and
provide report whether it can solve our specific problem of image recommendation with
Wikipedia dataset. We do not expect to create a complete end-to-end solution but rather
investigate a path towards it is feasible.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Deep multimodal representation learning: a survey</article-title>
          .
          <source>IEEE Access 7</source>
          ,
          <fpage>63373</fpage>
          -
          <lpage>63394</lpage>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2019</year>
          .2916887
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>McGurk</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , MacDonald, J.:
          <article-title>Hearing lips and seeing voices</article-title>
          .
          <source>Nature</source>
          <volume>264</volume>
          ,
          <fpage>746</fpage>
          -
          <lpage>748</lpage>
          (
          <year>1976</year>
          ). doi:
          <volume>10</volume>
          .1038/264746a0
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Baltrušaitis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahuja</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L.-P.:</given-names>
          </string-name>
          <article-title>Multimodal machine learning: a survey and taxonomy</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>41</volume>
          (
          <issue>2</issue>
          ),
          <fpage>423</fpage>
          -
          <lpage>443</lpage>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .1109/TPAMI.
          <year>2018</year>
          .2798607
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>D</given-names>
            <surname>'mello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.K.</given-names>
            ,
            <surname>Kory</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>A review and meta-analysis of multimodal affect detection systems</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>47</volume>
          (
          <issue>3</issue>
          ), Article No 43 (
          <year>2015</year>
          ).
          <source>doi: 10.1145/2682899</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.G.M.:</given-names>
          </string-name>
          <article-title>Predicting visual features from text for image and video caption retrieval</article-title>
          .
          <source>IEEE Transactions on Multimedia</source>
          <volume>20</volume>
          (
          <issue>12</issue>
          ),
          <fpage>3377</fpage>
          -
          <lpage>3388</lpage>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .1109/TMM.
          <year>2018</year>
          .2832602
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Dual encoding for zero-example video retrieval</article-title>
          .
          <source>In: 2019 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>9346</fpage>
          -
          <lpage>9355</lpage>
          . IEEE Press, New York (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Plummer</surname>
            ,
            <given-names>B.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cervantes</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Caicedo</surname>
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hockenmaier</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazebnik</surname>
          </string-name>
          , S.:
          <article-title>”Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models</article-title>
          .
          <source>In: 2015 IEEE International Conference on Computer Vision</source>
          . IEEE Press, New York (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>ImageNet classification with deep convolutional neural networks</article-title>
          .
          <source>In: 25th International Conference on Neural Information Processing Systems</source>
          . Volume
          <volume>1</volume>
          , pp.
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          . Curran Associates Inc., New York (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: 2016 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . IEEE Press, New York (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vogel</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dickson</surname>
            ,
            <given-names>G.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehman</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          :
          <article-title>Persuasion and the role of visual presentation support: the UM/3M study</article-title>
          .
          <source>Minneapolis: Management Information Systems Research Center</source>
          , School of Management, University of Minnesota (
          <year>1986</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Jiang</surname>
          </string-name>
          , Q.-Y.,
          <string-name>
            <surname>Li</surname>
          </string-name>
          , W.-J.:
          <article-title>Deep cross-modal hashing</article-title>
          .
          <source>In: 2017 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>3232</fpage>
          -
          <lpage>3240</lpage>
          . IEEE Press, New York (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>arXiv preprint arXiv:1301.3781</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pennington</surname>
          </string-name>
          , Jeffrey, Richard Socher, and Christopher Manning. ”
          <article-title>Glove: global vectors for word representation</article-title>
          .
          <source>In: 2014 ACL Conference on Empirical Methods in Natural Language Processing</source>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          . Association for Computational Linguistics (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jernite</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sontag</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rush</surname>
            ,
            <given-names>A.M.</given-names>
          </string-name>
          :
          <article-title>Character-aware neural language models</article-title>
          .
          <source>In 13th AAAI Conference on Artificial Intelligence</source>
          , pp.
          <fpage>2741</fpage>
          -
          <lpage>2749</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Elman</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Finding structure in time</article-title>
          .
          <source>Cognitive science 14</source>
          (
          <issue>2</issue>
          ),
          <fpage>179</fpage>
          -
          <lpage>221</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.-G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xue</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chan</surname>
          </string-name>
          , S.-F.:
          <article-title>Exploiting feature and class relationships in video categorization with regularized deep neural networks</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>40</volume>
          (
          <issue>2</issue>
          ),
          <fpage>352</fpage>
          -
          <lpage>364</lpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Wikimedia</surname>
            <given-names>Commons</given-names>
          </string-name>
          , Wikimedia. https://commons.wikimedia.org/wiki/Main Page
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Habibian</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mensink</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Snoek</surname>
            ,
            <given-names>C.G.M.:</given-names>
          </string-name>
          <article-title>Video2vec embeddings recognize events when examples are scarce</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          <volume>39</volume>
          (
          <issue>10</issue>
          ),
          <fpage>2089</fpage>
          -
          <lpage>2103</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Fukui</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal compact bilinear pooling for visual question answering and visual grounding</article-title>
          .
          <source>arXiv preprint arXiv:1606</source>
          .
          <year>01847</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Mor</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polyak</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taigman</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>A universal music translation network</article-title>
          . arXiv preprintarXiv:
          <year>1805</year>
          .
          <volume>07848</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toshev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Show and tell: a neural image caption generator</article-title>
          .
          <source>In: 2015 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          . IEEE Press, New York (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Venugopalan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mooney</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Translating videos to natural language using deep recurrent neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1412.4729</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akata</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Logeswaran</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schiele</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Generative adversarial text to image synthesis</article-title>
          .
          <source>arXiv preprintarXiv:1605.05396</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Multimodal convolutional neural networks for matching image andsentence</article-title>
          .
          <source>In: 2015 IEEE International Conference on Computer Vision</source>
          , pp.
          <fpage>2623</fpage>
          -
          <lpage>2631</lpage>
          . IEEE Press, New York (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Frome</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranzato</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          '
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Mikolov</surname>
          </string-name>
          , T.:
          <article-title>DeViSE: a deep visual-semantic embedding model</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems</source>
          . Volume
          <volume>26</volume>
          , pp.
          <fpage>2121</fpage>
          -
          <lpage>2129</lpage>
          . Curran Associates, Inc. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.S.:</given-names>
          </string-name>
          <article-title>Unifying visual semantic embeddings with multimodal neural language models</article-title>
          .
          <source>arXiv preprint arXiv:1411.2539</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29. Socher, Richard, Karpathy,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.V.</given-names>
            ,
            <surname>Manning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.D.</given-names>
            ,
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.Y.</surname>
          </string-name>
          :
          <article-title>Grounded compositional semantics for finding and describing images with sentences</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>2</volume>
          ,
          <fpage>207</fpage>
          -
          <lpage>218</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mei</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rui</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Jointly modeling embedding and translation to bridge video and language</article-title>
          .
          <source>In: 2016 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>4594</fpage>
          -
          <lpage>4602</lpage>
          . IEEE Press, New York (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lazebnik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Learning deep structure-preserving image-text embeddings</article-title>
          .
          <source>In: 2016 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>5005</fpage>
          -
          <lpage>5013</lpage>
          . IEEE Press, New York (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>