<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identical Objects Recognition Based On Image and Textual Description</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bauman Moscow State Technical University</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Moscow</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russian Federation andrew.aslanov@gmail.com</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Moscow Lomonosov State University</institution>
          ,
          <addr-line>Moscow, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1883</year>
      </pub-date>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Today image retrieval methods are rapidly growing but still, image as a type of information is not sufficient for some specific cases, especially when it comes to search for information in social networks. For this reason, we introduce a combined method including both text and image representations for identical object search. The task is solved with practical relevance to lost pets finding in social networks. The suggested method shows approximately 14.66% better quality result in comparison with the same method in which image retrieval technique reproduced only.</p>
      </abstract>
      <kwd-group>
        <kwd>dataset</kwd>
        <kwd>object detection</kwd>
        <kwd>neural network</kwd>
        <kwd>deep learning</kwd>
        <kwd>siamese network</kwd>
        <kwd>joint representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        When solving different problems, a human uses different sources of information,
including audio, video or text information. Multimodal task settings are intended to study
approaches to exploit different modalities to improve the results of task solution.
Crossmedia (or multi-media) retrieval aims to search for information when queries and
retrieval results are of different media types: text, image, or video [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1–3</xref>
        ].
      </p>
      <p>
        Multimodal machine translation approaches consider techniques of translation with
the use of available pictures [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. The main problem of all multimodal tasks is
socalled “media gap”, which means that representations of different media types lie in
different feature spaces.
      </p>
      <p>
        Multimodal machine learning approaches aim to build models that can process and
relate information from multiple modalities [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Authors of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] list the following
problems of multimodal machine learning: representation of multimodal data, translation
(mapping) data from one modality to another, alignment between subelements of two
or more modalities, fusion of information from different modalities, and co-learning.
      </p>
      <p>In the current paper, this issue is studied for pets` search over the social network
posts, where each post may include text (or image) only or image with its textual
description. It can be useful in case a user could upload the photo or the textual portrait
of his lost pet to a system and then get a response that contains the most similar recent
posts aggregated from the social network.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>In this section, we briefly discuss the related methods where text and image are both
processed.</p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the authors introduced event photography method for automatic photos
annotation. It's suggested to combine the image with its description and perform neural
network in order to check how strongly text pertains to the corresponded photo. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the
authors proposed to represent image and text with 2 kinds of scene graphs: visual scene
graph and textual scene graph. These parts jointly characterize objects and relationships
between them. The image-text retrieval task is then naturally formulated as cross-modal
scene graph matching.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] the encoder is a convolutional neural network, and the features of the last
fully connected layer or convolutional layer are extracted as features of the image. The
decoder is a recurrent neural network, which is mainly used for image description
generation. In [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] RNN was replaced with LSTM for vanishing gradient solving problem.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] firstly attention mechanism was proposed to be applied. It allows the neural
network to have the ability to focus on specific inputs or features. Attention mechanism
is the following two aspects: the decision needs to pay attention to which part of the
input; the allocation of limited information processing resources to the important part.
      </p>
      <p>
        There are various techniques to calculate the attention distribution and "value" is
used to generate the selected information. Experiments have proved that the attention
mechanism is applied in abstract generation [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], visual captioning [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and other
issues.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>We define post or document as an entity containing image or text or both image and
text. We need to find images or texts where an identical object is mentioned. Formally,
if we have training samples subset of texts or images, the task is to minimize the
distance between documents if they contain the same object.</p>
      <p>We propose two methods of identical objects search. The first one is represented in
Section 4. For that case we consider a method based on image retrieval only. It includes
image data collection (4.1), further inappropriate images filtration (4.2) and
augmentation (4.3), object detection (4.4) and similarity evaluation (4.5). The second method is
described in Section 5. It uses image-text joint representations. It contains collection of
posts (image and corresponded description) (5.1), posts data transformation to joint
feature vectors and vectors similarity evaluation (5.2). All steps for each of methods
are shown in Fig. 1.
Also, results are valid for the configuration below:
1. Processor: Intel Core i9-9820X, 3.3 Ghz;
2. Graphics: nVidia 1080TI, 11 Gb GPU memory;
3. Memory: 64 Gb, 4.2 Ghz;
4. OS: Ubuntu 18.04 x64.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Method 1: Identical objects search based on images only</title>
      <sec id="sec-4-1">
        <title>Data collection</title>
        <p>Our goal is to build a dataset containing folders with images that show the same animal
in each of them. For this purpose, we use crawling which is an automated data gathering
method. In order to collect data, we should get all available images from the social
networks and then filter out improper cases. To do that, we find Instagram and Flickr
accounts that aggregate images of the desired animal (cat or dog), where for each post
there is a text description with a link to the profile of the user who uploaded this image
to a social network. The user profile specified in this post usually contains the desired
identical animal images as expected (Fig. 2). The first 100 images are saved from every
such a profile.</p>
        <p>By means of offered algorithm, we extract 9502 cat accounts and 8070 dog accounts.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Inappropriate images filtration</title>
        <p>For every account received from the previous step, it is necessary to filter the images
that do not contain the desired object (cat or dog). To select a neural network
architecture, which is most efficient for image filtering, we gather a test dataset, which is
manually labeled.
─ 1000 random images from the different Instagram accounts;
─ 118 selected images representing complicated cases on which animal is situated but
at the same time the image does not have any clear outlines or ambiguity of presence
takes place (Fig. 3).
Such images may contain disguise, difficult view (angle), merge with background, blur,
close-up, graphical effects, and so on.</p>
        <p>To meet the classification challenges, pretrained neural network models are used. It
allows us not to spend time and computational resources as well as answer the question
if there is a required animal on the image. After that, we are able to make a decision if
it is necessary to filter out a specific image, with sufficient precision.</p>
        <p>
          In this paper, we compare the following CNN architectures each of which has the
unique properties for the classification tasks:
1. VGG16/VGG19 [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]: unification of 16/19 convolution layers to a sequence of
convolutions; reduction of filter size to 33; rejection of local response normalization
layer using;
2. ResNet50 [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]: a high precision fixation on a current convolution layer and residual
connections idea introduction;
3. Inception v3 [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]: Inception module, Root Mean Square propagation (RMSprop)
and batch normalization introduction; reduction of filter size to 33; filter
decomposition to a pair of 1N and N1 filters;
4. Xception [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]: spatial and channel feature separation, replacing Inception modules
with depthwise convolutions.
5. Inception ResNet v2 [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]: dimensionality reduction paradigm by 11 convolution
using; adding shortcut connections; increase of hyperparameters number.
6. NasNet Large [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]: blocks or cells are searched by reinforcement learning; the
number of initial convolutional filters are free parameters used for scaling. Only cells
returning a feature map of the same dimension (or factored to 2) are searched by the
recurrent neural network.
The input of a neural network is an image with a fixed size. The output is a binary
answer is there an object (cat or dog) on the image or not.
Table 3 shows how many accounts remained after the crawling and filtration steps.
Each account holds an average 42 photos of the identical animal.
        </p>
        <p>Architecture
name
VGG16
VGG19
ResNet50
Inception v3
Xception
Inception
Resnet v2</p>
      </sec>
      <sec id="sec-4-3">
        <title>NasNet Large</title>
        <p>Cats
Dogs</p>
      </sec>
      <sec id="sec-4-4">
        <title>Data augmentation</title>
        <p>In case an account contains lower than 16 images, then the augmentation technique is
applied. Based on available photos, some geometric operations (like Affine
Transformation, brightness level changing, rotation, reflection and others) are performed. That
is necessary in order to increase number of photos per account to ensure convergence
of the subsequent algorithm.
4.4</p>
      </sec>
      <sec id="sec-4-5">
        <title>Object detection</title>
        <p>We use object detection to focus on an object instead of image background. As well as
for the classification task, we use pretrained neural networks for this topic.</p>
        <p>
          The following architectures were compared:
1. Yolo v3 [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]: batch normalization and higher resolution classifier usage; multi-scale
training and feature pyramid network introduction; darknet-53 feature extractor.
2. Faster-RCNN [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]: at the conceptual level, this architecture type is composed of 3
neural networks:
─ Feature Network – pretrained image classification network without a few last layers;
─ Region Proposal Network – purpose is to generate a number of bounding boxes that
has a high probability of containing any object;
─ Detection Network – takes input from both the Feature Network and RPN, and
generates the final class and bounding box.
1.
2.
3.
        </p>
        <p>
          Grid-RCNN [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]: uses multi-point supervision formulation to encode more
clues in order to reduce the impact of inaccurate prediction of specific points.
RetinaNet [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]: single, unified network composed of a backbone network (for
computing a conv feature map) and two task-specific subnetworks. The first
subnet performs classification on the backbones output; the second subnet
performs convolution bounding box regression.
        </p>
        <p>
          CenterNet [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]: uses centeredness information to perceive the visual patterns
within each proposed region.
        </p>
        <p>The obtained results of object detection are specified in Table 4 and Table 5. We use
subset of 118 dog photos and 155 cat photos as a test set.
mAP
0.9814
0.9876
0.5975
0.9642
0.9914
0.5078
We have found the most qualitative results of the anchor-based RetinaNet neural
network. We use RetinaNet + FreeAnchor neural network option for the dog and cat
detection.
4.5</p>
      </sec>
      <sec id="sec-4-6">
        <title>Similarity calculation with siamese neural network</title>
        <p>
          Siamese neural networks are used to calculate the similarity between texts or images.
We take a modified FaceNet version called EmbeddingNet1 with triplet loss
function [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for our task. Our full dataset contains 6572 cat accounts and 3552 dog
accounts. For the similarity calculation, we make a subset (training set) containing
1200 dog and cat accounts shuffled. It is done for computation reduction. Also, we
make a subset (test set) containing 582 shuffled accounts. In this section, we provide
hyperparameters used while training.
1. Initial image size: 5125123;
2. Margin: 0.4;
3. Loss function: triplet loss;
4. Learning rate: 0.0001;
5. Optimizer: radam;
6. Epochs: 500;
7. Maximum number of neighbors: 100.
        </p>
        <p>If result has not changed more than 10 epochs in a row, then we conclude training is
completed.</p>
        <p>We use Instagram data for training and VK2 posts for test. VK test set contains 582
photos. Training has taken 79 epochs. As a result, we have top-1 accuracy equals
54.10% and top-5 accuracy equals 71.81% for test data. Every account takes 56.62 ms.
for processing. To compute similarity, we use nVidia 1080TI graphics processor
configuration.</p>
        <sec id="sec-4-6-1">
          <title>1 https://github.com/RocketFlash/EmbeddingNet 2 https://vk.com</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Method 2: Identical objects search based on image-text joint representations</title>
      <p>5.1</p>
      <sec id="sec-5-1">
        <title>Data collection</title>
        <p>To make identical objects search, much like the previous method we make automatic
gathering firstly. This time, we collect specific VK social network posts and filter out
inappropriate ones after that.</p>
        <p>In general, the algorithm is comprised of the following stages:
1. Data crawling from social network. We make some groups crawling from VK. The
first half of these groups (6 pieces) has different general-purpose topics and the last
half (6 pieces) is relevant for animal shelter and volunteer particularities. We extract
39995 posts in total.
2. Texts preprocessing. We split and lemmatize all the words from the previous stage;
also, we remove all punctuation marks. Words within hashtags commonly don't
provide any added value therefore it should be removed primarily. After that, we remove
numbers, proper names, English words (we work with the texts in Russian only),
Latin symbols, and special symbols. We make basic stop-words processing (with
default parameters) with Python nltk3 framework.
3. Tokens sorting. We sort all the frequent words (unigrams) in descending order and
then take the first 10%. Also, we consciously increase the frequencies number of
some words in order to raise the priority of the locations' references.</p>
        <p>As a result, we have k the most frequent meaningful words ( = 370 for our case).
Further each unigram frequency is normalized in relation to the total number of
frequencies with the following expression:</p>
        <p>= ∑10==0 1  .</p>
        <p>= 1 ∑ =1   .</p>
        <p>The obtained in (1) pw values are the unigrams tokens. The more frequently word is
mentioned, the higher priority it has.</p>
        <p>In (1): fw – frequency of the word w; pw – priority of the word w.</p>
        <p>If we summarize all the priorities for every post, then we are able to calculate text c
priority.</p>
        <p>In (2): l – post words number; pc – priority of the text c.</p>
        <p>If there is no any word in most frequent unigrams list, then the final priority is
assigned to 0.</p>
        <p>Text priority values are added to the array where the median is specified more
precisely with every new priority. Median defined as a threshold: if priority is higher than</p>
        <sec id="sec-5-1-1">
          <title>3 https://nltk.org</title>
          <p>(1)
(2)
the median value, then the text is referred to our specific (cats and dogs mentions) and
the corresponded target is set to 1 for that reason. Otherwise, the target is set to 0.</p>
          <p>Median value converges with a growing number of posts as shown in Fig. 4.
(3)
(4)
There are about 350-400 iterations for sufficient algorithm convergence. The final
threshold after 1200 iterations equals 118,02.</p>
          <p>We can calculate the current post error (residual) ec as difference between the current
text priority pci and median threshold value. Applying linear interpolation, we get
confidence that current post is assigned to the correct class:

 =  (
(  ) −   ).</p>
          <p>In (3): pc – vector containing all the priorities before current priority value; pci – current
priority value; f – linear interpolation function.</p>
          <p>We use scipy 1.5.0 of Python programming language for linear interpolation
function computation.</p>
          <p>Thus, in accordance with detection results, each image in the post is corresponded
to mean average precision metric quality of the proper class (dog or cat).</p>
          <p>As a result, we have the following full probability equation representing if post
contains at least one mention of dog or cat:</p>
          <p>1
= 4</p>
          <p>3
+ 4  
.</p>
          <p>In (4) we work on the image and text data in the post. For verification, we take 100
posts for algorithm testing and receive 94 posts of 100 are relevant.</p>
          <p>As a result, we have a method capable to collect posts of required topics
automatically. We collect topics with cats and dogs and achieve sufficient data precision for our
case. We make a 7292 posts dataset totally.
5.2</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Joint feature vector getting and similarity evaluation</title>
        <p>
          We compare post vectors for object identity calculation. Both image and text were
transformed into vector representations. We use pretrained DistilBERT [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] for text
embedding getting. Also, we use img2vec4 framework with ResNet50 backbone to
obtain image embedding. After that, we make concatenation of these vectors to get a joint
representation of the entire post. The schematic view of this algorithm is shown in
Fig. 5.
We compare received vectors with each other using cosine distance. We take subset
including 582 test posts for the method verification as the previous way involves. The
difference lies in additional text descriptions and social network choice. Also, objects
identity on images inside of every post is not initially guaranteed. As a result, we have
top-1 accuracy equals 69,09% and top-5 accuracy equals 86,15%. Every post takes
61.05 ms for processing.
The results of both methods are shown in Table 6.
        </p>
        <p>Ultimately, in this article, we present new methods of training set creation. Further
identical object recognition task solution is based on a search of similar objects on the
images or image-text joint representations. Such representations consist of numerical
vectors including both image and text features. We compare these representations to
find out how close the similar objects are located in feature space after applied methods.</p>
        <p>All the calculations are produced with Python 3.6.8 programming language and
libraries: keras 2.3, tensorflow-gpu 1.15.0, mxnet+cu 1.5.1, CUDA 9.0, numpy 1.18.0,
scipy 1.5.0, matplotlib 3.1.3, mmcv 0.5.4, torch 1.4.0.</p>
        <sec id="sec-5-2-1">
          <title>4 https://github.com/christiansafka/img2vec</title>
          <p>As a result, we get text as an additional property has a significant positive impact.
We obtain quality metrics increased approximately in 14.66%.
6</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we solve the problem of how to recognize identical objects in the social
network. This issue is studied for lost pets’ search. Two methods are suggested. In the
first case, we collect a dataset of identical lost pet photos and then try to find the most
similar objects of required class using a siamese neural network. The second case
assumes we use combined technique including both text and image representations. As
they are received, we try to transform them into the joint feature vector for further
comparison with the cosine distance metric.</p>
      <p>As the results demonstrate, approach including various data types is more preferable,
it gives us more high result. We obtain 14.66% better metric quality in comparison with
a method in which the image retrieval method presented only. Also, we provide Github
repository5 source code for reproducibility.</p>
      <p>Potential future research will focus on designing a retrieval system that will employ
image captioning additionally.</p>
      <sec id="sec-6-1">
        <title>5 https://github.com/andreqwert/lost_pets_search</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges</article-title>
          .
          <source>Transactions on circuits and systems for video technology</source>
          ,
          <fpage>2372</fpage>
          -
          <lpage>2385</lpage>
          . IEEE (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lv</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Internet cross-media retrieval based on deep learning</article-title>
          .
          <source>Journal of Visual Communication and Image Representation</source>
          ,
          <fpage>356</fpage>
          -
          <lpage>366</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wei</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          et al:
          <article-title>Modality-dependent cross-media retrieval</article-title>
          .
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Specia</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sima</surname>
            'an,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elliott</surname>
            ,
            <given-names>D.:</given-names>
          </string-name>
          <article-title>A shared task on multimodal machine translation and crosslingual image description</article-title>
          .
          <source>In: Proceedings of the First Conference on Machine Translation</source>
          , vol.
          <volume>2</volume>
          :
          <string-name>
            <given-names>Shared</given-names>
            <surname>Task Papers</surname>
          </string-name>
          ,
          <fpage>543</fpage>
          -
          <lpage>553</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , P. Y.,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shiang</surname>
            ,
            <given-names>S. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dyer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Attention-based multimodal neural machine translation</article-title>
          .
          <source>In: Proceedings of the First Conference on Machine Translation</source>
          , vol.
          <volume>2</volume>
          :
          <string-name>
            <given-names>Shared</given-names>
            <surname>Task Papers</surname>
          </string-name>
          ,
          <fpage>639</fpage>
          -
          <lpage>645</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ngiam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khosla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nam</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Multimodal deep learning</article-title>
          .
          <source>In: International Conference on Machine Learning</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Baltrušaitis</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahuja</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morency</surname>
            ,
            <given-names>L. P.:</given-names>
          </string-name>
          <article-title>Multimodal machine learning: A survey and taxonomy</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence</source>
          ,
          <fpage>423</fpage>
          -
          <lpage>443</lpage>
          . IEEE (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Postnikov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobrov</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>News Stories Representation Using Event Photos</article-title>
          .
          <source>In: XIX International Conference on Data Analytics and Management in Data Intensive Domains</source>
          ,
          <fpage>359</fpage>
          -
          <lpage>366</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval</article-title>
          .
          <source>In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision</source>
          . IEEE, USA (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Yagfarov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ostankovich</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akhmetzyanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Traffic Sign Classification Using Embedding Learning Approach for Self-driving Cars</article-title>
          .
          <source>In: International Conference of Human Interaction and Emerging Technologies</source>
          ,
          <fpage>180</fpage>
          -
          <lpage>184</lpage>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toshev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Show and tell: a neural image caption generator</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          . IEEE, USA (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Karpathy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Johnson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>F.-F.</given-names>
          </string-name>
          :
          <article-title>Visualizing and understanding recurrent networks (</article-title>
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heess</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Recurrent models of visual attention</article-title>
          .
          <source>Advances in Neural Information Processing Systems</source>
          ,
          <volume>3</volume>
          ,
          <fpage>2204</fpage>
          -
          <lpage>2212</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Allamanis</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A convolutional attention network for extreme summarization of source code</article-title>
          .
          <source>In: Proceedings of the Thirty-Third International Conference on Machine Learning</source>
          . New York, USA (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanjalic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>From deterministic to generative: multi-modal stochastic RNNS for video captioning</article-title>
          .
          <source>IEEE Transaction on Neural Networks and Learning System</source>
          ,
          <volume>30</volume>
          (
          <issue>10</issue>
          ),
          <fpage>3047</fpage>
          -
          <lpage>3058</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sanh</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Debut</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaumond</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
          </string-name>
          , T.:
          <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>
          .
          <source>In: 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R</given-names>
          </string-name>
          , He,
          <string-name>
            <surname>K.</surname>
          </string-name>
          , Doll`ar, P.:
          <article-title>Focal loss for dense object detection</article-title>
          .
          <source>In: International Conference of Computer Vision</source>
          ,
          <fpage>2980</fpage>
          -
          <lpage>2988</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <string-name>
            <surname>Faster</surname>
            <given-names>R-CNN</given-names>
          </string-name>
          :
          <article-title>Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          .
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          ,
          <fpage>1137</fpage>
          -
          <lpage>1149</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Duan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Qi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tian</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          :
          <article-title>Centernet: Keypoint triplets for object detection</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          ,
          <fpage>6569</fpage>
          -
          <lpage>6578</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Redmon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farhadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Yolo v3: An incremental improvement</article-title>
          . arXiv:
          <year>1804</year>
          .
          <volume>02767</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yue</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
          </string-name>
          , J.:
          <article-title>Grid-rcnn</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>7363</fpage>
          -
          <lpage>7372</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks</article-title>
          .
          <source>Theory of Parsing, Translation and Compiling</source>
          , vol.
          <volume>1</volume>
          .
          <string-name>
            <surname>Prentice-Hall</surname>
          </string-name>
          , Englewood Cliffs, NJ (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shroff</surname>
          </string-name>
          , G.:
          <article-title>Deep Reidual Learning for Image Recognition</article-title>
          .
          <source>arXiv:1512.03385</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ioffe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wojna</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Rethinking the Inception Architecture for Computer Vision</article-title>
          .
          <source>In: The IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>2818</fpage>
          -
          <lpage>2826</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <fpage>1251</fpage>
          -
          <lpage>1258</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ioffe</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning</article-title>
          .
          <source>In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence</source>
          ,
          <fpage>4278</fpage>
          -
          <lpage>4284</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Zoph</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasudevan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quoc</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <article-title>le: Learning transferable architectures for scalable image recognition</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <fpage>8697</fpage>
          -
          <lpage>8710</lpage>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>