<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Multiple Deep-learning-based Image Features for Visual Sentiment Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandros Pournaras</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolaos Gkalelis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damianos Galanopoulos</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasileios Mezaris CERTH-ITI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Greece</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>apournaras</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>gkalelis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>dgalanop</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>bmezaris}@iti.gr</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <fpage>13</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper presents our team's (IDT-ITI-CERTH) proposed method for the Visual Sentiment Analysis task of the Mediaeval 2021 benchmarking activity. Visual sentiment analysis is a challenging task as it involves a high level of subjectivity. The most recent works are based on deep convolutional neural networks, and exploit transfer learning from other image classification tasks. However, transferring knowledge from tasks other than image classification has not been investigated in the literature. Motivated by this, in our approach we examine the potential of transferring knowledge from several pre-trained networks, some of which are out-of-domain. We concatenate these diverse feature vectors and construct an image representation that is used to train a classifier for each of the three subtasks of this Mediaeval task. Due to a bug in the original submission file, the oficial scores we got are 0.595, 0.479 and 0.380 for subtasks 1,2 and 3 respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Visual sentiment analysis is the problem of identifying the
sentiment conveyed by an image. The problem has recently attracted
significant attention due to the large-scale use of images in
social media. This Mediaeval task focuses on images from natural
disasters, thus content that can often induce strongly negative
sentiments. A human-labeled disaster-related dataset as well as a
deep-learning based approach to solve it was proposed in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] a detailed description of the task is presented.
      </p>
      <p>
        In general, visual sentiment analysis is challenging because it
involves a higher level of human subjectivity in the classification
process, compared to other image classification tasks. Similarly to
such tasks, deep convolutional neural networks are widely used;
many literature works, e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], rely on transfer learning by
performing fine-tuning on pre-trained networks that most
commonly have been originally trained on ImageNet [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However,
little emphasis has been given to investigating the potential of
transferring knowledge learned from neural networks trained on tasks
other than image classification. For this reason, we employ several
pre-trained networks, some of which are trained on out-of-domain
datasets and tasks. We extract their encodings and concatenate
them to create a rich image representation. Using this, we train a
sentiments classifier that takes the form of either a dense 3-layer
neural network or a Mixture of Experts (MoE) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] classifier.
      </p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
      <p>
        Our proposed method is closely based on [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a method that
achieves state-of-the-art performance in many benchmark image
sentiment analysis datasets. We are transferring knowledge from
5 trained neural networks. These networks have diferent
architectures and are trained on diferent datasets, some for problems
other than image classification. They were chosen for use in this
task because they perform very well in their respective domains.
A feature vector is extracted from each network. In the following
subsections, we briefly describe each network and how each feature
vector is extracted. We classify each feature vector in one of the
two categories, either in-domain for those coming from networks
trained on image classification tasks, or out-of-domain for those
coming from networks trained on other tasks. A summary of the
employed feature vectors can be seen in Table 1.
2.1
      </p>
      <p>
        In-Domain Feature Vectors
2.1.1 EficientNet features. EficientNet [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a recently
proposed deep convolutional neural network architecture that achieves
state-of-the-art performance on image classification tasks. We used
a "B2"-variation model pre-trained on the 1000-class ImageNet
dataset. We remove the last fully connected layer, so the network
outputs a 1408-element feature vector, E.
      </p>
      <p>
        2.1.2 Resnet features. Resnet [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a family of convolutional
neural networks based on residual blocks, that have shown
state-ofthe-art performance in image classification tasks. We use the
152layer deep Resnet architecture trained on the 11k-class ImageNet
dataset [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and extract the 2048-element "pool5" layer as the feature
vector, R.
2.2
      </p>
      <p>
        Out-of-Domain Feature Vectors
2.2.1 YT8M features. YouTube-8M [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a large annotated video
dataset containing approximately 6 million videos of a total
duration of more than 500.000 hours and labeled with 3862 classes. For
training a classifier on this dataset, we extract features at a 1fps
sampling rate using an Inception neural network [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] pre-trained
on Imagenet [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The ReLU activation of the last hidden layer of
this network is given as input to a rather simple CNN classifier,
consisting of a 1D convolutional layer with 64 filters, a max-pooling
layer, a dropout and a Sigmoid of 3862 outputs. This is the
YouTube8M-trained classifier that we ultimately use as feature extractor for
sentiment classification: the classifier’s 3862-element output vector
for each image is our feature vector, Y.
      </p>
      <p>
        2.2.2 "Signature" features. To obtain the "signature" features,
we utilize a cross-modal network designed for ad-hoc video search.
More specifically, the attention-based dual encoding network
presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is used. The network is trained to translate a media
A. Pournaras, N. Gkalelis, D. Galanopoulos, V. Mezaris
item (i.e. an entire video) V or a textual item (i.e. a natural-language
video caption or search query) T into a new joint feature space  (·),
resulting in representations  (V) or  (T), respectively; such
representations, despite being derived from diferent data modalities, are
directly comparable. This network is trained using large datasets
of video-caption pairs: MSR-VTT [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], TGIF [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Vatex [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and
ActivityNet [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. For leveraging this pre-trained network as a feature
generator in the image sentiment analysis task, we considered an
image as a special type of video comprising only one keyframe. The
image is used as input to the visual encoding branch of the network,
fed forward through the multi-level encoding layers, and the global
image representation  (V), a 2048-element vector, is used as our
2.2.3
      </p>
      <p>
        Graph Convolutional Network (GCN) features. To obtain
this feature vector, we employ a neural network trained for the
task of video event recognition [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Following the application of an
object detector on the frames of the video, a neural network is used
to extract the objects’ features and graphs are used to model the
relations between objects. Then, a graph convolutional network
(GCN) is utilized to perform reasoning on the graphs. The resulting
object-based frame-level features are then forwarded to a long
short-term memory (LSTM) network for video event recognition.
To extract the feature vector that we use for the image sentiment
analysis in this work, we fetch the output of the GCN, which is a
2048-element vector G.
2.3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Sentiment Classifiers</title>
      <p>We concatenate the 5 feature vectors described above, resulting in
a final 11414-element feature vector, that will be used to train our
classifiers for the 3 subtasks.</p>
      <p>2.3.1</p>
      <p>Subtask 1. For subtask 1 we employ a Mixture of Experts
classifier. The first layer of this classifier is a fully connected layer
which transforms the input vector to a 200-element vector. After
passing through a Dropout and a ReLU block, this 200-element
vector is the input  forwarded to the  = 2 experts, 
defined for each class , as well as to the associated gates, 1 (), 2 ().
For each class, an extra "dummy" expert is also defined to represent
the rest-of-the-world class, and only participates in partitioning
the feature space through the gate component of the Mixture of
Expert classifier. The experts and the gate are implemented as fully
connected layers with a sigmoid and a softmax nonlinearity,
respectively. A confidence score for the th class is then computed by
merging experts’ outputs into a single output  ( ) according to the
gate’s decision (Eq. (1)). The whole network is trained end-to-end.
1 (), 2 (), which are
 ( ( )) ∗    ( ( ))
(1)
2.3.2</p>
      <p>Subtask 2. For subtask 2 we employ a dense 3-layer neural
network classifier. This classifier comprises three fully-connected
layers with 1000, 200 and, finally, 7 neurons, as there are 7 target
classes in this subtask. Between consecutive layers there is a ReLU
and a dropout layer with 0.4 probability. Finally the output passes
through a sigmoid nonlinearity.</p>
      <p>2.3.3</p>
      <p>Subtask 3. For subtask 3 we use the same classifier used
in subtask 2, with the exception of the output layer, which in this
case comprises 10 neurons, as there are 10 target classes.
3</p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>For optimizing the parameters and choosing the classifiers for each
subtask, we randomly split the development set to a training and
a validation set with 80% and 20% of the images respectively. For
subtask 1, we measured the cross-entropy loss. We employed the
Adam optimizer and trained with 10−5 learning rate. We trained
the model for 300 epochs. For subtasks 2 and 3 we additionally
performed augmentations to the images of the training set: random
crop, blurring, change in brightness and random rotations. We
measured the binary cross-entropy loss and optimized with Adam.
For subtask 2 we trained for 300 epochs with 3 10−6 learning rate,
while for subtask 3 for 200 epochs with 5 10−6 learning rate. The
learning rate is scheduled to drop by half every 70 epochs in all
the subtasks. The batch size for the training was set to 64 for all
the subtasks. Following the selection of the above parameters, we
used the entire development set provided by the task organizers
to train our final models. The experimental results we got on the
development set as well as the oficial (with bug) and unoficial
(corrected) test set results are shown in Table 2.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by the EU Horizon 2020 programme under
grant agreement 832921 (MIRROR).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Abu-El-Haija</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kothari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Natsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Toderici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Varadarajan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vijayanarasimhan</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>YouTube-8M: A Large-Scale Video Classification Benchmark</article-title>
          . In arXiv:
          <volume>1609</volume>
          .08675. https://arxiv.org/pdf/ 1609.08675v1.pdf
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>X.</given-names>
            <surname>Giró-</surname>
          </string-name>
          i-Nieto.
          <year>2017</year>
          .
          <article-title>From pixels to sentiment: Fine-tuning CNNs for visual sentiment prediction</article-title>
          .
          <source>Image and Vision Computing</source>
          <volume>65</volume>
          (
          <year>2017</year>
          ),
          <fpage>15</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanopoulos</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Attention Mechanisms, Signal Encodings and Fusion Strategies for Improved Ad-hoc Video Search with Dual Encoding Networks</article-title>
          .
          <source>In Proc. of the ACM Int. Conf. on Multimedia Retrieval</source>
          (
          <article-title>(ICMR '20))</article-title>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Gkalelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>ObjectGraphs: Using Objects and a Graph Convolutional Network for the Bottom-Up Recognition and Explanation of Events in Video</article-title>
          .
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops</source>
          .
          <fpage>3375</fpage>
          -
          <lpage>3383</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. Z.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Conci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Al-Fuqaha</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Visual Sentiment Analysis: A Natural Disaster Use-case Task at MediaEval 2021</article-title>
          .
          <source>In Proceedings of the MediaEval 2021 Workshop</source>
          , Online.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          .
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2016</year>
          ),
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Islam</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Visual sentiment analysis for social images using transfer learning approach</article-title>
          .
          <source>In IEEE Int. Conf. on Big Data and Cloud Computing (BDCloud)</source>
          ,
          <article-title>Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)</article-title>
          . IEEE,
          <fpage>124</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Jacobs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nowlan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>1991</year>
          .
          <article-title>Adaptive Mixture of Local Expert</article-title>
          .
          <source>Neural Computation</source>
          <volume>3</volume>
          (
          <issue>02</issue>
          <year>1991</year>
          ),
          <fpage>78</fpage>
          -
          <lpage>88</lpage>
          . https://doi. org/10.1162/neco.
          <year>1991</year>
          .
          <volume>3</volume>
          .1.
          <fpage>79</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Niebles</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>DenseCaptioning Events in Videos</article-title>
          .
          <source>In Int. Conf. on Computer Vision</source>
          (ICCV).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Song</surname>
          </string-name>
          , and others.
          <year>2016</year>
          .
          <article-title>TGIF: A new dataset and benchmark on animated GIF description</article-title>
          .
          <source>In Proc. of IEEE CVPR</source>
          .
          <volume>4641</volume>
          -
          <fpage>4650</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pournaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gkalelis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Galanopoulos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Exploiting Out-of-Domain Datasets and Visual Representations for Image Sentiment Classification</article-title>
          .
          <source>In 2021 16th International Workshop on Semantic and Social Media Adaptation Personalization (SMAP)</source>
          .
          <article-title>1-6</article-title>
          . https://doi.org/10.1109/SMAP53521.
          <year>2021</year>
          .9610801
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>O.</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Krause</surname>
          </string-name>
          , and others.
          <source>2015</source>
          .
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>Int. journal of computer vision 115</source>
          , 3 (
          <year>2015</year>
          ),
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          .
          <article-title>1-9</article-title>
          . https://doi.org/10.1109/CVPR.
          <year>2015</year>
          .7298594
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Tan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>EficientNet: Rethinking Model Scaling for Convolutional Neural Networks</article-title>
          .
          <source>In Int. Conf. on Machine Learning</source>
          .
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          , and others.
          <source>2019</source>
          .
          <article-title>Vatex: A large-scale, high-quality multilingual dataset for video-and-language research</article-title>
          .
          <source>In Proc. of the IEEE Int. Conf. on Computer Vision</source>
          . 4581-
          <fpage>4591</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rui</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>MSR-VTT: A Large Video Description Dataset for Bridging Video and Language</article-title>
          .
          <source>In Proc. of IEEE CVPR</source>
          .
          <volume>5288</volume>
          -
          <fpage>5296</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zohaib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Conci</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A</given-names>
            .
            <surname>Al-Fuqaha</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Sentiment Analysis from Images of Natural Disasters</article-title>
          . (
          <year>2019</year>
          ).
          <article-title>arXiv:cs</article-title>
          .CV/
          <year>1910</year>
          .04416
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>