<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explaining Emotional of Image-captioning Attitude Through the Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleg Bisikalo</string-name>
          <email>obisikalo@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Volodymyr Kovenko</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilona Bogach</string-name>
          <email>ilona.bogach@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olha Chorna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kremenchuk Mykhailo Ostrohradskyi National University</institution>
          ,
          <addr-line>Pershotravneva Street, 20, Kremenchuk, 39600</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vinnytsia National Technical University</institution>
          ,
          <addr-line>Khmelnytsky highway 95, Vinnytsya, 21021</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Deep learning algorithms trained on huge datasets containing visual and textual information, have shown to learn useful features for other downstream tasks. This implies that such models understand the data on different levels of hierarchies. In this paper we study the ability of SOTA (state-of-the-art) models for both texts and images to understand the emotional attitude caused by a situation. For this purpose we gathered a small size dataset based on IMDB-WIKI one and annotated it specifically for the task. In order to investigate the ability of pretrained models to understand the data, the KNN clustering procedure over representations of text and images is utilized in parallel. It's shown that although used models are not capable of understanding the task at hand, a transfer learning procedure based on them helps to improve results on such tasks as image-captioning and sentiment analysis. We then frame our problem as the task of image captioning and experiment with different architectures and approaches to training. Finally, we show that adding additional biometric features such as probabilities of emotions and gender probabilities improves the results and leads to better understanding of emotional attitude.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Deep learning algorithms</kwd>
        <kwd>Emotional attitude</kwd>
        <kwd>SOTA models</kwd>
        <kwd>Image-captioning</kwd>
        <kwd>NLP</kwd>
        <kwd>Transfer-learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>2. A set of experiments on the tasks of image-captioning and sentiment analysis, based on
features extracted from highlighted models. It’s also shown that adding biometric features as
gender and emotions distribution improves the performance of image-captioning models.</p>
      <p>
        The training procedure was conducted using tensorflow [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and pytorch [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Data collection</title>
      <p>The actual data needed to include both images and their captions. As the main intent was to
capture the emotional attitude, the images would have to contain people and explicit or implicit
information about the cause of their emotional state. The captions should have contained an
exhaustive unbiased description of the situation. Based on highlighted requirements, the first idea was
to make a dataset from the subset of existing image-captioning datasets.</p>
      <p>
        Image-captioning is the process of generating textual description of an image. The task implies
that the relevant dataset consists of image-text pairs. One of the most popular datasets for the
discussed task is COCO [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which consists of 330K images. We used only the subset of dataset
related to image-captioning, mainly the 2014 train split, which consisted of 29766 images along with
5 captions per each image. As it would be pretty hard and cumbersome to filter out images manually,
a YoloV3 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] object-detection algorithm trained on the discussed dataset was used. Only images that
contained objects of class “person” were left. As a result, the COCO dataset was shrunk to 3731
images. However, filtered images and captions only contained the actual plot of the image without
any emotional attitude. The other analyzed dataset was a VizWiz [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] one. VizWiz is the first
goaloriented VQA (visual question answering) dataset arising from a natural VQA setting, which consists
of over 31,000 visual questions originating from blind people. Needed data subset was found by
filtering the captions using people related words. As the final data was of a poor quality, this variant
was declined. The last image-text data we experimented with was SentiCap [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] one. SentiCap
consists of 2360 images containing sentiments. After filtering the dataset in the same way as it was
done for VizWiz one, we arrived with only 830 samples, which was not enough for our task.
      </p>
      <p>
        The other variant was to gather a dataset from the very beginning and annotate it. The images were
taken from the IMDB-WIKI [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] dataset for age and gender detection. Each image was annotated
with a description of the emotional attitude of the person or people on it. As a result we arrived with
the dataset of 3840 image-text pairs, where each image was resized to 224x224 pixels (Fig. 1).
b)
Figure 1 (a, b): Dataset examples with corresponding captions
c)
f)
Figure 1 (c - f): Dataset examples with corresponding captions
      </p>
      <p>
        In order to categorize the dataset, sentiments related to captions were added using Vader [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ],
which is a rule based model for sentiment analysis. Then the sentiments were checked by humans one
more time to produce more meaningful ones. As the result of the analysis, the data appeared to be
imbalanced in terms of the new category (Fig. 2).
      </p>
      <p>New sentiment category was used for analysis of clustering and for solving the task of sentiment
analysis given the captions.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Pretrained models overview</title>
      <p>In order to analyze the ability of pretrained models to understand such difficult information as
emotional attitude, recent SOTA models trained on big datasets of textual and visual information were
chosen.
3.1.</p>
    </sec>
    <sec id="sec-4">
      <title>ResNet</title>
      <p>
        ResNet, introduced by Kaiming He et.al, is a deep convolutional architecture, which suppressed
previous results on Imagenet benchmark and showed to be pretty successful for object detection by
obtaining a 28% relative improvement on the COCO object detection dataset. Main advantage of such
architecture is the addition of residual connections that help to fight the problem of vanishing
gradients, which is typical for deep neural networks. This advantage gave a possibility to train a very
deep network, each layer of which learned different useful features. In our work ResNet152V2
pretrained on the Imagenet dataset was used. We also experimented with ResNet50 trained on FER
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] dataset.
3.2.
      </p>
    </sec>
    <sec id="sec-5">
      <title>EfficientNet</title>
      <p>
        EfficientNet, introduced by Tan et.al, is a deep convolutional neural network architecture and
scaling method that uniformly scales all dimensions of depth/width/resolution using a compound
coefficient. It achieves state-of-the-art 84.3% top-1 accuracy on ImageNet and transfers well to other
tasks, reaching state-of-the-art accuracy on CIFAR-100 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] (91.7%), Flowers [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] (98.8%), and 3
other transfer learning datasets. In our work EfficientNet trained on age-gender IMDB-WIKI dataset
was used.
3.3.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Word2Vec</title>
      <p>
        Word2Vec, introduced by Mikolov et.al [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] is a neural network based approach to learning word
embeddings. The approach gives a possibility to use two methods of learning: CBOW and
skipgramm. During the CBOW approach, the model is asked to predict the current word given the
context, whereas skip-gram one tries to predict words within a certain range before and after the
current word. As a result of such training, the model learns meaningful word vectors that are often
used for transfer learning. Word2Vec embeddings pretrained on Google News with the vectors’
dimensionality of 300 were used in the paper.
      </p>
      <p>The exact setup of experiments and description of layers using which the data representation was
derived along with experimental results is discussed further in the paper</p>
    </sec>
    <sec id="sec-7">
      <title>4. Experiments</title>
    </sec>
    <sec id="sec-8">
      <title>4.1 Image-captioning</title>
      <p>
        Image understanding is the process of interpreting regions/objects to figure out what's happening
in the image. This may include figuring out what the objects are, their spatial relationship to each
other, etc [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. This statement implies that one of the definitions of scene understanding is a capability
of describing its context. Thus, we theorize that a model which can describe the emotional attitude
based on image is capable of understanding it. The task of describing the image is known as
imagecaptioning, and gained a huge popularity with the development of deep neural networks [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Though
there are many different approaches to the task [20], we exploit only the encoder-decoder architecture,
where encoder’s goal is to encode the representation of the image into the feature vector and
decoder’s one is to generate the captions based on this information. The theoretical foundations of
constructing text messages / captions by modeling combinations of significant words are considered
in [21]. For the role of encoder, a convolutional neural network is often exploited, whereas for the role
of decoder - recurrent one. In our work the research is done due to different encoder-decoder
architectures used to solve the task of image-captioning.
      </p>
      <p>As it was stated by</p>
      <p>Kovenko et.al [22], by solving the problem of data reconstruction,
autoencoders tend to learn low-level features, which are useful for transfer learning. Based on this
idea we train the deep convolutional autoencoder on our dataset and use latent code produced by the
encoder part for encoding images in image-captioning task. Also the experiments include the output
of 4th block of ResNet, along with the logits of ResNet as the encoders. In order to compare this
transfer learning approaches, we also experiment with custom not pretrained convolutional encoder.</p>
      <p>The decoder part is represented by the embedding layer and LSTM (Long-short-term-memory)
[23] network. LSTM is capable of learning long-time dependencies, which is especially useful when
working with sequential data. As the embedding layer, for all the experiments, Word2Vec was used.
For all the approaches, layer normalization [24] after LSTM was used. As it was stated by Xu et.al
[25], the attention mechanism applied to image-captioning tasks can greatly improve results. Nezami
et.al [26] showed that usage of additional features of emotions helps to improve results on the
imagecaptioning datasets that include emotional aspects. Based on these ideas, we experimented with using
attention and conditioning LSTM on additional features. Different from Nezami’s approach, gender
features were also used and the emotional ones were encoded as probability distribution. Specifically,
YoloV3 is used to extract face regions from the images and EfficientNet trained on Age-Gender
dataset along with ResNet trained on FER one are used to predict gender and emotions.</p>
      <p>Gender features are produced using predicted probabilities for each face presented on the image
(formula 1).
(1)
(2)
of an argmax operation over prediction probability vector for specific face i.
where G - number of unique genders, g - gender,  
N - number of faces presented on the image, 1</p>
      <p>- normalized vector with gender probabilities,
- identifier of Pibeing equal to specific g,   - result

 =</p>
      <p>, 


∑</p>
      <p>= ∑ 1
 =1
  ,   = 
(
  )</p>
      <p>Emotional features are produced as normalized probability distribution of the sum of probability
vectors for each face presented on the image (formula 2).</p>
      <p>= ∑ 
  , 
=</p>
      <p>∑  


where</p>
      <p>- vector of averaged emotion probabilities, N - number of faces presented on the image,
predi - prediction probability vector for specific face i, M - number of unique genders.</p>
      <p>The data was splitted in the same way as for sentiment analysis. The approaches were validated
based on the test set performance using beam search technique with the beam size of 5. BLEU score
along with perplexity were used as the main metrics. For all the experiments RMSprop optimizer was
used, with the initial learning rate of 0.0001. In order not to overfit, the loss reduction technique was
used. If there was no improvement in validation perplexity for two epochs, the loss was reduced by
the factor 10. All the models were trained with a batch size of 64 for 30 epochs (Fig. 3).</p>
      <p>Analyzing the results it’s obvious that the transfer-learning procedure gives better results than
training from scratch (ordinary) w.r.t BLEU on a test set. It’s also clear that ResNet representation
tends to give better results than autoencoder’s one, possibly because of a deeper architecture and
better learned features. Attention didn’t work well for all the approaches, probably because of the low
number of samples in the dataset and small amount of epochs. So far an approach that utilized logits
output of ResNet for encoder part of the network along with Word2Vec embeddings and additional
features of emotions (resnet_logits_w2v_emotions) gave the best results on test data w.r.t to averaged
BLEU score. The other appomodelrach, which is also worth paying attention to is the one, which
incorporates both emotions and gender features. Despite resnet_logits_w2v_emotions_gender didn’t
achieve the best performance on test BLEU, it reached the best balanced performance on all the data
splits, and thus was chosen as the best one. The architecture of the overall prediction pipeline is shown
in Fig. 4.</p>
      <p>As it can be seen from Fig. 4, the overall pipeline is dependent on the face pre-processing step
along with the detection of emotions and gender. Obviously, if the performance of highlighted steps is
poor, the final output will be at least biased. The example of such bias is represented in Fig. 5.</p>
      <p>During error analysis, it was found that the model suffers from slight overfitting on most frequent
words and phrases (like “man is flirting with a woman” presented in Fig. 5), which is a problem
caused by a small diversity of the dataset. Despite the fact that the collected data is a noisy one, as
each image was annotated by a different expert, which is not very suitable for the task of
imagecaptioning, the model succeeds to give adequate results on average (Fig. 6).
b)
Figure 6 (a, b). Example of generated captions. T - true caption, greedy - result of greedy decoding,
beam - result of beam search decoding. Captions which are fully inappropriate are marked with bleu.
c)
e)
Figure 6 (c – e). Example of generated captions. T - true caption, greedy - result of greedy decoding,
beam - result of beam search decoding. Captions which are fully inappropriate are marked with bleu.</p>
      <p>It’s important to note that the longer training would probably give better results.</p>
    </sec>
    <sec id="sec-9">
      <title>5. Conclusion and further work</title>
      <p>In this paper we analyzed the ability of deep learning models to understand the emotional attitude driven by
the situation. For this purpose, a new dataset with image-text pairs was presented. In result of pretrained SOTA
models analysis, it was concluded that some of them can be used in the process of transfer-learning. Through the
experiments it was shown that the dataset can be used to solve the problem of sentiment analysis. It was then
theorized that the problem of understanding the emotional attitude, can be transferred to the task of
imagecaptioning. Empirical results have shown that addition of emotional and gender features along with
transferlearning based on ResNet network and Word2Vec embeddings improve the overall captioning performance. Our
approach gives pleasant results on average, confirming that deep learning models are able to understand
emotional attitude if they are trained to. It's important to note that such an approach has many downsides, as it’s
dependent on the performance of three additional models for face, emotions and gender detection. The other
problem that was faced is the noisy nature of the dataset and small variation of phrases in it. In future work it’s
planned to gather a bigger dataset, label each image with 5 captions and fix current problems.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Acknowledgements</title>
      <p>We would like to thank Oleksii Abdullaiev, Dmytro Tarasovskyi and Dmytro Maliovanyi for their
contribution in terms of the dataset creation.</p>
    </sec>
    <sec id="sec-11">
      <title>7. References</title>
      <p>[20] [20] Hossain, M. Z., Sohel, F., Shiratuddin, M. F., &amp; Laga, H. (2019). A comprehensive survey
of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6), 1-36.
[21] Bisikalo, O., Bogach, I. &amp; Sholota, V. (2020). The Method of Modelling the Mechanism of
Random Access Memory of System for Natural Language Processing. In 2020 IEEE 15th
International Conference on Advanced Trends in Radioelectronics, Telecommunications and
Computer Engineering (TCSET) (pp. 472-477). doi: 10.1109/TCSET49122.2020.235477.
[22] Kovenko, V., &amp; Bogach, I. (2020). A Comprehensive Study of Autoencoders' Applications</p>
      <p>Related to Images. In IT&amp;I Workshops (pp. 43-54).
[23] Hochreiter, S., &amp; Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8),
1735-1780.
[24] Ba, J. L., Kiros, J. R., &amp; Hinton, G. E. (2016). Layer normalization. arXiv preprint
arXiv:1607.06450.
[25] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., ... &amp; Bengio, Y. (2015, June).</p>
      <p>Show, attend and tell: Neural image caption generation with visual attention. In International
conference on machine learning (pp. 2048-2057). PMLR.
[26] Nezami, O. M., Dras, M., Anderson, P., &amp; Hamey, L. (2018, September). Face-cap: Image
captioning using facial expression analysis. In Joint European Conference on Machine Learning
and Knowledge Discovery in Databases (pp. 226-240). Springer, Cham.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Elizabeth</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liddy</surname>
          </string-name>
          .
          <source>Natural Language Processing. In Encyclopaedia of Library and Information Science</source>
          , 2nd Ed. NY. Marcel Decker, Inc. https://surface.syr.edu/cgi/viewcontent.cgi?
          <source>article=1043&amp;context=istpub (accessed 12 December</source>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>IBM</surname>
          </string-name>
          . What is Computer Vision? https://www.ibm.com/topics/computer-vision
          <source>(accessed 12 December</source>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
          </string-name>
          . et al.,
          <year>2009</year>
          .
          <article-title>Imagenet: A large-scale hierarchical image database</article-title>
          .
          <source>In 2009 IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Yosinski</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jason</surname>
          </string-name>
          , et al.
          <article-title>How transferable are features in deep neural networks?</article-title>
          .
          <source>arXiv preprint arXiv:1411.1792</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kovenko</surname>
          </string-name>
          , Volodymyr; Abdullaiev, Oleksii; Maliovanyi, Dmytro; Tarasovskyi, Dmytro; Bogach, Ilona; Bisikalo,
          <string-name>
            <surname>Oleh</surname>
          </string-name>
          (
          <year>2021</year>
          ), “
          <article-title>EmoAtCap : Emotional attitude captioning dataset”, Mendeley Data, V5</article-title>
          , doi: 10.17632/dym6p2pvbt.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Abadi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barham</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brevdo</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Citro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Tensorflow: Large-scale machine learning on heterogeneous distributed systems</article-title>
          .
          <source>arXiv preprint arXiv:1603</source>
          .
          <fpage>04467</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Paszke</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gross</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Massa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lerer</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradbury</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chanan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Chintala</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2019</year>
          ).
          <article-title>Pytorch: An imperative style, high-performance deep learning library</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ,
          <volume>32</volume>
          ,
          <fpage>8026</fpage>
          -
          <lpage>8037</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T. Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maire</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hays</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perona</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramanan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          (
          <year>2014</year>
          ,
          <article-title>September)</article-title>
          .
          <article-title>Microsoft coco: Common objects in context</article-title>
          .
          <source>In European conference on computer vision</source>
          (pp.
          <fpage>740</fpage>
          -
          <lpage>755</lpage>
          ). Springer, Cham.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Redmon</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Farhadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Yolov3: An incremental improvement</article-title>
          . arXiv preprint arXiv:
          <year>1804</year>
          .02767.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Gurari</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stangl</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grauman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Bigham</surname>
            ,
            <given-names>J. P.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Vizwiz grand challenge: Answering visual questions from blind people</article-title>
          .
          <source>In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          (pp.
          <fpage>3608</fpage>
          -
          <lpage>3617</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Mathews</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          (
          <year>2016</year>
          , March). Senticap:
          <article-title>Generating image descriptions with sentiments</article-title>
          .
          <source>In Proceedings of the AAAI Conference on Artificial Intelligence</source>
          (Vol.
          <volume>30</volume>
          , No. 1).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Rothe</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Timofte</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Van Gool</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Dex: Deep expectation of apparent age from a single image</article-title>
          .
          <source>In Proceedings of the IEEE international conference on computer vision workshops</source>
          (pp.
          <fpage>10</fpage>
          -
          <lpage>15</lpage>
          ), doi: 10.1109/ICCVW.
          <year>2015</year>
          .
          <volume>41</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hutto</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          (
          <year>2014</year>
          , May).
          <article-title>Vader: A parsimonious rule-based model for sentiment analysis of social media text</article-title>
          .
          <source>In Proceedings of the International AAAI Conference on Web and Social Media</source>
          (Vol.
          <volume>8</volume>
          , No. 1).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carrier</surname>
            ,
            <given-names>P. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamner</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>Challenges in representation learning: A report on three machine learning contests</article-title>
          .
          <source>Neural Networks</source>
          ,
          <volume>64</volume>
          :
          <fpage>59</fpage>
          --
          <lpage>63</lpage>
          ,
          <year>2015</year>
          . doi:
          <volume>10</volume>
          .1016/j.neunet.
          <year>2014</year>
          .
          <volume>09</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          .
          <article-title>Learning Multiple Layers of Features from Tiny Images</article-title>
          .
          <source>Tech Report</source>
          . https://www.cs.toronto.edu/~kriz/learning-features-2009
          <source>-TR.pdf (accessed 12 December</source>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Nilsback</surname>
            ,
            <given-names>M. E.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2008</year>
          , December).
          <article-title>Automated flower classification over a large number of classes</article-title>
          .
          <source>In 2008 Sixth Indian Conference on Computer Vision</source>
          , Graphics &amp; Image
          <string-name>
            <surname>Processing</surname>
          </string-name>
          (pp.
          <fpage>722</fpage>
          -
          <lpage>729</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Barz</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Denzler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2019</year>
          , January).
          <article-title>Hierarchy-based image embeddings for semantic image retrieval</article-title>
          .
          <source>In 2019 IEEE Winter Conference on Applications of Computer Vision</source>
          (WACV) (pp.
          <fpage>638</fpage>
          -
          <lpage>647</lpage>
          ). IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Bryan</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Morse</surname>
          </string-name>
          . Image Understanding. http://www.sci.utah.edu/~gerig/CS6640- F2012/Materials/BMorse-BYU-
          <article-title>iu-active-contours.pdf - Title from the screen (accessed 12 December</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toshev</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Show and tell: A neural image caption generator</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          (pp.
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>