A Multi-class Approach – Building a Visual
                                    Classifier based on Textual Descriptions using
                                                  Zero-Shot Learning

                                                    Preeti Jagdish Sajjan and Frank G. Glavin

                                                           School of Computer Science,
                                                       College of Science and Engineering,
                                                  National University of Ireland, Galway, Ireland.
                                              p.sajjan1@nuigalway.ie, frank.glavin@nuigalway.ie


                                       Abstract. Machine Learning (ML) techniques for image classification
                                       routinely require many labelled images for training the model and while
                                       testing, we ought to use images belonging to the same domain as those
                                       used for training. In this paper, we overcome the two main hurdles of
                                       ML, i.e. scarcity of data and constrained prediction of the classification
                                       model. We do this by introducing a visual classifier which uses a concept
                                       of transfer learning, namely Zero-Shot Learning (ZSL), and standard
                                       Natural Language Processing techniques. We train a classifier by map-
                                       ping labelled images to their textual description instead of training it for
                                       specific classes. Transfer learning involves transferring knowledge across
                                       domains that are similar. ZSL intelligently applies the knowledge learned
                                       while training for future recognition tasks. ZSL differentiates classes as
                                       two types: seen and unseen classes. Seen classes are the classes upon
                                       which we have trained our model and unseen classes are the classes upon
                                       which we test our model. The examples from unseen classes have not
                                       been encountered in the training phase. Earlier research in this domain
                                       focused on developing a binary classifier but, in this paper, we present a
                                       multi-class classifier with a Zero-Shot Learning approach.

                                       Keywords: Transfer Learning · Computer Vision · NLP · Zero-Shot
                                       Learning · Meta-Learning · Image Classification.


                                1    Introduction
                                In this work, we develop a visual classifier capable of classifying images based on
                                their textual descriptions. This is made possible by mapping visual features from
                                images and textual features from the descriptions to the semantic label space of
                                the classes. We do this by using Zero-Shot Learning (ZSL). As the name would
                                suggest, Zero-Shot Learning can be defined as a setup where the model is given
                                certain classes during its testing phase which were not included in the training
                                phase. In simpler terms, ZSL intelligently allows the model to recognise and
                                classify objects that it has never seen before with some degree of certainty. We
                                consider earlier work by Elhoseiny [5] as a base reference.


Copyright 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2      P. J. Sajjan, F. G. Glavin

1.1   Motivation and Dataset
Concepts of transfer learning, such as Zero-Shot Learning and Few-Shot Learn-
ing, have achieved greater visibility producing significant research in the past
decade. The main motivation for applying the concepts of Zero-Shot Learning
is to achieve a model that can classify the images into certain categories with-
out being trained upon it. The key idea of ZSL is to explore and exploit the
knowledge of how an unseen class is semantically related to seen classes. We
present a classifier that is built using various emerging Computer Vision, Deep
Learning, and NLP techniques. The novelty here is in creating a state-of-the-art
Zero-Shot Learning multi-class classification model that learns to map images
to their textual descriptions and, in turn, to their class labels. While our model
is trained on a much smaller dataset, when compared to earlier work, we still
achieve promising results.
    The dataset that we use for building our model is the Caltech Birds dataset
CUB200-2011 Birds [17] which has 11788 images of 200 species of bird. Textual
descriptions for each of the above 200 classes are obtained from earlier research
[5] with information extracted manually from Wikipedia. Examples of the tex-
tual descriptions and the Caltech Birds dataset, organised with respect to the
class labels, are shown below in Figure 1.


      Fig. 1. Samples from the dataset [Class Label, Text Description, Images]


2     Background Information and Related Work
2.1   Transfer Learning
Transfer Learning (TL) involves transferring the knowledge learnt in one domain
to another domain. There are many real-world applications in which collecting
new training data can be difficult or expensive. Transfer learning aims to reduce
the need to collect such training data by transferring knowledge between the
task domains. Pan and Yang [10] presented a survey on TL and noted that
its motivation was initially discussed in a NIPS-95 workshop as “Learning to
                              Visual Classification using Zero-Shot Learning      3

Learn”, which focused on the need for machine learning methods to learn, and
then use this learned knowledge, when a sample from an unseen category or
domain appears.

2.2    Applying Zero-Shot Learning to Image Classification
Lack of training data for each class and learning, both local and global, features
for a group of images makes image classification using Zero-Shot Learning a
challenging task. Li et al. [8], formulate ZSL as a conditioned image classification
problem where they aim to classify the visual features using a classifier that has
learned from semantic descriptions. ZSL models typically learn the mapping
function that maps the feature space to the semantic vector space. Since with
ZSL, the model only has visibility of instances from training classes, it suffers
a projection domain shift. This problem of ZSL was first identified by Fu et
al. [6]. Kodirov et al. [7] present Semantic Autoencoders, where an encoder
aims to project the visual feature vectors to the semantic vector space and a
decoder aims to reconstruct the original visual feature vectors from the semantic
vector space. Once the model is trained, the authors retrieve the encoder model
which establishes an optimal mapping function as a solution to this. Akata et
al. [2] use images from seen classes and semantic attributes, from both seen and
unseen classes, to learn two dictionaries (“coupled dictionary”), that can sparsely
represent the visual and semantic feature vectors of an image. They also provide
an attribute-aware system to solve domain shift and the hubness 1 problems of
ZSL. Akanksha et al. [1] propose a Semantically Aligned Bias Reducing (SABR)
approach that focuses on overcoming the hubness problem by learning a latent
space which is responsible for preserving the semantic relation between the labels
and then encoding the discriminating information of the classes.

2.3    Mapping Textual Descriptions to Images
Humans can write a summary of events seen in an image to provide a better
understanding of that image. Wang J et al. [16] describe a system to establish
a link from an image to a sentence using a score from the comparison made
between the context vector of an image and the context vector of a sentence.
Later, other researchers [15] presented a system that predicted the natural lan-
guage descriptions automatically from the image input. This is made possible
using recognition algorithms and exploiting statistics behind parsing huge text
data. Carrara et al. [4] proposed Text2Vis, a neural network that is capable of
generating a visual representation in the visual feature space of ImageNet from a
short textual description. This concept serves as another end of our objective for
classifying images. In more recent work, Otto et al. [9] outlined an approach to
understand, categorise, and predict the semantic relations of an image to text.
Here, the authors derive a categorisation of eight semantics in image-text pairs
and illustrate how they can systematically be characterised by a set of metrics.
1
    A problem in which a few select words (or hubs) are too close to many others,
    especially in high dimensional spaces.
4       P. J. Sajjan, F. G. Glavin

They make use of a Deep Learning system to predict the classes using multimodal
embeddings.


2.4   Image Classification based on Textual Descriptions

Many earlier approaches, relevant to this paper, include texts from the web to
train and predict a Zero-Shot Learning classifier. Elhoseiny et al. [5] proposed
an optimised formulation that combines knowledge transfer techniques and a
regression function to predict a visual classifier. Ba et al. [3] considered deep
neural networks to predict convolutional classifiers by using text features to find
the optimal weights for the layers of a deep neural network. This approach was
reported to give a noticeable improvement in zero-shot classification. Qiao et al.
[13] revisited the importance of regularisation on ZSL. They show that applying
the attribute-based formulation to text achieves better performance.


3     Model Architecture and Methodology

In this section, we describe our model architecture in detail. This is divided into
three components, i) extracting features from images, ii) extracting features from
text documents, and iii) building a ZSL model that learns to classify the image
as a result of mapping these attributes.


3.1   Image Feature Extraction

Every image is stored in a machine in the form of a matrix, where each element
in this matrix holds the image’s pixel value. Any mathematical operations on an
image are inherently performed on this matrix.
    From image feature extraction, we intend to collect representative data from
the images. Features play a crucial role in the domain of computer vision and
image processing. Keras provides a wide variety of Deep Learning models that
are used for image classification, feature extraction, and transfer learning. We
decided to use VGG16 [14] due to its promising results on learning critical fea-
tures in images. VGG16 is a convolutional neural network with 16 layers. This
model was proposed by Simonyan and Zisserman [14] and the authors report
achieving a 92.7% top-5 test accuracy.
    As an aside, we will explain what we mean by ‘Top-5 accuracy’. If we are
testing our model with an image of ‘cat’ and the model predicts { ‘lion’, ‘tiger’,
‘cat’, ‘leopard’, ‘dog’ } classes in ascending order of their probabilities then since
the expected class is one among the predicted classes, we consider this as a true
positive instance.


3.2   Text Feature Extraction
A challenging problem when dealing with text descriptions is to understand the
context. Traditional NLP techniques can fail when context is important. The
                              Visual Classification using Zero-Shot Learning     5

second component of our model is to extract features from the text. We have
a total of 200 textual descriptions, one for each class. Most previous work uses
Term Frequency Inverse Document Frequency (TF-IDF) to retrieve the features
from the text. Some researchers also form attributes manually helping the model
to focus on important aspects of the text. TF-IDF is inefficient for capturing se-
mantics and overlooks the position of text in a document. On the other hand,
manually defining the attributes makes the model unstable when it encounters
the raw textual descriptions. As a solution, one of the biggest breakthroughs
in understanding the context in the text is ELMo (Embeddings from Language
Models), an NLP state-of-the-art framework introduced by AllenNLP [11].
    ELMo is a deep contextualised word representation that captures the se-
mantics and context of a word in a sentence [11]. Word vectors from ELMo
are achieved as a result of the computations carried out on top of a two-layer
bidirectional language model (biLM) where each layer has a forward and back-
ward pass. This architecture converts a raw text string to word vectors with the
help of a character-level CNN. These word vectors are provided as input to the
first layer of biLM, the output of which forms intermediate word vectors which
are then fed to the second layer of biLM. ELMo word vectors are the result of
the weighted sum of raw word vectors and the two intermediate word vectors.
Since this architecture works on a character level, and forms the vectors consid-
ering the entire sentence, a word can have different word vectors under different
contexts [12]. TensorFlow Hub is the library enabling us to use ELMo in our
work.

3.3   Building a Zero-Shot Learning Model
The above retrieved visual feature vectors, and textual feature vectors are fed
into our neural network as input. We form two sets of classes: seen (training) and
unseen (testing or zero-shot). We then form two datasets (seen and zero-shot)
within our code that consist of two attributes (image features, text features) and
one label (class) each.
    One might wonder, if we are not including samples from test classes during
training, how is it possible for our model to predict those classes? This is what
makes Zero-Shot Learning a challenging and interesting area. To build a model,
we have to make sure that both the independent and dependent attributes are
located in the same space. Hence, to achieve this, we explore the relationship
between these attributes with the help of intermediate level semantic vector
representation. This representation is introduced to enable sharing knowledge
and to establish a mapping function between seen and unseen classes. These
semantic representations could be achieved either by semantic attributes or by
semantic word vectors. We will be focusing on semantic word vectors where we
project each class label into semantic space and these projections are then used
as prototypes for our Zero-Shot Learning classes. For simplicity let us call these
semantic word vectors class vectors. To obtain class vectors for each class in our
dataset, we will be using Google’s Word2Vec model.
    To summarise, for training classes, we have image features, text features,
training class labels, and training class vectors. However, for testing classes, we
6       P. J. Sajjan, F. G. Glavin


                Fig. 2. Visibility for training and testing our model


only have their class vectors visible to our model. Note that the model has never
seen any samples from testing classes and only has visibility to their class vectors.

4     Proposed Model
Since we aim at developing Zero-Shot Learning classifiers, we have to define a
training model that learns on mapping our attributes (visual features from the
image and textual features from descriptions) to their class vectors and in turn
to their class labels. In order to achieve this, we use categorical cross entropy as
our loss function. If we consider visual feature vector xi , corresponding textual
feature vector tj and Ii,j as the actual class label in categorical form then the
loss function is computed using the following equation:
                                     C
                                     X
                       Loss = −            Ii,j log( yˆj (xi , tj ))
                                     i=1

where yˆj (xi , tj ) is the j th scalar value in the prediction we obtain from our
model. Once we are done training our model, we pop out the last layer so that
we get class vectors as predictions. When a sample from an unseen class is given
to this network, we will be able to obtain a semantic word vector as an output.
Since we know that if a vector is nearer to a certain class vector then the sample
has a higher probability of belonging to that class. By measuring the output vec-
tor’s distance to all other class vectors, we will be able to perform classification.
    Let x(n1) denote visual feature vector for an image, y (n2) denote textual fea-
ture vector for the corresponding textual description, and l ∈ {1, ..., C} denotes
the class labels, and (n1, n2) represents vector dimensionality, x, y ∈ Rd , and C is
the total number of classes we have in our dataset. We split this data into training
and testing with M, N instances respectively, where p ∈ lM 6∈ lN or p ∈ lN 6∈ lM .
                                                       (n1)   (n2)
Therefore, now we have a training set Dtrain = {xtrain , ytrain , ltrain } and test-
                       (n1)    (n2)
ing set Dtest = {xtest , ytest , ltest }. The model will be trained on a training
set Dtrain and testing set Dtest will not be seen until the model is being tested.

4.1   Training Model
We are considering 171 classes for training with 60 images per class. Once we
have visual feature vectors and corresponding textual feature vectors, we feed
                              Visual Classification using Zero-Shot Learning       7

both these vectors to our neural network. The network developed consists of 10
hidden layers, one semantic vector space layer with 300 neurons (embedding size)
followed by the output layer with 171 neurons (number of training classes). All
the hidden layers are equipped with ‘relu’ activation and the final output layer
is with ‘softmax’. While training the network, the model learns to map both the
feature vectors to corresponding class labels with the help of class vectors in
semantic vector space. So, in turn, it learns to map the feature vectors to their
class vectors. The below illustration explains this clearly. Brewer Blackbird is one


            Fig. 3. Proposed Training Model with image and text input


of the training classes and a model with a sample of this class is illustrated above.
As we can see, the image for this class is fed to the VGG16 which yields us an
image feature vector of 1 × 4096 dimensions. On the other hand, a corresponding
textual document is fed to the ELMo which yields us a textual feature vector
of 1 × 1024 dimensions. These two feature vectors form attributes of our model
and are fed to train our neural network, the mapping function. Once we finish
training our network, we save the model along with its weights which are used
during testing.


4.2   Testing Model
We are considering 25 classes (each with 60 images) for testing our model. When
we say unseen classes, samples belonging to these 25 classes have not been en-
countered before. The visual features and textual features of these classes are
very new to the model. The process is like that of training in the beginning.
To perform ZSL, we pop out the final softmax layer from the training model
discussed above and let the model predict the class vector in the semantic vector
space. Now, our model will predict for any pair of attributes, a class vector of
(1 × 300 dimensions) in the semantic vector space.
    Semantic Vector Space Mapping: Let us here see how we retrieve the
8       P. J. Sajjan, F. G. Glavin

class label when our model is predicting only the class vector. We make use of
K-dimensional trees, popularly known as KDTrees. KDTrees are defined as a
binary search tree where data in each node is a K-Dimensional point in space.
We can derive the class labels by providing a query to this tree instance along
with the number of nearest nodes or vectors we intend to consider (here k = 10,
5, 1). These class labels are arranged in increasing order of their distance from
the predicted class vector.


Fig. 4. Semantic Vector Space illustrating the visibility of the model during training


5     Experimental Settings
5.1   Image Feature Extraction
When compared to that of the traditional hand-engineered feature extraction
techniques, CNNs outperforms others by learning the complex and crucial fea-
ture representation from the raw image pixels. The image feature vectors ob-
tained from the CNNs hold local and spatial information from the pixels of
the image. Therefore, we import the pre-trained VGG16 from Keras models
and by popping the final sigmoid layer we extract 1 × 4096 dimension feature
vectors. The optimiser used while training this model is stochastic gradient de-
scent with a learning rate of 0.1 and loss being categorical cross-entropy. The
hyper-parameters are set to their default values. We load the images using the
Keras load image package and preprocess it before feeding it to the network.
Consider we have N images belong to C classes where label l = {1, ...., C}.
∀n ∈ N, V GG16(n) will obtain a feature vector x(n1) , where n1 is the dimen-
sionality of the vector i.e 1 × 4096. Now once we extract feature vectors for all
                                 (n1)
the images, we will be having xN .

5.2   Textual Feature Extraction
The features from the textual descriptions are obtained from ELMo using Ten-
sorFlow Hub. We extract the embeddings in the form of a dictionary. To suc-
cessfully extract features from the text, we have to clean it and perform lemma-
tisation (normalisation) where the text converts each word to their base form.
                              Visual Classification using Zero-Shot Learning      9

We provide this preprocessed text as an input to the word embedding instance
(ELMo) which outputs the corresponding feature vector.
    As specified, we have 200 classes in our dataset and the textual description
for each of these 200 classes are purely extracted from Wikipedia. Let us assume
that we have M documents where M = C classes with labels l = {1, ...., C}.
∀m ∈ M, ELM o(m) will obtain for us an embedding or textual feature vector
y (n2) , where n2 is the dimensionality of the vector i.e 1 × 1024. Now, once we
                                                                  (n2)
extract feature vectors for all the textual documents, we have yM .

5.3   ZSL Model
The above two feature vectors x(n1) , y (n2) are the two independent variables or
attributes to our model. In our experimental results, we found that reducing the
dimensionality of the visual feature vector from 1 × 4096 to 1 × 1024 increased
the model performance by 5%. This dimension reduction is performed by a
neural network with 3 hidden layers equipped with a ‘relu’ activation function
and a final ‘softmax’ layer. To avoid the model overfitting, we perform batch
normalisation and add dropout layers to the network. We fed this reduced visual
feature vector to the neural network along with textual feature vectors. This
neural network is developed with five hidden layers, followed by the semantic
vector space layer and then by the final softmax layer. The hidden layers and
the semantic vector space layer are equipped with a ‘relu’ activation function.
    Let us now explain how we get class vectors for each class label in the semantic
vector space. From our experimental results, we found that Word2Vec establishes
better embeddings when compared to that of GloVe. This is mostly because we
wish to extract embeddings for the scientific names of the species. Word2Vec
makes it possible for us to have embeddings for 196 of 200 classes so we remove
the classes that it cannot handle. As a result, we have 11547 images belonging
to 196 classes. Word2Vec provides us an embedding of 1 × 300 dimensions. We
obtain an instance of Word2Vec from gensim models and once we extract the
class vectors from this instance, we save it locally as a numpy file. We load this
numpy file later while mapping the class vectors to the class labels. The final
layer of the model is responsible for this mapping function. This is achieved with
the help of customised kernel initialisation offered by Keras. Here we load the
saved numpy file with class labels as keys and class vectors as values. And for
every instance in the training dataset, we return the vector associated with the
class label (key) which makes it possible for our model to learn a mapping of
the semantic vectors to the class labels.


6     Results and Observations
6.1   Model Evaluation
The evaluation metric we are using is top-k accuracy. According to Ba et
al. [3], the best way to evaluate a multi-class classifier is by sorting the final
prediction score yˆc obtained from the model. We deal with Zero-Shot Learning
10      P. J. Sajjan, F. G. Glavin

on top of multi-class classification which increases the complexity due to the
much larger prediction space.


                   Top-10 Accuracy(%) Top-5 Accuracy(%) Top-1 Accuracy(%)
     Seen classes     99.9 ± 0.01         99 ± 0.5           97 ± 0.5
   Unseen classes      43 ± 0.7           32 ± 0.5           19 ± 0.8
Table 1. Proposed Zero-Shot Learning Model performance on Caltech Birds dataset


    Table 1 illustrates the results achieved by our model for classifying the sam-
ples for both seen and unseen class categories for the Caltech Birds dataset. Seen
samples are tested using the usual machine learning technique of splitting the
data (here belonging to training classes) into training and testing and evaluating
the performance of the model. The above results demonstrate that our approach
of developing a multi-class Zero-Shot Learning classifier can classify the images
belonging to classes it has seen and also to the classes it has never seen before.


6.2   Performance Evaluation

Many earlier research approaches use articles from the web to develop a Zero-
Shot Learning classifier. One such approach, which gave a noticeable improve-
ment in zero-shot classification is where Ba et al. [3] considered deep neural
networks to predict convolutional classifiers. Since this is most relevant to our
work, we wish to discuss the results of this approach. The authors considered
160 seen and 40 unseen classes from the available 200 classes that are randomly
assigned. Ba et al. achieved the highest accuracy for the unseen classes, a top-
5 accuracy of 42.8%, and top-1 accuracy of 12%. Also, the author mentions
the performance of the model for seen classes, a top-5 accuracy of 66.8%. The
‘training-testing’ split for the seen class’s evaluation is not reported in the work.
    Using the same dataset, we have developed a model using an altogether
different approach. We consider a word embedding ELMo for extracting tex-
tual features and pre-trained VGG16, pre-trained on the ImageNet dataset, for
extracting visual feature vectors. Both are then fed to the neural network as at-
tributes. Our model considers 171 classes as seen and 25 classes as unseen. The
model achieves for unseen classes a top-5 accuracy of 32 ± 0.5% and top-1
accuracy of 19 ± 0.8%. To test the performance of our model for seen classes,
we have considered 70:30 random split in the training dataset (samples from 171
training classes) and, our model could achieve top-5 accuracy of 99 ± 0.5% and
top-1 accuracy of 97 ± 0.5%. Since we were able to find the optimal weights
for our multi-class classifier, the proposed model performs extremely well while
classifying images belonging to both seen and unseen classes .
    Top-1 accuracy is when the model assigns the highest probability for the
actual class. Since we are able to achieve promising results for the most critical
predictions i.e top-1 accuracy, for the unseen classes and with the fact that the
                              Visual Classification using Zero-Shot Learning      11

model is trained on a much smaller dataset, it is evident that the model proposed
in this paper is efficient and achieves promising results.

6.3   Model Predictions
Below is a visualisation of the predictions produced by our model. This visuali-


             Fig. 5. Model predictions on testing of zero-shot classes. .


sation illustrates the top-5 accuracy metric, explained earlier where the leftmost
class in the nearest neighbor labels has the highest similarity score, and this simi-
larity score decreases to the right. ‘Marsh Wren’, ‘Ringed Kingfisher’, and ‘Ivory
Gull’ all belong to our testing classes i.e. the classes whose samples are never
seen by the model during training. The model performance is satisfactory for the
first two examples. We can see that the third example is one such instance where
our model is unable to predict the exact class label but the classes which are
chosen by the model under top-5 nearest neighbors belong to the same parent
class (‘Gull’).

7     Conclusion
In this paper, we introduce a multi-class Zero-Shot Learning model that learns
to predict the label for images belonging to unseen classes from their Wikipedia
articles. We have developed a deep neural network to establish a mapping func-
tion that maps the textual feature vectors from the raw Wikipedia articles and
visual feature vectors from images to a semantic space with the help of semantic
vectors derived from the class labels. This can also be interpreted as the ability
of the model to intelligently apply the knowledge acquired over training classes
with the help of an objective function.
    We demonstrated that our model outperformed an existing zero-shot model
on the top-1 accuracy metric on the CUBird dataset using only raw images and
Wikipedia articles.
12      P. J. Sajjan, F. G. Glavin

References
 1. Akanksha, Krishnan, Paul, N., Munjal, P.: Semantically aligned bias reducing zero
    shot learning (04 2019)
 2. Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Label-embedding for attribute-
    based classification. In: 2013 IEEE Conference on Computer Vision and Pattern
    Recognition. pp. 819–826 (2013)
 3. Ba, J.L., Swersky, K., Fidler, S., Salakhutdinov, R.: Predicting deep zero-shot con-
    volutional neural networks using textual descriptions. In: Conference on Computer
    Vision (ICCV). pp. 4247–4255 (2015)
 4. Carrara, F., Esuli, A., Fagni, T., Falchi, F., Moreo Fernández, A.: Picture it in
    your mind: Generating high level visual representations from textual descriptions.
    Information Retrieval Journal (10 2017)
 5. Elhoseiny, M.: Write a classifier: Zero shot learning using purely textual descrip-
    tions (12 2013)
 6. Fu, Y., Hospedales, T., Xiang, T., Gong, S.: Transductive multi-view zero-shot
    learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (01
    2015)
 7. Kodirov, E., Xiang, T., Gong, S.: Semantic autoencoder for zero-shot learning. In:
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp.
    4447–4456 (2017)
 8. Li, K., Min, M.R., Fu, Y.: Rethinking zero-shot learning: A conditional visual
    classification perspective. In: Conference on Computer Vision (ICCV). pp. 3582–
    3591 (2019)
 9. Otto, C., Springstein, M., Anand, A., Ewerth, R.: Understanding, categorizing and
    predicting semantic image-text relations. Proceedings of the 2019 on International
    Conference on Multimedia Retrieval (2019)
10. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Transactions on Knowl-
    edge and Data Engineering 22(10), 1345–1359 (2010)
11. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer,
    L.: Deep contextualized word representations (02 2018)
12. Prateek: ELMoArchitecture. https://www.analyticsvidhya.com/blog/2019/03/
    learn-to-use-elmo-to-extract-features-from-text/ (2019)
13. Qiao, R., Liu, L., Shen, C., Van Den Hengel, A.: Less is more: Zero-shot learning
    from online textual documents with noise suppression. In: Computer Vision and
    Pattern Recognition (CVPR). pp. 2249–2257 (2016)
14. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
    image recognition. arXiv 1409.1556 (09 2014)
15. Wang, H., Zhang, Y., sheng Yu, X.: An overview of image caption generation
    methods. Computational intelligence and neuroscience p. 3062706 (2020)
16. Wang, J., Markert, K., Everingham, M.: Learning models for object recognition
    from natural language descriptions (01 2009)
17. Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.:
    Caltech-UCSD Birds 200. Tech. Rep. CNS-TR-2010-001 (2010)