Figure 1: Top 5 classification results of our system given an image of class jaguar.
    Andrea Frome et al. [FCS+ 13] presents a method to incorporate word embeddings [MCCD13] in a visual-semantic
embedding model that is employed for image classification. They showcased the technique using 1000 classes of the
ILSVRC 2012 dataset [DDS+ 09], where the ground truth labels were replaced with word embeddings obtained from a
skip-gram architecture [MCCD13, MSC+ 13] while training. The results showed that the model could match state-of-the-
art classification performance while making semantically reasonable errors. Most notably, it showed capabilities of zero-
shot learning [PPHM09]. Several similar approaches have been presented since, that proposed methods of integrating word
embeddings with vision architectures and improved performance in zero-shot learning [ZXG17, ZS15, NMB+ 13]. Our
study was formulated primarily inspired by the above contributions and with motivations towards improving the semantic
representations employed. We aim to explore the effect of structured knowledge representations, such as knowledge graphs,
in place of unstructured text corpora [FCS+ 13], hoping to capture more explicit relationships between entities. We study
the effects of such an approach in an image classification task.
    The main contribution of this study is the introduction of a technique to embed semantics learnt from structured knowl-
edge bases in a visual-embedding model. We also introduce a novel evaluation framework to asses the generalisation of
concepts with respect to the structured knowledge used to inform the semantics during training. We demonstrate zero-
shot learning and introduce of a novel evaluation metric to assess its performance according to the class hierarchy of the
structured knowledge used.

2     Proposed method
Our approach to incorporate semantics of structured knowledge bases into the image classification task can be subdivided
into two main steps. First is a technique to obtain accurate concept representations, which will define the vector space for
our architecture. Next is mapping visual features of images obtained via a visual-semantic embedding model to the vector
space of the concept representations. The goal of this is to have both image representations and concept representations in
a single vector space, so that they can be compared during the image classification task.

2.1   Obtaining concept representations
Concept representations define the vector space for our architecture. Figure 2 a) shows this process in a flow chart where,
initially we have a graph G with nodes V and edges E. We assume either that all edges are is-a (directed specialisation)
edges or some edges are labelled as such. The reason for this is that, we identify is-a as the basic form of a specialisation
hierarchy that would be used to inform the semantics of the concepts in concern. Of course, moving forward the edges can
have more complex relational representations, which is beyond the scope of our current study. Also we do not restrict the
structure of the graph. An important thing to consider when selecting a graph is that, the nodes should contain the training
image label classes of the classification task (Li in Figure 2 b)). And the graph should represent the relations of concepts
to each other in the considered domain.
    Chakaveh Saedi et al. [SBRS18] recently proposed a technique to compute embeddings, specifically on WordNet
[Mil98], that could represent lexical semantics of the nodes. Their study showed how the resulting embedding vectors are
superior to the performance of word embeddings [MCCD13] in semantic similarity tasks. In our approach we make use of
these findings to obtain the embeddings as our concept representations.
    G is converted to an adjacency matrix M such that, if two words in G, wi and wj , are directly related by an edge, the
entry Mi j is set to 1 (otherwise 0). Also to account for words that are not directly connected to each other, M is further
enriched taking distantly connected nodes and aggregated as in Equation 1.
                                                    ∞
                                                    X
                                            MG =        (αM )n = (I − αM )−1                                              (1)
                                                   n=0


                                                             2
Figure 2: Flow diagram showing the main two steps in our approach. a) Represents the process of obtaining concept
representations from the knowledge base. b) Represents the process of mapping visual features to the obtained concept
representations
   Where, n is the length of the path between two nodes, α (< 1) is a decay factor that determines the effect of path length
on M . Longer the path between two words (larger the n), lesser it affects MG . MG is normalised using L2-norm and
reduced to a set of vectors with a lower dimensionality using Principal Component Analysis (PCA). These vectors will be
our concept representations.

2.2     Mapping visual features to concept representations
After defining the concept vector space as shown in Section 2.1, we want to train our visual-semantic embedding model
to map input images to the relevant concept representations. Figure 2 b) summarises the procedure where, in the visual
feature representation learning step, the learnt concept representations RC are used to inform the semantics of the concepts.
   We employ a deep neural network here, with an additional project layer at the top of the network that outputs a rep-
resentation matching the dimensionality of the concept representations. The projection layer is trained using the cosine
similarity between the predicted embeddings and the pre-computed concept representations as a measure of loss according
to Equation 2.

                                            loss(I, RC ) = 1 − cos(E(I), RC )                                             (2)
   In Equation 2, I and RC are the image and the corresponding concept representation of the image label class respec-
tively. E is the function that represents both feature extraction and mapping of the projection layer. We minimise the value
of 1 minus the cosine similarity during training since the more similar the two vectors are, they should result in a cosine
similarity closer to 1. After training, we use the function E to calculate the visual embeddings of all the training images
and store them. RI in Figure 2 b) represents the set of these embeddings.

2.3     The evaluation frameworks
2.3.1    Taxonomy-Aware Measure for Errors (TAME)
We introduce a novel evaluation framework which we name as Taxonomy-Aware Measure for Errors (TAME), that analyses
the classification errors with respect to the property of subsumption extracted from the hierarchical structure of the
knowledge base used. Subsumption is formed by the is-a relations in the knowledge base, for example, if C is related to
D via an is-a link, we can say that D subsumes C and that all properties of C is also present in D. We make use of this
property to check if our architecture is effective in generalising to the concepts that subsumes the ground-truth classes of
the images. We can check this at different levels of subsumption going up the is-a hierarchy.
   In the presence of background knowledge, it makes sense to distinguish bad from not so bad errors; for example, if the
system predicts a subsumer class of the ground-truth class instead of the ground-truth class itself, this is a less bad error
than if it predicted a less closely related class. For example, referring to Figure 3, if the system classifies an image of a
barrel as a vessel, TAME will consider it to be almost correct. TAME captures this aspect of our system and quantifies
image classification results with semantically accurate errors. Results are shown in Section 3.2.


                                                             3
             Figure 3: The hypernym hierarchy of the classes barrel and backpack up to 3 subsumption levels
2.3.2    Addition to zero-shot evaluation
Next a novel approach to evaluate zero shot learning is presented. We divide the set of chosen zero-shot classes into two
sets, namely, Sibling and N onSibling classes. The notion of being sibling classes is derived from the property of sharing
common subsumer classes. For example, referring to Figure 3, vessel and bag share the same subsumer, container. Hence
we categorise such classes as Sibling classes in the zero-shot evaluation, while the other being N onsibling classes. During
evaluation, we distinguish the difference in performance of these two sets in the image classification task.

3     Experiments
In this study, we make use of ILSVRC 2012 [RDS+ 15] as the image dataset and WordNet [Mil98] as the knowledge base
for obtaining concept representations. Since ILSVRC 2012 uses WordNet entities as the ground truth class labels for the
objects in the images, it qualifies as a reliable structured knowledge base for all the concepts found in ILSVRC 2012.
WordNet is a lexical ontology for English language that consists over 120k concepts, 25 types of relations between these
concepts and over 155k words (lemmas) that are categorised to nouns, verbs, adjectives and adverbs. In this implementa-
tion, we adopt the same approach of [SBRS18] and extract a sub graph of 60k words from all parts-of-speech in WordNet.
All relations were considered and weighted equally during the embedding calculation according to Section 2.1. The di-
mensionality is chosen to be 850 for the resulting concept representations [SBRS18]. We employ a deep residual network
[HZRS16] (Resnet 50), pre-trained for an image classification task with the ILSVRC 2012 dataset. The softmax prediction
layer at the top of the network is replaced with a projection layer, as explained Section 2.2. The projection layer acts as
a linear transformation that maps a 512 dimensional feature vector of the image to a 850 dimensional concept embedding
produced in our previous step. To evaluate our goal of semantically informed image classification, we use the process
shown in Figure 4.
    Throughout our evaluation we compare the results with the DeViSE [FCS+ 13] architecture trained on the same classes
and tested against the same settings. DeViSE takes a similar approach to embed semantics in image classification with the
only difference of concept representations being word embeddings [MCCD13]. This gives us the opportunity to compare
the difference in performance when concept representations are calculated using structured knowledge versus unstructured
text.


Figure 4: System setup to retrieve a concept given an image as query (Image classification). RC is the same vector as in
Figure 2.


3.1     Evaluating visual model performance
During the training phase of visual-semantic embedding model (denoted by E in Figure 2 b)), we choose 300 randomly
selected classes out of the ILSVRC 2012 1K dataset [RDS+ 15]. The reason behind selecting the 300 classes was that,
the nodes of WordNet graph considered during the concept extraction phase included these 300 classes. We compare
the Hit@k image classification accuracies of our approach with a standard Softmax image classification network baseline
(Resnet50 [HZRS16]) and DeViSE [FCS+ 13], both trained on the same training classes. The results are shown in Table 1.


                                                             4
                     Table 1: Hit@k accuracies of our system compared to Softmax classifier and DeViSE
                                                                   Hit@k(%)
                                           Model        Dataset
                                                                    1        5
                                                        Train Set 89.06 94.14
                                           Softmax
                                                        Test Set 46.69 63.55
                                                        Train Set 60.67 70.88
                                           DeViSE
                                                        Test Set 53.36 66.20
                                                        Train Set 71.92 88.93
                                           Proposed
                                                        Test Set 56.82 77.06

   From the results in Table 1, we see that although the softmax model attains higher accuracies in image classification for
the training set, both DeViSE and our proposed model show higher accuracies for the test set, with our model having the
highest accuracy in both cases of Hit@1 and Hit@5. The test set consists of unseen images of the same training classes and
higher accuracies with this set implies the superior generalisation ability of the systems. Nevertheless, for the focus on this
study we understand that our visual-semantic embedding model, in fact has promising performance in image classification.
   In addition to the above results, our system inherits the capability of incorporating semantics of the classes with their
relationships according to the structured knowledge base used during training (i.e. WordNet). We move on to our novel
evaluation framework introduced in Section 2.3.1 to evaluate this capability.

3.2    TAME on image classification
ILSVRC 2012 provides the hierarchy of the selected classes according to the hypernym tree in WordNet, similar to the
example is shown in Figure 3. We obtain three sets of subsumers for all the 300 classes selected, at three levels, each above
the 300 training classes, named as 1-step, 2-step and 3-step subsumers. As the levels increase, the classes become more
general in meaning. Note that these subsumer classes are not used during the training of the systems. The system will have
to deduce the semantic relationship informed by the concept representations provided during the training process in order
to accurately identify the more general subsumer classes. The prediction is taken as correct if the system outputs either the
ground truth class or any subsumer class in each of the tests. The results are compared the DeViSE [FCS+ 13] retrained
using the same 300 training classes and tested with the same subsumer sets. The results are shown in Table 2.

Table 2: Image classification performance including subsumer classes. 1-step, 2-step and 3-step subsumers represent
classes in respective number of steps above the ground truth class labels according to the WordNet hypernym (is-a) hierar-
chy.
       Model                                DeViSE                                                      Proposed
       Dataset                Train Set                      Test Set                      Train Set                      Test Set
      Accuracy                Hit@k%                        Hit@k%                         Hit@k%                        Hit@k%
                    1         5      10    20       1       5       10    20       1       5      10    20       1       5       10    20
 1-step Subsumers   60.94   75.75 80.55   84.65   53.36   66.20 68.45    70.70   72.24   88.79 91.03   95.02   61.29   80.17 83.39    88.06
 2-step Subsumers   60.87   76.92 82.68   87.25   53.41   72.70 79.57    84.70   72.26   88.79 91.12   95.13   61.41   80.20 83.63    88.39
 3-step Subsumers   60.77   76.81 82.88   87.86   55.94   73.58 79.86    85.09   73.65   90.24 92.30   95.99   62.61   81.72 84.98    89.40

   The results demonstrate that both architectures have the ability to generalise to subsumer classes from the trained
classes. This is seen from the increasing accuracy along with the inclusion of 1-step, 2-step and 3-step subsumers of both
architectures. It can also be understood that both the approaches gain a semantic understanding about the classes that is
not captured during the traditional classification test (results in Table 1). The superior performance of our approach in
almost all cases of TAME compared to DeViSE, shows the effectiveness of employing a structured knowledge rather than
unstructured text in obtaining the semantics used to train the visual-semantic embedding model. Although embeddings
learned from unstructured text also attains a level of generalisation to similar concepts, structured knowledge is able to
capture these semantics more accurately.

3.3    Evaluating zero-shot learning
To demonstrate zero shot learning, we pick a random set of 30 totally unseen classes from ILSVRC 2012 dataset. We
identify the sibling properties of these classes, as explained in Section 2.3.2, with respect to the 300 classes used during
training. It turned out that 14 out of the 30 zero-shot classes were sibling classes and the rest were Non-sibling classes.
    We use the same DeViSE architecture from Section 3.2 for comparison of results. The tests were carried out under two
settings of RC (from Figure 4), where results were taken with and without RC containing the training class embeddings.


                                                                     5
The classification is considered as correct if the correct label is present among the k outputs in each Hit@k task. The
results are shown in Table 3.

                       Table 3: Zero-shot learning results for both sibling and non-sibling classes
                                                                Sibling Classes        Non-Sibling classes
             Model                                                Hit@k(%)                  Hit@k(%)
                                                              1        5       10        1       5     10
             DeViSE (Only zero-shot class labels)           48.63 60.66 71.86 28.07 48.40 69.52
             DeViSE (Zero shot + training class labels)      0.55 14.48 28.96 0.27              6.42 16.31
             Proposed (Only zero-shot class labels)         53.01 74.04 89.07 25.67 54.28 77.27
             Proposed (Zero shot + training class labels) 1.64 27.60 41.53 0.00                 8.82 12.83

    Both models demonstrate capabilities of zero-shot learning, with higher accuracies when RC does not contain the
training class embeddings. The reason for this is that, the models map the zero-shot image embeddings more towards the
known class labels when they are present in the system. One takeaway from the results in Table 3 is that, the sibling classes
are more accurately classified in both models when compared with non-sibling classes. The intuition behind this being
the model’s awareness of similar classes during training to the zero-shot classes (effect of having common subsumers).
Our proposed model outperforms DeViSE in all instances in the sibling category. Another interesting insight is how our
proposed model performs worse with non-sibling classes in several instances compared to DeViSE. This can be seen as a
result of the structure of knowledge enforced on our embeddings. Next we extended the zero-shot evaluation to include
TAME introduced in Section 2.3.1.

3.4   TAME on zero-shot evaluation
We extend our zero-shot evaluation to test if the models can generalise zero-shot classification also to the subsumer classes
of the zero-shot classes. The subsumer classes are obtained by the same method as explain in Section 3.2 for the training
classes. The results are shown in Table 4.

                         Table 4: Results after extending the zero-shot results to include TAME.
       Model                                                         DeViSE
       Dataset          Sibling Class      Sibling + Training Classes   Non-Sibling Classes   Non-Sibling + Training Classes
      Accuracy           Hit@k(%)                  Hit@k(%)                 Hit@k(%)                    Hit@k(%)
                      1       5       10    1       5         10         1       5      10     1      5            10
 1-step Subsumers   61.48 79.51 82.24      1.09 28.69        45.08     44.92 77.54 86.90      0.27 13.10          38.50
 2-step Subsumers   50.55 77.05 81.42      1.09 30.60        50.55     37.17 77.54 88.50      0.27 17.38          42.25
 3-step Subsumers   50.27 84.70 96.45      1.09 36.34        56.56     36.10 77.01 88.50      0.27 18.45          41.71

       Model                                                        Proposed
       Dataset          Sibling Classes    Sibling + Training Classes   Non-Sibling Classes   Non-Sibling + Training Classes
      Accuracy            Hit@k(%)                 Hit@k(%)                  Hit@k(%)                   Hit@k(%)
                      1        5      10    1       5         10         1       5      10     1      5            10
 1-step Subsumers   52.46 73.77 84.70      2.73 28.69        43.17     24.87 46.26 65.78      0.00 9.36           15.24
 2-step Subsumers   53.55 73.22 83.61      3.01 28.96        44.26     26.74 51.60 65.51      0.00 13.37          20.86
 3-step Subsumers   51.71 70.86 82.29      3.14 27.43        42.29     26.47 52.41 66.04      0.00 13.37          21.39

   We take the same subsets of the Sibling and the Non-Sibling classes of the zero-shot set in Table 3 and include 1-step,
2-step and 3-step subsumers of those classes to obtain the Hit@k classification accuracy. This evaluates the generalisation
abilities of our architecture on the totally unseen zero-shot classes. With the results presented in Table 4, again we first
see the better overall performance of the sibling classes of both models, going in-line with the results of Table 3. An
observation to note is the decrease in accuracies seen with both approaches in several Hit@k instances with increasing
step. The reason for this is that, as the subsumer steps increase, more concept representations are introduced to RC for
the systems to choose from. Hence it increases the possibility of the models choosing wrong concepts for the zero-shot
classes.

4     Related work
Image understanding is often challenged by the accurate semantic understanding of concepts [ZLT17, MMBG04] according
to human-level knowledge. As Wengang Zhou et al. [ZLT17] points out, translation of human-level semantic knowledge


                                                             6
to low-level visual feature representations is a crucial hurdle to overcome which is referred as the semantic gap [ZLT17].
Even though techniques of machine learning and deep learning have been applied to solving this problem [LZLM07],
accurate definition of human-level knowledge is lacking in the existing systems.
    For the goal of bridging the semantic gap we turn to the many approaches presented in the area of Zero-shot learning
[PPHM09]. These are known for incorporating attributes to represent concepts in vector spaces and mapping visual fea-
tures to them, giving more meaning to features extracted from an image [FEHF09, LNH09]. Many recent deep learning
inspired techniques of zero-shot learning make use of semantic embeddings learnt from a language base in an unsupervised
setting, which are then used to guide visual feature representations [SGMN13, RPT15, FS16, ZS15, ZS16]. Our study was
largely inspired by the findings of Andrea Frome et al. [FCS+ 13], where word embeddings learnt from unstructured text
[MCCD13] were employed for zero-shot classification of ILSVRC 2012 1K dataset [RDS+ 15]. Our approach differs
when we replace the word embeddings with concept representations captured via a structure knowledge base (such as a
knowledge graph) to inform the visual-semantic embedding model.
    Xiaolong Wang et al. [WYG18] recently explored the idea of incorporating relations from knowledge graphs into zero-
shot learning with the use of Graph Convolutions Networks (GCN) [KW16]. They used the relation information from
the edges in the graph to enrich the word embeddings in the process of inferring visual features of unseen images. Our
study differs from this approach, where we completely replace the word embeddings with embeddings produced from a
structured knowledge base that encompasses the relations between the entities concerned.

5   Conclusion
We propose a technique to incorporate concept representations obtained from a structured knowledge base to inform a
visual-semantic embedding model. We show how these embeddings are calculated using the graph structure of the knowl-
edge base and also how they are used to train a projection layer at the top of a traditional image classification model.
Informed by the semantics, the image classification model is able to attain higher generalisation with enhance classifica-
tion accuracies in the test sets compared with traditional methods. We introduce two novel evaluation frameworks that
demonstrates how to assess the classification errors with respect to the structure of the knowledge base used. Our approach
shows promising results in TAME, where the ability of our approach to identify more general classes of the ground-truth
classes was evaluated according to the class hierarchy.
   We demonstrate zero-shot classification and evaluate its performance with the novel addition of notions of sibling and
non-sibling classes. The results show how sibling classes perform better in a zero-shot classification setting and how the
structure of the external knowledge play a part in the classification.
   Overall with the results, we identify how concept representations extracted from structured knowledge are more effective
than ones from unstructured text when informing a visual-semantic embedding model.

References
[DDS+ 09]    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical
             image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee,
             2009.

[FCS+ 13]    Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep
             visual-semantic embedding model. In Advances in neural information processing systems, pages 2121–2129,
             2013.

[FEHF09]     Ali Farhadi, Ian Endres, Derek Hoiem, and David Forsyth. Describing objects by their attributes. In 2009
             IEEE Conference on Computer Vision and Pattern Recognition, pages 1778–1785. IEEE, 2009.

[FS16]       Yanwei Fu and Leonid Sigal. Semi-supervised vocabulary-informed learning. In Proceedings of the IEEE
             Conference on Computer Vision and Pattern Recognition, pages 5337–5346, 2016.

[HZRS16]     Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In
             Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

[KSH12]      Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional
             neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[KW16]       Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv
             preprint arXiv:1609.02907, 2016.


                                                            7
[LNH09]      Christoph H Lampert, Hannes Nickisch, and Stefan Harmeling. Learning to detect unseen object classes
             by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and Pattern Recognition,
             pages 951–958. IEEE, 2009.
[LZLM07] Ying Liu, Dengsheng Zhang, Guojun Lu, and Wei-Ying Ma. A survey of content-based image retrieval with
         high-level semantics. Pattern Recognition, 40(1):262 – 282, 2007.
[Mar18]      Gary Marcus. Deep learning: A critical appraisal. arXiv preprint arXiv:1801.00631, 2018.
[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in
         vector space. arXiv preprint arXiv:1301.3781, 2013.
[Mil98]      George Miller. WordNet: An electronic lexical database. MIT press, 1998.
[MMBG04] Henning Müller, Nicolas Michoux, David Bandon, and Antoine Geissbuhler. A review of content-based
         image retrieval systems in medical applications—clinical benefits and future directions. International Journal
         of Medical Informatics, 73(1):1 – 23, 2004.
[MSC+ 13] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of
          words and phrases and their compositionality. In Advances in neural information processing systems, pages
          3111–3119, 2013.
[NMB+ 13] Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S
          Corrado, and Jeffrey Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv
          preprint arXiv:1312.5650, 2013.
[PPHM09] Mark Palatucci, Dean Pomerleau, Geoffrey E Hinton, and Tom M Mitchell. Zero-shot learning with semantic
         output codes. In Advances in neural information processing systems, pages 1410–1418, 2009.
[RDS+ 15]    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, An-
             drej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.
             International journal of computer vision, 115(3):211–252, 2015.
[RPT15]      Bernardino Romera-Paredes and Philip Torr. An embarrassingly simple approach to zero-shot learning. In
             International Conference on Machine Learning, pages 2152–2161, 2015.
[SBRS18]     Chakaveh Saedi, António Branco, João António Rodrigues, and João Silva. Wordnet embeddings. In Pro-
             ceedings of The Third Workshop on Representation Learning for NLP, pages 122–131, 2018.
[SGMN13] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng. Zero-shot learning through cross-
         modal transfer. In Advances in neural information processing systems, pages 935–943, 2013.
[SWM17]      Wojciech Samek, Thomas Wiegand, and Klaus-Robert Müller. Explainable artificial intelligence: Under-
             standing, visualizing and interpreting deep learning models. arXiv preprint arXiv:1708.08296, 2017.
[WSG+ 00] M. Worring, A. M. Smeulders, A. Gupta, S. Santini, and R. Jain. Content-based image retrieval at the end of
          the early years. IEEE Transactions on Pattern Analysis Machine Intelligence, 22(12):1349–1380, dec 2000.
[WYG18]      Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot recognition via semantic embeddings and knowl-
             edge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
             6857–6866, 2018.
[ZLT17]      Wengang Zhou, Houqiang Li, and Qi Tian. Recent advance in content-based image retrieval: A literature
             survey. CoRR, abs/1706.06064, 2017.
[ZS15]       Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via semantic similarity embedding. In Proceed-
             ings of the IEEE international conference on computer vision, pages 4166–4174, 2015.
[ZS16]       Ziming Zhang and Venkatesh Saligrama. Zero-shot learning via joint latent similarity embedding. In Pro-
             ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6034–6042, 2016.
[ZXG17]      Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep embedding model for zero-shot learning. In
             Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2021–2030, 2017.


                                                          8