Mining Discriminative Visual Features Based on
             Semantic Relations

      Qing Wei1,3 , Xiaowang Zhang1,3,? , Kewen Wang2 , and Zhiyong Feng1,3
 1
  College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
   2
     School of Information and Communication Technology, Griffith University,
                           Brisbane, QLD 4111, Australia
3
  Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin, China
                ?
                  Corresponding Author: xiaowangzhang@tju.edu.cn


         Abstract. In this paper, we present an embedding-based framework
         for fine-grained image classification so that the semantic of background
         knowledge of images can be internally fused in image recognition. Specif-
         ically, we propose a semantic-fusion model which explores semantic em-
         bedding from both background knowledge (such as text, knowledge bases)
         and visual information. Moreover, we present a multi-level embedding
         model extract multiple semantic segmentations of backgroud knowledge.
         Experimental results on a challenging benchmark CUB-200-2011 dataset
         verify that our approach outperforms state-of-the-art methods.


1      Introduction

The goal of fine-grained image classification is to recognize subcategories of ob-
jects, such as identifying the species of birds, under some basic-level categories.
Different from general-level object classification, fine-grained image classification
is challenging due to the large intra-class variance and small inter-class variance.
Often, human beings recognize an object not only by its visual outline but also
access their accumulated knowledge on the object.
    In this paper, we made full use of category attribute knowledge and deep
convolution neural network to construct a fusion-based model Semantic Visual
Representation Learning for fine-grained image classification. SVRL consists of
a multi-level embedding fusion model and a visual feature extract model.
    Our proposed SVRL has two distinct features: i) It is a novel weakly-supervised
model for fine-grained image classification, which can automatically obtain the
part region of image. ii) It can effectively integrate the visual information and
relevant knowledge to improve the image classification.

*
     Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
     mons License Attribution 4.0 International (CC BY 4.0).
2         Q. Wei et al.


    Fig. 1. Overview of our SVRL model. The structure of vision stream is ResNet50.


2      Semantic Visual Representation Learning

The framework of SVRL is shown in Figure 1. Based on the intuition of knowl-
edge conducting, we propose a multi-level fusion-based Semantic Visual Repre-
sentation Learning model for learning latent semantic representations.

Discriminative Patch Detector In this part, we adopt discriminative mid-
level feature to classify images. Specifically, we set 1 × 1 convolutional filter as a
small patch detector [4]. Firstly, the input image through a sequence of convolu-
tional and pooling layers, each C × 1 × 1 vector across channels at fixed spatial
location represents a small patch at a corresponding location in the original im-
age and the maximum value of the region can be found simply by picking the
location in the entire feature map. In this way, we picked out the discriminative
region feature of the image.

Multi Embedding Fusion From Figure 1, the knowledge stream consists of
Cgate and visual fusion components. In our work, we use word2vector and
TransR embedding method, note that, we can adaptively use N embedding
methods not only two methods. Given weight parameter w ∈ W , embedding
space e ∈ E, N is the number
                         PN      of embedding
                                           PN methods. The equation of Cgate
as follow: Cgate = N1 1 wi ei . where         1 wi = 1. After we get the inte-
grated feature space, we map semantic space into visual space by the same
visual full connection F C b which is only trained by part stream visual vector.
From here, we proposed an asynchronous learning, the semantic feature vector is
trained every p epoch, but it does not update parameters of F C b. So the asyn-
chronous method can not only keep semantic information but also learn better
visual feature to fuse semantic space and visual space. The equation of fusion is
T = V + α × V (tanh(S)). The V is visual feature vector, S is semantic vector
and T is fusion vector. Dot product is a fusion method which can intersect mul-
tiple information. The dimension of S, V , and T are 200 we designed. The gate
        Mining Discriminative Visual Features Based on Semantic Relations       3

mechanism is consist of Cgate, tanh gate and the dot product of visual feature
with semantic feature.

3   Experiments and Evaluation
In our experiments, we train our model using SGD with mini-batches 64 and
learning rate is 0.0007. The hyperparameter weight of vision stream loss and
knowledge stream loss are set 0.6, 0.3, 0.1. Two embedding weights are 0.3, 0.7.

Table 1. Comparisons with the state-of-the-art methods on CUB-200-2011 dataset.

                        Train Annotation Test Annotation
               Method                                    Accuracy
                         Parts  BBox      Parts BBox
             Part R-CNN   X       X        X       X       76.4
              PA-CNN              X                X       82.8
             SPDA-CNN     X       X                X       85.1
            AGAL-CNN[2]           X                X       85.5
               DVAN                                        79.0
               B-CNN                                       84.1
                PDFS                                       84.5
               CVL[1]                                      85.5
              T-CNN[5]                                     86.2
            SVRL (ours)                                    87.1


Classification Result and Comparison Compared with 9 state-of-the-art
fine-grained image classification methods, the result on CUB [3] of our SVRL
are presented in Table 1. In our experiments, we did not use part annotations
and BBox. We get 1.6% higher accuracy than the best part-based method AGAL
which both use part annotations and BBox. Compared with T-CNN and CVL
which do not use annotations and BBox, our method got 0.9%, 1.6% higher
accuracy respectively. These works got better performance combined knowledge
and vision, the difference between us is we fused multi-level embedding to get
the knowledge representation and the mid-level vision patch region learns the
discriminative feature.

    Table 2. The result of different components and variants on CUB-200-2011.

     Knowledge Components Accuracy(%) Vision Components Accuracy(%)
        Knowledge-W2V         82.2    Global-Stream Only    80.8
       Knowledge-TransR       83.0     Part-Stream Only     81.9
     Knowledge Stream-VGG     83.2    Vision Stream-VGG     85.2
    Knowledge Stream-ResNet   83.6   Vision Stream-ResNet   85.9
       Our SVRL-VGG          86.5    Our SVRL-ResNet       87.1
4       Q. Wei et al.


     Fig. 2. The visualization of discriminative region in CUB-200-2011 dataset.


More Experiments and Visualization We compare different variants of
our SVRL approach. From Table 2, we can observe that combining vision and
multi-level knowledge can achieve high accuracy than only one stream, which
demonstrates that visual information with text description and knowledge are
complementary in fine-grained image classification. Fig 2 is the visualization of
discriminative region in CUB dataset.

4    Conclusion
In this paper, we proposed a novel fine-grained image classification model SVRL
as a way of efficiently leveraging external knowledge to improve fine-grained
image classification. One important advantage of our approach was that our
SVRL model could reinforce vision and knowledge representation, which can
capture better discriminative feature for fine-grained classification. We believe
that our proposal is helpful in fusing semantics internally when processing the
cross media multi-information.

Acknowledgments
This work is supported by the National Key Research and Development Program
of China (2017YFC0908401) and the National Natural Science Foundation of
China (61976153,61972455). Xiaowang Zhang is supported by the Peiyang Young
Scholars in Tianjin University (2019XRX-0032).

References
1. He, X., Peng, Y.: Fine-grained image classification via combining vision and lan-
   guage. In Proc. of CVPR 2017, pp. 7332–7340.
2. Liu, X., Wang, J., Wen, S., Ding, E., Lin, Y.: Localizing by describing: Attribute-
   guided attention localization for fine-grained recognition. In Proc. of AAAI 2017,
   pp.4190–4196.
3. Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd
   birds-200-2011 dataset, 2011.
4. Wang, Y., Morariu, V.I., Davis, L.S.: Learning a discriminative filter bank within
   a cnn for fine-grained recognition. In Proc. of CVPR 2018, pp. 4148–4157.
5. Xu, H., Qi, G., Li, J., Wang, M., Xu, K., Gao, H.: Fine-grained image classification
   by visual-semantic embedding. In Proc. of IJCAI 2018, pp.1043–1049.