=Paper= {{Paper |id=Vol-2884/paper_127 |storemode=property |title=Asymptotic Cross-Entropy Weighting and Guided-Loss in Supervised Hierarchical Setting using Deep Attention Network |pdfUrl=https://ceur-ws.org/Vol-2884/paper_127.pdf |volume=Vol-2884 |authors=Charles Kantor,Brice Rauby,Léonard Boussioux,Emmanuel Jehanno,André-Philippe Drapeau Picard,Maxim Larrivée,Hugues Talbot |dblpUrl=https://dblp.org/rec/conf/aaaifs/KantorRBJPLT20 }} ==Asymptotic Cross-Entropy Weighting and Guided-Loss in Supervised Hierarchical Setting using Deep Attention Network== https://ceur-ws.org/Vol-2884/paper_127.pdf
             Asymptotic Cross-Entropy Weighting and Guided-Loss in Supervised
                    Hierarchical Setting using Deep Attention Networks
                     Charles Kantor1,2,3 Brice Rauby2,3,5 Léonard Boussioux2,3,6
       Emmanuel Jehanno2,3 André-Philippe Drapeau Picard4 Maxim Larrivée4 Hugues Talbot2,3,7
1
    Mila Artificial Intelligence Institute, Montreal, Canada 2 Paris-Saclay University, France 3 Ecole CentraleSupélec Paris, France
                           4
                             Montreal Insectarium - Space for Life, Canada 5 Polytechnique Montreal, Canada
         6
           Massachusetts Institute of Technology, Operations Research Center, Cambridge, MA, USA 7 Inria Paris, France

                               Abstract                                using the segmentation masks, penalizing attention given to
                                                                       features outside the mask. We also design a specific loss
     This article reveals two main techniques for improving fine-
     grained recognition and classification, defined as executing
                                                                       function that leverages the dataset’s hierarchical nature, con-
     these tasks between items with similar general patterns, but      sequently improving the Family, Genus and Species level
     that differ through small details. First, we build a prelimi-     accuracy.
     nary automated segmentation algorithm to ignore the image’s
     background for attention-guided classification. To do this, we                          Related Work
     wield segmentation-issued masks to make the classification
     network’s training easier through an additional loss that pe-     Preliminaries
     nalizes attention given to features outside the mask using con-   Fine-grained classification is a category of image classi-
     volutional block. Furthermore, we proffer a hierarchical loss     fication: the task is to distinguish between subtly rather
     based on cross entropy penalizing parent-level classification     than grossly different items; for example between different
     to leverage the philology of each wildlife species. We applied
                                                                       species of birds or dogs rather than giraffes vs. trucks. This
     our approaches in the particular context of butterfly recogni-
     tion, which is of practical interest to entomologists.            setting is more complex, requires better annotations, more
                                                                       data and is not yet satisfactorily solved (Xie et al. 2013;
                                                                       Chai, Lempitsky, and Zisserman 2013). A fundamental diffi-
                  Context and Motivation                               culty is to induce the learning architecture to focus on small
In this work, we deal with the issue of accurately identify-           but essential details without relying on overly complicated
ing large numbers of items in photographs, some of which               annotations. A recent interesting approach has been to use
may differ only in minute details. This is a difficult problem         a deconstruction-reconstruction method to this end (Chen
because both large and small differences must be taken into            et al. 2019) and bipartite and bi-modal graphs (Zhou and
account in order to recognize and classify.                            Lin 2016; Song et al. 2020).
   Among the images collected, a high percentage of species               Segmentation is a fundamental task in computer vision.
remains unidentified and represents a time-consuming label-            Its objective is to find semantically consistent regions that
ing task for experts. Identifying an insect on the species level       represent objects. Given enough data and annotations, deep
is challenging and depends on the tiniest of details. Citizen          recurrent CNN architectures such as ResNet (He et al. 2016)
scientists can help collect a large amount of data such as in-         and recurrent auto-encoders like U-Net (Ronneberger, Fis-
sect photographic documentation (Horn et al. 2017; Bous-               cher, and Brox 2015) constitute the current state-of-the-art in
sioux et al. 2019), but accurate identification remains re-            segmentation methods. In particular, U-Net and its variants
stricting. Recent improvements in performance in a wide                may learn a segmentation task from a few hundred labeled
range of classification tasks with deep learning methods of-           inputs.
fer population monitoring opportunities and efficient and                 The background of macro wildlife photos is typically full
large scale annotations. We worked with the eButterfly (Pru-           of environmental details like grass or leaves that can mis-
dic et al. 2017) citizen science program, which maintains a            lead the classification model and introduce bias. We noticed
fine-grained dataset of observations of all North American             experimentally via saliency maps that too much attention is
butterflies species.                                                   generally paid to the background rather than to the insect it-
   We develop computer vision algorithms and propose fine-             self. Consequently, we used automatic segmentation to help
grained classification innovations using segmentation tools            focus on the foreground.
to encourage the model to focus on areas of an image that
are salient for identification. We propose an additional loss          Hierarchical Classification
AAAI Fall 2020 Symposium on AI for Social Good.                        Hierarchical labels of a fine-grained dataset can be leveraged
Copyright © 2020 for this paper by its authors. Use permitted under    to improve performance (Zheng et al. 2020).
Creative Commons License Attribution 4.0 International (CC BY             Tree-CNN is an architecture-based approach to this set-
4.0).                                                                  ting in (Roy, Panda, and Roy 2019). Their study aims to
overcome the issue of catastrophic forgetting when fine-                           Imbalanced Data
tuning a model successively on each task (i.e., pre-training
on the task of classifying Families, then Genus and finally     Main source
Species). The architecture is built with a common trunk and     For our preliminary experiments, we use a data set of
several finer branches corresponding to each task. Hence, the   pictures submitted across Canada, Mexico and the United
model uses the hierarchy through the trunk while also pre-      States, representing over seven hundred different species as
serving the memory of each task. This Tree-CNN limits the       of June 2020. Among these observations, two-thirds have
computation costs of retraining and can learn with a smaller    been annotated by experts. The eButterfly program, co-
training effort than fine-tuning while retaining much of the    founded by the Montreal Insectarium, allows participants to
accuracy.                                                       record sightings by uploading images with date and time in-
   (Wu, Tygert, and LeCun 2019) present another approach        formation (Prudic et al. 2017).
based on the loss instead of the architecture. They propose
a new loss that takes the hierarchy into account and is no      Highly imbalanced classes
longer a flat loss compared to the cross-entropy (which com-
                                                                Our data set is organized hierarchically. Each image has
pares the same level classes). A convenient metric should
                                                                three labels: a species belongs to one and only one genus,
make all leaves equidistant from the root node.
                                                                belonging to one and only one family. This distribution of
   (Kosmopoulos et al. 2020) develops a different method-       labels enables us to have different complexity levels for
ology for tuning the loss. They compare different measures      our classification task. Given more than two-thirds labeled
presented in the literature and classify them into two sub-     images, we anticipate being able to learn the family label
groups: pair-based and set-based measures. They consider        with the best precision, provide a slightly less accurate es-
building a hierarchical loss as an optimization problem and     timate of the genus and a slightly worse again estimate of
propose a pair-based metric, optimized with a max-flow ap-      the species. In the provided dataset, classes are highly im-
proach. The article also offers a set-based implementation      balanced, meaning we are facing a problem of fine-grained
that approximates the sparsity-inducing `0 norm. Consider-      classification with significantly under-represented classes.
ing the path from the root to the predicted node as a set of
nodes and identically for the ground truth node, they can
propose a measure that uses the intersection, union and dif-                             Methods
ference between the sets. This measure computes analogous       Guided Attention Mechanism
precision, recall and F1 scores. They implement a Lowest
Common Ancestor, which is a bridge between pair-based           As the shape of butterflies presents a limited variability, it
and set-based measures. The corresponding new measure           seems feasible to incorporate prior knowledge of the shape
performs very well and takes the advantages of both ap-         of the object of interest. Due to the similarity between but-
proaches.                                                       terflies’ overall shape, we posit that butterfly segmentation
                                                                is a simpler task than its fine-grained classification. We pro-
                                                                pose to use masks obtained through an automatic segmen-
Attention and Visualizing CNNs                                  tation pipeline to improve the classification performance.
Attention mechanisms were introduced originally for Neu-        However, even though the segmentation is generally correct,
ral Machine Translation in (Bahdanau, Cho, and Bengio           a few failure cases can deteriorate the classification’s per-
2015) using recurrent neural networks. These mechanisms         formance if used during test-time. For this reason, we de-
utilize a divide-and-conquer approach to various AI tasks       veloped a method to leverage these masks during training
by focusing features on relevant items. These tools have        through an additional loss later called guided attention loss.
been extensively used in Natural Language Processing tasks
(e.g., (Parikh et al. 2016) and (Lin et al. 2017)). (Vaswani
                                                                Prior automated segmentation We used a pre-trained
et al. 2017) developed a model relying only on attention
                                                                Mask R-CNN network to generate the segmentation masks
and achieved state-of-the-art results in machine translation.
                                                                and fine-tuned it on a small subset of the dataset. This ap-
Later, the use of attention was extended to computer vision
                                                                proach is possible because the butterfly segmentation task is
tasks such as image classification and segmentation.
                                                                sufficiently similar to the task of segmenting other objects
   The Convolutional Block Attention Module or CBAM             present in a common dataset, and therefore, pre-training is
(Woo et al. 2018a) proposes a simple attention mechanism        very effective. We annotated a small subset (10%) used for
for feed-forward Convolutional Neural Network (CNN) ar-         pre-training, and we qualitatively assessed the segmentation
chitectures. Its lightweight structure and generality make it   performance. The segmentation results obtained were satis-
suitable for many vision tasks that require large numbers of    factory to be used in the guided attention.
parameters.
   Grad-CAM (Selvaraju et al. 2017) is a popular technique
to make CNN models more explainable, showing on which           Foreword designed attention-based loss As our goal is
areas of the picture they focused on making the prediction,     to enforce the model’s attention on the butterfly, we use a
using reverse gradient propagation descent. The discrimina-     network that was explicitly implementing an attention mech-
tive regions are localized through the areas of high gradient   anism. For this reason, we used an attention model based on
flow within the network.                                        the generic implementation of CBAM (Woo et al. 2018b).
This architecture is a good candidate for its correct classi-      which percentage we could ensure a corresponding certainty
fication results on several benchmarks and because it sepa-        of prediction.
rates the spatial attention mask from the channel attention.          Indeed, with a correct prediction and a good overlap on
Therefore, we could penalize the high values of the spatial        a given picture, it is reasonable to under-weigh this sample
attention mask located outside of the butterfly. Our loss can      in our training set. Furthermore, a good prediction with a
be written as follows:                                             low overlap would mean the decision is based on irrelevant
                                                                   features and therefore under-weighting the image can even
                          P          k,l      k,l                  benefit the training. Indeed, we can imagine that the wrongly
                             i,k,l Mi (1 − S )
             L(M, S) =          P         k,l                      used features would be forgotten later in training. Finally, we
                                   i,k,l Mi                        can augment the weight of the images incorrectly predicted,
                                                                   similarly to a hard-negative mining strategy (HNM) (Felzen-
   with Mik,l the pixel intensity of index (k, l) along the        szwalb et al. 2009): it bases the sampling process on the
spatial dimension of the attention-issued mask (M ) for the        training results for each class. This is equivalent to providing
channel i and S k,l the pixel’s value of index (k, l) along the    an uncertainty measure, which we can use to ameliorate the
spatial dimension of the segmentation-issued mask (S).             class imbalance problem via an image adaptative sampling.
   An attention-based loss is applied at each level to the
attention-issued masks (M ) computed spatially, at low
scale, with the segmentation-issued masks (S) max-pooled           Regularization The most straightforward solution will be
to the correct spatial dimension (e.g., for M ∈ [28 ×              to use more training data (another training set is already at
28] and S ∈ [250 × 250], a max-pooling is applied to (S)).         our disposal). Our future work will be to use more training
                                                                   data: one can, for example, use all the training data (with
                                                                   an adaptive sampling strategy in addressing the class im-
Experimental set-up For our preliminary experiments,               balance) or pre-train our model on other pre-existing but-
we use a ResNet (He et al. 2016) pre-trained on Imagenet           terflies datasets. If unsuccessful in addressing the overfit-
(Deng et al. 2009). To prevent over-fitting, the weight de-        ting issue, our approach would be to implement stochas-
cay parameter was finetuned and a dropout layer (Srivastava        tic depth as a regularization method, in addition to stronger
et al. 2014) was added before the last fully-connected layer       data-augmentation, such as methods of consistency training
with a keep-probability. Random rotation, flipping, rescal-        applied in a semi-supervised configuration as in MixMatch
ing and cropping were added for data-augmentation during           (Berthelot et al. 2019).
training. The best weights on the validation set were saved
and the training was interrupted when no improvement was
                                                                   Hierarchical Classification
noticed for more than 50 epochs. To obtain preliminary re-
sults without changing the class-balancing parameters, we          Our data structure presents a hierarchical property since
restrained ourselves to a reduced dataset that was perfectly       each label is composed of different but related items. We
balanced containing less than 100 species.                         should make the best use of this knowledge to improve the
   We compare the proposed approach with the model with-           results. Indeed, classifying other families should be simpler
out attention mechanism (original ResNet) to the model with        and more robust than classifying over species. Exploiting
attention mechanism trained without guided attention loss.         this hierarchy can improve robustness. For example, if the
We witness the importance of the attention mechanism in the        model is uncertain between two species A and B, which re-
classification task, highlighting the potential of the guided      spectively belong to families 1 and 2, while being certain it
attention approach. Our proposed approach yields almost as         belongs to family 1, it should predict species A.
good in top 1 and better in top 3 accuracies than ResNet with
CBAM.
                                                                   Learning underlying structure while preserving flat clas-
                                                                   sification Even if these hierarchies are common in the real
Analysis Our top 1 accuracy scores in training are near            world, they are challenging to leverage to improve the clas-
perfect (better than 99% for the three models) which means         sification. On the one hand, using the parents-to-children re-
our loss is ineffective due to over-fitting. We will use more      lation seems critical to extract relevant features and reduce
training data to address the class imbalance in future work        parent-level classification mistakes where the task should be
as well as using an adaptive sampling strategy. Our strategy       easier. On the other hand, over-penalizing parent-level rela-
with adaptive sampling is to use our uncertainty prediction        tionships can cause the classifier to under-perform on leaf
measure in our approach, called Over-CAM (Kantor et al.            classes compared to flat classification. Therefore designing
2020): our measure rejects the predictions if the overlap be-      a loss that enforces the learning of the underlying hierar-
tween binarized attention (or transformed saliency maps)           chy while preserving the flat classification performance is a
and object segmentation is not satisfactory. Indeed, this case     challenge we plan to address.
implies that the network likely based its prediction on at least      We thus propose to evaluate the impact of the weight-
some regions outside the butterfly, i.e., both the background      ing of the varying elements of a hierarchical loss on the
and foreground. Then, we determine the overlap distribution        classification performance. In addition, we introduce a loss
on the whole test set. Following that, several thresholds are      WCE based on cross-entropy that improves the flat classi-
chosen regarding the distribution curve to determine from          fication performance while penalizing parent-level classifi-
Table 1: Resnet model performance with and without Convolutional Block Attention Module and Guided-Attention. We provide
the average accuracy obtained over 3 different seeds and the standard deviation between parenthesis. Current best accuracies
are in bold.

                         Accuracy     ResNet        ResNet + CBAM       ResNet + CBAM + Designed Loss
                          Top 1     79.54 (0.70)      80.95 (0.45)                81.01 (0.40)
                          Top 3     91.72 (0.49)      93.35 (0.20)                93.00 (0.29)


cation mistakes. For a given sample, we use the following            Weighting-Impact method A natural and straightforward
loss:                                                                approach to learn the hierarchical structure is through a
                                                                     weighted classification loss. We consider the cross-entropy
   WCE = −λs log p(cs ) − λg log p(cg ) − λf log p(cf ),             loss for the nodes at each depth in the tree structure. For a
                                                                     depth k ∈ {1, .., d}, we consider the cross entropy loss at
with λs , λg , λf being the weighted coefficients for species,       this depth defined as follow :
genus and family, cs , cg , cf being the species, genus and
family class labels and p a probability distribution function.                  CE(k, I) = 1cI ∈c∧ c∈Ck (c) log(p(c))

                                                                     with p the predicted probability of a node c.
The general hierarchical loss
                                                                       Given a tuple of weights Λ = (λ1 , . . . , λd ) ∈ R∗d
                                                                                                                          + , we
Problem definitions In supervised hierarchical clas-
                                                                     compute the weighted cross-entropy:
sification applied to images, we consider a data-set D
containing images whose labels belong to C, a set of classes                                       d
                                                                                                   X
and we assume the existence and the knowledge of an                                 W CEΛ (I) =          λj CE(j, I)
underlying tree-structures of height d > 1. The leaves of                                          j=1
the structure represent all the classes of C. More precisely,
each leaf is a set containing only one element of C and                 This weighted cross entropy loss is differentiable and al-
each element in C is contained in a different leaf. A parent         lows the optimization of the weights of a CNN through gra-
node is defined as the union of its children. We note as Ck          dient descent.
the collection of sets composed of the nodes at depth k
(each node being the set composed of the classes descend-            Cross-entropy loss limitation It is critical to tune the Λ
ing from it). This way, we have C0 , the root of the tree, a         parameter properly, which requires a time-consuming opti-
collection of sets which union is equal to C. We assume that:        mization or some expert knowledge. Moreover, the cross-
                                                                     entropy loss has inherent limitations that need to be ad-
                                                                     dressed for proper hierarchical learning. Indeed, as the
• ∀c ∈ C, ∃! c0 ∈ Cd , c ∈ c0                                        model weights converge during training, the predicted prob-
• ∀c0 ∈ Cd , |c0 | = 1                                               ability of the target class converges to 1. Moreover, given
                                                                     the labels’ underlying structure, the parent node’s predicted
• ∀1 ≤ i ≤ d, ∀c ∈ Ci , ∀ c0 ∈ Ci−1 , c ∩ c0 6= ∅ ⇒ c ⊆ c0           probability is always greater than its children’s. Since the
                                                                     cross-entropy loss is expressed as − log p(ci ), with ci the
• ∀1 ≤ i ≤ d, ∀c ∈ Ci , ∃! c0 ∈ Ci−1 , c ⊆ c0
                                                                     label class, its gradient regarding p has a magnitude that de-
                                                                     creases as p augments. As a result, the cross-entropy loss
  Under these assumptions, it is possible to assess the              naturally under-weighs the optimization of parent-level fea-
importance of a classification error. Given an image I ∈ D           tures with respect to the children and requires a weighting.
and its corresponding labels cI ∈ C and considering the              For this reason, we propose a loss in which gradient magni-
prediction tI ∈ C, we define the importance of the error :           tude is not decreasing while getting closer to 1. The diver-
k(I, t) = d − max{0 ≤ n ≤ d | ∃c ∈ Cn , tI ∈ c ∧ cI ∈ c}.            gence in 0 implies a small impact of the weighting and the
We note that if tI = cI , we have k(I, t) = 0.                       convergence to 0 in 1 implies importance of the species

   With this setting, we are interested in learning that a clas-     Loss properties We designed a loss function with the
sifier reduces the number and the importance of the errors.          shape shown in Figure 1. When both probabilities are close
We will note p the predicted probability; it is defined over         to 1, it yields 0 and when probabilities are both close to 0,
the leaves of the structure and can be extended to every node        it yields 1. The essence of that idea is that when the gradi-
considering each parent node’s construction principle. In-           ent magnitude of the loss is close to 0, we have a gradient
deed, at each level, every node has zero intersection with           magnitude higher in the direction of genus rather than fam-
its siblings and therefore, the predicted probability of a par-      ilies. The reverse is observed when it is close to 1. Such
ent node is equal to the sum of the predicted probability of         penalization is selected to hinder optimization on the genus
its children.                                                        if family’s optimization is affected negatively.
                                                                  nition. In Proceedings of the IEEE Conference on Computer
                                                                  Vision and Pattern Recognition, 5157–5166.
                                                                  Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-
                                                                  Fei, L. 2009. ImageNet: A Large-Scale Hierarchical Image
                                                                  Database. In CVPR09.
                                                                  Felzenszwalb, P. F.; Girshick, R. B.; McAllester, D.; and
                                                                  Ramanan, D. 2009. Object detection with discriminatively
                                                                  trained part-based models. IEEE transactions on pattern
                                                                  analysis and machine intelligence 32(9): 1627–1645.
                                                                  He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep resid-
                                                                  ual learning for image recognition. In Proceedings of the
                                                                  IEEE conference on computer vision and pattern recogni-
                                                                  tion, 770–778.
                                                                  Horn, G. V.; Aodha, O. M.; Song, Y.; Shepard, A.; Adam,
                                                                  H.; Perona, P.; and Belongie, S. J. 2017. The iNaturalist
                                                                  Challenge 2017 Dataset. CoRR abs/1707.06642. URL http:
                                                                  //arxiv.org/abs/1707.06642.
                                                                  Kantor, C.; Rauby, B.; Boussioux, L.; Jehanno, E.; and Tal-
Figure 1: Loss function to be designed: Pf is the family          bot, H. 2020. Over-CAM : Gradient-Based Localization and
probability and Pg is the genus probability.                      Spatial Attention for Confidence Measure in Fine-Grained
                                                                  Recognition using Deep Neural Networks. doi:10.1109/
                                                                  ICCV.2017.322. URL https://hal.archives-ouvertes.fr/hal-
                       Conclusion                                 02974521. Working paper or preprint.
In this article, we propose a method for fine-grained recog-      Kosmopoulos, A.; Partalas, I.; Gaussier, E.; Paliouras, G.;
nition and classification of wildlife images. In particular, we   and Androutsopoulos, I. 2020. Evaluation Measures for
propose to guide the convolutional neural networks by lever-      Hierarchical Classification: a Unified View and Novel Ap-
aging attention masks along with segmentation as a means          proaches. URL http://www2.aueb.gr/users/ion/docs/dami
to being less sensitive to the typical detail-rich environment.   final manuscript.pdf.
This work shows improved results in top-3 accuracy in com-        Lin, Z.; Feng, M.; Dos Santos, C.; Yu, M.; Xiang, B.; Zhou,
parison to the state of the art. Furthermore, we explore the      B.; and Bengio, Y. 2017. A Structured Self-attentive Sen-
use of a hierarchical loss to leverage species philology. Our     tence Embedding .
approach is general enough to be adapted in broader fine-
grained classification contexts. Our methodology can be of        Parikh, A.; Täckström, O.; Das, D.; and Uszkoreit, J. 2016.
great use for large-scale wildlife crowd-sourcing programs        A Decomposable Attention Model for Natural Language In-
that gather crucial census data to understand species demo-       ference. 2249–2255. doi:10.18653/v1/D16-1244.
graphics and dynamics.                                            Prudic, K. L.; McFarland, K. P.; Oliver, J. C.; Hutchinson,
                                                                  R. A.; Long, E. C.; Kerr, J. T.; and Larrivée, M. 2017.
                        References                                eButterfly: Leveraging Massive Online Citizen Science for
Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Ma-            Butterfly Conservation. Insects 8(2). ISSN 2075-4450.
chine Translation by Jointly Learning to Align and Trans-         doi:10.3390/insects8020053. URL https://www.mdpi.com/
late. CoRR abs/1409.0473.                                         2075-4450/8/2/53.
Berthelot, D.; Carlini, N.; Goodfellow, I.; Oliver, A.; Paper-    Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Con-
not, N.; and Raffel, C. 2019. MixMatch: A Holistic Ap-            volutional networks for biomedical image segmentation. In
proach to Semi-Supervised Learning. URL https://arxiv.org/        International Conference on Medical image computing and
pdf/1905.02249.pdf.                                               computer-assisted intervention, 234–241. Springer.
Boussioux, L.; Giro-Larraz, T.; Guille-Escuret, C.; Cherti,       Roy, D.; Panda, P.; and Roy, K. 2019. Tree-CNN: A Hierar-
M.; and Kégl, B. 2019. InsectUp: Crowdsourcing Insect Ob-        chical Deep Convolutional Neural Network for Incremental
servations to Assess Demographic Shifts and Improve Clas-         Learning. URL https://arxiv.org/pdf/1802.05800.pdf.
sification. URL https://arxiv.org/pdf/1906.11898.pdf.             Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.;
Chai, Y.; Lempitsky, V.; and Zisserman, A. 2013. Symbiotic        Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explana-
segmentation and part localization for fine-grained catego-       tions from deep networks via gradient-based localization. In
rization. In Proceedings of the IEEE International Confer-        Proceedings of the IEEE international conference on com-
ence on Computer Vision, 321–328.                                 puter vision, 618–626.
Chen, Y.; Bai, Y.; Zhang, W.; and Mei, T. 2019. Destruc-          Song, K.; Wei, X.; Shu, X.; Song, R.; and Lu, J. 2020. Bi-
tion and construction learning for fine-grained image recog-      Modal Progressive Mask Attention for Fine-Grained Recog-
nition. IEEE Transactions on Image Processing 29: 7006–
7018. doi:10.1109/TIP.2020.2996736.
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Salakhutdinov, R. 2014. Dropout: A Simple Way to Pre-
vent Neural Networks from Overfitting. Journal of Ma-
chine Learning Research 15(56): 1929–1958. URL http:
//jmlr.org/papers/v15/srivastava14a.html.
Vaswani, A.; et al. 2017. Attention is All you Need. In
Guyon, I.; Luxburg, U. V.; Bengio, S.; Wallach, H.; Fer-
gus, R.; Vishwanathan, S.; and Garnett, R., eds., Advances
in Neural Information Processing Systems 30, 5998–6008.
Curran Associates, Inc. URL http://papers.nips.cc/paper/
7181-attention-is-all-you-need.pdf.
Woo, S.; Park, J.; Lee, J.-Y.; and So Kweon, I. 2018a. Cbam:
Convolutional block attention module. In Proceedings of the
European conference on computer vision (ECCV), 3–19.
Woo, S.; Park, J.; Lee, J.-Y.; and So Kweon, I. 2018b.
CBAM: Convolutional Block Attention Module. In The Eu-
ropean Conference on Computer Vision (ECCV).
Wu, C.; Tygert, M.; and LeCun, Y. 2019. A hierarchical loss
and its problems when classifying non-hierarchically. URL
https://arxiv.org/pdf/1709.01062.pdf.
Xie, L.; Tian, Q.; Hong, R.; Yan, S.; and Zhang, B. 2013. Hi-
erarchical part matching for fine-grained visual categoriza-
tion. In Proceedings of the IEEE International Conference
on Computer Vision, 1641–1648.
Zheng, H.; Fu, J.; Zha, Z.; Luo, J.; and Mei, T. 2020. Learn-
ing Rich Part Hierarchies With Progressive Attention Net-
works for Fine-Grained Image Recognition. IEEE Transac-
tions on Image Processing 29: 476–488. doi:10.1109/TIP.
2019.2921876.
Zhou, F.; and Lin, Y. 2016. Fine-grained image classification
by exploring bipartite-graph labels. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recogni-
tion, 1124–1133.