<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mykola Baranov</string-name>
          <email>mykola.baranov@lnu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuriy Shcherbyna</string-name>
          <email>yuriy.shcherbyna@lnu.edu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ivan Franko National University of Lviv</institution>
          ,
          <addr-line>Universytetska St, 1, Lviv, L'vivs'ka oblast, 79000</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Deep learning has introduced a lot of successful approaches in a lot of supervised learning areas including computer vision. Modern neural network-based models have proved a human-level performance accuracy. The one limitation that has come along with deep learning is - the requirement of large-scale datasets to train such models. Despite the large-scale datasets like ImageNet or OpenImages, a huge amount of classes are left uncovered. In order to extend the existing model to one more class a lot of data collection and annotation is required. Few-shot learning approaches tackle the issue of large-scale dataset requirements. Most of the few-shot learning approaches tackle the problem of identity recognition (like face recognition etc). Novel object classification remains a challenging task. In this work, we built metric learning based deep learning model based on triplet loss. We explore how the triplet loss-driven model may be applicable to image recognition in a case of a lack of data. Our experiments show that such kind of model leads up to 83% accuracy using only a few samples per class.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Few-shot learning</kwd>
        <kwd>metric learning</kwd>
        <kwd>distance learning</kwd>
        <kwd>deep learning</kwd>
        <kwd>computer vision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Deep learning models recently have achieved great success in various computer vision tasks
like image classification, segmentation, object detection, etc. A lot of progress has been achieved due
to increasing model capacity - researchers came up with models like ResNeXt[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or Inception[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Some steps have been performed towards tradeoff between models depth and parameters count - the
family of EfficientNet[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is the best choice when processing speed and accuracy balance is required. It
has been proven that using a reach large-scale datasets leads to good results in terms of model key
point indicators. It is natural since deep learning models have a trend to generalize data. So, providing
a large amount of data leads to better generalization while training the same model on a few samples
definitely will lead to overfitting. There are numerous techniques for preventing overfitting (random
erase[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], CutOut[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], grid mask[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], drop block[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and others) but such techniques make it harder to
overfit specific features of images by augmentation it, but it is almost useless where a number of
training samples are tiny.
      </p>
      <p>
        The approach of synthetic data generating may be used to generate synthetic data using
generative adversarial networks[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. It is proven to generate realistic images in a controlled
environment[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] but it fails to produce completely new images of the given object (for example, side
view instead of front view).
      </p>
      <p>
        Transfer learning is a widely used approach to fine-tune existing pretrained models on a small
dataset. In computer vision, it was usually done by cutting off head leyers and replacing them with
newly created ones. It is proven that fine-tuning only the last layers gives a significant improvement
by transferring knowledge of the base model to the tuned model[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Described research trend is extensive development rather than intensive in terms of model
flexibility. It is impossible to extend the classification model class without
● Model retraining
● Extending a large-scale dataset</p>
      <p>Moreover, even satisfaction of the requirements described above does not guarantee a
successful model retraining. It is worth mentioning that a lot of business cases of deep learning model
applications do not accept manual work like data annotation. Others do not support handling
computation resources on their own. A typical example of such a business case - is intelligent
cashier-less checkout in the supermarket.</p>
      <p>Few-shot learning is a novel approach to deep learning. The main idea of this concept - is to
replace classical supervised learning tasks. Instead of forcing the model to generalize the training
dataset (or memorize in case of overfitting), we would train the model's ability to learn. It is natural
that a baby is able to recognize a car by seeing only several cars in its life. It is done by memorizing
several samples of cars and then just performing an intelligent comparison of new objects with
existing ones. This is exactly what metric learning does.</p>
      <p>A lot of few-shot research works are focused on individual recognition rather than different
object recognition (e.g. face recognition is the main focus of few-shot learning since all faces share
the same shape but differ in details). In this paper, we move the focus of few-shot object classification
from individual recognition to different object recognition. In this work, we explore the capacity of
few-shot classification approaches in terms of the replacement of traditional deep learning
classification approaches. In summary the contribution of this paper:
1. Build a few-shot 1000-class classification deep learning pipeline
2. Explore strengths and weaknesses of our model on routine object classification
3. Compare results with traditional approaches of image classification</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>Traditional classification approaches use softmax activation along with cross-entropy loss.
That means that each model output of i-th neuron is responsible for the probability of i-th class.
Obviously, such models have a constant number of classes that they deal with.</p>
      <p>
        Metric learning suggests as to take another approach - instead of directly predicting the class
of input pictures we calculate the distance between picture pairs. In other words, we build a novel
metric function that is able to calculate the distance in picture space. Such a function should satisfy
the following requirements:
1. Distance between pictures that belong to the same class should be minimal
2. Distance between different pictures should be as high as possible
3. Distance between the same pictures should be zero
Obviously, comparing pictures pixel-wise leads to poor results. Usually, such metrics are constructed
by the deep learning model backbone (feature extractor) and some predefined metric (such as
L2-distance or cosine distance). In that setup backbone is a trainable part, which learns the
embeddings of pictures.
The simple loss that forces the model to learn such embeddings is a contrast loss[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
where Y corresponds to pair labels. It is equal to 1 if pair of images contains images of different
objects. Otherwise, it is equal to 0. In contrast loss formula Y acts as a switcher by tightening two
different formulas without an explicit switcher while keeping the formula differentiable. It allows us
to use contrast loss in backpropagation. Dw here represents a distance between two images in some
arbitrary metric space. In a nutshell, optimizing of contrast loss force model to increase the distance
between images of different objects and keep the small distance within the images of the same
object. Margin parameters m control a maximum allowed distance between pairs of images. It allows
paying attention to pairs that fail to satisfy loss rather than continue optimizing successful cases.
      </p>
      <p>
        Further study showed that pairwise loss is not the best option. Triplet loss performs much
better in a lot of cases[
        <xref ref-type="bibr" rid="ref12 ref13">12,13</xref>
        ]. The idea of triplet loss is the same as in contrast loss but it operates
with three samples
1. Anchor - random sample from the dataset
2. Positive - random sample of the same class as an anchor
3. Negative - random sample of other class
where d(a, p) stands for the distance between anchor sample to positive sample, and d(a, n) stands for
the distance between anchor sample and negative sample. The target of triplet loss - make the distance
between different samples higher than the distance between similar samples. It implicitly defines our
main target. Margins are used in order to prevent the overfitting of easier samples (actually they play a
similar role to the margin of contrast loss).
      </p>
      <p>
        However, selection of random triples is not the best option. We will show that it leads to
optimisation the model to a local minimum in the next sections. In work[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] several strategies of
triplet mining were proposed. Triplet may be classified into three groups:
1. Easy triplet: triplets which have a loss of 0
2. Hard triplets: triplets where the negative is closer to the anchor than the positive
3. Semi-hard triplets: triplets where the negative is not closer to the anchor than the positive, but
which still have positive loss
Impact of such strategies will be exposed in the next sections.
      </p>
      <p>
        Centre loss[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] penalizes the distance between embeddings and their corresponding class
centers in the Euclidean space to achieve intra-class compactness. However, it has limited applications
since a lot of real-life scene images do not form dense clusters. So centroid of the cluster may light ut
of the cluster which will lead to the broken training process. The formula of center loss is provided on
figure 3. Lct−c stands for the contrastive-center loss; xi denotes the training sample embeddings with
dimension d. yi stands for the label of xi. cyi represents the y-ith class center of deep features with
dimension the same dimension. Hyperparameter δ is a constant used for numeric stability.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and Materials 3.1</title>
    </sec>
    <sec id="sec-4">
      <title>Dataset and task definition</title>
      <p>There are a lot of techniques of few-shot learning. One of the most popular - using support
set. Let’s define an N-way-K-Shot classification problem. In that setup, the support set contains
samples of N-classes. So, during forward pass such models could classify only 1 class of N. There are
a lot of papers and benchmarks tackling that problem.</p>
      <p>
        It is natural that an increasing number of classes N will decrease the overall accuracy of our
model. So, that approach leads to pure results while working with a huge amount of classes (like
MiniImage[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]).
      </p>
      <p>In this work, we are going to build a model that will classify images along with 1000 classes
where only 10 images per class are available. Since we have a limited amount of data per class,
traditional classification will lead to poor results.</p>
      <p>
        Our work is mainly based on an FSS-1000 dataset[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This dataset is specially collected for
few-shot learning purposes. A lot of popular datasets (like ImageNet[
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], PASCAL VOC[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
ILSVRC[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) introduce a high bias. It may be a classes balance, semantic balance, items per image
balance, etc. The main purpose of the FSS-1000 dataset is to illuminate any bias from data. Thus it
contains 1000 classes. Each class represents 10 pictures. All pictures were collected across different
search engines (Google, Yahoo, Bing, etc). Moreover, each picture has a constant resolution of
224⤫224 and only one class instance is presented at once. These properties ensure that our model will
have no bias to any specific properties of images. In contrast to that, model trained on ImagenNt may
be perfect in dog classification (because that dataset contains more than 200 dog breeds) but fail on
car recognition.
      </p>
      <p>FSS-1000 provides us with instance segmentation masks. Since this work is focused on a
classification task we crop each image by its bounding box mask. We calculate the bounding box
mask as a minimum and maximum coordinate of the mask both for X and Y axes. Each cropped
sample was resized to the standard resolution of 224x224 in order to match the expected resolution of
pretrained backbone models.</p>
      <p>Resizing of cropped samples provides us with additional distortion. In figure 5 we can
observe the level of such distortions. However, according to the original aspect ratio exploration, most
of the samples have a 1:1 aspect ratio which illuminates any distortion while resizing to 224x224.</p>
    </sec>
    <sec id="sec-5">
      <title>Methodology and setup</title>
      <p>In our experiments we propose two deep learning pipelines. One is suitable for binary
classification (e.g. decide if two given images are similar or not). The second one - traditional image
classification on 1000 classes. Actually, both pipelines are quite similar. The main difference lies in
embedding the processor module.</p>
      <p>
        Since the dataset we are working with is quite limited we decide to use a model from the
EfficientNet family. In our experiments, we used EfficientNet B2[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] as the main backbone of our
model. We took pretrained weights on ImageNet[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as an initial weight.
      </p>
      <p>A fully connected layer is initialized randomly. L2 normalized is used in order to keep a
meaningful scale of extracted embeddings.</p>
      <p>For a 1000-class classification task we start from the pipeline defined above. First of all, we
cut off a similarity estimation model. It was replaced by a traditional K-Nearest neighbors model.
Since the KNN model cannot be trained using gradient descent methods we introduce a 2-stage
training pipeline. At the first stage, we train a backbone model using triplet loss. Moreover, we may
reuse the backbone extracted trained for the binary classification task. In the second stage, we
calculate embedding for each sample in the training dataset. Afterward, extracted embeddings are feed
to the KNN model along with class labels. So, the backbone model extracts embedding from the input
image. The second KNN predicts class labels based on nearest neighbors.</p>
    </sec>
    <sec id="sec-6">
      <title>4. Experiment 4.1</title>
    </sec>
    <sec id="sec-7">
      <title>Dataset setup</title>
      <p>All training and evaluations have been performed on an FSS-1000 dataset. We split the
dataset randomly per class. We take 7 samples of each class for training. The rest part is used for
validation purposes. We follow a traditional approach of 80% to 20% dataset split but shuffling has
been performed in respect of class balance. As a result, we came up with a perfectly balanced dataset
with exactly the same number of samples per class both in training and validation splits.
4.2</p>
    </sec>
    <sec id="sec-8">
      <title>Embaing model training</title>
      <p>We define use embedding size of 256 (i.e. fully connected layer size is set to 256). In order to
keep backbone weights healthy, we start training with frozen backbone weights. After reaching the
loss plateau we unfreeze backbone weights and continue training. Such setup helps us to increase final
accuracy by ~2%.</p>
      <p>
        Triplet loss is used as the main loss to optimize during training. We use a semi-hard triplet
mining policy[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. We set the margin to 1 for triplet loss and L2-distance is used as a distance
between embeddings.
We calculate distances between each pair of samples. The distribution of such distances is presented
in Figure 9. We calculate the best suitable threshold for the training set using the Otsu thresholding
algorithm. Applying an obtained threshold (0.0048) to the validation test gives us 0.921 true positive
rate and 0.998 true negative rate.
      </p>
    </sec>
    <sec id="sec-9">
      <title>5. Results</title>
      <p>We use the KNN model with K=7. That number of the nearest neighbors is chosen according
to the number of train samples per class. We fit the KNN model to the training embeddings.</p>
      <p>It is very important to select a triplets mining policy if using triplet loss. Deep models tend to
overfit training data (especially if training on small datasets). Using triplet loss allows us to increase
the number of samples exponentially (by incorporating three samples there are a lot of different
combinations). Despite a large number of different triplets, there is still a chance to overfit the model.
This is caused by the fact that model optimizes all distances between samples despite having very
good distances between easy samples. In order to prevent such behavior, a triplet mining policy is
implemented.</p>
      <p>As we can see, random triplet mining tends to generate very good embeddings between some classes,
but embeddings of the rest are awful. Such setup may produce nice scores for binary classification
tasks, but for 1000-class classification, only 0.122 accuracy was obtained.</p>
    </sec>
    <sec id="sec-10">
      <title>6. Discussions</title>
      <p>Traditional approaches of deep learning model cames up with a model that is able to classify
images end-to-end. Actually, such models implicitly consist of two parts - feature extraction and
feature classification. The goal of classification layers is to attribute input features to one of the
classes. Part of the feature extraction module - is to produce such features that classification layers
will be able to classify. Thus, combining these two modules in one model along with gradient descent
optimization methods gives us an end-to-end trainable model. Since those modules are training
simultaneously feature extractor tries to produce features that are easy to classify. So, there are no
constraints for such embeddings. Thus, the model pays attention to a unique part of the object despite
the real importance of such part (for example, the glass may be treated as the most important part of
the object to be classified as a car, despite numerous other objects with glasses).</p>
      <p>In contrast to that approaches, the same networks explicitly find separable embeddings. By
separable here we mean that model is forced to calculate features that are distinctive from other class
embeddings but not suitable. In other words, the model doesn’t operate with class labels at all.</p>
      <p>In Figure 9 we show how well our embeddings are separable. There is a tiny overlap between
positive and negative distances which stands for hard negative and hard positive pairs. In this overlap,
we are likely to give a wrong prediction on classification but having a distance value we can evaluate
a confidence level and refuse to give any prediction or make a bias toward one of the classes (which
class actually depends on the task). In Figure 9 we also plot a distribution for training and validation
test sets, so there is a good fit for validation. It indicates that there was no overfit. The same
conclusion is proven by observing a validation loss following the training loss during the training
process.</p>
      <p>Having such embeddings we can apply a distance-based classification model to it. We find the
K-Nearest neighbor classifier the best candidate for classification. K-Fold on the training set (with
K=5) gives us the best parameters for KNN - 7 neighbors and weighted distance comparison.
Evaluation of such a model on the validation sets proves a significant performance in comparison with
the traditional model. Especially, it gives us more than a 10% of accuracy boost in comparison with
EfficientNet B2 followed by classification layers. We also evaluate a top 5 accuracy and got relatively
the same boost performance.</p>
      <p>We also want to emphasize the gap between training and validation accuracies. The traditional
model overfit training data completely while our ending extractor model still avoids overfitting and
produces better performance. Note, that we do not consider the K-Nearest neighbor model overfit
since it should overfit training data by the definition.</p>
      <p>The key difference between the few-shot model and the traditional end-to-end model is e
semantic of extracted embeddings. In a triplet loss based model, we can only find a similar embedding
in the database while features don’t provide any class-specific representation. That’s why such a
relatively small amount of data is enough to fit the model while the traditional approach leads to
overfitting.</p>
    </sec>
    <sec id="sec-11">
      <title>7. Conclusions</title>
      <p>Classical approaches in deep learning image classifications usually consist of two modules
implicitly tightened together - the feature extraction module and embedding classification layers.
Training of such models leads to making feature extraction models find class-specific embeddings.
Since there are no specific contains for extracted embeddings they are not separable in general.</p>
      <p>In this work, we have tackled the issue of the large-scale classification task while dealing with
a small amount of data. The proposed two-stage classification model is very promising in terms of
class capacity, resistance to overfitting, and ability to fit unseen classes. Especially, it gives us a
significant accuracy boost (more than 10% of top 1 accuracy) while keeping off overfitting. Distances
between produced embeddings satisfy normal distribution without any outliers except easy positive
pairs (very similar images in simple words). It indicates that such a model may be easily extended to
predict novel classes just as is. There is enough to precalculate embeddings of unseen classes and such
a model is likely to classify unseen images correctly.</p>
      <p>
        The few-shot model seems to be a promising approach for large-scale image classification in
terms of a number of internal parameters as well. Our model contains less than 9M of parameters. The
smallest model with a similar performance on ImageNet 1000-class classification contains up to 66M
parameters[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] which is more than 7 times bigger). So, our approach assures to be much faster than
traditional approaches with the same performance but still remains dramatically fewer data per class.
      </p>
      <p>We plan some additional steps for the further research. In particular, we are going to
experiment with embedding size, triplet loss margin, etc. We believe that better embeddings may be
obtained by ensembling embeddings from different models.</p>
    </sec>
    <sec id="sec-12">
      <title>8. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Xie</surname>
            , Saining, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
            <given-names>Kaiming</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          .
          <article-title>"Aggregated residual transformations for deep neural networks."</article-title>
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          (
          <year>2017</year>
          ):
          <fpage>1492</fpage>
          -
          <lpage>1500</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Szegedy</surname>
            , Christian, Wei Liu, Yangqing Jia,
            <given-names>Pierre</given-names>
          </string-name>
          <string-name>
            <surname>Sermanet</surname>
            , Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and
            <given-names>Andrew</given-names>
          </string-name>
          <string-name>
            <surname>Rabinovich</surname>
          </string-name>
          .
          <article-title>"Going deeper with convolutions."</article-title>
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognitio</source>
          (
          <year>2015</year>
          ):
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mingxing</surname>
            , and
            <given-names>Quoc</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <article-title>"Efficientnet: Rethinking model scaling for convolutional neural networks."</article-title>
          <source>In International conference on machine learning</source>
          (
          <year>2019</year>
          ):
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Zhong</surname>
            , Zhun, Liang Zheng, Guoliang Kang,
            <given-names>Shaozi</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>and Yi</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>"Random erasing data augmentation."</article-title>
          <source>In Proceedings of the AAAI conference on artificial intelligence</source>
          , vol.
          <volume>34</volume>
          , no.
          <volume>07</volume>
          (
          <year>2020</year>
          ):
          <fpage>13001</fpage>
          -
          <lpage>13008</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] DeVries, Terrance, and
          <string-name>
            <surname>Graham</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          .
          <article-title>"Improved regularization of convolutional neural networks with cutout</article-title>
          .
          <source>" arXiv preprint arXiv:1708.04552</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Chen</surname>
            , Pengguang, Shu Liu,
            <given-names>Hengshuang</given-names>
          </string-name>
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>and Jiaya</given-names>
          </string-name>
          <string-name>
            <surname>Jia</surname>
          </string-name>
          .
          <article-title>"Gridmask data augmentation." arXiv preprint arXiv:</article-title>
          <year>2001</year>
          .
          <volume>04086</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Ghiasi</surname>
          </string-name>
          , Golnaz,
          <string-name>
            <surname>Tsung-Yi Lin</surname>
          </string-name>
          , and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <article-title>"Dropblock: A regularization method for convolutional networks</article-title>
          .
          <source>" Advances in neural information processing systems</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Geirhos</surname>
          </string-name>
          , Patricia Rubisch, Claudio Michaelis, Matthias Bethge,
          <article-title>Felix A Wichmann, and Wieland Brendel. “ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness”</article-title>
          <source>In International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Tonioni</surname>
            , Alessio,
            <given-names>Eugenio</given-names>
          </string-name>
          <string-name>
            <surname>Serra</surname>
          </string-name>
          , and Luigi Di Stefano.
          <article-title>"A deep learning pipeline for product recognition on store shelves."</article-title>
          <source>In 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS)</source>
          (
          <year>2018</year>
          );
          <fpage>25</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Zhuang</surname>
            , Fuzhen, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and
            <given-names>Qing</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          .
          <article-title>"A comprehensive survey on transfer learning</article-title>
          .
          <source>" Proceedings of the IEEE 109, no. 1</source>
          (
          <year>2020</year>
          ):
          <fpage>43</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Khosla</surname>
            , Prannay, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
            <given-names>Dilip</given-names>
          </string-name>
          <string-name>
            <surname>Krishnan</surname>
          </string-name>
          .
          <article-title>"Supervised contrastive learning</article-title>
          .
          <source>" Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          ):
          <fpage>18661</fpage>
          -
          <lpage>18673</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Qi</surname>
            , Ce, and
            <given-names>Fei</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
          </string-name>
          .
          <article-title>"Contrastive-center loss for deep neural networks."</article-title>
          <source>In 2017 IEEE international conference on image processing (ICIP)</source>
          (
          <year>2017</year>
          ):.
          <fpage>2851</fpage>
          -
          <lpage>2855</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Russakovsky</surname>
          </string-name>
          , Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al.
          <article-title>"Imagenet large scale visual recognition challenge."</article-title>
          <source>International journal of computer vision 115</source>
          , no.
          <issue>3</issue>
          (
          <year>2015</year>
          ):
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vicente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Agapito</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Batista</surname>
          </string-name>
          ,
          <article-title>"</article-title>
          <source>Reconstructing PASCAL VOC," 2014 IEEE Conference on Computer Vision and Pattern Recognition</source>
          , (
          <year>2014</year>
          ):
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          , doi: 10.1109/CVPR.
          <year>2014</year>
          .
          <volume>13</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Xiang</given-names>
          </string-name>
          , Tianhan Wei, Yau Pun Chen,
          <string-name>
            <surname>Yu-Wing Tai</surname>
          </string-name>
          , and
          <string-name>
            <surname>Chi-Keung Tang</surname>
          </string-name>
          .
          <article-title>"Fss-1000: A 1000-class dataset for few-shot segmentation."</article-title>
          <source>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          , pp.
          <fpage>2869</fpage>
          -
          <lpage>2878</lpage>
          .
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Xuan</surname>
            , Hong,
            <given-names>Abby</given-names>
          </string-name>
          <string-name>
            <surname>Stylianou</surname>
            , and
            <given-names>Robert</given-names>
          </string-name>
          <string-name>
            <surname>Pless</surname>
          </string-name>
          .
          <article-title>"Improved embeddings with easy positive triplet mining."</article-title>
          <source>In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          (
          <year>2020</year>
          ):
          <fpage>2474</fpage>
          -
          <lpage>2482</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Xie</surname>
          </string-name>
          , Qizhe,
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          , Eduard Hovy, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <article-title>"Self-training with noisy student improves imagenet classification."</article-title>
          <source>In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          (
          <year>2020</year>
          ):
          <fpage>10687</fpage>
          -
          <lpage>10698</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>