<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Application of Metric Learning to Large-scale Image Classification Task</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mykola</forename><surname>Baranov</surname></persName>
							<email>mykola.baranov@lnu.edu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Ivan Franko National University of Lviv</orgName>
								<address>
									<addrLine>Universytetska St, 1, L&apos;vivs&apos;ka</addrLine>
									<postCode>79000</postCode>
									<settlement>Lviv</settlement>
									<region>oblast</region>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yuriy</forename><surname>Shcherbyna</surname></persName>
							<email>yuriy.shcherbyna@lnu.edu.ua</email>
							<affiliation key="aff0">
								<orgName type="institution">Ivan Franko National University of Lviv</orgName>
								<address>
									<addrLine>Universytetska St, 1, L&apos;vivs&apos;ka</addrLine>
									<postCode>79000</postCode>
									<settlement>Lviv</settlement>
									<region>oblast</region>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">International Conference on Computational Linguistics and Intelligent Systems</orgName>
								<address>
									<addrLine>May 12-13</addrLine>
									<postCode>2022</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Application of Metric Learning to Large-scale Image Classification Task</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">CB0ACE9207C3A7A9E6B83CB5E1970D54</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T12:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Few-shot learning</term>
					<term>metric learning</term>
					<term>distance learning</term>
					<term>deep learning</term>
					<term>computer vision</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Deep learning has introduced a lot of successful approaches in a lot of supervised learning areas including computer vision. Modern neural network-based models have proved a human-level performance accuracy. The one limitation that has come along with deep learning is -the requirement of large-scale datasets to train such models. Despite the large-scale datasets like ImageNet or OpenImages, a huge amount of classes are left uncovered. In order to extend the existing model to one more class a lot of data collection and annotation is required. Few-shot learning approaches tackle the issue of large-scale dataset requirements. Most of the few-shot learning approaches tackle the problem of identity recognition (like face recognition etc). Novel object classification remains a challenging task. In this work, we built metric learning based deep learning model based on triplet loss. We explore how the triplet loss-driven model may be applicable to image recognition in a case of a lack of data. Our experiments show that such kind of model leads up to 83% accuracy using only a few samples per class.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Deep learning models recently have achieved great success in various computer vision tasks like image classification, segmentation, object detection, etc. A lot of progress has been achieved due to increasing model capacity -researchers came up with models like ResNeXt <ref type="bibr" target="#b0">[1]</ref> or Inception <ref type="bibr" target="#b1">[2]</ref>. Some steps have been performed towards tradeoff between models depth and parameters count -the family of EfficientNet <ref type="bibr" target="#b2">[3]</ref> is the best choice when processing speed and accuracy balance is required. It has been proven that using a reach large-scale datasets leads to good results in terms of model key point indicators. It is natural since deep learning models have a trend to generalize data. So, providing a large amount of data leads to better generalization while training the same model on a few samples definitely will lead to overfitting. There are numerous techniques for preventing overfitting (random erase <ref type="bibr" target="#b3">[4]</ref>, CutOut <ref type="bibr" target="#b4">[5]</ref>, grid mask <ref type="bibr" target="#b5">[6]</ref>, drop block <ref type="bibr" target="#b6">[7]</ref>, and others) but such techniques make it harder to overfit specific features of images by augmentation it, but it is almost useless where a number of training samples are tiny.</p><p>The approach of synthetic data generating may be used to generate synthetic data using generative adversarial networks <ref type="bibr" target="#b7">[8]</ref>. It is proven to generate realistic images in a controlled environment <ref type="bibr" target="#b8">[9]</ref> but it fails to produce completely new images of the given object (for example, side view instead of front view).</p><p>Transfer learning is a widely used approach to fine-tune existing pretrained models on a small dataset. In computer vision, it was usually done by cutting off head leyers and replacing them with newly created ones. It is proven that fine-tuning only the last layers gives a significant improvement by transferring knowledge of the base model to the tuned model <ref type="bibr" target="#b9">[10]</ref>.</p><p>Described research trend is extensive development rather than intensive in terms of model flexibility. It is impossible to extend the classification model class without</p><p>• Model retraining • Extending a large-scale dataset Moreover, even satisfaction of the requirements described above does not guarantee a successful model retraining. It is worth mentioning that a lot of business cases of deep learning model applications do not accept manual work like data annotation. Others do not support handling computation resources on their own. A typical example of such a business case -is intelligent cashier-less checkout in the supermarket.</p><p>Few-shot learning is a novel approach to deep learning. The main idea of this concept -is to replace classical supervised learning tasks. Instead of forcing the model to generalize the training dataset (or memorize in case of overfitting), we would train the model's ability to learn. It is natural that a baby is able to recognize a car by seeing only several cars in its life. It is done by memorizing several samples of cars and then just performing an intelligent comparison of new objects with existing ones. This is exactly what metric learning does.</p><p>A lot of few-shot research works are focused on individual recognition rather than different object recognition (e.g. face recognition is the main focus of few-shot learning since all faces share the same shape but differ in details). In this paper, we move the focus of few-shot object classification from individual recognition to different object recognition. In this work, we explore the capacity of few-shot classification approaches in terms of the replacement of traditional deep learning classification approaches. In summary the contribution of this paper:</p><p>1. Build a few-shot 1000-class classification deep learning pipeline 2. Explore strengths and weaknesses of our model on routine object classification 3. Compare results with traditional approaches of image classification</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>Traditional classification approaches use softmax activation along with cross-entropy loss. That means that each model output of i-th neuron is responsible for the probability of i-th class. Obviously, such models have a constant number of classes that they deal with.</p><p>Metric learning suggests as to take another approach -instead of directly predicting the class of input pictures we calculate the distance between picture pairs. In other words, we build a novel metric function that is able to calculate the distance in picture space. Such a function should satisfy the following requirements:</p><p>1. Distance between pictures that belong to the same class should be minimal 2. Distance between different pictures should be as high as possible 3. Distance between the same pictures should be zero Obviously, comparing pictures pixel-wise leads to poor results. Usually, such metrics are constructed by the deep learning model backbone (feature extractor) and some predefined metric (such as L2-distance or cosine distance). In that setup backbone is a trainable part, which learns the embeddings of pictures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 1: Example of embedding learning</head><p>The simple loss that forces the model to learn such embeddings is a contrast loss <ref type="bibr" target="#b10">[11]</ref>:</p><p>where Y corresponds to pair labels. It is equal to 1 if pair of images contains images of different objects. Otherwise, it is equal to 0. In contrast loss formula Y acts as a switcher by tightening two different formulas without an explicit switcher while keeping the formula differentiable. It allows us to use contrast loss in backpropagation. D w here represents a distance between two images in some arbitrary metric space. In a nutshell, optimizing of contrast loss force model to increase the distance between images of different objects and keep the small distance within the images of the same object. Margin parameters m control a maximum allowed distance between pairs of images. It allows paying attention to pairs that fail to satisfy loss rather than continue optimizing successful cases.</p><p>Further study showed that pairwise loss is not the best option. Triplet loss performs much better in a lot of cases <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13]</ref>. The idea of triplet loss is the same as in contrast loss but it operates with three samples 1. Anchor -random sample from the dataset 2. Positive -random sample of the same class as an anchor 3. Negative -random sample of other class where d(a, p) stands for the distance between anchor sample to positive sample, and d(a, n) stands for the distance between anchor sample and negative sample. The target of triplet loss -make the distance between different samples higher than the distance between similar samples. It implicitly defines our main target. Margins are used in order to prevent the overfitting of easier samples (actually they play a similar role to the margin of contrast loss).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 2:</head><p>The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity <ref type="bibr" target="#b5">[6]</ref> However, selection of random triples is not the best option. We will show that it leads to optimisation the model to a local minimum in the next sections. In work <ref type="bibr" target="#b13">[14]</ref> several strategies of triplet mining were proposed. Triplet may be classified into three groups:</p><p>1. Easy triplet: triplets which have a loss of 0 2. Hard triplets: triplets where the negative is closer to the anchor than the positive 3. Semi-hard triplets: triplets where the negative is not closer to the anchor than the positive, but which still have positive loss Impact of such strategies will be exposed in the next sections.</p><p>Centre loss <ref type="bibr" target="#b11">[12]</ref> penalizes the distance between embeddings and their corresponding class centers in the Euclidean space to achieve intra-class compactness. However, it has limited applications since a lot of real-life scene images do not form dense clusters. So centroid of the cluster may light ut of the cluster which will lead to the broken training process. The formula of center loss is provided on figure <ref type="figure" target="#fig_0">3</ref>. L ct−c stands for the contrastive-center loss; x i denotes the training sample embeddings with dimension d. y i stands for the label of x i . c yi represents the y-ith class center of deep features with dimension the same dimension. Hyperparameter δ is a constant used for numeric stability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods and Materials</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Dataset and task definition</head><p>There are a lot of techniques of few-shot learning. One of the most popular -using support set. Let's define an N-way-K-Shot classification problem. In that setup, the support set contains samples of N-classes. So, during forward pass such models could classify only 1 class of N. There are a lot of papers and benchmarks tackling that problem.</p><p>It is natural that an increasing number of classes N will decrease the overall accuracy of our model. So, that approach leads to pure results while working with a huge amount of classes (like MiniImage <ref type="bibr" target="#b12">[13]</ref>).</p><p>In this work, we are going to build a model that will classify images along with 1000 classes where only 10 images per class are available. Since we have a limited amount of data per class, traditional classification will lead to poor results.</p><p>Our work is mainly based on an FSS-1000 dataset <ref type="bibr" target="#b15">[15]</ref>. This dataset is specially collected for few-shot learning purposes. A lot of popular datasets (like ImageNet <ref type="bibr" target="#b12">[13]</ref>, PASCAL VOC <ref type="bibr" target="#b13">[14]</ref>, ILSVRC <ref type="bibr" target="#b16">[16]</ref>) introduce a high bias. It may be a classes balance, semantic balance, items per image balance, etc. The main purpose of the FSS-1000 dataset is to illuminate any bias from data. Thus it contains 1000 classes. Each class represents 10 pictures. All pictures were collected across different search engines (Google, Yahoo, Bing, etc). Moreover, each picture has a constant resolution of 224⤫224 and only one class instance is presented at once. These properties ensure that our model will have no bias to any specific properties of images. In contrast to that, model trained on ImagenNt may be perfect in dog classification (because that dataset contains more than 200 dog breeds) but fail on car recognition. FSS-1000 provides us with instance segmentation masks. Since this work is focused on a classification task we crop each image by its bounding box mask. We calculate the bounding box mask as a minimum and maximum coordinate of the mask both for X and Y axes. Each cropped sample was resized to the standard resolution of 224x224 in order to match the expected resolution of pretrained backbone models. In that figure, we can observe that the best tradeoff between information loss and compression ratio is 175x175. However original size is required in order to prevent information loss.</p><p>Resizing of cropped samples provides us with additional distortion. In figure <ref type="figure" target="#fig_2">5</ref> we can observe the level of such distortions. However, according to the original aspect ratio exploration, most of the samples have a 1:1 aspect ratio which illuminates any distortion while resizing to 224x224. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Methodology and setup</head><p>In our experiments we propose two deep learning pipelines. One is suitable for binary classification (e.g. decide if two given images are similar or not). The second one -traditional image classification on 1000 classes. Actually, both pipelines are quite similar. The main difference lies in embedding the processor module. Since the dataset we are working with is quite limited we decide to use a model from the EfficientNet family. In our experiments, we used EfficientNet B2 <ref type="bibr" target="#b2">[3]</ref> as the main backbone of our model. We took pretrained weights on ImageNet <ref type="bibr" target="#b6">[7]</ref> as an initial weight.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiment</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Dataset setup</head><p>All training and evaluations have been performed on an FSS-1000 dataset. We split the dataset randomly per class. We take 7 samples of each class for training. The rest part is used for validation purposes. We follow a traditional approach of 80% to 20% dataset split but shuffling has been performed in respect of class balance. As a result, we came up with a perfectly balanced dataset with exactly the same number of samples per class both in training and validation splits.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Embaing model training</head><p>We define use embedding size of 256 (i.e. fully connected layer size is set to 256). In order to keep backbone weights healthy, we start training with frozen backbone weights. After reaching the loss plateau we unfreeze backbone weights and continue training. Such setup helps us to increase final accuracy by ~2%.</p><p>Triplet loss is used as the main loss to optimize during training. We use a semi-hard triplet mining policy <ref type="bibr" target="#b16">[16]</ref>. We set the margin to 1 for triplet loss and L2-distance is used as a distance between embeddings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 8. Training loss graph of backbone model</head><p>We calculate distances between each pair of samples. The distribution of such distances is presented in Figure <ref type="figure" target="#fig_6">9</ref>. We calculate the best suitable threshold for the training set using the Otsu thresholding algorithm. Applying an obtained threshold (0.0048) to the validation test gives us 0.921 true positive rate and 0.998 true negative rate. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>We use the KNN model with K=7. That number of the nearest neighbors is chosen according to the number of train samples per class. We fit the KNN model to the training embeddings. It is very important to select a triplets mining policy if using triplet loss. Deep models tend to overfit training data (especially if training on small datasets). Using triplet loss allows us to increase the number of samples exponentially (by incorporating three samples there are a lot of different combinations). Despite a large number of different triplets, there is still a chance to overfit the model. This is caused by the fact that model optimizes all distances between samples despite having very good distances between easy samples. In order to prevent such behavior, a triplet mining policy is implemented.  As we can see, random triplet mining tends to generate very good embeddings between some classes, but embeddings of the rest are awful. Such setup may produce nice scores for binary classification tasks, but for 1000-class classification, only 0.122 accuracy was obtained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Discussions</head><p>Traditional approaches of deep learning model cames up with a model that is able to classify images end-to-end. Actually, such models implicitly consist of two parts -feature extraction and feature classification. The goal of classification layers is to attribute input features to one of the classes. Part of the feature extraction module -is to produce such features that classification layers will be able to classify. Thus, combining these two modules in one model along with gradient descent optimization methods gives us an end-to-end trainable model. Since those modules are training simultaneously feature extractor tries to produce features that are easy to classify. So, there are no constraints for such embeddings. Thus, the model pays attention to a unique part of the object despite the real importance of such part (for example, the glass may be treated as the most important part of the object to be classified as a car, despite numerous other objects with glasses).</p><p>In contrast to that approaches, the same networks explicitly find separable embeddings. By separable here we mean that model is forced to calculate features that are distinctive from other class embeddings but not suitable. In other words, the model doesn't operate with class labels at all.</p><p>In Figure <ref type="figure" target="#fig_6">9</ref> we show how well our embeddings are separable. There is a tiny overlap between positive and negative distances which stands for hard negative and hard positive pairs. In this overlap, we are likely to give a wrong prediction on classification but having a distance value we can evaluate a confidence level and refuse to give any prediction or make a bias toward one of the classes (which class actually depends on the task). In Figure <ref type="figure" target="#fig_6">9</ref> we also plot a distribution for training and validation test sets, so there is a good fit for validation. It indicates that there was no overfit. The same conclusion is proven by observing a validation loss following the training loss during the training process.</p><p>Having such embeddings we can apply a distance-based classification model to it. We find the K-Nearest neighbor classifier the best candidate for classification. K-Fold on the training set (with K=5) gives us the best parameters for KNN -7 neighbors and weighted distance comparison. Evaluation of such a model on the validation sets proves a significant performance in comparison with the traditional model. Especially, it gives us more than a 10% of accuracy boost in comparison with EfficientNet B2 followed by classification layers. We also evaluate a top 5 accuracy and got relatively the same boost performance.</p><p>We also want to emphasize the gap between training and validation accuracies. The traditional model overfit training data completely while our ending extractor model still avoids overfitting and produces better performance. Note, that we do not consider the K-Nearest neighbor model overfit since it should overfit training data by the definition.</p><p>The key difference between the few-shot model and the traditional end-to-end model is e semantic of extracted embeddings. In a triplet loss based model, we can only find a similar embedding in the database while features don't provide any class-specific representation. That's why such a relatively small amount of data is enough to fit the model while the traditional approach leads to overfitting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions</head><p>Classical approaches in deep learning image classifications usually consist of two modules implicitly tightened together -the feature extraction module and embedding classification layers. Training of such models leads to making feature extraction models find class-specific embeddings. Since there are no specific contains for extracted embeddings they are not separable in general.</p><p>In this work, we have tackled the issue of the large-scale classification task while dealing with a small amount of data. The proposed two-stage classification model is very promising in terms of class capacity, resistance to overfitting, and ability to fit unseen classes. Especially, it gives us a significant accuracy boost (more than 10% of top 1 accuracy) while keeping off overfitting. Distances between produced embeddings satisfy normal distribution without any outliers except easy positive pairs (very similar images in simple words). It indicates that such a model may be easily extended to predict novel classes just as is. There is enough to precalculate embeddings of unseen classes and such a model is likely to classify unseen images correctly.</p><p>The few-shot model seems to be a promising approach for large-scale image classification in terms of a number of internal parameters as well. Our model contains less than 9M of parameters. The smallest model with a similar performance on ImageNet 1000-class classification contains up to 66M parameters <ref type="bibr" target="#b17">[17]</ref> which is more than 7 times bigger). So, our approach assures to be much faster than traditional approaches with the same performance but still remains dramatically fewer data per class.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 3 .</head><label>3</label><figDesc>Figure 3. Example of the FSS-1000 dataset samples.</figDesc><graphic coords="4,73.50,508.24,451.50,68.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 4 .</head><label>4</label><figDesc>Figure 4. Visualization of cropped objects resolution objects in the FSS-1000 dataset.In that figure, we can observe that the best tradeoff between information loss and compression ratio is 175x175. However original size is required in order to prevent information loss.</figDesc><graphic coords="5,139.01,73.50,317.25,327.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 5 .</head><label>5</label><figDesc>Figure 5. Cropped and resized samples. Here we can observe a few aspect ratio distortions.</figDesc><graphic coords="5,73.50,506.32,451.50,159.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 6 .</head><label>6</label><figDesc>Figure 6. Diagram of proposed pipeline suitable for binary classification. The similarity estimation module compares the distance between embedding with a predefined threshold which has been extracted from the training dataset.</figDesc><graphic coords="6,73.50,115.89,451.50,90.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 7 .</head><label>7</label><figDesc>Figure 7. Diagram of embedding extraction module.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 7 .</head><label>7</label><figDesc>Figure 7. Diagram of the model suitable for a 1000-class classification task.</figDesc><graphic coords="6,73.50,515.17,451.50,75.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 9 .</head><label>9</label><figDesc>Figure 9. Embedding distance distributions of training and validation sets.</figDesc><graphic coords="7,73.50,385.29,451.50,84.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 9 .</head><label>9</label><figDesc>Figure 9. Class confusion matrix (left) and class distance heatmap (right)</figDesc><graphic coords="8,73.50,73.50,216.00,216.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 10 .</head><label>10</label><figDesc>Figure 10. Difference between obtained distances using semi-hard triplet mining policy (left) and random triplet mining (right).</figDesc><graphic coords="8,73.50,375.96,213.75,213.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="2,73.50,626.38,451.50,117.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc></figDesc><table><row><cell>Classification scores</cell><cell></cell><cell></cell></row><row><cell>Model</cell><cell>Accuracy train</cell><cell>Accuracy validation</cell></row><row><cell>Our (top 1)</cell><cell>0.944</cell><cell>0.836</cell></row><row><cell>Our (top 5)</cell><cell>0.999</cell><cell>0.957</cell></row><row><cell>EfficientNet B2 + softmax (top 1)</cell><cell>1.0</cell><cell>0.723</cell></row><row><cell>EfficientNet B2 + softmax (top 5)</cell><cell>1.0</cell><cell>0.914</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We plan some additional steps for the further research. In particular, we are going to experiment with embedding size, triplet loss margin, etc. We believe that better embeddings may be obtained by ensembling embeddings from different models.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Aggregated residual transformations for deep neural networks</title>
		<author>
			<persName><forename type="first">Saining</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ross</forename><surname>Girshick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Piotr</forename><surname>Dollár</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhuowen</forename><surname>Tu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kaiming</forename><surname>He</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1492" to="1500" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Going deeper with convolutions</title>
		<author>
			<persName><forename type="first">Christian</forename><surname>Szegedy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yangqing</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pierre</forename><surname>Sermanet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Scott</forename><surname>Reed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dragomir</forename><surname>Anguelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dumitru</forename><surname>Erhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vincent</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Rabinovich</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognitio</title>
				<meeting>the IEEE conference on computer vision and pattern recognitio</meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="1" to="9" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Efficientnet: Rethinking model scaling for convolutional neural networks</title>
		<author>
			<persName><forename type="first">Mingxing</forename><surname>Tan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Quoc</forename><surname>Le</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International conference on machine learning</title>
				<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="6105" to="6114" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Random erasing data augmentation</title>
		<author>
			<persName><forename type="first">Zhun</forename><surname>Zhong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liang</forename><surname>Zheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Guoliang</forename><surname>Kang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shaozi</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yi</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI conference on artificial intelligence</title>
				<meeting>the AAAI conference on artificial intelligence</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="13001" to="13008" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Improved regularization of convolutional neural networks with cutout</title>
		<author>
			<persName><forename type="first">Terrance</forename><surname>Devries</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Graham</forename><forename type="middle">W</forename><surname>Taylor</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1708.04552</idno>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Gridmask data augmentation</title>
		<author>
			<persName><forename type="first">Pengguang</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shu</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hengshuang</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jiaya</forename><surname>Jia</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2001.04086</idno>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Dropblock: A regularization method for convolutional networks</title>
		<author>
			<persName><forename type="first">Golnaz</forename><surname>Ghiasi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tsung-Yi</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Quoc</surname></persName>
		</author>
		<author>
			<persName><surname>Le</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness</title>
		<author>
			<persName><forename type="first">Robert</forename><surname>Geirhos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Patricia</forename><surname>Rubisch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Claudio</forename><surname>Michaelis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Matthias</forename><surname>Bethge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Felix</forename><forename type="middle">A</forename><surname>Wichmann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wieland</forename><surname>Brendel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations (ICLR)</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A deep learning pipeline for product recognition on store shelves</title>
		<author>
			<persName><forename type="first">Alessio</forename><surname>Tonioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eugenio</forename><surname>Serra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luigi</forename><surname>Di</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefano</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Image Processing, Applications and Systems (IPAS)</title>
				<imprint>
			<date type="published" when="2018">2018. 2018</date>
			<biblScope unit="page" from="25" to="31" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">A comprehensive survey on transfer learning</title>
		<author>
			<persName><forename type="first">Fuzhen</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhiyuan</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Keyu</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dongbo</forename><surname>Xi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yongchun</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hengshu</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hui</forename><surname>Xiong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Qing</forename><surname>He</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE</title>
				<meeting>the IEEE</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">109</biblScope>
			<biblScope unit="page" from="43" to="76" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Supervised contrastive learning</title>
		<author>
			<persName><forename type="first">Prannay</forename><surname>Khosla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Piotr</forename><surname>Teterwak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chen</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aaron</forename><surname>Sarna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yonglong</forename><surname>Tian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Phillip</forename><surname>Isola</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Aaron</forename><surname>Maschinot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ce</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dilip</forename><surname>Krishnan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="18661" to="18673" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Contrastive-center loss for deep neural networks</title>
		<author>
			<persName><forename type="first">Ce</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Fei</forename><surname>Su</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE international conference on image processing (ICIP)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="2851" to="2855" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Imagenet large scale visual recognition challenge</title>
		<author>
			<persName><forename type="first">Olga</forename><surname>Russakovsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jia</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hao</forename><surname>Su</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jonathan</forename><surname>Krause</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sanjeev</forename><surname>Satheesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sean</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhiheng</forename><surname>Huang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International journal of computer vision</title>
		<imprint>
			<biblScope unit="volume">115</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="211" to="252" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<title level="m" type="main">Reconstructing PASCAL VOC</title>
		<author>
			<persName><forename type="first">S</forename><surname>Vicente</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Carreira</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Agapito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Batista</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<idno type="DOI">10.1109/CVPR.2014.13</idno>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2014">2014</date>
			<biblScope unit="page" from="41" to="48" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Fss-1000: A 1000-class dataset for few-shot segmentation</title>
		<author>
			<persName><forename type="first">Xiang</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tianhan</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yu-Wing</forename><surname>Yau Pun Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chi-Keung</forename><surname>Tai</surname></persName>
		</author>
		<author>
			<persName><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE/CVF Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2869" to="2878" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Improved embeddings with easy positive triplet mining</title>
		<author>
			<persName><forename type="first">Hong</forename><surname>Xuan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Abby</forename><surname>Stylianou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><surname>Pless</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</title>
				<meeting>the IEEE/CVF Winter Conference on Applications of Computer Vision</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="2474" to="2482" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Self-training with noisy student improves imagenet classification</title>
		<author>
			<persName><forename type="first">Qizhe</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Minh-Thang</forename><surname>Luong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eduard</forename><surname>Hovy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Quoc</surname></persName>
		</author>
		<author>
			<persName><surname>Le</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</title>
				<meeting>the IEEE/CVF conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="10687" to="10698" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
