Application of Metric Learning to Large-scale Image Classification Task Mykola Baranov1, Yuriy Shcherbyna1 1 Ivan Franko National University of Lviv, Universytetska St, 1, Lviv, L'vivs'ka oblast, 79000, Ukraine Abstract Deep learning has introduced a lot of successful approaches in a lot of supervised learning areas including computer vision. Modern neural network-based models have proved a human-level performance accuracy. The one limitation that has come along with deep learning is - the requirement of large-scale datasets to train such models. Despite the large-scale datasets like ImageNet or OpenImages, a huge amount of classes are left uncovered. In order to extend the existing model to one more class a lot of data collection and annotation is required. Few-shot learning approaches tackle the issue of large-scale dataset requirements. Most of the few-shot learning approaches tackle the problem of identity recognition (like face recognition etc). Novel object classification remains a challenging task. In this work, we built metric learning based deep learning model based on triplet loss. We explore how the triplet loss-driven model may be applicable to image recognition in a case of a lack of data. Our experiments show that such kind of model leads up to 83% accuracy using only a few samples per class. Keywords Few-shot learning, metric learning, distance learning, deep learning, computer vision 1. Introduction Deep learning models recently have achieved great success in various computer vision tasks like image classification, segmentation, object detection, etc. A lot of progress has been achieved due to increasing model capacity - researchers came up with models like ResNeXt[1] or Inception[2]. Some steps have been performed towards tradeoff between models depth and parameters count - the family of EfficientNet[3] is the best choice when processing speed and accuracy balance is required. It has been proven that using a reach large-scale datasets leads to good results in terms of model key point indicators. It is natural since deep learning models have a trend to generalize data. So, providing a large amount of data leads to better generalization while training the same model on a few samples definitely will lead to overfitting. There are numerous techniques for preventing overfitting (random erase[4], CutOut[5], grid mask[6], drop block[7], and others) but such techniques make it harder to overfit specific features of images by augmentation it, but it is almost useless where a number of training samples are tiny. The approach of synthetic data generating may be used to generate synthetic data using generative adversarial networks[8]. It is proven to generate realistic images in a controlled environment[9] but it fails to produce completely new images of the given object (for example, side view instead of front view). Transfer learning is a widely used approach to fine-tune existing pretrained models on a small dataset. In computer vision, it was usually done by cutting off head leyers and replacing them with 1 COLINS-2022: 6th International Conference on Computational Linguistics and Intelligent Systems, May 12–13, 2022, Gliwice, Poland. EMAIL: mykola.baranov@lnu.edu.ua (M. Baranov); yuriy.shcherbyna@lnu.edu.ua (Y. Shcherbyna) ORCID: 0000-0003-1509-2924 (M. Baranov); 0000-0002-4942-2787 (Y. Shcherbyna) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) newly created ones. It is proven that fine-tuning only the last layers gives a significant improvement by transferring knowledge of the base model to the tuned model[10]. Described research trend is extensive development rather than intensive in terms of model flexibility. It is impossible to extend the classification model class without ● Model retraining ● Extending a large-scale dataset Moreover, even satisfaction of the requirements described above does not guarantee a successful model retraining. It is worth mentioning that a lot of business cases of deep learning model applications do not accept manual work like data annotation. Others do not support handling computation resources on their own. A typical example of such a business case - is intelligent cashier-less checkout in the supermarket. Few-shot learning is a novel approach to deep learning. The main idea of this concept - is to replace classical supervised learning tasks. Instead of forcing the model to generalize the training dataset (or memorize in case of overfitting), we would train the model's ability to learn. It is natural that a baby is able to recognize a car by seeing only several cars in its life. It is done by memorizing several samples of cars and then just performing an intelligent comparison of new objects with existing ones. This is exactly what metric learning does. A lot of few-shot research works are focused on individual recognition rather than different object recognition (e.g. face recognition is the main focus of few-shot learning since all faces share the same shape but differ in details). In this paper, we move the focus of few-shot object classification from individual recognition to different object recognition. In this work, we explore the capacity of few-shot classification approaches in terms of the replacement of traditional deep learning classification approaches. In summary the contribution of this paper: 1. Build a few-shot 1000-class classification deep learning pipeline 2. Explore strengths and weaknesses of our model on routine object classification 3. Compare results with traditional approaches of image classification 2. Related works Traditional classification approaches use softmax activation along with cross-entropy loss. That means that each model output of i-th neuron is responsible for the probability of i-th class. Obviously, such models have a constant number of classes that they deal with. Metric learning suggests as to take another approach - instead of directly predicting the class of input pictures we calculate the distance between picture pairs. In other words, we build a novel metric function that is able to calculate the distance in picture space. Such a function should satisfy the following requirements: 1. Distance between pictures that belong to the same class should be minimal 2. Distance between different pictures should be as high as possible 3. Distance between the same pictures should be zero Obviously, comparing pictures pixel-wise leads to poor results. Usually, such metrics are constructed by the deep learning model backbone (feature extractor) and some predefined metric (such as L2-distance or cosine distance). In that setup backbone is a trainable part, which learns the embeddings of pictures. Figure 1: Example of embedding learning The simple loss that forces the model to learn such embeddings is a contrast loss[11]: where Y corresponds to pair labels. It is equal to 1 if pair of images contains images of different objects. Otherwise, it is equal to 0. In contrast loss formula Y acts as a switcher by tightening two different formulas without an explicit switcher while keeping the formula differentiable. It allows us to use contrast loss in backpropagation. Dw here represents a distance between two images in some arbitrary metric space. In a nutshell, optimizing of contrast loss force model to increase the distance between images of different objects and keep the small distance within the images of the same object. Margin parameters m control a maximum allowed distance between pairs of images. It allows paying attention to pairs that fail to satisfy loss rather than continue optimizing successful cases. Further study showed that pairwise loss is not the best option. Triplet loss performs much better in a lot of cases[12,13]. The idea of triplet loss is the same as in contrast loss but it operates with three samples 1. Anchor - random sample from the dataset 2. Positive - random sample of the same class as an anchor 3. Negative - random sample of other class where d(a, p) stands for the distance between anchor sample to positive sample, and d(a, n) stands for the distance between anchor sample and negative sample. The target of triplet loss - make the distance between different samples higher than the distance between similar samples. It implicitly defines our main target. Margins are used in order to prevent the overfitting of easier samples (actually they play a similar role to the margin of contrast loss). Figure 2: The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity[6] However, selection of random triples is not the best option. We will show that it leads to optimisation the model to a local minimum in the next sections. In work[14] several strategies of triplet mining were proposed. Triplet may be classified into three groups: 1. Easy triplet: triplets which have a loss of 0 2. Hard triplets: triplets where the negative is closer to the anchor than the positive 3. Semi-hard triplets: triplets where the negative is not closer to the anchor than the positive, but which still have positive loss Impact of such strategies will be exposed in the next sections. Centre loss[12] penalizes the distance between embeddings and their corresponding class centers in the Euclidean space to achieve intra-class compactness. However, it has limited applications since a lot of real-life scene images do not form dense clusters. So centroid of the cluster may light ut of the cluster which will lead to the broken training process. The formula of center loss is provided on figure 3. Lct−c stands for the contrastive-center loss; xi denotes the training sample embeddings with dimension d. yi stands for the label of xi. cyi represents the y-ith class center of deep features with dimension the same dimension. Hyperparameter δ is a constant used for numeric stability. 3. Methods and Materials 3.1 Dataset and task definition There are a lot of techniques of few-shot learning. One of the most popular - using support set. Let’s define an N-way-K-Shot classification problem. In that setup, the support set contains samples of N-classes. So, during forward pass such models could classify only 1 class of N. There are a lot of papers and benchmarks tackling that problem. It is natural that an increasing number of classes N will decrease the overall accuracy of our model. So, that approach leads to pure results while working with a huge amount of classes (like MiniImage[13]). In this work, we are going to build a model that will classify images along with 1000 classes where only 10 images per class are available. Since we have a limited amount of data per class, traditional classification will lead to poor results. Our work is mainly based on an FSS-1000 dataset[15]. This dataset is specially collected for few-shot learning purposes. A lot of popular datasets (like ImageNet[13], PASCAL VOC[14], ILSVRC[16]) introduce a high bias. It may be a classes balance, semantic balance, items per image balance, etc. The main purpose of the FSS-1000 dataset is to illuminate any bias from data. Thus it contains 1000 classes. Each class represents 10 pictures. All pictures were collected across different search engines (Google, Yahoo, Bing, etc). Moreover, each picture has a constant resolution of 224⤫224 and only one class instance is presented at once. These properties ensure that our model will have no bias to any specific properties of images. In contrast to that, model trained on ImagenNt may be perfect in dog classification (because that dataset contains more than 200 dog breeds) but fail on car recognition. Figure 3. Example of the FSS-1000 dataset samples. FSS-1000 provides us with instance segmentation masks. Since this work is focused on a classification task we crop each image by its bounding box mask. We calculate the bounding box mask as a minimum and maximum coordinate of the mask both for X and Y axes. Each cropped sample was resized to the standard resolution of 224x224 in order to match the expected resolution of pretrained backbone models. Figure 4. Visualization of cropped objects resolution objects in the FSS-1000 dataset. In that figure, we can observe that the best tradeoff between information loss and compression ratio is 175x175. However original size is required in order to prevent information loss. Resizing of cropped samples provides us with additional distortion. In figure 5 we can observe the level of such distortions. However, according to the original aspect ratio exploration, most of the samples have a 1:1 aspect ratio which illuminates any distortion while resizing to 224x224. Figure 5. Cropped and resized samples. Here we can observe a few aspect ratio distortions. 3.2 Methodology and setup In our experiments we propose two deep learning pipelines. One is suitable for binary classification (e.g. decide if two given images are similar or not). The second one - traditional image classification on 1000 classes. Actually, both pipelines are quite similar. The main difference lies in embedding the processor module. Figure 6. Diagram of proposed pipeline suitable for binary classification. The similarity estimation module compares the distance between embedding with a predefined threshold which has been extracted from the training dataset. Since the dataset we are working with is quite limited we decide to use a model from the EfficientNet family. In our experiments, we used EfficientNet B2[3] as the main backbone of our model. We took pretrained weights on ImageNet[7] as an initial weight. Figure 7. Diagram of embedding extraction module. A fully connected layer is initialized randomly. L2 normalized is used in order to keep a meaningful scale of extracted embeddings. For a 1000-class classification task we start from the pipeline defined above. First of all, we cut off a similarity estimation model. It was replaced by a traditional K-Nearest neighbors model. Since the KNN model cannot be trained using gradient descent methods we introduce a 2-stage training pipeline. At the first stage, we train a backbone model using triplet loss. Moreover, we may reuse the backbone extracted trained for the binary classification task. In the second stage, we calculate embedding for each sample in the training dataset. Afterward, extracted embeddings are feed to the KNN model along with class labels. So, the backbone model extracts embedding from the input image. The second KNN predicts class labels based on nearest neighbors. Figure 7. Diagram of the model suitable for a 1000-class classification task. 4. Experiment 4.1 Dataset setup All training and evaluations have been performed on an FSS-1000 dataset. We split the dataset randomly per class. We take 7 samples of each class for training. The rest part is used for validation purposes. We follow a traditional approach of 80% to 20% dataset split but shuffling has been performed in respect of class balance. As a result, we came up with a perfectly balanced dataset with exactly the same number of samples per class both in training and validation splits. 4.2 Embaing model training We define use embedding size of 256 (i.e. fully connected layer size is set to 256). In order to keep backbone weights healthy, we start training with frozen backbone weights. After reaching the loss plateau we unfreeze backbone weights and continue training. Such setup helps us to increase final accuracy by ~2%. Triplet loss is used as the main loss to optimize during training. We use a semi-hard triplet mining policy[16]. We set the margin to 1 for triplet loss and L2-distance is used as a distance between embeddings. Figure 8. Training loss graph of backbone model We calculate distances between each pair of samples. The distribution of such distances is presented in Figure 9. We calculate the best suitable threshold for the training set using the Otsu thresholding algorithm. Applying an obtained threshold (0.0048) to the validation test gives us 0.921 true positive rate and 0.998 true negative rate. Figure 9. Embedding distance distributions of training and validation sets. 5. Results We use the KNN model with K=7. That number of the nearest neighbors is chosen according to the number of train samples per class. We fit the KNN model to the training embeddings. Table 1 Classification scores Model Accuracy train Accuracy validation Our (top 1) 0.944 0.836 Our (top 5) 0.999 0.957 EfficientNet B2 + softmax (top 1) 1.0 0.723 EfficientNet B2 + softmax (top 5) 1.0 0.914 It is very important to select a triplets mining policy if using triplet loss. Deep models tend to overfit training data (especially if training on small datasets). Using triplet loss allows us to increase the number of samples exponentially (by incorporating three samples there are a lot of different combinations). Despite a large number of different triplets, there is still a chance to overfit the model. This is caused by the fact that model optimizes all distances between samples despite having very good distances between easy samples. In order to prevent such behavior, a triplet mining policy is implemented. Figure 9. Class confusion matrix (left) and class distance heatmap (right) Figure 10 presents differences between naive triplet meaning (random) and semi-hard triplet mining. Figure 10. Difference between obtained distances using semi-hard triplet mining policy (left) and random triplet mining (right). As we can see, random triplet mining tends to generate very good embeddings between some classes, but embeddings of the rest are awful. Such setup may produce nice scores for binary classification tasks, but for 1000-class classification, only 0.122 accuracy was obtained. 6. Discussions Traditional approaches of deep learning model cames up with a model that is able to classify images end-to-end. Actually, such models implicitly consist of two parts - feature extraction and feature classification. The goal of classification layers is to attribute input features to one of the classes. Part of the feature extraction module - is to produce such features that classification layers will be able to classify. Thus, combining these two modules in one model along with gradient descent optimization methods gives us an end-to-end trainable model. Since those modules are training simultaneously feature extractor tries to produce features that are easy to classify. So, there are no constraints for such embeddings. Thus, the model pays attention to a unique part of the object despite the real importance of such part (for example, the glass may be treated as the most important part of the object to be classified as a car, despite numerous other objects with glasses). In contrast to that approaches, the same networks explicitly find separable embeddings. By separable here we mean that model is forced to calculate features that are distinctive from other class embeddings but not suitable. In other words, the model doesn’t operate with class labels at all. In Figure 9 we show how well our embeddings are separable. There is a tiny overlap between positive and negative distances which stands for hard negative and hard positive pairs. In this overlap, we are likely to give a wrong prediction on classification but having a distance value we can evaluate a confidence level and refuse to give any prediction or make a bias toward one of the classes (which class actually depends on the task). In Figure 9 we also plot a distribution for training and validation test sets, so there is a good fit for validation. It indicates that there was no overfit. The same conclusion is proven by observing a validation loss following the training loss during the training process. Having such embeddings we can apply a distance-based classification model to it. We find the K-Nearest neighbor classifier the best candidate for classification. K-Fold on the training set (with K=5) gives us the best parameters for KNN - 7 neighbors and weighted distance comparison. Evaluation of such a model on the validation sets proves a significant performance in comparison with the traditional model. Especially, it gives us more than a 10% of accuracy boost in comparison with EfficientNet B2 followed by classification layers. We also evaluate a top 5 accuracy and got relatively the same boost performance. We also want to emphasize the gap between training and validation accuracies. The traditional model overfit training data completely while our ending extractor model still avoids overfitting and produces better performance. Note, that we do not consider the K-Nearest neighbor model overfit since it should overfit training data by the definition. The key difference between the few-shot model and the traditional end-to-end model is e semantic of extracted embeddings. In a triplet loss based model, we can only find a similar embedding in the database while features don’t provide any class-specific representation. That’s why such a relatively small amount of data is enough to fit the model while the traditional approach leads to overfitting. 7. Conclusions Classical approaches in deep learning image classifications usually consist of two modules implicitly tightened together - the feature extraction module and embedding classification layers. Training of such models leads to making feature extraction models find class-specific embeddings. Since there are no specific contains for extracted embeddings they are not separable in general. In this work, we have tackled the issue of the large-scale classification task while dealing with a small amount of data. The proposed two-stage classification model is very promising in terms of class capacity, resistance to overfitting, and ability to fit unseen classes. Especially, it gives us a significant accuracy boost (more than 10% of top 1 accuracy) while keeping off overfitting. Distances between produced embeddings satisfy normal distribution without any outliers except easy positive pairs (very similar images in simple words). It indicates that such a model may be easily extended to predict novel classes just as is. There is enough to precalculate embeddings of unseen classes and such a model is likely to classify unseen images correctly. The few-shot model seems to be a promising approach for large-scale image classification in terms of a number of internal parameters as well. Our model contains less than 9M of parameters. The smallest model with a similar performance on ImageNet 1000-class classification contains up to 66M parameters[17] which is more than 7 times bigger). So, our approach assures to be much faster than traditional approaches with the same performance but still remains dramatically fewer data per class. We plan some additional steps for the further research. In particular, we are going to experiment with embedding size, triplet loss margin, etc. We believe that better embeddings may be obtained by ensembling embeddings from different models. 8. References [1] Xie, Saining, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. "Aggregated residual transformations for deep neural networks." In Proceedings of the IEEE conference on computer vision and pattern recognition (2017): 1492-1500. [2] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with convolutions." In Proceedings of the IEEE conference on computer vision and pattern recognitio (2015): 1-9. [3] Tan, Mingxing, and Quoc Le. "Efficientnet: Rethinking model scaling for convolutional neural networks." In International conference on machine learning (2019): 6105-6114. [4] Zhong, Zhun, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. "Random erasing data augmentation." In Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07 (2020): 13001-13008 [5] DeVries, Terrance, and Graham W. Taylor. "Improved regularization of convolutional neural networks with cutout." arXiv preprint arXiv:1708.04552 (2017). [6] Chen, Pengguang, Shu Liu, Hengshuang Zhao, and Jiaya Jia. "Gridmask data augmentation." arXiv preprint arXiv:2001.04086 (2020). [7] Ghiasi, Golnaz, Tsung-Yi Lin, and Quoc V. Le. "Dropblock: A regularization method for convolutional networks." Advances in neural information processing systems 31 (2018). [8] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. “ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness” In International Conference on Learning Representations (ICLR), 2019. [9] Tonioni, Alessio, Eugenio Serra, and Luigi Di Stefano. "A deep learning pipeline for product recognition on store shelves." In 2018 IEEE International Conference on Image Processing, Applications and Systems (IPAS) (2018); 25-31. [10] Zhuang, Fuzhen, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. "A comprehensive survey on transfer learning." Proceedings of the IEEE 109, no. 1 (2020): 43-76. [11] Khosla, Prannay, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. "Supervised contrastive learning." Advances in Neural Information Processing Systems 33 (2020): 18661-18673. [12] Qi, Ce, and Fei Su. "Contrastive-center loss for deep neural networks." In 2017 IEEE international conference on image processing (ICIP) (2017):. 2851-2855. [13] Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang et al. "Imagenet large scale visual recognition challenge." International journal of computer vision 115, no. 3 (2015): 211-252. [14] S. Vicente, J. Carreira, L. Agapito and J. Batista, "Reconstructing PASCAL VOC," 2014 IEEE Conference on Computer Vision and Pattern Recognition, (2014): 41-48, doi: 10.1109/CVPR.2014.13. [15] Li, Xiang, Tianhan Wei, Yau Pun Chen, Yu-Wing Tai, and Chi-Keung Tang. "Fss-1000: A 1000-class dataset for few-shot segmentation." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2869-2878. 2020. [16] Xuan, Hong, Abby Stylianou, and Robert Pless. "Improved embeddings with easy positive triplet mining." In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2020): 2474-2482 [17] Xie, Qizhe, Minh-Thang Luong, Eduard Hovy, and Quoc V. Le. "Self-training with noisy student improves imagenet classification." In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2020): 10687-10698.