-

SEBD

1613-0073

Representation Learning⋆

Manuel Goyo

manuel.goyo@sansano.usm.cl 1

Giacomo Frisoni

giacomo.frisoni@unibo.it 0

GianlucaMoro

Claudio Sartori

claudio.sartori@unibo.it 0

Self Supervised Learning, Representation Learning, Triplet Loss, Negative Sampling

0 Department of Computer Science and Engineering, University of Bologna , Bologna , Italy 1 Department of Informatics, Universidad Técnica Federico Santa María , Valparaíso , Chile

2024

32 23 26

Self-supervised representation learning extracts meaningful features from data without explicit supervision, building a space with desired properties. Contrastive learning has emerged as the predominant approach to clustering similar data points and separating dissimilar ones within the embedding space. Although creating diferent views of the same data (e.g., cropping, rotation) emphasizes similarities without labels, current methods struggle to define negative examples. Several algorithms only consider positive examples or integrate dissimilarity measures into their loss functions by computing average distances within the same batch. However, they do not capture nuanced diferences efectively, risking collapsing data points in a single location. In this paper, we propose a novel technique, termed “Refined Triplet Sampling” (ReTSam), to generate synthetic negative vectors for contrastive learning. Mechanically, for each element in the batch, we identify its -nearest neighbors and designate the centroid as a hard negative for a triplet loss methodology. We testReTSam on two widely used image datasets, namely CIFAR-10 and SVHN, considering content-based image retrieval and classification tasks. Our findings demonstrate that, despite its simplicity, ReTSam not only promotes the learning of similarity but also significantly improves that of dissimilarity (with a +5% increase in Mean Average Precision on CIFAR10), resulting in superior performance in practical scenarios.

CEUR ceur-ws.org

1. Introduction

Lately, representation learning has become a crucial element in the development of modern AI agents, largely propelled by significant advancements in self-supervised learning (SSL). SSL is a paradigm where representations are obtained through pre-training tasks using unlabeled data, playing a pivotal role in contemporary AI. These acquired representations are then utilized + in subsequent tasks like classification or content-based retrieval of images. Importantly, the attractiveness of SSL stems from its capability to leverage abundant and cost-efective unlabeled data, often surpassing its supervised counterpart, as observed in certain instances [ 1, 2 ]. Many contrastive learning approaches hinge on two fundamental elements: the concepts of similar (positive) pairs ( , ) and dissimilar (negative) pairs( , ) of data points. The training objective, typically noise-contrastive estimation [ 3 ], directs the learned representation to map − positive pairs to close locations and negative pairs to distant ones. Alternative objectives have also been explored [ 4 ]. The efectiveness of these methods relies on the formulation of information for the positive and negative pairs, as they cannot leverage genuine similarity information due to the absence of supervision. Certain authors opt not to explicitly generate dissimilar data. Instead, they compute distances to all other data points4[] or their closest neighbors [ 5 ], calculate the average of these similarities, and use it as a dissimilarity measure in a loss function. However, the drawback of this approach lies in the inadequacy of the average to efectively represent dissimilarity. An alternative approach addresses the issue by focusing solely on positive instances and implementing diverse parameter updates 6[ , 7 ]. Nevertheless, this method fails to endow the algorithm with the capability to construct a robust decision boundary for efectively discerning diferences within the data, leading to overlaps with diferent categories. Some authors pursue explicit negatives by considering diferent views (augmentations) for each image to identify real negatives and discard false negative8s][or by estimating a sample from the distribution over negative pairs [ 9 ]. This approach stems from metric learning settings, where “hard” (true negative) examples can expedite the correction of mistakes in the learning process [ 10, 11 ]. In representation learning, informative negative examples are intuitively those pairs that are mapped nearby but should be far apart. This concept is successfully applied in metric learning, where true pairs of dissimilar points are available, in contrast to unsupervised contrastive learning. Our methodology hinges on the generation of a hard negative, inspired by the findings of Cai et al. (2020) [12], who assert that “... a small minority of negatives were both necessary and suficient for the downstream task to reach full accuracy.” In light of this insight, we propose an approach centered around triplet loss. In this setup, the positive pairs are generated in a conventional manner, employing transformations that preserve semantic content. However, the negative element is uniquely crafted considering only the k nearest neighbors of the remaining batch of positives to the anchor. The negative is then derived by computing the centroid. This approach emphasizes that the centroid serves as an excellent representation of the negative, owing to its ability to encapsulate information from all vectors in close proximity to the anchor.

Particularly, the main contributions of this work are as follows: • We design a simple but efective sampling strategy based on similarity to create negative elements. • We propose a general self-supervised training method based on triplet loss for representation learning. • We are the first to evaluate state-of-the-art self-supervised algorithms in the context of

Content-Based Image Retrieval (CBIR) in diferent datasets. • Our experiments across two datasets demonstrate that our approach surpasses existing methods in both Content-Based Image Retrieval (CBIR) and classification tasks, as indicated by superior performance metrics such as Mean Average Precision (MAP) for CBIR and Accuracy, Recall, Precision, and F1 for classification.

The rest of the paper is organized as follows. Section2 presents a review of the work related to this approach. In section3, we will describe our proposed method. Section 4 will show the results of applying our method to diferent datasets. Section 5 will present conclusions and future works.

2. Related Works 2.1. Representation Learning

In the realm of unsupervised representation learning, the approaches are predominantly categorized into generative and discriminative methods [ 13, 4 ]. Generative strategies involve constructing a distribution over data and latent embeddings, utilizing these embeddings as representations for images. Techniques such as auto-encoding of images1[ 4, 15 ] and adversarial learning [16] are commonly employed in generative methods. While these approaches provide comprehensive pixel-level representations, the computational demands can be significant, and the generation of highly detailed images may not be essential for efective representation learning. Discriminative methods, particularly contrastive methods [ 4, 6, 5 ], currently stand at the forefront, showcasing state-of-the-art performance in self-supervised learning. Some alternative methodologies opt for auxiliary handcrafted prediction tasks to guide representation learning. However, their eficacy often falls short in comparison to contrastive methods. Noteworthy techniques, such as relative patch prediction 1[ 3, 17 ], colorizing grayscale images [18, 19], image inpainting [20], image jigsaw puzzle [21], image super-resolution [22], and geometric transformations [23, 24], have been explored for their utility. Despite the integration of wellstructured architectures [25], these approaches consistently underperform when juxtaposed with the superior performance demonstrated by contrastive methods [26, 27].

2.2. Contrastive Learning

Contrastive learning stands as a compelling alternative to the computationally intensive task of pixel-level image generation. Shifting its focus from image creation, contrastive learning aims to minimize the distance between representations of diferent views of the same image (positive pairs) and maximize the distance between representations of views from diferent images (negative pairs) [ 17, 28, 6 ]. Contrastive methods often capitalize on comparisons with multiple examples, and in some cases, they exhibit efectiveness even without explicit negative examples [ 4, 5, 7 ]. Several noteworthy algorithms have been proposed for contrastive learning of visual representations. SimCLR [ 4 ], for instance, utilizes augmented views of other items in a minibatch as negative samples. MoCo [ 1, 26 ], on the other hand, incorporates a momentum-updated memory bank of old negative representations, enabling the use of large batches of negative samples. Tri Huynh et al. [ 8 ] tackle a fundamental issue in contrastive learning—the mitigation of false negatives. The introduction of false negatives poses challenges such as discarding semantic information and slow convergence. The authors propose novel approaches to identify false negatives, introducing two strategies—false negative elimination and attraction—to mitigate their efects. Their work involves systematic evaluations to comprehensively understand and address this issue. Robinson et al. [ 9 ] present an unsupervised method based on a simple distribution over hard negative pairs for contrastive representation learning. They construct this distribution over hard negatives with the assumption that the most useful negative samples are those that the embedding currently believes to be similar to the anchor. A noteworthy approach to learning image representation is introduced by2[ 9 ]. This involves computing the cross-correlation matrix between the outputs of two identical networks, which receive distorted versions of a sample. The objective is to make this cross-correlation matrix as similar to the identity matrix as possible. This ensures that the embedding vectors of the distorted versions of a sample become more similar to each other while reducing redundancy among the components of these vectors.

2.3. Triplet Loss Approach

The triplet loss approach, initially introduced by Ding et al. for person re-identification and independently adopted by Schrof et al. for face recognition [ 30, 10 ], has undergone substantial evolution, becoming a transformative paradigm in contrastive learning. In building upon the foundational concept of triplet loss, researchers have dedicated eforts to enhance the generation and selection of valuable triplets. Hermans et al. 3[ 1 ] contributed significant strategies to identify and leverage informative triplets, thereby bolstering the robustness and efectiveness of the triplet loss methodology. Seeking further refinement, Wang et al. [ 32 ] delved into the application of cross-batch triplet loss, with the objective of augmenting generalization capabilities and stabilizing the triplet loss approach. This extension demonstrates a nuanced understanding of inter-batch relationships and their pivotal role in shaping the learning process. Furthermore, researchers have ventured into adapting the triplet loss approach to weakly supervised scenarios. Wang et al. [ 33 ] made notable contributions in this domain, exploring methods to harness weak supervision signals and extend the applicability of the triplet loss paradigm to scenarios where labeled data may be scarce. Turpault et al.3[ 4 ] took a unique approach by integrating unsupervised triplet loss-based learning into a self-supervised representation learning framework. Their variant involves obtaining positive samples for triplets with unlabeled anchors by applying a transformation to the anchor. The negative sample for these triplets is then chosen as the sample in the training set that is closest to the anchor and distant from the positive sample. Another noteworthy contribution to the triplet loss approach comes from Wang et al. [ 5 ], who introduced a truncated triplet loss methodology. In their approach, the negative pair is constructed by selecting a negative sample deputy from all negative samples. This strategic choice aims to mitigate false negatives and prevent the model from over-clustering samples of the same actual categories into diferent clusters. Finally, Li et al. [ 35 ] introduce an algorithm called Trip-ROMA, based on a simple Triplet loss with RandOm MApping (ROMA) strategy, which consists of mapping random samples into other spaces and requiring these randomly projected samples to satisfy the same relationship indicated by the triplets. Finally, integrating the triplet-based loss with random mapping, we obtain the proposed method.

3. Algorithm 3.1. Motivation

We first are going to show a motivation and then we present the algorithm.

In the past year, the prominence of Self-Supervised Representation Learning has experienced significant growth, primarily driven by the challenges posed by the absence of labeled data. A prevalent strategy involves applying augmentations to generate diferent views of the same data, efectively emphasizing similar or closely related data points [ 36 ] (see Fig. 1).

However, a critical challenge emerges in creating dissimilar data, as failure to do so may lead to a collapsing solution where all data points cluster at a single location13[]. Addressing the challenge of dissimilar data, some authors calculate distances to all other data point4s][or their closest neighbors [ 5 ], computing the average of these similarities and using it as a dissimilarity measure in a loss function. Nevertheless, the inadequacy of the average to efectively represent dissimilarity poses a drawback to this approach, so requires a large batch size. An alternative method tackles the issue by solely considering positive instances and implementing diverse parameter updates [ 6, 7 ]. However, this method falls short in enabling the algorithm to construct a robust decision boundary for efectively discriminating diferences within the data, leading to overlaps with diferent categories. The crux of our motivation lies in selecting a robust representation of the negative within the data (hard negative). This representation should efectively challenge the model in diferentiating it from the positive. Leveraging the triplet loss approach, commonly employed in contrastive learning for SSL, becomes a natural choice, in the

For simplicity, we illustrate the triplet set ( , +, − )=1,⋯,

using one query data and one =1 ℒ = ∑ max ( (, −) − ( , +) , ) where

is a similarity metric (e.g., cosine similarity or Euclidean), and is a margin determining whether to discard a triplet

Constructing triplets for each data point poses a significant challenge, particularly in determining how to establish negative pairs accurately (dog in Fig.2). While positive pairs can be reliably generated, identifying negative pairs involves the use of hard negative samples (points that are challenging to distinguish from an anchor point). The key challenge lies in utilizing hard negatives while remaining unsupervised, precluding the adoption of existing negative sampling strategies that rely on true similarity information.

3.2. Proposed Methodology

To overcome the challenge of creating dissimilar data and to enhance the efectiveness of the triplet loss approach, we draw inspiration from the work of Cai et al.1[ 2 ]. Their findings suggest that only a small quantity of negatives is necessary for achieving full accuracy in downstream tasks. In our proposed method, we introduce a novel approach for generating negative values within a triplet set.

In this approach, the anchor represents one view of the data, and the positive is derived from the other view of the same data within a batch. Crucially, the negative is constructed by searching for the k nearest neighbors of the anchor among the positive ones. The negative value is obtained by calculating the centroid of these k vectors. This vector serves as an excellent representation of the negative since it combines elements of the negative data with characteristics of the positive data, efectively building a hard negative. This is attributed to its ability to encapsulate information from all vectors in close proximity to the anchor. Consequently, the centroid poses a challenge when diferentiating it from the anchor, thereby enhancing the discriminative capability of the model.

Mathematically, the triplet loss is expressed as: ℒ1( , , ) = max (sim ( , ) − sim ( , ) + , 0 ) (1)

Here, = ( 1()) represents the anchor, = ( 2()) represents the positive, with denoting an encoder neural network, and 1, 2 drawn from the set of augmentation transform techniques, and sim() indicate a similarity measure between two vectors (cosine similarity for default). The -th element of is computed as [] = Centroid(k-nearest( − )), where Centroid denotes the centroid function, k-nearest( − ) represents the -elements closest to excluding the -th element.

Typically, the triplet loss is constrained by its sensitivity to the training triplets due to its reliance on a set margin [ 37 ]. Consequently, the cross-entropy loss serves as a more flexible alternative, resembling a softer version of the triplet loss with an adjustable margin [ 35 ]. This adaptation addresses the constraint of the triplet loss with a fixed margin.

Finally, the total loss function is defined as: ℒ2( , , ) = −log

exp ( ⊤ ) exp ( ⊤ ) + exp ( ⊤ ) ℒoss = (ℒ1( , , ) + ℒ 2( , , )) (2) (3)

This proposed solution addresses the limitations of existing methods by introducing a more efective way of constructing negative representations, thereby aiming to enhance overall performance in representation learning, particularly in Content-Based Image Retrieval and Classification tasks.

4. Main Results 4.1. Training

We are going to present the protocols to train our algorithm and our results: Data Augmentation: One type of augmentation involves spatial/geometric transformation of data, such as cropping and resizing (with horizontal flipping), rotation [24], and cutout [ 38 ]. The other type of augmentation involves appearance transformation, such as color distortion (including color dropping, brightness, contrast, saturation, hue) 3[ 9, 40 ], Gaussian blur, and Sobel filtering.

Algorithm: Our algorithm is based on [ 4 ]. We can see the general schema in Figure 3(), and the final algorithm can be found in Algorithm 1

Datasets: We use two diferent datasets to validate the results. The CIFAR-10 dataset comprises 60,000 32x32 color images categorized into 10 classes, each containing 6,000 images. It is divided into 50,000 training images and 10,000 test images [ 41 ], and The SVHN (Street View House Numbers) dataset is a real-world image dataset specifically designed for developing machine learning and object recognition algorithms with minimal data preprocessing and formatting requirements. It consists of images containing digits, with 10 classes representing each digit from 0 to 9. The dataset is split into 73,257 digits for training, 26,032 digits for testing [ 42 ].

Metrics: • Mean Average Precision (MAP) is a crucial metric in image retrieval tasks, providing a comprehensive measure of a system’s efectiveness across multiple queries. It assesses the average precision at each relevant image’s position in the ranked list and computes the mean of these values. Relevant images are defined based on query relevance, and precision is calculated by dividing the number of relevant images retrieved up to a certain position by the total number of retrieved images up to that position. To calculate MAP@K, a variant of MAP where only the top K retrieved items are considered, you can use the following formula: 1 || =1 || ∑

min ( , | |) Where: || is the total number of queries, Precision@k is the precision at position for query , Relevance() is a binary indicator function that is 1 if the item at position is relevant and 0 otherwise,| | is the number of relevant items for query , and is the cutof rank. • Accuracy, Recall, Precision, and F1-score are fundamental metrics for evaluating classification tasks. Accuracy measures the proportion of correctly classified instances among all instances, providing an overall assessment of the model’s performance. Recall quantifies the proportion of true positive instances correctly identified by the model among all actual positive instances. Precision measures the proportion of true positive instances among all instances predicted as positive, ofering insights into the model’s precision in positive predictions. F1-score, the harmonic mean of precision and recall, balances the trade-of between precision and recall, providing a single metric that reflects both measures’ performance. These metrics collectively ofer a comprehensive understanding of the classification model’s efectiveness in correctly identifying instances belonging to diferent classes.

Evaluation: The evaluation was carried out using two diferent methods. Firstly, the CBIR method was employed, where the last output layer of the encoder was used to generate a feature vector for each image. Subsequently, the closest images in the training set were retrieved for each image in the test set, aiming to measure the results of the k nearest neighbors using the Mean Average Precision at K (MAP@K) metric. Secondly, a linear evaluation was conducted. In this approach, only a linear layer was added to the encoder, and then the model was retrained to perform classification using the available labels while keeping the encoder weights frozen.

Other protocols: Our encoder is based on the Very Deep Convolutional Networks for Large-Scale Image Recognition paper 4[ 3 ]. The batch size is 32. The maximum epoch is 200, we use stochastic gradient descent with a learning rate0.6 and cosine learning rate decay schedule. You can observe all the details in the appendix.

4.2. Baselines

We are going to compare our approach with 4 relevant state-of-the-art works in self-supervised. • SimCLR [ 4 ]: is a straightforward framework for contrastive learning of visual representations. Two distinct data augmentation operators, ∼ and ′ ∼ , are randomly selected from the same family of augmentations and applied to each data example, creating two correlated views. A base encoder network and a projection head are trained to maximize agreement using a contrastive loss. After completing the training, the projection head is discarded, and the encoder is employed to obtain a representation, denoted as h, for downstream tasks. Notably, SimCLR introduces a learnable nonlinear transformation between the representation and the contrastive loss, significantly enhancing the quality of the learned representations. • SimSiam [ 6 ]: is a model designed to maximize the similarity between two augmentations of a single image while avoiding collapsing solutions. It utilizes two augmented views of the same image, processed by an identical encoder network (comprising a backbone and a projection MLP). A prediction MLP is applied to one side, while a stop-gradient operation is applied to the other side. The model’s objective is to maximize the similarity between both sides. Notably, SimSiam does not rely on negative pairs or a momentum encoder. The authors empirically demonstrate the existence of collapsing solutions and emphasize the critical role of the stop-gradient operation in preventing such occurrences. This suggests the presence of an underlying optimization problem diferent from conventional contrastive learning. • BYOL [ 7 ]: is an approach to self-supervised image representation learning. It relies on two neural networks, referred to as online and target networks, that interact and learn from each other. Using an augmented view of an image, the online network is trained to predict the target network’s representation of the same image under a diferent augmented view. Concurrently, the target network is updated with a slow-moving average of the online network. The use of a slow-moving average of the online parameters as the target network encourages the encoding of increasing information within the online projection and mitigates the risk of collapsed solutions. • BarlowTwins [ 29 ]: proposes an objective function that inherently avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample. The objective is to make this matrix as close to the identity matrix as possible. This approach ensures that the embedding vectors of distorted versions of a sample are similar while minimizing redundancy between the components of these vectors.

4.3. Preliminary results:

The provided Tables ofer a comprehensive insight into the performance metrics concerning Content-Based Image Retrieval (CBIR) and Linear Evaluation across various Self-Supervised Learning (SSL) algorithms applied to datasets CIFAR-104[ 1 ] and The Street View House Numbers (SVHN) [ 42 ]. In the context of CBIR, the precision metric, Mean Average Precision (MAP), is computed at diferent values of k, indicating the number of nearest neighbors sought in the retrieval process. Each row in Table1 corresponds to a distinct SSL algorithm, with the MAP values at diferent k values displayed, showcasing the algorithm’s performance in retrieving relevant images. Notably, higher MAP values indicate a superior ability to retrieve relevant images in the CBIR task. In the Linear Evaluation, presented in Table2, various performance metrics such as Accuracy, Recall, Precision, and F1-score are provided for each SSL algorithm. These Tables provide a detailed breakdown of the performance of each SSL algorithm under consideration, facilitating a nuanced understanding of their efectiveness in image retrieval and classification tasks.

4.4. Analysis

Preliminary results reveal the outstanding efectiveness of our approach on two fundamental tasks: Content-Based Image Retrieval (CBIR) and Linear Evaluation. • Content-Based Image Retrieval (CBIR): To evaluate the performance of our method on CBIR, the CIFAR-10 and SVHN datasets were used. Looking at the Table1: – CIFAR-10: Our method significantly outperforms the baselines for diferent values of k. Compared to other state-of-the-art methods such as SimCLR, SimSiam, BYOL, and BarlowTwins, our approach demonstrates considerable improvement in mean average precision (MAP). We achieved a MAP of 0.7316 for k=1000, 0.8253 for k=100, 0.868 for k=10, and 0.924 for k=1, indicating a high capacity for image representation and retrieval in the latent space. – SVHN: Although our algorithm shows notable improvement compared to baselines, including SimCLR, BYOL, and BarlowTwins, in terms of MAP, it has been outperformed by the SimSiam approach. Our method achieves a MAP of 0.4315 for k=1000, 0.6004 for k=100, 0.6965 for k=10, and 0.805 for k=1. Despite not being the best in this data set, our approach is still competitive and ofers promising results. • Linear Evaluation To evaluate the generalization ability of the learned representations in a linear classification task, an evaluation was performed on CIFAR-10 and SVHN. Performance metrics include precision, recall, precision, and F1-score. Analyzing the Table 2 – CIFAR-10: Our method excels at this task, significantly outperforming other stateof-the-art approaches such as SimCLR, SimSiam, BYOL, and BarlowTwins. We achieved a classification accuracy of 93.22%, demonstrating the efectiveness of the learned representations in linear classification tasks on this dataset. – SVHN: Our method also shows impressive performance on the linear classification task for SVHN. Although SimSiam outperforms our approach on the CBIR task, our method outperforms both SimSiam and other baselines in terms of classification accuracy, achieving an accuracy of 87.42%.

In summary, our results indicate that our approach has outstanding performance on the CBIR task in CIFAR-10, being highly competitive in SVHN. Furthermore, it demonstrates exceptional generalization ability in linear classification tasks on both data sets. These findings support the efectiveness and promise of our method in feature extraction and representation of image data. Algorithm 1 Algorithm

5. Conclusion

The landscape of Self-Supervised Representation Learning (SRL) has witnessed significant advancements, and this paper contributes to the field by addressing a crucial limitation in existing methods. Traditional approaches often focus on learning similarity without adequately capturing dissimilarity nuances, leading to suboptimal representations. Our proposed method, termed “Refining Triplet Sampling”, introduces a novel strategy for generating negative vectors in a batch, enhancing the triplet loss methodology for representation learning. The motivation behind our approach stems from the challenge of creating dissimilar data, a critical aspect of efective SRL. Existing methods, including those relying on the average as a measure of dissimilarity, fall short of providing robust negative representations. Our method tackles this limitation by constructing negative samples based on the k-nearest neighbors, significantly improving the model’s ability to diferentiate dissimilar instances.

Experimental results, particularly in Content-Based Image Retrieval (CBIR) and Linear Evaluation, consistently demonstrate the superiority of our approach over other Self-Supervised Learning (SSL) methods (baselines). The refined representations showcase higher Mean Average Precision (MAP) values in CBIR, emphasizing the efectiveness of our method in retrieving relevant images. Linear Evaluation further underscores the versatility of our learned representations, outperforming other algorithms in terms of Accuracy, Recall, Precision, and F1.

Acknowledgments

This research received partial support through an agreement with Scotiabank and Federico Santa María Technical University, as well as via a scholarship for international visits provided by Federico Santa María Technical University and the National Agency for Research and Development (doctoral scholarship 2022/21221059). feature embedding, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4004–4012. [12] T. T. Cai, J. Frankle, D. J. Schwab, A. S. Morcos, Are all negatives created equal in contrastive instance discrimination?, arXiv preprint arXiv:2010.06682 (2020). [13] C. Doersch, A. Gupta, A. A. Efros, Unsupervised visual representation learning by context prediction, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1422–1430. [14] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103. [15] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114 (2013). [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, Advances in neural information processing systems 27 (2014). [17] C. Doersch, A. Zisserman, Multi-task self-supervised visual learning, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2051–2060. [18] R. Zhang, P. Isola, A. A. Efros, Colorful image colorization, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, Springer, 2016, pp. 649–666. [19] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic colorization, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, 2016, pp. 577–593. [20] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros, Context encoders: Feature learning by inpainting, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544. [21] M. Noroozi, P. Favaro, Unsupervised learning of visual representations by solving jigsaw puzzles, in: European conference on computer vision, Springer, 2016, pp. 69–84. [22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4681–4690. [23] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, T. Brox, Discriminative unsupervised feature learning with convolutional neural networks, Advances in neural information processing systems 27 (2014). [24] S. Gidaris, P. Singh, N. Komodakis, Unsupervised representation learning by predicting image rotations, arXiv preprint arXiv:1803.07728 (2018). [25] A. Kolesnikov, X. Zhai, L. Beyer, Revisiting self-supervised visual representation learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1920–1929. [26] X. Chen, H. Fan, R. Girshick, K. He, Improved baselines with momentum contrastive learning, arXiv preprint arXiv:2003.04297 (2020). [27] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, P. Isola, What makes for good views for contrastive learning?, Advances in neural information processing systems 33 (2020)

A. Implementation Details A.1. Hardware Configuration

The experiments were carried out on a computer with the following specifications: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz, 32GB of RAM, and a GeForce GTX 1080 Ti GPU.

A.2. Selected Hyperparameters

In Table 3, a comprehensive list of all the hyperparameters utilized for our methods is provided. These hyperparameters are pivotal components in configuring and fine-tuning the performance of our methodologies. Each hyperparameter plays a distinct role in shaping the behavior and eficacy of the employed techniques. Through meticulous selection and optimization of these hyperparameters, we aim to enhance the overall performance and robustness of our methods across various experimental settings and datasets. 200 32 SGD 15

A.3. Dataset details

Additional information about the datasets is presented in the Table4. It is important to note that these two datasets represent very diferent natures; one consists of natural images while the other is composed solely of numbers. The combination of both sets is essential for a comprehensive evaluation of the performance of diferent data sets.

A.4. Recovery Visualization

In this subsection, we present visual examples showcasing the recovery achieved by our method. These illustrations are depicted in Figures??, ??, ??, ??, and ??. Through these images, we aim to demonstrate the efectiveness of our approach in accurately reconstructing the original content. Notably, our method excels in preserving the semantic integrity of the images during the recovery process, thereby emphasizing its robust performance in retaining crucial visual details and structures

A.5. Online Resources

For those interested in replicating our results, the code is available on GitHub at the following link:

GitHub Repository

This repository contains the necessary resources and instructions to facilitate the replication of our findings. Feel free to explore and utilize the code to delve deeper into our methodology and validate the outcomes.

A.6. Future Work

Despite the advancements presented in this work in the domain of image retrieval and classification, there are several lines of research that can further enrich our approach and explore its applicability in diferent visual contexts. Below are highlighted some areas of interest for future investigations: • Exploration of Diversity in Image Datasets: To assess the robustness and generalization of our algorithm across diferent visual domains, we propose the inclusion of additional datasets representing diverse nature of images. This could involve datasets containing medical images, satellite data, texture images, among others. Expanding the domains of images will allow for a more comprehensive evaluation of the algorithm’s ability to adapt to a variety of visual contexts. • Transfer Learning in Cross-Domain Scenarios: To extend our research on transfer learning, we suggest exploring cross-domain scenarios where the model is trained on one dataset and evaluated on another with diferent visual characteristics. This line of investigation will help assess the algorithm’s adaptation capability to diferent visual styles and evaluate the transferability of learned representations across diferent image domains. • Exploration of Semi-Supervised Learning Techniques: To further improve the performance of the algorithm in image retrieval and classification tasks, we propose investigating semi-supervised learning techniques. This approach leverages both labeled and unlabeled data to train the model, which can be particularly useful in scenarios where labeled datasets are scarce or expensive to obtain. Exploring semi-supervised strategies could open up new opportunities to enhance the eficiency and accuracy of the algorithm in computer vision tasks.

These research directions represent significant steps towards advancing our understanding of self-supervised algorithms in the field of computer vision and their application in a variety of visual domains and real-world scenarios. (b) Recovery Images

[1]

He ,

Fan ,

Wu ,

Xie ,

Girshick , Momentum contrast for unsupervised visual representation learning , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020 , pp. 9729 - 9738 .

[2]

Chen ,

Kornblith ,

Swersky ,

Norouzi ,

G. E.

Hinton , Big self-supervised models are strong semi-supervised learners , Advances in neural information processing systems 33 ( 2020 ) 22243 - 22255 .

[3]

Gutmann ,

Hyvärinen , Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings , 2010 , pp. 297 - 304 .

[4]

Chen ,

Kornblith ,

Norouzi ,

Hinton , A simple framework for contrastive learning of visual representations , in: International conference on machine learning, PMLR , 2020 , pp. 1597 - 1607 .

[5]

Wang ,

P. H.

Torr ,

Lin , Solving ineficiency of self-supervised representation learning , in: Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021 , pp. 9505 - 9515 .

[6]

Chen ,

He , Exploring simple siamese representation learning , in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2021 , pp. 15750 - 15758 .

[7] J.-B. Grill , F.

Strub , F.

Altché , C.

Tallec , P.

Richemond , E.

Buchatskaya , C.

Doersch , B. Avila

Pires , Z.

Guo , M. Gheshlaghi

Azar , et al., Bootstrap your own latent-a new approach to self-supervised learning , Advances in neural information processing systems 33 ( 2020 ) 21271 - 21284 .

[8]

Huynh ,

Kornblith ,

M. R.

Walter ,

Maire ,

Khademi , Boosting contrastive selfsupervised learning with false negative cancellation , in: Proceedings of the IEEE/CVF winter conference on applications of computer vision , 2022 , pp. 2785 - 2795 .

[9]

Robinson , C.-Y. Chuang,

Sra ,

Jegelka , Contrastive learning with hard negative samples , arXiv preprint arXiv: 2010 . 04592 ( 2020 ).

[10]

Schrof ,

Kalenichenko ,

Philbin , Facenet: A unified embedding for face recognition and clustering , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2015 , pp. 815 - 823 .

[11]

Oh Song ,

Xiang ,

Jegelka ,

Savarese , Deep metric learning via lifted structured 6827-6839.

[28]

Wu ,

Xiong ,

S. X.

Yu ,

Lin , Unsupervised feature learning via non-parametric instance discrimination , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2018 , pp. 3733 - 3742 .

[29]

Zbontar ,

Jing , I. Misra, Y. LeCun, S. Deny, Barlow twins: Self-supervised learning via redundancy reduction , in: International Conference on Machine Learning, PMLR , 2021 , pp. 12310 - 12320 .

[30]

Ding ,

Lin ,

Wang ,

Chao , Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition 48 ( 2015 ) 2993 - 3003 .

[31]

Hermans ,

Beyer ,

Leibe , In defense of the triplet loss for person re-identification , arXiv preprint arXiv:1703.07737 ( 2017 ).

[32]

Wang ,

Zhang , W. Huang,

M. R.

Scott , Cross-batch memory for embedding learning , in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020 , pp. 6388 - 6397 .

[33]

Wang ,

Zhang ,

Lai ,

Yu ,

Lin , Weakly supervised person re-id: Diferentiable graphical learning and a new benchmark , IEEE Transactions on Neural Networks and Learning Systems 32 ( 2020 ) 2142 - 2156 .

[34]

Turpault ,

Serizel , E. Vincent, Semi-supervised triplet loss based learning of ambient audio embeddings , in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , IEEE, 2019 , pp. 760 - 764 .

[35]

Li ,

Yang ,

Kong ,

Wang ,

Huo ,

Gao ,

Luo , Trip-roma: Self-supervised learning with triplets and random mappings , Transactions on Machine Learning Research ( 2022 ).

[36]

Hadsell ,

Chopra , Y. LeCun, Dimensionality reduction by learning an invariant mapping , in: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06) , volume 2 , IEEE, 2006 , pp. 1735 - 1742 .

[37] C.-Y. Wu , R.

Manmatha , A. J.

Smola , P.

Krahenbuhl , Sampling matters in deep embedding learning , in: Proceedings of the IEEE international conference on computer vision , 2017 , pp. 2840 - 2848 .

[38] T. DeVries , G. W. Taylor, Improved regularization of convolutional neural networks with cutout , arXiv preprint arXiv:1708.04552 ( 2017 ).

[39] A. G. Howard, Some improvements on deep convolutional neural network based image classification , arXiv preprint arXiv:1312.5402 ( 2013 ).

[40]

Szegedy , W. Liu,

Jia ,

Sermanet ,

Reed ,

Anguelov ,

Erhan ,

Vanhoucke ,

Rabinovich , Going deeper with convolutions , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2015 , pp. 1 - 9 .

[41]

Krizhevsky , Learning multiple layers of features from tiny images , Technical Report , 2009 .

[42]

Netzer ,

Wang ,

Coates ,

Bissacco ,

Wu ,

A. Y.

Ng , Reading digits in natural images with unsupervised feature learning , in: NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011 , 2011 . URL:http://ufldl.stanford.edu/housenumbers/ nips2011_housenumbers.pdf.

[43]

Simonyan ,

Zisserman , Very deep convolutional networks for large-scale image recognition , arXiv preprint arXiv:1409.1556 ( 2014 ).