Refining Triplet Sampling for Improved
                                Self-Supervised Representation Learning⋆
                                Manuel Goyo1,∗,† , Giacomo Frisoni2,† , Gianluca Moro2,† and Claudio Sartori2,∗,†
                                1
                                    Department of Informatics, Universidad Técnica Federico Santa María, Valparaíso, Chile
                                2
                                    Department of Computer Science and Engineering, University of Bologna, Bologna, Italy


                                               Abstract
                                               Self-supervised representation learning extracts meaningful features from data without explicit supervi-
                                               sion, building a space with desired properties. Contrastive learning has emerged as the predominant
                                               approach to clustering similar data points and separating dissimilar ones within the embedding space.
                                               Although creating different views of the same data (e.g., cropping, rotation) emphasizes similarities
                                               without labels, current methods struggle to define negative examples. Several algorithms only consider
                                               positive examples or integrate dissimilarity measures into their loss functions by computing average
                                               distances within the same batch. However, they do not capture nuanced differences effectively, risking
                                               collapsing data points in a single location. In this paper, we propose a novel technique, termed “Refined
                                               Triplet Sampling” (ReTSam), to generate synthetic negative vectors for contrastive learning. Mechani-
                                               cally, for each element in the batch, we identify its 𝑘-nearest neighbors and designate the centroid as a
                                               hard negative for a triplet loss methodology. We test ReTSam on two widely used image datasets, namely
                                               CIFAR-10 and SVHN, considering content-based image retrieval and classification tasks. Our findings
                                               demonstrate that, despite its simplicity, ReTSam not only promotes the learning of similarity but also
                                               significantly improves that of dissimilarity (with a +5% increase in Mean Average Precision on CIFAR10),
                                               resulting in superior performance in practical scenarios.

                                               Keywords
                                               Self Supervised Learning, Representation Learning, Triplet Loss, Negative Sampling


                                1. Introduction
                                Lately, representation learning has become a crucial element in the development of modern AI
                                agents, largely propelled by significant advancements in self-supervised learning (SSL). SSL is a
                                paradigm where representations are obtained through pre-training tasks using unlabeled data,
                                playing a pivotal role in contemporary AI. These acquired representations are then utilized
                                in subsequent tasks like classification or content-based retrieval of images. Importantly, the
                                attractiveness of SSL stems from its capability to leverage abundant and cost-effective unlabeled
                                data, often surpassing its supervised counterpart, as observed in certain instances [1, 2]. Many
                                contrastive learning approaches hinge on two fundamental elements: the concepts of similar
                                (positive) pairs (𝑥, 𝑥 + ) and dissimilar (negative) pairs (𝑥, 𝑥 − ) of data points. The training
                                objective, typically noise-contrastive estimation [3], directs the learned representation to map
                                SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
                                ∗
                                    Corresponding author.
                                †
                                    These authors contributed equally.
                                Envelope-Open manuel.goyo@sansano.usm.cl (M. Goyo); giacomo.frisoni@unibo.it (G. Frisoni); gianluca.moro@unibo.it
                                (G. Moro); claudio.sartori@unibo.it (C. Sartori)
                                             © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
positive pairs to close locations and negative pairs to distant ones. Alternative objectives
have also been explored [4]. The effectiveness of these methods relies on the formulation of
information for the positive and negative pairs, as they cannot leverage genuine similarity
information due to the absence of supervision. Certain authors opt not to explicitly generate
dissimilar data. Instead, they compute distances to all other data points [4] or their closest
neighbors [5], calculate the average of these similarities, and use it as a dissimilarity measure
in a loss function. However, the drawback of this approach lies in the inadequacy of the
average to effectively represent dissimilarity. An alternative approach addresses the issue
by focusing solely on positive instances and implementing diverse parameter updates [6, 7].
Nevertheless, this method fails to endow the algorithm with the capability to construct a robust
decision boundary for effectively discerning differences within the data, leading to overlaps
with different categories. Some authors pursue explicit negatives by considering different views
(augmentations) for each image to identify real negatives and discard false negatives [8] or by
estimating a sample from the distribution over negative pairs [9]. This approach stems from
metric learning settings, where “hard” (true negative) examples can expedite the correction
of mistakes in the learning process [10, 11]. In representation learning, informative negative
examples are intuitively those pairs that are mapped nearby but should be far apart. This
concept is successfully applied in metric learning, where true pairs of dissimilar points are
available, in contrast to unsupervised contrastive learning. Our methodology hinges on the
generation of a hard negative, inspired by the findings of Cai et al. (2020) [12], who assert
that “... a small minority of negatives were both necessary and sufficient for the downstream
task to reach full accuracy.” In light of this insight, we propose an approach centered around
triplet loss. In this setup, the positive pairs are generated in a conventional manner, employing
transformations that preserve semantic content. However, the negative element is uniquely
crafted considering only the k nearest neighbors of the remaining batch of positives to the
anchor. The negative is then derived by computing the centroid. This approach emphasizes
that the centroid serves as an excellent representation of the negative, owing to its ability to
encapsulate information from all vectors in close proximity to the anchor.
   Particularly, the main contributions of this work are as follows:
    • We design a simple but effective sampling strategy based on similarity to create negative
      elements.
    • We propose a general self-supervised training method based on triplet loss for represen-
      tation learning.
    • We are the first to evaluate state-of-the-art self-supervised algorithms in the context of
      Content-Based Image Retrieval (CBIR) in different datasets.
    • Our experiments across two datasets demonstrate that our approach surpasses existing
      methods in both Content-Based Image Retrieval (CBIR) and classification tasks, as indi-
      cated by superior performance metrics such as Mean Average Precision (MAP) for CBIR
      and Accuracy, Recall, Precision, and F1 for classification.
   The rest of the paper is organized as follows. Section 2 presents a review of the work related
to this approach. In section 3, we will describe our proposed method. Section 4 will show the
results of applying our method to different datasets. Section 5 will present conclusions and
future works.
2. Related Works
2.1. Representation Learning
In the realm of unsupervised representation learning, the approaches are predominantly cat-
egorized into generative and discriminative methods [13, 4]. Generative strategies involve
constructing a distribution over data and latent embeddings, utilizing these embeddings as
representations for images. Techniques such as auto-encoding of images [14, 15] and adversarial
learning [16] are commonly employed in generative methods. While these approaches provide
comprehensive pixel-level representations, the computational demands can be significant, and
the generation of highly detailed images may not be essential for effective representation learn-
ing. Discriminative methods, particularly contrastive methods [4, 6, 5], currently stand at the
forefront, showcasing state-of-the-art performance in self-supervised learning. Some alternative
methodologies opt for auxiliary handcrafted prediction tasks to guide representation learning.
However, their efficacy often falls short in comparison to contrastive methods. Noteworthy
techniques, such as relative patch prediction [13, 17], colorizing grayscale images [18, 19],
image inpainting [20], image jigsaw puzzle [21], image super-resolution [22], and geometric
transformations [23, 24], have been explored for their utility. Despite the integration of well-
structured architectures [25], these approaches consistently underperform when juxtaposed
with the superior performance demonstrated by contrastive methods [26, 27].

2.2. Contrastive Learning
Contrastive learning stands as a compelling alternative to the computationally intensive task of
pixel-level image generation. Shifting its focus from image creation, contrastive learning aims
to minimize the distance between representations of different views of the same image (positive
pairs) and maximize the distance between representations of views from different images
(negative pairs) [17, 28, 6]. Contrastive methods often capitalize on comparisons with multiple
examples, and in some cases, they exhibit effectiveness even without explicit negative examples
[4, 5, 7]. Several noteworthy algorithms have been proposed for contrastive learning of visual
representations. SimCLR [4], for instance, utilizes augmented views of other items in a minibatch
as negative samples. MoCo [1, 26], on the other hand, incorporates a momentum-updated
memory bank of old negative representations, enabling the use of large batches of negative
samples. Tri Huynh et al. [8] tackle a fundamental issue in contrastive learning—the mitigation
of false negatives. The introduction of false negatives poses challenges such as discarding
semantic information and slow convergence. The authors propose novel approaches to identify
false negatives, introducing two strategies—false negative elimination and attraction—to mitigate
their effects. Their work involves systematic evaluations to comprehensively understand and
address this issue. Robinson et al. [9] present an unsupervised method based on a simple
distribution over hard negative pairs for contrastive representation learning. They construct
this distribution over hard negatives with the assumption that the most useful negative samples
are those that the embedding currently believes to be similar to the anchor. A noteworthy
approach to learning image representation is introduced by [29]. This involves computing the
cross-correlation matrix between the outputs of two identical networks, which receive distorted
versions of a sample. The objective is to make this cross-correlation matrix as similar to the
identity matrix as possible. This ensures that the embedding vectors of the distorted versions of
a sample become more similar to each other while reducing redundancy among the components
of these vectors.

2.3. Triplet Loss Approach
The triplet loss approach, initially introduced by Ding et al. for person re-identification and
independently adopted by Schroff et al. for face recognition [30, 10], has undergone substantial
evolution, becoming a transformative paradigm in contrastive learning. In building upon the
foundational concept of triplet loss, researchers have dedicated efforts to enhance the genera-
tion and selection of valuable triplets. Hermans et al. [31] contributed significant strategies to
identify and leverage informative triplets, thereby bolstering the robustness and effectiveness
of the triplet loss methodology. Seeking further refinement, Wang et al. [32] delved into the ap-
plication of cross-batch triplet loss, with the objective of augmenting generalization capabilities
and stabilizing the triplet loss approach. This extension demonstrates a nuanced understanding
of inter-batch relationships and their pivotal role in shaping the learning process. Furthermore,
researchers have ventured into adapting the triplet loss approach to weakly supervised scenarios.
Wang et al. [33] made notable contributions in this domain, exploring methods to harness
weak supervision signals and extend the applicability of the triplet loss paradigm to scenarios
where labeled data may be scarce. Turpault et al. [34] took a unique approach by integrating
unsupervised triplet loss-based learning into a self-supervised representation learning frame-
work. Their variant involves obtaining positive samples for triplets with unlabeled anchors by
applying a transformation to the anchor. The negative sample for these triplets is then chosen
as the sample in the training set that is closest to the anchor and distant from the positive
sample. Another noteworthy contribution to the triplet loss approach comes from Wang et al.
[5], who introduced a truncated triplet loss methodology. In their approach, the negative pair is
constructed by selecting a negative sample deputy from all negative samples. This strategic
choice aims to mitigate false negatives and prevent the model from over-clustering samples of
the same actual categories into different clusters. Finally, Li et al. [35] introduce an algorithm
called Trip-ROMA, based on a simple Triplet loss with RandOm MApping (ROMA) strategy,
which consists of mapping random samples into other spaces and requiring these randomly
projected samples to satisfy the same relationship indicated by the triplets. Finally, integrating
the triplet-based loss with random mapping, we obtain the proposed method.


3. Algorithm
We first are going to show a motivation and then we present the algorithm.

3.1. Motivation
In the past year, the prominence of Self-Supervised Representation Learning has experienced
significant growth, primarily driven by the challenges posed by the absence of labeled data. A
prevalent strategy involves applying augmentations to generate different views of the same
data, effectively emphasizing similar or closely related data points [36] (see Fig. 1).
                   Figure 1: schema self-supervised with augmentation strategy


   However, a critical challenge emerges in creating dissimilar data, as failure to do so may lead
to a collapsing solution where all data points cluster at a single location [13]. Addressing the
challenge of dissimilar data, some authors calculate distances to all other data points [4] or their
closest neighbors [5], computing the average of these similarities and using it as a dissimilarity
measure in a loss function. Nevertheless, the inadequacy of the average to effectively represent
dissimilarity poses a drawback to this approach, so requires a large batch size. An alternative
method tackles the issue by solely considering positive instances and implementing diverse
parameter updates [6, 7]. However, this method falls short in enabling the algorithm to construct
a robust decision boundary for effectively discriminating differences within the data, leading
to overlaps with different categories. The crux of our motivation lies in selecting a robust
representation of the negative within the data (hard negative). This representation should
effectively challenge the model in differentiating it from the positive. Leveraging the triplet loss
approach, commonly employed in contrastive learning for SSL, becomes a natural choice, in the
Fig. 2, you can see the schema.
   Triplet loss, introduced independently for various applications such as person re-identification
and face recognition [30, 10], deals with sets comprising an anchor sample, a positive sample,
and a negative sample. The loss function encourages the model to maximize the similarity
between the anchor and positive samples while minimizing the similarity between the anchor
and negative samples, subject to a margin constraint.
   For simplicity, we illustrate the triplet set (𝑥𝑖 , 𝑥𝑖+ , 𝑥𝑖− )𝑖=1,⋯,𝑚 using one query data and one
sample. The triplet loss is defined as
                                  𝑚
                           ℒ = ∑ max (𝑠𝑖𝑚 (𝑥, 𝑥𝑖− ) − 𝑠𝑖𝑚 (𝑥𝑖 , 𝑥𝑖+ ) , 𝑚)
                                 𝑖=1
where 𝑠𝑖𝑚 is a similarity metric (e.g., cosine similarity or Euclidean), and 𝑚 is a margin deter-
mining whether to discard a triplet
                              Figure 2: schema triplet loss approach


   Constructing triplets for each data point poses a significant challenge, particularly in deter-
mining how to establish negative pairs accurately (dog in Fig. 2). While positive pairs can be
reliably generated, identifying negative pairs involves the use of hard negative samples (points
that are challenging to distinguish from an anchor point). The key challenge lies in utilizing
hard negatives while remaining unsupervised, precluding the adoption of existing negative
sampling strategies that rely on true similarity information.

3.2. Proposed Methodology
To overcome the challenge of creating dissimilar data and to enhance the effectiveness of the
triplet loss approach, we draw inspiration from the work of Cai et al. [12]. Their findings suggest
that only a small quantity of negatives is necessary for achieving full accuracy in downstream
tasks. In our proposed method, we introduce a novel approach for generating negative values
within a triplet set.
   In this approach, the anchor represents one view of the data, and the positive is derived
from the other view of the same data within a batch. Crucially, the negative is constructed by
searching for the k nearest neighbors of the anchor among the positive ones. The negative
value is obtained by calculating the centroid of these k vectors. This vector serves as an
excellent representation of the negative since it combines elements of the negative data with
characteristics of the positive data, effectively building a hard negative. This is attributed
to its ability to encapsulate information from all vectors in close proximity to the anchor.
Consequently, the centroid poses a challenge when differentiating it from the anchor, thereby
enhancing the discriminative capability of the model.
   Mathematically, the triplet loss is expressed as:

                     ℒ1 (𝑧𝑎 , 𝑧𝑝 , 𝑧𝑛 ) = max (sim(𝑧𝑎 , 𝑧𝑛 ) − sim(𝑧𝑎 , 𝑧𝑝 ) + 𝑚, 0)            (1)
  Here, 𝑧𝑎 = 𝑓 (𝜏1 (𝑥)) represents the anchor, 𝑧𝑝 = 𝑓 (𝜏2 (𝑥)) represents the positive, with 𝑓
denoting an encoder neural network, and 𝜏1 , 𝜏2 drawn from the set 𝑇 of augmentation transform
techniques, and sim() indicate a similarity measure between two vectors (cosine similarity
for default). The 𝑖-th element of 𝑧𝑛 is computed as 𝑧𝑛 [𝑖] = Centroid(k-nearest(𝑧𝑝−𝑖 )), where
Centroid denotes the centroid function, k-nearest(𝑧𝑝−𝑖 ) represents the 𝑘-elements closest to 𝑧𝑝
excluding the 𝑖-th element.
   Typically, the triplet loss is constrained by its sensitivity to the training triplets due to its
reliance on a set margin [37]. Consequently, the cross-entropy loss serves as a more flexible
alternative, resembling a softer version of the triplet loss with an adjustable margin [35]. This
adaptation addresses the constraint of the triplet loss with a fixed margin.

                                                               exp (𝑧𝑎⊤ 𝑧𝑝 )
                          ℒ2 (𝑧𝑎 , 𝑧𝑝 , 𝑧𝑛 ) = − log                                               (2)
                                                       exp (𝑧𝑎⊤ 𝑧𝑝 ) + exp (𝑧𝑎⊤ 𝑧𝑛 )
  Finally, the total loss function is defined as:

                           ℒoss = 𝔼𝑥 (ℒ1 (𝑥𝑎 , 𝑥𝑝 , 𝑥𝑛 ) + 𝛼ℒ2 (𝑧𝑎 , 𝑧𝑝 , 𝑧𝑛 ))                    (3)
   This proposed solution addresses the limitations of existing methods by introducing a more
effective way of constructing negative representations, thereby aiming to enhance overall
performance in representation learning, particularly in Content-Based Image Retrieval and
Classification tasks.


4. Main Results
We are going to present the protocols to train our algorithm and our results:

4.1. Training
Data Augmentation: One type of augmentation involves spatial/geometric transformation of
data, such as cropping and resizing (with horizontal flipping), rotation [24], and cutout [38].
The other type of augmentation involves appearance transformation, such as color distortion
(including color dropping, brightness, contrast, saturation, hue) [39, 40], Gaussian blur, and
Sobel filtering.
   Algorithm: Our algorithm is based on [4]. We can see the general schema in Figure (3), and
the final algorithm can be found in Algorithm 1
   Datasets: We use two different datasets to validate the results. The CIFAR-10 dataset
comprises 60,000 32x32 color images categorized into 10 classes, each containing 6,000 images.
It is divided into 50,000 training images and 10,000 test images [41], and The SVHN (Street
View House Numbers) dataset is a real-world image dataset specifically designed for developing
machine learning and object recognition algorithms with minimal data preprocessing and
formatting requirements. It consists of images containing digits, with 10 classes representing
each digit from 0 to 9. The dataset is split into 73,257 digits for training, 26,032 digits for testing
[42].
   Metrics:
                           Figure 3: schema of our algorithm


• Mean Average Precision (MAP) is a crucial metric in image retrieval tasks, providing a
  comprehensive measure of a system’s effectiveness across multiple queries. It assesses
  the average precision at each relevant image’s position in the ranked list and computes
  the mean of these values. Relevant images are defined based on query relevance, and
  precision is calculated by dividing the number of relevant images retrieved up to a certain
  position by the total number of retrieved images up to that position. To calculate MAP@K,
  a variant of MAP where only the top K retrieved items are considered, you can use the
  following formula:
                                   |𝑄|   𝐾
                            1      ∑𝑘=1 Precision@k𝑞 × Relevance(𝑘)
                   MAP@K =     ∑
                           |𝑄| 𝑞=1            min(𝐾 , |𝑅𝑞 |)
  Where: |𝑄| is the total number of queries, Precision@k𝑞 is the precision at position 𝑘 for
  query 𝑞, Relevance(𝑘) is a binary indicator function that is 1 if the item at position 𝑘 is
  relevant and 0 otherwise, |𝑅𝑞 | is the number of relevant items for query 𝑞, and 𝐾 is the
  cutoff rank.
• Accuracy, Recall, Precision, and F1-score are fundamental metrics for evaluating classifi-
  cation tasks. Accuracy measures the proportion of correctly classified instances among all
  instances, providing an overall assessment of the model’s performance. Recall quantifies
  the proportion of true positive instances correctly identified by the model among all
  actual positive instances. Precision measures the proportion of true positive instances
  among all instances predicted as positive, offering insights into the model’s precision
  in positive predictions. F1-score, the harmonic mean of precision and recall, balances
  the trade-off between precision and recall, providing a single metric that reflects both
  measures’ performance. These metrics collectively offer a comprehensive understanding
      of the classification model’s effectiveness in correctly identifying instances belonging to
      different classes.

   Evaluation: The evaluation was carried out using two different methods. Firstly, the CBIR
method was employed, where the last output layer of the encoder was used to generate a feature
vector for each image. Subsequently, the closest images in the training set were retrieved for
each image in the test set, aiming to measure the results of the k nearest neighbors using the
Mean Average Precision at K (MAP@K) metric. Secondly, a linear evaluation was conducted. In
this approach, only a linear layer was added to the encoder, and then the model was retrained
to perform classification using the available labels while keeping the encoder weights frozen.
   Other protocols: Our encoder is based on the Very Deep Convolutional Networks for
Large-Scale Image Recognition paper [43]. The batch size is 32. The maximum epoch is 200, we
use stochastic gradient descent with a learning rate 0.6 and cosine learning rate decay schedule.
You can observe all the details in the appendix.

4.2. Baselines
We are going to compare our approach with 4 relevant state-of-the-art works in self-supervised.

    • SimCLR [4]: is a straightforward framework for contrastive learning of visual representa-
      tions. Two distinct data augmentation operators, 𝜏 ∼ 𝒯 and 𝜏 ′ ∼ 𝒯, are randomly selected
      from the same family of augmentations and applied to each data example, creating two
      correlated views. A base encoder network and a projection head are trained to maximize
      agreement using a contrastive loss. After completing the training, the projection head
      is discarded, and the encoder is employed to obtain a representation, denoted as h, for
      downstream tasks. Notably, SimCLR introduces a learnable nonlinear transformation
      between the representation and the contrastive loss, significantly enhancing the quality
      of the learned representations.
    • SimSiam [6]: is a model designed to maximize the similarity between two augmentations
      of a single image while avoiding collapsing solutions. It utilizes two augmented views of
      the same image, processed by an identical encoder network (comprising a backbone and a
      projection MLP). A prediction MLP is applied to one side, while a stop-gradient operation
      is applied to the other side. The model’s objective is to maximize the similarity between
      both sides. Notably, SimSiam does not rely on negative pairs or a momentum encoder.
      The authors empirically demonstrate the existence of collapsing solutions and emphasize
      the critical role of the stop-gradient operation in preventing such occurrences. This
      suggests the presence of an underlying optimization problem different from conventional
      contrastive learning.
    • BYOL [7]: is an approach to self-supervised image representation learning. It relies on
      two neural networks, referred to as online and target networks, that interact and learn
      from each other. Using an augmented view of an image, the online network is trained to
      predict the target network’s representation of the same image under a different augmented
      view. Concurrently, the target network is updated with a slow-moving average of the
      online network. The use of a slow-moving average of the online parameters as the target
      network encourages the encoding of increasing information within the online projection
      and mitigates the risk of collapsed solutions.
    • BarlowTwins [29]: proposes an objective function that inherently avoids collapse by
      measuring the cross-correlation matrix between the outputs of two identical networks
      fed with distorted versions of a sample. The objective is to make this matrix as close
      to the identity matrix as possible. This approach ensures that the embedding vectors
      of distorted versions of a sample are similar while minimizing redundancy between the
      components of these vectors.

4.3. Preliminary results:
The provided Tables offer a comprehensive insight into the performance metrics concerning
Content-Based Image Retrieval (CBIR) and Linear Evaluation across various Self-Supervised
Learning (SSL) algorithms applied to datasets CIFAR-10 [41] and The Street View House Numbers
(SVHN) [42]. In the context of CBIR, the precision metric, Mean Average Precision (MAP), is
computed at different values of k, indicating the number of nearest neighbors sought in the
retrieval process. Each row in Table 1 corresponds to a distinct SSL algorithm, with the MAP
values at different k values displayed, showcasing the algorithm’s performance in retrieving
relevant images. Notably, higher MAP values indicate a superior ability to retrieve relevant
images in the CBIR task. In the Linear Evaluation, presented in Table 2, various performance
metrics such as Accuracy, Recall, Precision, and F1-score are provided for each SSL algorithm.
These Tables provide a detailed breakdown of the performance of each SSL algorithm under
consideration, facilitating a nuanced understanding of their effectiveness in image retrieval and
classification tasks.

Table 1
MAP Results
                    Model with CIFAR-10      1000       100        10       1
                    SimCLR [4]               0.687    0.7732   0.8221    0.881
                    SimSiam [6]              0.691    0.8054   0.8475    0.904
                    BYOL [7]                 0.6917   0.7832   0.8377    0.905
                    BarlowTwins [29]         0.4323   0.5753   0.6689    0.791
                    RetSam                   0.7316   0.8253    0.868    0.924
                    Model with SVHN          1000        100       10        1
                    SimCLR [4]               0.2593   0.3899    0.508    0.657
                    SimSiam [6]              0.5188   0.7177    0.812    0.874
                    BYOL [7]                 0.3217   0.4636    0.584    0.715
                    BarlowTwins [29]         0.3758   0.5671      0.69    0.78
                    RetSam                   0.4315   0.6004   0.6965    0.805


4.4. Analysis
Preliminary results reveal the outstanding effectiveness of our approach on two fundamental
tasks: Content-Based Image Retrieval (CBIR) and Linear Evaluation.
Table 2
Linear Evaluation Results
                Model with CIFAR-10     Accuracy     Recall   Precision       F1
                SimCLR [4]              0.9014       0.9014      0.9014    0.9016
                SimSiam [6]             0.8587       0.8587      0.8692    0.8607
                BYOL [7]                0.9028       0.9028      0.9027    0.9028
                BarlowTwins [29]        0.8328       0.8328      0.8331    0.8328
                RetSam                  0.9322       0.9322      0.9323    0.9322
                Model with SVHN         Accuracy     Recall   Precision        F1
                SimCLR [4]              0.8130       0.8130       0.8138   0.8127
                SimSiam [6]             0.2233       0.2233       0.4027   0.1127
                BYOL [7]                0.8090       0.8090       0.8102   0.8089
                BarlowTwins [29]        0.8456       0.8456       0.8468   0.8457
                RetSam                  0.8742       0.8742       0.8752   0.8743


    • Content-Based Image Retrieval (CBIR): To evaluate the performance of our method
      on CBIR, the CIFAR-10 and SVHN datasets were used. Looking at the Table 1:
         – CIFAR-10: Our method significantly outperforms the baselines for different values
           of k. Compared to other state-of-the-art methods such as SimCLR, SimSiam, BYOL,
           and BarlowTwins, our approach demonstrates considerable improvement in mean
           average precision (MAP). We achieved a MAP of 0.7316 for k=1000, 0.8253 for k=100,
           0.868 for k=10, and 0.924 for k=1, indicating a high capacity for image representation
           and retrieval in the latent space.
         – SVHN: Although our algorithm shows notable improvement compared to baselines,
           including SimCLR, BYOL, and BarlowTwins, in terms of MAP, it has been outper-
           formed by the SimSiam approach. Our method achieves a MAP of 0.4315 for k=1000,
           0.6004 for k=100, 0.6965 for k=10, and 0.805 for k=1. Despite not being the best in
           this data set, our approach is still competitive and offers promising results.
    • Linear Evaluation To evaluate the generalization ability of the learned representations
      in a linear classification task, an evaluation was performed on CIFAR-10 and SVHN.
      Performance metrics include precision, recall, precision, and F1-score. Analyzing the
      Table 2
         – CIFAR-10: Our method excels at this task, significantly outperforming other state-
           of-the-art approaches such as SimCLR, SimSiam, BYOL, and BarlowTwins. We
           achieved a classification accuracy of 93.22%, demonstrating the effectiveness of the
           learned representations in linear classification tasks on this dataset.
         – SVHN: Our method also shows impressive performance on the linear classification
           task for SVHN. Although SimSiam outperforms our approach on the CBIR task, our
           method outperforms both SimSiam and other baselines in terms of classification
           accuracy, achieving an accuracy of 87.42%.
In summary, our results indicate that our approach has outstanding performance on the CBIR
task in CIFAR-10, being highly competitive in SVHN. Furthermore, it demonstrates exceptional
generalization ability in linear classification tasks on both data sets. These findings support the
effectiveness and promise of our method in feature extraction and representation of image data.

Algorithm 1 Algorithm
 1: Input: Unlabeled dataset X
 2: Output: Trained model
 3: Initialize encoder network 𝑓
 4: Define hyperparameters: 𝛼, margin m
 5: while Training not converged do
 6:    for every batch x in X do
 7:        Apply data transformations 𝜏1 and 𝜏2 to create 𝜏1 (𝑥) and 𝜏2 (𝑥)
 8:        Compute embeddings za , z𝑝 using 𝑓 in 𝜏1 (𝑥) and 𝜏2 (𝑥) respectly
 9:        Compute distance between za , z𝑝 and take the k-firsts.
10:        Exclude the first and compute Centroid zn with the rest
11:        Calculate ℒ1 using Equation 1 with za , zp , and zn
12:        Calculate ℒ2 using Equation 2 with za , zp , and zn
13:        Calculate total loss ℒoss using Equation 3
14:        Update model parameters using backpropagation
15:    end for
16: end while
17: Return: Trained model


5. Conclusion
The landscape of Self-Supervised Representation Learning (SRL) has witnessed significant
advancements, and this paper contributes to the field by addressing a crucial limitation in
existing methods. Traditional approaches often focus on learning similarity without adequately
capturing dissimilarity nuances, leading to suboptimal representations. Our proposed method,
termed “Refining Triplet Sampling”, introduces a novel strategy for generating negative vectors
in a batch, enhancing the triplet loss methodology for representation learning. The motivation
behind our approach stems from the challenge of creating dissimilar data, a critical aspect
of effective SRL. Existing methods, including those relying on the average as a measure of
dissimilarity, fall short of providing robust negative representations. Our method tackles this
limitation by constructing negative samples based on the k-nearest neighbors, significantly
improving the model’s ability to differentiate dissimilar instances.
   Experimental results, particularly in Content-Based Image Retrieval (CBIR) and Linear Eval-
uation, consistently demonstrate the superiority of our approach over other Self-Supervised
Learning (SSL) methods (baselines). The refined representations showcase higher Mean Average
Precision (MAP) values in CBIR, emphasizing the effectiveness of our method in retrieving rele-
vant images. Linear Evaluation further underscores the versatility of our learned representations,
outperforming other algorithms in terms of Accuracy, Recall, Precision, and F1.


Acknowledgments
This research received partial support through an agreement with Scotiabank and Federico
Santa María Technical University, as well as via a scholarship for international visits provided
by Federico Santa María Technical University and the National Agency for Research and
Development (doctoral scholarship 2022/21221059).


References
 [1] K. He, H. Fan, Y. Wu, S. Xie, R. Girshick, Momentum contrast for unsupervised visual
     representation learning, in: Proceedings of the IEEE/CVF conference on computer vision
     and pattern recognition, 2020, pp. 9729–9738.
 [2] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, G. E. Hinton, Big self-supervised models
     are strong semi-supervised learners, Advances in neural information processing systems
     33 (2020) 22243–22255.
 [3] M. Gutmann, A. Hyvärinen, Noise-contrastive estimation: A new estimation principle for
     unnormalized statistical models, in: Proceedings of the thirteenth international conference
     on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, 2010,
     pp. 297–304.
 [4] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A simple framework for contrastive learning
     of visual representations, in: International conference on machine learning, PMLR, 2020,
     pp. 1597–1607.
 [5] G. Wang, K. Wang, G. Wang, P. H. Torr, L. Lin, Solving inefficiency of self-supervised
     representation learning, in: Proceedings of the IEEE/CVF International Conference on
     Computer Vision, 2021, pp. 9505–9515.
 [6] X. Chen, K. He, Exploring simple siamese representation learning, in: Proceedings of the
     IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15750–15758.
 [7] J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch,
     B. Avila Pires, Z. Guo, M. Gheshlaghi Azar, et al., Bootstrap your own latent-a new
     approach to self-supervised learning, Advances in neural information processing systems
     33 (2020) 21271–21284.
 [8] T. Huynh, S. Kornblith, M. R. Walter, M. Maire, M. Khademi, Boosting contrastive self-
     supervised learning with false negative cancellation, in: Proceedings of the IEEE/CVF
     winter conference on applications of computer vision, 2022, pp. 2785–2795.
 [9] J. Robinson, C.-Y. Chuang, S. Sra, S. Jegelka, Contrastive learning with hard negative
     samples, arXiv preprint arXiv:2010.04592 (2020).
[10] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition
     and clustering, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2015, pp. 815–823.
[11] H. Oh Song, Y. Xiang, S. Jegelka, S. Savarese, Deep metric learning via lifted structured
     feature embedding, in: Proceedings of the IEEE conference on computer vision and pattern
     recognition, 2016, pp. 4004–4012.
[12] T. T. Cai, J. Frankle, D. J. Schwab, A. S. Morcos, Are all negatives created equal in contrastive
     instance discrimination?, arXiv preprint arXiv:2010.06682 (2020).
[13] C. Doersch, A. Gupta, A. A. Efros, Unsupervised visual representation learning by context
     prediction, in: Proceedings of the IEEE international conference on computer vision, 2015,
     pp. 1422–1430.
[14] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust
     features with denoising autoencoders, in: Proceedings of the 25th international conference
     on Machine learning, 2008, pp. 1096–1103.
[15] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114
     (2013).
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
     Y. Bengio, Generative adversarial nets, Advances in neural information processing systems
     27 (2014).
[17] C. Doersch, A. Zisserman, Multi-task self-supervised visual learning, in: Proceedings of
     the IEEE international conference on computer vision, 2017, pp. 2051–2060.
[18] R. Zhang, P. Isola, A. A. Efros, Colorful image colorization, in: Computer Vision–ECCV
     2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016,
     Proceedings, Part III 14, Springer, 2016, pp. 649–666.
[19] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic coloriza-
     tion, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
     Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, 2016, pp. 577–593.
[20] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros, Context encoders: Feature
     learning by inpainting, in: Proceedings of the IEEE conference on computer vision and
     pattern recognition, 2016, pp. 2536–2544.
[21] M. Noroozi, P. Favaro, Unsupervised learning of visual representations by solving jigsaw
     puzzles, in: European conference on computer vision, Springer, 2016, pp. 69–84.
[22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani,
     J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution using a generative
     adversarial network, in: Proceedings of the IEEE conference on computer vision and
     pattern recognition, 2017, pp. 4681–4690.
[23] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, T. Brox, Discriminative unsupervised
     feature learning with convolutional neural networks, Advances in neural information
     processing systems 27 (2014).
[24] S. Gidaris, P. Singh, N. Komodakis, Unsupervised representation learning by predicting
     image rotations, arXiv preprint arXiv:1803.07728 (2018).
[25] A. Kolesnikov, X. Zhai, L. Beyer, Revisiting self-supervised visual representation learning,
     in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
     2019, pp. 1920–1929.
[26] X. Chen, H. Fan, R. Girshick, K. He, Improved baselines with momentum contrastive
     learning, arXiv preprint arXiv:2003.04297 (2020).
[27] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, P. Isola, What makes for good views
     for contrastive learning?, Advances in neural information processing systems 33 (2020)
     6827–6839.
[28] Z. Wu, Y. Xiong, S. X. Yu, D. Lin, Unsupervised feature learning via non-parametric
     instance discrimination, in: Proceedings of the IEEE conference on computer vision and
     pattern recognition, 2018, pp. 3733–3742.
[29] J. Zbontar, L. Jing, I. Misra, Y. LeCun, S. Deny, Barlow twins: Self-supervised learning via
     redundancy reduction, in: International Conference on Machine Learning, PMLR, 2021,
     pp. 12310–12320.
[30] S. Ding, L. Lin, G. Wang, H. Chao, Deep feature learning with relative distance comparison
     for person re-identification, Pattern Recognition 48 (2015) 2993–3003.
[31] A. Hermans, L. Beyer, B. Leibe, In defense of the triplet loss for person re-identification,
     arXiv preprint arXiv:1703.07737 (2017).
[32] X. Wang, H. Zhang, W. Huang, M. R. Scott, Cross-batch memory for embedding learning,
     in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
     2020, pp. 6388–6397.
[33] G. Wang, G. Wang, X. Zhang, J. Lai, Z. Yu, L. Lin, Weakly supervised person re-id:
     Differentiable graphical learning and a new benchmark, IEEE Transactions on Neural
     Networks and Learning Systems 32 (2020) 2142–2156.
[34] N. Turpault, R. Serizel, E. Vincent, Semi-supervised triplet loss based learning of ambient
     audio embeddings, in: ICASSP 2019-2019 IEEE International Conference on Acoustics,
     Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 760–764.
[35] W. Li, X. Yang, M. Kong, L. Wang, J. Huo, Y. Gao, J. Luo, Trip-roma: Self-supervised
     learning with triplets and random mappings, Transactions on Machine Learning Research
     (2022).
[36] R. Hadsell, S. Chopra, Y. LeCun, Dimensionality reduction by learning an invariant
     mapping, in: 2006 IEEE computer society conference on computer vision and pattern
     recognition (CVPR’06), volume 2, IEEE, 2006, pp. 1735–1742.
[37] C.-Y. Wu, R. Manmatha, A. J. Smola, P. Krahenbuhl, Sampling matters in deep embedding
     learning, in: Proceedings of the IEEE international conference on computer vision, 2017,
     pp. 2840–2848.
[38] T. DeVries, G. W. Taylor, Improved regularization of convolutional neural networks with
     cutout, arXiv preprint arXiv:1708.04552 (2017).
[39] A. G. Howard, Some improvements on deep convolutional neural network based image
     classification, arXiv preprint arXiv:1312.5402 (2013).
[40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
     A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE conference
     on computer vision and pattern recognition, 2015, pp. 1–9.
[41] A. Krizhevsky, Learning multiple layers of features from tiny images, Technical Report,
     2009.
[42] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A. Y. Ng, Reading digits in natural
     images with unsupervised feature learning, in: NIPS Workshop on Deep Learning and
     Unsupervised Feature Learning 2011, 2011. URL: http://ufldl.stanford.edu/housenumbers/
     nips2011_housenumbers.pdf.
[43] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image
     recognition, arXiv preprint arXiv:1409.1556 (2014).
A. Implementation Details
A.1. Hardware Configuration
The experiments were carried out on a computer with the following specifications: Intel(R)
Core(TM) i7-8700K CPU @ 3.70GHz, 32GB of RAM, and a GeForce GTX 1080 Ti GPU.

A.2. Selected Hyperparameters
In Table 3, a comprehensive list of all the hyperparameters utilized for our methods is provided.
These hyperparameters are pivotal components in configuring and fine-tuning the performance
of our methodologies. Each hyperparameter plays a distinct role in shaping the behavior and
efficacy of the employed techniques. Through meticulous selection and optimization of these
hyperparameters, we aim to enhance the overall performance and robustness of our methods
across various experimental settings and datasets.

Table 3
Selected Hyperparameters

                        Hyperparameter           Selected Value
                        Learning Rate                  0.06
                        Epochs                          200
                        Batch Size                       32
                        Decay Schedule         cosine learning rate
                        Optimizer                      SGD
                        Encoder             VGG50(weights=”imagenet”)
                        k- neighbors                     15
                        m (margin)                      0.6
                        𝛼                               0.5


A.3. Dataset details
Additional information about the datasets is presented in the Table 4. It is important to note
that these two datasets represent very different natures; one consists of natural images while
the other is composed solely of numbers. The combination of both sets is essential for a
comprehensive evaluation of the performance of different data sets.

Table 4
Dataset Information

            Type                       Name              Train    Test   N° Classes
            Natural Image             Cifar-10           50000   10000       10
            Numbers         Street View House Number     73257   26032       10
       (a) Query Image                                         (b) Recovery Images

Figure 4: Class: Bird


A.4. Recovery Visualization
In this subsection, we present visual examples showcasing the recovery achieved by our method.
These illustrations are depicted in Figures ??, ??, ??, ??, and ??. Through these images, we
aim to demonstrate the effectiveness of our approach in accurately reconstructing the original
content. Notably, our method excels in preserving the semantic integrity of the images during
the recovery process, thereby emphasizing its robust performance in retaining crucial visual
details and structures

A.5. Online Resources
For those interested in replicating our results, the code is available on GitHub at the following
link:
   GitHub Repository
   This repository contains the necessary resources and instructions to facilitate the replication
of our findings. Feel free to explore and utilize the code to delve deeper into our methodology
and validate the outcomes.

A.6. Future Work
Despite the advancements presented in this work in the domain of image retrieval and classifi-
cation, there are several lines of research that can further enrich our approach and explore its
applicability in different visual contexts. Below are highlighted some areas of interest for future
investigations:
      (a) Query Image                                        (b) Recovery Images

Figure 5: Class: Ship


    • Exploration of Diversity in Image Datasets: To assess the robustness and general-
      ization of our algorithm across different visual domains, we propose the inclusion of
      additional datasets representing diverse nature of images. This could involve datasets
      containing medical images, satellite data, texture images, among others. Expanding the
      domains of images will allow for a more comprehensive evaluation of the algorithm’s
      ability to adapt to a variety of visual contexts.
    • Transfer Learning in Cross-Domain Scenarios: To extend our research on transfer
      learning, we suggest exploring cross-domain scenarios where the model is trained on
      one dataset and evaluated on another with different visual characteristics. This line of
      investigation will help assess the algorithm’s adaptation capability to different visual
      styles and evaluate the transferability of learned representations across different image
      domains.
    • Exploration of Semi-Supervised Learning Techniques: To further improve the
      performance of the algorithm in image retrieval and classification tasks, we propose
      investigating semi-supervised learning techniques. This approach leverages both labeled
      and unlabeled data to train the model, which can be particularly useful in scenarios where
      labeled datasets are scarce or expensive to obtain. Exploring semi-supervised strategies
      could open up new opportunities to enhance the efficiency and accuracy of the algorithm
      in computer vision tasks.

These research directions represent significant steps towards advancing our understanding of
self-supervised algorithms in the field of computer vision and their application in a variety of
visual domains and real-world scenarios.
      (a) Query Image    (b) Recovery Images

Figure 6: Class: Horse


      (a) Query Image    (b) Recovery Images

Figure 7: Class: Frog
       (a) Query Image      (b) Recovery Images

Figure 8: Class: Airplane