<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>SEBD</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Representation Learning⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Manuel Goyo</string-name>
          <email>manuel.goyo@sansano.usm.cl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giacomo Frisoni</string-name>
          <email>giacomo.frisoni@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>GianlucaMoro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Claudio Sartori</string-name>
          <email>claudio.sartori@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Self Supervised Learning, Representation Learning, Triplet Loss, Negative Sampling</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, University of Bologna</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Informatics, Universidad Técnica Federico Santa María</institution>
          ,
          <addr-line>Valparaíso</addr-line>
          ,
          <country country="CL">Chile</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>32</volume>
      <fpage>23</fpage>
      <lpage>26</lpage>
      <abstract>
        <p>Self-supervised representation learning extracts meaningful features from data without explicit supervision, building a space with desired properties. Contrastive learning has emerged as the predominant approach to clustering similar data points and separating dissimilar ones within the embedding space. Although creating diferent views of the same data (e.g., cropping, rotation) emphasizes similarities without labels, current methods struggle to define negative examples. Several algorithms only consider positive examples or integrate dissimilarity measures into their loss functions by computing average distances within the same batch. However, they do not capture nuanced diferences efectively, risking collapsing data points in a single location. In this paper, we propose a novel technique, termed “Refined Triplet Sampling” (ReTSam), to generate synthetic negative vectors for contrastive learning. Mechanically, for each element in the batch, we identify its -nearest neighbors and designate the centroid as a hard negative for a triplet loss methodology. We testReTSam on two widely used image datasets, namely CIFAR-10 and SVHN, considering content-based image retrieval and classification tasks. Our findings demonstrate that, despite its simplicity, ReTSam not only promotes the learning of similarity but also significantly improves that of dissimilarity (with a +5% increase in Mean Average Precision on CIFAR10), resulting in superior performance in practical scenarios.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Lately, representation learning has become a crucial element in the development of modern AI
agents, largely propelled by significant advancements in self-supervised learning (SSL). SSL is a
paradigm where representations are obtained through pre-training tasks using unlabeled data,
playing a pivotal role in contemporary AI. These acquired representations are then utilized
+
in subsequent tasks like classification or content-based retrieval of images. Importantly, the
attractiveness of SSL stems from its capability to leverage abundant and cost-efective unlabeled
data, often surpassing its supervised counterpart, as observed in certain instances [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Many
contrastive learning approaches hinge on two fundamental elements: the concepts of similar
(positive) pairs ( , 
) and dissimilar (negative) pairs( , 
) of data points. The training
objective, typically noise-contrastive estimation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], directs the learned representation to map
−
positive pairs to close locations and negative pairs to distant ones. Alternative objectives
have also been explored [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The efectiveness of these methods relies on the formulation of
information for the positive and negative pairs, as they cannot leverage genuine similarity
information due to the absence of supervision. Certain authors opt not to explicitly generate
dissimilar data. Instead, they compute distances to all other data points4[] or their closest
neighbors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], calculate the average of these similarities, and use it as a dissimilarity measure
in a loss function. However, the drawback of this approach lies in the inadequacy of the
average to efectively represent dissimilarity. An alternative approach addresses the issue
by focusing solely on positive instances and implementing diverse parameter updates 6[
        <xref ref-type="bibr" rid="ref7">, 7</xref>
        ].
Nevertheless, this method fails to endow the algorithm with the capability to construct a robust
decision boundary for efectively discerning diferences within the data, leading to overlaps
with diferent categories. Some authors pursue explicit negatives by considering diferent views
(augmentations) for each image to identify real negatives and discard false negative8s][or by
estimating a sample from the distribution over negative pairs [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This approach stems from
metric learning settings, where “hard” (true negative) examples can expedite the correction
of mistakes in the learning process [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. In representation learning, informative negative
examples are intuitively those pairs that are mapped nearby but should be far apart. This
concept is successfully applied in metric learning, where true pairs of dissimilar points are
available, in contrast to unsupervised contrastive learning. Our methodology hinges on the
generation of a hard negative, inspired by the findings of Cai et al. (2020) [12], who assert
that “... a small minority of negatives were both necessary and suficient for the downstream
task to reach full accuracy.” In light of this insight, we propose an approach centered around
triplet loss. In this setup, the positive pairs are generated in a conventional manner, employing
transformations that preserve semantic content. However, the negative element is uniquely
crafted considering only the k nearest neighbors of the remaining batch of positives to the
anchor. The negative is then derived by computing the centroid. This approach emphasizes
that the centroid serves as an excellent representation of the negative, owing to its ability to
encapsulate information from all vectors in close proximity to the anchor.
      </p>
      <p>Particularly, the main contributions of this work are as follows:
• We design a simple but efective sampling strategy based on similarity to create negative
elements.
• We propose a general self-supervised training method based on triplet loss for
representation learning.
• We are the first to evaluate state-of-the-art self-supervised algorithms in the context of</p>
      <p>Content-Based Image Retrieval (CBIR) in diferent datasets.
• Our experiments across two datasets demonstrate that our approach surpasses existing
methods in both Content-Based Image Retrieval (CBIR) and classification tasks, as
indicated by superior performance metrics such as Mean Average Precision (MAP) for CBIR
and Accuracy, Recall, Precision, and F1 for classification.</p>
      <p>The rest of the paper is organized as follows. Section2 presents a review of the work related
to this approach. In section3, we will describe our proposed method. Section 4 will show the
results of applying our method to diferent datasets. Section 5 will present conclusions and
future works.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Works</title>
      <sec id="sec-3-1">
        <title>2.1. Representation Learning</title>
        <p>
          In the realm of unsupervised representation learning, the approaches are predominantly
categorized into generative and discriminative methods [
          <xref ref-type="bibr" rid="ref4">13, 4</xref>
          ]. Generative strategies involve
constructing a distribution over data and latent embeddings, utilizing these embeddings as
representations for images. Techniques such as auto-encoding of images1[
          <xref ref-type="bibr" rid="ref4">4, 15</xref>
          ] and adversarial
learning [16] are commonly employed in generative methods. While these approaches provide
comprehensive pixel-level representations, the computational demands can be significant, and
the generation of highly detailed images may not be essential for efective representation
learning. Discriminative methods, particularly contrastive methods [
          <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 6, 5</xref>
          ], currently stand at the
forefront, showcasing state-of-the-art performance in self-supervised learning. Some alternative
methodologies opt for auxiliary handcrafted prediction tasks to guide representation learning.
However, their eficacy often falls short in comparison to contrastive methods. Noteworthy
techniques, such as relative patch prediction 1[
          <xref ref-type="bibr" rid="ref3">3, 17</xref>
          ], colorizing grayscale images [18, 19],
image inpainting [20], image jigsaw puzzle [21], image super-resolution [22], and geometric
transformations [23, 24], have been explored for their utility. Despite the integration of
wellstructured architectures [25], these approaches consistently underperform when juxtaposed
with the superior performance demonstrated by contrastive methods [26, 27].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Contrastive Learning</title>
        <p>
          Contrastive learning stands as a compelling alternative to the computationally intensive task of
pixel-level image generation. Shifting its focus from image creation, contrastive learning aims
to minimize the distance between representations of diferent views of the same image (positive
pairs) and maximize the distance between representations of views from diferent images
(negative pairs) [
          <xref ref-type="bibr" rid="ref12 ref6">17, 28, 6</xref>
          ]. Contrastive methods often capitalize on comparisons with multiple
examples, and in some cases, they exhibit efectiveness even without explicit negative examples
[
          <xref ref-type="bibr" rid="ref4 ref5 ref7">4, 5, 7</xref>
          ]. Several noteworthy algorithms have been proposed for contrastive learning of visual
representations. SimCLR [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], for instance, utilizes augmented views of other items in a minibatch
as negative samples. MoCo [
          <xref ref-type="bibr" rid="ref1">1, 26</xref>
          ], on the other hand, incorporates a momentum-updated
memory bank of old negative representations, enabling the use of large batches of negative
samples. Tri Huynh et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] tackle a fundamental issue in contrastive learning—the mitigation
of false negatives. The introduction of false negatives poses challenges such as discarding
semantic information and slow convergence. The authors propose novel approaches to identify
false negatives, introducing two strategies—false negative elimination and attraction—to mitigate
their efects. Their work involves systematic evaluations to comprehensively understand and
address this issue. Robinson et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] present an unsupervised method based on a simple
distribution over hard negative pairs for contrastive representation learning. They construct
this distribution over hard negatives with the assumption that the most useful negative samples
are those that the embedding currently believes to be similar to the anchor. A noteworthy
approach to learning image representation is introduced by2[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. This involves computing the
cross-correlation matrix between the outputs of two identical networks, which receive distorted
versions of a sample. The objective is to make this cross-correlation matrix as similar to the
identity matrix as possible. This ensures that the embedding vectors of the distorted versions of
a sample become more similar to each other while reducing redundancy among the components
of these vectors.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>2.3. Triplet Loss Approach</title>
        <p>
          The triplet loss approach, initially introduced by Ding et al. for person re-identification and
independently adopted by Schrof et al. for face recognition [
          <xref ref-type="bibr" rid="ref10 ref14">30, 10</xref>
          ], has undergone substantial
evolution, becoming a transformative paradigm in contrastive learning. In building upon the
foundational concept of triplet loss, researchers have dedicated eforts to enhance the
generation and selection of valuable triplets. Hermans et al. 3[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] contributed significant strategies to
identify and leverage informative triplets, thereby bolstering the robustness and efectiveness
of the triplet loss methodology. Seeking further refinement, Wang et al. [
          <xref ref-type="bibr" rid="ref16">32</xref>
          ] delved into the
application of cross-batch triplet loss, with the objective of augmenting generalization capabilities
and stabilizing the triplet loss approach. This extension demonstrates a nuanced understanding
of inter-batch relationships and their pivotal role in shaping the learning process. Furthermore,
researchers have ventured into adapting the triplet loss approach to weakly supervised scenarios.
Wang et al. [
          <xref ref-type="bibr" rid="ref17">33</xref>
          ] made notable contributions in this domain, exploring methods to harness
weak supervision signals and extend the applicability of the triplet loss paradigm to scenarios
where labeled data may be scarce. Turpault et al.3[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] took a unique approach by integrating
unsupervised triplet loss-based learning into a self-supervised representation learning
framework. Their variant involves obtaining positive samples for triplets with unlabeled anchors by
applying a transformation to the anchor. The negative sample for these triplets is then chosen
as the sample in the training set that is closest to the anchor and distant from the positive
sample. Another noteworthy contribution to the triplet loss approach comes from Wang et al.
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], who introduced a truncated triplet loss methodology. In their approach, the negative pair is
constructed by selecting a negative sample deputy from all negative samples. This strategic
choice aims to mitigate false negatives and prevent the model from over-clustering samples of
the same actual categories into diferent clusters. Finally, Li et al. [
          <xref ref-type="bibr" rid="ref19">35</xref>
          ] introduce an algorithm
called Trip-ROMA, based on a simple Triplet loss with RandOm MApping (ROMA) strategy,
which consists of mapping random samples into other spaces and requiring these randomly
projected samples to satisfy the same relationship indicated by the triplets. Finally, integrating
the triplet-based loss with random mapping, we obtain the proposed method.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Algorithm</title>
      <sec id="sec-4-1">
        <title>3.1. Motivation</title>
        <p>We first are going to show a motivation and then we present the algorithm.</p>
        <p>
          In the past year, the prominence of Self-Supervised Representation Learning has experienced
significant growth, primarily driven by the challenges posed by the absence of labeled data. A
prevalent strategy involves applying augmentations to generate diferent views of the same
data, efectively emphasizing similar or closely related data points [
          <xref ref-type="bibr" rid="ref20">36</xref>
          ] (see Fig. 1).
        </p>
        <p>
          However, a critical challenge emerges in creating dissimilar data, as failure to do so may lead
to a collapsing solution where all data points cluster at a single location13[]. Addressing the
challenge of dissimilar data, some authors calculate distances to all other data point4s][or their
closest neighbors [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], computing the average of these similarities and using it as a dissimilarity
measure in a loss function. Nevertheless, the inadequacy of the average to efectively represent
dissimilarity poses a drawback to this approach, so requires a large batch size. An alternative
method tackles the issue by solely considering positive instances and implementing diverse
parameter updates [
          <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
          ]. However, this method falls short in enabling the algorithm to construct
a robust decision boundary for efectively discriminating diferences within the data, leading
to overlaps with diferent categories. The crux of our motivation lies in selecting a robust
representation of the negative within the data (hard negative). This representation should
efectively challenge the model in diferentiating it from the positive. Leveraging the triplet loss
approach, commonly employed in contrastive learning for SSL, becomes a natural choice, in the
        </p>
        <p>For simplicity, we illustrate the triplet set (  ,  
+,  −
)=1,⋯,</p>
        <p>using one query data and one

=1
ℒ = ∑ max (
(,  −) − 
(  ,  +) ,  )
where</p>
        <p>is a similarity metric (e.g., cosine similarity or Euclidean), and is a margin
determining whether to discard a triplet</p>
        <p>Constructing triplets for each data point poses a significant challenge, particularly in
determining how to establish negative pairs accurately (dog in Fig.2). While positive pairs can be
reliably generated, identifying negative pairs involves the use of hard negative samples (points
that are challenging to distinguish from an anchor point). The key challenge lies in utilizing
hard negatives while remaining unsupervised, precluding the adoption of existing negative
sampling strategies that rely on true similarity information.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Proposed Methodology</title>
        <p>
          To overcome the challenge of creating dissimilar data and to enhance the efectiveness of the
triplet loss approach, we draw inspiration from the work of Cai et al.1[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Their findings suggest
that only a small quantity of negatives is necessary for achieving full accuracy in downstream
tasks. In our proposed method, we introduce a novel approach for generating negative values
within a triplet set.
        </p>
        <p>In this approach, the anchor represents one view of the data, and the positive is derived
from the other view of the same data within a batch. Crucially, the negative is constructed by
searching for the k nearest neighbors of the anchor among the positive ones. The negative
value is obtained by calculating the centroid of these k vectors. This vector serves as an
excellent representation of the negative since it combines elements of the negative data with
characteristics of the positive data, efectively building a hard negative. This is attributed
to its ability to encapsulate information from all vectors in close proximity to the anchor.
Consequently, the centroid poses a challenge when diferentiating it from the anchor, thereby
enhancing the discriminative capability of the model.</p>
        <p>Mathematically, the triplet loss is expressed as:
ℒ1(  ,   ,   ) = max (sim (  ,   ) − sim (  ,   ) + , 0 )
(1)</p>
        <p>Here,   =  ( 1()) represents the anchor,   =  ( 2()) represents the positive, with 
denoting an encoder neural network, and 1,  2 drawn from the set  of augmentation transform
techniques, and sim() indicate a similarity measure between two vectors (cosine similarity
for default). The  -th element of   is computed as   [] = Centroid(k-nearest( − )), where
Centroid denotes the centroid function, k-nearest( − ) represents the  -elements closest to  
excluding the  -th element.</p>
        <p>
          Typically, the triplet loss is constrained by its sensitivity to the training triplets due to its
reliance on a set margin [
          <xref ref-type="bibr" rid="ref21">37</xref>
          ]. Consequently, the cross-entropy loss serves as a more flexible
alternative, resembling a softer version of the triplet loss with an adjustable margin [
          <xref ref-type="bibr" rid="ref19">35</xref>
          ]. This
adaptation addresses the constraint of the triplet loss with a fixed margin.
        </p>
        <p>Finally, the total loss function is defined as:
ℒ2(  ,   ,   ) = −log</p>
        <p>exp ( ⊤  )
exp ( ⊤  ) + exp ( ⊤  )
ℒoss =   (ℒ1(  ,   ,   ) + ℒ 2(  ,   ,   ))
(2)
(3)</p>
        <p>This proposed solution addresses the limitations of existing methods by introducing a more
efective way of constructing negative representations, thereby aiming to enhance overall
performance in representation learning, particularly in Content-Based Image Retrieval and
Classification tasks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Main Results</title>
      <sec id="sec-5-1">
        <title>4.1. Training</title>
        <p>
          We are going to present the protocols to train our algorithm and our results:
Data Augmentation: One type of augmentation involves spatial/geometric transformation of
data, such as cropping and resizing (with horizontal flipping), rotation [24], and cutout [
          <xref ref-type="bibr" rid="ref22">38</xref>
          ].
The other type of augmentation involves appearance transformation, such as color distortion
(including color dropping, brightness, contrast, saturation, hue) 3[
          <xref ref-type="bibr" rid="ref24 ref9">9, 40</xref>
          ], Gaussian blur, and
Sobel filtering.
        </p>
        <p>
          Algorithm: Our algorithm is based on [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We can see the general schema in Figure 3(), and
the final algorithm can be found in Algorithm 1
        </p>
        <p>
          Datasets: We use two diferent datasets to validate the results. The CIFAR-10 dataset
comprises 60,000 32x32 color images categorized into 10 classes, each containing 6,000 images.
It is divided into 50,000 training images and 10,000 test images [
          <xref ref-type="bibr" rid="ref25">41</xref>
          ], and The SVHN (Street
View House Numbers) dataset is a real-world image dataset specifically designed for developing
machine learning and object recognition algorithms with minimal data preprocessing and
formatting requirements. It consists of images containing digits, with 10 classes representing
each digit from 0 to 9. The dataset is split into 73,257 digits for training, 26,032 digits for testing
[
          <xref ref-type="bibr" rid="ref26">42</xref>
          ].
        </p>
        <p>Metrics:
• Mean Average Precision (MAP) is a crucial metric in image retrieval tasks, providing a
comprehensive measure of a system’s efectiveness across multiple queries. It assesses
the average precision at each relevant image’s position in the ranked list and computes
the mean of these values. Relevant images are defined based on query relevance, and
precision is calculated by dividing the number of relevant images retrieved up to a certain
position by the total number of retrieved images up to that position. To calculate MAP@K,
a variant of MAP where only the top K retrieved items are considered, you can use the
following formula:
1
|| =1
||
∑</p>
        <p />
        <p>min ( , |  |)
Where: || is the total number of queries, Precision@k is the precision at position  for
query  , Relevance() is a binary indicator function that is 1 if the item at position is
relevant and 0 otherwise,|  | is the number of relevant items for query  , and  is the
cutof rank.
• Accuracy, Recall, Precision, and F1-score are fundamental metrics for evaluating
classification tasks. Accuracy measures the proportion of correctly classified instances among all
instances, providing an overall assessment of the model’s performance. Recall quantifies
the proportion of true positive instances correctly identified by the model among all
actual positive instances. Precision measures the proportion of true positive instances
among all instances predicted as positive, ofering insights into the model’s precision
in positive predictions. F1-score, the harmonic mean of precision and recall, balances
the trade-of between precision and recall, providing a single metric that reflects both
measures’ performance. These metrics collectively ofer a comprehensive understanding
of the classification model’s efectiveness in correctly identifying instances belonging to
diferent classes.</p>
        <p>Evaluation: The evaluation was carried out using two diferent methods. Firstly, the CBIR
method was employed, where the last output layer of the encoder was used to generate a feature
vector for each image. Subsequently, the closest images in the training set were retrieved for
each image in the test set, aiming to measure the results of the k nearest neighbors using the
Mean Average Precision at K (MAP@K) metric. Secondly, a linear evaluation was conducted. In
this approach, only a linear layer was added to the encoder, and then the model was retrained
to perform classification using the available labels while keeping the encoder weights frozen.</p>
        <p>
          Other protocols: Our encoder is based on the Very Deep Convolutional Networks for
Large-Scale Image Recognition paper 4[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The batch size is 32. The maximum epoch is 200, we
use stochastic gradient descent with a learning rate0.6 and cosine learning rate decay schedule.
You can observe all the details in the appendix.
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Baselines</title>
        <p>
          We are going to compare our approach with 4 relevant state-of-the-art works in self-supervised.
• SimCLR [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]: is a straightforward framework for contrastive learning of visual
representations. Two distinct data augmentation operators, ∼  and  ′ ∼  , are randomly selected
from the same family of augmentations and applied to each data example, creating two
correlated views. A base encoder network and a projection head are trained to maximize
agreement using a contrastive loss. After completing the training, the projection head
is discarded, and the encoder is employed to obtain a representation, denoted as h, for
downstream tasks. Notably, SimCLR introduces a learnable nonlinear transformation
between the representation and the contrastive loss, significantly enhancing the quality
of the learned representations.
• SimSiam [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]: is a model designed to maximize the similarity between two augmentations
of a single image while avoiding collapsing solutions. It utilizes two augmented views of
the same image, processed by an identical encoder network (comprising a backbone and a
projection MLP). A prediction MLP is applied to one side, while a stop-gradient operation
is applied to the other side. The model’s objective is to maximize the similarity between
both sides. Notably, SimSiam does not rely on negative pairs or a momentum encoder.
The authors empirically demonstrate the existence of collapsing solutions and emphasize
the critical role of the stop-gradient operation in preventing such occurrences. This
suggests the presence of an underlying optimization problem diferent from conventional
contrastive learning.
• BYOL [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]: is an approach to self-supervised image representation learning. It relies on
two neural networks, referred to as online and target networks, that interact and learn
from each other. Using an augmented view of an image, the online network is trained to
predict the target network’s representation of the same image under a diferent augmented
view. Concurrently, the target network is updated with a slow-moving average of the
online network. The use of a slow-moving average of the online parameters as the target
network encourages the encoding of increasing information within the online projection
and mitigates the risk of collapsed solutions.
• BarlowTwins [
          <xref ref-type="bibr" rid="ref13">29</xref>
          ]: proposes an objective function that inherently avoids collapse by
measuring the cross-correlation matrix between the outputs of two identical networks
fed with distorted versions of a sample. The objective is to make this matrix as close
to the identity matrix as possible. This approach ensures that the embedding vectors
of distorted versions of a sample are similar while minimizing redundancy between the
components of these vectors.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Preliminary results:</title>
        <p>
          The provided Tables ofer a comprehensive insight into the performance metrics concerning
Content-Based Image Retrieval (CBIR) and Linear Evaluation across various Self-Supervised
Learning (SSL) algorithms applied to datasets CIFAR-104[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and The Street View House Numbers
(SVHN) [
          <xref ref-type="bibr" rid="ref26">42</xref>
          ]. In the context of CBIR, the precision metric, Mean Average Precision (MAP), is
computed at diferent values of k, indicating the number of nearest neighbors sought in the
retrieval process. Each row in Table1 corresponds to a distinct SSL algorithm, with the MAP
values at diferent k values displayed, showcasing the algorithm’s performance in retrieving
relevant images. Notably, higher MAP values indicate a superior ability to retrieve relevant
images in the CBIR task. In the Linear Evaluation, presented in Table2, various performance
metrics such as Accuracy, Recall, Precision, and F1-score are provided for each SSL algorithm.
These Tables provide a detailed breakdown of the performance of each SSL algorithm under
consideration, facilitating a nuanced understanding of their efectiveness in image retrieval and
classification tasks.
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>4.4. Analysis</title>
        <p>Preliminary results reveal the outstanding efectiveness of our approach on two fundamental
tasks: Content-Based Image Retrieval (CBIR) and Linear Evaluation.
• Content-Based Image Retrieval (CBIR): To evaluate the performance of our method
on CBIR, the CIFAR-10 and SVHN datasets were used. Looking at the Table1:
– CIFAR-10: Our method significantly outperforms the baselines for diferent values
of k. Compared to other state-of-the-art methods such as SimCLR, SimSiam, BYOL,
and BarlowTwins, our approach demonstrates considerable improvement in mean
average precision (MAP). We achieved a MAP of 0.7316 for k=1000, 0.8253 for k=100,
0.868 for k=10, and 0.924 for k=1, indicating a high capacity for image representation
and retrieval in the latent space.
– SVHN: Although our algorithm shows notable improvement compared to baselines,
including SimCLR, BYOL, and BarlowTwins, in terms of MAP, it has been
outperformed by the SimSiam approach. Our method achieves a MAP of 0.4315 for k=1000,
0.6004 for k=100, 0.6965 for k=10, and 0.805 for k=1. Despite not being the best in
this data set, our approach is still competitive and ofers promising results.
• Linear Evaluation To evaluate the generalization ability of the learned representations
in a linear classification task, an evaluation was performed on CIFAR-10 and SVHN.
Performance metrics include precision, recall, precision, and F1-score. Analyzing the
Table 2
– CIFAR-10: Our method excels at this task, significantly outperforming other
stateof-the-art approaches such as SimCLR, SimSiam, BYOL, and BarlowTwins. We
achieved a classification accuracy of 93.22%, demonstrating the efectiveness of the
learned representations in linear classification tasks on this dataset.
– SVHN: Our method also shows impressive performance on the linear classification
task for SVHN. Although SimSiam outperforms our approach on the CBIR task, our
method outperforms both SimSiam and other baselines in terms of classification
accuracy, achieving an accuracy of 87.42%.</p>
        <p>In summary, our results indicate that our approach has outstanding performance on the CBIR
task in CIFAR-10, being highly competitive in SVHN. Furthermore, it demonstrates exceptional
generalization ability in linear classification tasks on both data sets. These findings support the
efectiveness and promise of our method in feature extraction and representation of image data.
Algorithm 1 Algorithm</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>The landscape of Self-Supervised Representation Learning (SRL) has witnessed significant
advancements, and this paper contributes to the field by addressing a crucial limitation in
existing methods. Traditional approaches often focus on learning similarity without adequately
capturing dissimilarity nuances, leading to suboptimal representations. Our proposed method,
termed “Refining Triplet Sampling”, introduces a novel strategy for generating negative vectors
in a batch, enhancing the triplet loss methodology for representation learning. The motivation
behind our approach stems from the challenge of creating dissimilar data, a critical aspect
of efective SRL. Existing methods, including those relying on the average as a measure of
dissimilarity, fall short of providing robust negative representations. Our method tackles this
limitation by constructing negative samples based on the k-nearest neighbors, significantly
improving the model’s ability to diferentiate dissimilar instances.</p>
      <p>Experimental results, particularly in Content-Based Image Retrieval (CBIR) and Linear
Evaluation, consistently demonstrate the superiority of our approach over other Self-Supervised
Learning (SSL) methods (baselines). The refined representations showcase higher Mean Average
Precision (MAP) values in CBIR, emphasizing the efectiveness of our method in retrieving
relevant images. Linear Evaluation further underscores the versatility of our learned representations,
outperforming other algorithms in terms of Accuracy, Recall, Precision, and F1.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research received partial support through an agreement with Scotiabank and Federico
Santa María Technical University, as well as via a scholarship for international visits provided
by Federico Santa María Technical University and the National Agency for Research and
Development (doctoral scholarship 2022/21221059).
feature embedding, in: Proceedings of the IEEE conference on computer vision and pattern
recognition, 2016, pp. 4004–4012.
[12] T. T. Cai, J. Frankle, D. J. Schwab, A. S. Morcos, Are all negatives created equal in contrastive
instance discrimination?, arXiv preprint arXiv:2010.06682 (2020).
[13] C. Doersch, A. Gupta, A. A. Efros, Unsupervised visual representation learning by context
prediction, in: Proceedings of the IEEE international conference on computer vision, 2015,
pp. 1422–1430.
[14] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and composing robust
features with denoising autoencoders, in: Proceedings of the 25th international conference
on Machine learning, 2008, pp. 1096–1103.
[15] D. P. Kingma, M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv:1312.6114
(2013).
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, Advances in neural information processing systems
27 (2014).
[17] C. Doersch, A. Zisserman, Multi-task self-supervised visual learning, in: Proceedings of
the IEEE international conference on computer vision, 2017, pp. 2051–2060.
[18] R. Zhang, P. Isola, A. A. Efros, Colorful image colorization, in: Computer Vision–ECCV
2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016,
Proceedings, Part III 14, Springer, 2016, pp. 649–666.
[19] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic
colorization, in: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The
Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, 2016, pp. 577–593.
[20] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A. A. Efros, Context encoders: Feature
learning by inpainting, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2016, pp. 2536–2544.
[21] M. Noroozi, P. Favaro, Unsupervised learning of visual representations by solving jigsaw
puzzles, in: European conference on computer vision, Springer, 2016, pp. 69–84.
[22] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani,
J. Totz, Z. Wang, et al., Photo-realistic single image super-resolution using a generative
adversarial network, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2017, pp. 4681–4690.
[23] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, T. Brox, Discriminative unsupervised
feature learning with convolutional neural networks, Advances in neural information
processing systems 27 (2014).
[24] S. Gidaris, P. Singh, N. Komodakis, Unsupervised representation learning by predicting
image rotations, arXiv preprint arXiv:1803.07728 (2018).
[25] A. Kolesnikov, X. Zhai, L. Beyer, Revisiting self-supervised visual representation learning,
in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2019, pp. 1920–1929.
[26] X. Chen, H. Fan, R. Girshick, K. He, Improved baselines with momentum contrastive
learning, arXiv preprint arXiv:2003.04297 (2020).
[27] Y. Tian, C. Sun, B. Poole, D. Krishnan, C. Schmid, P. Isola, What makes for good views
for contrastive learning?, Advances in neural information processing systems 33 (2020)</p>
    </sec>
    <sec id="sec-8">
      <title>A. Implementation Details</title>
      <sec id="sec-8-1">
        <title>A.1. Hardware Configuration</title>
        <p>The experiments were carried out on a computer with the following specifications: Intel(R)
Core(TM) i7-8700K CPU @ 3.70GHz, 32GB of RAM, and a GeForce GTX 1080 Ti GPU.</p>
      </sec>
      <sec id="sec-8-2">
        <title>A.2. Selected Hyperparameters</title>
        <p>In Table 3, a comprehensive list of all the hyperparameters utilized for our methods is provided.
These hyperparameters are pivotal components in configuring and fine-tuning the performance
of our methodologies. Each hyperparameter plays a distinct role in shaping the behavior and
eficacy of the employed techniques. Through meticulous selection and optimization of these
hyperparameters, we aim to enhance the overall performance and robustness of our methods
across various experimental settings and datasets.
200
32
SGD
15</p>
      </sec>
      <sec id="sec-8-3">
        <title>A.3. Dataset details</title>
        <p>Additional information about the datasets is presented in the Table4. It is important to note
that these two datasets represent very diferent natures; one consists of natural images while
the other is composed solely of numbers. The combination of both sets is essential for a
comprehensive evaluation of the performance of diferent data sets.</p>
      </sec>
      <sec id="sec-8-4">
        <title>A.4. Recovery Visualization</title>
        <p>In this subsection, we present visual examples showcasing the recovery achieved by our method.
These illustrations are depicted in Figures??, ??, ??, ??, and ??. Through these images, we
aim to demonstrate the efectiveness of our approach in accurately reconstructing the original
content. Notably, our method excels in preserving the semantic integrity of the images during
the recovery process, thereby emphasizing its robust performance in retaining crucial visual
details and structures</p>
      </sec>
      <sec id="sec-8-5">
        <title>A.5. Online Resources</title>
        <p>For those interested in replicating our results, the code is available on GitHub at the following
link:</p>
        <p>GitHub Repository</p>
        <p>This repository contains the necessary resources and instructions to facilitate the replication
of our findings. Feel free to explore and utilize the code to delve deeper into our methodology
and validate the outcomes.</p>
      </sec>
      <sec id="sec-8-6">
        <title>A.6. Future Work</title>
        <p>Despite the advancements presented in this work in the domain of image retrieval and
classification, there are several lines of research that can further enrich our approach and explore its
applicability in diferent visual contexts. Below are highlighted some areas of interest for future
investigations:
• Exploration of Diversity in Image Datasets: To assess the robustness and
generalization of our algorithm across diferent visual domains, we propose the inclusion of
additional datasets representing diverse nature of images. This could involve datasets
containing medical images, satellite data, texture images, among others. Expanding the
domains of images will allow for a more comprehensive evaluation of the algorithm’s
ability to adapt to a variety of visual contexts.
• Transfer Learning in Cross-Domain Scenarios: To extend our research on transfer
learning, we suggest exploring cross-domain scenarios where the model is trained on
one dataset and evaluated on another with diferent visual characteristics. This line of
investigation will help assess the algorithm’s adaptation capability to diferent visual
styles and evaluate the transferability of learned representations across diferent image
domains.
• Exploration of Semi-Supervised Learning Techniques: To further improve the
performance of the algorithm in image retrieval and classification tasks, we propose
investigating semi-supervised learning techniques. This approach leverages both labeled
and unlabeled data to train the model, which can be particularly useful in scenarios where
labeled datasets are scarce or expensive to obtain. Exploring semi-supervised strategies
could open up new opportunities to enhance the eficiency and accuracy of the algorithm
in computer vision tasks.</p>
        <p>These research directions represent significant steps towards advancing our understanding of
self-supervised algorithms in the field of computer vision and their application in a variety of
visual domains and real-world scenarios.
(b) Recovery Images</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <article-title>Momentum contrast for unsupervised visual representation learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>9729</fpage>
          -
          <lpage>9738</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>Big self-supervised models are strong semi-supervised learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>22243</fpage>
          -
          <lpage>22255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gutmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hyvärinen</surname>
          </string-name>
          ,
          <article-title>Noise-contrastive estimation: A new estimation principle for unnormalized statistical models</article-title>
          ,
          <source>in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>304</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <article-title>A simple framework for contrastive learning of visual representations</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1597</fpage>
          -
          <lpage>1607</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Torr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Solving ineficiency of self-supervised representation learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>9505</fpage>
          -
          <lpage>9515</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Exploring simple siamese representation learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>15750</fpage>
          -
          <lpage>15758</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>J.-B. Grill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Strub</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Altché</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Tallec</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Richemond</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Buchatskaya</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doersch</surname>
            ,
            <given-names>B. Avila</given-names>
          </string-name>
          <string-name>
            <surname>Pires</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>M. Gheshlaghi</given-names>
          </string-name>
          <string-name>
            <surname>Azar</surname>
          </string-name>
          , et al.,
          <article-title>Bootstrap your own latent-a new approach to self-supervised learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>21271</fpage>
          -
          <lpage>21284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Huynh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kornblith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Walter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khademi</surname>
          </string-name>
          ,
          <article-title>Boosting contrastive selfsupervised learning with false negative cancellation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF winter conference on applications of computer vision</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2785</fpage>
          -
          <lpage>2795</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Robinson</surname>
          </string-name>
          , C.-Y. Chuang,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jegelka</surname>
          </string-name>
          ,
          <article-title>Contrastive learning with hard negative samples</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>04592</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Schrof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Philbin</surname>
          </string-name>
          ,
          <article-title>Facenet: A unified embedding for face recognition and clustering</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>815</fpage>
          -
          <lpage>823</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>Oh Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jegelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          ,
          <source>Deep metric learning via lifted structured 6827-6839.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Unsupervised feature learning via non-parametric instance discrimination</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>3733</fpage>
          -
          <lpage>3742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zbontar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jing</surname>
          </string-name>
          , I. Misra, Y. LeCun, S. Deny,
          <article-title>Barlow twins: Self-supervised learning via redundancy reduction</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>12310</fpage>
          -
          <lpage>12320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chao</surname>
          </string-name>
          ,
          <article-title>Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition 48 (</article-title>
          <year>2015</year>
          )
          <fpage>2993</fpage>
          -
          <lpage>3003</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hermans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Leibe</surname>
          </string-name>
          ,
          <article-title>In defense of the triplet loss for person re-identification</article-title>
          ,
          <source>arXiv preprint arXiv:1703.07737</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Huang,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <article-title>Cross-batch memory for embedding learning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6388</fpage>
          -
          <lpage>6397</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Weakly supervised person re-id: Diferentiable graphical learning and a new benchmark</article-title>
          ,
          <source>IEEE Transactions on Neural Networks and Learning Systems</source>
          <volume>32</volume>
          (
          <year>2020</year>
          )
          <fpage>2142</fpage>
          -
          <lpage>2156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>N.</given-names>
            <surname>Turpault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Serizel</surname>
          </string-name>
          , E. Vincent,
          <article-title>Semi-supervised triplet loss based learning of ambient audio embeddings</article-title>
          ,
          <source>in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>760</fpage>
          -
          <lpage>764</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luo</surname>
          </string-name>
          , Trip-roma:
          <article-title>Self-supervised learning with triplets and random mappings</article-title>
          ,
          <source>Transactions on Machine Learning Research</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>R.</given-names>
            <surname>Hadsell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chopra</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>LeCun, Dimensionality reduction by learning an invariant mapping</article-title>
          ,
          <source>in: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR'06)</source>
          , volume
          <volume>2</volume>
          , IEEE,
          <year>2006</year>
          , pp.
          <fpage>1735</fpage>
          -
          <lpage>1742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [37]
          <string-name>
            <surname>C.-Y. Wu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Manmatha</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Krahenbuhl</surname>
          </string-name>
          ,
          <article-title>Sampling matters in deep embedding learning</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2840</fpage>
          -
          <lpage>2848</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [38]
          <string-name>
            <surname>T. DeVries</surname>
          </string-name>
          , G. W. Taylor,
          <article-title>Improved regularization of convolutional neural networks with cutout</article-title>
          ,
          <source>arXiv preprint arXiv:1708.04552</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [39]
          <string-name>
            <surname>A. G. Howard,</surname>
          </string-name>
          <article-title>Some improvements on deep convolutional neural network based image classification</article-title>
          ,
          <source>arXiv preprint arXiv:1312.5402</source>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <article-title>Going deeper with convolutions</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <article-title>Learning multiple layers of features from tiny images</article-title>
          ,
          <source>Technical Report</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Netzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Coates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bissacco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Reading digits in natural images with unsupervised feature learning</article-title>
          ,
          <source>in: NIPS Workshop on Deep Learning and Unsupervised Feature Learning</source>
          <year>2011</year>
          ,
          <year>2011</year>
          . URL:http://ufldl.stanford.edu/housenumbers/ nips2011_housenumbers.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>