<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Italian Information Retrieval Workshop, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>FashionSearch++: Improving Consumer-to-Shop Clothes Retrieval with Hard Negatives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Morelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcella Cornia</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rita Cucchiara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <fpage>3</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Consumer-to-shop clothes retrieval has recently emerged in computer vision and multimedia communities with the development of architectures that can find similar in-shop clothing images given a query photo. Due to its nature, the main challenge lies in the domain gap between user-acquired and in-shop images. In this paper, we follow the most recent successful research in this area employing convolutional neural networks as feature extractors and propose to enhance the training supervision through a modified triplet loss that takes into account hard negative examples. We test the proposed approach on the Street2Shop dataset, achieving results comparable to state-of-the-art solutions and demonstrating good generalization properties when dealing with diferent settings and clothing categories.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;consumer-to-shop clothes retrieval</kwd>
        <kwd>image retrieval</kwd>
        <kwd>computer vision</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The visual search of an image from a database of several items is becoming a fundamental
task for many diferent applications in the fields of information retrieval, computer vision,
and multimedia. Typically, the task consists in finding the most similar images to a given
query, which can be either another image [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] or a textual sentence [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">3, 4, 5, 6, 7</xref>
        ]. While
text-based image retrieval can sufer from language constraints, image-based retrieval has no
such limitations. Due to the ability to find similar images given a target one, this task fits
perfectly with the great expansion of e-commerce and the need for customers to easily find what
they are looking for among a large number of products. In particular, in the fashion domain, the
ability for a customer to find an in-shop garment given a query photo is a remarkable feature.
      </p>
      <p>
        In the last few years, much research efort [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref8 ref9">8, 9, 10, 11, 12, 13</xref>
        ] has been spent on making
e-commerce customer experience more efective and enjoyable, resulting in diferent solutions
for clothes retrieval for both in-shop [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and consumer-to-shop [
        <xref ref-type="bibr" rid="ref14 ref15 ref8">14, 8, 15</xref>
        ] settings. Focusing
on consumer-to-shop clothes retrieval, the main challenge is given by the strong diferences
between query and in-shop images. In fact, while query images are usually taken in the wild
and may exhibit low quality and lighting variations, in-shop images are usually high quality,
in front perspective, and shot in a controlled environment. Almost all recent fashion retrieval
Shop Image
shared
CNN
      </p>
      <p>Pooling
Strategy
Pooling
Strategy</p>
      <p>
        Embedding Space
query images
shop images
works [
        <xref ref-type="bibr" rid="ref15">16, 17, 15, 18</xref>
        ] employ convolution neural networks (CNNs) to encode images and a
supervised triplet loss function to train the overall architecture. In this paper, we follow this
line of research and propose to modify the standard hinge-based triplet loss function with the
integration of hard negatives [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] thus improving the generalization abilities of the networks and
increasing the final performance. Despite having been widely used to improve visual-semantic
embeddings [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4, 19, 20, 21</xref>
        ], this loss function has never been applied in the context of fashion
retrieval. Experimental results on a widely used dataset for consumer-to-shop clothes retrieval,
namely Street2Shop [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], demonstrate the efectiveness of this strategy leading to better retrieval
results using both diferent backbones and pooling strategies. Furthermore, we show that the
use of hard negative examples can significantly increase the final results on almost all categories
of clothing and accessories (e.g. bags, dresses, footwear, skirts, etc.) and achieve performance
comparable to state-of-the-art techniques.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Proposed Approach</title>
      <p>Given a query image of a fashion item and a corresponding in-shop image, these are fed through
a CNN followed by a pooling strategy to extract a 1D feature vector for each image. Then,
the obtained feature vectors can be compared through a similarity function that measures the
similarity between the two images. An overview of the proposed approach is shown in Fig. 1.
Extracting image features. Both query and in-shop images are processed through a CNN
that extracts a 3D tensor for each image of  ×  ×  dimensions, where ,  , and  are
respectively the output tensor height, width, and number of channels. The 3D tensor can be
seen as a set of 2D features channel responses  = {} where  = {1, . . . , },  is the 2D
tensor representing the responses of the -th feature channel over the set Ω of spatial locations,
and () is the response at a particular position .</p>
      <p>
        To obtain a single 2D tensor for each image, we employ two diferent pooling functions:
a standard average pooling and R-MAC descriptors [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. While the average pooling is a
wellknown pooling technique computed by averaging the set  of 2D tensors, R-MAC descriptors
are an aggregation of image region descriptors extracted through a rigid-grid mechanism over
. Formally, considering a rectangular region ℛ ⊆ Ω = [1,  ] × [1, ], each region feature
vector is defined as:
      </p>
      <p>= [ℛ,1...ℛ,...ℛ,]⊤,
where ℛ, = max∈ℛ () is the maximum activation of the -th channel of ℛ. Each region
ℛ is detected through a square grid of variable dimensions applied at  diferent scales. After
extracting a feature vector for each region, they are processed using ℓ2-normalization, PCA, and
another ℓ2-normalization. Finally, the region feature vectors are summed and ℓ2-normalized to
form a single feature vector for each image.</p>
      <p>Training with hard negatives. Once the descriptor of the query and in-shop images are
obtained, they are compared using a similarity function. Note that the descriptors embedding
space is learned according to the loss function used in the backbone training phase. To extract
similar descriptors from similar images, a standard hinge-based triplet ranking loss is usually
employed and defined as:
 (, ) = ∑︁[ − (, ) + (, ^)]+</p>
      <p>^
where []+ = (, 0) and  is a similarity function (i.e. the cosine similarity in our
experiments). In the equation above, (, ) is a matching image pair composed of a
usergenerated image  and a shop image  (such that  contains the same fashion item depicted in ),
while ^ is a negative shop image with respect to  (such that ′ contains a diferent fashion item).
The sum term in the equation requires that the diference in similarity between the matching
and the non-matching pair is higher than a margin  .</p>
      <p>
        As demonstrated in previous works [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], this loss function can be dominated by multiple
negatives with small violations. To avoid such behavior, we employ a modified version that
takes into consideration the hardest negative instead of the sum of all negative examples. In
practice, this is done by replacing the sum in Eq. 2 with maximum, thus considering only the
most violating non-matching pair. Formally, we define the loss function as follow:
(1)
(2)
 (, ) = max[ − (, ) + (, ′)]+
′
(3)
where only the hardest negative shop image ′ is taken into account.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Evaluation</title>
      <p>In this section, we evaluate the performance of our approach and describe the dataset and
implementation details used in our experiments.</p>
      <p>
        Dataset and implementation details. We train and test our model on Street2Shop [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
that contains 404,683 shop photos collected from 25 diferent online retailers and 20,357
usergenerated photos. Overall, the dataset is composed of a total of 39,479 image pairs, each
consisting of a user-generated photo and the corresponding shop image, from 11 diferent
clothing categories. User-generated photos are annotated with bounding boxes of fashion items
and can be associated with multiple views of the same fashion item.
      </p>
      <p>Bags Belts Dresses Eyewear Footwear Hats Leggings Outerwear Pants Skirts Tops</p>
      <p>To encode images, we use two diferent CNNs ( i.e. ResNet-50 and ResNet-101 [22]) pre-trained
on ImageNet [23]. We resize and crop all images to 224 × 224 and obtain a 2048-dimensional
feature vector for each encoded image using both average pooling and R-MAC descriptors. In
the case of R-MAC, we extract region feature vectors at 3 diferent scales. To train all models,
we use Adam [24] as optimizer with a learning rate equal to 0.0001 decreased by a factor of 10
every 10 epochs. In all experiments, we use a batch size of 50 and a margin  equal to 0.1.
Experimental results. To evaluate the efectiveness of our approach, we report rank-based
performance metrics @ ( = 1, 5, 10, 20) for consumer-to-shop clothes retrieval.
Specifically, @ computes the percentage of test queries for which at least one correct result is
found among the top- retrieved shop items. Table 1 shows the results using all shop images as
retrievable items on the Street2Shop test set, without filtering the images by category. We report
the retrieval performance of both ResNet-50 and ResNet-101 backbones while extracting image
feature vectors either using average pooling or R-MAC descriptors. We compare the results of
our approach, in which we finetune the backbone using the hinge-based triplet loss with hard
negatives, with those obtained by finetuning the CNNs with a standard triplet loss and those
extracted by using the CNNs pre-trained on ImageNet without finetuning. As it can be seen,
ifnetuning the backbone leads to a noteworthy gain in performance on all considered settings.
Also, the modified triplet loss further improves the final performance using both ResNet-50 and
ResNet-101 as backbone and employing both pooling strategies.</p>
      <p>In Table 2, we report the performance on each of the 11 clothing categories of the Street2Shop</p>
      <p>ResNet-101 + R-MAC
(Finetuned)</p>
      <p>ResNet-101 + R-MAC
(Finetuned with HN)
dataset. These results are obtained by performing the retrieval on a subset of in-shop images,
ifltered by query category. As it can be noticed, the use of hard negatives generally increases
the network performance, leading to better results on almost all clothing categories. Finally,
Fig. 2 shows sample query images along with the corresponding top-3 shop images retrieved by
the ResNet-101 model using R-MAC descriptors and finetuned with and without the use of hard
negatives in the training loss function.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this work, we have tackled the task of consumer-to-shop clothes retrieval where the goal is to
ifnd the most similar clothing item from a catalog of shop images using a user-generated photo
as query. To address the task, we have employed a CNN-based feature extraction network and
two pooling mechanisms to extract compact feature vectors from images and have proposed to
train the network with a modified hinge-based triplet ranking loss that takes into account hard
negative examples. Experiments, performed on the Street2Shop dataset, have shown that the
proposed loss function can efectively improve the retrieval results in all tested settings.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by YOOX NET-A-PORTER Group and the “SUPER
- Supercomputing Unified Platform” project (POR FESR 2014-2020 DGR 1383/2018 - CUP
E81F18000330007), co-funded by Emilia Romagna region.
reasoning networks on a similarity pyramid, in: Proceedings of the IEEE/CVF International
Conference on Computer Vision, 2019.
[16] X. Zhao, H. Qi, R. Luo, L. Davis, A Weakly Supervised Adaptive Triplet Loss for Deep Metric
Learning, in: Proceedings of the European Conference on Computer Vision Workshops,
2019.
[17] A. Chopra, A. Sinha, H. Gupta, M. Sarkar, K. Ayush, B. Krishnamurthy, Powering
robust fashion retrieval with information rich feature embeddings, in: Proceedings of the
IEEE/CFV Conference on Computer Vision and Pattern Recognition Workshops, 2019.
[18] A. D’Innocente, N. Garg, Y. Zhang, L. Bazzani, M. Donoser, Localized Triplet Loss for
Fine-Grained Fashion Image Retrieval, in: Proceedings of the IEEE/CFV Conference on
Computer Vision and Pattern Recognition Workshops, 2021.
[19] L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, Aligning text and document illustrations:
towards visually explainable digital humanities, in: Proceedings of the International
Conference on Pattern Recognition, 2018.
[20] M. Stefanini, M. Cornia, L. Baraldi, M. Corsini, R. Cucchiara, Artpedia: A new
visualsemantic dataset with visual and contextual sentences in the artistic domain, in:
Proceedings of the International Conference on Image Analysis and Processing, 2019.
[21] M. Cornia, L. Baraldi, H. R. Tavakoli, R. Cucchiara, A unified cycle-consistent neural model
for text and image retrieval, Multimedia Tools and Applications 79 (2020) 25697–25721.
[22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016.
[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,
A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition
Challenge, International Journal of Computer Vision 115 (2015) 211–252.
[24] D. Kingma, J. Ba, Adam: a method for stochastic optimization, in: Proceedings of the</p>
      <p>International Conference on Learning Representations, 2015.
[25] X. Wang, Z. Sun, W. Zhang, Y. Zhou, Y.-G. Jiang, Matching user photos to online products
with robust deep features, in: Proceedings of the ACM International Conference on
Multimedia Retrieval, 2016.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Tolias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sicre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          ,
          <article-title>Particular object retrieval with integral max-pooling of CNN activations</article-title>
          ,
          <source>in: Proceedings of the International Conference on Learning Representations</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gordo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Almazán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Revaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Larlus</surname>
          </string-name>
          ,
          <article-title>Deep image retrieval: Learning global representations for image search</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Faghri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Fleet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          , S. Fidler, VSE++
          <article-title>: Improving Visual-Semantic Embeddings with Hard Negatives</article-title>
          ,
          <source>in: Proceedings of the British Machine Vision Conference</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.-H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Hua,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Stacked cross attention for image-text matching</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cornia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baraldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Tavakoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cucchiara</surname>
          </string-name>
          ,
          <article-title>Towards cycle-consistent models for text and image retrieval</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision Workshops</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cornia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stefanini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baraldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Corsini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cucchiara</surname>
          </string-name>
          ,
          <article-title>Explaining digital humanities by aligning images and textual descriptions</article-title>
          ,
          <source>Pattern Recognition Letters</source>
          <volume>129</volume>
          (
          <year>2020</year>
          )
          <fpage>166</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Stefanini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cornia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baraldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cucchiara</surname>
          </string-name>
          ,
          <article-title>A Novel Attention-based Aggregation Function to Combine Vision and Language</article-title>
          ,
          <source>in: Proceedings of the International Conference on Pattern Recognition</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>DeepFashion: Powering robust clothes recognition and retrieval with rich annotations</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <surname>VITON:</surname>
          </string-name>
          <article-title>An Image-based Virtual Try-On Network</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Toward characteristic-preserving image-based virtual try-on network</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          , P. Luo,
          <article-title>DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Neuberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Borenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hilleli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Oks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alpert</surname>
          </string-name>
          ,
          <article-title>Image Based Virtual Try-On Network From Unpaired Data</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Fincato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Landi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cornia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cesari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cucchiara</surname>
          </string-name>
          ,
          <article-title>VITON-GT: An Image-based Virtual Try-On Model with Geometric Transformations</article-title>
          ,
          <source>in: Proceedings of the International Conference on Pattern Recognition</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hadi Kiapour</surname>
          </string-name>
          , X. Han,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lazebnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Berg</surname>
          </string-name>
          ,
          <article-title>Where to buy it: Matching street clothing photos in online shops</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Fashion retrieval via graph
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>