FashionSearch++: Improving Consumer-to-Shop Clothes Retrieval with Hard Negatives Davide Morelli1 , Marcella Cornia1 and Rita Cucchiara1 1 Department of Engineering “Enzo Ferrari”, University of Modena and Reggio Emilia, Modena, Italy Abstract Consumer-to-shop clothes retrieval has recently emerged in computer vision and multimedia commu- nities with the development of architectures that can find similar in-shop clothing images given a query photo. Due to its nature, the main challenge lies in the domain gap between user-acquired and in-shop images. In this paper, we follow the most recent successful research in this area employing convolu- tional neural networks as feature extractors and propose to enhance the training supervision through a modified triplet loss that takes into account hard negative examples. We test the proposed approach on the Street2Shop dataset, achieving results comparable to state-of-the-art solutions and demonstrating good generalization properties when dealing with different settings and clothing categories. Keywords consumer-to-shop clothes retrieval, image retrieval, computer vision 1. Introduction The visual search of an image from a database of several items is becoming a fundamental task for many different applications in the fields of information retrieval, computer vision, and multimedia. Typically, the task consists in finding the most similar images to a given query, which can be either another image [1, 2] or a textual sentence [3, 4, 5, 6, 7]. While text-based image retrieval can suffer from language constraints, image-based retrieval has no such limitations. Due to the ability to find similar images given a target one, this task fits perfectly with the great expansion of e-commerce and the need for customers to easily find what they are looking for among a large number of products. In particular, in the fashion domain, the ability for a customer to find an in-shop garment given a query photo is a remarkable feature. In the last few years, much research effort [8, 9, 10, 11, 12, 13] has been spent on making e-commerce customer experience more effective and enjoyable, resulting in different solutions for clothes retrieval for both in-shop [8] and consumer-to-shop [14, 8, 15] settings. Focusing on consumer-to-shop clothes retrieval, the main challenge is given by the strong differences between query and in-shop images. In fact, while query images are usually taken in the wild and may exhibit low quality and lighting variations, in-shop images are usually high quality, in front perspective, and shot in a controlled environment. Almost all recent fashion retrieval IIR 2021 – 11th Italian Information Retrieval Workshop, September 13–15, 2021, Bari, Italy " davide.morelli@unimore.it (D. Morelli); marcella.cornia@unimore.it (M. Cornia); rita.cucchiara@unimore.it (R. Cucchiara)  0000-0001-7918-6220 (D. Morelli); 0000-0001-9640-9385 (M. Cornia); 0000-0002-2239-283X (R. Cucchiara) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Query Image CNN Pooling Embedding Space Strategy shared Pooling query images Strategy shop images CNN Shop Image Figure 1: Overview of our approach for consumer-to-shop clothes retrieval. works [16, 17, 15, 18] employ convolution neural networks (CNNs) to encode images and a supervised triplet loss function to train the overall architecture. In this paper, we follow this line of research and propose to modify the standard hinge-based triplet loss function with the integration of hard negatives [3] thus improving the generalization abilities of the networks and increasing the final performance. Despite having been widely used to improve visual-semantic embeddings [3, 4, 19, 20, 21], this loss function has never been applied in the context of fashion retrieval. Experimental results on a widely used dataset for consumer-to-shop clothes retrieval, namely Street2Shop [14], demonstrate the effectiveness of this strategy leading to better retrieval results using both different backbones and pooling strategies. Furthermore, we show that the use of hard negative examples can significantly increase the final results on almost all categories of clothing and accessories (e.g. bags, dresses, footwear, skirts, etc.) and achieve performance comparable to state-of-the-art techniques. 2. Proposed Approach Given a query image of a fashion item and a corresponding in-shop image, these are fed through a CNN followed by a pooling strategy to extract a 1D feature vector for each image. Then, the obtained feature vectors can be compared through a similarity function that measures the similarity between the two images. An overview of the proposed approach is shown in Fig. 1. Extracting image features. Both query and in-shop images are processed through a CNN that extracts a 3D tensor for each image of 𝐻 × 𝑊 × 𝐷 dimensions, where 𝐻, 𝑊 , and 𝐶 are respectively the output tensor height, width, and number of channels. The 3D tensor can be seen as a set of 2D features channel responses 𝑋 = {𝑋𝑖 } where 𝑖 = {1, . . . , 𝐷}, 𝑋𝑖 is the 2D tensor representing the responses of the 𝑖-th feature channel over the set Ω of spatial locations, and 𝑋𝑖 (𝑝) is the response at a particular position 𝑝. To obtain a single 2D tensor for each image, we employ two different pooling functions: a standard average pooling and R-MAC descriptors [1]. While the average pooling is a well- known pooling technique computed by averaging the set 𝑋 of 2D tensors, R-MAC descriptors are an aggregation of image region descriptors extracted through a rigid-grid mechanism over 𝑋. Formally, considering a rectangular region ℛ ⊆ Ω = [1, 𝑊 ] × [1, 𝐻], each region feature vector is defined as: 𝑓𝑅 = [𝑓ℛ,1 ...𝑓ℛ,𝑖 ...𝑓ℛ,𝑘 ]⊤ , (1) where 𝑓ℛ,𝑖 = max𝑝∈ℛ 𝑋𝑖 (𝑝) is the maximum activation of the 𝑖-th channel of ℛ. Each region ℛ is detected through a square grid of variable dimensions applied at 𝐿 different scales. After extracting a feature vector for each region, they are processed using ℓ2 -normalization, PCA, and another ℓ2 -normalization. Finally, the region feature vectors are summed and ℓ2 -normalized to form a single feature vector for each image. Training with hard negatives. Once the descriptor of the query and in-shop images are obtained, they are compared using a similarity function. Note that the descriptors embedding space is learned according to the loss function used in the backbone training phase. To extract similar descriptors from similar images, a standard hinge-based triplet ranking loss is usually employed and defined as: ∑︁ 𝐿𝑆𝐻 (𝑎, 𝑏) = [𝛼 − 𝑠(𝑎, 𝑏) + 𝑠(𝑎, ^𝑏)]+ (2) 𝑝 ^𝑐 where [𝑥]+ = 𝑚𝑎𝑥(𝑥, 0) and 𝑠 is a similarity function (i.e. the cosine similarity in our experiments). In the equation above, (𝑎, 𝑏) is a matching image pair composed of a user- generated image 𝑎 and a shop image 𝑏 (such that 𝑏 contains the same fashion item depicted in 𝑎), while ^𝑏 is a negative shop image with respect to 𝑎 (such that 𝑏′ contains a different fashion item). The sum term in the equation requires that the difference in similarity between the matching and the non-matching pair is higher than a margin 𝛼. As demonstrated in previous works [3], this loss function can be dominated by multiple negatives with small violations. To avoid such behavior, we employ a modified version that takes into consideration the hardest negative instead of the sum of all negative examples. In practice, this is done by replacing the sum in Eq. 2 with maximum, thus considering only the most violating non-matching pair. Formally, we define the loss function as follow: 𝐿𝑀 𝐻 (𝑎, 𝑏) = max ′ [𝛼 − 𝑠(𝑎, 𝑏) + 𝑠(𝑎, 𝑏′ )]+ (3) 𝑏 where only the hardest negative shop image 𝑏′ is taken into account. 3. Experimental Evaluation In this section, we evaluate the performance of our approach and describe the dataset and implementation details used in our experiments. Dataset and implementation details. We train and test our model on Street2Shop [14] that contains 404,683 shop photos collected from 25 different online retailers and 20,357 user- generated photos. Overall, the dataset is composed of a total of 39,479 image pairs, each consisting of a user-generated photo and the corresponding shop image, from 11 different clothing categories. User-generated photos are annotated with bounding boxes of fashion items and can be associated with multiple views of the same fashion item. Table 1 Retrieval results on the Street2Shop test set using shop images from all categories as retrievable items. Results are reported in terms of 𝑅@𝐾 with 𝐾 = 1, 5, 10, 20. Average Pooling R-MAC R@1 R@5 R@10 R@20 R@1 R@5 R@10 R@20 ResNet-50 (Pre-trained) 3.7 8.2 10.7 13.7 5.9 11.1 13.9 17.8 ResNet-50 (Finetuned) 14.8 24.9 30.9 36.9 15.4 26.8 32.5 38.8 ResNet-50 (Finetuned with HN) 15.4 25.7 31.5 37.1 18.5 29.8 34.9 41.7 ResNet-101 (Pre-trained) 3.8 8.1 10.4 13.4 6.6 11.9 14.8 18.0 ResNet-101 (Finetuned) 15.4 25.3 30.6 37.0 15.0 25.9 32.3 38.8 ResNet-101 (Finetuned with HN) 17.4 28.0 33.8 39.8 23.6 36.0 42.4 48.6 Table 2 Retrieval results in terms of 𝑅@20 on the 11 categories of the Street2Shop dataset. Bags Belts Dresses Eyewear Footwear Hats Leggings Outerwear Pants Skirts Tops Wang et al. [25] 46.6 20.2 56.9 13.8 13.1 24.4 15.9 20.3 22.3 50.8 48.0 ResNet-50 + R-MAC Finetuned 52.5 38.1 52.4 36.2 28.9 29.2 42.7 36.0 34.8 71.6 45.4 Finetuned with HN 61.2 26.2 56.7 46.6 31.9 40.0 48.1 37.8 39.4 74.1 48.8 ResNet-101 + R-MAC Finetuned 54.0 35.7 53.2 20.7 29.9 29.2 45.9 33.8 40.9 74.4 46.0 Finetuned with HN 65.5 33.3 63.4 50.0 39.2 64.6 48.7 42.7 37.9 78.4 55.5 To encode images, we use two different CNNs (i.e. ResNet-50 and ResNet-101 [22]) pre-trained on ImageNet [23]. We resize and crop all images to 224 × 224 and obtain a 2048-dimensional feature vector for each encoded image using both average pooling and R-MAC descriptors. In the case of R-MAC, we extract region feature vectors at 3 different scales. To train all models, we use Adam [24] as optimizer with a learning rate equal to 0.0001 decreased by a factor of 10 every 10 epochs. In all experiments, we use a batch size of 50 and a margin 𝛼 equal to 0.1. Experimental results. To evaluate the effectiveness of our approach, we report rank-based performance metrics 𝑅@𝐾 (𝐾 = 1, 5, 10, 20) for consumer-to-shop clothes retrieval. Specif- ically, 𝑅@𝐾 computes the percentage of test queries for which at least one correct result is found among the top-𝐾 retrieved shop items. Table 1 shows the results using all shop images as retrievable items on the Street2Shop test set, without filtering the images by category. We report the retrieval performance of both ResNet-50 and ResNet-101 backbones while extracting image feature vectors either using average pooling or R-MAC descriptors. We compare the results of our approach, in which we finetune the backbone using the hinge-based triplet loss with hard negatives, with those obtained by finetuning the CNNs with a standard triplet loss and those extracted by using the CNNs pre-trained on ImageNet without finetuning. As it can be seen, finetuning the backbone leads to a noteworthy gain in performance on all considered settings. Also, the modified triplet loss further improves the final performance using both ResNet-50 and ResNet-101 as backbone and employing both pooling strategies. In Table 2, we report the performance on each of the 11 clothing categories of the Street2Shop ResNet-101 + R-MAC ResNet-101 + R-MAC Query Image (Finetuned) (Finetuned with HN) Figure 2: Top-3 retrieved results on sample query images from the Street2Shop test set. For each query, we show the top-3 results retrieved by the ResNet-101 model with R-MAC descriptors, finetuned on the dataset with and without the use of hard negatives during training. Correct and wrong retrieved elements are highlighted in green and red, respectively. dataset. These results are obtained by performing the retrieval on a subset of in-shop images, filtered by query category. As it can be noticed, the use of hard negatives generally increases the network performance, leading to better results on almost all clothing categories. Finally, Fig. 2 shows sample query images along with the corresponding top-3 shop images retrieved by the ResNet-101 model using R-MAC descriptors and finetuned with and without the use of hard negatives in the training loss function. 4. Conclusion In this work, we have tackled the task of consumer-to-shop clothes retrieval where the goal is to find the most similar clothing item from a catalog of shop images using a user-generated photo as query. To address the task, we have employed a CNN-based feature extraction network and two pooling mechanisms to extract compact feature vectors from images and have proposed to train the network with a modified hinge-based triplet ranking loss that takes into account hard negative examples. Experiments, performed on the Street2Shop dataset, have shown that the proposed loss function can effectively improve the retrieval results in all tested settings. Acknowledgments This work has been partially supported by YOOX NET-A-PORTER Group and the “SUPER - Supercomputing Unified Platform” project (POR FESR 2014-2020 DGR 1383/2018 - CUP E81F18000330007), co-funded by Emilia Romagna region. References [1] G. Tolias, R. Sicre, H. Jégou, Particular object retrieval with integral max-pooling of CNN activations, in: Proceedings of the International Conference on Learning Representations, 2016. [2] A. Gordo, J. Almazán, J. Revaud, D. Larlus, Deep image retrieval: Learning global repre- sentations for image search, in: Proceedings of the European Conference on Computer Vision, 2016. [3] F. Faghri, D. J. Fleet, J. R. Kiros, S. Fidler, VSE++: Improving Visual-Semantic Embeddings with Hard Negatives, in: Proceedings of the British Machine Vision Conference, 2018. [4] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text matching, in: Proceedings of the European Conference on Computer Vision, 2018. [5] M. Cornia, L. Baraldi, H. R. Tavakoli, R. Cucchiara, Towards cycle-consistent models for text and image retrieval, in: Proceedings of the European Conference on Computer Vision Workshops, 2018. [6] M. Cornia, M. Stefanini, L. Baraldi, M. Corsini, R. Cucchiara, Explaining digital humanities by aligning images and textual descriptions, Pattern Recognition Letters 129 (2020) 166– 172. [7] M. Stefanini, M. Cornia, L. Baraldi, R. Cucchiara, A Novel Attention-based Aggregation Function to Combine Vision and Language, in: Proceedings of the International Conference on Pattern Recognition, 2020. [8] Z. Liu, P. Luo, S. Qiu, X. Wang, X. Tang, DeepFashion: Powering robust clothes recognition and retrieval with rich annotations, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. [9] X. Han, Z. Wu, Z. Wu, R. Yu, L. S. Davis, VITON: An Image-based Virtual Try-On Network, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018. [10] B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, M. Yang, Toward characteristic-preserving image-based virtual try-on network, in: Proceedings of the European Conference on Computer Vision, 2018. [11] Y. Ge, R. Zhang, X. Wang, X. Tang, P. Luo, DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019. [12] A. Neuberger, E. Borenstein, B. Hilleli, E. Oks, S. Alpert, Image Based Virtual Try-On Network From Unpaired Data, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020. [13] M. Fincato, F. Landi, M. Cornia, F. Cesari, R. Cucchiara, VITON-GT: An Image-based Virtual Try-On Model with Geometric Transformations, in: Proceedings of the International Conference on Pattern Recognition, 2020. [14] M. Hadi Kiapour, X. Han, S. Lazebnik, A. C. Berg, T. L. Berg, Where to buy it: Matching street clothing photos in online shops, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015. [15] Z. Kuang, Y. Gao, G. Li, P. Luo, Y. Chen, L. Lin, W. Zhang, Fashion retrieval via graph reasoning networks on a similarity pyramid, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019. [16] X. Zhao, H. Qi, R. Luo, L. Davis, A Weakly Supervised Adaptive Triplet Loss for Deep Metric Learning, in: Proceedings of the European Conference on Computer Vision Workshops, 2019. [17] A. Chopra, A. Sinha, H. Gupta, M. Sarkar, K. Ayush, B. Krishnamurthy, Powering ro- bust fashion retrieval with information rich feature embeddings, in: Proceedings of the IEEE/CFV Conference on Computer Vision and Pattern Recognition Workshops, 2019. [18] A. D’Innocente, N. Garg, Y. Zhang, L. Bazzani, M. Donoser, Localized Triplet Loss for Fine-Grained Fashion Image Retrieval, in: Proceedings of the IEEE/CFV Conference on Computer Vision and Pattern Recognition Workshops, 2021. [19] L. Baraldi, M. Cornia, C. Grana, R. Cucchiara, Aligning text and document illustrations: towards visually explainable digital humanities, in: Proceedings of the International Conference on Pattern Recognition, 2018. [20] M. Stefanini, M. Cornia, L. Baraldi, M. Corsini, R. Cucchiara, Artpedia: A new visual- semantic dataset with visual and contextual sentences in the artistic domain, in: Proceed- ings of the International Conference on Image Analysis and Processing, 2019. [21] M. Cornia, L. Baraldi, H. R. Tavakoli, R. Cucchiara, A unified cycle-consistent neural model for text and image retrieval, Multimedia Tools and Applications 79 (2020) 25697–25721. [22] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016. [23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision 115 (2015) 211–252. [24] D. Kingma, J. Ba, Adam: a method for stochastic optimization, in: Proceedings of the International Conference on Learning Representations, 2015. [25] X. Wang, Z. Sun, W. Zhang, Y. Zhou, Y.-G. Jiang, Matching user photos to online products with robust deep features, in: Proceedings of the ACM International Conference on Multimedia Retrieval, 2016.