Closing the Gap Between uery and Database through uery Feature Transformation in C2C e-Commerce Visual Search Takuma Yamaguchi, Kosuke Arase, Riku Togashi, Shunya Ueta Mercari, Inc Tokyo, Japan {kumon,kosuke.arase,riktor,hurutoriya}@mercari.com ABSTRACT This paper introduces an image representation technique for vi- sual search on a consumer-to-consumer (C2C) e-commerce website. Visual searching at such websites cannot efectively close the gap between query images taken by users and database images. The proposed technique consists of extraction of a lightweight deep CNN-based feature vector and transformation of a query feature. Our quantitative and qualitative experiments using datasets from an online C2C marketplace with over one billion items show that this image representation technique with our query image feature transformation can improve users’ visual search experience, par- ticularly when searching for apparel items, without negative side efects on nonapparel items. (a) Query Image (b) Baseline (c) Proposed CCS CONCEPTS Figure 1: Query image and its visual search results among 100 million items. · Information systems → Image search; search system. However, even if these simple systems can retrieve KEYWORDS visually similar images, their results could be nonoptimal. C2C Content-based Image Retrieval, Deep Learning, e-Commerce e-commerce site search algorithms tend to extract items listed by ACM Reference Format: professional sellers even if more relevant items are listed by nonpro- Takuma Yamaguchi, Kosuke Arase, Riku Togashi, Shunya Ueta. 2019. Closing fessional sellers because the query images are often more visually the Gap Between Query and Database through Query Feature Transfor- similar to images taken by professionals than those provided by mation in C2C e-Commerce Visual Search. In Proceedings of the SIGIR nonprofessionals, especially in apparel categories. Speciically, it- 2019 Workshop on eCommerce (SIGIR 2019 eCom), 4 pages. ted apparel images (Figure 1b) are likely to be retrieved in response to a itted apparel query image (Figure 1a). In this paper, we call 1 INTRODUCTION apparel łittedž if it is pictured being worn by a model and łlatž if The explosive increase of online photos, driven by social network- it is instead laid lat on a surface. Professional and nonprofessional ing and e-commerce sites, has focused researchers’ attention on vi- sellers tend to upload itted and lat apparel images, respectively. sual search, also called content-based image retrieval [9ś11]. Many Searches that return many items listed by professional sellers can newly posted photos are listed on consumer-to-consumer (C2C) cause problems for C2C e-commerce sites, for example, by hurting e-commerce sites, where most sellers are not professional photog- buyer experience and discouraging nonprofessional sellers from raphers or retailers; therefore, buyers are often stymied by the listing items [1]. poor quality or limited quantity of item information and keywords. To manage these issues so as to retrieve more lat apparel items, Moreover, buyers might not even know the correct keywords to use we developed an image representation technique that closes the to find their desired items. In such a situation, image-based item visual gap between itted apparel query images and lat apparel searches may improve the user experience. images in a database. The technique consists of extracting features Algorithms for extracting image features based on deep con- using a lightweight deep CNN and transforming query features; it volutional neural network (CNN) [7, 8] and approximate nearest enables the retrieval of lat apparel images (Figure 1c) from a itted neighbor (ANN) search [2, 4] can be used to realize a simple visual apparel query image (Figure 1a). Moreover, the feature transforma- tion step can be applied to any query vector because it causes no signiicant side efects to lat apparel and nonapparel query vectors. Copyright © 2019 by the paper’s authors. Copying permitted for private and academic Thus, additional information of whether the query image contains purposes. In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.): itted apparel is not required before feature transformation. Our ex- Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published periments demonstrate that more lat apparel images are correctly at http://ceur-ws.org discovered through query feature transformation for a itted ap- parel query image without serious negative impacts on lat apparel and nonapparel query images. SIGIR 2019 eCom, July 2019, Paris, France T. Yamaguchi et al. 2 RELATED WORK Algorithm 1 Generate Feature Transformation Vector Some e-commerce sites, such as Alibaba and eBay, have introduced Input: visual search systems that enable users to search for products using h {1, ...,C }, {1, ..., Nc } , # Feature vectors of itted apparel images images [9, 11]. These systems are basically composed of deep CNN- l {1, ...,C }, {1, ..., Mc } # Feature vectors of lat apparel images based feature extraction and a nearest neighbor search. A method hc,n and lc,m represent the n-th and m-th feature vectors of of discovering more items relevant to a query image involves the category c, respectively. training of a deep CNN model with a triplet loss; however, building Output: Feature transformation vector ẑ and updating a dataset for training such models is infeasible for a 1: for c ← 1 to C do massive and volatile inventory marketplace. Although implicit feed- 2: h¯c ← Median(hc,1 , . . . , hc, Nc ) # Fitted apparel vector (Median back, such as page views and click logs, allows for model training vector of the itted apparel vectors) with a triplet loss even in such a marketplace [11], implicit feed- 3: l¯c ← Median(lc,1 , . . . , lc, Mc ) # Flat apparel vector (Median vector back is available only after launching a visual search system into of the lat apparel vectors) production. This paper proposes an image representation method 4: ¯ tc ← hc − lc ¯ # Subtracting the lat apparel vector from the itted apparel with query feature transformation; this method closes the gap be- vector. tween a itted apparel query vector and lat apparel database vectors 5: zc ← Maximum(tc , 0) # Replacing negative elements with 0 without time-consuming human relevance assessments. 6: ẑc ← ∥zzc ∥ # L2 Normalization. ẑc is a gap vector of category c c 2 7: z ← Average(ẑ 1 , . . . , ẑC ) # Averaging the gap vectors 3 VISUAL SEARCH ARCHITECTURE 8: ẑ ← z # L2 Normalization ∥z ∥ 2 The proposed visual search architecture simply consists of image 9: return ẑ # Feature transformation vector feature extraction, query feature transformation, and a nearest neighbor vector search. For C2C e-commerce sites speciically, this Algorithm 2 Feature Transformation feature transformation closes the distance between a itted apparel query vector and lat apparel database vectors. An approximate Input: nearest neighbor (ANN) algorithm accomplishes the nearest neigh- q, # Feature vector of the query image bor search in a large database within a practical runtime. ẑ # Feature transformation vector Output: Transformed query vector p̂ 3.1 Image Representation 1: q̂ ← q # L2 Normalization ∥q ∥ 2 3.1.1 Feature Extraction Model. For feature extraction, we adopted 2: t ← q̂ − ẑ # Subtracting feature transformation vector from query vector MobileNetV2 [6], which is a state-of-the-art lightweight CNN model. 3: p ← Maximum(t, 0) # Replacing negative elements with 0 Sending query images at a large scale to an e-commerce visual p 4: p̂ ← # L2 Normalization ∥p ∥ 2 search system from user devices can cause network traic problems. 5: return p̂ # Transformed query vector One solution to this issue is edge computing, through which image features are extracted on an edge device or a smart device. Such a lightweight extraction model works eiciently in an edge device and 25,895 of lat apparel) belonging to 15 apparel categories, such and consumes only several megabytes of memory space. as tops, jackets, pants, and hats. In the training step, a gap vec- We prepared a dataset consisting of images and their metadata tor, which represents the diference between itted and lat apparel collected from an online C2C marketplace with over one billion feature vectors, was calculated for each category and the feature listings. The dataset has 9 million images belonging to 14,000 classes, transformation vector was computed by averaging the gap vectors. which are combinations of item brands, textures, and categoriesÐ For a query, the transformation simply subtracts the feature trans- for example, Nike striped men’s golf polo. Images from nonapparel formation vector from a query image feature vector (Algorithm 2); categories, such as laptops, bikes, and toys, are included in the its computation time is negligibly small. dataset. The feature vector extracted from MobileNetV2 initially lacks One of the model’s hyper parameters is a width multiplier [6]; for negative value elements owing to the use of the rectiied linear unit a given layer and width multiplier α, the number of output channels (ReLU) activation function [5]. The negative value elements in the N becomes α N . The model was trained on the dataset with a width feature vector space can be treated as unnecessary, that is, elements multiplier of 1.4. The output of the global average pooling layer are replaced with zero in Algorithms 1 and 2, a step that is key to was used as an image feature vector that has 1, 792 (1, 280 × 1.4) preventing side efects in query feature transformation. Even if the dimensions. Then, the feature vectors of the query and database feature transformation designed for a itted apparel query vector is images were extracted using the same feature extractor. applied to a lat apparel or nonapparel query vector, the essential 3.1.2 uery Feature Transformation. Only the query feature vec- feature is still preserved by removing negative value elements. tors were calibrated using a feature transformation vector, which expresses a human feature vector intuitively, to close the gap be- 3.2 Nearest Neighbor Search tween itted apparel query feature vector and lat apparel database In large-scale e-commerce, ANN searches outperform brute force in feature vectors. The feature transformation vector was trained inding the nearest neighbors of a transformed query vector from through Algorithm 1 with 80,040 images (54,145 of itted apparel the database vectors. ANN algorithms, such as IVFADC [2] and Closing the Gap Between uery and Database through uery Feature Transformation SIGIR 2019 eCom, July 2019, Paris, France Table 1: Visual Search Results for Apparel Categories (mAP@100) Flat Apparel Fitted Apparel Fitted Apparel (Cropped) Baseline Proposed Baseline Proposed Baseline Proposed T-Shirts 0.844 0.895 0.004 0.376 0.042 0.542 Sweaters 0.926 0.967 0.002 0.456 0.053 0.670 Hoodies 0.942 0.977 0.053 0.691 0.211 0.756 Denim Jackets 0.982 0.993 0.004 0.778 0.041 0.850 Down Jackets 0.972 0.995 0.115 0.815 0.313 0.866 Jeans 0.878 0.822 0.001 0.381 0.095 0.737 Casual Pants 0.889 0.933 0.002 0.475 0.139 0.690 Knee-Lengh Skirts 0.718 0.732 0.000 0.081 0.090 0.257 Long Skirts 0.567 0.614 0.004 0.180 0.041 0.244 Dresses 0.847 0.922 0.001 0.226 0.018 0.254 Rii [4], allow us to retrieve the nearest neighbors in a practical and proposed image representations for lat and itted apparel query runtime. In our experiments, we used IVFADC to retrieve visually images. The results demonstrate a signiicant improvement for the similar images from among 100 million images. itted apparel queries in every category. Although query feature transformation was designed to close the gap between itted and 4 EXPERIMENTS lat apparel vectors, it also positively inluenced lat apparel queries. We conducted experiments to evaluate the proposed method. For These results imply that our proposed method enables more essen- these experiments, we collected 20,000 images from a C2C mar- tial features to be extracted from query images. ketplace: half of these images were those of lat apparel and the We also collected 100 million images belonging to over 1000 item remaining were itted apparel images. The lat apparel images be- categories, including nonapparel images. For such large-scale data, long to ten categories, shown in the irst column of Table 1. Fitted ANN algorithms allow us to retrieve the nearest neighbors within apparel images not belonging to the ten categories, such as jerseys a practical runtime. Figure 2 presents the visual search results with and polo shirts, were also included. From the 20,000 images, 2,000 IVFADC [2] (code length per vector: 64 bytes, number of cells: 8,192, were used as query images, from among which 100 images were number of cells visited for each query: 64), from the 100 million randomly selected from the 10 categories for each lat and itted images for itted apparel and nonapparel queries. To demonstrate apparel class. The remaining 18,000 images were treated as data- the versatility of the proposed method, the itted apparel queries base images. For itted apparel queries, cropped images of the query also contain images from a diferent dataset, ATR [3], which was objects were prepared manually from the original images to reduce originally used for a human parsing task. For itted apparel queries, the inluence of the background. our proposed method retrieved a greater number of visually similar The mean average precision at 100 (mAP@100), deined as fol- lat apparel items (1stś8th rows in Figure 2). In addition, no serious lows, was used as an evaluation measure for each category. negative impact was observed for nonapparel queries: visually simi- lar items to the query images were successfully extracted (9thś13th ÍN rows in Figure 2). The runtimes of the image feature extraction q=1 AP@K(q) mAP@K = , method and the nearest-100 vector search were approximately 40 N and 70 ms, respectively, using an 8-core 2.3 GHz CPU. By simulta- where ÍK Ík neously processing multiple query images and/or using GPUs, the k =1 (P@k · I (k)) n=1 I (n) runtime per query could be made faster. AP@K = ÍK , P@k = , I (k) k k =1 5 CONCLUSION ( 1 i-th item is lat apparel in the same category as the query I (i) = , 0 otherwise This paper proposed an image representation technique for visual search at C2C e-commerce sites. The proposed method, comprising N is the number of query images, and AP@K and P@k indicate a deep CNN-based feature extraction and query feature transfor- the average precision at K and precision at k for each query, re- mation, signiicantly improves conventional visual search methods spectively. A retrieved item is recognized as correctly selected only for comparing images of itted and lat apparel. Additionally, the when it is an image of lat apparel in the same category as the query. proposed method did not negatively impact either lat apparel or For baseline image representation, a vector from the global aver- nonapparel queries in a serious manner. The performance and total age pooling layer of MobineNetV2, described in Section 3.1, was runtimes of our visual search system were practical in the experi- used for query and database images. Our proposed method also ments described, indicating that it can be successfully deployed to uses the same feature extractor and transforms query vectors. Be- a major online C2C marketplace. After the system is widely used cause the number of database images used in this experiment was in production, further improvement is expected using real query relatively small, the nearest vectors were greedily retrieved using images and implicit feedback. cosine similarity. Table 1 compares the mAP@100 of the baseline SIGIR 2019 eCom, July 2019, Paris, France T. Yamaguchi et al. Query Image Baseline Image Representation Proposed Method Fitted-Apparels Fitted-Apparels (ATR Dataset) Nonapparels Figure 2: Visual search results from 100 million images. The irst column shows query images and the next seven columns show the results with the baseline image representation. The remaining columns show the results obtained using query fea- ture transformation. Our method successfully retrieved more lat apparel images corresponding to the itted apparel queries without negatively impacting nonapparel queries. REFERENCES Vision and Pattern Recognition (CVPR ’18). [1] A. Hagiu and R. Simon. 2016. Network Efects Aren’t Enough. Harvard Business [7] X. Song, S. Jiang, and L. Herranz. 2017. Multi-Scale Multi-Feature Context Review 94, 4 (April 2016), 65ś71. Modeling for Scene Recognition in the Semantic Manifold. Trans. Img. Proc. 26, 6 [2] H. Jegou, M. Douze, and C. Schmid. 2011. Product Quantization for Nearest (June 2017), 2721ś2735. Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (Jan. 2011), 117ś [8] A. Babenko Yandex and V. Lempitsky. 2015. Aggregating Local Deep Features 128. for Image Retrieval. In Proceedings of the 2015 IEEE International Conference on [3] X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, and S. Yan. 2015. Deep Computer Vision (ICCV ’15). 1269ś1277. Human Parsing with Active Template Regression. IEEE Trans. Pattern Anal. Mach. [9] F. Yang, A. Kale, Y. Bubnov, L. Stein, Q. Wang, H. Kiapour, and R. Piramuthu. Intell. 37, 12 (Dec. 2015), 2402ś2414. 2017. Visual Search at eBay. In Proceedings of the 23rd ACM SIGKDD International [4] Y. Matsui, R. Hinami, and S. Satoh. 2018. Reconigurable Inverted Index. In Conference on Knowledge Discovery and Data Mining (KDD ’17). 2101ś2110. Proceedings of the 26th ACM International Conference on Multimedia (MM ’18). [10] A. Zhai, D. Kislyuk, Y. Jing, M. Feng, E. Tzeng, J. Donahue, Y. L. Du, and T. Darrell. 1715ś1723. 2017. Visual Discovery at Pinterest. In Proceedings of the 26th International [5] V. Nair and G. E. Hinton. 2010. Rectiied Linear Units Improve Restricted Boltz- Conference on World Wide Web Companion (WWW ’17 Companion). 515ś524. mann Machines. In Proceedings of the 27th International Conference on Interna- [11] Y. Zhang, P. Pan, Y. Zheng, K. Zhao, Y. Zhang, X. Ren, and R. Jin. 2018. Visual tional Conference on Machine Learning (ICML’10). 807ś814. Search at Alibaba. In Proceedings of the 24th ACM SIGKDD International Conference [6] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. 2018. MobileNetV2: on Knowledge Discovery and Data Mining (KDD ’18). 993ś1001. Inverted Residuals and Linear Bottlenecks. In The IEEE Conference on Computer