CCS CONCEPTS

Closing the Gap Between uery and Database through uery Feature Transformation in C2C e-Commerce Visual Search

Takuma Yamaguchi

Kosuke Arase

Riku Togashi

Shunya Ueta Mercari

Inc Tokyo

Japan

kumon

kosuke.arase

riktor

hurutoriya}@mercari.com

0 0 Content-based Image RetrievalD , eep Learninge,-Commerce

2019

This paper introduces an image representation technique for visual search on a consumer-to-consumer (C2C) e-commerce website. Visual searching at such websites cannot efectively close the gap between query images taken by users and database images. The proposed technique consists of extraction of a lightweight deep CNN-based feature vector and transformation of a query feature. Our quantitative and qualitative experiments using datasets from an online C2C marketplace with over one billion items show that this image representation technique with our query image feature transformation can improve users' visual search experience, particularly when searching for apparel items, without negative side efects on nonapparel items. (a) Query Image

CCS CONCEPTS

· Information systems→ Image search;

1 INTRODUCTION

The explosiviencreaseofonlinpehotos, drivenby socianletworkingand e-commerce sitesh,as focusedresearchers’ attentionon visualsearch, alsocalledcontent-based imageretrieva[l9ś11]. Many newlyposted photos are listeodn consumer-to-consumer (C2C) e-commerce sitesw,here most sellersare not professionaplhotographers or retailerst;herefore, buyers are often stymiedby the poor qualitoyr limitqeudantityofiteminformatiaond keywords. Moreover, buyers mighntot even know the correct keywords to use to fintdheirdesireditemsI.n such a situatioin,mage-baseditem searches may improvethe user experience.

Algorithmfs orextractingimagefeaturesbased on deep convolutionanleuralnetwork (CNN) [7, 8] and approximatenearest neighbor(ANN) search [2, 4] can be used to realizae simplveisual search system. However, even if these simple systems can retrieve visually similar images, their results could be nonoptimal. C2C e-commerce site search algorithms tend to extract items listed by professional sellers even if more relevant items are listed by nonprofessional sellers because the query images are often more visually similar to images taken by professionals than those provided by nonprofessionals, especially in apparel categories. Speciically, itted apparel images (Figure 1b) are likely to be retrieved in response to a itted apparel query image (Figure 1a). In this paper, we call apparel łittedž if it is pictured being worn by a model and łlatž if it is instead laid lat on a surface. Professional and nonprofessional sellers tend to upload itted and lat apparel images, respectively. Searches that return many items listed by professional sellers can cause problems for C2C e-commerce sites, for example, by hurting buyer experience and discouraging nonprofessional sellers from listing items [1].

To manage these issues so as to retrieve more lat apparel items, we developed an image representation technique that closes the visual gap between itted apparel query images and lat apparel images in a database. The technique consists of extracting features using a lightweight deep CNN and transforming query features; it enables the retrieval of lat apparel images (Figure 1c) from a itted apparel query image (Figure 1a). Moreover, the feature transformation step can be applied to any query vector because it causes no signiicant side efects to lat apparel and nonapparel query vectors. Thus, additional information of whether the query image contains itted apparel is not required before feature transformation. Our experiments demonstrate that more lat apparel images are correctly discovered through query feature transformation for a itted apparel query image without serious negative impacts on lat apparel and nonapparel query images.

Algorithm 1Generate Feature Transformation Vector

2 RELATED WORK

Some e-commerce sites, such as Alibaba and eBay, have introduced Input: visual search systems that enable users to search for products using h {1, ...,C }, {1, ..., Nc }, # Feature vectors of itted apparel images images9[, 11]. These systems are basically composed of deep CNN- l {1, ...,C }, {1, ...,Mc } # Feature vectors of lat apparel images based feature extraction and a nearest neighbor search. A method hc,n and lc,m represent the n-th and m-th feature vectors of of discovering more items relevant to a query image involves the category c, respectively. training of a deep CNN model with a triplet loss; however, buildOiuntgput: and updating a dataset for training such models is infeasible for aFeature transformation vectozˆr

1: for c ← 1 to C do massive and volatile inventory marketplace. Although implicit f2e:ed- h¯c ← Median(hc,1, . . . , hc, Nc ) # Fitted apparel vector (Median back, such as page views and click logs, allows for model training vector of the itted apparel vectors) with a triplet loss even in such a marketp1l1a]c,ei[mplicit feed- 3: l¯c ← Median(lc,1, . . . , lc,Mc ) # Flat apparel vector (Median vector back is available only after launching a visual search system intoof the lat apparel vectors) production. This paper proposes an image representation method 4: with query feature transformation; this method closes the gap between a itted apparel query vector and lat apparel database vectors 5: without time-consuming human relevance assessments. 6: tc ← h¯c − l¯c # Subtracting the lat apparel vector from the itted apparel vector.

zc ← Maximum(tc , 0) # Replacing negative elements with 0 zˆc ← ∥zzcc∥2 # L2 Normalizatiozˆnc. is a gap vector of categorcy 7: z ← Average(zˆ1, . . . , zˆC ) # Averaging the gap vectors 3 VISUAL SEARCH ARCHITECTURE 8: zˆ ← ∥zz∥2 # L2 Normalization The proposed visual search architecture simply consists of image9: return zˆ # Feature transformation vector feature extraction, query feature transformation, and a nearest neighbor vector search. For C2C e-commerce sites speciically, this feature transformation closes the distance between a itted apparel Algorithm 2Feature Transformation query vector and lat apparel database vectors. An approximate Input: nearest neighbor (ANN) algorithm accomplishes the nearest neigh- q, # Feature vector of the query image bor search in a large database within a practical runtime. zˆ # Feature transformation vector

Output:

Transformed query vectorpˆ 3.1 Image Representation 1: qˆ ← ∥qq∥2 # L2 Normalization 3.1.1 Feature Extraction ModelF.or feature extraction, we adopted 2: t ← qˆ − zˆ # Subtracting feature transformation vector from query vector MobileNetV26][, which is a state-of-the-art lightweight CNN model. 3: p ← Maximum(t, 0) # Replacing negative elements with 0 Sending query images at a large scale to an e-commerce visual 4: pˆ ← ∥pp∥2 # L2 Normalization search system from user devices can cause network traic problems. 5: return pˆ # Transformed query vector One solution to this issue is edge computing, through which image features are extracted on an edge device or a smart device. Such a lightweight extraction model works eiciently in an edge device and consumes only several megabytes of memory space. and 25,895 of lat apparel) belonging to 15 apparel categories, such We prepared a dataset consisting of images and their metadata as tops, jackets, pants, and hats. In the training step, a gap vector, which represents the diference between itted and lat apparel collected from an online C2C marketplace with over one billion listings. The dataset has 9 million images belonging to 14,000 clafseseastu,re vectors, was calculated for each category and the feature which are combinations of item brands, textures, and categoriesÐ transformation vector was computed by averaging the gap vectors. for example, Nike striped men’s golf polo. Images from nonapparelFor a query, the transformation simply subtracts the feature transcategories, such as laptops, bikes, and toys, are included in theformation vector from a query image feature vector (Algorithm 2); dataset. its computation time is negligibly small.

One of the model’s hyper parameters is a width multi6p]l;ifoer [ The feature vector extracted from MobileNetV2 initially lacks a given layer and width multiαp,ltiheernumber of output channels negative value elements owing to the use of the rectiied linear unit N becomes α N . The model was trained on the dataset with a width (ReLU) activation functi5o]n.T[he negative value elements in the multiplier of 1.4. The output of the global average pooling layfeerature vector space can be treated as unnecessary, that is, elements was used as an image feature vector that has, 1792 (1, 280 × 1.4) are replaced with zero in Algorithms 1 and 2, a step that is key to dimensions. Then, the feature vectors of the query and database preventing side efects in query feature transformation. Even if the images were extracted using the same feature extractor. feature transformation designed for a itted apparel query vector is applied to a lat apparel or nonapparel query vector, the essential 3.1.2 uery Feature TransformationO.nly the query feature vec- feature is still preserved by removing negative value elements. tors were calibrated using a feature transformation vector, which expresses a human feature vector intuitively, to close the gap be-3.2 Nearest Neighbor Search tween itted apparel query feature vector and lat apparel database In large-scale e-commerce, ANN searches outperform brute force in feature vectors. The feature transformation vector was trained inding the nearest neighbors of a transformed query vector from through Algorithm 1 with 80,040 images (54,145 of itted apparelthe database vectors. ANN algorithms, such as IVFADC2][ and Closing the Gap Between uery and Database through uery Feature Transformation Rii4][, allow us to retrieve the nearest neighbors in a practicaland proposed image representations for lat and itted apparel query runtime. In our experiments, we used IVFADC to retrieve visuallyimages. The results demonstrate a signiicant improvement for the similar images from among 100 million images. itted apparel queries in every category. Although query feature transformation was designed to close the gap between itted and 4 EXPERIMENTS lat apparel vectors, it also positively inluenced lat apparel queries. We conducted experiments to evaluate the proposed method. For These results imply that our proposed method enables more essenthese experiments, we collected 20,000 images from a C2C mar- tial features to be extracted from query images. ketplace: half of these images were those of lat apparel and the We also collected 100 million images belonging to over 1000 item remaining were itted apparel images. The lat apparel images bec-ategories, including nonapparel images. For such large-scale data, long to ten categories, shown in the irst column of Table 1. FittedANN algorithms allow us to retrieve the nearest neighbors within apparel images not belonging to the ten categories, such as jerseys a practical runtime. Figure 2 presents the visual search results with and polo shirts, were also included. From the 20,000 images, 2,000 IVFADC [2] (code length per vector: 64 bytes, number of cells: 8,192, were used as query images, from among which 100 images were number of cells visited for each query: 64), from the 100 million randomly selected from the 10 categories for each lat and itted images for itted apparel and nonapparel queries. To demonstrate apparel class. The remaining 18,000 images were treated as data- the versatility of the proposed method, the itted apparel queries base images. For itted apparel queries, cropped images of the query also contain images from a diferent dataset, A3T],Rw[hich was objects were prepared manually from the original images to reduceoriginally used for a human parsing task. For itted apparel queries, the inluence of the background. our proposed method retrieved a greater number of visually similar

The mean average precision at 10m0 A(P @100), deined as fol- lat apparel items (1stś8th rows in Figure 2). In addition, no serious lows, was used as an evaluation measure for each category. negative impact was observed for nonapparel queries: visually similar items to the query images were successfully extracted (9thś13th ÍqN=1 AP @K(q) rows in Figure 2). The runtimes of the image feature extraction mAP @K = N , method and the nearest-100 vector search were approximately 40 where and 70 ms, respectively, using an 8-core 2.3 GHz CPU. By simultaneously processing multiple query images and/or using GPUs, the runtime per query could be made faster. ÍkK=1 I (k) ,

k P @k = Ín=1 I (n) k , 5

CONCLUSION

I (i) = (1 i-th item is lat apparel in the same category as the qu,ery

0 otherwise This paper proposed an image representation technique for visual N is the number of query images, aAnPd@K and P @k indicate search at C2C e-commerce sites. The proposed method, comprising the average precision Kat and precision katfor each query, re- a deep CNN-based feature extraction and query feature transforspectively. A retrieved item is recognized as correctly selected onmlyation, signiicantly improves conventional visual search methods when it is an image of lat apparel in the same category as the query.for comparing images of itted and lat apparel. Additionally, the

For baseline image representation, a vector from the global aver-proposed method did not negatively impact either lat apparel or age pooling layer of MobineNetV2, described in Section 3.1, wnaosnapparel queries in a serious manner. The performance and total used for query and database images. Our proposed method also runtimes of our visual search system were practical in the experiuses the same feature extractor and transforms query vectors. Be- ments described, indicating that it can be successfully deployed to cause the number of database images used in this experiment was a major online C2C marketplace. After the system is widely used relatively small, the nearest vectors were greedily retrieved usinign production, further improvement is expected using real query cosine similarity. Table 1 comparmesAtPh@e100 of the baseline images and implicit feedback.

Query Image

Baseline Image Representation

Proposed Method

Vision and Pattern Recognition (CVPR '18) . [1]

Hagiu and

Simon . 2016 . Network Efects Aren't EnoughH .arvard Business [7]

Song ,

Jiang , and

Herranz . 2017 . Multi-Scale Multi-Feature Context

Review 94, 4 (April 2016 ), 65ś71 . Modeling for Scene Recognition in the Semantic MaTnrainfs .oIlmdg.. Proc. 26 , 6 [2]

Jegou ,

Douze , and

Schmid . 2011 . Product Quantization for Nearest (June 2017 ), 2721ś2735 .

Neighbor

Search . IEEE Trans. Pattern Anal. Mach. Intell . 33 , 1 (Jan. 2011 ), 117ś [8]

A. Babenko

Yandex and

Lempitsky . 2015 . Aggregating Local Deep Features

128. for Image Retrieval . PInroceedings of the 2015 IEEE International Conference on [3 ]

Liang ,

Liu ,

Shen ,

Yang , L. Liu,

Dong ,

Lin , and

Yan . 2015 . Deep Computer Vision (ICCV '15) . 1269ś1277 .

Human Parsing with Active Template RegressIiEEoEn . Trans. Pattern Anal. Mach . [9]

Yang ,

Kale ,

Bubnov ,

Stein ,

Wang ,

Kiapour , and

Piramuthu .

Intell. 37 , 12 (Dec. 2015 ), 2402ś2414 . 2017 . Visual Search at eBay . IPnroceedings of the 23rd ACM SIGKDD International [4]

Matsui ,

Hinami , and

Satoh . 2018 . Reconigurable Inverted Index . In Conference on Knowledge Discovery and Data Mining (KDD '17) . 2101ś2110 .

Proceedings of the 26th ACM International Conference on Multimedia (MM '18) . [10]

Zhai ,

Kislyuk ,

Jing ,

Feng ,

Tzeng ,

Donahue ,

Y. L.

Du , and

Darrell .

1715ś1723. 2017 . Visual Discovery at PinterestP . rIonceedings of the 26th International [5]

Nair and

G. E.

Hinton . 2010. Rectiied Linear Units Improve Restricted Boltz- Conference on World Wide Web Companion (WWW '17 Companion) . 515ś524 .

mann Machines. InProceedings of the 27th International Conference on Interna- [11]

Zhang , P. Pan,

Zheng ,

Zhao ,

Zhang ,

Ren , and

Jin . 2018 . Visual

tional Conference on Machine Learning (ICML'10) . 807ś814. Search at Alibaba. IPnroceedings of the 24th ACM SIGKDD International Conference [6]

Sandler ,

Howard ,

Zhu ,

Zhmoginov , and

Chen . 2018 . MobileNetV2: on Knowledge Discovery and Data Mining (KDD '18) . 993ś1001 .