Closing the Gap Between uery and Database through uery
    Feature Transformation in C2C e-Commerce Visual Search
                                   Takuma Yamaguchi, Kosuke Arase, Riku Togashi, Shunya Ueta
                                                                      Mercari, Inc
                                                                      Tokyo, Japan
                                                   {kumon,kosuke.arase,riktor,hurutoriya}@mercari.com

ABSTRACT
This paper introduces an image representation technique for vi-
sual search on a consumer-to-consumer (C2C) e-commerce website.
Visual searching at such websites cannot efectively close the gap
between query images taken by users and database images. The
proposed technique consists of extraction of a lightweight deep
CNN-based feature vector and transformation of a query feature.
Our quantitative and qualitative experiments using datasets from
an online C2C marketplace with over one billion items show that
this image representation technique with our query image feature
transformation can improve users’ visual search experience, par-
ticularly when searching for apparel items, without negative side
efects on nonapparel items.                                                             (a) Query Image            (b) Baseline           (c) Proposed


CCS CONCEPTS                                                                          Figure 1: Query image and its visual search results among
                                                                                      100 million items.
· Information systems → Image search;
                                                                                      search system. However, even if these simple systems can retrieve
KEYWORDS                                                                              visually similar images, their results could be nonoptimal. C2C
Content-based Image Retrieval, Deep Learning, e-Commerce                              e-commerce site search algorithms tend to extract items listed by
ACM Reference Format:                                                                 professional sellers even if more relevant items are listed by nonpro-
Takuma Yamaguchi, Kosuke Arase, Riku Togashi, Shunya Ueta. 2019. Closing              fessional sellers because the query images are often more visually
the Gap Between Query and Database through Query Feature Transfor-                    similar to images taken by professionals than those provided by
mation in C2C e-Commerce Visual Search. In Proceedings of the SIGIR                   nonprofessionals, especially in apparel categories. Speciically, it-
2019 Workshop on eCommerce (SIGIR 2019 eCom), 4 pages.                                ted apparel images (Figure 1b) are likely to be retrieved in response
                                                                                      to a itted apparel query image (Figure 1a). In this paper, we call
1 INTRODUCTION                                                                        apparel łittedž if it is pictured being worn by a model and łlatž if
The explosive increase of online photos, driven by social network-                    it is instead laid lat on a surface. Professional and nonprofessional
ing and e-commerce sites, has focused researchers’ attention on vi-                   sellers tend to upload itted and lat apparel images, respectively.
sual search, also called content-based image retrieval [9ś11]. Many                   Searches that return many items listed by professional sellers can
newly posted photos are listed on consumer-to-consumer (C2C)                          cause problems for C2C e-commerce sites, for example, by hurting
e-commerce sites, where most sellers are not professional photog-                     buyer experience and discouraging nonprofessional sellers from
raphers or retailers; therefore, buyers are often stymied by the                      listing items [1].
poor quality or limited quantity of item information and keywords.                        To manage these issues so as to retrieve more lat apparel items,
Moreover, buyers might not even know the correct keywords to use                      we developed an image representation technique that closes the
to find their desired items. In such a situation, image-based item                    visual gap between itted apparel query images and lat apparel
searches may improve the user experience.                                             images in a database. The technique consists of extracting features
   Algorithms for extracting image features based on deep con-                        using a lightweight deep CNN and transforming query features; it
volutional neural network (CNN) [7, 8] and approximate nearest                        enables the retrieval of lat apparel images (Figure 1c) from a itted
neighbor (ANN) search [2, 4] can be used to realize a simple visual                   apparel query image (Figure 1a). Moreover, the feature transforma-
                                                                                      tion step can be applied to any query vector because it causes no
                                                                                      signiicant side efects to lat apparel and nonapparel query vectors.
Copyright © 2019 by the paper’s authors. Copying permitted for private and academic   Thus, additional information of whether the query image contains
purposes.
In: J. Degenhardt, S. Kallumadi, U. Porwal, A. Trotman (eds.):                        itted apparel is not required before feature transformation. Our ex-
Proceedings of the SIGIR 2019 eCom workshop, July 2019, Paris, France, published      periments demonstrate that more lat apparel images are correctly
at http://ceur-ws.org                                                                 discovered through query feature transformation for a itted ap-
                                                                                      parel query image without serious negative impacts on lat apparel
                                                                                      and nonapparel query images.
SIGIR 2019 eCom, July 2019, Paris, France                                                                                             T. Yamaguchi et al.


2     RELATED WORK                                                        Algorithm 1 Generate Feature Transformation Vector
Some e-commerce sites, such as Alibaba and eBay, have introduced          Input:
visual search systems that enable users to search for products using          h {1, ...,C }, {1, ..., Nc } , # Feature vectors of itted apparel images
images [9, 11]. These systems are basically composed of deep CNN-             l {1, ...,C }, {1, ..., Mc }   # Feature vectors of lat apparel images
based feature extraction and a nearest neighbor search. A method              hc,n and lc,m represent the n-th and m-th feature vectors of
of discovering more items relevant to a query image involves the              category c, respectively.
training of a deep CNN model with a triplet loss; however, building       Output:
                                                                              Feature transformation vector ẑ
and updating a dataset for training such models is infeasible for a
                                                                           1: for c ← 1 to C do
massive and volatile inventory marketplace. Although implicit feed-        2:     h¯c ← Median(hc,1 , . . . , hc, Nc ) # Fitted apparel vector (Median
back, such as page views and click logs, allows for model training              vector of the itted apparel vectors)
with a triplet loss even in such a marketplace [11], implicit feed-        3:       l¯c ← Median(lc,1 , . . . , lc, Mc ) # Flat apparel vector (Median vector
back is available only after launching a visual search system into              of the lat apparel vectors)
production. This paper proposes an image representation method             4:               ¯
                                                                                   tc ← hc − lc   ¯ # Subtracting the lat apparel vector from the itted apparel
with query feature transformation; this method closes the gap be-               vector.
tween a itted apparel query vector and lat apparel database vectors        5:      zc ← Maximum(tc , 0)     # Replacing negative elements with 0
without time-consuming human relevance assessments.                        6:      ẑc ← ∥zzc ∥ # L2 Normalization. ẑc is a gap vector of category c
                                                                                                c 2
                                                                           7: z ← Average(ẑ 1 , . . . , ẑC ) # Averaging the gap vectors
3     VISUAL SEARCH ARCHITECTURE                                           8: ẑ ←
                                                                                    z     # L2 Normalization
                                                                                   ∥z ∥ 2
The proposed visual search architecture simply consists of image           9: return ẑ   # Feature transformation vector
feature extraction, query feature transformation, and a nearest
neighbor vector search. For C2C e-commerce sites speciically, this
                                                                          Algorithm 2 Feature Transformation
feature transformation closes the distance between a itted apparel
query vector and lat apparel database vectors. An approximate             Input:
nearest neighbor (ANN) algorithm accomplishes the nearest neigh-              q, # Feature vector of the query image
bor search in a large database within a practical runtime.                    ẑ   # Feature transformation vector
                                                                          Output:
                                                                              Transformed query vector p̂
3.1    Image Representation                                                1: q̂ ←
                                                                                     q
                                                                                              # L2 Normalization
                                                                                   ∥q ∥ 2
3.1.1 Feature Extraction Model. For feature extraction, we adopted         2: t ← q̂ − ẑ     # Subtracting feature transformation vector from query vector
MobileNetV2 [6], which is a state-of-the-art lightweight CNN model.        3: p ← Maximum(t, 0)            # Replacing negative elements with 0
Sending query images at a large scale to an e-commerce visual                        p
                                                                           4: p̂ ←            # L2 Normalization
                                                                                   ∥p ∥ 2
search system from user devices can cause network traic problems.
                                                                           5: return p̂      # Transformed query vector
One solution to this issue is edge computing, through which image
features are extracted on an edge device or a smart device. Such a
lightweight extraction model works eiciently in an edge device
                                                                          and 25,895 of lat apparel) belonging to 15 apparel categories, such
and consumes only several megabytes of memory space.
                                                                          as tops, jackets, pants, and hats. In the training step, a gap vec-
    We prepared a dataset consisting of images and their metadata
                                                                          tor, which represents the diference between itted and lat apparel
collected from an online C2C marketplace with over one billion
                                                                          feature vectors, was calculated for each category and the feature
listings. The dataset has 9 million images belonging to 14,000 classes,
                                                                          transformation vector was computed by averaging the gap vectors.
which are combinations of item brands, textures, and categoriesÐ
                                                                          For a query, the transformation simply subtracts the feature trans-
for example, Nike striped men’s golf polo. Images from nonapparel
                                                                          formation vector from a query image feature vector (Algorithm 2);
categories, such as laptops, bikes, and toys, are included in the
                                                                          its computation time is negligibly small.
dataset.
                                                                              The feature vector extracted from MobileNetV2 initially lacks
    One of the model’s hyper parameters is a width multiplier [6]; for
                                                                          negative value elements owing to the use of the rectiied linear unit
a given layer and width multiplier α, the number of output channels
                                                                          (ReLU) activation function [5]. The negative value elements in the
N becomes α N . The model was trained on the dataset with a width
                                                                          feature vector space can be treated as unnecessary, that is, elements
multiplier of 1.4. The output of the global average pooling layer
                                                                          are replaced with zero in Algorithms 1 and 2, a step that is key to
was used as an image feature vector that has 1, 792 (1, 280 × 1.4)
                                                                          preventing side efects in query feature transformation. Even if the
dimensions. Then, the feature vectors of the query and database
                                                                          feature transformation designed for a itted apparel query vector is
images were extracted using the same feature extractor.
                                                                          applied to a lat apparel or nonapparel query vector, the essential
3.1.2 uery Feature Transformation. Only the query feature vec-            feature is still preserved by removing negative value elements.
tors were calibrated using a feature transformation vector, which
expresses a human feature vector intuitively, to close the gap be-        3.2      Nearest Neighbor Search
tween itted apparel query feature vector and lat apparel database         In large-scale e-commerce, ANN searches outperform brute force in
feature vectors. The feature transformation vector was trained            inding the nearest neighbors of a transformed query vector from
through Algorithm 1 with 80,040 images (54,145 of itted apparel           the database vectors. ANN algorithms, such as IVFADC [2] and
Closing the Gap Between uery and Database through uery Feature Transformation                               SIGIR 2019 eCom, July 2019, Paris, France

                                          Table 1: Visual Search Results for Apparel Categories (mAP@100)

                                                         Flat Apparel               Fitted Apparel          Fitted Apparel (Cropped)
                                                    Baseline     Proposed       Baseline     Proposed        Baseline      Proposed
                       T-Shirts                      0.844         0.895         0.004         0.376           0.042         0.542
                       Sweaters                      0.926         0.967         0.002         0.456           0.053         0.670
                       Hoodies                       0.942         0.977         0.053         0.691           0.211         0.756
                       Denim Jackets                 0.982         0.993         0.004         0.778           0.041         0.850
                       Down Jackets                  0.972         0.995         0.115         0.815           0.313         0.866
                       Jeans                         0.878         0.822         0.001         0.381           0.095         0.737
                       Casual Pants                  0.889         0.933         0.002         0.475           0.139         0.690
                       Knee-Lengh Skirts             0.718         0.732         0.000         0.081           0.090         0.257
                       Long Skirts                   0.567         0.614         0.004         0.180           0.041         0.244
                       Dresses                       0.847         0.922         0.001         0.226           0.018         0.254


Rii [4], allow us to retrieve the nearest neighbors in a practical                 and proposed image representations for lat and itted apparel query
runtime. In our experiments, we used IVFADC to retrieve visually                   images. The results demonstrate a signiicant improvement for the
similar images from among 100 million images.                                      itted apparel queries in every category. Although query feature
                                                                                   transformation was designed to close the gap between itted and
4    EXPERIMENTS                                                                   lat apparel vectors, it also positively inluenced lat apparel queries.
We conducted experiments to evaluate the proposed method. For                      These results imply that our proposed method enables more essen-
these experiments, we collected 20,000 images from a C2C mar-                      tial features to be extracted from query images.
ketplace: half of these images were those of lat apparel and the                      We also collected 100 million images belonging to over 1000 item
remaining were itted apparel images. The lat apparel images be-                    categories, including nonapparel images. For such large-scale data,
long to ten categories, shown in the irst column of Table 1. Fitted                ANN algorithms allow us to retrieve the nearest neighbors within
apparel images not belonging to the ten categories, such as jerseys                a practical runtime. Figure 2 presents the visual search results with
and polo shirts, were also included. From the 20,000 images, 2,000                 IVFADC [2] (code length per vector: 64 bytes, number of cells: 8,192,
were used as query images, from among which 100 images were                        number of cells visited for each query: 64), from the 100 million
randomly selected from the 10 categories for each lat and itted                    images for itted apparel and nonapparel queries. To demonstrate
apparel class. The remaining 18,000 images were treated as data-                   the versatility of the proposed method, the itted apparel queries
base images. For itted apparel queries, cropped images of the query                also contain images from a diferent dataset, ATR [3], which was
objects were prepared manually from the original images to reduce                  originally used for a human parsing task. For itted apparel queries,
the inluence of the background.                                                    our proposed method retrieved a greater number of visually similar
   The mean average precision at 100 (mAP@100), deined as fol-                     lat apparel items (1stś8th rows in Figure 2). In addition, no serious
lows, was used as an evaluation measure for each category.                         negative impact was observed for nonapparel queries: visually simi-
                                                                                   lar items to the query images were successfully extracted (9thś13th
                                       ÍN
                                                                                   rows in Figure 2). The runtimes of the image feature extraction
                                         q=1 AP@K(q)
                         mAP@K =                          ,                        method and the nearest-100 vector search were approximately 40
                                                N
                                                                                   and 70 ms, respectively, using an 8-core 2.3 GHz CPU. By simulta-
where
                        ÍK                                    Ík                   neously processing multiple query images and/or using GPUs, the
                          k =1
                              (P@k · I (k))                    n=1 I (n)           runtime per query could be made faster.
           AP@K =            ÍK             ,       P@k =                  ,
                                   I (k)                           k
                              k =1

                                                                                   5   CONCLUSION
           (
               1 i-th item is lat apparel in the same category as the query
 I (i) =                                                                    ,
               0 otherwise                                                         This paper proposed an image representation technique for visual
                                                                                   search at C2C e-commerce sites. The proposed method, comprising
N is the number of query images, and AP@K and P@k indicate
                                                                                   a deep CNN-based feature extraction and query feature transfor-
the average precision at K and precision at k for each query, re-
                                                                                   mation, signiicantly improves conventional visual search methods
spectively. A retrieved item is recognized as correctly selected only
                                                                                   for comparing images of itted and lat apparel. Additionally, the
when it is an image of lat apparel in the same category as the query.
                                                                                   proposed method did not negatively impact either lat apparel or
   For baseline image representation, a vector from the global aver-
                                                                                   nonapparel queries in a serious manner. The performance and total
age pooling layer of MobineNetV2, described in Section 3.1, was
                                                                                   runtimes of our visual search system were practical in the experi-
used for query and database images. Our proposed method also
                                                                                   ments described, indicating that it can be successfully deployed to
uses the same feature extractor and transforms query vectors. Be-
                                                                                   a major online C2C marketplace. After the system is widely used
cause the number of database images used in this experiment was
                                                                                   in production, further improvement is expected using real query
relatively small, the nearest vectors were greedily retrieved using
                                                                                   images and implicit feedback.
cosine similarity. Table 1 compares the mAP@100 of the baseline
SIGIR 2019 eCom, July 2019, Paris, France                                                                                                                 T. Yamaguchi et al.

           Query Image                      Baseline Image Representation                                                  Proposed Method
       Fitted-Apparels
       Fitted-Apparels (ATR Dataset)
       Nonapparels


Figure 2: Visual search results from 100 million images. The irst column shows query images and the next seven columns
show the results with the baseline image representation. The remaining columns show the results obtained using query fea-
ture transformation. Our method successfully retrieved more lat apparel images corresponding to the itted apparel queries
without negatively impacting nonapparel queries.


REFERENCES                                                                                     Vision and Pattern Recognition (CVPR ’18).
[1] A. Hagiu and R. Simon. 2016. Network Efects Aren’t Enough. Harvard Business            [7] X. Song, S. Jiang, and L. Herranz. 2017. Multi-Scale Multi-Feature Context
    Review 94, 4 (April 2016), 65ś71.                                                          Modeling for Scene Recognition in the Semantic Manifold. Trans. Img. Proc. 26, 6
[2] H. Jegou, M. Douze, and C. Schmid. 2011. Product Quantization for Nearest                  (June 2017), 2721ś2735.
    Neighbor Search. IEEE Trans. Pattern Anal. Mach. Intell. 33, 1 (Jan. 2011), 117ś       [8] A. Babenko Yandex and V. Lempitsky. 2015. Aggregating Local Deep Features
    128.                                                                                       for Image Retrieval. In Proceedings of the 2015 IEEE International Conference on
[3] X. Liang, S. Liu, X. Shen, J. Yang, L. Liu, J. Dong, L. Lin, and S. Yan. 2015. Deep        Computer Vision (ICCV ’15). 1269ś1277.
    Human Parsing with Active Template Regression. IEEE Trans. Pattern Anal. Mach.         [9] F. Yang, A. Kale, Y. Bubnov, L. Stein, Q. Wang, H. Kiapour, and R. Piramuthu.
    Intell. 37, 12 (Dec. 2015), 2402ś2414.                                                     2017. Visual Search at eBay. In Proceedings of the 23rd ACM SIGKDD International
[4] Y. Matsui, R. Hinami, and S. Satoh. 2018. Reconigurable Inverted Index. In                 Conference on Knowledge Discovery and Data Mining (KDD ’17). 2101ś2110.
    Proceedings of the 26th ACM International Conference on Multimedia (MM ’18).          [10] A. Zhai, D. Kislyuk, Y. Jing, M. Feng, E. Tzeng, J. Donahue, Y. L. Du, and T. Darrell.
    1715ś1723.                                                                                 2017. Visual Discovery at Pinterest. In Proceedings of the 26th International
[5] V. Nair and G. E. Hinton. 2010. Rectiied Linear Units Improve Restricted Boltz-            Conference on World Wide Web Companion (WWW ’17 Companion). 515ś524.
    mann Machines. In Proceedings of the 27th International Conference on Interna-        [11] Y. Zhang, P. Pan, Y. Zheng, K. Zhao, Y. Zhang, X. Ren, and R. Jin. 2018. Visual
    tional Conference on Machine Learning (ICML’10). 807ś814.                                  Search at Alibaba. In Proceedings of the 24th ACM SIGKDD International Conference
[6] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. 2018. MobileNetV2:               on Knowledge Discovery and Data Mining (KDD ’18). 993ś1001.
    Inverted Residuals and Linear Bottlenecks. In The IEEE Conference on Computer