<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Closing the Gap Between uery and Database through uery Feature Transformation in C2C e-Commerce Visual Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Takuma Yamaguchi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kosuke Arase</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riku Togashi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shunya Ueta Mercari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inc Tokyo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Japan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>kumon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>kosuke.arase</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>riktor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>hurutoriya}@mercari.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Content-based Image RetrievalD</institution>
          ,
          <addr-line>eep Learninge,-Commerce</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <abstract>
        <p>This paper introduces an image representation technique for visual search on a consumer-to-consumer (C2C) e-commerce website. Visual searching at such websites cannot efectively close the gap between query images taken by users and database images. The proposed technique consists of extraction of a lightweight deep CNN-based feature vector and transformation of a query feature. Our quantitative and qualitative experiments using datasets from an online C2C marketplace with over one billion items show that this image representation technique with our query image feature transformation can improve users' visual search experience, particularly when searching for apparel items, without negative side efects on nonapparel items. (a) Query Image</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>· Information systems→ Image search;</p>
    </sec>
    <sec id="sec-2">
      <title>1 INTRODUCTION</title>
      <p>The explosiviencreaseofonlinpehotos, drivenby
socianletworkingand e-commerce sitesh,as focusedresearchers’ attentionon
visualsearch, alsocalledcontent-based imageretrieva[l9ś11]. Many
newlyposted photos are listeodn consumer-to-consumer (C2C)
e-commerce sitesw,here most sellersare not
professionaplhotographers or retailerst;herefore, buyers are often stymiedby the
poor qualitoyr limitqeudantityofiteminformatiaond keywords.
Moreover, buyers mighntot even know the correct keywords to use
to fintdheirdesireditemsI.n such a situatioin,mage-baseditem
searches may improvethe user experience.</p>
      <p>Algorithmfs orextractingimagefeaturesbased on deep
convolutionanleuralnetwork (CNN) [7, 8] and approximatenearest
neighbor(ANN) search [2, 4] can be used to realizae simplveisual
search system. However, even if these simple systems can retrieve
visually similar images, their results could be nonoptimal. C2C
e-commerce site search algorithms tend to extract items listed by
professional sellers even if more relevant items are listed by
nonprofessional sellers because the query images are often more visually
similar to images taken by professionals than those provided by
nonprofessionals, especially in apparel categories. Speciically,
itted apparel images (Figure 1b) are likely to be retrieved in response
to a itted apparel query image (Figure 1a). In this paper, we call
apparel łittedž if it is pictured being worn by a model and łlatž if
it is instead laid lat on a surface. Professional and nonprofessional
sellers tend to upload itted and lat apparel images, respectively.
Searches that return many items listed by professional sellers can
cause problems for C2C e-commerce sites, for example, by hurting
buyer experience and discouraging nonprofessional sellers from
listing items [1].</p>
      <p>To manage these issues so as to retrieve more lat apparel items,
we developed an image representation technique that closes the
visual gap between itted apparel query images and lat apparel
images in a database. The technique consists of extracting features
using a lightweight deep CNN and transforming query features; it
enables the retrieval of lat apparel images (Figure 1c) from a itted
apparel query image (Figure 1a). Moreover, the feature
transformation step can be applied to any query vector because it causes no
signiicant side efects to lat apparel and nonapparel query vectors.
Thus, additional information of whether the query image contains
itted apparel is not required before feature transformation. Our
experiments demonstrate that more lat apparel images are correctly
discovered through query feature transformation for a itted
apparel query image without serious negative impacts on lat apparel
and nonapparel query images.</p>
      <p>Algorithm 1Generate Feature Transformation Vector</p>
    </sec>
    <sec id="sec-3">
      <title>2 RELATED WORK</title>
      <p>Some e-commerce sites, such as Alibaba and eBay, have introduced Input:
visual search systems that enable users to search for products using h {1, ...,C }, {1, ..., Nc }, # Feature vectors of itted apparel images
images9[, 11]. These systems are basically composed of deep CNN- l {1, ...,C }, {1, ...,Mc } # Feature vectors of lat apparel images
based feature extraction and a nearest neighbor search. A method hc,n and lc,m represent the n-th and m-th feature vectors of
of discovering more items relevant to a query image involves the category c, respectively.
training of a deep CNN model with a triplet loss; however, buildOiuntgput:
and updating a dataset for training such models is infeasible for aFeature transformation vectozˆr</p>
      <p>1: for c ← 1 to C do
massive and volatile inventory marketplace. Although implicit f2e:ed- h¯c ← Median(hc,1, . . . , hc, Nc ) # Fitted apparel vector (Median
back, such as page views and click logs, allows for model training vector of the itted apparel vectors)
with a triplet loss even in such a marketp1l1a]c,ei[mplicit feed- 3: l¯c ← Median(lc,1, . . . , lc,Mc ) # Flat apparel vector (Median vector
back is available only after launching a visual search system intoof the lat apparel vectors)
production. This paper proposes an image representation method 4:
with query feature transformation; this method closes the gap
between a itted apparel query vector and lat apparel database vectors 5:
without time-consuming human relevance assessments. 6:
tc ← h¯c − l¯c # Subtracting the lat apparel vector from the itted apparel
vector.</p>
      <p>zc ← Maximum(tc , 0) # Replacing negative elements with 0
zˆc ← ∥zzcc∥2 # L2 Normalizatiozˆnc. is a gap vector of categorcy
7: z ← Average(zˆ1, . . . , zˆC ) # Averaging the gap vectors
3 VISUAL SEARCH ARCHITECTURE 8: zˆ ← ∥zz∥2 # L2 Normalization
The proposed visual search architecture simply consists of image9: return zˆ # Feature transformation vector
feature extraction, query feature transformation, and a nearest
neighbor vector search. For C2C e-commerce sites speciically, this
feature transformation closes the distance between a itted apparel Algorithm 2Feature Transformation
query vector and lat apparel database vectors. An approximate Input:
nearest neighbor (ANN) algorithm accomplishes the nearest neigh- q, # Feature vector of the query image
bor search in a large database within a practical runtime. zˆ # Feature transformation vector</p>
      <p>Output:</p>
      <p>Transformed query vectorpˆ
3.1 Image Representation 1: qˆ ← ∥qq∥2 # L2 Normalization
3.1.1 Feature Extraction ModelF.or feature extraction, we adopted 2: t ← qˆ − zˆ # Subtracting feature transformation vector from query vector
MobileNetV26][, which is a state-of-the-art lightweight CNN model. 3: p ← Maximum(t, 0) # Replacing negative elements with 0
Sending query images at a large scale to an e-commerce visual 4: pˆ ← ∥pp∥2 # L2 Normalization
search system from user devices can cause network traic problems. 5: return pˆ # Transformed query vector
One solution to this issue is edge computing, through which image
features are extracted on an edge device or a smart device. Such a
lightweight extraction model works eiciently in an edge device
and consumes only several megabytes of memory space. and 25,895 of lat apparel) belonging to 15 apparel categories, such
We prepared a dataset consisting of images and their metadata as tops, jackets, pants, and hats. In the training step, a gap
vector, which represents the diference between itted and lat apparel
collected from an online C2C marketplace with over one billion
listings. The dataset has 9 million images belonging to 14,000 clafseseastu,re vectors, was calculated for each category and the feature
which are combinations of item brands, textures, and categoriesÐ transformation vector was computed by averaging the gap vectors.
for example, Nike striped men’s golf polo. Images from nonapparelFor a query, the transformation simply subtracts the feature
transcategories, such as laptops, bikes, and toys, are included in theformation vector from a query image feature vector (Algorithm 2);
dataset. its computation time is negligibly small.</p>
      <p>One of the model’s hyper parameters is a width multi6p]l;ifoer [ The feature vector extracted from MobileNetV2 initially lacks
a given layer and width multiαp,ltiheernumber of output channels negative value elements owing to the use of the rectiied linear unit
N becomes α N . The model was trained on the dataset with a width (ReLU) activation functi5o]n.T[he negative value elements in the
multiplier of 1.4. The output of the global average pooling layfeerature vector space can be treated as unnecessary, that is, elements
was used as an image feature vector that has, 1792 (1, 280 × 1.4) are replaced with zero in Algorithms 1 and 2, a step that is key to
dimensions. Then, the feature vectors of the query and database preventing side efects in query feature transformation. Even if the
images were extracted using the same feature extractor. feature transformation designed for a itted apparel query vector is
applied to a lat apparel or nonapparel query vector, the essential
3.1.2 uery Feature TransformationO.nly the query feature vec- feature is still preserved by removing negative value elements.
tors were calibrated using a feature transformation vector, which
expresses a human feature vector intuitively, to close the gap be-3.2 Nearest Neighbor Search
tween itted apparel query feature vector and lat apparel database In large-scale e-commerce, ANN searches outperform brute force in
feature vectors. The feature transformation vector was trained inding the nearest neighbors of a transformed query vector from
through Algorithm 1 with 80,040 images (54,145 of itted apparelthe database vectors. ANN algorithms, such as IVFADC2][ and
Closing the Gap Between uery and Database through uery Feature Transformation
Rii4][, allow us to retrieve the nearest neighbors in a practicaland proposed image representations for lat and itted apparel query
runtime. In our experiments, we used IVFADC to retrieve visuallyimages. The results demonstrate a signiicant improvement for the
similar images from among 100 million images. itted apparel queries in every category. Although query feature
transformation was designed to close the gap between itted and
4 EXPERIMENTS lat apparel vectors, it also positively inluenced lat apparel queries.
We conducted experiments to evaluate the proposed method. For These results imply that our proposed method enables more
essenthese experiments, we collected 20,000 images from a C2C mar- tial features to be extracted from query images.
ketplace: half of these images were those of lat apparel and the We also collected 100 million images belonging to over 1000 item
remaining were itted apparel images. The lat apparel images bec-ategories, including nonapparel images. For such large-scale data,
long to ten categories, shown in the irst column of Table 1. FittedANN algorithms allow us to retrieve the nearest neighbors within
apparel images not belonging to the ten categories, such as jerseys a practical runtime. Figure 2 presents the visual search results with
and polo shirts, were also included. From the 20,000 images, 2,000 IVFADC [2] (code length per vector: 64 bytes, number of cells: 8,192,
were used as query images, from among which 100 images were number of cells visited for each query: 64), from the 100 million
randomly selected from the 10 categories for each lat and itted images for itted apparel and nonapparel queries. To demonstrate
apparel class. The remaining 18,000 images were treated as data- the versatility of the proposed method, the itted apparel queries
base images. For itted apparel queries, cropped images of the query also contain images from a diferent dataset, A3T],Rw[hich was
objects were prepared manually from the original images to reduceoriginally used for a human parsing task. For itted apparel queries,
the inluence of the background. our proposed method retrieved a greater number of visually similar</p>
      <p>The mean average precision at 10m0 A(P @100), deined as fol- lat apparel items (1stś8th rows in Figure 2). In addition, no serious
lows, was used as an evaluation measure for each category. negative impact was observed for nonapparel queries: visually
similar items to the query images were successfully extracted (9thś13th
ÍqN=1 AP @K(q) rows in Figure 2). The runtimes of the image feature extraction
mAP @K = N , method and the nearest-100 vector search were approximately 40
where and 70 ms, respectively, using an 8-core 2.3 GHz CPU. By
simultaneously processing multiple query images and/or using GPUs, the
runtime per query could be made faster.
ÍkK=1 I (k)
,</p>
      <p>k
P @k = Ín=1 I (n)
k
,
5</p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSION</title>
      <p>I (i) = (1 i-th item is lat apparel in the same category as the qu,ery</p>
      <p>0 otherwise This paper proposed an image representation technique for visual
N is the number of query images, aAnPd@K and P @k indicate search at C2C e-commerce sites. The proposed method, comprising
the average precision Kat and precision katfor each query, re- a deep CNN-based feature extraction and query feature
transforspectively. A retrieved item is recognized as correctly selected onmlyation, signiicantly improves conventional visual search methods
when it is an image of lat apparel in the same category as the query.for comparing images of itted and lat apparel. Additionally, the</p>
      <p>For baseline image representation, a vector from the global aver-proposed method did not negatively impact either lat apparel or
age pooling layer of MobineNetV2, described in Section 3.1, wnaosnapparel queries in a serious manner. The performance and total
used for query and database images. Our proposed method also runtimes of our visual search system were practical in the
experiuses the same feature extractor and transforms query vectors. Be- ments described, indicating that it can be successfully deployed to
cause the number of database images used in this experiment was a major online C2C marketplace. After the system is widely used
relatively small, the nearest vectors were greedily retrieved usinign production, further improvement is expected using real query
cosine similarity. Table 1 comparmesAtPh@e100 of the baseline images and implicit feedback.</p>
      <p>Query Image</p>
      <p>Baseline Image Representation</p>
      <p>Proposed Method</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>Vision and Pattern Recognition (CVPR '18)</article-title>
          . [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hagiu</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Simon</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Network Efects Aren't EnoughH</article-title>
          .arvard Business [7]
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Herranz</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <string-name>
            <surname>Multi-Scale</surname>
          </string-name>
          Multi-Feature Context
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>Review 94, 4 (April</source>
          <year>2016</year>
          ),
          <year>65ś71</year>
          .
          <article-title>Modeling for Scene Recognition in the Semantic MaTnrainfs</article-title>
          .oIlmdg..
          <source>Proc. 26</source>
          ,
          <issue>6</issue>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Product Quantization for Nearest (June</article-title>
          <year>2017</year>
          ),
          <year>2721ś2735</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Neighbor</given-names>
            <surname>Search</surname>
          </string-name>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>33</volume>
          ,
          <issue>1</issue>
          (Jan.
          <year>2011</year>
          ),
          <year>117ś</year>
          [8]
          <string-name>
            <given-names>A. Babenko</given-names>
            <surname>Yandex</surname>
          </string-name>
          and
          <string-name>
            <given-names>V.</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          .
          <year>2015</year>
          . Aggregating Local Deep Features
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          128.
          <article-title>for Image Retrieval</article-title>
          .
          <source>PInroceedings of the 2015 IEEE International Conference on [3</source>
          ]
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep Computer Vision (ICCV '15)</article-title>
          .
          <year>1269ś1277</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>Human Parsing with Active Template RegressIiEEoEn</article-title>
          .
          <source>Trans. Pattern Anal. Mach</source>
          . [9]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bubnov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kiapour</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Piramuthu</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          Intell.
          <volume>37</volume>
          ,
          <issue>12</issue>
          (Dec.
          <year>2015</year>
          ),
          <year>2402ś2414</year>
          .
          <year>2017</year>
          .
          <article-title>Visual Search at eBay</article-title>
          .
          <source>IPnroceedings of the 23rd ACM SIGKDD International</source>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hinami</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Satoh</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Reconigurable Inverted Index</article-title>
          .
          <source>In Conference on Knowledge Discovery and Data Mining (KDD '17)</source>
          .
          <year>2101ś2110</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>Proceedings of the 26th ACM International Conference on Multimedia (MM '18)</source>
          . [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kislyuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tzeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Donahue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. L.</given-names>
            <surname>Du</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          1715ś1723.
          <year>2017</year>
          .
          <article-title>Visual Discovery at PinterestP</article-title>
          .
          <source>rIonceedings of the 26th International</source>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Nair</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <source>2010. Rectiied Linear Units Improve Restricted Boltz- Conference on World Wide Web Companion (WWW '17 Companion)</source>
          .
          <year>515ś524</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          mann Machines.
          <source>InProceedings of the 27th International Conference on Interna-</source>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , P. Pan,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Jin</surname>
          </string-name>
          .
          <year>2018</year>
          . Visual
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>tional Conference on Machine Learning (ICML'10)</source>
          . 807ś814. Search at Alibaba.
          <source>IPnroceedings of the 24th ACM SIGKDD International Conference</source>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sandler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zhmoginov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>MobileNetV2: on Knowledge Discovery and Data Mining (KDD '18)</article-title>
          .
          <year>993ś1001</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>