<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Com Data Challenge), Jennifer B. Sartor, Theo D'Hondt,
and Wolfgang De Meuter (Eds.). ACM, New York, NY, USA, Article</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Convolutional Neural Network and Bidirectional LSTM Based Taxonomy Classification Using External Dataset at SIGIR eCom Data Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hongwei Zhang Yahoo Japan Corporation Tokyo</string-name>
          <email>ayiwamot@yahoo-corp.jp</email>
          <email>ftakahas@yahoo-corp.jp</email>
          <email>hshiino@yahoo-corp.jp</email>
          <email>shogosu@yahoo-corp.jp</email>
          <email>yiseki@yahoo-corp.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Japan hzhang@yahoo-corp.jp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aya Iwamoto Yahoo Japan Corporation Tokyo</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Convolutional Neural Network</institution>
          ,
          <addr-line>Bidirectional LSTM, External dataset</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fumihiko Takahashi Yahoo Japan Corporation Tokyo</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Hiroaki Shiino Yahoo Japan Corporation Tokyo</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Shogo D. Suzuki Yahoo Japan Corporation Tokyo</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Yohei Iseki Yahoo Japan Corporation Tokyo</institution>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <volume>4</volume>
      <issue>5</issue>
      <abstract>
        <p>In eCommerce websites, products are annotated with various metadata such as a category by human sellers. Automatic item categorization is useful to reduce this cost and have been well researched. This paper describes how we won the 2nd place (weighted F1 at Stage 1 and Stage 2 are 0.8421 and 0.8399) at SIGIR eCom DataChallenge 2018, whose goal is to predict each product's category by its title. We formulate the task as a simple classification problem of all leaf categories in a given dataset. The key features of our methods are combining of Convolutional Neural Network and Bidirectional LSTM and using ad-hoc features from an external dataset (i.e. not given in this contest). An error analysis is also employed and some cases which are hard to predict accurately are revealed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        In eCommerce websites, products are registered with metadata (e.g.
title, category, etc.) by human sellers. Annotating products with
those metadata is hard job, and therefore automatic predictions
of metadata can reduce the cost[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In recent years, a number of
studies of automatic item categorization in eCommerce have been
made[
        <xref ref-type="bibr" rid="ref1 ref10 ref2 ref8 ref9">1, 2, 8–10</xref>
        ]. Pradipto et al.[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] reported that products are often
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
      </p>
      <p>SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s).</p>
      <p>ACM ISBN 123-4567-24-567/08/06.
https://doi.org/10.475/123_4
categorized incorrectly because product taxonomies are large.
Furthermore, if two diferent sellers annotate a same product with a
title, the result should be diferent. Those dificulties cause a noisy
dataset, and therefore automatic item categorization is dificult task.</p>
      <p>At this challenge of SIGIR eCom DataChallenge 2018, Rakuten
Institute of Technology provides train and test datasets. The train
dataset is composed of product titles and category ID paths, and the
test dataset contains only product titles. The goal of participants
is to predict the category ID path for each product title in the
test dataset. This challenge is more dificult than the previous item
categorization problems for following two reasons: (1) The metadata
of products is only title and any other information (e.g. price, image,
etc.) is not included. (2) The dataset contains not category “name”
paths, but category “ID” paths. This causes dificulty in using prior
knowledge of each category.</p>
      <p>
        This paper describes how we won the 2nd place (weighted F1
at Stage 1 and Stage 2 are 0.8421 and 0.8399) at SIGIR eCom
DataChallenge 2018. We formulate the task as a simple classification
problem of all leaf categories in the given dataset. The key features
of our methods are following two parts: (1) Convolutional Neural
Network and Bidirectional LSTM are used together. This technique
may be useful because two models are diferent in structure. (2)
Amazon Product Data[
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ], which contains product reviews and
metadata from Amazon, is used to generate ad-hoc features. The
products in the dataset given by this contest do not have metadata
as described above. Thus, it is useful to incorporate the metadata
from the external dataset.
      </p>
      <p>In the rest of the paper, the detail of our system is described in
Section 2. Section 3 describes error analysis of our model. Finally,
we present the conclusion in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>METHODS</title>
      <p>An overview of our system is given in Figure1. A product title is
fed to the system and a category for the product is predicted by
following procedures: (1) The product title is split into words and
they are normalized. (2) Each word is converted into an embedding
vector. (3) The embedding vectors are input into “Multi-kernel CNN
module” and “BiLSTM module”. Each module outputs a flattened
vector. (4) Ad-hoc features are generated for each word. Each ad-hoc
feature is fed into a multi layer perceptron and a flattened vector is
gotten. (5) Three vectors from step (3) and (4) are concatenated into
a flattened vector and passed into a last fully connected layer.
Probabilities of all leaf categories are output from the fully connected
layer.</p>
      <p>In the following of this section, we describe the detail of those
procedures.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Preprocessing of a product title</title>
      <p>First, the input sequence (i.e. product title) is split into some words
by a space character. Then, symbol characters (e.g. %, #, etc.) in
each word are removed. Finally, each of the words is converted into
lowercase.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Generating embedding vectors</title>
      <p>
        We used word2vec implemented by gensim[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to generate
skipthoughts embedding vectors. The setting of word2vec is as follows.
• used all words appearing in train and test dataset
• window size is 7
• hierarchical softmax is used for model training
• negative sample size is 5
• embedding vector size is 512
      </p>
      <p>In general, a word embedding is used in the area of natural
language processing. In addition to this, we generated other
embeddings as follows.</p>
      <p>• pos tags
• stemmed word
• lemma of a word
• hypernym of a word</p>
      <p>These embeddings are useful for unknown words in the test
dataset.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Training Modules</title>
      <p>
        We used Convolutional Neural Network with multiple kernels
(Multi-kernel CNN)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Bidirectional LSTM with Soft Attention[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
for training modules.
2.3.1 Multi-kernel CNN. Y. Kim[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed a Convolutional
Neural Network based method for sentence classification problem.
We adopted the idea to predict categories and call this module
“Multi-kernel CNN”. An overview of “Multi-kernel CNN” module is
given in Figure 2.
      </p>
      <p>The input of this module is embedding vectors described in
Section 2.2 and the output is a vector whose elements correspond
to probabilities of each leaf category. First, the input is passed into
one-dimensional convolutional layers and the outputs are feature
maps. We used multiple convolutional layers diferent in kernel
size (e.g. 2, 3, 4 and 5). Next, feature maps are flattened into a vector
and it is passed into a last fully connected layer.</p>
      <p>
        2.3.2 Bidirectional LSTM with Soft Atention. In recent years,
Recurrent Neural Networks have been used in the area of natural
language processing. In the field of neural machine translation, it
is reported that attention mechanism is efective technique[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        We employed bidirectional LSTM with Soft Attention[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to
predict categories. We note that an attention layer of “sequence to
sequence” model accepts what LSTM layers output from both of
input-side and output-side sequences; however, this model is
“sequence to label” and only output from input-side LSTM layers is
accepted. An overview of “Bidirectional LSTM” module is given in
Figure 3.
      </p>
      <p>The input and the output is same as “Multi-kernel CNN” module.
First, the input is passed into a Bidirectional LSTM layer and the
output is encoded sequence. Then, the encoded sequence is passed
into a soft attention layer and the output is probability distribution
over all leaf categories.
2.4</p>
    </sec>
    <sec id="sec-6">
      <title>Ad-hoc features</title>
      <p>Ad-hoc features such as length of a title are useful to improve the
accuracy. The list of ad-hoc features is as follows.</p>
      <p>• title length
• uppercase rate
• alphabet/non-alphabet/digits rate count
• space character rate/count
• max length of words
• unique word rate
• a number of filtered symbols
• a number of words
• histogram of word length
• histogram of pos tags
%"
%"
!"
%#
%#
!#
%$
%$
!$
%"
%#
%$</p>
      <p>
        In addition to ad-hoc features as below, metadata of products
from an external dataset, Amazon Product Data[
        <xref ref-type="bibr" rid="ref3 ref6">3, 6</xref>
        ], which
contains product reviews and metadata such as categories and prices,
is used as ad-hoc features.
      </p>
      <p>Figure 4 shows the procedure of generating the ad-hoc features
from Amazon Product Data. (1) Training a model whose input is a
title and output is a category with Amazon Product Data. (2) The
product titles of the datasets given by this contest and Amazon
Product Data are fed into the model of (1) and embedding vectors
are gotten from an embedding layer. (3) Prices and categories of 20
products in Amazon Product Data which are similar to each product
in the dataset given by this contest are fetched. The similarity of
two products is the euclidean distance between two embedding
vectors from (2).</p>
    </sec>
    <sec id="sec-7">
      <title>2.5 Concat vectors from previous modules</title>
      <p>We get three flattened vectors from “Multi-kernel CNN”,
“Bidirectional LSTM” and the multi layer perceptron with ad-hoc features.
In this part, these vectors are concatenated into a flattened vector
and it is passed into a fully connected layer whose output is as same
as the previous module (e.g. probabilities of each leaf category).</p>
    </sec>
    <sec id="sec-8">
      <title>2.6 Over sampling on “small” categories</title>
      <p>In training, we oversampled data by shufling words in titles for
categories which have less than 50 products. It solves a problem of
an imbalance of categories.</p>
    </sec>
    <sec id="sec-9">
      <title>3 ERROR ANALYSIS</title>
      <p>We split the train dataset by this contest into two parts: train and
validation part to check a performance of the proposed system.
…
…</p>
    </sec>
    <sec id="sec-10">
      <title>3.1 Top level category prediction</title>
      <p>First, we show the accuracy of top level category prediction.
Figure 5 shows the confusion matrix of top level category
prediction. It is found that products whose top level category is “1208”
are often miss classified as “4015”. In Table 1, examples which
are miss classified “1208” as “4015” and products correspond to
miss classified category are shown in Table 2. In the first and
second lines in Table 1, true and predicted category are seem
to be similar. More specifically, “1208&gt;310&gt;1629&gt;1513&gt;3369” and
“4015&gt;4454&gt;473” seem to be a food category. “1208&gt;546&gt;4262&gt;572”</p>
      <p>F1 score for each category
1.0
0.8</p>
    </sec>
    <sec id="sec-11">
      <title>Dificult Categories</title>
      <p>In this sections, dificult categories are explored. Figure6 shows F1
scores for each category. It can be said that it is dificult to predict
accurately for categories which few products are correspond to.</p>
      <p>Furthermore, we focus on categories that are hard to predict
in categories close to the number of products (i.e. bottom of the</p>
    </sec>
    <sec id="sec-12">
      <title>CONCLUSION</title>
      <p>In this paper, we describe how we tackle SIGIR eCom
DataChallenge 2018. Our proposed model is combined Convolutional Neural
Network and Bidirectional LSTM. We also used Amazon Product
Data to generate ad-hoc features. In error analysis, it is found that
two categories which are similar or share same words in titles are
hard to distinguish. It is also found that media categories are hard
to distinguish because each title of products in those does not have
enough information. We believe that high prediction accuracy came
from proposed deep learning models and the external dataset, but
human prior knowledge for each category is useful to get better
performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Pradipto</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yandi Xia</surname>
            , Aaron Levine, Giuseppe Di Fabbrizio, and
            <given-names>Ankur</given-names>
          </string-name>
          <string-name>
            <surname>Datta</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Web-scale language-independent cataloging of noisy product listings for e-commerce</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <surname>Long</surname>
            <given-names>Papers</given-names>
          </string-name>
          , Vol.
          <volume>1</volume>
          .
          <fpage>969</fpage>
          -
          <lpage>979</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Jung-Woo</surname>
            <given-names>Ha</given-names>
          </string-name>
          , Hyuna Pyo, and
          <string-name>
            <given-names>Jeonghee</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Large-scale item categorization in e-commerce using multiple recurrent neural networks</article-title>
          .
          <source>In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM</source>
          ,
          <volume>107</volume>
          -
          <fpage>115</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ruining</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <surname>Julian McAuley</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering</article-title>
          .
          <source>In proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Committee</source>
          ,
          <fpage>507</fpage>
          -
          <lpage>517</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Convolutional neural networks for sentence classification</article-title>
          .
          <source>arXiv preprint arXiv:1408.5882</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Minh-Thang</surname>
            <given-names>Luong</given-names>
          </string-name>
          , Hieu Pham, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Efective approaches to attention-based neural machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1508.04025</source>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Julian</surname>
            <given-names>McAuley</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Targett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qinfeng</given-names>
            <surname>Shi</surname>
          </string-name>
          , and Anton Van Den Hengel.
          <year>2015</year>
          .
          <article-title>Image-based recommendations on styles and substitutes</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <volume>43</volume>
          -
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Radim</given-names>
            <surname>Řehůřek</surname>
          </string-name>
          and
          <string-name>
            <given-names>Petr</given-names>
            <surname>Sojka</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Software Framework for Topic Modelling with Large Corpora</article-title>
          .
          <source>In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA</source>
          , Valletta, Malta,
          <fpage>45</fpage>
          -
          <lpage>50</lpage>
          . http://is.muni.cz/publication/ 884893/en.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Dan</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jean-David Ruvini</surname>
            ,
            <given-names>Rajyashree</given-names>
          </string-name>
          <string-name>
            <surname>Mukherjee</surname>
            , and
            <given-names>Neel</given-names>
          </string-name>
          <string-name>
            <surname>Sundaresan</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>A study of smoothing algorithms for item categorization on e-commerce sites</article-title>
          .
          <source>Neurocomputing</source>
          <volume>92</volume>
          (
          <year>2012</year>
          ),
          <fpage>54</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Dan</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jean-David Ruvini</surname>
            ,
            <given-names>and Badrul</given-names>
          </string-name>
          <string-name>
            <surname>Sarwar</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Large-scale item categorization for e-commerce</article-title>
          .
          <source>In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM</source>
          ,
          <volume>595</volume>
          -
          <fpage>604</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yandi</surname>
            <given-names>Xia</given-names>
          </string-name>
          , Aaron Levine,
          <string-name>
            <surname>Pradipto Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Giuseppe Di</surname>
            <given-names>Fabbrizio</given-names>
          </string-name>
          , Keiji Shinzato, and
          <string-name>
            <given-names>Ankur</given-names>
            <surname>Datta</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Large-Scale Categorization of Japanese Product Titles Using Neural Attention Models</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <surname>Short</surname>
            <given-names>Papers</given-names>
          </string-name>
          , Vol.
          <volume>2</volume>
          .
          <fpage>663</fpage>
          -
          <lpage>668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>2296&gt;2435&gt;1576 Scarlet Women The Quartet Alfred 00-21113 I Will Sing - Music Book</source>
          <volume>2296</volume>
          &gt;
          <fpage>2435</fpage>
          &gt;
          <article-title>3792 Nocturnes and Polonaises Complete Preludes</article-title>
          and
          <string-name>
            <surname>Etudes-Tableaux Blues</surname>
          </string-name>
          Heaven
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>2296&gt;3597&gt;3064 Dowin In The Delta Free Beer Regina Belle - Believe in Me</source>
          <volume>2296</volume>
          &gt;
          <fpage>3597</fpage>
          &gt;3956
          <string-name>
            <given-names>New</given-names>
            <surname>Edition Backyard - Skillet</surname>
          </string-name>
          Suits-Season Three
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>2296&gt;3706&gt;1586 Newlyweds-Nick and Jessica Complete 2nd and 3rd Seasons Defiance-s3 [dvd] [3discs] (Universal) Log Horizon: Season 2 - Collection 2</source>
          <volume>2296</volume>
          &gt;3706&gt;3437
          <string-name>
            <surname>Yu</surname>
            <given-names>Yu</given-names>
          </string-name>
          <article-title>Hakusho Season 3 Case Closed-Season 3-S.A</article-title>
          .V.E.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>