=Paper=
{{Paper
|id=Vol-2319/ecom18DC_paper_13
|storemode=property
|title=Overview of the SIGIR 2018 eCom Rakuten Data Challenge
|pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_13.pdf
|volume=Vol-2319
|authors=Yiu-Chang Lin,Pradipto Das,Ankur Datta
|dblpUrl=https://dblp.org/rec/conf/sigir/LinDD18
}}
==Overview of the SIGIR 2018 eCom Rakuten Data Challenge==
<pdf width="1500px">https://ceur-ws.org/Vol-2319/ecom18DC_paper_13.pdf</pdf>
<pre>
         Overview of the SIGIR 2018 eCom Rakuten Data Challenge
                      Yiu-Chang Lin                                                       Pradipto Das                                    Ankur Datta
          Rakuten Institute of Technology                                    Rakuten Institute of Technology                    Rakuten Institute of Technology
             Boston, MA, 02110 - USA                                            Boston, MA, 02110 - USA                            Boston, MA, 02110 - USA
            yiuchang.lin@rakuten.com                                           pradipto.das@rakuten.com                           ankur.datta@rakuten.com

 ABSTRACT                                                                                             We hope that by making the data available to the participants, the
This paper presents an overview of the SIGIR 2018 eCom Rakuten                                        task will attract more research institutions and industry practition-
Data Challenge. In this data challenge, Rakuten.com has released                                      ers, who do not have the opportunity to contribute their ideas due
a sampling of one million product titles and the corresponding                                        to the lack of an actual commercial e-commerce catalog data. In a
anonymized category paths from their entire product catalog. Of                                       typical e-commerce setting, merchants are responsible for matching
these, 0.8 million of product titles and their corresponding category                                 their products with an existing category, which is a leaf node in the
paths are released as training data and 0.2 million of product titles                                 taxonomy tree. The problem of large scale taxonomy classification
are released as test data. The task is to predict the category, defined                               is thus immensely useful to help merchants upload their products
as a full path in the taxonomy tree as provided in the training set, of                               in the right places of an e-commerce platform catalog.
the product titles in the test set. The evaluation is divided into two
stages to measure system performance on a part of the test data                                       2     DATASET
and the entire test data. The different systems are evaluated using                                   Rakuten has released a sample of one million product listings, in-
weighted precision, recall and F1. In total, 26 teams have submitted                                  cluding the training (0.8 million) and test (0.2 million) set, consisting
28 systems with a top performance of 0.8513 weighted F1 score.                                        of product titles and their corresponding category paths that belong
                                                                                                      to a taxonomy tree to describe the varying degrees of generality
 1     INTRODUCTION                                                                                   of the items in the catalog. A product taxonomy is a tree-based
The SIGIR 2018 eCom Rakuten Data Challenge 1 is organized by                                          hierarchical representation of labels of the listings in a catalog.
Rakuten Institute of Technology, Boston2 (RIT-Boston), a                                                  Rakuten’s catalog of products is much larger than a million
dedicated R&D organization for the Rakuten group3 of companies.                                       listings. The larger catalog of product listings is first de-duplicated
The data challenge focuses on the task of large-scale taxonomy                                        using leaf node label and product title tuples as keys and then a
classification of product titles, where the goal is to predict each                                   random sampling of one million listings is performed. These set
product’s category, defined as a full path from root node to a leaf                                   of listings have been allowed to be published for this year’s data
node in the taxonomy tree provided in the training set. The cat-                                      challenge. The set of one million product listings may not cover the
aloging of product listings through taxonomy categorization is a                                      entire set of leaf nodes in Rakuten.com’s taxonomy, and so, any
fundamental problem for any e-commerce platform, with applica-                                        taxonomy generated from the category labels provided with the
tions ranging from basic data organization, personalized search                                       training set may yield a truncated taxonomy.
recommendations to query understanding and targeted campaign-                                             Further, to preserve anonymity of the taxonomy, the nodes in
ing. For instance, in the Rakuten.com catalog, “Dr. Martens Air                                       the actual taxonomy have been masked with random integers. The
Wair 1460 Mens Leather Ankle Boots” is categorized under the                                          class or the category label of a product listing is thus a path from the
“Clothing, Shoes & Accessories > Shoes > Men > Boots” leaf. How-                                      root of the taxonomy tree to a leaf node where the listing resides.
ever, manual and rule based approaches to categorization are not                                      In the left to right ordering of the nodes in such a path, the path
scalable since commercial product taxonomies are organized in tree                                    becomes more specific in describing the listing as it approaches the
structures with three to ten levels of depth and thousands of leaf                                    leaf level. The string representation of a path from root to leaf is
nodes. Advances in this area of research have been limited due to                                     henceforth dubbed as a “category-id-path”.
the lack of real data from actual commercial catalogs.                                                    The training data file is in a tab-separated values (TSV) format
   The challenge presents several interesting research aspects due                                    where each line contains a product title and its corresponding
to the intrinsic noisy nature of the product labels, the size of modern                               category-id-path. In the test set, only the product titles are provided
e-commerce catalogs, and the typical imbalanced data distribution.                                    and the objective of this data challenge is to predict the full category-
1 https://sigir-ecom.github.io/data-task.html
                                                                                                      id-path for each such title. Table 1 and table 2 show some examples
 2 https://rit.rakuten.co.jp/                                                                         of product titles from the training set and test set, respectively. The
 3 https://www.rakuten.com/                                                                           partitioning of the training and the test sets are obtained using
                                                                                                      category-wise stratified sampling of the one million listing dataset.
 Permission to make digital or hard copies of part or all of this work for personal or
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
 classroom
In:            use is G.
    J. Degenhardt,     granted  withoutS.fee
                          Di Fabbrizio,      providedM.that
                                          Kallumadi,        copies
                                                         Kumar,     areLin,
                                                                 Y.-C.   notA.made  or distributed
                                                                               Trotman, H. Zhao
 for profit
(eds.):      or commercial
        Proceedings
published
                               advantage
                      of the SIGIR
           at http://ceur-ws.org
                                          andworkshop,
                                   2018 eCom   that copies bear2018,
                                                        12 July, this notice  andMichigan,
                                                                       Ann Arbor, the full citation
                                                                                            USA,      2.1    Data Characteristics
 on the first   page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).                                                      In the training set, there are 800, 000 product titles from 3, 008 leaf-
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                              level nodes. Product titles are unevenly distributed among these
© 2018 Copyright held by the owner/author(s).                                                         3, 008 categories. The top ten categories compose around 30% of
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                               the data set and the top forty categories compose around 50% of
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                                                     Yiu-Chang Lin, Pradipto Das, and Ankur Datta


                                                  10 5                                                                  350000
                                                                                                                        300000
                                                  10 4
                                                                                                                        250000
                          Frequency (log scale)
                                                  10 3                                                                  200000


                                                                                                            Frequency
                                                  10 2                                                                  150000
                                                                                                                        100000
                                                  10 1
                                                                                                                           50000
                                                  10 0                                                                        0
                                                     10 0         10 1                   10 2        10 3                          1   2     3        4      5            6     7   8
                                                                              Rank (log scale)                                                     Category Depth


Figure 1: Left: Product title frequencies for 3,008 leaf nodes in the taxonomy tree. Both X and Y axis are in log scale. Rank
ranges from 1 to 3,008. Right: Product title frequencies for different lengths of category paths.

                                             80000                                                                         18000
                                             70000                                                                         16000
                                             60000                                                                         14000
                                                                                                                           12000
                                             50000
                        Frequency


                                                                                                               Frequency   10000
                                             40000
                                                                                                                            8000
                                             30000
                                                                                                                            6000
                                             20000                                                                          4000
                                             10000                                                                          2000
                                                         0                                                                     0
                                                             10          20           30        40     50                              50          100          150           200       250
                                                                              Title Word Length                                                  Title Character Length


Figure 2: Left: Word length distribution over all product titles in the training set. Right: Character length distribution over all
product titles in the training set.

                            Product Titles
Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A
                                                                     category-id-paths
                                                                       3292>114>1231
                                                                                                                        length is 68.44, with a maximum length of 255. Figure 2 shows the
              Ka-Bar Desert MULE Serrated Folding Knife              4238>321>753>3121                                  title length distributions at world-level and character-level, respec-
       5.11 TACTICAL 74280 Taclite TDU Pants, R/M, Dark Navy         4015>3285>1443>20
           Skechers 4lb S Grip Jogging Weight set of 2- Black       2075>945>2183>3863
                                                                                                                        tively.
                 Generations Small Side Table White              4015>3636>1319>1409>3606

 Table 1: Examples of product titles from the training set.                                                             3      EVALUATION
                                                                                                                    In this section, we briefly mention the evaluation criteria used to
                                 Product Titles                                                                     judge the different systems. Although we do not resort to a stricter
          Disc Brake Rotor-Advanced Technology Rear Raybestos 980368                                                statistical significance testing of different systems using confidence
       Coquette Neon Pink Ruffle Babydoll 7035 Neon Pink One Size Fits All                                          interval estimation from bootstrapped samples of the test set, it
      12V 7Ah (SPS Brand) APC NS3000RMT3U Replacement Battery ( 4 Pack)                                             will be useful to do so once this pilot task has been completed.
Honda Ridgeline 2006-08 Black Radio AM FM 6 Disc CD PN 39100-SJC-A100 Face 3TS1
            Frankford Arsenal Platinum Series Case Prep Essentials Kit
                                                                                                                    One motivation to not introduce such testing in this pilot task is
                                                                                                                    that in an e-commerce setting, even a 0.01% improvement means
    Table 2: Examples or product titles from the test set.
                                                                                                                    thousands of listings being assigned to their right places in the
                                                                                                                    catalog without human intervention.
the data set. The left hand side of Figure 1 shows the distribution
of product title frequency across all the leaf-level categories (sorted
by frequency) where the X and Y axes are both in log scale. Across                                                      3.1        Metrics
the 800, 000 product titles, the maximum depth of category level is                                                 The metrics that we have used to evaluate the different classifica-
8 and the average depth is 4.                                                                                       tion systems are based on weighted-precision, weighted-recall and
    The right hand side of Figure 1 shows the distribution of product                                               weighted-F1 for the test set of exact “category-id-path” match. In
title frequency across different depths of category-id-paths. The                                                   other words, partial path match does not count as a correct predic-
average word-level title length over all categories, after the surface                                              tion. We assume that there are a total of K classes, {c i |i = 1, 2, ..., K }
form of the title is tokenized by white space, is 10.93, with a maxi-                                               in the training set. The number of true instances for each class
mum length of 58 words. Similarly, the average character-level title                                                (support) is ni , and the total number of training instances is N =
Overview of the SIGIR 2018 eCom Rakuten Data Challenge                                 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA


Figure 3: Left: Stage 1 leaderboard (evaluated on the first 20,000 test titles). Right: Stage 2 leaderboard (evaluated on all 200,000
test titles).

ÍK
  i=1 ni . After calculating the precision (Pi ), recall (R i ) and F1 (F 1i )   3.2     Leaderboard
scores for each class c i , the weighted-precision, weighted-recall              The leaderboard shows the weighted precision, recall and F1 scores,
and weighted-F1 are defined as follows:                                          upto four decimal digits of precision, for the latest submissions
                                                                                 from each participating team. The corresponding file submission
                                            K                                    time is shown as well so participants can refer the scores to which
                                            Õ ni
                          weiдhted-P =                Pi                  (1)    submission it belongs to. The leaderboard is sorted by the weighted
                                                  N
                                            i=1                                  F1 score. We choose not to use an off-the-shelf classifier, such as
                                                                                 a logistic regression model or some other standard classifier, as a
                                            K
                                            Õ ni                                 minimal baseline (lowerbound) in the leaderboard.
                          weiдhted-R =                Ri                  (2)
                                                  N
                                            i=1                                  3.3     Timeline
                                                                                       • Stage 1 (April 9 - June 23): Participants build and test models
                                            K
                                            Õ ni                                         on the training data. Each team can make at most three
                         weiдhted-F 1 =               F 1i                (3)
                                                  N                                      submissions per day in this stage. The leaderboard (Fig. 3
                                            i=1
                                                                                         left) only shows the model performance on a subset of the
   It can be shown that weighted-recall is actually equal to absolute                    test set, i.e., the first 20, 000 test product titles. Note that this
accuracy. Since the product distribution over the taxonomy tree                          subset is randomized since the entire test set is randomized
is highly imbalanced, weighted version of precision, recall, and F1                      using stratified sampling on the one million titles.
make much more sense than macro or micro version of precision,                         • Stage 2 (June 24): The leaderboard switches to Stage 2 on
recall, and F1 do. The evaluation script is also provided during the                     June 24 (Fig. 3 right) and shows the model performance
data challenge4 .                                                                        on the entire test set, i.e., all 200, 000 test product titles.
                                                                                         No submission is allowed once the leaderboard has been
4 https://github.com/sigir-ecom/dataChallenge                                            switched to this stage.
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                      Yiu-Chang Lin, Pradipto Das, and Ankur Datta


4   RIT-BOSTON BASELINE METHOD
The RITB-Baseline method uses the method of Joulin et al. [4],
which has been very popular recently for large-scale multi-label
text classification. The method, dubbed, fastText, uses lock free
asynchronous gradient descent optimization [8] of a log linear loss
function – it is actually a single layer perceptron with no non-
linearities in the activation functions. For details on the derivations
for the fastText model parameters and using a softmax loss func-
tion, see Section 7.
    Generally, in a production setting, we have seen gradient boosted
decision trees (GBDTs) to perform just as well as state-of-the-art
Convolutional Neural Networks (CNNs), which has been previously           Figure 4: Class imbalance measured as KL divergences of em-
reported in [1]. However, although, GBDTs have the added advan-           pirical distributions over data proportions in top level cate-
tage of incorporating intuitive feature engineering as well as decid-     gories from their respective uniform distributions.
ing feature importance, their training time on large datasets can be
prohibitively slow. Recent trends in e-commerce industry has thus
shifted to using fastText or other deep learning approaches for           a baseline hierarchical cascading classifier using fastText, how-
generic text classification problems, of which taxonomy classifica-       ever, for this dataset, we found out that a “flat” fastText classifier
tion of product titles is a specific instance. Deep learning techniques   performed better by at least one percentage point absolute. Ad-
are a continuously evolving family of models with varying degrees         ditionally, the “negative sampling” loss function has also shown
of architecture engineering, optimization algorithms and parameter        better performance than the typical “softmax” loss. However, fine
fine tuning methods and as such it is difficult to assign a baseline      tuning of parameters based on the negative sampling loss function
model using this family of methods.                                       can show signs of overfitting.
    We thus choose to use fastText as a quick way to solve the                For data preprocessing, we use the token normalizer mentioned
taxonomy classification problem. Further fastText apparently has          in [1]. We first tokenize using whitespace and lowercase the to-
less parameters to tune than comparable deep learning methods.            kens. Then for each token, we separate any adjoining punctuations
The reason for the choice of the word apparent will become clear          from numbers (decimal or otherwise). We replace all numbers with
once we explain the graphs in Fig. 5. The fastText method in [4]          a @NUMBER@ literal token and remove all punctuation tokens. We
employs many tricks, including hashing to a fixed vocabulary size         preserve alphanumeric tokens in the hope that sometimes longer
of n-grams, lock free asynchronous gradient descent optimization          alphanumeric strings encode model numbers that are unique to
[8], averaging of word representations as document representation         each category. For instance, the surface form of the title, “"Ana Silver
and others. Further, given a training set containing NV words from        Co Rainbow Moonstone Earrings 2 1/4"" (925 Sterling Silver) - Hand-
a vocabulary of V words and a number of passes over the data,             made Jewelry EARR346812"” gets converted to “ana silver co rainbow
i.e. epochs, equal to E, the learning rate, α, at time t is updated       moonstone earrings @NUMBER@ @NUMBER@ @NUMBER@ @NUMBER@ ster-
according to                                                              ling silver handmade jewelry earr34681” using our normalization
                                                                          scheme. For the baseline, we also do not remove any stop words or
                                            t                             perform any morphological operations on the resulting tokens. The
                         α (t ) = γ (1 −        )                  (4)
                                           NV E                           entire training dataset is de-duplicated using the (category label,
                                                                          title normal form) tuples.
where γ is a fixed parameter. Hence, unlike optimizing for the learn-         The plots in Fig. 5 confirms the non-deterministic nature of
ing rate or step size found in more principled convex optimization        fastText due to the use of asynchronous SGD. In fastText, each
techniques, fastText uses an heuristic estimate and a linear decre-       thread reads its own view of the data, creates a mini-batch of size
ment of the learning rate that allows it to calculate an estimated        one, and updates the parameter matrices without any locking mech-
time of completion for the training phase to end.                         anism. This type of lock free parameter updates have been shown to
   The high degree of class imbalance becomes apparent upon               be just as effective as their synchronous counterparts under the as-
observing the Kullback Leibler (KL) divergences of the empirical          sumptions of feature sparsity in the data [8]. The plots in Fig. 5 show
distribution over data points in the training set for each of the top     absolute accuracy results on our 10% Dev2 set for two different sce-
level categories, from their respective uniform distributions (see        narios, after training a bi-level fastText classifier on the training
Fig. 4). The RDC Challenge dataset may have exhibited this phe-           data that results after excluding the Dev1 and Dev2 sets. For the
nomenon due to the dataset selection process, however, in general,        first scenario, we run fastText for forty runs with the same param-
depending on common choices of e-commerce taxonomies for mid-             eter settings of dim=min(120,K/2), epoch=100, wordNgrams=3,
size organizations, KL divergences usually vary between 1.0 and           loss=softmax, thread=45, with K being the number of classes
2.0 [1].                                                                  for a particular subtree rooted at a classification node in the taxon-
   We initially split the 0.8M training dataset into a 10% Dev2 set, a    omy tree. For this scenario, we obtain the mean absolute accuracy
10% Dev1 set after excluding the Dev2 set, and the rest for training      to be 74.0908 with a standard deviation of 0.0741. For the second
and cross-validation. The splitting is done using stratified sampling.    scenario, we run fastText for six different runs with the same
Following a bi-level classification scheme in [1], we initially built     settings, except that the number of threads are set to 1, 2, 5,
Overview of the SIGIR 2018 eCom Rakuten Data Challenge                     eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA


Figure 5: Absolute accuracy results on our 10% Dev2 set for the following two scenarios. Left: Forty different runs of fastText
using the exact same settings. Right: Six different runs of fastText using the same settings except for the number of threads.


15, 30 and 45 respectively. The solid bars in the right plot of Fig.       • Team HSJX-ITEC-YU submitted one system (0.7790 F1
5 shows the number of threads and the corresponding accuracy                 score) that uses K Nearest Neighbors (KNN) classification
numbers are shown on the line graph above the solid bars.                    model and the Best Match (BM) 25 probabilistic information
    We can observe from the graph in the right of Fig. 5, that the triv-     retrieval model. The top k product title matches of the BM25
ially synchronous version of fastText, i.e. using just one thread,           model are then classified by the KNN model.
is the weakest of all models. This makes sense since fastText is           • Team JCWRY submitted one system (0.8295 F1 score) that
just a singe layer perceptron with no non-linearities. Accuracy im-          uses deep convolutional neural networks with oversampling,
proves by almost 70% when the number of threads is set to two, but,          threshold moving and error correct output coding to predict
plateaus off for the other settings of thread counts after that. For         product taxonomies. Their best accuracy is obtained through
our Dev2 set, we obtain the best accuracy using thirty threads and           an esemble of multiple networks, such as Kim-CNN [5] and
that is the number of threads we set for all of our fastText baseline        Zhang-CNN [9], trained on different extracted features in-
runs. The asynchronous SGD of fastText with a mini-batch size of             puts, inlcuding doc2vec [6], NER and POS features.
one, raises a legitimate concern – the number of threads apparently        • Team MKANEMAS submitted one system (0.8399 F1 score)
behaves as a parameter that regulates generalization performance             that formulates the task as a simple classification problem
and is machine architecture dependent. Needless to mention that              of all leaf categories in the given dataset. The key feature
this behavior is independent of the loss function used. In our ex-           of their system is the combination of a Convolutional Neu-
periments, the default setting of twelve threads have shown lower            ral Network and Bidirectional LSTM using ad-hoc features
performance than thirty threads for this task.                               generated from an external dataset (Amazon Product Data)
    After some minimal experiments on our Dev2 set, we finalized             [3][6].
the settings of our fastText baseline, RITB-Baseline (see Table            • Team mcskinner submitted one system (0.8513 F1 score)
3 and Fig. 3), to be the following: -dim 300 -minn 4 -maxn 10                using a straightforward network architecture and ensemble
-wordNgrams 3 -neg 10 -loss ns -epoch 3000 -thread 30.                       LSTM strategy to achieve competitive results. The positive
We have set a high number of epochs due to the availability of free          impact of tightening the connections between recurrent and
computational cycles in our servers.                                         output layers through the use of pooling layers is also demon-
    In the next section, we briefly highlight the methods used for the       strated. The author similarly provides practical details on
data challenge task by the teams who have submitted their system             their training methodology and algorithms for probability
description papers.                                                          calibration. The final solution is produced by a bidirectional
                                                                             ensemble of 6 LSTMs with Balanced Pooling View architec-
5    SYSTEM DESCRIPTIONS                                                     ture.
Below is a list of each team’s systems. Their team names, affilia-         • Team minimono submitted one system (0.7994 F1 score)
tion and system description paper title available on the workshop            based on word-level sequence-to-sequence neural networks
website can be seen in Table 3.                                              widely used in machine translation and automatic docu-
     • Team CorUmBc submitted one system (0.7690 F1 score)                   ment summarization. By treating taxonomy classification as
       based on a Bidirectional Long Short Term Memory Network               a translation problem from a description of a product to a
       (BiLSTM) to capture the context information for each word,            category path. The text of the product name is viewed as
       followed by a multi-head attention model to aggregate useful          the encoder input and the sequence of category name as the
       information from these words as the final representation of           decoder output.
       the product title. Their model adopts an end-to-end archi-          • Team neko submitted one system (0.8256 F1 score) that
       tecture without any hand-crafted features and regulated by            treats category prediction as a sequence generation task.
       various dropout techniques.                                           The authors built a word-level sequence-to-sequence model
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                       Yiu-Chang Lin, Pradipto Das, and Ankur Datta


  Team Name                         Affiliation                                                System Description Paper
 CorUmBc            University of Maryland Baltimore County         Large Scale Taxonomy Classification using BiLSTM with Self-Attention
 HSJX-ITEC-YU       York University & Trent University              A Best Match KNN-based Approach for Large-scale Product Categorization
                                                                    An Empirical Study of Using An Ensemble Model in E-commerce Taxonomy
 JCWRY              Individual Researchers
                                                                    Classification Challenge
                                                                    Convolutional Neural Network and Bidirectional LSTM Based Taxonomy
 MKANEMAS           Yahoo Japan Corporation
                                                                    Classification Using External Dataset at SIGIR eCom Data Challenge
 mcskinner          Individual Researcher                           Product Categorization with LSTMs and Balanced Pooling Views
 minimono           University of Tsukuba                           Encoder-Decoder neural networks for taxonomy classification
 neko               Rakuten Institute of Technology Singapore       Unconstrained Production Categorization with Sequence-to-Sequence Models
 RITB-Baseline      Rakuten Institute of Technology Boston          Overview of the SIGIR 2018 eCom Rakuten Data Challenge
 Tyken2018          Tohoku University                               Large-Scale Taxonomy Problem: a Mixed Machine Learning Approach
 Topsig             Queensland University of Technology             TopSig at the SIGIR’eCom 2018 Rakuten Data Challenge
 tiger              JD.com                                          Multi-level Deep Learning based E-commerce Product Categorization
 Uplab              Uplab SAS                                       Ecommerce Product Title Classification
 Waterloo           University of Waterloo                          Team Waterloo at the SIGIR E-Commerce Data Challenge
Table 3: List of participants who submitted system description papers. All submitted papers are available at: https://sigir-ecom.
github.io/accepted-papers.html.


      to generate non-constrained product categories that are not                category issue, sampling and data enhancement techniques
      limited to the labels from the training set. The final model               are used. They build eight sample datasets according to the
      is an ensemble of several attentional sequence-to-sequence                 category hierarchy and develop two classification algorithms
      models.                                                                    to build models for different levels and search paths using
    • Team Tyken2018 submitted one system (0.7509 F1 score)                      category trees.
      that builds upon a two-step architecture where the first step            • Team Uplab submitted three systems based on different
      classifies the input product title to the genre tree it belongs            classifier types, including single flat linear Support Vector
      to and the second step predicts its category path. For the                 Machines classifier (0.8366 F1 score), a top down esemble
      path prediction, two methods are proposed, original method                 which combines top-level and sub-level classifiers (0.8173 F1
      and actual method. They originally proposed a shallow feed-                score) and a CNN with pre-trained word embeddings (0.6509
      forward fully connected or deep neural network, i.e., 50-layer             F1 score). The authors found that TF-IDF with both bi-gram
      ResNet architecture [2] depending on how complex the tree                  and uni-gram features work best for categorization.
      structure is, but constructed a hierarchical style model in the          • Team Waterloo submitted one system (0.7781 F1 score) that
      end.                                                                       uses multinomial naive-bayes, logistic regression, batched
    • Team Topsig submitted one system (0.7941 F1 score) that                    gradient-descent and neural softmax classifiers retro-fitted
      utilizes random indexing to create topological signatures                  for the hierarchical classification objective. The authors eval-
      (TopSig) to categorize the product names in the provided                   uate different classification methods, including flat and hier-
      data sets. The authors make use of their open-source Top-                  archical classification, and implement a general framework
      Sig tool to generate dimensionality-reduced signatures and                 to tackle the task of taxonomy classification.
      search for these signatures. The signature generation ap-
      proach used by TopSig has been shown to be effective at
      different tasks. The approach can be regarded as a variation         6   RESULTS
      on LSH to approximate nearest neighbour search.                     Twenty six research groups/individuals participated in the data
    • Team tiger submitted one system (0.8379 F1 score) that com-         challenge. Each team submitted their predicted category-id-path
      bines machine learning, deep learning, and natural language         on the test set in the same TSV format as the training set (Table 1).
      processing to create a multi-level and multi-class deep learn-      We score the system submissions against the gold standard using
      ing tree method. This includes multiple models based on             weighted precision, recall and F1 as defined in Section 3. The final
      single-label and multi-level label predictions, as well as char-    results are summarized in Figure 3. The left hand side of Figure 3
      acteristics of the product tree structure. The training dataset     shows the leaderboard in Stage 1 (evaluated on the first 20,000 test
      and the test dataset are merged to pre-train word vectors           titles) and the right hand side shows stage 2 results (evaluated on
      to calculate semantic similarity. To address the unbalanced         the entire set of test titles).
Overview of the SIGIR 2018 eCom Rakuten Data Challenge                                            eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA


   In Stage 1, the top five teams are team mcskinner, team MKANE-                           [8] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild:
MAS, team tiger, team Uplab and team JCWRY with 0.8510,                                         A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances
                                                                                                in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L.
0.8421, 0.8404, 0.8375 and 0.8278 weighted-F1 scores, respectively.                             Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 693–701.
In Stage 2, the same five teams rank at top five with 0.8513, 0.8399,                       [9] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional
                                                                                                networks for text classification. In Advances in neural information processing
0.8379, 0.8366 and 0.8295 weighted-F1 scores. The rank does not                                 systems. 649–657.
change much between Stage 1 and Stage 2, except for team Uplab-
2, team Tyche and team HSJX-ITEC-YU. Table 4 shows the list
of teams whose ranks is higher in Stage 2 than in Stage 1.                                  7     APPENDIX
                                                                                            The loss function for fastText, mentioned in [4], is the following
    Team Name        ∆ Rank Stage 1 Rank Stage 2 Rank
                                                                                            log likelihood function:
    Uplab-2            +3         11            8
    Tyche              +1         13           12
                                                                                                            N                        N                             N
                                                                                                                               !
    HSJX-ITEC-YU       +5         16           21                                                           Ö                        Õ                             Õ
                                                                                                L = log          f (BAxn )tn =            tn log(f (BAxn )) =            Ln     (5)
Table 4: List of teams that rank higher in Stage 1 than in                                                 n=1                      n=1                            n=1
Stage 2.
                                                                                            where tn is the label of the nth training instance. The optimization
   With regards to F1 scores, the difference between Stage 1 and
                                                                                            problem on the training set is to find the optimal parameters A∗
Stage 2 is less than 0.01 except for team HSJX-ITEC-YU, whose
                                                                                            and B∗ by minimizing the negative log likelihood in Equ. 5.
score difference is more than 0.06. Table 5 shows the the list of top
twenty teams whose F1 score is higher in Stage 2 than in Stage 1.
                                                                                                                                         N
                                                                                                                                      1 Õ
                                                                                                                         arg min −     ×    Ln                                  (6)
    Team Name        Stage 1 F1 Stage 2 F1 Stage 2 Rank                                                                    A∗,B∗      N i=1
    mcskinner          0.8510     0.8513         1
    JCWRY              0.8278     0.8295         5
                                                                                               The parameter matrix A connects the input features to the units
    neko               0.8245     0.8256         6
                                                                                            in the hidden layer of dimension H . This matrix is a look-up ta-
    Uplab-2            0.8149     0.8173         8
                                                                                            ble over the words [4] and H is a parameter of the model. The
    Tyche              0.7976     0.8004        12
                                                                                            weight parameter matrix, B, connects the hidden units to the out-
    VanGuard           0.7871     0.7884        15
                                                                                            put units. The k th output unit of the nth instance contributes to
    HSJX-ITEC-YU       0.7176     0.7790        16
                                                                                            the conditional probability of the output yn,k , of the k th label be-
    Waterloo           0.7767     0.7781        17
    Sam-chan           0.7617     0.7666        19                                          ing 0 or 1. That is, for a particular nth instance (xn , tn ), we have,
                                                                                                  ÎK      t n,k
    Tyken2018          0.7431     0.7509        20                                          Ln = k=1     yn,k   where tn,k ∈ {0, 1}.
Table 5: List of the top twenty teams that have higher F1                                      From now on, let L = −Ln , so that we can clear up the notation.
                                                                                                                                                    ÎN
score in Stage 1 than in Stage 2.                                                           Also, the data is assumed to be i.i.d so that p(X) = n=1      p(xn ). Let
                                                                                            the input nodes be labeled as x i , i ∈ {1, ..., D}; the hidden layer
                                                                                            nodes be labeled as z j , j ∈ {1, ..., H } and the output layer nodes be
REFERENCES                                                                                  labeled as vk , k ∈ {1, ..., K }. The target for the input x = {xd } is a
[1] Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, and Ankur Datta.
    2017. Web-scale language-independent cataloging of noisy product listings for
                                                                                            one-hot K vector with the k th entry to be 1 if x belongs to class
    e-commerce. In Proceedings of the 15th Conference of the European Chapter of the        k. Let vk be the input to the k th output unit i.e. vk = BTk z with
    Association for Computational Linguistics: Volume 1, Long Papers, Vol. 1. 969–979.            Í              
                                                                                            zh =     D A        x .
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual                       d =1 d,h d
    learning for image recognition. In Proceedings of the IEEE conference on computer
    vision and pattern recognition. 770–778.
                                                                                               Using the method of maximum likelihood estimation on the
[3] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual                 loss function L, where tk ∈ {0, 1} and yk = p(vk ) ∈ [0, 1], or,
    evolution of fashion trends with one-class collaborative filtering. In proceedings      equivalently, using gradient descent minimization on L, we can
    of the 25th international conference on world wide web. International World Wide
    Web Conferences Steering Committee, 507–517.                                            derive the updates for model parameter estimators.
[4] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag
    of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of
    the European Chapter of the Association for Computational Linguistics: Volume 2,        7.1      Parameter updates for the output layer
    Short Papers. Association for Computational Linguistics, 427–431.
[5] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv        We first begin by finding the updates to the B parameter from Equ. 6.
    preprint arXiv:1408.5882 (2014).                                                        For brevity, we drop the subscript n henceforth. Taking derivatives
[6] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
    2015. Image-based recommendations on styles and substitutes. In Proceedings
                                                                                            of the loss function at the output layer w.r.t. Bk,h , we obtain,
    of the 38th International ACM SIGIR Conference on Research and Development in
    Information Retrieval. ACM, 43–52.
[7] Lam Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtarik, Katya Schein-
                                                                                                                          ∂L
                                                                                                                               = −(tk − yk )zh                                  (7)
    berg, and Martin Takac. 2018. SGD and Hogwild! Convergence Without the                                               ∂Bk,h
    Bounded Gradients Assumption. In Proceedings of the 35th International Conference
    on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and
    Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden,                                                        T               T
                                                                                            where, yk = p(vk ) = so f tmax(vk ) = e Bk z / kK′ =1 e Bk ′ z
                                                                                                                                          Í
    3747–3755.
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA                                                                    Yiu-Chang Lin, Pradipto Das, and Ankur Datta


  Derivation: For a particular nt h instance, we have                                                  Derivation: As in the previous section, for a particular nth
                                                                                                    instance, we have, using the chain rule of derivatives,
              K
                                                                                                                        K 
                                                                                                               ∂L             ∂L ∂vk       ∂zh
              Õ                                                                                                                         
      −L =           tk log yk
                                                                                                                       Õ
                                                                                                           −        =
              k =1                                                                                            ∂Ah,d           ∂vk ∂zh ∂Ah,d
                                                                                                                       k =1
               K                          K
                                                               !                                                                          Í             
                                                                                                                                        ∂ h=1H B
                                                                                                                                                   k,h zh ª ∂z
              Õ                           Õ
          =          tk vk − log                exp(v ) k′                                    (8)                      ÕK ©
                                                                                                                                                                 h
              k =1                        k ′ =1                                                                    =       ­(tk − yk )                    ®
                                                                                                                            ­                  ∂zh         ® ∂A
                                                                                                                                                               h,d
              ÕK                              ÕK
                                                                    !                                                  k =1
          =          tk BTk z − log                 exp(BTk z)
                                                                                                                            «                    Í        ¬    
                                                                                                                        K                      ∂    D A      x
              k =1                         k ′ =1
                                                                                                                       Õ                           d =1 h,d d
                                                                                                                    =        (tk − yk ) Bk,h
                                                                                                                                             
               K           H                          K                 H                                                                            ∂Ah,d
                                                                                        !!
                                                                                                                            k =1
              Õ            Õ                          Õ                 Õ
          =          tk          Bk,h zh − log                 exp            Bk,h zh         (9)
              k =1         h=1                        k ′ =1            h=1
                                                                                                    leading to,
                                                                                                                                    K
                                                                                                                                                    !
                                                                                                                            ∂L      Õ
  Using chain rule of derivatives,                                                                                               =−   (tk − yk )Bk,h xd                (13)
                                                                                                                           ∂Ah,d
                                                                                                                                          k =1
                               ∂L     ∂L ∂vk                                                          Thus for the Ah,d parameter estimator update, we have:
                                    =
                              ∂Bk,h   ∂vk ∂Bk,h
                                                                                                                  (t +1)      (t )                ∂L
                                                                                                              Ah,d = Ah,d + α × −
  For the first term, we have:                                                                                                                       (t )
                                                                                                                                                 ∂Ah,d
                                                                                                                                              K 
                             Í                                                                                                                                   !
                          ∂  log   K     exp(v   ′)                                                                           (t )
                                                                                                                                              Õ
                                                                                                                                                       (t )
                                                                                                                                                            
                                                                                                                                                              (t )
               ∂L                  k =1
                                    ′          k      ∂vk ′                                                                = Ah,d + α ×          tk − yk Bk,h xd       (14)
            −      = tk −
              ∂vk                   ∂vk ′             ∂vk                                                                                     k =1
                                exp(vk ′ )                                                             Note that the document or sentence representation in fastText
                   = tk −                      δ k′                                                 is obtained by an averaging of the vector representations of the
                          log kK′ =1 exp(vk ′ ) k
                              Í
                                                                                                    constituent n-grams. Thus, the gradients involving x are normalized
                          = t k − yk                                                                by the number of word and/or character n-grams in the input x.
  For the second term, we have:                                                                     7.3    Asynchronous Stochastic Gradient Descent
                                                                                                    The general expression for updating the estimators of a parameter
                              Í              
                            ∂    H B      z
                    ∂vk         h=1   k,h   h
                                                                                                    w, in a gradient descent optimization scheme is given by,
                         =
                   ∂Bk,h          ∂Bk,h
                                                                                                                                     w(t +1) = w(t ) + αu              (15)
                         = zh
                                                                                                    where α is the learning rate or step length and u is a descent direc-
                                                                                                    tion, which is set to − ∂w∂L . The fastText model relies on asyn-
leading to,                                                                                                                     (t )
                                                                                                    chronous stochastic gradient descent (ASGD) method of optimiza-
                               ∂L                                                                   tion following [8], where, several cores can update any component,
                                    = −(tk − yk )zh                                          (10)
                              ∂Bk,h                                                                   (t )
                                                                                                    wj , several times, but atomically, on the same iteration.
                                                                                                        Both ASGD and SGD depend on finding unbiased estimators of
  Thus for the Bk,h parameter update, we have:                                                      gradients. The key difference is that for the former, the gradient
                                                                                                    computation for parameter estimators is allowed to use stale val-
                     (t +1)        (t )                  ∂L
                 Bk,h = Bk,h + α × −                                                                ues of the previously updated parameter iterates. Convergence is
                                                             (t )
                                                       ∂Bk,h                                        shown to happen for certain scenarios, where the loss functions
                                 (t )
                                                    
                                                 (t ) (t )                                          are convex and the stochastic gradients are assumed to be bounded
                              = Bk,h + α × tk − yk zh                                        (11)   in expectation. In the original ASGD paper [8], this convergence
                                                                                                    is shown for a fixed step size. The convergence analysis for ASGD
where α is the learning rate that is updated as shown in Equ. 4.                                    in [7] is shown for the non-bounded gradient assumption and a
                                                                                                    continuously decreasing step size, which is the case for fastText.
7.2    Parameter updates for the input layer                                                            It is mentioned in [7], that the permutation of indices for the
Taking derivatives of the loss function at the output layer w.r.t.                                  parameter vector or matrix updates when there are inconsistent
Ah,d , we obtain,                                                                                   reads and writes by several cores is crucial for good performance.
                                                                                                    In fastText, the compute cores start iterating over the data from
                                                                                                    fixed positions in the training file and not randomly as in the case
                              K                                                                     of true SGD. This might be a reason why the solution computed
                                              !
                      ∂L      Õ
                           =−   (tk − yk )Bk,h xd                                            (12)   by fastText, is not stochastically optimal with regards to the as-
                     ∂Ah,d
                                          k=1                                                       sumptions mentioned in [7].

</pre>