=Paper=
{{Paper
|id=Vol-2319/ecom18DC_paper_13
|storemode=property
|title=Overview of the SIGIR 2018 eCom Rakuten Data Challenge
|pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_13.pdf
|volume=Vol-2319
|authors=Yiu-Chang Lin,Pradipto Das,Ankur Datta
|dblpUrl=https://dblp.org/rec/conf/sigir/LinDD18
}}
==Overview of the SIGIR 2018 eCom Rakuten Data Challenge==
Overview of the SIGIR 2018 eCom Rakuten Data Challenge
Yiu-Chang Lin Pradipto Das Ankur Datta
Rakuten Institute of Technology Rakuten Institute of Technology Rakuten Institute of Technology
Boston, MA, 02110 - USA Boston, MA, 02110 - USA Boston, MA, 02110 - USA
yiuchang.lin@rakuten.com pradipto.das@rakuten.com ankur.datta@rakuten.com
ABSTRACT We hope that by making the data available to the participants, the
This paper presents an overview of the SIGIR 2018 eCom Rakuten task will attract more research institutions and industry practition-
Data Challenge. In this data challenge, Rakuten.com has released ers, who do not have the opportunity to contribute their ideas due
a sampling of one million product titles and the corresponding to the lack of an actual commercial e-commerce catalog data. In a
anonymized category paths from their entire product catalog. Of typical e-commerce setting, merchants are responsible for matching
these, 0.8 million of product titles and their corresponding category their products with an existing category, which is a leaf node in the
paths are released as training data and 0.2 million of product titles taxonomy tree. The problem of large scale taxonomy classification
are released as test data. The task is to predict the category, defined is thus immensely useful to help merchants upload their products
as a full path in the taxonomy tree as provided in the training set, of in the right places of an e-commerce platform catalog.
the product titles in the test set. The evaluation is divided into two
stages to measure system performance on a part of the test data 2 DATASET
and the entire test data. The different systems are evaluated using Rakuten has released a sample of one million product listings, in-
weighted precision, recall and F1. In total, 26 teams have submitted cluding the training (0.8 million) and test (0.2 million) set, consisting
28 systems with a top performance of 0.8513 weighted F1 score. of product titles and their corresponding category paths that belong
to a taxonomy tree to describe the varying degrees of generality
1 INTRODUCTION of the items in the catalog. A product taxonomy is a tree-based
The SIGIR 2018 eCom Rakuten Data Challenge 1 is organized by hierarchical representation of labels of the listings in a catalog.
Rakuten Institute of Technology, Boston2 (RIT-Boston), a Rakuten’s catalog of products is much larger than a million
dedicated R&D organization for the Rakuten group3 of companies. listings. The larger catalog of product listings is first de-duplicated
The data challenge focuses on the task of large-scale taxonomy using leaf node label and product title tuples as keys and then a
classification of product titles, where the goal is to predict each random sampling of one million listings is performed. These set
product’s category, defined as a full path from root node to a leaf of listings have been allowed to be published for this year’s data
node in the taxonomy tree provided in the training set. The cat- challenge. The set of one million product listings may not cover the
aloging of product listings through taxonomy categorization is a entire set of leaf nodes in Rakuten.com’s taxonomy, and so, any
fundamental problem for any e-commerce platform, with applica- taxonomy generated from the category labels provided with the
tions ranging from basic data organization, personalized search training set may yield a truncated taxonomy.
recommendations to query understanding and targeted campaign- Further, to preserve anonymity of the taxonomy, the nodes in
ing. For instance, in the Rakuten.com catalog, “Dr. Martens Air the actual taxonomy have been masked with random integers. The
Wair 1460 Mens Leather Ankle Boots” is categorized under the class or the category label of a product listing is thus a path from the
“Clothing, Shoes & Accessories > Shoes > Men > Boots” leaf. How- root of the taxonomy tree to a leaf node where the listing resides.
ever, manual and rule based approaches to categorization are not In the left to right ordering of the nodes in such a path, the path
scalable since commercial product taxonomies are organized in tree becomes more specific in describing the listing as it approaches the
structures with three to ten levels of depth and thousands of leaf leaf level. The string representation of a path from root to leaf is
nodes. Advances in this area of research have been limited due to henceforth dubbed as a “category-id-path”.
the lack of real data from actual commercial catalogs. The training data file is in a tab-separated values (TSV) format
The challenge presents several interesting research aspects due where each line contains a product title and its corresponding
to the intrinsic noisy nature of the product labels, the size of modern category-id-path. In the test set, only the product titles are provided
e-commerce catalogs, and the typical imbalanced data distribution. and the objective of this data challenge is to predict the full category-
1 https://sigir-ecom.github.io/data-task.html
id-path for each such title. Table 1 and table 2 show some examples
2 https://rit.rakuten.co.jp/ of product titles from the training set and test set, respectively. The
3 https://www.rakuten.com/ partitioning of the training and the test sets are obtained using
category-wise stratified sampling of the one million listing dataset.
Permission to make digital or hard copies of part or all of this work for personal or
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
classroom
In: use is G.
J. Degenhardt, granted withoutS.fee
Di Fabbrizio, providedM.that
Kallumadi, copies
Kumar, areLin,
Y.-C. notA.made or distributed
Trotman, H. Zhao
for profit
(eds.): or commercial
Proceedings
published
advantage
of the SIGIR
at http://ceur-ws.org
andworkshop,
2018 eCom that copies bear2018,
12 July, this notice andMichigan,
Ann Arbor, the full citation
USA, 2.1 Data Characteristics
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s). In the training set, there are 800, 000 product titles from 3, 008 leaf-
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA level nodes. Product titles are unevenly distributed among these
© 2018 Copyright held by the owner/author(s). 3, 008 categories. The top ten categories compose around 30% of
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/nnnnnnn.nnnnnnn the data set and the top forty categories compose around 50% of
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Yiu-Chang Lin, Pradipto Das, and Ankur Datta
10 5 350000
300000
10 4
250000
Frequency (log scale)
10 3 200000
Frequency
10 2 150000
100000
10 1
50000
10 0 0
10 0 10 1 10 2 10 3 1 2 3 4 5 6 7 8
Rank (log scale) Category Depth
Figure 1: Left: Product title frequencies for 3,008 leaf nodes in the taxonomy tree. Both X and Y axis are in log scale. Rank
ranges from 1 to 3,008. Right: Product title frequencies for different lengths of category paths.
80000 18000
70000 16000
60000 14000
12000
50000
Frequency
Frequency 10000
40000
8000
30000
6000
20000 4000
10000 2000
0 0
10 20 30 40 50 50 100 150 200 250
Title Word Length Title Character Length
Figure 2: Left: Word length distribution over all product titles in the training set. Right: Character length distribution over all
product titles in the training set.
Product Titles
Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A
category-id-paths
3292>114>1231
length is 68.44, with a maximum length of 255. Figure 2 shows the
Ka-Bar Desert MULE Serrated Folding Knife 4238>321>753>3121 title length distributions at world-level and character-level, respec-
5.11 TACTICAL 74280 Taclite TDU Pants, R/M, Dark Navy 4015>3285>1443>20
Skechers 4lb S Grip Jogging Weight set of 2- Black 2075>945>2183>3863
tively.
Generations Small Side Table White 4015>3636>1319>1409>3606
Table 1: Examples of product titles from the training set. 3 EVALUATION
In this section, we briefly mention the evaluation criteria used to
Product Titles judge the different systems. Although we do not resort to a stricter
Disc Brake Rotor-Advanced Technology Rear Raybestos 980368 statistical significance testing of different systems using confidence
Coquette Neon Pink Ruffle Babydoll 7035 Neon Pink One Size Fits All interval estimation from bootstrapped samples of the test set, it
12V 7Ah (SPS Brand) APC NS3000RMT3U Replacement Battery ( 4 Pack) will be useful to do so once this pilot task has been completed.
Honda Ridgeline 2006-08 Black Radio AM FM 6 Disc CD PN 39100-SJC-A100 Face 3TS1
Frankford Arsenal Platinum Series Case Prep Essentials Kit
One motivation to not introduce such testing in this pilot task is
that in an e-commerce setting, even a 0.01% improvement means
Table 2: Examples or product titles from the test set.
thousands of listings being assigned to their right places in the
catalog without human intervention.
the data set. The left hand side of Figure 1 shows the distribution
of product title frequency across all the leaf-level categories (sorted
by frequency) where the X and Y axes are both in log scale. Across 3.1 Metrics
the 800, 000 product titles, the maximum depth of category level is The metrics that we have used to evaluate the different classifica-
8 and the average depth is 4. tion systems are based on weighted-precision, weighted-recall and
The right hand side of Figure 1 shows the distribution of product weighted-F1 for the test set of exact “category-id-path” match. In
title frequency across different depths of category-id-paths. The other words, partial path match does not count as a correct predic-
average word-level title length over all categories, after the surface tion. We assume that there are a total of K classes, {c i |i = 1, 2, ..., K }
form of the title is tokenized by white space, is 10.93, with a maxi- in the training set. The number of true instances for each class
mum length of 58 words. Similarly, the average character-level title (support) is ni , and the total number of training instances is N =
Overview of the SIGIR 2018 eCom Rakuten Data Challenge eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
Figure 3: Left: Stage 1 leaderboard (evaluated on the first 20,000 test titles). Right: Stage 2 leaderboard (evaluated on all 200,000
test titles).
ÍK
i=1 ni . After calculating the precision (Pi ), recall (R i ) and F1 (F 1i ) 3.2 Leaderboard
scores for each class c i , the weighted-precision, weighted-recall The leaderboard shows the weighted precision, recall and F1 scores,
and weighted-F1 are defined as follows: upto four decimal digits of precision, for the latest submissions
from each participating team. The corresponding file submission
K time is shown as well so participants can refer the scores to which
Õ ni
weiдhted-P = Pi (1) submission it belongs to. The leaderboard is sorted by the weighted
N
i=1 F1 score. We choose not to use an off-the-shelf classifier, such as
a logistic regression model or some other standard classifier, as a
K
Õ ni minimal baseline (lowerbound) in the leaderboard.
weiдhted-R = Ri (2)
N
i=1 3.3 Timeline
• Stage 1 (April 9 - June 23): Participants build and test models
K
Õ ni on the training data. Each team can make at most three
weiдhted-F 1 = F 1i (3)
N submissions per day in this stage. The leaderboard (Fig. 3
i=1
left) only shows the model performance on a subset of the
It can be shown that weighted-recall is actually equal to absolute test set, i.e., the first 20, 000 test product titles. Note that this
accuracy. Since the product distribution over the taxonomy tree subset is randomized since the entire test set is randomized
is highly imbalanced, weighted version of precision, recall, and F1 using stratified sampling on the one million titles.
make much more sense than macro or micro version of precision, • Stage 2 (June 24): The leaderboard switches to Stage 2 on
recall, and F1 do. The evaluation script is also provided during the June 24 (Fig. 3 right) and shows the model performance
data challenge4 . on the entire test set, i.e., all 200, 000 test product titles.
No submission is allowed once the leaderboard has been
4 https://github.com/sigir-ecom/dataChallenge switched to this stage.
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Yiu-Chang Lin, Pradipto Das, and Ankur Datta
4 RIT-BOSTON BASELINE METHOD
The RITB-Baseline method uses the method of Joulin et al. [4],
which has been very popular recently for large-scale multi-label
text classification. The method, dubbed, fastText, uses lock free
asynchronous gradient descent optimization [8] of a log linear loss
function – it is actually a single layer perceptron with no non-
linearities in the activation functions. For details on the derivations
for the fastText model parameters and using a softmax loss func-
tion, see Section 7.
Generally, in a production setting, we have seen gradient boosted
decision trees (GBDTs) to perform just as well as state-of-the-art
Convolutional Neural Networks (CNNs), which has been previously Figure 4: Class imbalance measured as KL divergences of em-
reported in [1]. However, although, GBDTs have the added advan- pirical distributions over data proportions in top level cate-
tage of incorporating intuitive feature engineering as well as decid- gories from their respective uniform distributions.
ing feature importance, their training time on large datasets can be
prohibitively slow. Recent trends in e-commerce industry has thus
shifted to using fastText or other deep learning approaches for a baseline hierarchical cascading classifier using fastText, how-
generic text classification problems, of which taxonomy classifica- ever, for this dataset, we found out that a “flat” fastText classifier
tion of product titles is a specific instance. Deep learning techniques performed better by at least one percentage point absolute. Ad-
are a continuously evolving family of models with varying degrees ditionally, the “negative sampling” loss function has also shown
of architecture engineering, optimization algorithms and parameter better performance than the typical “softmax” loss. However, fine
fine tuning methods and as such it is difficult to assign a baseline tuning of parameters based on the negative sampling loss function
model using this family of methods. can show signs of overfitting.
We thus choose to use fastText as a quick way to solve the For data preprocessing, we use the token normalizer mentioned
taxonomy classification problem. Further fastText apparently has in [1]. We first tokenize using whitespace and lowercase the to-
less parameters to tune than comparable deep learning methods. kens. Then for each token, we separate any adjoining punctuations
The reason for the choice of the word apparent will become clear from numbers (decimal or otherwise). We replace all numbers with
once we explain the graphs in Fig. 5. The fastText method in [4] a @NUMBER@ literal token and remove all punctuation tokens. We
employs many tricks, including hashing to a fixed vocabulary size preserve alphanumeric tokens in the hope that sometimes longer
of n-grams, lock free asynchronous gradient descent optimization alphanumeric strings encode model numbers that are unique to
[8], averaging of word representations as document representation each category. For instance, the surface form of the title, “"Ana Silver
and others. Further, given a training set containing NV words from Co Rainbow Moonstone Earrings 2 1/4"" (925 Sterling Silver) - Hand-
a vocabulary of V words and a number of passes over the data, made Jewelry EARR346812"” gets converted to “ana silver co rainbow
i.e. epochs, equal to E, the learning rate, α, at time t is updated moonstone earrings @NUMBER@ @NUMBER@ @NUMBER@ @NUMBER@ ster-
according to ling silver handmade jewelry earr34681” using our normalization
scheme. For the baseline, we also do not remove any stop words or
t perform any morphological operations on the resulting tokens. The
α (t ) = γ (1 − ) (4)
NV E entire training dataset is de-duplicated using the (category label,
title normal form) tuples.
where γ is a fixed parameter. Hence, unlike optimizing for the learn- The plots in Fig. 5 confirms the non-deterministic nature of
ing rate or step size found in more principled convex optimization fastText due to the use of asynchronous SGD. In fastText, each
techniques, fastText uses an heuristic estimate and a linear decre- thread reads its own view of the data, creates a mini-batch of size
ment of the learning rate that allows it to calculate an estimated one, and updates the parameter matrices without any locking mech-
time of completion for the training phase to end. anism. This type of lock free parameter updates have been shown to
The high degree of class imbalance becomes apparent upon be just as effective as their synchronous counterparts under the as-
observing the Kullback Leibler (KL) divergences of the empirical sumptions of feature sparsity in the data [8]. The plots in Fig. 5 show
distribution over data points in the training set for each of the top absolute accuracy results on our 10% Dev2 set for two different sce-
level categories, from their respective uniform distributions (see narios, after training a bi-level fastText classifier on the training
Fig. 4). The RDC Challenge dataset may have exhibited this phe- data that results after excluding the Dev1 and Dev2 sets. For the
nomenon due to the dataset selection process, however, in general, first scenario, we run fastText for forty runs with the same param-
depending on common choices of e-commerce taxonomies for mid- eter settings of dim=min(120,K/2), epoch=100, wordNgrams=3,
size organizations, KL divergences usually vary between 1.0 and loss=softmax, thread=45, with K being the number of classes
2.0 [1]. for a particular subtree rooted at a classification node in the taxon-
We initially split the 0.8M training dataset into a 10% Dev2 set, a omy tree. For this scenario, we obtain the mean absolute accuracy
10% Dev1 set after excluding the Dev2 set, and the rest for training to be 74.0908 with a standard deviation of 0.0741. For the second
and cross-validation. The splitting is done using stratified sampling. scenario, we run fastText for six different runs with the same
Following a bi-level classification scheme in [1], we initially built settings, except that the number of threads are set to 1, 2, 5,
Overview of the SIGIR 2018 eCom Rakuten Data Challenge eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
Figure 5: Absolute accuracy results on our 10% Dev2 set for the following two scenarios. Left: Forty different runs of fastText
using the exact same settings. Right: Six different runs of fastText using the same settings except for the number of threads.
15, 30 and 45 respectively. The solid bars in the right plot of Fig. • Team HSJX-ITEC-YU submitted one system (0.7790 F1
5 shows the number of threads and the corresponding accuracy score) that uses K Nearest Neighbors (KNN) classification
numbers are shown on the line graph above the solid bars. model and the Best Match (BM) 25 probabilistic information
We can observe from the graph in the right of Fig. 5, that the triv- retrieval model. The top k product title matches of the BM25
ially synchronous version of fastText, i.e. using just one thread, model are then classified by the KNN model.
is the weakest of all models. This makes sense since fastText is • Team JCWRY submitted one system (0.8295 F1 score) that
just a singe layer perceptron with no non-linearities. Accuracy im- uses deep convolutional neural networks with oversampling,
proves by almost 70% when the number of threads is set to two, but, threshold moving and error correct output coding to predict
plateaus off for the other settings of thread counts after that. For product taxonomies. Their best accuracy is obtained through
our Dev2 set, we obtain the best accuracy using thirty threads and an esemble of multiple networks, such as Kim-CNN [5] and
that is the number of threads we set for all of our fastText baseline Zhang-CNN [9], trained on different extracted features in-
runs. The asynchronous SGD of fastText with a mini-batch size of puts, inlcuding doc2vec [6], NER and POS features.
one, raises a legitimate concern – the number of threads apparently • Team MKANEMAS submitted one system (0.8399 F1 score)
behaves as a parameter that regulates generalization performance that formulates the task as a simple classification problem
and is machine architecture dependent. Needless to mention that of all leaf categories in the given dataset. The key feature
this behavior is independent of the loss function used. In our ex- of their system is the combination of a Convolutional Neu-
periments, the default setting of twelve threads have shown lower ral Network and Bidirectional LSTM using ad-hoc features
performance than thirty threads for this task. generated from an external dataset (Amazon Product Data)
After some minimal experiments on our Dev2 set, we finalized [3][6].
the settings of our fastText baseline, RITB-Baseline (see Table • Team mcskinner submitted one system (0.8513 F1 score)
3 and Fig. 3), to be the following: -dim 300 -minn 4 -maxn 10 using a straightforward network architecture and ensemble
-wordNgrams 3 -neg 10 -loss ns -epoch 3000 -thread 30. LSTM strategy to achieve competitive results. The positive
We have set a high number of epochs due to the availability of free impact of tightening the connections between recurrent and
computational cycles in our servers. output layers through the use of pooling layers is also demon-
In the next section, we briefly highlight the methods used for the strated. The author similarly provides practical details on
data challenge task by the teams who have submitted their system their training methodology and algorithms for probability
description papers. calibration. The final solution is produced by a bidirectional
ensemble of 6 LSTMs with Balanced Pooling View architec-
5 SYSTEM DESCRIPTIONS ture.
Below is a list of each team’s systems. Their team names, affilia- • Team minimono submitted one system (0.7994 F1 score)
tion and system description paper title available on the workshop based on word-level sequence-to-sequence neural networks
website can be seen in Table 3. widely used in machine translation and automatic docu-
• Team CorUmBc submitted one system (0.7690 F1 score) ment summarization. By treating taxonomy classification as
based on a Bidirectional Long Short Term Memory Network a translation problem from a description of a product to a
(BiLSTM) to capture the context information for each word, category path. The text of the product name is viewed as
followed by a multi-head attention model to aggregate useful the encoder input and the sequence of category name as the
information from these words as the final representation of decoder output.
the product title. Their model adopts an end-to-end archi- • Team neko submitted one system (0.8256 F1 score) that
tecture without any hand-crafted features and regulated by treats category prediction as a sequence generation task.
various dropout techniques. The authors built a word-level sequence-to-sequence model
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Yiu-Chang Lin, Pradipto Das, and Ankur Datta
Team Name Affiliation System Description Paper
CorUmBc University of Maryland Baltimore County Large Scale Taxonomy Classification using BiLSTM with Self-Attention
HSJX-ITEC-YU York University & Trent University A Best Match KNN-based Approach for Large-scale Product Categorization
An Empirical Study of Using An Ensemble Model in E-commerce Taxonomy
JCWRY Individual Researchers
Classification Challenge
Convolutional Neural Network and Bidirectional LSTM Based Taxonomy
MKANEMAS Yahoo Japan Corporation
Classification Using External Dataset at SIGIR eCom Data Challenge
mcskinner Individual Researcher Product Categorization with LSTMs and Balanced Pooling Views
minimono University of Tsukuba Encoder-Decoder neural networks for taxonomy classification
neko Rakuten Institute of Technology Singapore Unconstrained Production Categorization with Sequence-to-Sequence Models
RITB-Baseline Rakuten Institute of Technology Boston Overview of the SIGIR 2018 eCom Rakuten Data Challenge
Tyken2018 Tohoku University Large-Scale Taxonomy Problem: a Mixed Machine Learning Approach
Topsig Queensland University of Technology TopSig at the SIGIR’eCom 2018 Rakuten Data Challenge
tiger JD.com Multi-level Deep Learning based E-commerce Product Categorization
Uplab Uplab SAS Ecommerce Product Title Classification
Waterloo University of Waterloo Team Waterloo at the SIGIR E-Commerce Data Challenge
Table 3: List of participants who submitted system description papers. All submitted papers are available at: https://sigir-ecom.
github.io/accepted-papers.html.
to generate non-constrained product categories that are not category issue, sampling and data enhancement techniques
limited to the labels from the training set. The final model are used. They build eight sample datasets according to the
is an ensemble of several attentional sequence-to-sequence category hierarchy and develop two classification algorithms
models. to build models for different levels and search paths using
• Team Tyken2018 submitted one system (0.7509 F1 score) category trees.
that builds upon a two-step architecture where the first step • Team Uplab submitted three systems based on different
classifies the input product title to the genre tree it belongs classifier types, including single flat linear Support Vector
to and the second step predicts its category path. For the Machines classifier (0.8366 F1 score), a top down esemble
path prediction, two methods are proposed, original method which combines top-level and sub-level classifiers (0.8173 F1
and actual method. They originally proposed a shallow feed- score) and a CNN with pre-trained word embeddings (0.6509
forward fully connected or deep neural network, i.e., 50-layer F1 score). The authors found that TF-IDF with both bi-gram
ResNet architecture [2] depending on how complex the tree and uni-gram features work best for categorization.
structure is, but constructed a hierarchical style model in the • Team Waterloo submitted one system (0.7781 F1 score) that
end. uses multinomial naive-bayes, logistic regression, batched
• Team Topsig submitted one system (0.7941 F1 score) that gradient-descent and neural softmax classifiers retro-fitted
utilizes random indexing to create topological signatures for the hierarchical classification objective. The authors eval-
(TopSig) to categorize the product names in the provided uate different classification methods, including flat and hier-
data sets. The authors make use of their open-source Top- archical classification, and implement a general framework
Sig tool to generate dimensionality-reduced signatures and to tackle the task of taxonomy classification.
search for these signatures. The signature generation ap-
proach used by TopSig has been shown to be effective at
different tasks. The approach can be regarded as a variation 6 RESULTS
on LSH to approximate nearest neighbour search. Twenty six research groups/individuals participated in the data
• Team tiger submitted one system (0.8379 F1 score) that com- challenge. Each team submitted their predicted category-id-path
bines machine learning, deep learning, and natural language on the test set in the same TSV format as the training set (Table 1).
processing to create a multi-level and multi-class deep learn- We score the system submissions against the gold standard using
ing tree method. This includes multiple models based on weighted precision, recall and F1 as defined in Section 3. The final
single-label and multi-level label predictions, as well as char- results are summarized in Figure 3. The left hand side of Figure 3
acteristics of the product tree structure. The training dataset shows the leaderboard in Stage 1 (evaluated on the first 20,000 test
and the test dataset are merged to pre-train word vectors titles) and the right hand side shows stage 2 results (evaluated on
to calculate semantic similarity. To address the unbalanced the entire set of test titles).
Overview of the SIGIR 2018 eCom Rakuten Data Challenge eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
In Stage 1, the top five teams are team mcskinner, team MKANE- [8] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. 2011. Hogwild:
MAS, team tiger, team Uplab and team JCWRY with 0.8510, A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In Advances
in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L.
0.8421, 0.8404, 0.8375 and 0.8278 weighted-F1 scores, respectively. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 693–701.
In Stage 2, the same five teams rank at top five with 0.8513, 0.8399, [9] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional
networks for text classification. In Advances in neural information processing
0.8379, 0.8366 and 0.8295 weighted-F1 scores. The rank does not systems. 649–657.
change much between Stage 1 and Stage 2, except for team Uplab-
2, team Tyche and team HSJX-ITEC-YU. Table 4 shows the list
of teams whose ranks is higher in Stage 2 than in Stage 1. 7 APPENDIX
The loss function for fastText, mentioned in [4], is the following
Team Name ∆ Rank Stage 1 Rank Stage 2 Rank
log likelihood function:
Uplab-2 +3 11 8
Tyche +1 13 12
N N N
!
HSJX-ITEC-YU +5 16 21 Ö Õ Õ
L = log f (BAxn )tn = tn log(f (BAxn )) = Ln (5)
Table 4: List of teams that rank higher in Stage 1 than in n=1 n=1 n=1
Stage 2.
where tn is the label of the nth training instance. The optimization
With regards to F1 scores, the difference between Stage 1 and
problem on the training set is to find the optimal parameters A∗
Stage 2 is less than 0.01 except for team HSJX-ITEC-YU, whose
and B∗ by minimizing the negative log likelihood in Equ. 5.
score difference is more than 0.06. Table 5 shows the the list of top
twenty teams whose F1 score is higher in Stage 2 than in Stage 1.
N
1 Õ
arg min − × Ln (6)
Team Name Stage 1 F1 Stage 2 F1 Stage 2 Rank A∗,B∗ N i=1
mcskinner 0.8510 0.8513 1
JCWRY 0.8278 0.8295 5
The parameter matrix A connects the input features to the units
neko 0.8245 0.8256 6
in the hidden layer of dimension H . This matrix is a look-up ta-
Uplab-2 0.8149 0.8173 8
ble over the words [4] and H is a parameter of the model. The
Tyche 0.7976 0.8004 12
weight parameter matrix, B, connects the hidden units to the out-
VanGuard 0.7871 0.7884 15
put units. The k th output unit of the nth instance contributes to
HSJX-ITEC-YU 0.7176 0.7790 16
the conditional probability of the output yn,k , of the k th label be-
Waterloo 0.7767 0.7781 17
Sam-chan 0.7617 0.7666 19 ing 0 or 1. That is, for a particular nth instance (xn , tn ), we have,
ÎK t n,k
Tyken2018 0.7431 0.7509 20 Ln = k=1 yn,k where tn,k ∈ {0, 1}.
Table 5: List of the top twenty teams that have higher F1 From now on, let L = −Ln , so that we can clear up the notation.
ÎN
score in Stage 1 than in Stage 2. Also, the data is assumed to be i.i.d so that p(X) = n=1 p(xn ). Let
the input nodes be labeled as x i , i ∈ {1, ..., D}; the hidden layer
nodes be labeled as z j , j ∈ {1, ..., H } and the output layer nodes be
REFERENCES labeled as vk , k ∈ {1, ..., K }. The target for the input x = {xd } is a
[1] Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, and Ankur Datta.
2017. Web-scale language-independent cataloging of noisy product listings for
one-hot K vector with the k th entry to be 1 if x belongs to class
e-commerce. In Proceedings of the 15th Conference of the European Chapter of the k. Let vk be the input to the k th output unit i.e. vk = BTk z with
Association for Computational Linguistics: Volume 1, Long Papers, Vol. 1. 969–979. Í
zh = D A x .
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual d =1 d,h d
learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
Using the method of maximum likelihood estimation on the
[3] Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual loss function L, where tk ∈ {0, 1} and yk = p(vk ) ∈ [0, 1], or,
evolution of fashion trends with one-class collaborative filtering. In proceedings equivalently, using gradient descent minimization on L, we can
of the 25th international conference on world wide web. International World Wide
Web Conferences Steering Committee, 507–517. derive the updates for model parameter estimators.
[4] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag
of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of
the European Chapter of the Association for Computational Linguistics: Volume 2, 7.1 Parameter updates for the output layer
Short Papers. Association for Computational Linguistics, 427–431.
[5] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv We first begin by finding the updates to the B parameter from Equ. 6.
preprint arXiv:1408.5882 (2014). For brevity, we drop the subscript n henceforth. Taking derivatives
[6] Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel.
2015. Image-based recommendations on styles and substitutes. In Proceedings
of the loss function at the output layer w.r.t. Bk,h , we obtain,
of the 38th International ACM SIGIR Conference on Research and Development in
Information Retrieval. ACM, 43–52.
[7] Lam Nguyen, Phuong Ha Nguyen, Marten van Dijk, Peter Richtarik, Katya Schein-
∂L
= −(tk − yk )zh (7)
berg, and Martin Takac. 2018. SGD and Hogwild! Convergence Without the ∂Bk,h
Bounded Gradients Assumption. In Proceedings of the 35th International Conference
on Machine Learning (Proceedings of Machine Learning Research), Jennifer Dy and
Andreas Krause (Eds.), Vol. 80. PMLR, Stockholmsmässan, Stockholm Sweden, T T
where, yk = p(vk ) = so f tmax(vk ) = e Bk z / kK′ =1 e Bk ′ z
Í
3747–3755.
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Yiu-Chang Lin, Pradipto Das, and Ankur Datta
Derivation: For a particular nt h instance, we have Derivation: As in the previous section, for a particular nth
instance, we have, using the chain rule of derivatives,
K
K
∂L ∂L ∂vk ∂zh
Õ
−L = tk log yk
Õ
− =
k =1 ∂Ah,d ∂vk ∂zh ∂Ah,d
k =1
K K
! Í
∂ h=1H B
k,h zh ª ∂z
Õ Õ
= tk vk − log exp(v ) k′ (8) ÕK ©
h
k =1 k ′ =1 = (tk − yk ) ®
∂zh ® ∂A
h,d
ÕK ÕK
! k =1
= tk BTk z − log exp(BTk z)
« Í ¬
K ∂ D A x
k =1 k ′ =1
Õ d =1 h,d d
= (tk − yk ) Bk,h
K H K H ∂Ah,d
!!
k =1
Õ Õ Õ Õ
= tk Bk,h zh − log exp Bk,h zh (9)
k =1 h=1 k ′ =1 h=1
leading to,
K
!
∂L Õ
Using chain rule of derivatives, =− (tk − yk )Bk,h xd (13)
∂Ah,d
k =1
∂L ∂L ∂vk Thus for the Ah,d parameter estimator update, we have:
=
∂Bk,h ∂vk ∂Bk,h
(t +1) (t ) ∂L
Ah,d = Ah,d + α × −
For the first term, we have: (t )
∂Ah,d
K
Í !
∂ log K exp(v ′) (t )
Õ
(t )
(t )
∂L k =1
′ k ∂vk ′ = Ah,d + α × tk − yk Bk,h xd (14)
− = tk −
∂vk ∂vk ′ ∂vk k =1
exp(vk ′ ) Note that the document or sentence representation in fastText
= tk − δ k′ is obtained by an averaging of the vector representations of the
log kK′ =1 exp(vk ′ ) k
Í
constituent n-grams. Thus, the gradients involving x are normalized
= t k − yk by the number of word and/or character n-grams in the input x.
For the second term, we have: 7.3 Asynchronous Stochastic Gradient Descent
The general expression for updating the estimators of a parameter
Í
∂ H B z
∂vk h=1 k,h h
w, in a gradient descent optimization scheme is given by,
=
∂Bk,h ∂Bk,h
w(t +1) = w(t ) + αu (15)
= zh
where α is the learning rate or step length and u is a descent direc-
tion, which is set to − ∂w∂L . The fastText model relies on asyn-
leading to, (t )
chronous stochastic gradient descent (ASGD) method of optimiza-
∂L tion following [8], where, several cores can update any component,
= −(tk − yk )zh (10)
∂Bk,h (t )
wj , several times, but atomically, on the same iteration.
Both ASGD and SGD depend on finding unbiased estimators of
Thus for the Bk,h parameter update, we have: gradients. The key difference is that for the former, the gradient
computation for parameter estimators is allowed to use stale val-
(t +1) (t ) ∂L
Bk,h = Bk,h + α × − ues of the previously updated parameter iterates. Convergence is
(t )
∂Bk,h shown to happen for certain scenarios, where the loss functions
(t )
(t ) (t ) are convex and the stochastic gradients are assumed to be bounded
= Bk,h + α × tk − yk zh (11) in expectation. In the original ASGD paper [8], this convergence
is shown for a fixed step size. The convergence analysis for ASGD
where α is the learning rate that is updated as shown in Equ. 4. in [7] is shown for the non-bounded gradient assumption and a
continuously decreasing step size, which is the case for fastText.
7.2 Parameter updates for the input layer It is mentioned in [7], that the permutation of indices for the
Taking derivatives of the loss function at the output layer w.r.t. parameter vector or matrix updates when there are inconsistent
Ah,d , we obtain, reads and writes by several cores is crucial for good performance.
In fastText, the compute cores start iterating over the data from
fixed positions in the training file and not randomly as in the case
K of true SGD. This might be a reason why the solution computed
!
∂L Õ
=− (tk − yk )Bk,h xd (12) by fastText, is not stochastically optimal with regards to the as-
∂Ah,d
k=1 sumptions mentioned in [7].