<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Overview of the SIGIR 2018 eCom Rakuten Data Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yiu-Chang Lin</string-name>
          <email>yiuchang.lin@rakuten.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pradipto Das</string-name>
          <email>pradipto.das@rakuten.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ankur Datta</string-name>
          <email>ankur.datta@rakuten.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rakuten Institute of Technology</institution>
          ,
          <addr-line>Boston, MA, 02110 -</addr-line>
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>This paper presents an overview of the SIGIR 2018 eCom Rakuten Data Challenge. In this data challenge, Rakuten.com has released a sampling of one million product titles and the corresponding anonymized category paths from their entire product catalog. Of these, 0.8 million of product titles and their corresponding category paths are released as training data and 0.2 million of product titles are released as test data. The task is to predict the category, defined as a full path in the taxonomy tree as provided in the training set, of the product titles in the test set. The evaluation is divided into two stages to measure system performance on a part of the test data and the entire test data. The diferent systems are evaluated using weighted precision, recall and F1. In total, 26 teams have submitted 28 systems with a top performance of 0.8513 weighted F1 score.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>The SIGIR 2018 eCom Rakuten Data Challenge 1 is organized by
Rakuten Institute of Technology, Boston2 (RIT-Boston), a
dedicated R&amp;D organization for the Rakuten group3 of companies.
The data challenge focuses on the task of large-scale taxonomy
classification of product titles, where the goal is to predict each
product’s category, defined as a full path from root node to a leaf
node in the taxonomy tree provided in the training set. The
cataloging of product listings through taxonomy categorization is a
fundamental problem for any e-commerce platform, with
applications ranging from basic data organization, personalized search
recommendations to query understanding and targeted
campaigning. For instance, in the Rakuten.com catalog, “Dr. Martens Air
Wair 1460 Mens Leather Ankle Boots” is categorized under the
“Clothing, Shoes &amp; Accessories &gt; Shoes &gt; Men &gt; Boots” leaf.
However, manual and rule based approaches to categorization are not
scalable since commercial product taxonomies are organized in tree
structures with three to ten levels of depth and thousands of leaf
nodes. Advances in this area of research have been limited due to
the lack of real data from actual commercial catalogs.</p>
      <p>The challenge presents several interesting research aspects due
to the intrinsic noisy nature of the product labels, the size of modern
e-commerce catalogs, and the typical imbalanced data distribution.</p>
      <sec id="sec-1-1">
        <title>1https://sigir-ecom.github.io/data-task.html 2https://rit.rakuten.co.jp/ 3https://www.rakuten.com/</title>
        <p>Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s).</p>
        <p>ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
https://doi.org/10.1145/nnnnnnn.nnnnnnn
We hope that by making the data available to the participants, the
task will attract more research institutions and industry
practitioners, who do not have the opportunity to contribute their ideas due
to the lack of an actual commercial e-commerce catalog data. In a
typical e-commerce setting, merchants are responsible for matching
their products with an existing category, which is a leaf node in the
taxonomy tree. The problem of large scale taxonomy classification
is thus immensely useful to help merchants upload their products
in the right places of an e-commerce platform catalog.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>DATASET</title>
      <p>Rakuten has released a sample of one million product listings,
including the training (0.8 million) and test (0.2 million) set, consisting
of product titles and their corresponding category paths that belong
to a taxonomy tree to describe the varying degrees of generality
of the items in the catalog. A product taxonomy is a tree-based
hierarchical representation of labels of the listings in a catalog.</p>
      <p>Rakuten’s catalog of products is much larger than a million
listings. The larger catalog of product listings is first de-duplicated
using leaf node label and product title tuples as keys and then a
random sampling of one million listings is performed. These set
of listings have been allowed to be published for this year’s data
challenge. The set of one million product listings may not cover the
entire set of leaf nodes in Rakuten.com’s taxonomy, and so, any
taxonomy generated from the category labels provided with the
training set may yield a truncated taxonomy.</p>
      <p>Further, to preserve anonymity of the taxonomy, the nodes in
the actual taxonomy have been masked with random integers. The
class or the category label of a product listing is thus a path from the
root of the taxonomy tree to a leaf node where the listing resides.
In the left to right ordering of the nodes in such a path, the path
becomes more specific in describing the listing as it approaches the
leaf level. The string representation of a path from root to leaf is
henceforth dubbed as a “category-id-path”.</p>
      <p>The training data file is in a tab-separated values (TSV) format
where each line contains a product title and its corresponding
category-id-path. In the test set, only the product titles are provided
and the objective of this data challenge is to predict the full
categoryid-path for each such title. Table 1 and table 2 show some examples
of product titles from the training set and test set, respectively. The
partitioning of the training and the test sets are obtained using
category-wise stratified sampling of the one million listing dataset.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>Data Characteristics</title>
      <p>In the training set, there are 800, 000 product titles from 3, 008
leaflevel nodes. Product titles are unevenly distributed among these
3, 008 categories. The top ten categories compose around 30% of
the data set and the top forty categories compose around 50% of
105
18000
16000
14000
12000
y
cn10000
e
u
eq 8000
r
F 6000
4000
2000
0
101</p>
      <p>102
Rank (log scale)
103
1
2
3
6
7</p>
      <p>8
4 5
Category Depth
the data set. The left hand side of Figure 1 shows the distribution
of product title frequency across all the leaf-level categories (sorted
by frequency) where the X and Y axes are both in log scale. Across
the 800, 000 product titles, the maximum depth of category level is
8 and the average depth is 4.</p>
      <p>The right hand side of Figure 1 shows the distribution of product
title frequency across diferent depths of category-id-paths. The
average word-level title length over all categories, after the surface
form of the title is tokenized by white space, is 10.93, with a
maximum length of 58 words. Similarly, the average character-level title
length is 68.44, with a maximum length of 255. Figure 2 shows the
title length distributions at world-level and character-level,
respectively.</p>
    </sec>
    <sec id="sec-4">
      <title>3 EVALUATION</title>
      <p>In this section, we briefly mention the evaluation criteria used to
judge the diferent systems. Although we do not resort to a stricter
statistical significance testing of diferent systems using confidence
interval estimation from bootstrapped samples of the test set, it
will be useful to do so once this pilot task has been completed.
One motivation to not introduce such testing in this pilot task is
that in an e-commerce setting, even a 0.01% improvement means
thousands of listings being assigned to their right places in the
catalog without human intervention.</p>
    </sec>
    <sec id="sec-5">
      <title>3.1 Metrics</title>
      <p>The metrics that we have used to evaluate the diferent
classification systems are based on weighted-precision, weighted-recall and
weighted-F1 for the test set of exact “category-id-path” match. In
other words, partial path match does not count as a correct
prediction. We assume that there are a total of K classes, {ci |i = 1, 2, ..., K }
in the training set. The number of true instances for each class
(support) is ni , and the total number of training instances is N =
ÍK</p>
      <p>i=1 ni . After calculating the precision (Pi ), recall (Ri ) and F1 (F 1i )
scores for each class ci , the weighted-precision, weighted-recall
and weighted-F1 are defined as follows:
weiдhted-P =
weiдhted-R =
weiдhted-F 1 =</p>
      <p>K
Õ ni
i=1 N
K
Õ ni
i=1 N</p>
      <p>Pi</p>
      <p>Ri
K
Õ ni F 1i
i=1 N
(1)
(2)
(3)</p>
      <p>It can be shown that weighted-recall is actually equal to absolute
accuracy. Since the product distribution over the taxonomy tree
is highly imbalanced, weighted version of precision, recall, and F1
make much more sense than macro or micro version of precision,
recall, and F1 do. The evaluation script is also provided during the
data challenge4.</p>
      <sec id="sec-5-1">
        <title>4https://github.com/sigir-ecom/dataChallenge</title>
        <p>3.2</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Leaderboard</title>
      <p>The leaderboard shows the weighted precision, recall and F1 scores,
upto four decimal digits of precision, for the latest submissions
from each participating team. The corresponding file submission
time is shown as well so participants can refer the scores to which
submission it belongs to. The leaderboard is sorted by the weighted
F1 score. We choose not to use an of-the-shelf classifier, such as
a logistic regression model or some other standard classifier, as a
minimal baseline (lowerbound) in the leaderboard.
3.3</p>
      <p>Timeline
• Stage 1 (April 9 - June 23): Participants build and test models
on the training data. Each team can make at most three
submissions per day in this stage. The leaderboard (Fig. 3
left) only shows the model performance on a subset of the
test set, i.e., the first 20, 000 test product titles. Note that this
subset is randomized since the entire test set is randomized
using stratified sampling on the one million titles.
• Stage 2 (June 24): The leaderboard switches to Stage 2 on
June 24 (Fig. 3 right) and shows the model performance
on the entire test set, i.e., all 200, 000 test product titles.
No submission is allowed once the leaderboard has been
switched to this stage.
4</p>
      <p>
        RIT-BOSTON BASELINE METHOD
The RITB-Baseline method uses the method of Joulin et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
which has been very popular recently for large-scale multi-label
text classification. The method, dubbed, fastText, uses lock free
asynchronous gradient descent optimization [8] of a log linear loss
function – it is actually a single layer perceptron with no
nonlinearities in the activation functions. For details on the derivations
for the fastText model parameters and using a softmax loss
function, see Section 7.
      </p>
      <p>
        Generally, in a production setting, we have seen gradient boosted
decision trees (GBDTs) to perform just as well as state-of-the-art
Convolutional Neural Networks (CNNs), which has been previously
reported in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, although, GBDTs have the added
advantage of incorporating intuitive feature engineering as well as
deciding feature importance, their training time on large datasets can be
prohibitively slow. Recent trends in e-commerce industry has thus
shifted to using fastText or other deep learning approaches for
generic text classification problems, of which taxonomy
classification of product titles is a specific instance. Deep learning techniques
are a continuously evolving family of models with varying degrees
of architecture engineering, optimization algorithms and parameter
ifne tuning methods and as such it is dificult to assign a baseline
model using this family of methods.
      </p>
      <p>
        We thus choose to use fastText as a quick way to solve the
taxonomy classification problem. Further fastText apparently has
less parameters to tune than comparable deep learning methods.
The reason for the choice of the word apparent will become clear
once we explain the graphs in Fig. 5. The fastText method in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
employs many tricks, including hashing to a fixed vocabulary size
of n-grams, lock free asynchronous gradient descent optimization
[8], averaging of word representations as document representation
and others. Further, given a training set containing NV words from
a vocabulary of V words and a number of passes over the data,
i.e. epochs, equal to E, the learning rate, α , at time t is updated
according to
t
α (t ) = γ (1 − NV E )
(4)
where γ is a fixed parameter. Hence, unlike optimizing for the
learning rate or step size found in more principled convex optimization
techniques, fastText uses an heuristic estimate and a linear
decrement of the learning rate that allows it to calculate an estimated
time of completion for the training phase to end.
      </p>
      <p>
        The high degree of class imbalance becomes apparent upon
observing the Kullback Leibler (KL) divergences of the empirical
distribution over data points in the training set for each of the top
level categories, from their respective uniform distributions (see
Fig. 4). The RDC Challenge dataset may have exhibited this
phenomenon due to the dataset selection process, however, in general,
depending on common choices of e-commerce taxonomies for
midsize organizations, KL divergences usually vary between 1.0 and
2.0 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        We initially split the 0.8M training dataset into a 10% Dev2 set, a
10% Dev1 set after excluding the Dev2 set, and the rest for training
and cross-validation. The splitting is done using stratified sampling.
Following a bi-level classification scheme in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we initially built
a baseline hierarchical cascading classifier using fastText,
however, for this dataset, we found out that a “flat” fastText classifier
performed better by at least one percentage point absolute.
Additionally, the “negative sampling” loss function has also shown
better performance than the typical “softmax” loss. However, fine
tuning of parameters based on the negative sampling loss function
can show signs of overfitting.
      </p>
      <p>
        For data preprocessing, we use the token normalizer mentioned
in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We first tokenize using whitespace and lowercase the
tokens. Then for each token, we separate any adjoining punctuations
from numbers (decimal or otherwise). We replace all numbers with
a @NUMBER@ literal token and remove all punctuation tokens. We
preserve alphanumeric tokens in the hope that sometimes longer
alphanumeric strings encode model numbers that are unique to
each category. For instance, the surface form of the title, “"Ana Silver
Co Rainbow Moonstone Earrings 2 1/4"" (925 Sterling Silver) -
Handmade Jewelry EARR346812"” gets converted to “ana silver co rainbow
moonstone earrings @NUMBER@ @NUMBER@ @NUMBER@ @NUMBER@
sterling silver handmade jewelry earr34681” using our normalization
scheme. For the baseline, we also do not remove any stop words or
perform any morphological operations on the resulting tokens. The
entire training dataset is de-duplicated using the (category label,
title normal form) tuples.
      </p>
      <p>The plots in Fig. 5 confirms the non-deterministic nature of
fastText due to the use of asynchronous SGD. In fastText, each
thread reads its own view of the data, creates a mini-batch of size
one, and updates the parameter matrices without any locking
mechanism. This type of lock free parameter updates have been shown to
be just as efective as their synchronous counterparts under the
assumptions of feature sparsity in the data [8]. The plots in Fig. 5 show
absolute accuracy results on our 10% Dev2 set for two diferent
scenarios, after training a bi-level fastText classifier on the training
data that results after excluding the Dev1 and Dev2 sets. For the
ifrst scenario, we run fastText for forty runs with the same
parameter settings of dim=min(120,K/2), epoch=100, wordNgrams=3,
loss=softmax, thread=45, with K being the number of classes
for a particular subtree rooted at a classification node in the
taxonomy tree. For this scenario, we obtain the mean absolute accuracy
to be 74.0908 with a standard deviation of 0.0741. For the second
scenario, we run fastText for six diferent runs with the same
settings, except that the number of threads are set to 1, 2, 5,
15, 30 and 45 respectively. The solid bars in the right plot of Fig.
5 shows the number of threads and the corresponding accuracy
numbers are shown on the line graph above the solid bars.</p>
      <p>We can observe from the graph in the right of Fig. 5, that the
trivially synchronous version of fastText, i.e. using just one thread,
is the weakest of all models. This makes sense since fastText is
just a singe layer perceptron with no non-linearities. Accuracy
improves by almost 70% when the number of threads is set to two, but,
plateaus of for the other settings of thread counts after that. For
our Dev2 set, we obtain the best accuracy using thirty threads and
that is the number of threads we set for all of our fastText baseline
runs. The asynchronous SGD of fastText with a mini-batch size of
one, raises a legitimate concern – the number of threads apparently
behaves as a parameter that regulates generalization performance
and is machine architecture dependent. Needless to mention that
this behavior is independent of the loss function used. In our
experiments, the default setting of twelve threads have shown lower
performance than thirty threads for this task.</p>
      <p>After some minimal experiments on our Dev2 set, we finalized
the settings of our fastText baseline, RITB-Baseline (see Table
3 and Fig. 3), to be the following: -dim 300 -minn 4 -maxn 10
-wordNgrams 3 -neg 10 -loss ns -epoch 3000 -thread 30.
We have set a high number of epochs due to the availability of free
computational cycles in our servers.</p>
      <p>In the next section, we briefly highlight the methods used for the
data challenge task by the teams who have submitted their system
description papers.
5</p>
    </sec>
    <sec id="sec-7">
      <title>SYSTEM DESCRIPTIONS</title>
      <p>Below is a list of each team’s systems. Their team names,
afiliation and system description paper title available on the workshop
website can be seen in Table 3.</p>
      <p>
        • Team CorUmBc submitted one system (0.7690 F1 score)
based on a Bidirectional Long Short Term Memory Network
(BiLSTM) to capture the context information for each word,
followed by a multi-head attention model to aggregate useful
information from these words as the final representation of
the product title. Their model adopts an end-to-end
architecture without any hand-crafted features and regulated by
various dropout techniques.
• Team HSJX-ITEC-YU submitted one system (0.7790 F1
score) that uses K Nearest Neighbors (KNN) classification
model and the Best Match (BM) 25 probabilistic information
retrieval model. The top k product title matches of the BM25
model are then classified by the KNN model.
• Team JCWRY submitted one system (0.8295 F1 score) that
uses deep convolutional neural networks with oversampling,
threshold moving and error correct output coding to predict
product taxonomies. Their best accuracy is obtained through
an esemble of multiple networks, such as Kim-CNN [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
Zhang-CNN [9], trained on diferent extracted features
inputs, inlcuding doc2vec [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], NER and POS features.
• Team MKANEMAS submitted one system (0.8399 F1 score)
that formulates the task as a simple classification problem
of all leaf categories in the given dataset. The key feature
of their system is the combination of a Convolutional
Neural Network and Bidirectional LSTM using ad-hoc features
generated from an external dataset (Amazon Product Data)
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
• Team mcskinner submitted one system (0.8513 F1 score)
using a straightforward network architecture and ensemble
LSTM strategy to achieve competitive results. The positive
impact of tightening the connections between recurrent and
output layers through the use of pooling layers is also
demonstrated. The author similarly provides practical details on
their training methodology and algorithms for probability
calibration. The final solution is produced by a bidirectional
ensemble of 6 LSTMs with Balanced Pooling View
architecture.
• Team minimono submitted one system (0.7994 F1 score)
based on word-level sequence-to-sequence neural networks
widely used in machine translation and automatic
document summarization. By treating taxonomy classification as
a translation problem from a description of a product to a
category path. The text of the product name is viewed as
the encoder input and the sequence of category name as the
decoder output.
• Team neko submitted one system (0.8256 F1 score) that
treats category prediction as a sequence generation task.
The authors built a word-level sequence-to-sequence model
      </p>
      <sec id="sec-7-1">
        <title>Afiliation</title>
      </sec>
      <sec id="sec-7-2">
        <title>System Description Paper</title>
        <p>University of Maryland Baltimore County
Large Scale Taxonomy Classification using BiLSTM with Self-Attention
category issue, sampling and data enhancement techniques
are used. They build eight sample datasets according to the
category hierarchy and develop two classification algorithms
to build models for diferent levels and search paths using
category trees.
• Team Uplab submitted three systems based on diferent
classifier types, including single flat linear Support Vector
Machines classifier (0.8366 F1 score), a top down esemble
which combines top-level and sub-level classifiers (0.8173 F1
score) and a CNN with pre-trained word embeddings (0.6509
F1 score). The authors found that TF-IDF with both bi-gram
and uni-gram features work best for categorization.
• Team Waterloo submitted one system (0.7781 F1 score) that
uses multinomial naive-bayes, logistic regression, batched
gradient-descent and neural softmax classifiers retro-fitted
for the hierarchical classification objective. The authors
evaluate diferent classification methods, including flat and
hierarchical classification, and implement a general framework
to tackle the task of taxonomy classification.
6</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>RESULTS</title>
      <p>Twenty six research groups/individuals participated in the data
challenge. Each team submitted their predicted category-id-path
on the test set in the same TSV format as the training set (Table 1).
We score the system submissions against the gold standard using
weighted precision, recall and F1 as defined in Section 3. The final
results are summarized in Figure 3. The left hand side of Figure 3
shows the leaderboard in Stage 1 (evaluated on the first 20,000 test
titles) and the right hand side shows stage 2 results (evaluated on
the entire set of test titles).
In Stage 1, the top five teams are team mcskinner, team
MKANEMAS, team tiger, team Uplab and team JCWRY with 0.8510,
0.8421, 0.8404, 0.8375 and 0.8278 weighted-F1 scores, respectively.
In Stage 2, the same five teams rank at top five with
0.8379, 0.8366 and 0.8295 weighted-F1 scores. The rank does not
change much between Stage 1 and Stage 2, except for team
Uplab2, team Tyche and team HSJX-ITEC-YU. Table 4 shows the list
of teams whose ranks is higher in Stage 2 than in Stage 1.</p>
    </sec>
    <sec id="sec-9">
      <title>APPENDIX</title>
      <p>
        log likelihood function:
The loss function for fastText, mentioned in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], is the following
N
Ö
n=1
!
=
where tn is the label of the nth training instance. The optimization
problem on the training set is to find the optimal parameters A∗
and B∗ by minimizing the negative log likelihood in Equ. 5.
      </p>
      <p>1
arg min − N ×
A∗,B∗
(6)
(7)</p>
      <p>
        The parameter matrix A connects the input features to the units
in the hidden layer of dimension H . This matrix is a look-up
table over the words [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and H is a parameter of the model. The
weight parameter matrix, B, connects the hidden units to the
output units. The kth output unit of the nth instance contributes to
the conditional probability of the output yn,k , of the kth label
being 0 or 1. That is, for a particular nth instance (xn , tn ), we have,
Ln = ÎK ytn,k where tn,k ∈ {0, 1}.
      </p>
      <p>k=1 n,k</p>
      <p>From now on, let L = −Ln , so that we can clear up the notation.
Also, the data is assumed to be i.i.d so that p(X) = ÎN
n=1 p(xn ). Let
the input nodes be labeled as xi , i ∈ {1, ..., D}; the hidden layer
nodes be labeled as zj , j ∈ {1, ..., H } and the output layer nodes be
labeled as vk , k ∈ {1, ..., K }. The target for the input x = {xd } is a
one-hot K vector with the kth entry to be 1 if x belongs to class
k. Let vk be the input to the kth output unit i.e. vk = BT z with
k
zh =
ÍD</p>
      <p>d=1 Ad,hxd .</p>
      <p>
        Using the method of maximum likelihood estimation on the
derive the updates for model parameter estimators.
loss function L, where tk ∈ {0, 1} and yk = p(vk ) ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], or,
equivalently, using gradient descent minimization on L, we can
7.1
      </p>
      <p>Parameter updates for the output layer
We first begin by finding the updates to the
B parameter from Equ. 6.</p>
      <p>For brevity, we drop the subscript n henceforth. Taking derivatives
of the loss function at the output layer w.r.t. Bk,h , we obtain,
∂L
∂Bk,h
where, yk = p(vk ) = so f tmax (vk ) = eBTk z/ÍkK′=1 eBTk′ z
Derivation: For a particular nth instance, we have</p>
      <p>Derivation: As in the previous section, for a particular nth
instance, we have, using the chain rule of derivatives,
−L =</p>
      <p>K
Õ</p>
      <p>tk log yk
k=1</p>
      <p>K
= Õ tk vk − log
k=1</p>
      <p>K
= Õ tk BTk z − log
k′=1</p>
      <p>K
Õ
Using chain rule of derivatives,
For the first term, we have:
exp(vk′ )</p>
      <p>T
exp(Bk z)
!</p>
      <p>!
ÕK exp ÕH
k′=1
h=1</p>
      <p>Bk,hzh
! !</p>
      <p>For the second term, we have:
leading to,
∂L
∂Bk,h
= ∂L ∂vk</p>
      <p>∂vk ∂Bk,h
∂L = tk −
− ∂vk
∂ log ÍK
k′=1 exp(vk′ ) ∂vk′
∂vk′
exp(vk′ )
= tk − log ÍK</p>
      <p>k′=1 exp(vk′ )
= tk − yk
k
δk′
∂vk
∂Bk,h
=
= zh
∂ ÍH
h=1 Bk,hzh
∂Bk,h
∂L
∂Bk,h
where α is the learning rate that is updated as shown in Equ. 4.
7.2 Parameter updates for the input layer
Taking derivatives of the loss function at the output layer w.r.t.
Ah,d , we obtain,</p>
      <p>∂L
∂Ah,d
(tk − yk )Bk,h xd
!
leading to,</p>
      <p>∂L
− ∂Ah,d
∂Ah,d</p>
      <p>K
= kÕ=1 ©­­(tk − yk )</p>
      <p>«</p>
      <p>K
= Õ
k=1
∂L
∂Ah,d
k=1</p>
      <p>Note that the document or sentence representation in fastText
is obtained by an averaging of the vector representations of the
constituent n-grams. Thus, the gradients involving x are normalized
by the number of word and/or character n-grams in the input x.
7.3 Asynchronous Stochastic Gradient Descent
The general expression for updating the estimators of a parameter
w, in a gradient descent optimization scheme is given by,
w(t +1) = w(t ) + α u
where α is the learning rate or step length and u is a descent
direc∂L . The fastText model relies on
asyntion, which is set to − ∂w(t)
chronous stochastic gradient descent (ASGD) method of
optimization following [8], where, several cores can update any component,
w(t ), several times, but atomically, on the same iteration.
j</p>
      <p>
        Both ASGD and SGD depend on finding unbiased estimators of
gradients. The key diference is that for the former, the gradient
computation for parameter estimators is allowed to use stale
values of the previously updated parameter iterates. Convergence is
shown to happen for certain scenarios, where the loss functions
are convex and the stochastic gradients are assumed to be bounded
in expectation. In the original ASGD paper [8], this convergence
is shown for a fixed step size. The convergence analysis for ASGD
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is shown for the non-bounded gradient assumption and a
continuously decreasing step size, which is the case for fastText.
      </p>
      <p>
        It is mentioned in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], that the permutation of indices for the
parameter vector or matrix updates when there are inconsistent
reads and writes by several cores is crucial for good performance.
In fastText, the compute cores start iterating over the data from
ifxed positions in the training file and not randomly as in the case
of true SGD. This might be a reason why the solution computed
by fastText, is not stochastically optimal with regards to the
assumptions mentioned in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Pradipto</given-names>
            <surname>Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yandi Xia</surname>
            , Aaron Levine, Giuseppe Di Fabbrizio, and
            <given-names>Ankur</given-names>
          </string-name>
          <string-name>
            <surname>Datta</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Web-scale language-independent cataloging of noisy product listings for e-commerce</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>1</volume>
          ,
          <string-name>
            <surname>Long</surname>
            <given-names>Papers</given-names>
          </string-name>
          , Vol.
          <volume>1</volume>
          .
          <fpage>969</fpage>
          -
          <lpage>979</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kaiming</given-names>
            <surname>He</surname>
          </string-name>
          , Xiangyu Zhang, Shaoqing Ren, and
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>770</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ruining</given-names>
            <surname>He</surname>
          </string-name>
          and
          <string-name>
            <surname>Julian McAuley</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering</article-title>
          .
          <source>In proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Committee</source>
          ,
          <fpage>507</fpage>
          -
          <lpage>517</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Armand</given-names>
            <surname>Joulin</surname>
          </string-name>
          , Edouard Grave, Piotr Bojanowski, and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Bag of Tricks for Eficient Text Classification</article-title>
          .
          <source>In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume</source>
          <volume>2</volume>
          ,
          <string-name>
            <given-names>Short</given-names>
            <surname>Papers</surname>
          </string-name>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <fpage>427</fpage>
          -
          <lpage>431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Convolutional neural networks for sentence classification</article-title>
          .
          <source>arXiv preprint arXiv:1408.5882</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Julian</surname>
            <given-names>McAuley</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Targett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Qinfeng</given-names>
            <surname>Shi</surname>
          </string-name>
          , and Anton Van Den Hengel.
          <year>2015</year>
          .
          <article-title>Image-based recommendations on styles and substitutes</article-title>
          .
          <source>In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <volume>43</volume>
          -
          <fpage>52</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Lam</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , Phuong Ha Nguyen, Marten van Dijk,
          <string-name>
            <surname>Peter Richtarik</surname>
            , Katya Scheinberg, and
            <given-names>Martin</given-names>
          </string-name>
          <string-name>
            <surname>Takac</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>SGD and Hogwild! Convergence Without the Bounded Gradients Assumption</article-title>
          .
          <source>In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research)</source>
          ,
          <source>Jennifer Dy and Andreas Krause (Eds.)</source>
          , Vol.
          <volume>80</volume>
          . PMLR, Stockholmsmässan, Stockholm Sweden,
          <fpage>3747</fpage>
          -
          <lpage>3755</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>