INTRODUCTION

Overview of the SIGIR 2018 eCom Rakuten Data Challenge

Yiu-Chang Lin

yiuchang.lin@rakuten.com 0

Pradipto Das

pradipto.das@rakuten.com 0

Ankur Datta

ankur.datta@rakuten.com 0 0 Rakuten Institute of Technology , Boston, MA, 02110 - USA

2018

This paper presents an overview of the SIGIR 2018 eCom Rakuten Data Challenge. In this data challenge, Rakuten.com has released a sampling of one million product titles and the corresponding anonymized category paths from their entire product catalog. Of these, 0.8 million of product titles and their corresponding category paths are released as training data and 0.2 million of product titles are released as test data. The task is to predict the category, defined as a full path in the taxonomy tree as provided in the training set, of the product titles in the test set. The evaluation is divided into two stages to measure system performance on a part of the test data and the entire test data. The diferent systems are evaluated using weighted precision, recall and F1. In total, 26 teams have submitted 28 systems with a top performance of 0.8513 weighted F1 score.

INTRODUCTION

The SIGIR 2018 eCom Rakuten Data Challenge 1 is organized by Rakuten Institute of Technology, Boston2 (RIT-Boston), a dedicated R&D organization for the Rakuten group3 of companies. The data challenge focuses on the task of large-scale taxonomy classification of product titles, where the goal is to predict each product’s category, defined as a full path from root node to a leaf node in the taxonomy tree provided in the training set. The cataloging of product listings through taxonomy categorization is a fundamental problem for any e-commerce platform, with applications ranging from basic data organization, personalized search recommendations to query understanding and targeted campaigning. For instance, in the Rakuten.com catalog, “Dr. Martens Air Wair 1460 Mens Leather Ankle Boots” is categorized under the “Clothing, Shoes & Accessories > Shoes > Men > Boots” leaf. However, manual and rule based approaches to categorization are not scalable since commercial product taxonomies are organized in tree structures with three to ten levels of depth and thousands of leaf nodes. Advances in this area of research have been limited due to the lack of real data from actual commercial catalogs.

The challenge presents several interesting research aspects due to the intrinsic noisy nature of the product labels, the size of modern e-commerce catalogs, and the typical imbalanced data distribution.

1https://sigir-ecom.github.io/data-task.html 2https://rit.rakuten.co.jp/ 3https://www.rakuten.com/

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA © 2018 Copyright held by the owner/author(s).

ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. https://doi.org/10.1145/nnnnnnn.nnnnnnn We hope that by making the data available to the participants, the task will attract more research institutions and industry practitioners, who do not have the opportunity to contribute their ideas due to the lack of an actual commercial e-commerce catalog data. In a typical e-commerce setting, merchants are responsible for matching their products with an existing category, which is a leaf node in the taxonomy tree. The problem of large scale taxonomy classification is thus immensely useful to help merchants upload their products in the right places of an e-commerce platform catalog. 2

DATASET

Rakuten has released a sample of one million product listings, including the training (0.8 million) and test (0.2 million) set, consisting of product titles and their corresponding category paths that belong to a taxonomy tree to describe the varying degrees of generality of the items in the catalog. A product taxonomy is a tree-based hierarchical representation of labels of the listings in a catalog.

Rakuten’s catalog of products is much larger than a million listings. The larger catalog of product listings is first de-duplicated using leaf node label and product title tuples as keys and then a random sampling of one million listings is performed. These set of listings have been allowed to be published for this year’s data challenge. The set of one million product listings may not cover the entire set of leaf nodes in Rakuten.com’s taxonomy, and so, any taxonomy generated from the category labels provided with the training set may yield a truncated taxonomy.

Further, to preserve anonymity of the taxonomy, the nodes in the actual taxonomy have been masked with random integers. The class or the category label of a product listing is thus a path from the root of the taxonomy tree to a leaf node where the listing resides. In the left to right ordering of the nodes in such a path, the path becomes more specific in describing the listing as it approaches the leaf level. The string representation of a path from root to leaf is henceforth dubbed as a “category-id-path”.

The training data file is in a tab-separated values (TSV) format where each line contains a product title and its corresponding category-id-path. In the test set, only the product titles are provided and the objective of this data challenge is to predict the full categoryid-path for each such title. Table 1 and table 2 show some examples of product titles from the training set and test set, respectively. The partitioning of the training and the test sets are obtained using category-wise stratified sampling of the one million listing dataset. 2.1

Data Characteristics

In the training set, there are 800, 000 product titles from 3, 008 leaflevel nodes. Product titles are unevenly distributed among these 3, 008 categories. The top ten categories compose around 30% of the data set and the top forty categories compose around 50% of 105 18000 16000 14000 12000 y cn10000 e u eq 8000 r F 6000 4000 2000 0 101

102 Rank (log scale) 103 1 2 3 6 7

8 4 5 Category Depth the data set. The left hand side of Figure 1 shows the distribution of product title frequency across all the leaf-level categories (sorted by frequency) where the X and Y axes are both in log scale. Across the 800, 000 product titles, the maximum depth of category level is 8 and the average depth is 4.

The right hand side of Figure 1 shows the distribution of product title frequency across diferent depths of category-id-paths. The average word-level title length over all categories, after the surface form of the title is tokenized by white space, is 10.93, with a maximum length of 58 words. Similarly, the average character-level title length is 68.44, with a maximum length of 255. Figure 2 shows the title length distributions at world-level and character-level, respectively.

3 EVALUATION

In this section, we briefly mention the evaluation criteria used to judge the diferent systems. Although we do not resort to a stricter statistical significance testing of diferent systems using confidence interval estimation from bootstrapped samples of the test set, it will be useful to do so once this pilot task has been completed. One motivation to not introduce such testing in this pilot task is that in an e-commerce setting, even a 0.01% improvement means thousands of listings being assigned to their right places in the catalog without human intervention.

3.1 Metrics

The metrics that we have used to evaluate the diferent classification systems are based on weighted-precision, weighted-recall and weighted-F1 for the test set of exact “category-id-path” match. In other words, partial path match does not count as a correct prediction. We assume that there are a total of K classes, {ci |i = 1, 2, ..., K } in the training set. The number of true instances for each class (support) is ni , and the total number of training instances is N = ÍK

i=1 ni . After calculating the precision (Pi ), recall (Ri ) and F1 (F 1i ) scores for each class ci , the weighted-precision, weighted-recall and weighted-F1 are defined as follows: weiдhted-P = weiдhted-R = weiдhted-F 1 =

K Õ ni i=1 N K Õ ni i=1 N

Ri K Õ ni F 1i i=1 N (1) (2) (3)

It can be shown that weighted-recall is actually equal to absolute accuracy. Since the product distribution over the taxonomy tree is highly imbalanced, weighted version of precision, recall, and F1 make much more sense than macro or micro version of precision, recall, and F1 do. The evaluation script is also provided during the data challenge4.

4https://github.com/sigir-ecom/dataChallenge

3.2

Leaderboard

The leaderboard shows the weighted precision, recall and F1 scores, upto four decimal digits of precision, for the latest submissions from each participating team. The corresponding file submission time is shown as well so participants can refer the scores to which submission it belongs to. The leaderboard is sorted by the weighted F1 score. We choose not to use an of-the-shelf classifier, such as a logistic regression model or some other standard classifier, as a minimal baseline (lowerbound) in the leaderboard. 3.3

Timeline • Stage 1 (April 9 - June 23): Participants build and test models on the training data. Each team can make at most three submissions per day in this stage. The leaderboard (Fig. 3 left) only shows the model performance on a subset of the test set, i.e., the first 20, 000 test product titles. Note that this subset is randomized since the entire test set is randomized using stratified sampling on the one million titles. • Stage 2 (June 24): The leaderboard switches to Stage 2 on June 24 (Fig. 3 right) and shows the model performance on the entire test set, i.e., all 200, 000 test product titles. No submission is allowed once the leaderboard has been switched to this stage. 4

RIT-BOSTON BASELINE METHOD The RITB-Baseline method uses the method of Joulin et al. [ 4 ], which has been very popular recently for large-scale multi-label text classification. The method, dubbed, fastText, uses lock free asynchronous gradient descent optimization [8] of a log linear loss function – it is actually a single layer perceptron with no nonlinearities in the activation functions. For details on the derivations for the fastText model parameters and using a softmax loss function, see Section 7.

Generally, in a production setting, we have seen gradient boosted decision trees (GBDTs) to perform just as well as state-of-the-art Convolutional Neural Networks (CNNs), which has been previously reported in [ 1 ]. However, although, GBDTs have the added advantage of incorporating intuitive feature engineering as well as deciding feature importance, their training time on large datasets can be prohibitively slow. Recent trends in e-commerce industry has thus shifted to using fastText or other deep learning approaches for generic text classification problems, of which taxonomy classification of product titles is a specific instance. Deep learning techniques are a continuously evolving family of models with varying degrees of architecture engineering, optimization algorithms and parameter ifne tuning methods and as such it is dificult to assign a baseline model using this family of methods.

We thus choose to use fastText as a quick way to solve the taxonomy classification problem. Further fastText apparently has less parameters to tune than comparable deep learning methods. The reason for the choice of the word apparent will become clear once we explain the graphs in Fig. 5. The fastText method in [ 4 ] employs many tricks, including hashing to a fixed vocabulary size of n-grams, lock free asynchronous gradient descent optimization [8], averaging of word representations as document representation and others. Further, given a training set containing NV words from a vocabulary of V words and a number of passes over the data, i.e. epochs, equal to E, the learning rate, α , at time t is updated according to t α (t ) = γ (1 − NV E ) (4) where γ is a fixed parameter. Hence, unlike optimizing for the learning rate or step size found in more principled convex optimization techniques, fastText uses an heuristic estimate and a linear decrement of the learning rate that allows it to calculate an estimated time of completion for the training phase to end.

The high degree of class imbalance becomes apparent upon observing the Kullback Leibler (KL) divergences of the empirical distribution over data points in the training set for each of the top level categories, from their respective uniform distributions (see Fig. 4). The RDC Challenge dataset may have exhibited this phenomenon due to the dataset selection process, however, in general, depending on common choices of e-commerce taxonomies for midsize organizations, KL divergences usually vary between 1.0 and 2.0 [ 1 ].

We initially split the 0.8M training dataset into a 10% Dev2 set, a 10% Dev1 set after excluding the Dev2 set, and the rest for training and cross-validation. The splitting is done using stratified sampling. Following a bi-level classification scheme in [ 1 ], we initially built a baseline hierarchical cascading classifier using fastText, however, for this dataset, we found out that a “flat” fastText classifier performed better by at least one percentage point absolute. Additionally, the “negative sampling” loss function has also shown better performance than the typical “softmax” loss. However, fine tuning of parameters based on the negative sampling loss function can show signs of overfitting.

For data preprocessing, we use the token normalizer mentioned in [ 1 ]. We first tokenize using whitespace and lowercase the tokens. Then for each token, we separate any adjoining punctuations from numbers (decimal or otherwise). We replace all numbers with a @NUMBER@ literal token and remove all punctuation tokens. We preserve alphanumeric tokens in the hope that sometimes longer alphanumeric strings encode model numbers that are unique to each category. For instance, the surface form of the title, “"Ana Silver Co Rainbow Moonstone Earrings 2 1/4"" (925 Sterling Silver) - Handmade Jewelry EARR346812"” gets converted to “ana silver co rainbow moonstone earrings @NUMBER@ @NUMBER@ @NUMBER@ @NUMBER@ sterling silver handmade jewelry earr34681” using our normalization scheme. For the baseline, we also do not remove any stop words or perform any morphological operations on the resulting tokens. The entire training dataset is de-duplicated using the (category label, title normal form) tuples.

The plots in Fig. 5 confirms the non-deterministic nature of fastText due to the use of asynchronous SGD. In fastText, each thread reads its own view of the data, creates a mini-batch of size one, and updates the parameter matrices without any locking mechanism. This type of lock free parameter updates have been shown to be just as efective as their synchronous counterparts under the assumptions of feature sparsity in the data [8]. The plots in Fig. 5 show absolute accuracy results on our 10% Dev2 set for two diferent scenarios, after training a bi-level fastText classifier on the training data that results after excluding the Dev1 and Dev2 sets. For the ifrst scenario, we run fastText for forty runs with the same parameter settings of dim=min(120,K/2), epoch=100, wordNgrams=3, loss=softmax, thread=45, with K being the number of classes for a particular subtree rooted at a classification node in the taxonomy tree. For this scenario, we obtain the mean absolute accuracy to be 74.0908 with a standard deviation of 0.0741. For the second scenario, we run fastText for six diferent runs with the same settings, except that the number of threads are set to 1, 2, 5, 15, 30 and 45 respectively. The solid bars in the right plot of Fig. 5 shows the number of threads and the corresponding accuracy numbers are shown on the line graph above the solid bars.

We can observe from the graph in the right of Fig. 5, that the trivially synchronous version of fastText, i.e. using just one thread, is the weakest of all models. This makes sense since fastText is just a singe layer perceptron with no non-linearities. Accuracy improves by almost 70% when the number of threads is set to two, but, plateaus of for the other settings of thread counts after that. For our Dev2 set, we obtain the best accuracy using thirty threads and that is the number of threads we set for all of our fastText baseline runs. The asynchronous SGD of fastText with a mini-batch size of one, raises a legitimate concern – the number of threads apparently behaves as a parameter that regulates generalization performance and is machine architecture dependent. Needless to mention that this behavior is independent of the loss function used. In our experiments, the default setting of twelve threads have shown lower performance than thirty threads for this task.

After some minimal experiments on our Dev2 set, we finalized the settings of our fastText baseline, RITB-Baseline (see Table 3 and Fig. 3), to be the following: -dim 300 -minn 4 -maxn 10 -wordNgrams 3 -neg 10 -loss ns -epoch 3000 -thread 30. We have set a high number of epochs due to the availability of free computational cycles in our servers.

In the next section, we briefly highlight the methods used for the data challenge task by the teams who have submitted their system description papers. 5

SYSTEM DESCRIPTIONS

Below is a list of each team’s systems. Their team names, afiliation and system description paper title available on the workshop website can be seen in Table 3.

• Team CorUmBc submitted one system (0.7690 F1 score) based on a Bidirectional Long Short Term Memory Network (BiLSTM) to capture the context information for each word, followed by a multi-head attention model to aggregate useful information from these words as the final representation of the product title. Their model adopts an end-to-end architecture without any hand-crafted features and regulated by various dropout techniques. • Team HSJX-ITEC-YU submitted one system (0.7790 F1 score) that uses K Nearest Neighbors (KNN) classification model and the Best Match (BM) 25 probabilistic information retrieval model. The top k product title matches of the BM25 model are then classified by the KNN model. • Team JCWRY submitted one system (0.8295 F1 score) that uses deep convolutional neural networks with oversampling, threshold moving and error correct output coding to predict product taxonomies. Their best accuracy is obtained through an esemble of multiple networks, such as Kim-CNN [ 5 ] and Zhang-CNN [9], trained on diferent extracted features inputs, inlcuding doc2vec [ 6 ], NER and POS features. • Team MKANEMAS submitted one system (0.8399 F1 score) that formulates the task as a simple classification problem of all leaf categories in the given dataset. The key feature of their system is the combination of a Convolutional Neural Network and Bidirectional LSTM using ad-hoc features generated from an external dataset (Amazon Product Data) [ 3 ][ 6 ]. • Team mcskinner submitted one system (0.8513 F1 score) using a straightforward network architecture and ensemble LSTM strategy to achieve competitive results. The positive impact of tightening the connections between recurrent and output layers through the use of pooling layers is also demonstrated. The author similarly provides practical details on their training methodology and algorithms for probability calibration. The final solution is produced by a bidirectional ensemble of 6 LSTMs with Balanced Pooling View architecture. • Team minimono submitted one system (0.7994 F1 score) based on word-level sequence-to-sequence neural networks widely used in machine translation and automatic document summarization. By treating taxonomy classification as a translation problem from a description of a product to a category path. The text of the product name is viewed as the encoder input and the sequence of category name as the decoder output. • Team neko submitted one system (0.8256 F1 score) that treats category prediction as a sequence generation task. The authors built a word-level sequence-to-sequence model

Afiliation System Description Paper

University of Maryland Baltimore County Large Scale Taxonomy Classification using BiLSTM with Self-Attention category issue, sampling and data enhancement techniques are used. They build eight sample datasets according to the category hierarchy and develop two classification algorithms to build models for diferent levels and search paths using category trees. • Team Uplab submitted three systems based on diferent classifier types, including single flat linear Support Vector Machines classifier (0.8366 F1 score), a top down esemble which combines top-level and sub-level classifiers (0.8173 F1 score) and a CNN with pre-trained word embeddings (0.6509 F1 score). The authors found that TF-IDF with both bi-gram and uni-gram features work best for categorization. • Team Waterloo submitted one system (0.7781 F1 score) that uses multinomial naive-bayes, logistic regression, batched gradient-descent and neural softmax classifiers retro-fitted for the hierarchical classification objective. The authors evaluate diferent classification methods, including flat and hierarchical classification, and implement a general framework to tackle the task of taxonomy classification. 6

RESULTS

Twenty six research groups/individuals participated in the data challenge. Each team submitted their predicted category-id-path on the test set in the same TSV format as the training set (Table 1). We score the system submissions against the gold standard using weighted precision, recall and F1 as defined in Section 3. The final results are summarized in Figure 3. The left hand side of Figure 3 shows the leaderboard in Stage 1 (evaluated on the first 20,000 test titles) and the right hand side shows stage 2 results (evaluated on the entire set of test titles). In Stage 1, the top five teams are team mcskinner, team MKANEMAS, team tiger, team Uplab and team JCWRY with 0.8510, 0.8421, 0.8404, 0.8375 and 0.8278 weighted-F1 scores, respectively. In Stage 2, the same five teams rank at top five with 0.8379, 0.8366 and 0.8295 weighted-F1 scores. The rank does not change much between Stage 1 and Stage 2, except for team Uplab2, team Tyche and team HSJX-ITEC-YU. Table 4 shows the list of teams whose ranks is higher in Stage 2 than in Stage 1.

APPENDIX

log likelihood function: The loss function for fastText, mentioned in [ 4 ], is the following N Ö n=1 ! = where tn is the label of the nth training instance. The optimization problem on the training set is to find the optimal parameters A∗ and B∗ by minimizing the negative log likelihood in Equ. 5.

1 arg min − N × A∗,B∗ (6) (7)

The parameter matrix A connects the input features to the units in the hidden layer of dimension H . This matrix is a look-up table over the words [ 4 ] and H is a parameter of the model. The weight parameter matrix, B, connects the hidden units to the output units. The kth output unit of the nth instance contributes to the conditional probability of the output yn,k , of the kth label being 0 or 1. That is, for a particular nth instance (xn , tn ), we have, Ln = ÎK ytn,k where tn,k ∈ {0, 1}.

k=1 n,k

From now on, let L = −Ln , so that we can clear up the notation. Also, the data is assumed to be i.i.d so that p(X) = ÎN n=1 p(xn ). Let the input nodes be labeled as xi , i ∈ {1, ..., D}; the hidden layer nodes be labeled as zj , j ∈ {1, ..., H } and the output layer nodes be labeled as vk , k ∈ {1, ..., K }. The target for the input x = {xd } is a one-hot K vector with the kth entry to be 1 if x belongs to class k. Let vk be the input to the kth output unit i.e. vk = BT z with k zh = ÍD

d=1 Ad,hxd .

Using the method of maximum likelihood estimation on the derive the updates for model parameter estimators. loss function L, where tk ∈ {0, 1} and yk = p(vk ) ∈ [ 0, 1 ], or, equivalently, using gradient descent minimization on L, we can 7.1

Parameter updates for the output layer We first begin by finding the updates to the B parameter from Equ. 6.

For brevity, we drop the subscript n henceforth. Taking derivatives of the loss function at the output layer w.r.t. Bk,h , we obtain, ∂L ∂Bk,h where, yk = p(vk ) = so f tmax (vk ) = eBTk z/ÍkK′=1 eBTk′ z Derivation: For a particular nth instance, we have

Derivation: As in the previous section, for a particular nth instance, we have, using the chain rule of derivatives, −L =

K Õ

tk log yk k=1

K = Õ tk vk − log k=1

K = Õ tk BTk z − log k′=1

K Õ Using chain rule of derivatives, For the first term, we have: exp(vk′ )

T exp(Bk z) !

! ÕK exp ÕH k′=1 h=1

Bk,hzh ! !

For the second term, we have: leading to, ∂L ∂Bk,h = ∂L ∂vk

∂vk ∂Bk,h ∂L = tk − − ∂vk ∂ log ÍK k′=1 exp(vk′ ) ∂vk′ ∂vk′ exp(vk′ ) = tk − log ÍK

k′=1 exp(vk′ ) = tk − yk k δk′ ∂vk ∂Bk,h = = zh ∂ ÍH h=1 Bk,hzh ∂Bk,h ∂L ∂Bk,h where α is the learning rate that is updated as shown in Equ. 4. 7.2 Parameter updates for the input layer Taking derivatives of the loss function at the output layer w.r.t. Ah,d , we obtain,

∂L ∂Ah,d (tk − yk )Bk,h xd ! leading to,

∂L − ∂Ah,d ∂Ah,d

K = kÕ=1 ©(tk − yk )

K = Õ k=1 ∂L ∂Ah,d k=1

Note that the document or sentence representation in fastText is obtained by an averaging of the vector representations of the constituent n-grams. Thus, the gradients involving x are normalized by the number of word and/or character n-grams in the input x. 7.3 Asynchronous Stochastic Gradient Descent The general expression for updating the estimators of a parameter w, in a gradient descent optimization scheme is given by, w(t +1) = w(t ) + α u where α is the learning rate or step length and u is a descent direc∂L . The fastText model relies on asyntion, which is set to − ∂w(t) chronous stochastic gradient descent (ASGD) method of optimization following [8], where, several cores can update any component, w(t ), several times, but atomically, on the same iteration. j

Both ASGD and SGD depend on finding unbiased estimators of gradients. The key diference is that for the former, the gradient computation for parameter estimators is allowed to use stale values of the previously updated parameter iterates. Convergence is shown to happen for certain scenarios, where the loss functions are convex and the stochastic gradients are assumed to be bounded in expectation. In the original ASGD paper [8], this convergence is shown for a fixed step size. The convergence analysis for ASGD in [ 7 ] is shown for the non-bounded gradient assumption and a continuously decreasing step size, which is the case for fastText.

It is mentioned in [ 7 ], that the permutation of indices for the parameter vector or matrix updates when there are inconsistent reads and writes by several cores is crucial for good performance. In fastText, the compute cores start iterating over the data from ifxed positions in the training file and not randomly as in the case of true SGD. This might be a reason why the solution computed by fastText, is not stochastically optimal with regards to the assumptions mentioned in [ 7 ].

[1]

Pradipto

Das , Yandi Xia , Aaron Levine, Giuseppe Di Fabbrizio, and Ankur Datta . 2017 . Web-scale language-independent cataloging of noisy product listings for e-commerce . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1 , Long

Papers

, Vol. 1 . 969 - 979 .

[2]

Kaiming

He , Xiangyu Zhang, Shaoqing Ren, and

Jian

Sun . 2016 . Deep residual learning for image recognition . In Proceedings of the IEEE conference on computer vision and pattern recognition . 770 - 778 .

[3]

Ruining

He and Julian McAuley . 2016 . Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering . In proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Committee , 507 - 517 .

[4]

Armand

Joulin , Edouard Grave, Piotr Bojanowski, and

Tomas

Mikolov . 2017 . Bag of Tricks for Eficient Text Classification . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2 ,

Short

Papers . Association for Computational Linguistics , 427 - 431 .

[5]

Yoon

Kim . 2014 . Convolutional neural networks for sentence classification . arXiv preprint arXiv:1408.5882 ( 2014 ).

[6] Julian

McAuley

Christopher

Targett ,

Qinfeng

Shi , and Anton Van Den Hengel. 2015 . Image-based recommendations on styles and substitutes . In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM , 43 - 52 .

[7]

Lam

Nguyen , Phuong Ha Nguyen, Marten van Dijk, Peter Richtarik , Katya Scheinberg, and Martin Takac . 2018 . SGD and Hogwild! Convergence Without the Bounded Gradients Assumption . In Proceedings of the 35th International Conference on Machine Learning (Proceedings of Machine Learning Research) , Jennifer Dy and Andreas Krause (Eds.) , Vol. 80 . PMLR, Stockholmsmässan, Stockholm Sweden, 3747 - 3755 .