INTRODUCTION

Large-Scale Taxonomy Problem: a Mixed Machine Learning Approach

Quentin Labernia

quentin@dais.is.tohoku.ac.jp 0

Toshiyuki Oike

oike@dais.is.tohoku.ac.jp 0

Yashio Kabashima∗

kabashima@dais.is.tohoku.ac.jp 0

Kohei Asano

asano@dais.is.tohoku.ac.jp 0

Takeshi Tokuyama

tokuyama@dais.is.tohoku.ac.jp 0

Michimasa Irie

irie@dais.is.tohoku.ac.jp 0

Jinhee Chun

jinhee@dais.is.tohoku.ac.jp 0

Machine Learning, Natural Language Processing, Deep Neural Net-

1 0 Tohoku University , GSIS 1 work , Random Forest

2018

Rakuten Data Challenge suggests tackling the Large-Scale Taxonomy Challenge. Given a large amount of product titles and category paths leading to these products, we would like to predict the category path of a given product, only based on its title. The provided paths are structured as a forest of 14 trees. The learning process is split into two steps: we first retrieve the tree the input belongs to and then handle the category path. We describe data embedding which represents an important task in this challenge and then introduced the so-called two step architecture. The original idea is based on deep neural network model. We also introduce an actual method as second step modification since the former second step is not eficient enough. This last technique makes usage of multiple sets of random forest classifiers to navigate inside each tree.

INTRODUCTION

As part of the SIGIR 2018 workshop, we tackle in this paper the Rakuten Data Challenge. E-commerce websites handle a large variety of data and their products are categorized in some way: when a new product is given to the system, one has to compute its category. For instance, an inkjet printer is an ofice product, an electronic ofice good, and of course kind of printer. Rakuten Data Challenge addresses this problem in the following terms. Given only the title of a product, try to predict its path through some categories – as usually shown to the customer in most e-commerce website. Each path starts out from one of the root categories, reaches to the actual product’s category, going through some intermediate levels. For example, the product ’Replacement Viewsonic VG710 LCD Monitor 48Watt AC Adapter 12V 4A’ is associated with such path 3292>114>1231. We refer to this problem as large-scale taxonomy classification.

Rakuten Data Challenge focuses on machine learning techniques as the amount of data is large, about one million instances. Also, labels correspond to paths in tree structure. Considering these characteristics, it might seem more natural to tackle this problem using machine learning techniques rather than typical data mining methodologies like FCA – Formal Concept Analysis – or logical rules based approaches. Such approaches indeed lead to high computation complexity and could simply fail because of the not adequate input and output data representation.

Up to now, datasets which are representative of practical usage of e-commerce websites have not been made public yet and Rakuten Data Challenge provides such dataset. However, it brings with it serious challenges. Among those, the large size of the dataset represents one of the main issue. This justifies the large-scale classiifcation appellation. One indeed needs to handle around 0.8 million of instances each of them consisting of a textual feature – title of the product - and categories’ path description. The output label consists of an ordered list of categories. We are dealing with a classification problem and each distinct list is a diferent label. Considering the test dataset, we come up with more than 3000 distinct list configurations. Labels are also unbalanced with regard to the input instances and we put emphasis on that point in section 5. Some categories like “4015” foster a lot of product whereas some others do not. Handling very small classes along with over-representative ones represents another big issue. The last considered point is related to the noisy semantic of each instance’s title. We are referring here to the fact that the title of a product can include the name of the product or its description which is more or less precise. Although two products share some words, it might be the case they are actually very diferent and so are not categorized under the same label. For instance, a laptop whose name is given after a given fruit, a book which talks about this fruit: both share common words. To get rid of such mistakes is a fundamental and necessary condition for our model to predict labels in the most eficient way.

In this paper, we introduce two methods to tackle the challenge. The original method refers to our first idea which results are not convincing. Hence, we also introduce the actual method whose diferences with the original one will be discussed in section 4. This last method has been used to get results of Rakuten Challenge shown in section 5.

As for data preprocessing, we give a detailed explanation of how do we address most of the previously mentioned problems in a logical way. Once the input and output data are formatted in the right way, we define the architecture of the models we built to perform the prediction. Both methods are introduced. Based on this architecture, we outline the experiment strategy. Finally, we compare diferences between the two strategies and provide critical explanations of our work. 2

DATA OVERVIEW

The provided data is composed of one million instances, split between the training dataset (80%) and test dataset (20%). A pair (xi , yi ) denotes the instance of index i. A sample of data picked up from the training dataset is shown in table 1.

The input data – title xi – is composed of one feature formated as a string of UTF-8 characters. It corresponds to the title of the product. A title xi has a specific length and potentially contains special characters such as the trademark character.

The taxonomy is a tree as stated in the Rakuten Data Challenge rules. After gathering all the nodes – each actually corresponds to one category – we end up with a forest of 14 trees. One tree is sort of very general category of product. It is obviously possible to link the root node of all the trees together so we form a unique tree. However, we chose to not perform such operation since it would make more sense to build specialized models managing one of these general categories at a time. Indeed, the classification logic between electronic products and make-up goods – for instance – could be potentially totally diferent. As a result, we consider the existence of T1, . . . , T14, the trees we build from the given learning dataset only. A node of a tree Ti is a category c ∈ N. This category corresponds to the node in the forest, which means that such category c is unique across all the trees. Notice that tree T14 is degenerated in the sense that it contains only one node.

The output data, or labels, are tuples of categories. One tuple yi = (c1, . . . , cDk ) describes a path of categories – associated with the instance xi –, that is: c1 is a root node of one tree Tk , c2 is one of the c1 child nodes, and so on until we reach cDk , the end of the path. After running analysis on the provided datasets, we found that cDk is actually always a leaf, thus we consider the following hypothesis to always be true: for any instance (xi , yi ), the last component of yi , cDk , is a leaf of a tree. A predicted tuple y˜i matches its ground truth counterpart yi if and only if both tuples are equal – perfect matching between the two. Put another words, our goal is to find a tuple with the right amount of components and each of these components have to match the provided ground truth.

We do not proceed to any data augmentation, either for data processing or model usage. This paper focuses on getting the best results while just relying on the title as input feature. Here, we find more interests to look at how we could obtain consistent results with semantically noisy title features as recalled in the introduction. Let us fix an important notation hereafter used: all the variables written with a tilde is a predicted value from our models. Next section introduces the data preprocessing based on the previously stated hypothesis. Let’s first take a look at the input feature transformation process. We describe two steps corresponding to bag of words and word2vec processes. Both representations are further used in section 4 and so we introduce them once at a time as follows.

3.1.1 Bag of words. A title xi is a string of characters whose encoding is UTF-8. We split the title into words using space characters. The integer qi denotes the number of words in the title xi . We define a binary vector representation of each word and so create a dictionary using the previously split words. The size of the vocabulary corresponds to the number of distinct words we meet in the training dataset because we build it using the training dataset. We drop the least and “most” frequent words if respectively: a word appears less than ten times ; a word’s frequency for each tree is high and about the same across all the trees – such words are not discriminative across diferent categories. When dealing with an instance xi , any word which does not belong to the dictionary is simply dropped. Any word wi, j – word j of the title xi – can be encoded into a binary vector of size ω. One can notice we do not proceed to any lemmatization of the words. We refer to the bag of word vector of the title xi using xi.

3.1.2 word2vec. The next step is dimensionality reduction of each words in the title xi. To do so, we use the well known word2vec technique[ 7 ]. We set each resulting word to be an element of R2000. Since titles are not of fixed size, we need to either build a model which can handle such variable length inputs, or transform the data into a predefined fixed format. Since our words are embedded in sort of semantic space using word2vec, we state that it makes sense to perform the addition of all the words of one title to get a significant meaning. Considering titles of Internet marketplaces, words order usually does not matter. For instance, “red shoes” and “shoes / red” should be meaningful speaking quite close. Moreover it gives us an elegant way to take into account all the words in the title while having a final fixed size. To recap, given xi a title, we first embed its binary vector words wi,1, . . . , wi,qi in a semantic space and obtain ui,1, . . . , ui,qi . We then aggregate all the words using sum to form the final title ti = Í1⩽j ⩽qi ui, j . We feed the machine learning model with these ti . 3.2

Output labels

This section is specifically designed for the original method. Labels are originally formated like “c1>c2>. . .>cDk ” as mentioned in section 2. However, such format cannot be eficiently used as output label because it would have to be parsed. Each leaf is reached by only one path so it is theoretically possible to proceed to the prediction task considering multiclass – an instance is associated with one and only one leaf, which also corresponds to only one path. By doing so, we do not take any advantage of the tree structure hypothesis. Also, there exist some paths containing a very small number of instances. Therefore, we consider the tree structure and suggest another way of performing prediction. We choose to transform these raw labels yi into binary vectors zi . Our goal is to describe the path from the root node to one leaf by giving to which direction we go at each level. An easy way to proceed is to first sort all the nodes by level, then we encode which node we go through using a binary vector. For instance, the path 1>3>8 shown in figure 1 could be written as: ( 0, 1 , 0, 0, 0, 0, 1 ) |{z} |{z} |{z} level 1 | {z } level 2 The first level group tells us to keep going through the node 3 – within a level, we use category integer representation to order nodes. The first subgroup of the second level is filled out with zeros for the reason that it corresponds to the nodes reachable from 2. Since it is mandatory to go through the root node, we ignore it and start out from the second level of the tree. If the tree is big, it implies very large labels. In order to reduce the size of such label, we use a simple trick as a reworked idea of anchor representation, used for instance in [ 6 ]. This idea is used in the case of image recognition where the number of contained labels can difer between images.

A label yi belongs to exactly one of the 14 trees, thus we only consider the kth tree. The height of this tree is Dk as defined in section 2. We also recall that the depth of a node is its distance to the root node. We define Lk,d to be the set of nodes of depth d and the set Mk,d = {|C | | C ⊆ Lk,d ∧ every nodes in C have the same parent} with d > 0. The set Mk,d gathers the number of children of the nodes in the level d − 1. The compressed vector zi is of length ri = Í1⩽d ⩽Dk max Mk,d and can be represented as follows: zi = (

. . .

|{z} level 1 : Gk,1Bmax Mk,1 . . .

. . .

|{z} level Dk : Gk,Dk Bmax Mk,Dk ) Looking at the figure 1 as an illustration, each group of components has the size of the biggest box at the considered level. After choosing a node д in the path, we can only reach its children child(д). Therefore the way we compress is by considering child(д) – one box in figure 1 – instead of Lk,d – the whole layer. However, depending on д, the number of elements in child(д) might difer, which is why we encode a level d using Gk,d components – the biggest box in

Step 1 Step 2

Original method Actual method

A one level. It ensures we can always encode child(д) while setting up a fixed size. Let’s fix an ordering for each child(д). Since we only pick exactly one node per level, we set up the nth component to be 1 if and only if we select the nth node of child(д). Otherwise, we put 0. That is, the previous 1>3>8 path of figure 1 can be rewritten like so: ( 0, 1 , 0, 1, 0 )

|{z} |{z} level 1 : Gk,1=2 level 2 : Gk,2=3 By doing so, we drastically reduce the size of the labels and give a fixed representation given the tree k: indeed, recall from section 2 that given any tree k, all its labels share the same length. Albeit this representation is dependent on the tree we choose, it is not a problem here since we build a specific model for each tree.

We obviously need to first compute for each tree its level representation, that is Gk,1, . . . , Gk, Dk . Then it is possible to transform any yi into its corresponding zi . Conversely, it allows us to get the raw format label y˜i back from the predicted z˜i . 4

MODELS DESCRIPTION

This section describes two diferent methods as recalled in section 1: our original idea along with the practical method. Both are based on a two step model architecture since we deal with the 14 trees separately and do not gather them into a unique one. The process is to first predict the tree Ti the current input belongs to, and then ifnd the entire path in that selected tree as the second step. First step model is denoted by A, while the second step ones are written B1, . . . , B13. Step one is common to both methodologies and only the latter part is what changes between the two. We put in section 4.1 and section 4.2 the respective explanations for step one and step two. 4.1

First step: classification over the trees

Let us consider an instance xi . We feed the model A with ti – recall that ti is the preprocessed title coming from xi . Our goal is to first determine to which tree this instance belongs to. For that purpose, we define the model A to be a neural network, whose output is a class between 1 and 14. If the instance xi is classified as k ∈ ⟦1; 14⟧, it means the instance is part of the tree Tk and thus further consider the model Bk as second stage process. We use a five layers fully connected neural network. Since we classify over 14 trees, model’s output size is also 14. Loss function is softmax cross entropy as loss function since any input title is associated with exactly one tree – the use of softmax operation transforms the raw output into a probability vector. 4.2

Second step: prediction of the path

4.2.1 Original Method. Each of the model B1, . . . , B13 is either a shallow neural network or a deep neural network. All the models are independent each other. Depending on the characteristics of a tree, we rely on diferent machine learning methods.

First, when trees have simple structures, we use shallow feedforward fully connected neural networks, each of these composed of three layers. The model outputs zi . A label zi contains multiple 1, thus they do not represent a probability distribution. Actually, each group G1, . . . , GDk of zi is a probability distribution itself. Then we choose to decompose the loss function in such a way we apply the softmax cross entropy over each group, and we finally aggregate the resulting values using summation. If we denote each group of zi by дi,1, . . . , дi, Dk – which are of respective length Gk,1, . . . , Gk, Dk – then the loss function Lk for the model Bk of tree Tk is given by: Lk (z˜i , zi ) =

cross entropy(softmax(д˜i,d ), дi,d ) Õ 1⩽d ⩽Dk

Next, we consider hard tree structures, that is trees whose labels zi have a high dimension. For this purpose, we rely on deep neural networks, inspired by image processing technique. Since we deal with fixed inputs, we do not consider recurrent networks and instead focuses on deep convolutional neural networks. Let us recall that inputs ti are embedded in a 2000 dimensional semantic space created using word2vec[ 7 ] and then aggregating all the words in the title using summation. We want to take advantage of convolution in order to retrieve expressive features before applying classical fully connected layers at the end of the network. To that purpose we suggest the use of ResNet[ 2 ] in its 50 layer flavor. Such architecture is based on residual blocks composed of three convolutional layers and a shortcut link between the input and output of the layer. A more detailed view of the architecture is shown figure 3. It allows us to enjoy the expressiveness of a deep architecture and is known to be easier to train than other deep convolutional models like VGG[ 8 ] or AlexNet[ 4 ]. The loss function is same as the former shallow fully connected neural networks’ one.

4.2.2 Actual Method. As second step, we construct hierarchical models. Let us recall that step one provides us the tree k the current input belongs to. Put another words, we know the first category c1 of the path yi = (c1, . . . , cDk ). In order to predict the full path, we construct 13 hierarchical classifiers B1, . . . , B13 whose task is to predict the rest of the path c2 to cDk once at a time. Notice that as for tree 14, the answer becomes obvious since T14 has only one node.

Hyperparameter Value Models

Model topology ResNet50[ 2 ] or Shallow NN All Regularization ℓ2-regularization All Loss function Softmax cross entropy A ∀k ∈ ⟦1, 13⟧, Lk Bs

Optimizer Adam[ 3 ] All

Batch scheme Mini-batches (⩽ 32) All Table 2: (Original method) Gathered specifications as for neural networks models. We use the following notation Bs B B1, . . . , B13.

Residual Bottleneck Block ResNet50

Convolution layer Convolution layer Convolution layer relu relu relu

Input

Convolution layer relu Fully Connected layer

Output 16 Blocks

We build the second model Bk of the tree k as follows. It is based on a set of models Bk,n where n is a parent node label of the tree k. The goal of each classifier Bk,n is to output the child node corresponding to the input title. Starting from the root node of the tree k, we can navigate in the tree by predicting the nodes of the path yi , once at a time. Theoretically, this strategy seems to be computationally expensive since the number of parent node increases exponentially with the layers. However, the maximum number of node is less than 1000, which means that we can aford such strategy.

Each classifier Bk,n is fed with bag of words titles xi and outputs the child node of n of the tree k which corresponds to our input. We use random forest[ 1, 5 ] algorithm which has got 150 trees – this hyperparameter is fixed for all Bk,n . As for the implementation, we rely on the open source library scikit-learn (http://scikit-learn.org/). Algorithm 1 shows how to build the model Bk . When classifying a title xi in tree k, we call PredictionProcedure(k, xi , ⊤k , ) as described in algorithm 2. The symbol ⊤k refers to the root node of the tree k. This simple algorithm consists in starting from the root node Bk . and going deeper once at a time by choosing the corresponding classifier inside the set

Algorithm 1 HierarchicalRandomForest

In this section, we report results of the Sigir Rakuten workshop oficial evaluation. As stated in section 2, 0.8 million train data and 0.2 million test data are provided. As for test data, ground truth is not given. A leaderboard system shows up the results at each stage of the evaluation process. The computed evaluation metrics are weighted-{precision, recall, f1} on the test set of exact CategoryIdPath match and a ranking is set over the f1-score. Visible in table 3 results we showed the best result in each original and actual model. The Rakuten Data Challenge is split into two evaluation stages. The ifrst stage corresponds to the evaluation over a subset of the test dataset and multiple submissions are possible. On the other hand, the second stage consists of only one evaluation of the method over the whole test dataset. Both original and actual methods have been tested during the first stage while only the actual one is used for the second stage. Table 3 summarizes these evaluations along with their corresponding evaluation measures. The final ranking of the method is 20th over 28 teams.

It is clear that our original method does not provide satisfactory results. We suppose the main reason relies on how complex category trees structure is and the number of data available for training considering all distinct paths. It has to be said that this original method’s issue is overfitting. The first consideration is that given a tree, it might happen that a small number of training data is only available. Even considering tree with large amount of data, the original method fails to provide good generalization property. Indeed, going deeper in the tree means having less data available for training. Since deep neural networks are used in this context, deep levels cannot be learned properly. This leads to get good results on the training set and a very poor generalization ability. On the other hand, the actual method provides good performance because it relies on decision trees – more precisely, random forest algorithm – combined with a simple bag of words embedding. Predicting independently children of each node of the tree with such technique gives us insurance that the learning process will lead to good and comprehensible results even if near the leaves. 6

CONCLUSION

In this work, we suggest a two layer architecture which tackles the large-scale taxonomy challenge as part of Rakuten Data Challenge in SIGIR 2018 workshop. While putting the accent on machine learning methodology, data preprocessing represents a major point of attention. We create bag of words then embed each words of the vocabulary in a semantic space and aggregate words of title using summation. Bag of words are used for the actual method’s second step while the further embedding is used for the rest. We build a two step architecture to predict the final path associated with a newly seen instance. The first step model A chooses which tree we have to consider – for instance the kth –, we then feed the right second step model – Bk – to output the encoded path. This second step is split into two diferent methods: original and actual. Our original idea relies on a one shot fixed scheme encoding of the whole category path. We rely on several machine learning techniques: shallow feed-forward fully connected neural networks and ResNet architecture. As for the actual method, we make usage of hierarchical classification model composed of random forests models. This method provide nice performances compared to the former one as shown in section 5.

Because of the imbalanced classes and complex data structures, our original strategy does not ofer good performances. However, we think that such method becomes very eficient when having enough samples and a uniform distribution over the classes. Since this situation appears in the shallow levels of most of the trees, we think it could be worthy of combining the original method with the actual one in order to improve overall performances.

ACKNOWLEDGEMENT

This work was funded by ImPACT Program of Council for Science, Technology and Innovation.

[1]

Leo

Breiman . 2001 .

Random

Forests . Machine Learning 45 , 1 ( 01 Oct 2001 ), 5 - 32 . DOI:http://dx.doi.org/10.1023/A:1010933404324

[2]

Kaiming

He , Xiangyu Zhang, Shaoqing Ren, and

Jian

Sun . 2015 . Deep Residual Learning for Image Recognition . CoRR abs/1512 .03385 ( 2015 ). http://arxiv.org/ abs/1512.03385

[3] Diederik

Kingma and Jimmy

Ba . 2014 . Adam: A Method for Stochastic Optimization . CoRR abs/1412 .6980 ( 2014 ). http://arxiv.org/abs/1412.6980

[4]

Alex

Krizhevsky , Ilya Sutskever, and

Geofrey E

Hinton . 2012 . ImageNet Classification with Deep Convolutional Neural Networks . In Advances in Neural Information Processing Systems 25,

Pereira ,

C. J. C.

Burges ,

Bottou , and

K. Q.

Weinberger (Eds.). Curran Associates, Inc., 1097 - 1105 . http://papers.nips.cc/paper/ 4824-imagenet -classification-with-deep-convolutional-neural-networks .pdf

[5]

Andy

Liaw , Matthew Wiener, and others. 2002 . Classification and regression by randomForest . R news 2 , 3 ( 2002 ), 18 - 22 .

[6] Wei

Liu

, Dragomir Anguelov, Dumitru Erhan,

Christian

Szegedy , Scott E . Reed, Cheng-Yang Fu , and Alexander

Berg . 2015 . SSD: Single Shot MultiBox Detector . CoRR abs/1512 .02325 ( 2015 ). http://arxiv.org/abs/1512.02325

[7]

Tomas

Mikolov , Ilya Sutskever, Kai Chen, Greg Corrado, and

Jefrey

Dean . 2013 . Distributed Representations of Words and Phrases and their Compositionality . CoRR abs/1310 .4546 ( 2013 ). http://arxiv.org/abs/1310.4546

[8]

Karen

Simonyan and

Andrew

Zisserman . 2014 . Very Deep Convolutional Networks for Large-Scale Image Recognition . CoRR abs/1409 .1556 ( 2014 ). http: //arxiv.org/abs/1409.1556