INTRODUCTION

Com Data Challenge), Jennifer B. Sartor, Theo D'Hondt, and Wolfgang De Meuter (Eds.). ACM, New York, NY, USA, Article

Convolutional Neural Network and Bidirectional LSTM Based Taxonomy Classification Using External Dataset at SIGIR eCom Data Challenge

Hongwei Zhang Yahoo Japan Corporation Tokyo

ayiwamot@yahoo-corp.jp ftakahas@yahoo-corp.jp hshiino@yahoo-corp.jp shogosu@yahoo-corp.jp yiseki@yahoo-corp.jp 0 1 2 3 4 5

Japan hzhang@yahoo-corp.jp

0 1 2 3 4 5 0 Aya Iwamoto Yahoo Japan Corporation Tokyo , Japan 1 Convolutional Neural Network , Bidirectional LSTM, External dataset 2 Fumihiko Takahashi Yahoo Japan Corporation Tokyo , Japan 3 Hiroaki Shiino Yahoo Japan Corporation Tokyo , Japan 4 Shogo D. Suzuki Yahoo Japan Corporation Tokyo , Japan 5 Yohei Iseki Yahoo Japan Corporation Tokyo , Japan

2018

4 5

In eCommerce websites, products are annotated with various metadata such as a category by human sellers. Automatic item categorization is useful to reduce this cost and have been well researched. This paper describes how we won the 2nd place (weighted F1 at Stage 1 and Stage 2 are 0.8421 and 0.8399) at SIGIR eCom DataChallenge 2018, whose goal is to predict each product's category by its title. We formulate the task as a simple classification problem of all leaf categories in a given dataset. The key features of our methods are combining of Convolutional Neural Network and Bidirectional LSTM and using ad-hoc features from an external dataset (i.e. not given in this contest). An error analysis is also employed and some cases which are hard to predict accurately are revealed.

INTRODUCTION

In eCommerce websites, products are registered with metadata (e.g. title, category, etc.) by human sellers. Annotating products with those metadata is hard job, and therefore automatic predictions of metadata can reduce the cost[ 2 ]. In recent years, a number of studies of automatic item categorization in eCommerce have been made[ 1, 2, 8–10 ]. Pradipto et al.[ 1 ] reported that products are often Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

ACM ISBN 123-4567-24-567/08/06. https://doi.org/10.475/123_4 categorized incorrectly because product taxonomies are large. Furthermore, if two diferent sellers annotate a same product with a title, the result should be diferent. Those dificulties cause a noisy dataset, and therefore automatic item categorization is dificult task.

At this challenge of SIGIR eCom DataChallenge 2018, Rakuten Institute of Technology provides train and test datasets. The train dataset is composed of product titles and category ID paths, and the test dataset contains only product titles. The goal of participants is to predict the category ID path for each product title in the test dataset. This challenge is more dificult than the previous item categorization problems for following two reasons: (1) The metadata of products is only title and any other information (e.g. price, image, etc.) is not included. (2) The dataset contains not category “name” paths, but category “ID” paths. This causes dificulty in using prior knowledge of each category.

This paper describes how we won the 2nd place (weighted F1 at Stage 1 and Stage 2 are 0.8421 and 0.8399) at SIGIR eCom DataChallenge 2018. We formulate the task as a simple classification problem of all leaf categories in the given dataset. The key features of our methods are following two parts: (1) Convolutional Neural Network and Bidirectional LSTM are used together. This technique may be useful because two models are diferent in structure. (2) Amazon Product Data[ 3, 6 ], which contains product reviews and metadata from Amazon, is used to generate ad-hoc features. The products in the dataset given by this contest do not have metadata as described above. Thus, it is useful to incorporate the metadata from the external dataset.

In the rest of the paper, the detail of our system is described in Section 2. Section 3 describes error analysis of our model. Finally, we present the conclusion in Section 4. 2

METHODS

An overview of our system is given in Figure1. A product title is fed to the system and a category for the product is predicted by following procedures: (1) The product title is split into words and they are normalized. (2) Each word is converted into an embedding vector. (3) The embedding vectors are input into “Multi-kernel CNN module” and “BiLSTM module”. Each module outputs a flattened vector. (4) Ad-hoc features are generated for each word. Each ad-hoc feature is fed into a multi layer perceptron and a flattened vector is gotten. (5) Three vectors from step (3) and (4) are concatenated into a flattened vector and passed into a last fully connected layer. Probabilities of all leaf categories are output from the fully connected layer.

In the following of this section, we describe the detail of those procedures. 2.1

Preprocessing of a product title

First, the input sequence (i.e. product title) is split into some words by a space character. Then, symbol characters (e.g. %, #, etc.) in each word are removed. Finally, each of the words is converted into lowercase. 2.2

Generating embedding vectors

We used word2vec implemented by gensim[ 7 ] to generate skipthoughts embedding vectors. The setting of word2vec is as follows. • used all words appearing in train and test dataset • window size is 7 • hierarchical softmax is used for model training • negative sample size is 5 • embedding vector size is 512

In general, a word embedding is used in the area of natural language processing. In addition to this, we generated other embeddings as follows.

• pos tags • stemmed word • lemma of a word • hypernym of a word

These embeddings are useful for unknown words in the test dataset. 2.3

Training Modules

We used Convolutional Neural Network with multiple kernels (Multi-kernel CNN)[ 4 ] and Bidirectional LSTM with Soft Attention[ 5 ] for training modules. 2.3.1 Multi-kernel CNN. Y. Kim[ 4 ] proposed a Convolutional Neural Network based method for sentence classification problem. We adopted the idea to predict categories and call this module “Multi-kernel CNN”. An overview of “Multi-kernel CNN” module is given in Figure 2.

The input of this module is embedding vectors described in Section 2.2 and the output is a vector whose elements correspond to probabilities of each leaf category. First, the input is passed into one-dimensional convolutional layers and the outputs are feature maps. We used multiple convolutional layers diferent in kernel size (e.g. 2, 3, 4 and 5). Next, feature maps are flattened into a vector and it is passed into a last fully connected layer.

2.3.2 Bidirectional LSTM with Soft Atention. In recent years, Recurrent Neural Networks have been used in the area of natural language processing. In the field of neural machine translation, it is reported that attention mechanism is efective technique[ 5 ].

We employed bidirectional LSTM with Soft Attention[ 5 ] to predict categories. We note that an attention layer of “sequence to sequence” model accepts what LSTM layers output from both of input-side and output-side sequences; however, this model is “sequence to label” and only output from input-side LSTM layers is accepted. An overview of “Bidirectional LSTM” module is given in Figure 3.

The input and the output is same as “Multi-kernel CNN” module. First, the input is passed into a Bidirectional LSTM layer and the output is encoded sequence. Then, the encoded sequence is passed into a soft attention layer and the output is probability distribution over all leaf categories. 2.4

Ad-hoc features

Ad-hoc features such as length of a title are useful to improve the accuracy. The list of ad-hoc features is as follows.

• title length • uppercase rate • alphabet/non-alphabet/digits rate count • space character rate/count • max length of words • unique word rate • a number of filtered symbols • a number of words • histogram of word length • histogram of pos tags %" %" !" %# %# !# %$ %$ !$ %" %# %$

In addition to ad-hoc features as below, metadata of products from an external dataset, Amazon Product Data[ 3, 6 ], which contains product reviews and metadata such as categories and prices, is used as ad-hoc features.

Figure 4 shows the procedure of generating the ad-hoc features from Amazon Product Data. (1) Training a model whose input is a title and output is a category with Amazon Product Data. (2) The product titles of the datasets given by this contest and Amazon Product Data are fed into the model of (1) and embedding vectors are gotten from an embedding layer. (3) Prices and categories of 20 products in Amazon Product Data which are similar to each product in the dataset given by this contest are fetched. The similarity of two products is the euclidean distance between two embedding vectors from (2).

2.5 Concat vectors from previous modules

We get three flattened vectors from “Multi-kernel CNN”, “Bidirectional LSTM” and the multi layer perceptron with ad-hoc features. In this part, these vectors are concatenated into a flattened vector and it is passed into a fully connected layer whose output is as same as the previous module (e.g. probabilities of each leaf category).

2.6 Over sampling on “small” categories

In training, we oversampled data by shufling words in titles for categories which have less than 50 products. It solves a problem of an imbalance of categories.

3 ERROR ANALYSIS

We split the train dataset by this contest into two parts: train and validation part to check a performance of the proposed system. … …

3.1 Top level category prediction

First, we show the accuracy of top level category prediction. Figure 5 shows the confusion matrix of top level category prediction. It is found that products whose top level category is “1208” are often miss classified as “4015”. In Table 1, examples which are miss classified “1208” as “4015” and products correspond to miss classified category are shown in Table 2. In the first and second lines in Table 1, true and predicted category are seem to be similar. More specifically, “1208>310>1629>1513>3369” and “4015>4454>473” seem to be a food category. “1208>546>4262>572”

F1 score for each category 1.0 0.8

Dificult Categories

In this sections, dificult categories are explored. Figure6 shows F1 scores for each category. It can be said that it is dificult to predict accurately for categories which few products are correspond to.

Furthermore, we focus on categories that are hard to predict in categories close to the number of products (i.e. bottom of the

CONCLUSION

In this paper, we describe how we tackle SIGIR eCom DataChallenge 2018. Our proposed model is combined Convolutional Neural Network and Bidirectional LSTM. We also used Amazon Product Data to generate ad-hoc features. In error analysis, it is found that two categories which are similar or share same words in titles are hard to distinguish. It is also found that media categories are hard to distinguish because each title of products in those does not have enough information. We believe that high prediction accuracy came from proposed deep learning models and the external dataset, but human prior knowledge for each category is useful to get better performance.

[1]

Pradipto

Das , Yandi Xia , Aaron Levine, Giuseppe Di Fabbrizio, and Ankur Datta . 2017 . Web-scale language-independent cataloging of noisy product listings for e-commerce . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1 , Long

Papers

, Vol. 1 . 969 - 979 .

[2] Jung-Woo

, Hyuna Pyo, and

Jeonghee

Kim . 2016 . Large-scale item categorization in e-commerce using multiple recurrent neural networks . In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM , 107 - 115 .

[3]

Ruining

He and Julian McAuley . 2016 . Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering . In proceedings of the 25th international conference on world wide web. International World Wide Web Conferences Steering Committee , 507 - 517 .

[4]

Yoon

Kim . 2014 . Convolutional neural networks for sentence classification . arXiv preprint arXiv:1408.5882 ( 2014 ).

[5] Minh-Thang

Luong

, Hieu Pham, and

Christopher D

Manning . 2015 . Efective approaches to attention-based neural machine translation . arXiv preprint arXiv:1508.04025 ( 2015 ).

[6] Julian

McAuley

Christopher

Targett ,

Qinfeng

Shi , and Anton Van Den Hengel. 2015 . Image-based recommendations on styles and substitutes . In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM , 43 - 52 .

[7]

Radim

Řehůřek and

Petr

Sojka . 2010 . Software Framework for Topic Modelling with Large Corpora . In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA , Valletta, Malta, 45 - 50 . http://is.muni.cz/publication/ 884893/en.

[8]

Dan

Shen , Jean-David Ruvini , Rajyashree

Mukherjee , and Neel

Sundaresan . 2012 . A study of smoothing algorithms for item categorization on e-commerce sites . Neurocomputing 92 ( 2012 ), 54 - 60 .

[9]

Dan

Shen , Jean-David Ruvini , and Badrul Sarwar . 2012 . Large-scale item categorization for e-commerce . In Proceedings of the 21st ACM international conference on Information and knowledge management. ACM , 595 - 604 .

[10] Yandi

Xia

, Aaron Levine, Pradipto Das , Giuseppe Di

Fabbrizio

, Keiji Shinzato, and

Ankur

Datta . 2017 . Large-Scale Categorization of Japanese Product Titles Using Neural Attention Models . In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2 , Short

Papers

, Vol. 2 . 663 - 668 .

2296>2435>1576 Scarlet Women The Quartet Alfred 00-21113 I Will Sing - Music Book 2296 > 2435 > 3792 Nocturnes and Polonaises Complete Preludes and Etudes-Tableaux Blues Heaven

2296>3597>3064 Dowin In The Delta Free Beer Regina Belle - Believe in Me 2296 > 3597 >3956

New

Edition Backyard - Skillet Suits-Season Three

2296>3706>1586 Newlyweds-Nick and Jessica Complete 2nd and 3rd Seasons Defiance-s3 [dvd] [3discs] (Universal) Log Horizon: Season 2 - Collection 2 2296 >3706>3437 Yu

Hakusho Season 3 Case Closed-Season 3-S.A .V.E.