=Paper=
{{Paper
|id=Vol-2319/ecom18DC_paper_6
|storemode=property
|title=An Empirical Study of Using An Ensemble Model in e-Commerce Taxonomy Classification Challenge
|pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_6.pdf
|volume=Vol-2319
|authors=Yugang Jia,Xin Wang,Hanqing Cao,Boshu Ru,Tianzhong Yang
|dblpUrl=https://dblp.org/rec/conf/sigir/JiaWCRY18
}}
==An Empirical Study of Using An Ensemble Model in e-Commerce Taxonomy Classification Challenge==
An Empirical Study of Using An Ensemble Model in
E-commerce Taxonomy Classification Challenge
Yugang Jia Xin Wang Hanqing Cao
Winchester, MA Newton, MA Mahwah, NJ
yugang.jia@gmail.com wangxin8588@gmail.com vauus@yahoo.com
Boshu Ru Tianzhong Yang
Dept. of Software and Information The University of Texas Health
Systems, UNC Charlotte Science Center
Charlotte, NC Houston, TX
boshu.ru@gmail.com tianzhong.yang@gmail.com
ABSTRACT to manually categorize the products. Therefore, large-scale
In the Rakuten data challenge on taxonomy Classification automatic categorization is in great need.
for eCommerce - scale Product Catalogs, we propose an
approach based on deep convolutional neural networks to 2 RELATED WORK
predict product taxonomies using their descriptions. The The challenges in large-scale automatic categorization in-
classification performance of the proposed system is further clude: Firstly, the products are sparsely distributed in a large
improved with oversampling, threshold moving and error number of categories and the data distribution is far away
correct output coding. The best classification accuracy is from uniform distribution. Such imbalanced data could largely
obtained through ensembling multiple networks trained dif- deteriorate the categorization performance [5, 21]; Secondly,
ferently with multiple inputs comprising of various extracted the commercial product taxonomy is usually categorized in
features. tree structures with thousands of leaf nodes, which adds
another layer of difficulty to explore the correlations among
CCS CONCEPTS large number of taxonomies in a hierarchical structure. Vari-
• Computing methodologies → Machine learning; ous classification methods, such as flat classification, cascade
classification and probabilistic cascading have been deployed
KEYWORDS in the large-scale taxonomies. [1, 8, 13]. However, it remains
Multi-class classification, Imbalanced classes, Word embed- to be a challenging problem due to the large data scale, data
ding, Convolutional neural networks, Error correct output heterogeneity, and category skewness [24].
coding In light of the challenges, different methods have been
proposed to achieve optimal classification performance. For
1 INTRODUCTION example, Naive Bayes demonstrates the effectiveness and
E-commerce sites provide millions of products that are con- efficiency for classifying test documents [15, 19], but it has
tinuously updated by merchants. The correct categorization poor performance when some categories are sparse [23].
of each product plays a crucial role in helping customers find Support vector machines (SVMs) have been served as a well-
the product that meets their need in many aspects, such as established benchmark model for classifying e-commerce
product searching, targeted advertising, personalized recom- product [9]. Chen et. al proposed a multi-class SVM with
mendation and product clustering. However, due to the large an extension of using margin re-scaling to optimize average
scale of the products, it is often not feasible and error prone revenue loss [4]. However, SVM has been shown to have
longer computing time and it only works well when the
Permission
Copyright © 2018 tobymake digital
the paper’s or hard
authors. Copyingcopies offorpart
permitted orand
private allacademic
of this purposes.
work for
In: J. Degenhardt,
personal G. Di Fabbrizio,
or classroom use is S.granted
Kallumadi, M. Kumar,
without feeY.-C. Lin, A. that
provided Trotman, H. Zhao
copies are
number of categories is less than five [5]. As one of the
(eds.): Proceedings of the SIGIR 2018 eCom workshop, 12 July, 2018, Ann Arbor, Michigan, USA,
not made
published or distributed for profit or commercial advantage and that copies
at http://ceur-ws.org
deep learning algorithms, recurrent neural network (RNN) is
bear this notice and the full citation on the first page. Copyrights for third- proposed by Pyo and Ha to deal with the multi-class classifi-
party components of this work must be honored. For all other uses, contact cation problem with unbalanced data[8], in which the learnt
the owner/author(s). word embedding depends on a recursive representation of
SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
the same initial feature space. In addition, convolutional
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.
neural network (CNN) achieves remarkable performance in
https://doi.org/10.1145/nnnnnnn.nnnnnnn sentence-level classification [10, 11, 27]. Recently, CNN has
1
SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Y. Jia et al.
been regarded as a replacement for well-established SVM There are many characteristics and potential challenges in
and logistic regression models [28], which uses pre-trained this dataset. At the class level, there exists large variation in
word vectors as inputs for training the CNN models [28]. the number of samples in each class, resulting in a largely
Product categorization is a hierarchical multi-class classi- unbalanced dataset (See Table 1). Out of the 3008 classes,
fication problem. Hence, a natural way of classifying product 19 classes have only one product, and 1484 (49.3 percent
is to use hierarchical classification. However, hierarchical ) classes have less than 25 products, which sums up to be
classification suffers from error propagation issue [25]. Kos- 13,618 products, or 1.70 percent of all products. On the other
mopoulos proposed a probabilistic hierarchical classification hand, 12 classes have more than 10,000 products, which sums
approach that predicted the leaf categories by estimating the up to be a total of 256,689 products, or 32.1 percent of all
probability of each root-to-leaf path [13]. Cevahir et.al [3] products. The top 3 largest classes include 69915, 30146, and
mitigated the error propagation issue in hierarchical clas- 25481 products respectively (See Table 2). The classes with
sification and achieved better results than flat models by many samples might embrace both the richness as well as the
incorporating a large number of features in leaf category diversity of data that could result in the inter-class distance
prediction. being closer than the intra-class distance. The classes with
To combat with imbalanced classification problem, cost only a handful of samples are expected to be hard to classify
sensitive training appears to be an effective solution. Zhihua accurately in the testing data.
Zhou et.al [29] show that only over-sampling and threshold The product description is a mixture of letters, words, num-
moving is effective for training cost-sensitive neural net- bers and other ASCII characters. At the product description
works by empirical studies. However, it becomes difficult to level, there appears to have at least the following challenges.
define costs of misclassification when there are large number The first challenge is that the difference between data sam-
of classes. A more recent paper [18] also supports similar ples in the same class might be larger than that of samples
claims. belonging to different classes to the extent of seemingly mis-
Error-Correcting Output coding (ECOC) is another method labels. For example, both "Mont Blanc Mb Starwalker Men
that has been used in multi-class text classification to further Eau De Toilette Edt 2.5Oz / 70Ml" (a type of perfume) and
improve a classifier’s accuracy [6, 7]. The idea behind this "Humminbird Pc11 Power Cord" (an electronic device) are in
approach is to encode each class label to a unique binary the category of "3625>4399>1598>3903", while "Creed Green
code with a number of digits, such that redundancy is in- Irish Tweed Eau De Perfum For Men - Small Size 1oz (30ml)
troduced to the transformed class labels that are then used ", also a type of perfume, is in category "3625>3005". The
for training a supervised Machine Learning model. Even if second challenge is the categories under the same parent cat-
some errors occur in the prediction of a transformed label, egories (or principal category) are correlated, which would
we may still be able to recover the correct original label by enforce the challenge above. A third challenge is that one ab-
choosing the one that is the closest to the prediction. This breviation could have different meanings, for example, "hp"
approach helps to reduce the space of model output when could be "horsepower" or "Hewlett-Packard", "mb" could be
there are a large number of class labels. At the meantime it "mega byte" or "marble". The fourth challenge is the large
also alleviates the class imbalance problem. variation in the number of words in each product descrip-
tion, ranging from 1 to dozens. While a short description
3 DATA CHARACTERISTICS might not provide sufficient information, such as a single
The SIGIR eCom Data Challenge is on large-scale taxonomy word product "Bonjour", a long description might include
classification. The competition requires us to classify each too much details that the relevant information might be
product description to one of the classes accurately in the hidden, such as "fosmon 2100mah dual port usb rapid car
unlabeled testing data using models developed from the la- charger for apple iphone 5c/5s/5/4s/4, samsung galaxy note
beled training data. Data is provided as a training data (800 3/2, s5/s4/mini/active/s3, lg g3/lg g2/ g2 mini, google nexus
thousands of products) and a testing data (200 thousands of 5/4, blackberry z10/z30/q10, htc one (m8), motorola moto
products). The training data has one column having produc- e/moto x/moto g, nokia lumia 1020".
tion description and one column having product categories. Due to the above data characteristics, data preprocessing
There are 14 principal product categories represented by that filters out noise and keeps relevant information is sys-
numbers, all but one of which have secondary categories of tematically designed and implemented and various modeling
levels ranging from 1 to 7, or 1 to 8 levels in total, which strategies are tested as described in the next sections.
results in a total of 3008 classes. Each of such combination
of principal and secondary categories formulates a string of
numbers that represents a class, eg "3625>4399>1598>3903",
"2296>3597>689".
2
Ensemble model for Taxonomy Classification SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
Table 1: Data Characteristics - Category Size algorithm [20]. The representation of a word is learnt to be
useful for prediction of other words in the sentence. The
log10(Categorize Size) Count Category Size Range model contains 300-dimensional vectors for 3 million words
and phrases.
0 893 < 10
Word embeddings pre-trained on product descriptions.
1 1372 (10 - 100]
We also learn word embeddings from all product descriptions
2 623 (100 - 1,000]
by using word2vec algorithm implemented within Gensim
3 108 (1,000 - 10,000]
[22], which is an unsupervised learning model. The dimen-
4 12 > 10,000
sion of word vector is set to be 50 and we train the word2vec
model using CBOW algorithm [20] with 5 epochs. In each
Table 2: Top 3 Categories epoch, we initially set learning rate to be 0.025 and let it
linearly decay to 0.0001.
Category Number of Products Word embeddings learnt in training. Firstly, we use
one-hot coding to represent an individual word in a product
2199>4592>12 69915 description. Then we learn a weight matrix in an embedding
3292>3581>3145>2201 30146 layer of proposed model to transform each word into a word
4015>2337>1458>40 25481 vector having 50 dimensions. The word vectors transformed
from one-hot coding are used as the input to train the first
convolutional layer.
4 PREPROCESSING AND FEATURIZATION
Character-level embeddings learnt in training. Besides
In this section we introduce how the product descriptions are word embeddings, we also use character level embeddings
preprocessed and the features extracted to train our model. learnt in training, where the learning approach is similar to
the approach of learn word embeddings in training. We use
Preprocessing one-hot coding to represent each unique character in the
As we described above, the product description contains raw texts rather than each word.
noise. We mitigate the noise in a trial-and-error way in order
to find the balance of signal noise ratio. As a result, we apply
Named entity and part-of-speech tag features
the following procedure. We first convert all letters to lower
case, and replacd special characters such as parentheses ex- The appearance of named entities might be associated with
cept single hyphen, and repeated characters such as multiple certain product categories. For example, locations, landmarks,
hyphen or dots with spaces. We then unify physical units and famous people are often prevalent in categories such
of "in", "ft", "hp", "ml","oz" and etc that follow a number to as branded perfume, books and movies, while organization
"nnnhp", "nnnml", and "nnnoz", respectively. For example, names might be seen more commonly in electronic products
"3.4oz", "3.4 oz" or "3.4-oz" would become "nnnoz". Finally, such as Apple and HP. In addition, individual words asso-
we remove dash (-), standalone numbers, and extra white ciated with the named entities might be rare and therefore
spaces. As a result of the preprocessing, distinct words are filtered by our word frequency requirement, but they could
reduced from 670K to 160K. be informative. For example, the word "Beethoven" occurs in
only one product description, but it is the name of a famous
musician and it carries strong information about what type
Word-level and character-level embeddings for of product it could be. Therefore, we use Stanford CoreNLP
system input package [17] to extract 23 types of named entities from the
We use three different sets of word embedding to create product description, namely cause of death, city, country,
vector presentation of words, each of which is as described criminal charge, date, duration, email, ideology, location,
in more details below. The resulted vectors of each word in misc, money, nationality, number, ordinal, organization, per-
each product description are then concatenated (i.e, column- cent, person, religion, set, state or province, time, title, and
bind) as word embedding features. As shown in Fig. 1, word URL. For each product description, we count the number
embeddings are concatenated as the input to train a CNN of words in each entity type and normalize values by the
model. Three sets of word embeddings generated by varying length of sentence, resulting a 23-dimension feature vector
methods are used as three separate inputs. representing the distribution of named entity types. Addi-
Word embeddings pre-trained on Google News. The tionally, we generate a 36-dimension feature vector for the
pre-trained word embeddings are trained on part of Google distribution of part-of-speech tags, which was also identified
News dataset (about 100 billion words) by using word2vec by Stanford CoreNLP.
3
SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Y. Jia et al.
Pre-processing Raw Text
Supervised word Google word2vec Google news Supervised character
embedding embedding embedding level embedding
NER and POS Doc2Vec Kim-CNN with Kim-CNN
features embedding Kim-CNN oversampling Kim-CNN Zhang-CNN
with ECOC
Dense layer Dense layer Dense layer Dense layer Dense layer
with softmax with softmax with Sigmoid with Softmax with Softmax
Maximum likelihood
decoding
Threshold Threshold
Threshold Threshold
Threshold moving moving
moving moving
moving
Rule based final ensemble
Figure 1: Illustration of analysis system for product title categorization
(a) (b) (c)
Figure 2: Classifier performance (F1-score, recall and precision) as a function of Category Size
4
Ensemble model for Taxonomy Classification SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
Creating sentence level representation Oversampling and threshold moving
We use doc2vec algorithm proposed in "Distributed Repre- We adopt an oversampling strategy to improve the model’s
sentations of Sentences and Documents" by Quoc Le et al. performance on imbalanced data. In training one of the five
[14], to create another set of features for product descriptions. networks as shown in Fig. 1, we initially draw 256 examples
The algorithm modifies word2vec to unsupervised learning from training data to form one batch. We add 768 more
of continuous representations for larger blocks of text, such examples to the batch of data by duplicating those ones
as sentences, paragraphs or entire documents. We represent whose classes having less than 5000 examples in the entire
each product description by a 50 dimensional feature vector training set. Within the 768 duplicated examples, we further
and we train the doc2vec model with 5 epochs. define three categories of classes according to the class sizes
in training set: classes with 1000 - 5000 examples, 100 - 1000
examples and less than 100 examples. The proportion of
5 PROPOSED MODEL FOR PRODUCT three categories of classes should follow 1:2:4 to form the
TAXONOMY CLASSIFICATION total 768 oversampled examples. In the end we have 1024
An overall architecture of our system is shown in Fig. 1. We examples in each batch.
introduce the key components in the proposed model in We also apply a threshold moving strategy to adjust model
the following sections. We train five models separately with predictions to alleviate data imbalance problem. The original
each one using multiple data inputs and varying setups. For output of our models, which are the probabilities for each
making the final predictions, we ensemble the predicting class, are divided by the class sizes before we yield final class
results from these models. All the models are trained using labels. This strategy will reduce the probabilities yielded
Adam algorithm [12] with a learning rate 0.001. on large classes but increase those on smaller classes. Our
experimental results show that this strategy is able to further
improve model performance.
Kim-CNN architecture
Several recent studies have examined CNN models for text Error correcting output coding
classification tasks and reported CNN based models achieved To leverage the hierarchy of class labels and explore label
outstanding performance [2, 11, 16]. We adopt the design correlation, we also used error correcting output coding
of Kim’s CNN model to extract informative patterns from [6, 7]. For our cases with large amount of classes, we create
the word embedding representation of input data [11]. Kim’s a unique binary coding for each taxonomy. The coding for
model contains paralleled convolution filters of three differ- each taxonomy level is different, for example, the first level
ent kernel sizes to protrude informative patterns in a sub taxonomy has 3 digits while the second level taxonomy has
area of the input data. The kernel size (k) of filter determines 4 digits. Then we concatenate all codes corresponding to
the magnitude of the sub area in the case of text, that is multiple taxonomies to form the new output coding. The
the number of continuous words in a sentence. By mixing original taxonomy prediction problem is transformed into a
convolution filters of three continuous sizes (k − 1, k, and multi-label learning problem using the encoded labels. The
k + 1), the network can learn patterns in the sentence at longest sequence of taxonomies contains 8 taxonomy codes.
three different scales. This design is similar to combining We pad those encoded labels of taxonomy sequences less
n-gram features of different scales (e.g., unigram, bigram, than 8 with zeros.
and trigram). The max pooling filter scans through outputs
of convolution filters and preserves only the max value in Model ensembling
each area. This operation washes out information that is less The models that are trained for ensembling are shown in Fig.
relevant to the classification task and reduces the dimension- 1 and descriptions of these models are as the following:
ality of features extracted by convolution filters. The output
• Model 1 : The last fully connected layer uses the input
is further processed by one layer of fully-connected neurons
including: (1) the output of Kim-CNN networks; (2)
to condense output matrix.
NER features; (3) doc2vec features.
Based on the Kim-CNN architecture proposed in [11] that
• Model 2 : The last fully connected layer uses the in-
used kernels in the sizes of 3, 4 and 5, we insert one more
put including: (1) the output of one Kim-CNN model
kernel with size being 1 to capture the unigram features
trained with upsampling; (2) the output of the other
and we apply a batch normalization layer followed by a
Kim-CNN trained with ECOC; (3) NER features.
dropout layer after the fully-connected layer of a standard
• Model 3 : The last fully connected layer uses the in-
Kim-CNN, then the output is further processed by a second
put including: (1) the output of Kim-CNN networks
fully-connected layer to yield final outputs.
trained with ECOC; (2) NER features and (3) doc2vec
5
SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA Y. Jia et al.
features. The output of Model 3 is decoded from the Table 3: Testing Results
yielded multi-label predictions in a way of calculating
the likelihood of predicting each individual class label Metric Testing-Stage1 Testing-Stage2
firstly and then we assign the original class label with
Precision 0.8545 0.8528
the highest computed likelihood to the example.
Recall 0.8172 0.8172
• Model 4 : We train a Kim-CNN model using character
F1 0.8278 0.8295
level embeddings of raw texts. The last fully connected
layer uses the input including: (1) the output of Kim-
CNN; (2) NER features and (3) doc2vec features.
• Number of filters : 1024
• Model 5 : Different from Kim-CNN that concatenates
• Window size of Kim-CNN: {1, 3, 4, 5}
the feature maps generated from different window
• Dropout rate : 0.5
sizes in parallel, we train a new CNN model with six
• Regularization parameter : 3×10−6
layers sequentially, which is referred as Zhang-CNN
[26]. The last fully connected layer uses the input in- In the tuning process, firstly we fix the number of filters to
cluding: (1) the output of Zhang-CNN; (2) NER features be 256 and we tune other parameters. Then we increase the
and (3) doc2vec features. We use the similar hyperpa- number of filters to 512 and 1024 and the latter one shows
rameters that have been proposed in their paper with a better performance. We also notice that a higher dropout
fine tuning. rate 0.5 would help to get the best performance. Besides, our
model is quite sensitive to the regularization parameter in
The output of each model is also adjusted by adopting thresh- the sense of converging speed. Once we increase it from
old moving such that each model provides 2 versions of pre- 3×10−6 to 1×10−3 , the model converges much slower but
dictions that are in terms of probabilities, with or without with no increase in performance.
threshold moving. We derive an ensembling procedure to Our error analysis shows that the classifier works well
combine multiple model predictions. Firstly, we select the for categories whose sample sizes are big, for example, more
predictions from Model 1, 4 and 5 to form a initial set of than 1000 cases, and not so well for smaller categories. Figure
candidate predictions, containing 6 versions of predictions. 2 shows the performance on a random validation dataset (20
We repeat training Model 2 for 3 times with random data percent of training data) using ensemble of models trained
shuffling and oversampling to provide 6 more versions of with over-sampling, threshold moving and error correcting
predictions and add those to the set. We adopt 3 different output coding techniques. The result inspires us to adopt
ways of creating output codings and repeatedly train Model various sampling strategies aiming to increase the sample
3, to obtain 6 more versions of predictions. We observe that size for those small categories.
using even more predictions beyond 6 versions from Model
2 or 3 will not further improve the overall performance. In 7 RESULTS AND DISCUSSION
the end we implement a majority voting over all versions
The models are firstly trained on 80% of the given 800K
of predicted labels within the candidate set to predict the
training samples and validated on the rest 20% of data to tune
final labels. When there is a tie, we choose the label with the
the hyperparameters. Then the models are retrained with
highest averaged probability.
fixed hyperparameters following an early stopping strategy
in training using all given 800k samples and tested on 200K
6 PARAMETER TUNING AND ERROR ANALYSIS samples with unknown labels (Table 3). We achieve a good
For training the proposed model, hyper-parameters such performance, with a F1 score of 0.8295.
as number of filters among {256, 512, 1024} for each con- We observe that feature engineering is particularly impor-
volutional layer, dropout rate among {0.1, 0.2, 0.3, 0.4, 0.5} tant in further improving performance using CNN. We went
and regularization parameter for convolutional and fully through a path that as we added more relevant but somehow
connected layers among {1×10−6 , 3×10−6 , 1×10−3 , 3×10−3 , different features, for example the NER features, the per-
1×10−1 } are tuned using a hold-out set in the training data. formance was improved accordingly. We also observe that
We adopt the same combination of window sizes 3, 4, 5 as there is considerable performance difference among different
what was used in Kim-CNN [11], otherwise we add a new models. The performance of models based on character level
window spanning only one word which was helpful to pre- embedding are not as good as the others. However, it helps
dict the taxonomy correctly with selected a few single words. to improve the overall performance of ensemble model.
We select an architecture with the best performance. The The threshold moving method is very helpful to increase
adopted hyper-parameters for Kim-CNN model are as the the precision of individual model, which is critical in final
following : ensembling. The oversampling and ECOC algorithms can
6
Ensemble model for Taxonomy Classification SIGIR 2018 eCom Data Challenge, July 2018, Ann Arbor, Michigan, USA
add additional randomness and improve the performance of [16] Jingzhou Liu, Wei-Cheng Chang, Yuexin Wu, and Yiming Yang. 2017.
ensembling model to certain extent. Deep Learning for Extreme Multi-label Text Classification. In Proceed-
ings of the 40th International ACM SIGIR Conference on Research and
Development in Information Retrieval. ACM, 115–124.
ACKNOWLEDGMENTS [17] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel,
The author would like to thank the organizer of SIGIR 2018 Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP nat-
eCom Data Challenge (Rakuten Institute of Technology Boston ural language processing toolkit. In Proceedings of 52nd annual meeting
of the association for computational linguistics: system demonstrations.
(RIT-Boston)) for their support.
55–60.
[18] Maciej A. Mazurowski Mateusz Buda, Atsuto Maki. 2017. A systematic
REFERENCES study of the class imbalance problem in convolutional neural networks.
[1] Rohit Babbar, Ioannis Partalas, Eric Gaussier, and Massih R Amini. 2013. arXiv:1710.05381 (2017).
On flat versus hierarchical classification in large-scale taxonomies. In [19] Andrew McCallum, Kamal Nigam, et al. 1998. A comparison of event
Advances in Neural Information Processing Systems. 1824–1832. models for naive bayes text classification. In AAAI-98 workshop on
[2] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard learning for text categorization, Vol. 752. Citeseer, 41–48.
Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, [20] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey
Urs Muller, Jiakai Zhang, et al. 2016. End to end learning for self-driving Dean. 2013. Distributed Representations of Words and Phrases and
cars. arXiv preprint arXiv:1604.07316 (2016). Their Compositionality. In Proceedings of the 26th International Con-
[3] Ali Cevahir and Koji Murakami. 2016. Large-scale Multi-class and ference on Neural Information Processing Systems - Volume 2 (NIPS’13).
Hierarchical Product Categorization for an E-commerce Giant. In Pro- Curran Associates Inc., USA, 3111–3119.
ceedings of COLING 2016, the 26th International Conference on Compu- [21] Michael Pazzani, Christopher Merz, Patrick Murphy, Kamal Ali, Timo-
tational Linguistics: Technical Papers. 525–535. thy Hume, and Clifford Brunk. 1994. Reducing misclassification costs.
[4] Jianfu Chen and David Warren. 2013. Cost-sensitive learning for In Machine Learning Proceedings 1994. Elsevier, 217–225.
large-scale hierarchical classification. In Proceedings of the 22nd ACM [22] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic
international conference on Conference on information & knowledge Modelling with Large Corpora. In Proceedings of the LREC 2010 Work-
management. ACM, 1351–1360. shop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta,
[5] Pradipto Das, Yandi Xia, Aaron Levine, Giuseppe Di Fabbrizio, and 45–50.
Ankur Datta. 2016. Large-scale taxonomy categorization for noisy [23] Dan Shen, Jean-David Ruvini, Rajyashree Mukherjee, and Neel Sun-
product listings. In Big Data (Big Data), 2016 IEEE International Con- daresan. 2012. A study of smoothing algorithms for item categorization
ference on. IEEE, 3885–3894. on e-commerce sites. Neurocomputing 92 (2012), 54–60.
[6] Thomas G. Dietterich and Ghulum Bakiri. 1995. Solving Multiclass [24] Dan Shen, Jean David Ruvini, Manas Somaiya, and Neel Sundaresan.
Learning Problems via Error-correcting Output Codes. J. Artif. Int. Res. 2011. Item categorization in the e-commerce domain. In Proceedings of
2, 1 (Jan. 1995), 263–286. the 20th ACM international conference on Information and knowledge
[7] Rayid Ghani. 2000. Using Error-Correcting Codes for Text Classifi- management. ACM, 1921–1924.
cation. In Proceedings of the Seventeenth International Conference on [25] Carlos N Silla and Alex A Freitas. 2011. A survey of hierarchical
Machine Learning (ICML ’00). Morgan Kaufmann Publishers Inc., San classification across different application domains. Data Mining and
Francisco, CA, USA, 303–310. Knowledge Discovery 22, 1-2 (2011), 31–72.
[8] Jung-Woo Ha, Hyuna Pyo, and Jeonghee Kim. 2016. Large-scale item [26] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level
categorization in e-commerce using multiple recurrent neural net- Convolutional Networks for Text Classification. In Proceedings of the
works. In Proceedings of the 22nd ACM SIGKDD International Confer- 28th International Conference on Neural Information Processing Systems
ence on Knowledge Discovery and Data Mining. ACM, 107–115. - Volume 1 (NIPS’15). MIT Press, Cambridge, MA, USA, 649–657.
[9] Thorsten Joachims. 1998. Text categorization with support vector ma- [27] Ye Zhang, Stephen Roller, and Byron Wallace. 2016. MGNC-CNN: A
chines: Learning with many relevant features. In European conference simple approach to exploiting multiple word embeddings for sentence
on machine learning. Springer, 137–142. classification. arXiv preprint arXiv:1603.00968 (2016).
[10] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A [28] Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and
convolutional neural network for modelling sentences. arXiv preprint practitioners’ guide to) convolutional neural networks for sentence
arXiv:1404.2188 (2014). classification. arXiv preprint arXiv:1510.03820 (2015).
[11] Yoon Kim. 2014. Convolutional neural networks for sentence classifi- [29] Zhi-Hua Zhou and Xu-Ying Liu. 2006. Training Cost-Sensitive Neural
cation. In Proceedings of the 2014 Conference on Empirical Methods in Networks with Methods Addressing the Class Imbalance Problem.
Natural Language Processing. IEEE Trans. on Knowl. and Data Eng. 18, 1 (Jan. 2006), 63–77.
[12] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Sto-
chastic Optimization. CoRR abs/1412.6980 (2014). arXiv:1412.6980
[13] Aris Kosmopoulos, Georgios Paliouras, and Ion Androutsopoulos. 2015.
Probabilistic cascading for large scale hierarchical classification. arXiv
preprint arXiv:1505.02251 (2015).
[14] Quoc Le and Tomas Mikolov. 2014. Distributed Representations of
Sentences and Documents. In Proceedings of the 31st International
Conference on International Conference on Machine Learning - Volume
32 (ICML’14). JMLR.org, II–1188–II–1196.
[15] David D Lewis and Marc Ringuette. 1994. A comparison of two learn-
ing algorithms for text categorization. In Third annual symposium on
document analysis and information retrieval, Vol. 33. 81–93.
7