<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ecommerce Product Title Classification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sylvain Goumy</string-name>
          <email>sylvain@uplab.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohamed-Amine Mejri</string-name>
          <email>Mohamed-amine@uplab.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CEO</institution>
          ,
          <addr-line>Uplab SAS, Lyon 69007</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Data Scientist</institution>
          ,
          <addr-line>Uplab SAS, Lyon 69007</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>E-commerce catalogs include a continuously growing number of products that are constantly updated. Each item in a catalog is characterized by several attributes and identified by a taxonomy label. Categorizing products with their taxonomy labels is fundamental to effectively search and organize listings in a catalog. However, manual and/or rule based approaches to categorization are not scalable. In this paper, we explain our work for the SIGIR eCom'18 Rakuten Data Challenge [1] which focuses on the Topic of largescale taxonomy classification. We first start with data processing. Secondly we investigate a number of feature extraction techniques and observe that TF-IDF with both bigram and unigram work best for categorization than CNN and word embedding. Finally, we evaluate several models and find than Support Vector Machines yield the highest result.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>Computing methodologies → Machine learning</p>
    </sec>
    <sec id="sec-2">
      <title>1 INTRODUCTION</title>
      <p>Product taxonomy categorization is a key factor in exposing
products of merchants to potential online buyers. Most catalog
search engines use taxonomy labels to optimize query results
and match relevant listings to users’ preferences. In addition to
improving the quality of product search, good categorization
also plays a critical role in targeted advertising, personalized
recommendations and product clustering. However, there are
multiple challenges in achieving good product categorization:
- The class set is very large: Machine learning applications
typically only have to predict between a few selected
classes (e.g. classifying an email as spam or no-spam),
but in e-commerce there are often hundreds or
thousands of categories that need to be classified. To
train robust models for these cases, you need a
particularly large amount of training data.
- Product data is diverse and unbalanced.</p>
      <p>This naturally raises the question of whether machine
learning and natural language processing can successfully
handle these large-scale classification taxonomies. In this paper,
we will try to explain our method to solve this problem through
the rakuten data challenge.</p>
    </sec>
    <sec id="sec-3">
      <title>2 DATA PREPROCESSING</title>
      <p>
        After doing some data exploration of the training dataset
provided by Rakuten, we took product titles and we run them
through a preprocessing pipeline, mainly using the libraries ‘re’
(Regular Expressions) and ‘nltk’ (Natural Language ToolKit),
inspired by the methodology proposed by Shubham Jain in his
blog post [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
1.
2.
3.
4.
5.
      </p>
      <sec id="sec-3-1">
        <title>Text Case:</title>
        <p>Lowercasing all letters.</p>
        <p>Special characters:
Removing punctuation and special characters.</p>
        <p>Stop words:
We removed the stop words (the, and, in, etc.) and then
decided to preserve them because we didn’t get any
improvement. Indeed, product titles are not phrases, and
each words seems to have its importance.</p>
        <p>Digits:
We figured out that product titles have sometimes a lot of
numerics in their text so we started by removing them
because we didn't expect them to have much predictive
value but after that we tried another approach which
consists of replacing words that contain digits with 0 so that
the algorithm deal with words like 12V and 9V as the same
word: 0V.</p>
        <p>Rare Words:
Because they are so rare, the association between them and
other words is dominated by noise. We started by removing
words that occurs less than 3 times in text and after that we
tried different values (less than 2 times, 1 time, percentage
approach: terms that have less than 10% of document
frequency) but finally we found that best performance was
with less than 3 times.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Tokenization: Tokenization refers to dividing the text into a sequence of words or sentences (We used word_tokenize from nltk to do that).</title>
        <p>Stemming &amp; Lemmatizing:
Finding word stems to remove variance from word
inflection (i.e we want our model to know that laptops and
laptop refer to the same thing). We tried different Stemmers
(Snowball, Regexp, Porter) from the nltk library as well as
WordNet Lemmatizer, and we decided to keep Lemmatizer
has it gave better performance on this dataset.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 FEATURE EXTRACTION</title>
      <p>After we preprocessed text samples we want to convert them
into vectors of numbers because this is the only input that
machine learning algorithms can work with. We tested several
methods to accomplish this:
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>N-grams</title>
      <p>N-grams are the combination of multiple words used together.
The basic principle behind n-grams is that they capture group of
words and not only words. For example this allow us to treat ‘tee
shirt’ as a single entity instead of two entities ‘tee’ and ‘shirt’.</p>
      <p>N-grams with N=1 are called unigrams. Similarly, bigrams
(N=2), trigrams (N=3) and so on can also be used.</p>
      <p>Unigrams do not usually contain as much information as
compared to bigrams and trigrams. That’s why we used both
unigrams and bigrams to improve the model performance.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>TF-IDF (Term Frequency – Inverse</title>
    </sec>
    <sec id="sec-7">
      <title>Document Frequency)</title>
      <p>Similar to bag-of-words, but weighs word occurrences in a text
sample higher when the words are rare in the rest of the dataset,
since these words are likely to be more descriptive of the sample.
Further, words with a high overall frequency in the dataset can
be excluded from the lexicon. As a result, both the impact of
non-informative words as well as the dimensionality of the
vector space can be reduced.
3.3</p>
    </sec>
    <sec id="sec-8">
      <title>Word Embedding</title>
      <p>Word Embedding is the representation of text in the form of
vectors. The underlying idea here is that similar words will have
a minimum distance between their vectors.</p>
      <p>Word embeddings models require a lot of text, so either we
can train it on our training data or we can use the pre-trained
word vectors developed by Google, Wiki, etc.</p>
      <p>
        We used Glove embeddings [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and specifically the
100dimensional GloVe embeddings of 400k words computed on a
2014 dump of English Wikipedia. This was more complex to
compute, but manages to create a low-dimensional text
representation that encodes subtle semantic similarities between
words and is easier for classifiers to train on. The
implementation was inspired from this keras blog about solving
a text classification problem using pre-trained word embeddings
and a convolutional neural network. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
      </p>
      <p>We achieved the best results with TF-IDF combined with
both unigrams and bigrams. Even though Word
embeddings definitely outperforms TF-IDF in tasks that include
complex semantic relationships between text samples, it is an
overkill for our use case, since product names are rather
simplistic and have barely any syntax in them.
4</p>
    </sec>
    <sec id="sec-9">
      <title>MODEL SELECTION</title>
      <p>
        After preprocessing and vectorization, we built our classifier. We
tested both accuracy and weighted-{precision, recall, F1} for a
range of machine learning models in the library scikit learn such
as Logistic Regression, K-nearest Neighbors, Support Vector
Machines etc. To do that we used scikit learn pipeline function
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to combine our vectorizer and classifier.
This approach tries to leverage the hierarchical nature of the
product taxonomy by building a classifier that predict only the
1st level category (the TopClassifier) and then a sub classifier for
each branch of the categories tree (the SubClassifiers).
      </p>
      <p>One benefits of this approach is that it lowers a lot the
numbers of distinct labels for each classifier, thus they need less
memory for training and we’ve been able to train them with a
higher number of features, by preserving the words that appears
only once or twice in the dataset, instead of limiting the features
to the words that appears at least 3 times for the single classifier,
which improves the global performances.
ALGORITHM 1: Top Down Classifier
top_category ← TopClassifier.predict(title)
if top_category = “92” then</p>
      <p>full_category ← “92”
else</p>
      <p>full_category ← SubClassifiers[top_ category].predict(title)
end
return full_category
Top level category “92” is a special case because it has no sub
categories, thus it does not need a SubClassifier.</p>
      <p>Here are the final performances of each classifier that we trained
for realizing the Top Down Classifier Algorithm:</p>
    </sec>
    <sec id="sec-10">
      <title>5 RESULTS AND DISCUSSION</title>
    </sec>
    <sec id="sec-11">
      <title>5.1 Internal Evaluation</title>
      <p>We’ve split the training dataset to keep 75% only for our training
(600 000 products), and the remaining 25% (200 000 products) for
our evaluation and performance measurement.
participations:
For the Rakuten Challenge, we’ve submitted three very distinct</p>
      <sec id="sec-11-1">
        <title>CNN + Glove pre trained word embedding and thus tested three different classification approaches for ecommerce products. the most performants.</title>
        <p>We were surprised to see that the most complex ones were not
Actually, a state of the art Support Vectors Machines, combined
with TF-IDF vectorizer and efficient data preprocessing has
proved to be the most powerful tool for this text classification
challenge.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Rakuten</given-names>
            <surname>Institute</surname>
          </string-name>
          of Technology.
          <year>2018</year>
          .
          <article-title>Rakuten Data Challenge: https://sigir-ecom.github.io/data-task</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Pennington</surname>
          </string-name>
          , Richard Socher,
          <string-name>
            <given-names>Christopher D.</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          . GloVe: Global Vectors for Word Representation: https://nlp.stanford.edu/projects/glove/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>François</given-names>
            <surname>Chollet</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Using pre-trained word embeddings in a Keras model: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model</article-title>
          .html
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Scikit-learn developers</surname>
          </string-name>
          .
          <year>2017</year>
          . Pipeline: http://scikitlearn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Shubham</given-names>
            <surname>Jin</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Ultimate guide to deal with Text Data (using Python) - for Data Scientists</article-title>
          &amp; Engineers: https://www.analyticsvidhya.com/blog/2018/02/the-different
          <article-title>-methods-deal-text-datapredictive-python/</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>