<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TopSig at the SIGIR'eCom 2018 Rakuten Data Challenge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Timothy Chappell</string-name>
          <email>timothy.chappell@qut.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shlomo Geva</string-name>
          <email>s.geva@qut.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lawrence Buckingham</string-name>
          <email>l.buckingham@qut.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electrical Engineering and</institution>
          ,
          <addr-line>Computer Science, Queensland</addr-line>
          ,
          <institution>University of Technology</institution>
          ,
          <addr-line>Brisbane, Queensland</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>For the SIGIR 2018 eCom workshop Rakuden Data Challenge we present an approach utilising random indexing to create toplogical signatures (TopSig) for the purposes of categorising the product names in the provided data sets. This paper describes the approach used to accomplish this.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Clustering and classification ;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Traditional text search engines are based on the bag-of-words
approach, and are almost invariably implemented with inverted
ifles [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. There are however some information retrieval tasks that
are better handled by alternative techniques, such as locality
sensitive hashing (LSH). These are often referred to as approximate
nearest neighbour approaches and are common in the image or
speech domains, where objects are typically represented by
numerical quantities and where text based approaches are not naturally
applicable. With LSH, the data can be handled in a more
convenient way by projecting from the original feature space, where the
natural representation of an object is inherently high dimensional
with variable length, into a space with fixed dimensionality. It then
becomes possible to use similarity between object representations
as a proxy for calculation of original representation object
similarity. The fixed dimensionality of the representation space combined
with favourable properties of the metric in that space permit the
implementation of highly eficient search and comparison algorithms.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
      </p>
      <p>SIGIR 2018, July 2018, Ann Arbor, Michigan USA
© 2018 Association for Computing Machinery.</p>
      <p>ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn
Unlike text search engines, LSH based search engines compare
entire object representations, rather than look for individual features
within objects (key words). In this project we aim to use the LSH
approach for searching the Rakuten items text collection, by taking
the entire object comparison approach, treating each item
description as an object. We then project all items into a lower dimensional
metric space, where it can be eficiently processed.</p>
      <p>This experiment is concerned with classification rather than
retrieval. We use a conceptually simple approach:
• We first search the collection with the new product title and
identify its nearest neighbours in the reference data set.
• We then apply a simple post-search re-ranking of the top-k
nearest neighbours, and assign to the new item the category
of the highest ranked item from the top-k nearest.</p>
      <p>In this paper we apply the Topological Signature (TopSig)
approach to this challenge to produce the initial top-k list.
2</p>
    </sec>
    <sec id="sec-3">
      <title>TOPOLOGICAL SIGNATURES</title>
      <p>
        The concept of performing similarity computations within a model
in which documents exist as vectors is an old one [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], but still
considered a fundamental of information retrieval theory today. The
naturally high dimensionality of text of any level of complexity
poses certain computational challenges; therefore it is desirable to
reduce this dimensionality where possible. Singular value
decomposition has been used successfully for this purpose [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], however it
has also been shown that random indexing [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] can achieve similar
results with far less preprocessing time.
      </p>
      <p>
        For the purposes of this research, we have made use of our
opensource TopSig tool to generate dimensionality-reduced signatures
from the product titles in the provided data sets and to search these
signatures. The signature generation approach used by TopSig has
been shown to be efective at ad-hoc document retrieval tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ],
biological sequence clustering tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and image retrieval tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
among others. The approach can be regarded as a variation on the
LSH approach to approximate nearest neighbour search, and is
further described in the following section.
3
      </p>
    </sec>
    <sec id="sec-4">
      <title>SIGNATURE GENERATION</title>
      <p>The signature generation process involves transforming the plain
text product titles into fixed length binary strings (signatures).
Random indexing of text starts with the assignment of a fixed length
random signature to each term in the collection. These term
signatures can take the form of ternary vectors having components
in the set [+1, 0, −1], or they can be normally-distributed random
real valued unit vectors. To index a document, the signatures of
all terms appearing in the document are added together, and the
1: for all document ∈ collection do
2: for all term ∈ document do
3: document vector ← document vector + hash(term)
4: end for
5: for all component ∈ document vector do
6: if document vector[component] ≤ 0 then
7: document signature[component] ← 0
8: else
9: document signature[component] ← 1
10: end if
11: end for
12: end for
resulting vector is then squashed to yield a binary vector by taking
the sign bit of each vector component. The randomly generated
term signatures play a role analogous to the basis vectors of a
projection into the fixed-dimension metric space. When combined they
yield binary vectors with efectively random components which
nonetheless reflect the term content of the originating document.
This projection preserves topology in the sense that documents
containing similar terms tend to have similar signatures. This is
a special case of locality-sensitive hashing. Similarity calculations
between signatures are based on the Hamming distance (number of
difering bits between the signatures). By encoding and processing
signatures with bitwise operations distance calculations can be
extremely fast.</p>
      <p>TopSig is used to generate signatures from the product titles
provided with this challenge through the following process:
• The product titles are filtered, with non-alphabetical
characters replaced with spaces.
• The titles are tokenised along space boundaries into
individual terms.
• The individual terms are checked against a general term
‘stop list’ consisting of common sentence particles (‘a’, ‘the’
etc.) and terms matching terms in the stop list are discarded.
• The terms are stemmed- words ending in ‘s’, ‘es’ or ‘ies’ have
these endings removed.
• The terms are case-folded to ensure that the same signatures
are generated irrespective of the capitalisation of the terms.
• Finally, the terms are hashed to signatures through random
projection, then summed together to form a signature for
the entire title, in the process depicted in Figure 1.</p>
      <p>The use of pseudo-random hashing in the signature generation
process creates signatures in which two completely unrelated
titles will tend to have approximately 50% of bits matching when
their signatures are compared. The presence of co-occurring terms
will increase the number of matching bits, while the presence of
only co-occurrent terms will result in two precisely-matching
signatures. In this way, the number of term co-occurrences between
two titles can be approximated by comparing their signatures. It is
this relationship that forms the basis of TopSig’s signature search
approach.
1: for all word ∈ query do
2: query vector ← query vector + hash(word)
3: end for
4: for all component ∈ query vector do
5: if query vector[component] ≤ 0 then
6: query sig[component] ← 0
7: else
8: query sig[component] ← 1
9: end if
10: end for
11: for all document sig ∈ collection do
12: score[document] ← distance(document sig, query sig)
13: end for</p>
    </sec>
    <sec id="sec-5">
      <title>SIGNATURE SEARCHING</title>
      <p>Once the signatures for a collection have been created, the task of
searching them is simple. When the collections are small, such as in
the case of the Rakuten Data Challenge, the approach that leads to
the lowest potential for error is to leverage the high speed at which
Hamming distances between signatures can be computed and to
exhaustively search the database collection for the query signature.
The simplest manifestation of this approach is shown in Figure 2,
in which the query is first transformed into a signature, then the
resulting signature is compared against every other signature in the
database, accumulating a score for each one. These scores (lower
means closer to the query signature) can then be used, depending
on the needs of the particular domain, to determine the closest
results and handle them in whatever way is needed.</p>
      <p>One innovation introduced by the TopSig approach that produces
more stable results when searching is to take advantage of the fact
that we have access to the query at the time of searching, which
means we can produce a mask that excludes the bits that are not
part of the query signature (either positively or negatively).</p>
      <p>The process for this is as follows:
• During searching, when the signature is being generated
from the query terms, a second shadow signature (the mask
signature) is generated as well.
• Whenever a positive or negative value is stored in a term
signature, the mask signature receives a 1 in that position.
• Then, as the database is being exhaustively searched, during
the Hamming distance calculation both the query and
database signatures are masked against the mask signature with
a bitwise AND operation, causing the bits that are not part
of the mask to be excluded from the final result.</p>
      <p>This refined approach is described in Figure 3. Note that, for this
to feature any improvement over traditional searching, the term
vectors produced by hashing need to be relatively sparse, with many
components having a value of 0 and therefore not contributing to
the overall signature.</p>
      <p>Due to the way signatures are generated, bits that are not
featured in the query signature will have a 50% chance of agreeing
with the database signatures. This results in a binomial
distribution of noise centred on half the number of unset bits afecting the
1: for all word ∈ query do
2: query vector ← query vector + hash(word)
3: end for
4: for all component ∈ query vector do
5: if query vector[component] ≤ 0 then
6: query sig[component] ← 0
7: else
8: query sig[component] ← 1
9: end if
10: if query vector[component] , 0 then
11: mask sig[component] ← 0
12: else
13: mask sig[component] ← 1
14: end if
15: end for
16: for all document sig ∈ collection do
17: score[document] ← distance(document sig ∧ mask sig, query
sig ∧ mask sig)
18: end for
Hamming distance of all comparisons, which can be a significant
distortion if the query signature contains far fewer terms than the
database signature. Masking mitigates this efect by removing these
bits from the Hamming distance calculation entirely.
5</p>
    </sec>
    <sec id="sec-6">
      <title>DATA CHALLENGE</title>
      <p>The Rakuten data challenge provides a data set of 1,000,000 product
names from https://www.rakuten.com/, divided into a training set
of 800,000 products with category labels intact and a test set of
200,000 products with labels omitted. These labels are presented as
a sequence of numbers, where each number represents a category
at a particular level on the Rakuten.com website. For instance, the
1208&gt;546&gt;4262&gt;2775 category refers to the Food &amp; Beverage →
Beverages → Wine → United States category of products:
• 1208 → Food &amp; Beverage
• 546 → Beverages
• 4262 → Wine
• 2775 → United States
The goal in this challenge is to predict the labels in the test set.</p>
      <p>While the labels do form a hierarchy in this fashion, we found
that the supercategories in many cases covered too broad a range
of items, making predicting them quite dificult. As a result, our
attempts at hierarchically categorising products by predicting the
category at each level, then searching within the subset of results
within that category to predict the next level was less accurate
than simply predicting the label directly. As a result, the approach
we describe in the following section does not make use of the
hierarchical structure of the categories, but instead predicts the
entire chain.
1: MaxH D ← results[k − 1].distance
2: for all result ∈ results do
3: LQ ← | query.terms |
4: LR ← | result.terms |
5: LI ← 2|query.terms∧result.terms|
6: n LQ</p>
      <p>L ← LLQI min LR , LLQR o
7: H D ← result.distance
8: H ← exp 1.0 − H D−M12a8x H D
9: score[result.class] ← score[result.class] +L × H
10: end for
To produce our attempt at this task, we first cleaned the data,
replacing all non-alphabetical characters with spaces, then used TopSig to
generate 2048-bit binary signatures for every topic in the training
set. Then, to predict the categories of products in the test set, we
converted each test title to a 2048-bit signature with the same
approach. Following this, we performed an exhaustive search against
the training signatures with the test signature using the masked
search variant described in Figure 3, computing the Hamming
distance between the test signature and each training signature, and
recording the closest 10 signatures.
7</p>
    </sec>
    <sec id="sec-7">
      <title>POST-PROCESSING</title>
      <p>While the initial signature search delivers the correct class in the
majority (≥ 77%) of cases, we find that in certain cases the top
result does not return the correct class, while the correct class is
represented well in the remaining top-k results. To handle this
possibility, we include a post-processing step that is applied to
rerank the top-k, pushing well-represented classes to the top if the
top result happens to be an outlier.</p>
      <p>During post-processing, we iterate over the top-k results and
accumulate a score for each class represented in that set based on
two factors, L and H .</p>
      <p>• L reflects the number of intersecting terms between the
query and each result.
• H is derived from the diference between the Hamming
distance of the current result and the last result in the top-k.
These two factors are then multiplied together to produce a score,
which is accumulated for each class represented in the top-k. The
scoring algorithm is illustrated in Figure 4.</p>
      <p>Finally, the class with the highest score is predicted as the
outcome for this particular search.
8</p>
    </sec>
    <sec id="sec-8">
      <title>EVALUATION</title>
      <p>Classification accuracy is measured by comparing the predicted
category for each product with the actual category. Results are
reported in terms of the following metrics:
• precision
• recall
• F1-score
Among these, the F1-score is the score ultimately used to compare
systems, although all values are reported for competing systems.</p>
    </sec>
    <sec id="sec-9">
      <title>9 RESULTS</title>
      <p>The approach described in this paper, consisting of a masked
signature search followed by the post-processing described in section 7
achieved the following precision, recall and F1-scores on the Stage 1
leaderboard in the Rakuten Data Challenge:
• Precision: 0.80
• Recall: 0.80
• F1-score: 0.80
On the Stage 2 leaderboard, similar results were achieved:
• Precision: 0.79
• Recall: 0.80
• F1-score: 0.79
10 CONCLUSION
We have presented our contribution to the SIGIR’eCom 2018 Rakuten
Data Challenge. The results show that, with minimal customisation
and only a small amount of post-processing, TopSig’s signature
generation and search capabilities can be used as an efective
classification tool. Even a very basic signature search produces
reasonable results initially, with a little bit of post-processing capable of
pushing the classification accuracy even higher.
11 SOURCE CODE AND DATA
The TopSig signature generation and search tool we used for this
challenge is available from http://www.topsig.org and is licensed
under the GNU GPLv3. The data is available from the organisers of
the SIGIR’eCom 2018 Rakuten Data Challenge (https://sigir-ecom.
github.io/data-task.html).
12 ACKNOWLEDGEMENTS
We would like to acknowledge the computational resources
provided by the Queensland University of Technology Science and
Engineering Faculty’s BigData Lab for the contribution these
resources played in our attempt at this challenge.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Timothy</given-names>
            <surname>Chappell</surname>
          </string-name>
          , Shlomo Geva, and
          <string-name>
            <given-names>James</given-names>
            <surname>Hogan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>K-Means Clustering of Biological Sequences</article-title>
          .
          <source>In Proceedings of the 22nd Australasian Document Computing Symposium. ACM</source>
          ,
          <volume>2</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Deerwester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.T.</given-names>
            <surname>Dumais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.W.</given-names>
            <surname>Furnas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.K.</given-names>
            <surname>Landauer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Harshman</surname>
          </string-name>
          .
          <year>1990</year>
          .
          <article-title>Indexing by latent semantic analysis</article-title>
          .
          <source>Journal of the American society for information science 41</source>
          ,
          <issue>6</issue>
          (
          <year>1990</year>
          ),
          <fpage>391</fpage>
          -
          <lpage>407</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Shlomo</given-names>
            <surname>Geva and Christopher M De Vries</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>TopSig: topology preserving document signatures</article-title>
          .
          <source>In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM</source>
          ,
          <volume>333</volume>
          -
          <fpage>338</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>An introduction to random indexing</article-title>
          .
          <source>In Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering</source>
          ,
          <string-name>
            <surname>TKE</surname>
          </string-name>
          , Vol.
          <volume>5</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Gerard</given-names>
            <surname>Salton</surname>
          </string-name>
          , Anita Wong, and
          <string-name>
            <surname>Chung-Shu Yang</surname>
          </string-name>
          .
          <year>1975</year>
          .
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Commun. ACM</source>
          <volume>18</volume>
          ,
          <issue>11</issue>
          (
          <year>1975</year>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Nanayakkara</given-names>
            <surname>Wasam</surname>
          </string-name>
          <string-name>
            <surname>Uluwitige</surname>
          </string-name>
          , Dinesha Chathurani, Shlomo Geva, Timothy Chappell, and
          <string-name>
            <given-names>Vinod</given-names>
            <surname>Chandran</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Eficient content-based image retrieval based on multi-feature fusion</article-title>
          . (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Ramamohanarao</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Inverted files versus signature ifles for text indexing</article-title>
          .
          <source>ACM Transactions on Database Systems (TODS) 23</source>
          ,
          <issue>4</issue>
          (
          <year>1998</year>
          ),
          <fpage>453</fpage>
          -
          <lpage>490</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>