=Paper= {{Paper |id=Vol-2319/ecom18DC_paper_11 |storemode=property |title=TopSig at the SIGIR'eCom 2018 Rakuten Data Challenge |pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_11.pdf |volume=Vol-2319 |authors=Timothy Chappell,Shlomo Geva,Lawrence Buckingham |dblpUrl=https://dblp.org/rec/conf/sigir/ChappellGB18 }} ==TopSig at the SIGIR'eCom 2018 Rakuten Data Challenge== https://ceur-ws.org/Vol-2319/ecom18DC_paper_11.pdf
              TopSig at the SIGIR’eCom 2018 Rakuten Data Challenge
                  Timothy Chappell                                                         Shlomo Geva                             Lawrence Buckingham
      School of Electrical Engineering and                                School of Electrical Engineering and                School of Electrical Engineering and
        Computer Science, Queensland                                        Computer Science, Queensland                        Computer Science, Queensland
           University of Technology                                            University of Technology                            University of Technology
             Brisbane, Queensland                                                Brisbane, Queensland                                Brisbane, Queensland
         timothy.chappell@qut.edu.au                                              s.geva@qut.edu.au                               l.buckingham@qut.edu.au

ABSTRACT                                                                                               Unlike text search engines, LSH based search engines compare en-
For the SIGIR 2018 eCom workshop Rakuden Data Challenge we                                             tire object representations, rather than look for individual features
present an approach utilising random indexing to create toplogical                                     within objects (key words). In this project we aim to use the LSH
signatures (TopSig) for the purposes of categorising the product                                       approach for searching the Rakuten items text collection, by taking
names in the provided data sets. This paper describes the approach                                     the entire object comparison approach, treating each item descrip-
used to accomplish this.                                                                               tion as an object. We then project all items into a lower dimensional
                                                                                                       metric space, where it can be efficiently processed.
CCS CONCEPTS                                                                                              This experiment is concerned with classification rather than
                                                                                                       retrieval. We use a conceptually simple approach:
• Information systems → Clustering and classification;
                                                                                                           • We first search the collection with the new product title and
KEYWORDS                                                                                                     identify its nearest neighbours in the reference data set.
                                                                                                           • We then apply a simple post-search re-ranking of the top-k
Signatures, hamming distance
                                                                                                             nearest neighbours, and assign to the new item the category
ACM Reference Format:                                                                                        of the highest ranked item from the top-k nearest.
Timothy Chappell, Shlomo Geva, and Lawrence Buckingham. 2018. TopSig
at the SIGIR’eCom 2018 Rakuten Data Challenge. In Proceedings of the
                                                                                                         In this paper we apply the Topological Signature (TopSig) ap-
41st International ACM SIGIR Conference on Research and Development in                                 proach to this challenge to produce the initial top-k list.
Information Retrieval (SIGIR 2018). ACM, New York, NY, USA, 4 pages.
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                                2   TOPOLOGICAL SIGNATURES
                                                                                                       The concept of performing similarity computations within a model
1      INTRODUCTION                                                                                    in which documents exist as vectors is an old one [5], but still con-
Traditional text search engines are based on the bag-of-words                                          sidered a fundamental of information retrieval theory today. The
approach, and are almost invariably implemented with inverted                                          naturally high dimensionality of text of any level of complexity
files [7]. There are however some information retrieval tasks that                                     poses certain computational challenges; therefore it is desirable to
are better handled by alternative techniques, such as locality sen-                                    reduce this dimensionality where possible. Singular value decom-
sitive hashing (LSH). These are often referred to as approximate                                       position has been used successfully for this purpose [2], however it
nearest neighbour approaches and are common in the image or                                            has also been shown that random indexing [4] can achieve similar
speech domains, where objects are typically represented by numer-                                      results with far less preprocessing time.
ical quantities and where text based approaches are not naturally                                         For the purposes of this research, we have made use of our open-
applicable. With LSH, the data can be handled in a more conve-                                         source TopSig tool to generate dimensionality-reduced signatures
nient way by projecting from the original feature space, where the                                     from the product titles in the provided data sets and to search these
natural representation of an object is inherently high dimensional                                     signatures. The signature generation approach used by TopSig has
with variable length, into a space with fixed dimensionality. It then                                  been shown to be effective at ad-hoc document retrieval tasks [3],
becomes possible to use similarity between object representations                                      biological sequence clustering tasks [1] and image retrieval tasks [6]
as a proxy for calculation of original representation object similar-                                  among others. The approach can be regarded as a variation on the
ity. The fixed dimensionality of the representation space combined                                     LSH approach to approximate nearest neighbour search, and is
with favourable properties of the metric in that space permit the im-                                  further described in the following section.
plementation of highly efficient search and comparison algorithms.
                                                                                                       3   SIGNATURE GENERATION
Permission to make digital or hard copies of all or part of this work for personal or
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
 classroom    use is granted    withoutS.fee providedM.that copies   areLin,
                                                                         notA.
                                                                             made    or distributed
                                                                                                       The signature generation process involves transforming the plain
In: J. Degenhardt,    G. Di Fabbrizio,     Kallumadi,    Kumar,   Y.-C.         Trotman,  H. Zhao
 for profit
(eds.):     or commercial
        Proceedings            advantage
                      of the SIGIR        andworkshop,
                                   2018 eCom   that copies
                                                        12 bear  this notice
                                                           July, 2018,        and Michigan,
                                                                        Ann Arbor, the full citation
                                                                                              USA,     text product titles into fixed length binary strings (signatures). Ran-
published  at http://ceur-ws.org
 on the first  page. Copyrights for components of this work owned by others than ACM                   dom indexing of text starts with the assignment of a fixed length
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a            random signature to each term in the collection. These term sig-
fee. Request permissions from permissions@acm.org.                                                     natures can take the form of ternary vectors having components
SIGIR 2018, July 2018, Ann Arbor, Michigan USA                                                         in the set [+1, 0, −1], or they can be normally-distributed random
© 2018 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00                                                           real valued unit vectors. To index a document, the signatures of
https://doi.org/10.1145/nnnnnnn.nnnnnnn                                                                all terms appearing in the document are added together, and the
SIGIR 2018, July 2018, Ann Arbor, Michigan USA                                     Timothy Chappell, Shlomo Geva, and Lawrence Buckingham

 1: for all document ∈ collection do                                          1: for all word ∈ query do
 2:   for all term ∈ document do                                              2:     query vector ← query vector + hash(word)
 3:     document vector ← document vector + hash(term)                        3: end for
 4:   end for                                                                 4: for all component ∈ query vector do
 5:   for all component ∈ document vector do                                  5:if query vector[component] ≤ 0 then
 6:     if document vector[component] ≤ 0 then                                6:   query sig[component] ← 0
 7:        document signature[component] ← 0                               7:   else
 8:     else                                                               8:      query sig[component] ← 1
 9:        document signature[component] ← 1                               9:   end if
10:     end if                                                            10: end for
11:   end for                                                             11: for all document sig ∈ collection do
12: end for                                                               12:   score[document] ← distance(document sig, query sig)
                                                                          13: end for
Figure 1: Pseudo-code algorithm for generating signatures
                                                                          Figure 2: Pseudo-code algorithm for exhaustively searching
                                                                          signatures (object←→object comparison)

resulting vector is then squashed to yield a binary vector by taking
the sign bit of each vector component. The randomly generated             4        SIGNATURE SEARCHING
term signatures play a role analogous to the basis vectors of a pro-      Once the signatures for a collection have been created, the task of
jection into the fixed-dimension metric space. When combined they         searching them is simple. When the collections are small, such as in
yield binary vectors with effectively random components which             the case of the Rakuten Data Challenge, the approach that leads to
nonetheless reflect the term content of the originating document.         the lowest potential for error is to leverage the high speed at which
This projection preserves topology in the sense that documents            Hamming distances between signatures can be computed and to
containing similar terms tend to have similar signatures. This is         exhaustively search the database collection for the query signature.
a special case of locality-sensitive hashing. Similarity calculations     The simplest manifestation of this approach is shown in Figure 2,
between signatures are based on the Hamming distance (number of           in which the query is first transformed into a signature, then the
differing bits between the signatures). By encoding and processing        resulting signature is compared against every other signature in the
signatures with bitwise operations distance calculations can be           database, accumulating a score for each one. These scores (lower
extremely fast.                                                           means closer to the query signature) can then be used, depending
   TopSig is used to generate signatures from the product titles          on the needs of the particular domain, to determine the closest
provided with this challenge through the following process:               results and handle them in whatever way is needed.
                                                                             One innovation introduced by the TopSig approach that produces
      • The product titles are filtered, with non-alphabetical charac-    more stable results when searching is to take advantage of the fact
        ters replaced with spaces.                                        that we have access to the query at the time of searching, which
      • The titles are tokenised along space boundaries into individ-     means we can produce a mask that excludes the bits that are not
        ual terms.                                                        part of the query signature (either positively or negatively).
      • The individual terms are checked against a general term              The process for this is as follows:
        ‘stop list’ consisting of common sentence particles (‘a’, ‘the’
                                                                               • During searching, when the signature is being generated
        etc.) and terms matching terms in the stop list are discarded.
                                                                                 from the query terms, a second shadow signature (the mask
      • The terms are stemmed- words ending in ‘s’, ‘es’ or ‘ies’ have
                                                                                 signature) is generated as well.
        these endings removed.
                                                                               • Whenever a positive or negative value is stored in a term
      • The terms are case-folded to ensure that the same signatures
                                                                                 signature, the mask signature receives a 1 in that position.
        are generated irrespective of the capitalisation of the terms.
                                                                               • Then, as the database is being exhaustively searched, during
      • Finally, the terms are hashed to signatures through random
                                                                                 the Hamming distance calculation both the query and data-
        projection, then summed together to form a signature for
                                                                                 base signatures are masked against the mask signature with
        the entire title, in the process depicted in Figure 1.
                                                                                 a bitwise AND operation, causing the bits that are not part
   The use of pseudo-random hashing in the signature generation                  of the mask to be excluded from the final result.
process creates signatures in which two completely unrelated ti-             This refined approach is described in Figure 3. Note that, for this
tles will tend to have approximately 50% of bits matching when            to feature any improvement over traditional searching, the term
their signatures are compared. The presence of co-occurring terms         vectors produced by hashing need to be relatively sparse, with many
will increase the number of matching bits, while the presence of          components having a value of 0 and therefore not contributing to
only co-occurrent terms will result in two precisely-matching sig-        the overall signature.
natures. In this way, the number of term co-occurrences between              Due to the way signatures are generated, bits that are not fea-
two titles can be approximated by comparing their signatures. It is       tured in the query signature will have a 50% chance of agreeing
this relationship that forms the basis of TopSig’s signature search       with the database signatures. This results in a binomial distribu-
approach.                                                                 tion of noise centred on half the number of unset bits affecting the
TopSig at the SIGIR’eCom 2018 Rakuten Data Challenge                                           SIGIR 2018, July 2018, Ann Arbor, Michigan USA

    1: for all word ∈ query do                                             1: MaxHD ← results[k − 1].distance
    2:query vector ← query vector + hash(word)                             2: for all result ∈ results do
 3: end for                                                                3:LQ ← | query.terms |
 4: for all component ∈ query vector do                                    4:LR ← | result.terms |
 5:   if query vector[component] ≤ 0 then                               5:   LI ← 2 |query.terms∧result.terms
                                                                                            n          o
                                                                                                              |
 6:      query sig[component] ← 0                                       6:   L ← LQLI min LQ , LR
                                                                                              LR LQ
 7:   else                                                              7:   HD ← result.distance
                                                                                                               
 8:      query sig[component] ← 1                                       8:   H ← exp 1.0 − H D−Max         HD
                                                                                                     128
 9:   end if
                                                                        9:   score[result.class] ← score[result.class] +L × H
10:   if query vector[component] , 0 then
                                                                       10: end for
11:      mask sig[component] ← 0
12:   else
                                                                       Figure 4: Pseudo-code algorithm for the scoring approach
13:      mask sig[component] ← 1
                                                                       applied to the top-k to generate scores for each class
14:   end if
15: end for
16: for all document sig ∈ collection do
                                                                       6        APPLYING TOPSIG TO THE CHALLENGE
17:   score[document] ← distance(document sig ∧ mask sig, query
      sig ∧ mask sig)                                                  To produce our attempt at this task, we first cleaned the data, replac-
18: end for                                                            ing all non-alphabetical characters with spaces, then used TopSig to
                                                                       generate 2048-bit binary signatures for every topic in the training
Figure 3: Pseudo-code algorithm for exhaustively searching             set. Then, to predict the categories of products in the test set, we
signatures (masked query←→object comparison)                           converted each test title to a 2048-bit signature with the same ap-
                                                                       proach. Following this, we performed an exhaustive search against
                                                                       the training signatures with the test signature using the masked
                                                                       search variant described in Figure 3, computing the Hamming dis-
                                                                       tance between the test signature and each training signature, and
Hamming distance of all comparisons, which can be a significant        recording the closest 10 signatures.
distortion if the query signature contains far fewer terms than the
database signature. Masking mitigates this effect by removing these    7        POST-PROCESSING
bits from the Hamming distance calculation entirely.                   While the initial signature search delivers the correct class in the
                                                                       majority (≥ 77%) of cases, we find that in certain cases the top
5        DATA CHALLENGE                                                result does not return the correct class, while the correct class is
                                                                       represented well in the remaining top-k results. To handle this
The Rakuten data challenge provides a data set of 1,000,000 product
                                                                       possibility, we include a post-processing step that is applied to re-
names from https://www.rakuten.com/, divided into a training set
                                                                       rank the top-k, pushing well-represented classes to the top if the
of 800,000 products with category labels intact and a test set of
                                                                       top result happens to be an outlier.
200,000 products with labels omitted. These labels are presented as
                                                                          During post-processing, we iterate over the top-k results and
a sequence of numbers, where each number represents a category
                                                                       accumulate a score for each class represented in that set based on
at a particular level on the Rakuten.com website. For instance, the
                                                                       two factors, L and H .
1208>546>4262>2775 category refers to the Food & Beverage →
Beverages → Wine → United States category of products:                          • L reflects the number of intersecting terms between the
                                                                                  query and each result.
         • 1208 → Food & Beverage                                               • H is derived from the difference between the Hamming dis-
         • 546 → Beverages                                                        tance of the current result and the last result in the top-k.
         • 4262 → Wine                                                 These two factors are then multiplied together to produce a score,
         • 2775 → United States                                        which is accumulated for each class represented in the top-k. The
The goal in this challenge is to predict the labels in the test set.   scoring algorithm is illustrated in Figure 4.
   While the labels do form a hierarchy in this fashion, we found         Finally, the class with the highest score is predicted as the out-
that the supercategories in many cases covered too broad a range       come for this particular search.
of items, making predicting them quite difficult. As a result, our
attempts at hierarchically categorising products by predicting the     8        EVALUATION
category at each level, then searching within the subset of results    Classification accuracy is measured by comparing the predicted
within that category to predict the next level was less accurate       category for each product with the actual category. Results are
than simply predicting the label directly. As a result, the approach   reported in terms of the following metrics:
we describe in the following section does not make use of the                   • precision
hierarchical structure of the categories, but instead predicts the              • recall
entire chain.                                                                   • F 1 -score
SIGIR 2018, July 2018, Ann Arbor, Michigan USA                                              Timothy Chappell, Shlomo Geva, and Lawrence Buckingham


Among these, the F 1 -score is the score ultimately used to compare
systems, although all values are reported for competing systems.

9     RESULTS
The approach described in this paper, consisting of a masked signa-
ture search followed by the post-processing described in section 7
achieved the following precision, recall and F 1 -scores on the Stage 1
leaderboard in the Rakuten Data Challenge:
     • Precision: 0.80
     • Recall: 0.80
     • F 1 -score: 0.80
   On the Stage 2 leaderboard, similar results were achieved:
     • Precision: 0.79
     • Recall: 0.80
     • F 1 -score: 0.79

10      CONCLUSION
We have presented our contribution to the SIGIR’eCom 2018 Rakuten
Data Challenge. The results show that, with minimal customisation
and only a small amount of post-processing, TopSig’s signature
generation and search capabilities can be used as an effective clas-
sification tool. Even a very basic signature search produces reason-
able results initially, with a little bit of post-processing capable of
pushing the classification accuracy even higher.

11      SOURCE CODE AND DATA
The TopSig signature generation and search tool we used for this
challenge is available from http://www.topsig.org and is licensed
under the GNU GPLv3. The data is available from the organisers of
the SIGIR’eCom 2018 Rakuten Data Challenge (https://sigir-ecom.
github.io/data-task.html).

12      ACKNOWLEDGEMENTS
We would like to acknowledge the computational resources pro-
vided by the Queensland University of Technology Science and
Engineering Faculty’s BigData Lab for the contribution these re-
sources played in our attempt at this challenge.

REFERENCES
[1] Timothy Chappell, Shlomo Geva, and James Hogan. 2017. K-Means Clustering of
    Biological Sequences. In Proceedings of the 22nd Australasian Document Computing
    Symposium. ACM, 2.
[2] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. 1990.
    Indexing by latent semantic analysis. Journal of the American society for information
    science 41, 6 (1990), 391–407.
[3] Shlomo Geva and Christopher M De Vries. 2011. TopSig: topology preserving
    document signatures. In Proceedings of the 20th ACM International Conference on
    Information and Knowledge Management. ACM, 333–338.
[4] M. Sahlgren. 2005. An introduction to random indexing. In Methods and Ap-
    plications of Semantic Indexing Workshop at the 7th International Conference on
    Terminology and Knowledge Engineering, TKE, Vol. 5.
[5] Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for
    automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
[6] Nanayakkara Wasam Uluwitige, Dinesha Chathurani, Shlomo Geva, Timothy
    Chappell, and Vinod Chandran. 2016. Efficient content-based image retrieval
    based on multi-feature fusion. (2016).
[7] J. Zobel, A. Moffat, and K. Ramamohanarao. 1998. Inverted files versus signature
    files for text indexing. ACM Transactions on Database Systems (TODS) 23, 4 (1998),
    453–490.