=Paper=
{{Paper
|id=Vol-2319/ecom18DC_paper_11
|storemode=property
|title=TopSig at the SIGIR'eCom 2018 Rakuten Data Challenge
|pdfUrl=https://ceur-ws.org/Vol-2319/ecom18DC_paper_11.pdf
|volume=Vol-2319
|authors=Timothy Chappell,Shlomo Geva,Lawrence Buckingham
|dblpUrl=https://dblp.org/rec/conf/sigir/ChappellGB18
}}
==TopSig at the SIGIR'eCom 2018 Rakuten Data Challenge==
TopSig at the SIGIR’eCom 2018 Rakuten Data Challenge
Timothy Chappell Shlomo Geva Lawrence Buckingham
School of Electrical Engineering and School of Electrical Engineering and School of Electrical Engineering and
Computer Science, Queensland Computer Science, Queensland Computer Science, Queensland
University of Technology University of Technology University of Technology
Brisbane, Queensland Brisbane, Queensland Brisbane, Queensland
timothy.chappell@qut.edu.au s.geva@qut.edu.au l.buckingham@qut.edu.au
ABSTRACT Unlike text search engines, LSH based search engines compare en-
For the SIGIR 2018 eCom workshop Rakuden Data Challenge we tire object representations, rather than look for individual features
present an approach utilising random indexing to create toplogical within objects (key words). In this project we aim to use the LSH
signatures (TopSig) for the purposes of categorising the product approach for searching the Rakuten items text collection, by taking
names in the provided data sets. This paper describes the approach the entire object comparison approach, treating each item descrip-
used to accomplish this. tion as an object. We then project all items into a lower dimensional
metric space, where it can be efficiently processed.
CCS CONCEPTS This experiment is concerned with classification rather than
retrieval. We use a conceptually simple approach:
• Information systems → Clustering and classification;
• We first search the collection with the new product title and
KEYWORDS identify its nearest neighbours in the reference data set.
• We then apply a simple post-search re-ranking of the top-k
Signatures, hamming distance
nearest neighbours, and assign to the new item the category
ACM Reference Format: of the highest ranked item from the top-k nearest.
Timothy Chappell, Shlomo Geva, and Lawrence Buckingham. 2018. TopSig
at the SIGIR’eCom 2018 Rakuten Data Challenge. In Proceedings of the
In this paper we apply the Topological Signature (TopSig) ap-
41st International ACM SIGIR Conference on Research and Development in proach to this challenge to produce the initial top-k list.
Information Retrieval (SIGIR 2018). ACM, New York, NY, USA, 4 pages.
https://doi.org/10.1145/nnnnnnn.nnnnnnn 2 TOPOLOGICAL SIGNATURES
The concept of performing similarity computations within a model
1 INTRODUCTION in which documents exist as vectors is an old one [5], but still con-
Traditional text search engines are based on the bag-of-words sidered a fundamental of information retrieval theory today. The
approach, and are almost invariably implemented with inverted naturally high dimensionality of text of any level of complexity
files [7]. There are however some information retrieval tasks that poses certain computational challenges; therefore it is desirable to
are better handled by alternative techniques, such as locality sen- reduce this dimensionality where possible. Singular value decom-
sitive hashing (LSH). These are often referred to as approximate position has been used successfully for this purpose [2], however it
nearest neighbour approaches and are common in the image or has also been shown that random indexing [4] can achieve similar
speech domains, where objects are typically represented by numer- results with far less preprocessing time.
ical quantities and where text based approaches are not naturally For the purposes of this research, we have made use of our open-
applicable. With LSH, the data can be handled in a more conve- source TopSig tool to generate dimensionality-reduced signatures
nient way by projecting from the original feature space, where the from the product titles in the provided data sets and to search these
natural representation of an object is inherently high dimensional signatures. The signature generation approach used by TopSig has
with variable length, into a space with fixed dimensionality. It then been shown to be effective at ad-hoc document retrieval tasks [3],
becomes possible to use similarity between object representations biological sequence clustering tasks [1] and image retrieval tasks [6]
as a proxy for calculation of original representation object similar- among others. The approach can be regarded as a variation on the
ity. The fixed dimensionality of the representation space combined LSH approach to approximate nearest neighbour search, and is
with favourable properties of the metric in that space permit the im- further described in the following section.
plementation of highly efficient search and comparison algorithms.
3 SIGNATURE GENERATION
Permission to make digital or hard copies of all or part of this work for personal or
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
classroom use is granted withoutS.fee providedM.that copies areLin,
notA.
made or distributed
The signature generation process involves transforming the plain
In: J. Degenhardt, G. Di Fabbrizio, Kallumadi, Kumar, Y.-C. Trotman, H. Zhao
for profit
(eds.): or commercial
Proceedings advantage
of the SIGIR andworkshop,
2018 eCom that copies
12 bear this notice
July, 2018, and Michigan,
Ann Arbor, the full citation
USA, text product titles into fixed length binary strings (signatures). Ran-
published at http://ceur-ws.org
on the first page. Copyrights for components of this work owned by others than ACM dom indexing of text starts with the assignment of a fixed length
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a random signature to each term in the collection. These term sig-
fee. Request permissions from permissions@acm.org. natures can take the form of ternary vectors having components
SIGIR 2018, July 2018, Ann Arbor, Michigan USA in the set [+1, 0, −1], or they can be normally-distributed random
© 2018 Association for Computing Machinery.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00 real valued unit vectors. To index a document, the signatures of
https://doi.org/10.1145/nnnnnnn.nnnnnnn all terms appearing in the document are added together, and the
SIGIR 2018, July 2018, Ann Arbor, Michigan USA Timothy Chappell, Shlomo Geva, and Lawrence Buckingham
1: for all document ∈ collection do 1: for all word ∈ query do
2: for all term ∈ document do 2: query vector ← query vector + hash(word)
3: document vector ← document vector + hash(term) 3: end for
4: end for 4: for all component ∈ query vector do
5: for all component ∈ document vector do 5:if query vector[component] ≤ 0 then
6: if document vector[component] ≤ 0 then 6: query sig[component] ← 0
7: document signature[component] ← 0 7: else
8: else 8: query sig[component] ← 1
9: document signature[component] ← 1 9: end if
10: end if 10: end for
11: end for 11: for all document sig ∈ collection do
12: end for 12: score[document] ← distance(document sig, query sig)
13: end for
Figure 1: Pseudo-code algorithm for generating signatures
Figure 2: Pseudo-code algorithm for exhaustively searching
signatures (object←→object comparison)
resulting vector is then squashed to yield a binary vector by taking
the sign bit of each vector component. The randomly generated 4 SIGNATURE SEARCHING
term signatures play a role analogous to the basis vectors of a pro- Once the signatures for a collection have been created, the task of
jection into the fixed-dimension metric space. When combined they searching them is simple. When the collections are small, such as in
yield binary vectors with effectively random components which the case of the Rakuten Data Challenge, the approach that leads to
nonetheless reflect the term content of the originating document. the lowest potential for error is to leverage the high speed at which
This projection preserves topology in the sense that documents Hamming distances between signatures can be computed and to
containing similar terms tend to have similar signatures. This is exhaustively search the database collection for the query signature.
a special case of locality-sensitive hashing. Similarity calculations The simplest manifestation of this approach is shown in Figure 2,
between signatures are based on the Hamming distance (number of in which the query is first transformed into a signature, then the
differing bits between the signatures). By encoding and processing resulting signature is compared against every other signature in the
signatures with bitwise operations distance calculations can be database, accumulating a score for each one. These scores (lower
extremely fast. means closer to the query signature) can then be used, depending
TopSig is used to generate signatures from the product titles on the needs of the particular domain, to determine the closest
provided with this challenge through the following process: results and handle them in whatever way is needed.
One innovation introduced by the TopSig approach that produces
• The product titles are filtered, with non-alphabetical charac- more stable results when searching is to take advantage of the fact
ters replaced with spaces. that we have access to the query at the time of searching, which
• The titles are tokenised along space boundaries into individ- means we can produce a mask that excludes the bits that are not
ual terms. part of the query signature (either positively or negatively).
• The individual terms are checked against a general term The process for this is as follows:
‘stop list’ consisting of common sentence particles (‘a’, ‘the’
• During searching, when the signature is being generated
etc.) and terms matching terms in the stop list are discarded.
from the query terms, a second shadow signature (the mask
• The terms are stemmed- words ending in ‘s’, ‘es’ or ‘ies’ have
signature) is generated as well.
these endings removed.
• Whenever a positive or negative value is stored in a term
• The terms are case-folded to ensure that the same signatures
signature, the mask signature receives a 1 in that position.
are generated irrespective of the capitalisation of the terms.
• Then, as the database is being exhaustively searched, during
• Finally, the terms are hashed to signatures through random
the Hamming distance calculation both the query and data-
projection, then summed together to form a signature for
base signatures are masked against the mask signature with
the entire title, in the process depicted in Figure 1.
a bitwise AND operation, causing the bits that are not part
The use of pseudo-random hashing in the signature generation of the mask to be excluded from the final result.
process creates signatures in which two completely unrelated ti- This refined approach is described in Figure 3. Note that, for this
tles will tend to have approximately 50% of bits matching when to feature any improvement over traditional searching, the term
their signatures are compared. The presence of co-occurring terms vectors produced by hashing need to be relatively sparse, with many
will increase the number of matching bits, while the presence of components having a value of 0 and therefore not contributing to
only co-occurrent terms will result in two precisely-matching sig- the overall signature.
natures. In this way, the number of term co-occurrences between Due to the way signatures are generated, bits that are not fea-
two titles can be approximated by comparing their signatures. It is tured in the query signature will have a 50% chance of agreeing
this relationship that forms the basis of TopSig’s signature search with the database signatures. This results in a binomial distribu-
approach. tion of noise centred on half the number of unset bits affecting the
TopSig at the SIGIR’eCom 2018 Rakuten Data Challenge SIGIR 2018, July 2018, Ann Arbor, Michigan USA
1: for all word ∈ query do 1: MaxHD ← results[k − 1].distance
2:query vector ← query vector + hash(word) 2: for all result ∈ results do
3: end for 3:LQ ← | query.terms |
4: for all component ∈ query vector do 4:LR ← | result.terms |
5: if query vector[component] ≤ 0 then 5: LI ← 2 |query.terms∧result.terms
n o
|
6: query sig[component] ← 0 6: L ← LQLI min LQ , LR
LR LQ
7: else 7: HD ← result.distance
8: query sig[component] ← 1 8: H ← exp 1.0 − H D−Max HD
128
9: end if
9: score[result.class] ← score[result.class] +L × H
10: if query vector[component] , 0 then
10: end for
11: mask sig[component] ← 0
12: else
Figure 4: Pseudo-code algorithm for the scoring approach
13: mask sig[component] ← 1
applied to the top-k to generate scores for each class
14: end if
15: end for
16: for all document sig ∈ collection do
6 APPLYING TOPSIG TO THE CHALLENGE
17: score[document] ← distance(document sig ∧ mask sig, query
sig ∧ mask sig) To produce our attempt at this task, we first cleaned the data, replac-
18: end for ing all non-alphabetical characters with spaces, then used TopSig to
generate 2048-bit binary signatures for every topic in the training
Figure 3: Pseudo-code algorithm for exhaustively searching set. Then, to predict the categories of products in the test set, we
signatures (masked query←→object comparison) converted each test title to a 2048-bit signature with the same ap-
proach. Following this, we performed an exhaustive search against
the training signatures with the test signature using the masked
search variant described in Figure 3, computing the Hamming dis-
tance between the test signature and each training signature, and
Hamming distance of all comparisons, which can be a significant recording the closest 10 signatures.
distortion if the query signature contains far fewer terms than the
database signature. Masking mitigates this effect by removing these 7 POST-PROCESSING
bits from the Hamming distance calculation entirely. While the initial signature search delivers the correct class in the
majority (≥ 77%) of cases, we find that in certain cases the top
5 DATA CHALLENGE result does not return the correct class, while the correct class is
represented well in the remaining top-k results. To handle this
The Rakuten data challenge provides a data set of 1,000,000 product
possibility, we include a post-processing step that is applied to re-
names from https://www.rakuten.com/, divided into a training set
rank the top-k, pushing well-represented classes to the top if the
of 800,000 products with category labels intact and a test set of
top result happens to be an outlier.
200,000 products with labels omitted. These labels are presented as
During post-processing, we iterate over the top-k results and
a sequence of numbers, where each number represents a category
accumulate a score for each class represented in that set based on
at a particular level on the Rakuten.com website. For instance, the
two factors, L and H .
1208>546>4262>2775 category refers to the Food & Beverage →
Beverages → Wine → United States category of products: • L reflects the number of intersecting terms between the
query and each result.
• 1208 → Food & Beverage • H is derived from the difference between the Hamming dis-
• 546 → Beverages tance of the current result and the last result in the top-k.
• 4262 → Wine These two factors are then multiplied together to produce a score,
• 2775 → United States which is accumulated for each class represented in the top-k. The
The goal in this challenge is to predict the labels in the test set. scoring algorithm is illustrated in Figure 4.
While the labels do form a hierarchy in this fashion, we found Finally, the class with the highest score is predicted as the out-
that the supercategories in many cases covered too broad a range come for this particular search.
of items, making predicting them quite difficult. As a result, our
attempts at hierarchically categorising products by predicting the 8 EVALUATION
category at each level, then searching within the subset of results Classification accuracy is measured by comparing the predicted
within that category to predict the next level was less accurate category for each product with the actual category. Results are
than simply predicting the label directly. As a result, the approach reported in terms of the following metrics:
we describe in the following section does not make use of the • precision
hierarchical structure of the categories, but instead predicts the • recall
entire chain. • F 1 -score
SIGIR 2018, July 2018, Ann Arbor, Michigan USA Timothy Chappell, Shlomo Geva, and Lawrence Buckingham
Among these, the F 1 -score is the score ultimately used to compare
systems, although all values are reported for competing systems.
9 RESULTS
The approach described in this paper, consisting of a masked signa-
ture search followed by the post-processing described in section 7
achieved the following precision, recall and F 1 -scores on the Stage 1
leaderboard in the Rakuten Data Challenge:
• Precision: 0.80
• Recall: 0.80
• F 1 -score: 0.80
On the Stage 2 leaderboard, similar results were achieved:
• Precision: 0.79
• Recall: 0.80
• F 1 -score: 0.79
10 CONCLUSION
We have presented our contribution to the SIGIR’eCom 2018 Rakuten
Data Challenge. The results show that, with minimal customisation
and only a small amount of post-processing, TopSig’s signature
generation and search capabilities can be used as an effective clas-
sification tool. Even a very basic signature search produces reason-
able results initially, with a little bit of post-processing capable of
pushing the classification accuracy even higher.
11 SOURCE CODE AND DATA
The TopSig signature generation and search tool we used for this
challenge is available from http://www.topsig.org and is licensed
under the GNU GPLv3. The data is available from the organisers of
the SIGIR’eCom 2018 Rakuten Data Challenge (https://sigir-ecom.
github.io/data-task.html).
12 ACKNOWLEDGEMENTS
We would like to acknowledge the computational resources pro-
vided by the Queensland University of Technology Science and
Engineering Faculty’s BigData Lab for the contribution these re-
sources played in our attempt at this challenge.
REFERENCES
[1] Timothy Chappell, Shlomo Geva, and James Hogan. 2017. K-Means Clustering of
Biological Sequences. In Proceedings of the 22nd Australasian Document Computing
Symposium. ACM, 2.
[2] S. Deerwester, S.T. Dumais, G.W. Furnas, T.K. Landauer, and R. Harshman. 1990.
Indexing by latent semantic analysis. Journal of the American society for information
science 41, 6 (1990), 391–407.
[3] Shlomo Geva and Christopher M De Vries. 2011. TopSig: topology preserving
document signatures. In Proceedings of the 20th ACM International Conference on
Information and Knowledge Management. ACM, 333–338.
[4] M. Sahlgren. 2005. An introduction to random indexing. In Methods and Ap-
plications of Semantic Indexing Workshop at the 7th International Conference on
Terminology and Knowledge Engineering, TKE, Vol. 5.
[5] Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for
automatic indexing. Commun. ACM 18, 11 (1975), 613–620.
[6] Nanayakkara Wasam Uluwitige, Dinesha Chathurani, Shlomo Geva, Timothy
Chappell, and Vinod Chandran. 2016. Efficient content-based image retrieval
based on multi-feature fusion. (2016).
[7] J. Zobel, A. Moffat, and K. Ramamohanarao. 1998. Inverted files versus signature
files for text indexing. ACM Transactions on Database Systems (TODS) 23, 4 (1998),
453–490.