<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">TopSig at the SIGIR&apos;eCom 2018 Rakuten Data Challenge</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Timothy</forename><surname>Chappell</surname></persName>
							<email>timothy.chappell@qut.edu.au</email>
						</author>
						<author>
							<persName><forename type="first">Shlomo</forename><surname>Geva</surname></persName>
							<email>s.geva@qut.edu.au</email>
						</author>
						<author>
							<persName><forename type="first">Lawrence</forename><surname>Buckingham</surname></persName>
							<email>l.buckingham@qut.edu.au</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">School of Electrical Engineering and Computer Science</orgName>
								<orgName type="institution">Queensland University of Technology Brisbane</orgName>
								<address>
									<region>Queensland</region>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">School of Electrical Engineering and Computer Science</orgName>
								<orgName type="institution">Queensland University of Technology Brisbane</orgName>
								<address>
									<region>Queensland</region>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff2">
								<orgName type="department">School of Electrical Engineering and Computer Science</orgName>
								<orgName type="institution">Queensland University of Technology Brisbane</orgName>
								<address>
									<region>Queensland</region>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<address>
									<settlement>Ann Arbor</settlement>
									<region>Michigan</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff4">
								<address>
									<settlement>Ann Arbor</settlement>
									<region>Michigan</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">TopSig at the SIGIR&apos;eCom 2018 Rakuten Data Challenge</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">3BCBBA2D4975107D064FF0F132FC2F98</idno>
					<idno type="DOI">10.1145/nnnnnnn.nnnnnnn</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T06:31+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>Signatures, hamming distance</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>For the SIGIR 2018 eCom workshop Rakuden Data Challenge we present an approach utilising random indexing to create toplogical signatures (TopSig) for the purposes of categorising the product names in the provided data sets. This paper describes the approach used to accomplish this.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Traditional text search engines are based on the bag-of-words approach, and are almost invariably implemented with inverted files <ref type="bibr" target="#b6">[7]</ref>. There are however some information retrieval tasks that are better handled by alternative techniques, such as locality sensitive hashing (LSH). These are often referred to as approximate nearest neighbour approaches and are common in the image or speech domains, where objects are typically represented by numerical quantities and where text based approaches are not naturally applicable. With LSH, the data can be handled in a more convenient way by projecting from the original feature space, where the natural representation of an object is inherently high dimensional with variable length, into a space with fixed dimensionality. It then becomes possible to use similarity between object representations as a proxy for calculation of original representation object similarity. The fixed dimensionality of the representation space combined with favourable properties of the metric in that space permit the implementation of highly efficient search and comparison algorithms.</p><p>Unlike text search engines, LSH based search engines compare entire object representations, rather than look for individual features within objects (key words). In this project we aim to use the LSH approach for searching the Rakuten items text collection, by taking the entire object comparison approach, treating each item description as an object. We then project all items into a lower dimensional metric space, where it can be efficiently processed.</p><p>This experiment is concerned with classification rather than retrieval. We use a conceptually simple approach:</p><p>• We first search the collection with the new product title and identify its nearest neighbours in the reference data set. • We then apply a simple post-search re-ranking of the top-k nearest neighbours, and assign to the new item the category of the highest ranked item from the top-k nearest.</p><p>In this paper we apply the Topological Signature (TopSig) approach to this challenge to produce the initial top-k list.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">TOPOLOGICAL SIGNATURES</head><p>The concept of performing similarity computations within a model in which documents exist as vectors is an old one <ref type="bibr" target="#b4">[5]</ref>, but still considered a fundamental of information retrieval theory today. The naturally high dimensionality of text of any level of complexity poses certain computational challenges; therefore it is desirable to reduce this dimensionality where possible. Singular value decomposition has been used successfully for this purpose <ref type="bibr" target="#b1">[2]</ref>, however it has also been shown that random indexing <ref type="bibr" target="#b3">[4]</ref> can achieve similar results with far less preprocessing time.</p><p>For the purposes of this research, we have made use of our opensource TopSig tool to generate dimensionality-reduced signatures from the product titles in the provided data sets and to search these signatures. The signature generation approach used by TopSig has been shown to be effective at ad-hoc document retrieval tasks <ref type="bibr" target="#b2">[3]</ref>, biological sequence clustering tasks <ref type="bibr" target="#b0">[1]</ref> and image retrieval tasks <ref type="bibr" target="#b5">[6]</ref> among others. The approach can be regarded as a variation on the LSH approach to approximate nearest neighbour search, and is further described in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">SIGNATURE GENERATION</head><p>The signature generation process involves transforming the plain text product titles into fixed length binary strings (signatures). Random indexing of text starts with the assignment of a fixed length random signature to each term in the collection. These term signatures can take the form of ternary vectors having components in the set [+1, 0, −1], or they can be normally-distributed random real valued unit vectors. To index a document, the signatures of all terms appearing in the document are added together, and the for all term ∈ document do end for 12: end for Figure <ref type="figure">1</ref>: Pseudo-code algorithm for generating signatures resulting vector is then squashed to yield a binary vector by taking the sign bit of each vector component. The randomly generated term signatures play a role analogous to the basis vectors of a projection into the fixed-dimension metric space. When combined they yield binary vectors with effectively random components which nonetheless reflect the term content of the originating document. This projection preserves topology in the sense that documents containing similar terms tend to have similar signatures. This is a special case of locality-sensitive hashing. Similarity calculations between signatures are based on the Hamming distance (number of differing bits between the signatures). By encoding and processing signatures with bitwise operations distance calculations can be extremely fast.</p><p>TopSig is used to generate signatures from the product titles provided with this challenge through the following process:</p><p>• The product titles are filtered, with non-alphabetical characters replaced with spaces. • The titles are tokenised along space boundaries into individual terms.</p><p>• The individual terms are checked against a general term 'stop list' consisting of common sentence particles ('a', 'the' etc.) and terms matching terms in the stop list are discarded. • The terms are stemmed-words ending in 's', 'es' or 'ies' have these endings removed. • The terms are case-folded to ensure that the same signatures are generated irrespective of the capitalisation of the terms. • Finally, the terms are hashed to signatures through random projection, then summed together to form a signature for the entire title, in the process depicted in Figure <ref type="figure">1</ref>.</p><p>The use of pseudo-random hashing in the signature generation process creates signatures in which two completely unrelated titles will tend to have approximately 50% of bits matching when their signatures are compared. The presence of co-occurring terms will increase the number of matching bits, while the presence of only co-occurrent terms will result in two precisely-matching signatures. In this way, the number of term co-occurrences between two titles can be approximated by comparing their signatures. It is this relationship that forms the basis of TopSig's signature search approach.</p><p>1: for all word ∈ query do 2: query vector ← query vector + hash(word) 3: end for 4: for all component ∈ query vector do </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">SIGNATURE SEARCHING</head><p>Once the signatures for a collection have been created, the task of searching them is simple. When the collections are small, such as in the case of the Rakuten Data Challenge, the approach that leads to the lowest potential for error is to leverage the high speed at which Hamming distances between signatures can be computed and to exhaustively search the database collection for the query signature. The simplest manifestation of this approach is shown in Figure <ref type="figure" target="#fig_1">2</ref>, in which the query is first transformed into a signature, then the resulting signature is compared against every other signature in the database, accumulating a score for each one. These scores (lower means closer to the query signature) can then be used, depending on the needs of the particular domain, to determine the closest results and handle them in whatever way is needed.</p><p>One innovation introduced by the TopSig approach that produces more stable results when searching is to take advantage of the fact that we have access to the query at the time of searching, which means we can produce a mask that excludes the bits that are not part of the query signature (either positively or negatively).</p><p>The process for this is as follows:</p><p>• During searching, when the signature is being generated from the query terms, a second shadow signature (the mask signature) is generated as well. • Whenever a positive or negative value is stored in a term signature, the mask signature receives a 1 in that position. • Then, as the database is being exhaustively searched, during the Hamming distance calculation both the query and database signatures are masked against the mask signature with a bitwise AND operation, causing the bits that are not part of the mask to be excluded from the final result. This refined approach is described in Figure <ref type="figure" target="#fig_0">3</ref>. Note that, for this to feature any improvement over traditional searching, the term vectors produced by hashing need to be relatively sparse, with many components having a value of 0 and therefore not contributing to the overall signature.</p><p>Due to the way signatures are generated, bits that are not featured in the query signature will have a 50% chance of agreeing with the database signatures. This results in a binomial distribution of noise centred on half the number of unset bits affecting the 1: for all word ∈ query do 2: query vector ← query vector + hash(word) 3: end for 4: for all component ∈ query vector do Hamming distance of all comparisons, which can be a significant distortion if the query signature contains far fewer terms than the database signature. Masking mitigates this effect by removing these bits from the Hamming distance calculation entirely.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">DATA CHALLENGE</head><p>The Rakuten data challenge provides a data set of 1,000,000 product names from https://www.rakuten.com/, divided into a training set of 800,000 products with category labels intact and a test set of 200,000 products with labels omitted. These labels are presented as a sequence of numbers, where each number represents a category at a particular level on the Rakuten.com website. For instance, the 1208&gt;546&gt;4262&gt;2775 category refers to the Food &amp; Beverage → Beverages → Wine → United States category of products:</p><formula xml:id="formula_0">• 1208 → Food &amp; Beverage • 546 → Beverages • 4262 → Wine • 2775 → United States</formula><p>The goal in this challenge is to predict the labels in the test set.</p><p>While the labels do form a hierarchy in this fashion, we found that the supercategories in many cases covered too broad a range of items, making predicting them quite difficult. As a result, our attempts at hierarchically categorising products by predicting the category at each level, then searching within the subset of results within that category to predict the next level was less accurate than simply predicting the label directly. As a result, the approach we describe in the following section does not make use of the hierarchical structure of the categories, but instead predicts the entire chain. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">APPLYING TOPSIG TO THE CHALLENGE</head><p>To produce our attempt at this task, we first cleaned the data, replacing all non-alphabetical characters with spaces, then used TopSig to generate 2048-bit binary signatures for every topic in the training set. Then, to predict the categories of products in the test set, we converted each test title to a 2048-bit signature with the same approach. Following this, we performed an exhaustive search against the training signatures with the test signature using the masked search variant described in Figure <ref type="figure" target="#fig_0">3</ref>, computing the Hamming distance between the test signature and each training signature, and recording the closest 10 signatures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">POST-PROCESSING</head><p>While the initial signature search delivers the correct class in the majority (≥ 77%) of cases, we find that in certain cases the top result does not return the correct class, while the correct class is represented well in the remaining top-k results. To handle this possibility, we include a post-processing step that is applied to rerank the top-k, pushing well-represented classes to the top if the top result happens to be an outlier.</p><p>During post-processing, we iterate over the top-k results and accumulate a score for each class represented in that set based on two factors, L and H .</p><p>• L reflects the number of intersecting terms between the query and each result. • H is derived from the difference between the Hamming distance of the current result and the last result in the top-k.</p><p>These two factors are then multiplied together to produce a score, which is accumulated for each class represented in the top-k. The scoring algorithm is illustrated in Figure <ref type="figure">4</ref>. Finally, the class with the highest score is predicted as the outcome for this particular search.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">EVALUATION</head><p>Classification accuracy is measured by comparing the predicted category for each product with the actual category. Results are reported in terms of the following metrics:</p><formula xml:id="formula_1">• precision • recall • F 1 -score</formula></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>3 :</head><label>3</label><figDesc>document vector ← document vector + hash(term)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Pseudo-code algorithm for exhaustively searching signatures (object←→object comparison)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Pseudo-code algorithm for exhaustively searching signatures (masked query←→object comparison)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>1 : 4 : 5 :HFigure 4 :</head><label>1454</label><figDesc>Figure 4: Pseudo-code algorithm for the scoring approach applied to the top-k to generate scores for each class</figDesc></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="12">ACKNOWLEDGEMENTS</head><p>We would like to acknowledge the computational resources provided by the Queensland University of Technology Science and Engineering Faculty's BigData Lab for the contribution these resources played in our attempt at this challenge.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Among these, the F 1 -score is the score ultimately used to compare systems, although all values are reported for competing systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">RESULTS</head><p>The approach described in this paper, consisting of a masked signature search followed by the post-processing described in section 7 achieved the following precision, recall and F 1 -scores on the Stage 1 leaderboard in the Rakuten Data Challenge:</p><p>• Precision: 0.80 • Recall: 0.80</p><p>• F 1 -score: 0.80 On the Stage 2 leaderboard, similar results were achieved:</p><p>• Precision: 0.79</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10">CONCLUSION</head><p>We have presented our contribution to the SIGIR'eCom 2018 Rakuten Data Challenge. The results show that, with minimal customisation and only a small amount of post-processing, TopSig's signature generation and search capabilities can be used as an effective classification tool. Even a very basic signature search produces reasonable results initially, with a little bit of post-processing capable of pushing the classification accuracy even higher.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="11">SOURCE CODE AND DATA</head><p>The TopSig signature generation and search tool we used for this challenge is available from http://www.topsig.org and is licensed under the GNU GPLv3. The data is available from the organisers of the SIGIR'eCom 2018 Rakuten Data Challenge (https://sigir-ecom. github.io/data-task.html).</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">K-Means Clustering of Biological Sequences</title>
		<author>
			<persName><forename type="first">Timothy</forename><surname>Chappell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shlomo</forename><surname>Geva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><surname>Hogan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd Australasian Document Computing Symposium</title>
				<meeting>the 22nd Australasian Document Computing Symposium</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">2</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Indexing by latent semantic analysis</title>
		<author>
			<persName><forename type="first">S</forename><surname>Deerwester</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">T</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">W</forename><surname>Furnas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">K</forename><surname>Landauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Harshman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the American society for information science</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="391" to="407" />
			<date type="published" when="1990">1990. 1990</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">TopSig: topology preserving document signatures</title>
		<author>
			<persName><forename type="first">Shlomo</forename><surname>Geva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christopher M De</forename><surname>Vries</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 20th ACM International Conference on Information and Knowledge Management</title>
				<meeting>the 20th ACM International Conference on Information and Knowledge Management</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011">2011</date>
			<biblScope unit="page" from="333" to="338" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">An introduction to random indexing</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sahlgren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering</title>
				<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="volume">5</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">A vector space model for automatic indexing</title>
		<author>
			<persName><forename type="first">Gerard</forename><surname>Salton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anita</forename><surname>Wong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chung-Shu</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Commun. ACM</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="page" from="613" to="620" />
			<date type="published" when="1975">1975. 1975</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Efficient content-based image retrieval based on multi-feature fusion</title>
		<author>
			<persName><forename type="first">Nanayakkara</forename><surname>Wasam Uluwitige</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dinesha</forename><surname>Chathurani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shlomo</forename><surname>Geva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Timothy</forename><surname>Chappell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Vinod</forename><surname>Chandran</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Inverted files versus signature files for text indexing</title>
		<author>
			<persName><forename type="first">J</forename><surname>Zobel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moffat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ramamohanarao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Database Systems (TODS)</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="issue">4</biblScope>
			<biblScope unit="page" from="453" to="490" />
			<date type="published" when="1998">1998. 1998</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
