=Paper=
{{Paper
|id=Vol-1653/paper_21
|storemode=property
|title=Fast and Compact Hamming Distance index
|pdfUrl=https://ceur-ws.org/Vol-1653/paper_21.pdf
|volume=Vol-1653
|authors=Simon Gog,Rossano Venturini
|dblpUrl=https://dblp.org/rec/conf/iir/GogV16
}}
==Fast and Compact Hamming Distance index==
Fast and Compact Hamming Distance Index Simon Gog1 and Rossano Venturini2 1 Institute of Theoretical Informatics, Karlsruhe Institute of Technology 2 Department of Computer Science, University of Pisa and Istella Srl Abstract. This paper proposes new solutions for the approximate dic- tionary queries problem. These solutions combine the use of succinct data structures with an efficient representation of the keys to significantly re- duce the space usage of the state-of-the-art solutions without introducing any time penalty. Finally, by exploiting triangle inequality, we can also significantly speed up the query time of the existing solutions. Indexing large collections of objects for similarity search is a core task of many applications in databases, pattern recognition, and information retrieval (IR) (e.g., near-duplicate detection in a collection of web pages or content-based re- trieval in a collection of images). In a general framework, each object is modeled with a vector of features and the similarity of two objects is measured by means of a similarity function, e.g., the Jaccard and the cosine similarity, on their vec- tors. Sketching techniques are used to reduce the dimensionality of these objects by producing succinct sketches of the vectors, such that the similarity of objects can be estimated by computing the Hamming distance of their sketches [1]. The Approximate Dictionary Queries problem is formulated as follows. We are given a dictionary set of binary keys, a k-query Q asks for identifying all the keys in D which are at Hamming distance at most k from Q. While there exist several theoretical solutions which provide guarantees on their space and time complex- ities (see e.g., [2] and references therein), the most efficient solutions in practice [6, 5] are based on the multi-index approach proposed by Greene et al. [4]. In this extended abstract we shortly summarizes the work in [3]. This paper improves the performance of any multi-index based solution by introducing the use of succinct data structures together with a suitable representation of the keys, SIMD-based optimizations, and the use of the triangle inequality. An empirical evaluation shows that the time improvement of the proposed solutions is by a factor up to 3.4 while the space usage reduction ranges between 22% and 40%. References 1. M. Charikar. Similarity estimation techniques from rounding algorithms. In Proc. STOC, pages 380–388, 2002. 2. R. Cole, L.-A. Gottlieb, and M. Lewenstein. Dictionary matching and indexing with errors and don’t cares. In Proc. STOC, pages 91–100, 2004. 3. S. Gog and R. Venturini. Fast and compact Hamming distance index. In SIGIR, 2016. 4. D. H. Greene, M. Parnas, and F. F. Yao. Multi-index hashing for information retrieval. In Proc. FOCS, pages 722–731, 1994. 5. A. X. Liu, K. Shen, and E. Torng. Large scale hamming distance query processing. In Proc. ICDE, pages 553–564, 2011. 6. G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, pages 141–150, 2007.