8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’10) RankReduce – Processing K-Nearest Neighbor Queries on Top of MapReduce∗ Aleksandar Stupar Sebastian Michel Ralf Schenkel Saarland University Saarland University Saarland University Saarbrücken, Germany Saarbrücken, Germany Saarbrücken, Germany astupar@mmci.uni- smichel@mmci.uni- schenkel@mmci.uni- saarland.de saarland.de saarland.de ABSTRACT available, whereas tags are usually extremely scarce and text We consider the problem of processing K-Nearest Neighbor around images can often be misleading. Approaches like the (KNN) queries over large datasets where the index is jointly work by Taneva et al. [24] use both textual and low level maintained by a set of machines in a computing cluster. image descriptors to increase the diversity of returned query The proposed RankReduce approach uses locality sensitive results. There exist plenty of fundamental prior works [16, hashing (LSH) together with a MapReduce implementation, 5, 4] on how to index feature based representations of pic- which by design is a perfect match as the hashing principle tures or, more generally, high dimensional vectors in a way of LSH can be smoothly integrated in the mapping phase that allows for inspecting only a small subset of all vectors of MapReduce. The LSH algorithm assigns similar objects to find the most similar ones. to the same fragments in the distributed file system which The increasing volume of high dimensional data, however, enables a effective selection of potential candidate neighbors poses novel problems to traditional indexing mechanisms which get then reduced to the set of K-Nearest Neighbors. which usually assume an in-memory index or optimize for We address problems arising due to the different character- local disk access. As a promising approach to process huge istics of MapReduce and LSH to achieve an efficient search amounts of data on a multitude of machines in a cluster, process on the one hand and high LSH accuracy on the other MapReduce [10] has been proposed and continuously ex- hand. We discuss several pitfalls and detailed descriptions plored for many interesting application classes. In this pa- on how to circumvent these. We evaluate RankReduce using per, we investigate the usage of MapReduce for searching in both synthetic data and a dataset obtained from Flickr.com high dimensional data. Although MapReduce was initially demonstrating the suitability of the approach. described in a generic and rather imprecise way in terms of implementation, implementations like Apache’s Hadoop have proven to provide salient properties such as scalability, 1. INTRODUCTION ease of use, and most notably robustness to node failures. With the success of the Web 2.0 and the wide spread usage This provides an excellent base to explore MapReduce for its of cell phones and digital cameras, millions of pictures are suitability for large scale management of high dimensional being taken and uploaded to portals like Facebook or Flickr data. every day1 , accumulating to billions of images2 . Searching In this work, we propose RankReduce, an approach to in these huge amounts of images becomes a challenging task. implement locality sensitive hashing (LSH) [1, 8, 16], an es- While there is an increasing trend to use social annotations, tablished method for similarity search on high dimensional so-called tags, for image retrieval, next to the traditional im- data, on top of the highly reliable and scalable MapReduce age search à la Google/Bing/Yahoo which inspects the text infrastructure. While this may seem to be straight forward around the web site holding the picture, there is a vital need at first glance, it poses interesting challenges to the inte- to process similarity queries, where for a given query pic- gration: most of the time, we face different characteristics ture, the K most similar pictures are returned based on low of MapReduce and LSH which need to be harnessed both level features such as color, texture, and shape [9]. The big at the same time to achieve both high accuracy and good advantage of such low level features is that they are always performance. As MapReduce is usually used only to process ∗This work has been supported by the Cluster of Excellence ”Multi- large amounts of data in an offline fashion and not for query modal Computing and Interaction“ (MMCI) processing, we carefully investigate its suitability to handle 1 http://blog.facebook.com/blog.php?post=2406207130 user defined queries effectively demonstrating interesting in- 2 http://blog.flickr.net/en/2009/10/12/4000000000/ sights on how to tune LSH on top of MapReduce. The remainder of the paper is structured as follows. Sec- tion 2 gives an overview of related work, Section 3 presents our framework and gives a brief introduction to LSH and MapReduce, Section 4 describes the way queries are pro- cessed, Section 5 presents an experimental evaluation, and Copyright c 2010 for the individual papers by the papers’ authors. Copy- Section 6 concludes the paper and gives an outlook on on- ing permitted only for private and academic purposes. This volume is pub- going work. lished and copyrighted by its editors. LSDS-IR Workshop, July 2010. Geneva, Switzerland. 13 LSDS-IR’10 RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce 2. RELATED WORK Processing K-Nearest Neighbor queries in high dimen- sional data has received a lot of attention by researchers in recent years. When the dimensionality increases the dis- tance between the closest and the farthest neighbor decreases rapidly, for most of the datasets [6], which is also known as the ’curse of dimensionality’. This problem has a direct im- pact on exact KNN queries processing based on tree struc- tures, such as X-Tree [5] and K-D tree [4], rendering these approaches applicable only to a rather small number of di- mensions. A better suitable approach for KNN processing in high dimensions is Locality Sensitive Hashing (LSH) [1, 8, 16]. It is based on the application of locality preserv- ing hash functions which map, with high probability, close points from the high dimensional space to the same hash value (i.e., hash bucket). Being an approximate method, the performance of LSH highly depends on accurate param- eter tuning [12, 3]. Work has also been done on decreasing the number of hash tables used for LSH, while preserving the precision, by probing multiple buckets per hash table Figure 1: The RankReduce Framework [20]. In recent years, a number of distributed solutions where the main emphasis was put on exploring loosly coupled dis- tributed systems in form of Peer-to-Peer networks (P2P) In this work, we consider the family of LSH functions such as [11, 13, 14, 17, 23]) have been proposed, cf. the based on p-stable distributions [8] which are most suitable work by Batko et al. [2] for a discussion on the suitablity of for lp norms. In this case, for each data point v, the hashing different P2P approaches to distributed similarity search. scheme considers k independent hash functions of the form MapReduce is a framework for efficient and fault tolerant a·v+B workload distribution in large clusters [10]. The motiva- ha,B (v) = b c (1) W tion behind the design and development of MapReduce has been found in Information Retrieval with its many compu- where a is a d -dimensional vector whose elements are chosen tationally expensive, but embarrassingly parallel problems independently from a p-stable distribution, W ∈ IR, and B on large datasets. One of the most basic of those prob- is chosen uniformly from [0, W ]. Each hash function maps lems is the inverted index construction, described in [21]. a d-dimensional data point onto the set of integers. With k MapReduce has not yet been utilized for distributed process- such hash functions, the final result is a vector of length k ing of KNN queries. Some similarities with KNN processing of the form g(v) = (ha1 ,B1 (v), ..., hak ,Bk (v)). can be found in recent work by Rares et al. [22] which de- In order to achieve high search accuracy, multiple hash scribes a couple of approaches for computing set similarities tables need to be constructed. The work by Lv et al. [20] on textual documents, but it does not address the issue of presents an approach to probe multiple buckets per hash ta- KNN query processing. The pairwise similarity is calculated ble which, however, leads to either sampling a larger fraction only for documents with the same prefixes (prefix filtering), of the dataset or to many fine grained accesses to small buck- which can be considered as the LSH min-hashing technique. ets. The latter causes a larger number of expensive random Lin [19] describes a MapReduce based implementation of accesses to the underlying infrastructure, as we deal with pairwise similarity comparisons of text documents based on file based indexes as opposed to in-memory accesses. Hence, an inverted index. we opted for using a single probe per hash table. For maintaining the hash tables over a set of machines in a 3. RANKREDUCE FRAMEWORK cluster, we employ MapReduce [10] which is designed to be used for large data processing in parallel. It is built on top of We address the problem of processing K-Nearest Neighbor the Distributed File System [15], which enables distributing queries in large datasets by implementing a distributed LSH the data over the cluster machines in a scalable and fault based index within the MapReduce Framework. tolerant way. This tight integration of MapReduce with the An LSH based index uses locality sensitive hash functions distributed file system enables it to move calculations where for indexing data. The salient property of these functions is the data resides, eliminating network bandwidth bottlenecks that they map, with high probability, similar objects (repre- caused by data shipping during query processing. Our im- sented in the d -dimensional vector space) to the same hash plementation uses the open source software Hadoop 3 , main- bucket, i.e., related objects are more probable to have the tained by the Apache Foundation, which provides a Java same hash value than distant ones. The actual indexing based implementation of both the MapReduce framework builds several hash tables with different LSH functions to in- and the Distributed File System (coined HDFS for Hadoop crease the probability of collision for close points. At query Distributed File System). In the last years, Hadoop gained time, the KNN search is performed by hashing the query a lot of popularity in the open source community and is point to one bucket per hash table and then to rank all dis- also part of many research efforts investigating large data covered objects in any of these buckets by their distance to processing. the query point. The closest K points are returned as the 3 final result. http://hadoop.apache.org/ 14 LSDS-IR’10 RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce MapReduce is a fairly simple programming model, based the HDFS when retrieving the full vectors, so we decided on two developer supplied functions: Map and Reduce. Both to store complete feature vectors. This fact also needs to functions are based on key-value pairs. The Map function be addressed when setting up LSH parameters, while too receives a key-value pair as input and emits multiple (or many LSH hash tables can dramatically increase index size, none) key-value pairs as output. The output from all Map as each feature vector is materialized for each hash table. functions is grouped by key, and for each such key, all values are fed to the Reduce function, which then produces the final output from these values. 4. QUERY PROCESSING In the Hadoop implementation, the input data is grouped We implemented KNN query processing as a MapReduce in so-called input splits (which often correspond to blocks job. Before starting this MapReduce job, the hash values for in the distributed file system), and a number of so-called the query documents are calculated. These values are then mapper processes call the Map function for each key-value used for selecting the buckets from the LSH index, which are pair in such an input split. A number of mappers can run to be probed. The selected buckets are provided as input to concurrently on each node in the cluster, and the mapper the query processing MapReduce job, generating multiple processes are in addition distributed over all nodes in the input splits. The generated input splits are read by a cus- cluster. Ideally, a mapper is run on the same node where tom implementation of the InputFormat class, which reads the input block resides, but this is not always possible due feature vectors stored in a binary format and provides them to workload imbalance. Similarly, after all mappers have as the key part of the Map function input. Queries are being finished, dedicated reducer processes are run on nodes int distributed to mappers either by putting them in the Dis- the cluster. Each reducer handles a fraction of the output tributed Cache or by putting them in HDFS file with high key space, copies those key-value pairs from all mappers’ number of replicas. They are read once by the InputFor- outputs (in the so-called shuffle phase), sorts them by key, mat implementation and reused as value part of the Map and feeds the to the Reduce function. The output of the function input between the function invocations. reducers is usually considered the final result but can also The input to the Map function consists therefore of the be used as input for following MapReduce jobs. feature vector to be probed as the key and the list of queries Figure 1 shows an illustration of our LSH integration with as the value. The Map function computes the similarity MapReduce. Each hash table in the LSH index is mapped of the feature vector with all query vectors. While a stan- to one folder in HDFS. For each bucket in such a hash table, dard MapReduce implementation would now emit a result a corresponding file is created in this folder, where the file pair for each combination of feature vector and query vec- name is created by concatenating hash values into a string, tor, we employ an optimization that delays emitting results with ’ ’ as separator. This mapping of buckets to HDFS until all feature vectors in the input split have been pro- files enables fast lookup at query time and ensures that only cessed. We then eventually emit the final K-Nearest Neigh- data that is to be probed is read from the HDFS. Placing the bor for each query vector from this input split in the form bucket in one file also enables block based sequential access of key-value pairs. Here, the query is the key and a nearest to all vectors in one bucket, which is very important as the neighbor together with its distance to the query vector is the MapReduce framework is optimized for such block based value. To implement this delayed emitting, we store the cur- rather than random access processing. Each of the buckets rently best K-Nearest Neighbor for each query in-memory, stores the complete feature vectors of all objects mapped to together with their distances from the query points. The re- this bucket in a binary encoding. sults are emitted at the end of processing the input split in Indexing of new feature vectors to the LSH index in HDFS Hadoop’s cleanup method4 . The Reduce method then reads, is easily done by appending them to the end of the appro- for each query, the K-Nearest Neighbor from each mapper, priate bucket file. This can also be done in parallel with sorts them by increasing distance, and emits the best K of query processing as long as different buckets are affected; as them as the final result for this query. HDFS does not include a transaction mechanism, appending The final sort in the reducer can even be executed within entries to buckets that are being queried would be possible, Hadoop instead of inside the Reduce method, as a subtask of but with unclear semantics for running queries. As HDFS sorting keys in the reducer. It is possible to apply a so-called scales well with increasing cluster size, the resulting growth Secondary Sort that allows, in our application, to sort not of the LSH index can easily be supported by adding more just the keys, but also the values for the same key. Tech- machines to the cluster. nically, this is implemented by replacing, for each (query, While an LSH index stored in-memory has no limita- (neighbor, distance)) tuple that is emitted by a mapper, the tion on the number of buckets, too many files in HDFS key by a combined key consisting of the query and the dis- can downgrade its performance, especially if these files are tance. Keys are then sorted lexicographically first by query much smaller than the block size (which defaults to 64MB). and then by distance. For assigning tuples to a Reduce The number of buckets, and therefore the number of files in method, however, only the query part of the key is taken HDFS for the LSH index, is highly dependent on the set up into account. The reducer then only needs to read the first K of LSH parameters as choosing a bad combination of param- values for each key, which then correspond to the K-Nearest eters can result in a large number of small files. Neighbor for that query. Inspired by in-memory indexes which can have references It is worth mentioning that because one feature vector from buckets to materialized feature vectors, we considered is placed in multiple hash tables, the same vector can be storing only feature vector ids in the buckets instead of the evaluated twice for the same query during processing. An actual feature vectors, and retrieving the full vectors only 4 on demand at query time. However, this approach would This feature was introduced in the most recent version 0.20; result in poor performance due to many random accesses to before, it was only possible to emit directly from the Map function 15 LSDS-IR’10 RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce alternative approach would be to have two MapReduce jobs And for the proximity measure we used Euclidean distance. for query processing instead of one, which would eliminate this kind of redundancy. The first MapReduce job would 5.1 LSH Setup create a union between buckets that need to be probed, and Before starting the evaluation we needed to understand the second job would use the union as an input to similarity how to set up LSH and what consequence it may have on search. However, while this would possibly save redundant the index size and query performance. In our setup we con- computations, it has the major drawback that the results sider the number of hash tables and the bucket size as LSH from the first job need to be written to the HDFS before parameters to be tuned. The bucket size can be changed ei- starting the second job. As the overhead from multiple eval- ther by changing the number of concatenated hash values or uations of the same feature vector has not been too large in by changing the parameter W in Formula 1. Because W is a our experimental evaluation (see Figure 4), we decided that continuous variable and provides a subtle control over bucket it is better to probe slightly more data rather than to pay size, we first fix the number of concatenated hash values and the additional I/O cost incurred by using two Map Reduce then vary parameter W [12]. These two parameters together jobs. determine which subset of the data needs to be accessed to The approach can handle multiple queries at the same answer a query (one bucket per hash table). We varied the time in one MapReduce job. But it is not suitable for the bucket size by varying parameter W for a different number cases when the number of queries becomes too large, as prob- of hash tables and then measured data subset probed and lem of KNN queries processing becomes the problem of set precision, shown in Figure 2 for synthetic dataset and in similarity joins [22]. Figure 3 for the Flickr dataset. These measurements were done using 50 KNN queries for k = 20 on both datasets, 5. EXPERIMENTAL EVALUATION but with reduced sizes to 100,000 feature vectors indexed. For our experiments we have used Hadoop version 0.20.2 The results show that increasing the number of hash tables installed on three virtual machines with Debian GNU/Linux can decrease the data subset that needs to be probed to 5.0 (Kernel version: 2.6.30.10.1) as operating system. Each achieve a certain precision, resulting in less time needed for of the virtual machines has been configured to have 200GB the query execution. hard drive, 5 GB main memory and two processors. VMware Realizing that each new table creates another copy of data Server version 2.0.2 was used for virtualization of all ma- and we may have only limited storage available, we need to chines. The virtual machines were run on a single machine tradeoff storage cost vs. execution time. Additionally, when with Intel Xeon CPU E5530 @2.4 GHz, 48 GB main memory, only a fixed subset of the data should be accessed, a larger 4 TB of hard drive and Microsoft Windows Server 2008 R2 number of hash tables results in a large number of small x64 as operating system. We used a single machine Hadoop sized buckets, which is not a good scenario for HDFS (it installation on these virtual machines as described later on. puts additional pressure on Hadoop’s data node that man- ages all files). On one hand, we would like to increase the Datasets number of hash tables and to decrease the probed data sub- As the performance of the LSH based index is highly de- set. On the other hand, we would like to use less storage pendent on the data characteristics [12], we conducted an space and a smaller number of files for storage and probing. experimental evaluation both on randomly generated (Syn- Figure 3 shows that the number of hash tables has smaller thetic Dataset) and real world image data (Flickr Dataset): impact on precision in case of real image data. Thus, as a Synthetic Dataset: general rule we suggest a smaller number of hash tables with For the synthetic dataset we used 32-dimensional randomly larger bucket sizes, still set to satisfy the precision thresh- generated vectors. The synthetic dataset was built by first old. Therefore we settle for a setup of four hash tables and creating N independently generated vector instances drawn a bucket size that allow us to get at least 70% precision. from the normal distribution N (0, 1) (independently for each dimension). Subsequently, we created m near duplicates for 5.2 Evaluation each of the N vectors, leading to an overall dataset size of We evaluate our approach and compare it to the linear m∗N vectors. The rational behind using the near duplicates scan over all data, also implemented as a MapReduce job. is that we make sure that the KNN retrieval is meaninful at As we did not have a real compute cluster at hand for run- all. We set m to 10 in the experiments and adapt N to ning the experiments, we simulate the execution in a large the desired dataset size depending on the experiment. We cluster by running the mappers and reducers sequentially generated 50 queries by using the same procedure as the on our small machine. We measure run times of their exe- original vectors were generated. cutions and the number of mappers started for each query Flickr Dataset: job. To avoid the possible bottleneck of a shared hard drive We used the 64-dimensional color structure feature vectors between virtual machines [18], we run each experiment on from crawled Flickr images provided by the CoPhIR data a single machine Hadoop installation with one map task al- collection [7] as our real image dataset. We extracted the lowed per task tracker. This results in sequential execution color structure feature vectors from the available MPEG-7 of map tasks so there is no concurrent access to a shared features and stored them in a binary format suitable for the hard drive by multiple virtual machines. experiments. As the queries, we have randomly selected 50 Considering that the workload for the reducers is really images from the rest of the CoPhIR data collection small for both linear scan and LSH, we only evaluate map As LSH is an approximate method, we measure the ef- execution times and the number of mappers run per query fectiveness of the nearest neighbor search by its precision, job. We measured the map execution times for all jobs and which is the relative overlap of the true K-Nearest Neigh- found that they are approximately constant, with average bor with the K-Nearest Neighbor computed by our method. value per mapper being 3.256 seconds and standard devia- 16 LSDS-IR’10 RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce LSH characteristics [Generated Data] LSH characteristics [Real Image Data] 100 100 90 90 80 80 Precision % Precision % 70 70 60 60 50 4 Hash tables 50 4 Hash tables 8 Hash tables 8 Hash tables 40 16 Hash tables 40 16 Hash tables 32 Hash tables 32 Hash tables 30 30 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 Data subset probed % Data subset probed % Figure 2: LSH characteristics on generated data. Figure 3: LSH characteristics on picture data. tion of 1.702 seconds. Taking into account that each mapper has an approximately same data input size, defined by the Overhead without using union for 4 hash tables HDFS’ block size, approximately constant mapper execution 35 Data subset probed without union time is well expected. 30 Data subset probed with union Data subset probed % Measures of Interest 25 20 Because the execution time of the mappers is almost con- stant, the load of a query execution can be represented as 15 number of mappers per query. We measured the number of 10 mappers per query and precision for 50 KNN queries, with 5 K=20, for both datasets, with 2GB, 4GB, and 8GB of in- dexed data (∼4000, ∼8000, and ∼16000 feature vectors for 0 200 250 300 350 400 450 500 550 the real image dataset and ∼8000, ∼16000, and ∼32000 fea- Bucket size, W parameter ture vectors for the synthetic dataset, respectively). The number of mappers per query for synthetic dataset is shown in Figure 5. And as we can see the number of mappers is about 3 times smaller for LSH than for linear scan. Also we Figure 4: Overhead without using union. can see in Figure 6, which shows the number of mappers per query for the Flickr dataset, that the difference in the num- ber of mappers between LSH and linear scan is even bigger. Our presented approach on large scale data processing is, The number of mappers per query is 4 to 5 times smaller however, not limited to KNN search over images, but can for LSH than for linear scan in this case. The precision, be extended to a variety of other interesting applications, shown in Figure 7, for generated data is over the threshold such as near duplicate detection, document classification, or of 70% for 2GB and 4GB of indexed data, but drops down to document clustering. 63.8% for 8GB. For real image data, the precision is almost As a first step in our future work plan to evaluate our constant, varying slightly around 86%. approach on a real compute cluster, which we are currently building up, with large scale data in the order of several TB. We furthermore plan to extend our approach to video and 6. CONCLUSION music retrieval. In this work we described RankReduce, an approach for processing large amounts of data for K-Nearest Neighbor (KNN) queries. Instead of dealing with standard issues in 7. REFERENCES distributed systems such as scalability and fault tolerance, [1] Alexandr Andoni and Piotr Indyk. Near-optimal we implement our solution with MapReduce, which provides hashing algorithms for approximate nearest neighbor these salient properties out of the box. The key idea of in high dimensions. In FOCS, 2006. the presented approach is to use Locality Sensitive Hashing [2] Michal Batko, David Novak, Fabrizio Falchi, and (LSH) in the Map phase of MapReduce to assign similar ob- Pavel Zezula. Scalability comparison of peer-to-peer jects to the same files in the underlying distributed file sys- similarity search structures. Future Generation Comp. tem. While this seemed to be straight forward at first glance, Syst., 2008. there was a nontrivial conflict of opposing criteria and con- [3] Mayank Bawa, Tyson Condie, and Prasanna Ganesan. straints caused by LSH and MapReduce which we had to Lsh forest: self-tuning indexes for similarity search. In solve to achieve accurate results with an acceptable query WWW, 2005. response time. We have demonstrated the suitability of our [4] Jon Louis Bentley. K-d trees for semidynamic point approach using both a synthetic and a real world dataset. sets. In Symposium on Computational Geometry, 1990. 17 LSDS-IR’10 RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce Map tasks per query [Generated Data] LSH precision [Generated and Real Image Data] 140 100 LSH 120 Linear scan 80 Maps tasks per query 100 Precision % 80 60 60 40 40 20 20 Generated data Picture data 0 0 2 4 8 1 2 3 4 5 6 7 8 9 Indexed data size GB Indexed data size GB Figure 5: Map tasks per query on generated data. Figure 7: LSH precision on generated and picture data. Map tasks per query [Real Image Data] Charikar, and Kai Li. Modeling lsh for performance 140 LSH tuning. In CIKM, 2008. 120 Linear scan [13] Christos Doulkeridis, Kjetil Nørvåg, and Michalis Maps tasks per query Vazirgiannis. Peer-to-peer similarity search over widely 100 distributed document collections. In LSDS-IR, 2008. 80 [14] Fabrizio Falchi, Claudio Gennaro, and Pavel Zezula. A 60 content-addressable network for similarity search in 40 metric spaces. In DBISP2P, 2005. [15] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak 20 Leung. The google file system. SIGOPS Oper. Syst. 0 Rev., 2003. 2 4 8 Indexed data size GB [16] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. [17] Parisa Haghani, Sebastian Michel, and Karl Aberer. Figure 6: Map tasks per query on picture data. Distributed similarity search in high dimensions using locality sensitive hashing. In EDBT, 2009. [18] Shadi Ibrahim, Hai Jin, Lu Lu, Li Qi, Song Wu, and [5] Stefan Berchtold, Daniel A. Keim, and Hans-Peter Xuanhua Shi. Evaluating mapreduce on virtual Kriegel. The x-tree : An index structure for machines: The hadoop case. In CloudCom, 2009. high-dimensional data. In VLDB, 1996. [19] Jimmy J. Lin. Brute force and indexed approaches to [6] Kevin S. Beyer, Jonathan Goldstein, Raghu pairwise document similarity comparisons with Ramakrishnan, and Uri Shaft. When is ”nearest mapreduce. In SIGIR, 2009. neighbor” meaningful? In ICDT, 1999. [20] Qin Lv, William Josephson, Zhe Wang, Moses [7] Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi, Charikar, and Kai Li. Multi-probe lsh: Efficient Claudio Lucchese, Raffaele Perego, Tommaso Piccioli, indexing for high-dimensional similarity search. In and Fausto Rabitti. CoPhIR: a test collection for VLDB, 2007. content-based image retrieval. CoRR, 2009. [21] Richard M. C. McCreadie, Craig Macdonald, and Iadh [8] Mayur Datar, Nicole Immorlica, Piotr Indyk, and Ounis. On single-pass indexing with mapreduce. In Vahab S. Mirrokni. Locality-sensitive hashing scheme SIGIR, 2009. based on p-stable distributions. In Symposium on [22] Vernica Rares, Carey Michael J., and Li Chen. Computational Geometry, 2004. Efficient parallel set-similarity joins using mapreduce. [9] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Ze 2010. Wang. Image retrieval: Ideas, influences, and trends of [23] Ozgur D. Sahin, Fatih Emekçi, Divyakant Agrawal, the new age. ACM Comput. Surv., 2008. and Amr El Abbadi. Content-based similarity search [10] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: over peer-to-peer systems. In DBISP2P, 2004. Simplified data processing on large clusters. In OSDI, [24] Bilyana Taneva, Mouna Kacimi, and Gerhard 2004. Weikum. Gathering and ranking photos of named [11] Vlastislav Dohnal and Pavel Zezula. Similarity entities with high precision, high recall, and diversity. searching in structured and unstructured p2p In WSDM, 2010. networks. In QSHINE, 2009. [12] Wei Dong, Zhe Wang, William Josephson, Moses 18