8th Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR’10)


     RankReduce – Processing K-Nearest Neighbor Queries
                   on Top of MapReduce∗

                Aleksandar Stupar                             Sebastian Michel                        Ralf Schenkel
                 Saarland University                         Saarland University                   Saarland University
                Saarbrücken, Germany                        Saarbrücken, Germany                  Saarbrücken, Germany
                astupar@mmci.uni-                           smichel@mmci.uni-                    schenkel@mmci.uni-
                    saarland.de                                saarland.de                           saarland.de

ABSTRACT                                                                     available, whereas tags are usually extremely scarce and text
We consider the problem of processing K-Nearest Neighbor                     around images can often be misleading. Approaches like the
(KNN) queries over large datasets where the index is jointly                 work by Taneva et al. [24] use both textual and low level
maintained by a set of machines in a computing cluster.                      image descriptors to increase the diversity of returned query
The proposed RankReduce approach uses locality sensitive                     results. There exist plenty of fundamental prior works [16,
hashing (LSH) together with a MapReduce implementation,                      5, 4] on how to index feature based representations of pic-
which by design is a perfect match as the hashing principle                  tures or, more generally, high dimensional vectors in a way
of LSH can be smoothly integrated in the mapping phase                       that allows for inspecting only a small subset of all vectors
of MapReduce. The LSH algorithm assigns similar objects                      to find the most similar ones.
to the same fragments in the distributed file system which                      The increasing volume of high dimensional data, however,
enables a effective selection of potential candidate neighbors               poses novel problems to traditional indexing mechanisms
which get then reduced to the set of K-Nearest Neighbors.                    which usually assume an in-memory index or optimize for
We address problems arising due to the different character-                  local disk access. As a promising approach to process huge
istics of MapReduce and LSH to achieve an efficient search                   amounts of data on a multitude of machines in a cluster,
process on the one hand and high LSH accuracy on the other                   MapReduce [10] has been proposed and continuously ex-
hand. We discuss several pitfalls and detailed descriptions                  plored for many interesting application classes. In this pa-
on how to circumvent these. We evaluate RankReduce using                     per, we investigate the usage of MapReduce for searching in
both synthetic data and a dataset obtained from Flickr.com                   high dimensional data. Although MapReduce was initially
demonstrating the suitability of the approach.                               described in a generic and rather imprecise way in terms
                                                                             of implementation, implementations like Apache’s Hadoop
                                                                             have proven to provide salient properties such as scalability,
1.    INTRODUCTION                                                           ease of use, and most notably robustness to node failures.
   With the success of the Web 2.0 and the wide spread usage                 This provides an excellent base to explore MapReduce for its
of cell phones and digital cameras, millions of pictures are                 suitability for large scale management of high dimensional
being taken and uploaded to portals like Facebook or Flickr                  data.
every day1 , accumulating to billions of images2 . Searching                    In this work, we propose RankReduce, an approach to
in these huge amounts of images becomes a challenging task.                  implement locality sensitive hashing (LSH) [1, 8, 16], an es-
While there is an increasing trend to use social annotations,                tablished method for similarity search on high dimensional
so-called tags, for image retrieval, next to the traditional im-             data, on top of the highly reliable and scalable MapReduce
age search à la Google/Bing/Yahoo which inspects the text                   infrastructure. While this may seem to be straight forward
around the web site holding the picture, there is a vital need               at first glance, it poses interesting challenges to the inte-
to process similarity queries, where for a given query pic-                  gration: most of the time, we face different characteristics
ture, the K most similar pictures are returned based on low                  of MapReduce and LSH which need to be harnessed both
level features such as color, texture, and shape [9]. The big                at the same time to achieve both high accuracy and good
advantage of such low level features is that they are always                 performance. As MapReduce is usually used only to process
∗This work has been supported by the Cluster of Excellence ”Multi-           large amounts of data in an offline fashion and not for query
modal Computing and Interaction“ (MMCI)                                      processing, we carefully investigate its suitability to handle
1
  http://blog.facebook.com/blog.php?post=2406207130                          user defined queries effectively demonstrating interesting in-
2
  http://blog.flickr.net/en/2009/10/12/4000000000/                           sights on how to tune LSH on top of MapReduce.

                                                                                The remainder of the paper is structured as follows. Sec-
                                                                             tion 2 gives an overview of related work, Section 3 presents
                                                                             our framework and gives a brief introduction to LSH and
                                                                             MapReduce, Section 4 describes the way queries are pro-
                                                                             cessed, Section 5 presents an experimental evaluation, and
Copyright c 2010 for the individual papers by the papers’ authors. Copy-     Section 6 concludes the paper and gives an outlook on on-
ing permitted only for private and academic purposes. This volume is pub-    going work.
lished and copyrighted by its editors.
LSDS-IR Workshop, July 2010. Geneva, Switzerland.


                                                                        13
LSDS-IR’10                                    RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce


2.   RELATED WORK
   Processing K-Nearest Neighbor queries in high dimen-
sional data has received a lot of attention by researchers
in recent years. When the dimensionality increases the dis-
tance between the closest and the farthest neighbor decreases
rapidly, for most of the datasets [6], which is also known as
the ’curse of dimensionality’. This problem has a direct im-
pact on exact KNN queries processing based on tree struc-
tures, such as X-Tree [5] and K-D tree [4], rendering these
approaches applicable only to a rather small number of di-
mensions. A better suitable approach for KNN processing
in high dimensions is Locality Sensitive Hashing (LSH) [1,
8, 16]. It is based on the application of locality preserv-
ing hash functions which map, with high probability, close
points from the high dimensional space to the same hash
value (i.e., hash bucket). Being an approximate method,
the performance of LSH highly depends on accurate param-
eter tuning [12, 3]. Work has also been done on decreasing
the number of hash tables used for LSH, while preserving
the precision, by probing multiple buckets per hash table                  Figure 1: The RankReduce Framework
[20]. In recent years, a number of distributed solutions where
the main emphasis was put on exploring loosly coupled dis-
tributed systems in form of Peer-to-Peer networks (P2P)             In this work, we consider the family of LSH functions
such as [11, 13, 14, 17, 23]) have been proposed, cf. the         based on p-stable distributions [8] which are most suitable
work by Batko et al. [2] for a discussion on the suitablity of    for lp norms. In this case, for each data point v, the hashing
different P2P approaches to distributed similarity search.        scheme considers k independent hash functions of the form
   MapReduce is a framework for efficient and fault tolerant                                           a·v+B
workload distribution in large clusters [10]. The motiva-                               ha,B (v) = b         c                 (1)
                                                                                                         W
tion behind the design and development of MapReduce has
been found in Information Retrieval with its many compu-          where a is a d -dimensional vector whose elements are chosen
tationally expensive, but embarrassingly parallel problems        independently from a p-stable distribution, W ∈ IR, and B
on large datasets. One of the most basic of those prob-           is chosen uniformly from [0, W ]. Each hash function maps
lems is the inverted index construction, described in [21].       a d-dimensional data point onto the set of integers. With k
MapReduce has not yet been utilized for distributed process-      such hash functions, the final result is a vector of length k
ing of KNN queries. Some similarities with KNN processing         of the form g(v) = (ha1 ,B1 (v), ..., hak ,Bk (v)).
can be found in recent work by Rares et al. [22] which de-           In order to achieve high search accuracy, multiple hash
scribes a couple of approaches for computing set similarities     tables need to be constructed. The work by Lv et al. [20]
on textual documents, but it does not address the issue of        presents an approach to probe multiple buckets per hash ta-
KNN query processing. The pairwise similarity is calculated       ble which, however, leads to either sampling a larger fraction
only for documents with the same prefixes (prefix filtering),     of the dataset or to many fine grained accesses to small buck-
which can be considered as the LSH min-hashing technique.         ets. The latter causes a larger number of expensive random
Lin [19] describes a MapReduce based implementation of            accesses to the underlying infrastructure, as we deal with
pairwise similarity comparisons of text documents based on        file based indexes as opposed to in-memory accesses. Hence,
an inverted index.                                                we opted for using a single probe per hash table.

                                                                    For maintaining the hash tables over a set of machines in a
3.   RANKREDUCE FRAMEWORK                                         cluster, we employ MapReduce [10] which is designed to be
                                                                  used for large data processing in parallel. It is built on top of
   We address the problem of processing K-Nearest Neighbor        the Distributed File System [15], which enables distributing
queries in large datasets by implementing a distributed LSH       the data over the cluster machines in a scalable and fault
based index within the MapReduce Framework.                       tolerant way. This tight integration of MapReduce with the
   An LSH based index uses locality sensitive hash functions      distributed file system enables it to move calculations where
for indexing data. The salient property of these functions is     the data resides, eliminating network bandwidth bottlenecks
that they map, with high probability, similar objects (repre-     caused by data shipping during query processing. Our im-
sented in the d -dimensional vector space) to the same hash       plementation uses the open source software Hadoop 3 , main-
bucket, i.e., related objects are more probable to have the       tained by the Apache Foundation, which provides a Java
same hash value than distant ones. The actual indexing            based implementation of both the MapReduce framework
builds several hash tables with different LSH functions to in-    and the Distributed File System (coined HDFS for Hadoop
crease the probability of collision for close points. At query    Distributed File System). In the last years, Hadoop gained
time, the KNN search is performed by hashing the query            a lot of popularity in the open source community and is
point to one bucket per hash table and then to rank all dis-      also part of many research efforts investigating large data
covered objects in any of these buckets by their distance to      processing.
the query point. The closest K points are returned as the
                                                                  3
final result.                                                         http://hadoop.apache.org/


                                                             14
LSDS-IR’10                                    RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce


   MapReduce is a fairly simple programming model, based          the HDFS when retrieving the full vectors, so we decided
on two developer supplied functions: Map and Reduce. Both         to store complete feature vectors. This fact also needs to
functions are based on key-value pairs. The Map function          be addressed when setting up LSH parameters, while too
receives a key-value pair as input and emits multiple (or         many LSH hash tables can dramatically increase index size,
none) key-value pairs as output. The output from all Map          as each feature vector is materialized for each hash table.
functions is grouped by key, and for each such key, all values
are fed to the Reduce function, which then produces the final
output from these values.                                         4.   QUERY PROCESSING
   In the Hadoop implementation, the input data is grouped           We implemented KNN query processing as a MapReduce
in so-called input splits (which often correspond to blocks       job. Before starting this MapReduce job, the hash values for
in the distributed file system), and a number of so-called        the query documents are calculated. These values are then
mapper processes call the Map function for each key-value         used for selecting the buckets from the LSH index, which are
pair in such an input split. A number of mappers can run          to be probed. The selected buckets are provided as input to
concurrently on each node in the cluster, and the mapper          the query processing MapReduce job, generating multiple
processes are in addition distributed over all nodes in the       input splits. The generated input splits are read by a cus-
cluster. Ideally, a mapper is run on the same node where          tom implementation of the InputFormat class, which reads
the input block resides, but this is not always possible due      feature vectors stored in a binary format and provides them
to workload imbalance. Similarly, after all mappers have          as the key part of the Map function input. Queries are being
finished, dedicated reducer processes are run on nodes int        distributed to mappers either by putting them in the Dis-
the cluster. Each reducer handles a fraction of the output        tributed Cache or by putting them in HDFS file with high
key space, copies those key-value pairs from all mappers’         number of replicas. They are read once by the InputFor-
outputs (in the so-called shuffle phase), sorts them by key,      mat implementation and reused as value part of the Map
and feeds the to the Reduce function. The output of the           function input between the function invocations.
reducers is usually considered the final result but can also         The input to the Map function consists therefore of the
be used as input for following MapReduce jobs.                    feature vector to be probed as the key and the list of queries
   Figure 1 shows an illustration of our LSH integration with     as the value. The Map function computes the similarity
MapReduce. Each hash table in the LSH index is mapped             of the feature vector with all query vectors. While a stan-
to one folder in HDFS. For each bucket in such a hash table,      dard MapReduce implementation would now emit a result
a corresponding file is created in this folder, where the file    pair for each combination of feature vector and query vec-
name is created by concatenating hash values into a string,       tor, we employ an optimization that delays emitting results
with ’ ’ as separator. This mapping of buckets to HDFS            until all feature vectors in the input split have been pro-
files enables fast lookup at query time and ensures that only     cessed. We then eventually emit the final K-Nearest Neigh-
data that is to be probed is read from the HDFS. Placing the      bor for each query vector from this input split in the form
bucket in one file also enables block based sequential access     of key-value pairs. Here, the query is the key and a nearest
to all vectors in one bucket, which is very important as the      neighbor together with its distance to the query vector is the
MapReduce framework is optimized for such block based             value. To implement this delayed emitting, we store the cur-
rather than random access processing. Each of the buckets         rently best K-Nearest Neighbor for each query in-memory,
stores the complete feature vectors of all objects mapped to      together with their distances from the query points. The re-
this bucket in a binary encoding.                                 sults are emitted at the end of processing the input split in
   Indexing of new feature vectors to the LSH index in HDFS       Hadoop’s cleanup method4 . The Reduce method then reads,
is easily done by appending them to the end of the appro-         for each query, the K-Nearest Neighbor from each mapper,
priate bucket file. This can also be done in parallel with        sorts them by increasing distance, and emits the best K of
query processing as long as different buckets are affected; as    them as the final result for this query.
HDFS does not include a transaction mechanism, appending             The final sort in the reducer can even be executed within
entries to buckets that are being queried would be possible,      Hadoop instead of inside the Reduce method, as a subtask of
but with unclear semantics for running queries. As HDFS           sorting keys in the reducer. It is possible to apply a so-called
scales well with increasing cluster size, the resulting growth    Secondary Sort that allows, in our application, to sort not
of the LSH index can easily be supported by adding more           just the keys, but also the values for the same key. Tech-
machines to the cluster.                                          nically, this is implemented by replacing, for each (query,
   While an LSH index stored in-memory has no limita-             (neighbor, distance)) tuple that is emitted by a mapper, the
tion on the number of buckets, too many files in HDFS             key by a combined key consisting of the query and the dis-
can downgrade its performance, especially if these files are      tance. Keys are then sorted lexicographically first by query
much smaller than the block size (which defaults to 64MB).        and then by distance. For assigning tuples to a Reduce
The number of buckets, and therefore the number of files in       method, however, only the query part of the key is taken
HDFS for the LSH index, is highly dependent on the set up         into account. The reducer then only needs to read the first K
of LSH parameters as choosing a bad combination of param-         values for each key, which then correspond to the K-Nearest
eters can result in a large number of small files.                Neighbor for that query.
   Inspired by in-memory indexes which can have references           It is worth mentioning that because one feature vector
from buckets to materialized feature vectors, we considered       is placed in multiple hash tables, the same vector can be
storing only feature vector ids in the buckets instead of the     evaluated twice for the same query during processing. An
actual feature vectors, and retrieving the full vectors only      4
on demand at query time. However, this approach would               This feature was introduced in the most recent version 0.20;
result in poor performance due to many random accesses to         before, it was only possible to emit directly from the Map
                                                                  function


                                                             15
LSDS-IR’10                                   RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce


alternative approach would be to have two MapReduce jobs         And for the proximity measure we used Euclidean distance.
for query processing instead of one, which would eliminate
this kind of redundancy. The first MapReduce job would           5.1    LSH Setup
create a union between buckets that need to be probed, and          Before starting the evaluation we needed to understand
the second job would use the union as an input to similarity     how to set up LSH and what consequence it may have on
search. However, while this would possibly save redundant        the index size and query performance. In our setup we con-
computations, it has the major drawback that the results         sider the number of hash tables and the bucket size as LSH
from the first job need to be written to the HDFS before         parameters to be tuned. The bucket size can be changed ei-
starting the second job. As the overhead from multiple eval-     ther by changing the number of concatenated hash values or
uations of the same feature vector has not been too large in     by changing the parameter W in Formula 1. Because W is a
our experimental evaluation (see Figure 4), we decided that      continuous variable and provides a subtle control over bucket
it is better to probe slightly more data rather than to pay      size, we first fix the number of concatenated hash values and
the additional I/O cost incurred by using two Map Reduce         then vary parameter W [12]. These two parameters together
jobs.                                                            determine which subset of the data needs to be accessed to
   The approach can handle multiple queries at the same          answer a query (one bucket per hash table). We varied the
time in one MapReduce job. But it is not suitable for the        bucket size by varying parameter W for a different number
cases when the number of queries becomes too large, as prob-     of hash tables and then measured data subset probed and
lem of KNN queries processing becomes the problem of set         precision, shown in Figure 2 for synthetic dataset and in
similarity joins [22].                                           Figure 3 for the Flickr dataset. These measurements were
                                                                 done using 50 KNN queries for k = 20 on both datasets,
5.   EXPERIMENTAL EVALUATION                                     but with reduced sizes to 100,000 feature vectors indexed.
  For our experiments we have used Hadoop version 0.20.2         The results show that increasing the number of hash tables
installed on three virtual machines with Debian GNU/Linux        can decrease the data subset that needs to be probed to
5.0 (Kernel version: 2.6.30.10.1) as operating system. Each      achieve a certain precision, resulting in less time needed for
of the virtual machines has been configured to have 200GB        the query execution.
hard drive, 5 GB main memory and two processors. VMware             Realizing that each new table creates another copy of data
Server version 2.0.2 was used for virtualization of all ma-      and we may have only limited storage available, we need to
chines. The virtual machines were run on a single machine        tradeoff storage cost vs. execution time. Additionally, when
with Intel Xeon CPU E5530 @2.4 GHz, 48 GB main memory,           only a fixed subset of the data should be accessed, a larger
4 TB of hard drive and Microsoft Windows Server 2008 R2          number of hash tables results in a large number of small
x64 as operating system. We used a single machine Hadoop         sized buckets, which is not a good scenario for HDFS (it
installation on these virtual machines as described later on.    puts additional pressure on Hadoop’s data node that man-
                                                                 ages all files). On one hand, we would like to increase the
Datasets                                                         number of hash tables and to decrease the probed data sub-
As the performance of the LSH based index is highly de-          set. On the other hand, we would like to use less storage
pendent on the data characteristics [12], we conducted an        space and a smaller number of files for storage and probing.
experimental evaluation both on randomly generated (Syn-         Figure 3 shows that the number of hash tables has smaller
thetic Dataset) and real world image data (Flickr Dataset):      impact on precision in case of real image data. Thus, as a
   Synthetic Dataset:                                            general rule we suggest a smaller number of hash tables with
For the synthetic dataset we used 32-dimensional randomly        larger bucket sizes, still set to satisfy the precision thresh-
generated vectors. The synthetic dataset was built by first      old. Therefore we settle for a setup of four hash tables and
creating N independently generated vector instances drawn        a bucket size that allow us to get at least 70% precision.
from the normal distribution N (0, 1) (independently for each
dimension). Subsequently, we created m near duplicates for       5.2    Evaluation
each of the N vectors, leading to an overall dataset size of       We evaluate our approach and compare it to the linear
m∗N vectors. The rational behind using the near duplicates       scan over all data, also implemented as a MapReduce job.
is that we make sure that the KNN retrieval is meaninful at      As we did not have a real compute cluster at hand for run-
all. We set m to 10 in the experiments and adapt N to            ning the experiments, we simulate the execution in a large
the desired dataset size depending on the experiment. We         cluster by running the mappers and reducers sequentially
generated 50 queries by using the same procedure as the          on our small machine. We measure run times of their exe-
original vectors were generated.                                 cutions and the number of mappers started for each query
   Flickr Dataset:                                               job. To avoid the possible bottleneck of a shared hard drive
We used the 64-dimensional color structure feature vectors       between virtual machines [18], we run each experiment on
from crawled Flickr images provided by the CoPhIR data           a single machine Hadoop installation with one map task al-
collection [7] as our real image dataset. We extracted the       lowed per task tracker. This results in sequential execution
color structure feature vectors from the available MPEG-7        of map tasks so there is no concurrent access to a shared
features and stored them in a binary format suitable for the     hard drive by multiple virtual machines.
experiments. As the queries, we have randomly selected 50          Considering that the workload for the reducers is really
images from the rest of the CoPhIR data collection               small for both linear scan and LSH, we only evaluate map
   As LSH is an approximate method, we measure the ef-           execution times and the number of mappers run per query
fectiveness of the nearest neighbor search by its precision,     job. We measured the map execution times for all jobs and
which is the relative overlap of the true K-Nearest Neigh-       found that they are approximately constant, with average
bor with the K-Nearest Neighbor computed by our method.          value per mapper being 3.256 seconds and standard devia-


                                                            16
LSDS-IR’10                                                              RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce


                                 LSH characteristics [Generated Data]                                                                   LSH characteristics [Real Image Data]
                   100                                                                                                 100

                    90                                                                                                  90

                    80                                                                                                  80
     Precision %


                                                                                                Precision %
                    70                                                                                                  70

                    60                                                                                                  60

                    50                               4 Hash tables                                                      50                                  4 Hash tables
                                                     8 Hash tables                                                                                          8 Hash tables
                    40                              16 Hash tables                                                      40                                 16 Hash tables
                                                    32 Hash tables                                                                                         32 Hash tables
                    30                                                                                                  30
                         0   5      10    15   20    25   30    35      40   45                                              0      5      10    15   20    25   30     35      40   45
                                         Data subset probed %                                                                                   Data subset probed %


 Figure 2: LSH characteristics on generated data.                                              Figure 3: LSH characteristics on picture data.


tion of 1.702 seconds. Taking into account that each mapper
has an approximately same data input size, defined by the                                                                        Overhead without using union for 4 hash tables
HDFS’ block size, approximately constant mapper execution                                                              35
                                                                                                                                          Data subset probed without union
time is well expected.                                                                                                 30                   Data subset probed with union
                                                                                                Data subset probed %
Measures of Interest                                                                                                   25

                                                                                                                       20
Because the execution time of the mappers is almost con-
stant, the load of a query execution can be represented as                                                             15
number of mappers per query. We measured the number of                                                                 10
mappers per query and precision for 50 KNN queries, with
                                                                                                                       5
K=20, for both datasets, with 2GB, 4GB, and 8GB of in-
dexed data (∼4000, ∼8000, and ∼16000 feature vectors for                                                               0
                                                                                                                        200        250       300    350    400    450        500     550
the real image dataset and ∼8000, ∼16000, and ∼32000 fea-                                                                                    Bucket size, W parameter
ture vectors for the synthetic dataset, respectively). The
number of mappers per query for synthetic dataset is shown
in Figure 5. And as we can see the number of mappers is
about 3 times smaller for LSH than for linear scan. Also we                                                  Figure 4: Overhead without using union.
can see in Figure 6, which shows the number of mappers per
query for the Flickr dataset, that the difference in the num-
ber of mappers between LSH and linear scan is even bigger.                                Our presented approach on large scale data processing is,
The number of mappers per query is 4 to 5 times smaller                                   however, not limited to KNN search over images, but can
for LSH than for linear scan in this case. The precision,                                 be extended to a variety of other interesting applications,
shown in Figure 7, for generated data is over the threshold                               such as near duplicate detection, document classification, or
of 70% for 2GB and 4GB of indexed data, but drops down to                                 document clustering.
63.8% for 8GB. For real image data, the precision is almost                                 As a first step in our future work plan to evaluate our
constant, varying slightly around 86%.                                                    approach on a real compute cluster, which we are currently
                                                                                          building up, with large scale data in the order of several TB.
                                                                                          We furthermore plan to extend our approach to video and
6.    CONCLUSION                                                                          music retrieval.
   In this work we described RankReduce, an approach for
processing large amounts of data for K-Nearest Neighbor
(KNN) queries. Instead of dealing with standard issues in
                                                                                          7.      REFERENCES
distributed systems such as scalability and fault tolerance,                               [1] Alexandr Andoni and Piotr Indyk. Near-optimal
we implement our solution with MapReduce, which provides                                       hashing algorithms for approximate nearest neighbor
these salient properties out of the box. The key idea of                                       in high dimensions. In FOCS, 2006.
the presented approach is to use Locality Sensitive Hashing                                [2] Michal Batko, David Novak, Fabrizio Falchi, and
(LSH) in the Map phase of MapReduce to assign similar ob-                                      Pavel Zezula. Scalability comparison of peer-to-peer
jects to the same files in the underlying distributed file sys-                                similarity search structures. Future Generation Comp.
tem. While this seemed to be straight forward at first glance,                                 Syst., 2008.
there was a nontrivial conflict of opposing criteria and con-                              [3] Mayank Bawa, Tyson Condie, and Prasanna Ganesan.
straints caused by LSH and MapReduce which we had to                                           Lsh forest: self-tuning indexes for similarity search. In
solve to achieve accurate results with an acceptable query                                     WWW, 2005.
response time. We have demonstrated the suitability of our                                 [4] Jon Louis Bentley. K-d trees for semidynamic point
approach using both a synthetic and a real world dataset.                                      sets. In Symposium on Computational Geometry, 1990.


                                                                                     17
LSDS-IR’10                                                              RankReduce - Processing K-Nearest Neighbor Queries on Top of MapReduce


                                    Map tasks per query [Generated Data]                                               LSH precision [Generated and Real Image Data]
                            140                                                                              100
                                         LSH
                            120   Linear scan
                                                                                                              80
     Maps tasks per query


                            100


                                                                                               Precision %
                             80                                                                               60

                             60                                                                               40
                             40
                                                                                                              20
                             20                                                                                                                Generated data
                                                                                                                                                  Picture data
                             0                                                                                0
                                         2            4             8                                              1       2     3      4     5     6       7    8     9
                                             Indexed data size GB                                                                    Indexed data size GB


 Figure 5: Map tasks per query on generated data.                                         Figure 7: LSH precision on generated and picture
                                                                                          data.


                                   Map tasks per query [Real Image Data]                       Charikar, and Kai Li. Modeling lsh for performance
                            140
                                         LSH
                                                                                               tuning. In CIKM, 2008.
                            120   Linear scan                                             [13] Christos Doulkeridis, Kjetil Nørvåg, and Michalis
     Maps tasks per query


                                                                                               Vazirgiannis. Peer-to-peer similarity search over widely
                            100
                                                                                               distributed document collections. In LSDS-IR, 2008.
                             80                                                           [14] Fabrizio Falchi, Claudio Gennaro, and Pavel Zezula. A
                             60                                                                content-addressable network for similarity search in
                             40
                                                                                               metric spaces. In DBISP2P, 2005.
                                                                                          [15] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak
                             20
                                                                                               Leung. The google file system. SIGOPS Oper. Syst.
                             0                                                                 Rev., 2003.
                                         2            4           8
                                             Indexed data size GB
                                                                                          [16] Aristides Gionis, Piotr Indyk, and Rajeev Motwani.
                                                                                               Similarity search in high dimensions via hashing. In
                                                                                               VLDB, 1999.
                                                                                          [17] Parisa Haghani, Sebastian Michel, and Karl Aberer.
  Figure 6: Map tasks per query on picture data.                                               Distributed similarity search in high dimensions using
                                                                                               locality sensitive hashing. In EDBT, 2009.
                                                                                          [18] Shadi Ibrahim, Hai Jin, Lu Lu, Li Qi, Song Wu, and
 [5] Stefan Berchtold, Daniel A. Keim, and Hans-Peter                                          Xuanhua Shi. Evaluating mapreduce on virtual
     Kriegel. The x-tree : An index structure for                                              machines: The hadoop case. In CloudCom, 2009.
     high-dimensional data. In VLDB, 1996.                                                [19] Jimmy J. Lin. Brute force and indexed approaches to
 [6] Kevin S. Beyer, Jonathan Goldstein, Raghu                                                 pairwise document similarity comparisons with
     Ramakrishnan, and Uri Shaft. When is ”nearest                                             mapreduce. In SIGIR, 2009.
     neighbor” meaningful? In ICDT, 1999.                                                 [20] Qin Lv, William Josephson, Zhe Wang, Moses
 [7] Paolo Bolettieri, Andrea Esuli, Fabrizio Falchi,                                          Charikar, and Kai Li. Multi-probe lsh: Efficient
     Claudio Lucchese, Raffaele Perego, Tommaso Piccioli,                                      indexing for high-dimensional similarity search. In
     and Fausto Rabitti. CoPhIR: a test collection for                                         VLDB, 2007.
     content-based image retrieval. CoRR, 2009.                                           [21] Richard M. C. McCreadie, Craig Macdonald, and Iadh
 [8] Mayur Datar, Nicole Immorlica, Piotr Indyk, and                                           Ounis. On single-pass indexing with mapreduce. In
     Vahab S. Mirrokni. Locality-sensitive hashing scheme                                      SIGIR, 2009.
     based on p-stable distributions. In Symposium on                                     [22] Vernica Rares, Carey Michael J., and Li Chen.
     Computational Geometry, 2004.                                                             Efficient parallel set-similarity joins using mapreduce.
 [9] Ritendra Datta, Dhiraj Joshi, Jia Li, and James Ze                                        2010.
     Wang. Image retrieval: Ideas, influences, and trends of                              [23] Ozgur D. Sahin, Fatih Emekçi, Divyakant Agrawal,
     the new age. ACM Comput. Surv., 2008.                                                     and Amr El Abbadi. Content-based similarity search
[10] Jeffrey Dean and Sanjay Ghemawat. Mapreduce:                                              over peer-to-peer systems. In DBISP2P, 2004.
     Simplified data processing on large clusters. In OSDI,                               [24] Bilyana Taneva, Mouna Kacimi, and Gerhard
     2004.                                                                                     Weikum. Gathering and ranking photos of named
[11] Vlastislav Dohnal and Pavel Zezula. Similarity                                            entities with high precision, high recall, and diversity.
     searching in structured and unstructured p2p                                              In WSDM, 2010.
     networks. In QSHINE, 2009.
[12] Wei Dong, Zhe Wang, William Josephson, Moses


                                                                                     18