<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scaling Out All Pairs Similarity Search with MapReduce</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Regular Paper</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Claudio Lucchese ISTI-CNR Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Gianmarco De Francisci Morales IMT Institute for Advanced Studies Lucca</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ranieri Baraglia ISTI-CNR Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <fpage>25</fpage>
      <lpage>30</lpage>
      <abstract>
        <p>Given a collection of objects, the All Pairs Similarity Search problem involves discovering all those pairs of objects whose similarity is above a certain threshold. In this paper we focus on document collections which are characterized by a sparseness that allows effective pruning strategies. Our contribution is a new parallel algorithm within the MapReduce framework. The proposed algorithm is based on the inverted index approach and incorporates state-of-theart pruning techniques. This is the first work that explores the feasibility of index pruning in a MapReduce algorithm. We evaluate several heuristics aimed at reducing the communication costs and the load imbalance. The resulting algorithm gives exact results up to 5x faster than the current best known solution that employs MapReduce.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>The task of discovering similar objects within a given
collection is common to many real world applications and
machine learning problems. To mention a few, recommendation
and near duplicate detection are a typical examples.</p>
      <p>Item-based and user-based recommendation algorithms
require to find, respectively, similar objects to those of
interest to the user, or other users with similar tastes. Due the
the number of users and objects present in recommender
systems, e.g. Amazon, similarity scores are usually computed
off-line.</p>
      <p>Near duplicate detection is commonly performed as a
preprocessing step before building a document index. It may
be used to detect redundant documents, which can therefore
be removed, or it may be a hint for spam websites
exploiting content repurposing strategies. Near duplicate detection
finds application also in the area of copyright protection as a
tool for discovering plagiarism, for both text and multimedia
content.</p>
      <p>In this paper, we focus on document collections. The
reason is that documents are a particular kind of data that
Copyright c 2010 for the individual papers by the papers’ authors.
Copying permitted only for private and academic purposes. This volume is
published and copyrighted by its editors.</p>
      <p>LSDS-IR Workshop, July 2010. Geneva, Switzerland.
exhibits a significant sparseness: only a small subset of the
whole lexicon occurs in any given document. This sparsity
allows to exploit indexing strategies that reduce the
potentially quadratic number of candidate pairs to evaluate.</p>
      <p>Furthermore, we are interested in discovering only those
pairs of documents with high similarity. If two documents
are not similar, they usually do not contribute to any of the
applications we mentioned above. By setting a minimum
similarity threshold, we can also embed aggressive pruning
strategies.</p>
      <p>Finally, the size of the collection at hand poses new
interesting challenges. This is particularly relevant for
Webrelated collections, where the number of documents involved
is measured in billions. This implies an enormous number
of potential candidates.</p>
      <p>More formally, we address the all pair similarity search
problem applied to a collection D of documents. Let L be
the lexicon of the collection. Each document d is represented
as a |L|-dimensional vector, where d[i] denotes the number of
occurrences of the i-th term in the document d. We adopt
the cosine distance to measure the similarity between two
documents. Cosine distance is a commutative function, such
that cos(di, dj) = cos(dj, di).</p>
      <p>Definition 1. Given a collection D = {d1, . . . , dN } of
documents, and a minimum similarity threshold σ, the All
Pairs Similarity (APS) problem requires to discover all those
document pairs di, dj ∈ D, such that:
cos(di, dj) =</p>
      <p>P0≤t&lt;|L| di[t] · dj[t]
kdikkdjk
≥ σ</p>
      <p>We normalize vectors to unit-magnitude. In this special
case, the cosine distance becomes simply the dot product
between the two vectors, denoted as dot(di, dj).</p>
      <p>The main contribution of this work is a new distributed
algorithm that embeds state-of-the-art pruning techniques.
The algorithm is designed within the MapReduce
framework, with the aim of exploiting the aggregated computing
and storage capabilities of large clusters.</p>
      <p>The rest of this paper is organized as follows: in
Section 2 we introduce a few concepts needed for the
description of the proposed algorithm. We also describe the two
most relevant contributions to our work. Section 3
incrementally describes our proposed algorithm and heuristics,
highlighting the strengths and weaknesses for each strategy.
Section 4 presents the results of our experimental
evaluation. Finally, in Section 5 we summarize our contribution
and present some ideas for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>
        A serial solution. The most efficient serial solution to
the APS problem was introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The authors use
an inverted index of the document collection to compute
similarities. An inverted index stores an inverted list for
each term of the lexicon, i.e. the list of documents containing
it, together with the weight of the term in each document.
More formally, the inverted list of the term t is defined as
It = {hdi, di[t]i|di[t] &gt; 0}.
      </p>
      <p>It is evident that two documents with non-zero similarity
must occur in the same inverted list at least once. Therefore,
given a document di, by processing all the inverted lists It
such that di[t] &gt; 0, we can detect all those documents dj
that have at least one term in common with di, and therefore
similarity greater than 0.</p>
      <p>This is analogous to information retrieval systems, where
a query is submitted to the inverted index to retrieve
matching/similar documents. In this case, a full document is used
as a query.</p>
      <p>Actually, index construction is performed incrementally,
and simultaneously to the search process. The matching
and indexing phase are performed one after the other. The
current document is first used as a query to the current
index. Then it is indexed, and it will be taken into account to
answer subsequent document similarity queries. Each
document is thus matched only against its predecessors, and
input documents can be discarded once indexed.</p>
      <p>
        Usually the matching phase dominates the computation
because its complexity is quadratic with respect to the length
of the inverted lists. In order to speed-up the search
process, various techniques to prune the index have been
proposed [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ].
      </p>
      <p>
        We focus on the first technique proposed in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Let dˆ be
an artificial document such that dˆ[i] = maxd∈D d[i]. The
document dˆ is an upper-bounding pivot: given a document
di, if cos(di, dˆ) &lt; σ then there is no document dj ∈ D
being sufficiently similar to di. This special document dˆ is
exploited as follows.
      </p>
      <p>Before indexing the current document di, the largest b
such that P0≤t&lt;b di[t] · dˆ[t] &lt; σ is computed. The terms t &lt;
b of a document are stored in a remainder collection named
DR, and only the terms t ≥ b of the current document are
inserted into the inverted index. The pruned index provides
partial scores upon similarity queries.</p>
      <p>The authors prove that for each document di currently
being matched, their algorithm correctly generates all the
candidate pairs (di, dj) using only the indexed components
of each dj. For such documents the remainder portion of dj
is retrieved from DR to compute the final similarity score.</p>
      <p>Finally, the authors propose to leverage the possibility of
reordering the terms in the lexicon. By sorting the terms in
each document by frequency in descending order, such that
d[0] refers to the most frequent term, most of the pruning
will involve the longest lists.</p>
      <p>A parallel solution.</p>
      <p>
        When dealing with large datasets, e.g. collections of Web
documents, the costs of serial solutions are still not
acceptable. Furthermore, the index structure can easily outgrow
the available memory. The authors of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] propose a parallel
distributed solution based on the MapReduce framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>MapReduce is a distributed computing paradigm inspired
by concepts of functional languages. More specifically,
MapReduce is based on two higher order functions: Map and
Reduce. The Map function applies a User Defined Function
(UDF) to each key-value pair in the input, which is treated
as a list of independent records. The result is a second list of
intermediate key-value pairs. This list is sorted and grouped
by key, and used as input to the Reduce function. The
Reduce function applies a second UDF to every intermediate
key with all its associated values to produce the final result.</p>
      <p>The signatures of the functions that compose the phases
of a MapReduce computation are as follows:</p>
      <p>M ap : [hk1, v1i]
Reduce : {k2 : [v2]}
→
→
[hk2, v2i]
[hk3, v3i]
where curly braces “{ }00 square brackets “[ ]” and angle
brackets “h i” indicate respectively a map/dictionary, a list
and a tuple.</p>
      <p>
        The Map and Reduce function are purely functional and
thus without side effects. For this reason they are easily
parallelizable. Fault tolerance is easily achieved by just
reexecuting the failed function. MapReduce has become an
effective tool for the development of large-scale applications
running on thousand of machines, especially with the release
of the open source implementation Hadoop [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Hadoop is an open source MapReduce implementation
written in Java. Hadoop provides also a distributed file
system called HDFS, that is used as a source and sink for
MapReduce executions. HDFS deamons run on the same
machines that run the computations. Data is split among
the nodes and stored on local disks. Great emphasis is
placed on data locality: the scheduler tries to run mappers
(task executing the Map function) on the same nodes that
hold the input data. This helps to reduce network traffic.</p>
      <p>Mappers sort and write intermediate values on the local
disk. Each reducer (task executing the Reduce function)
pulls the data from various remote disks. Intermediate
keyvalue pairs are already partitioned and sorted by key by the
mappers, so the reducer just merge-sorts the different
partitions to bring the same keys together. This phase is called
shuffle, and is the most expensive in terms of I/O operations.
The MapReduce data flow is illustrarted in Figure 1.</p>
      <p>Shuffle
DFS
Input 1
Input 2
Input 3</p>
      <p>MAP
MAP
MAP
Partition &amp;</p>
      <p>Sort</p>
      <p>REDUCE</p>
      <p>REDUCE
Merge &amp;
Group</p>
      <p>DFS
Output 1
Output 2</p>
      <p>
        Since building and querying incrementally a single shared
index in parallel is not a scalable solution, a two phase
algorithm is proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In the first phase an inverted
index of the collection is built (indexing phase), and in the
second phase the similarity score is computed directly from
the index (similarity phase). Each phase is implemented as
a MapReduce execution.
      </p>
      <p>
        We describe this algorithm in more detail in the following
section. The algorithm is used throughout the paper as a
baseline for the evaluation of our proposed solutions. Indeed,
the authors of [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] propose an algorithm for computing the
similarity of every pair of documents. For this reason, we
add a final filtering phase that discards the documents that
do not satisfy the threshold.
      </p>
    </sec>
    <sec id="sec-3">
      <title>ALGORITHM</title>
      <p>
        In this section we describe the algorithm used to solve the
APS problem using the MapReduce framework. We start
from a basic algorithm and propose variations to reduce its
cost. The main idea we try to exploit is that many of the
pairs are not above the similarity threshold, so they can
be pruned early from the computation. This fact is already
exploited in state-of-the-art serial algorithms [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Our goal
is to embed these techniques into the MapReduce parallel
framework.
3.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>Indexed Approach (Version 0)</title>
      <p>
        A simple solution to the pairwise document similarity
problem [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] can be expressed as two separate MapReduce
jobs:
1. Indexing: for each term in the document, the mapper
emits the term as the key, and a tuple consisting of
document ID and weight as the value, i.e. the tuple
hd, d[t]i. The MapReduce runtime automatically
handles the grouping by key of these tuples. The reducer
then writes them to disk to generate the inverted lists.
2. Similarity: for each inverted list It, the mapper emits
pairs of document IDs that are in the same list as keys.
There will be m×(m−1)/2 pairs where m = |It| is the
inverted list length. The mapper will associate to each
pair the product of the corresponding term weights.
Each value represents a single term’s contribution to
the final similarity score. The MapReduce runtime
sorts and groups the tuples and then the reducer sums
all the partial similarity scores for a pair to generate
the final similarity score.
      </p>
      <p>This approach is very easy to understand and implement,
but suffers from various problems. First, it generates and
evaluates all the pairs that have one feature in common, even
if only a small fraction of them are actually above the
similarity threshold. Second, the load is not evenly distributed.</p>
      <p>The reducers of the similarity phase can only start
after all the mappers have completed. The time to process
the longest inverted list dominates the pair generation
performed by the mappers. With real-world data, which follows
a Zipfian or Power-law distribution, this means that the
reducers usually have to wait for a single mapper to complete.
This problem is exacerbated by the quadratic nature of the
problem: a list twice as long takes about four times more to
be processed.</p>
      <p>
        A document frequency cut has been proposed to help
reducing the number of candidate pairs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This technique
removes the 1% most frequent terms from the computation.
The rationale behind this choice is that because these terms
are frequent, they do not help in discerning documents. The
main drawback of this approach is that the resulting
similarity score is not exact.
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Pruning (Version 1)</title>
      <p>To address the issues in the previous approach, we
employ the pruning technique described in Section 2. As a
result, during the indexing phase, a smaller pruned index
is produced. On the one hand, this reduces the number of
candidate pairs produced, and therefore the volume of data
handled during the MapReduce shuffle. On the other hand,
by sorting terms by their frequency, the pruning significantly
shortens the longest inverted lists. This decreases the cost
of producing a quadratic number of pairs from these lists.</p>
      <p>This pruning technique yields correct results when used in
conjunction with dynamic index building. However, it also
works when the index is built fully before matching, and
only the index is used to generate candidate pairs. To prove
this, we show that this approach generates every document
pair with similarity above the threshold.</p>
      <p>Let di, dj be two documents and let bi, bj be, respectively,
the first indexed features for each document. bi and bj are
the boundaries between the pruned and indexed part as
shown in Figure 2. Without losing generality, let bj bi
(recall that features are sorted in decreasing order of
frequency, so bj is less frequent than bi). We can compute the
similarity score as the sum of two parts:</p>
      <p>X
0≤t&lt;bj</p>
      <p>X
bj≤t&lt;|L|
dot(di, dj) =
di[t] · dj [t] +
di[t] · dj [t]</p>
      <p>While indexing, we keep an upper bound on the similarity
between the document and the rest of the input. This means
that ∀d˙ ∈ D, P0≤t&lt;bj d˙[t] · dj[t] &lt; σ. Thus, if the two
documents are above the similarity threshold dot(di, dj ) ≥ σ,
then it must be that Pbj≤t&lt;|L| di[t] · dj[t] &gt; 0. If this is
the case, then ∃ t bj | (di ∈ It ∧ dj ∈ It). Therefore, our
strategy will generate the pair (di, dj ) when scanning list It.</p>
      <p>bi
d</p>
      <p>i
d
j
0</p>
      <sec id="sec-5-1">
        <title>Pruned</title>
        <p>bj</p>
      </sec>
      <sec id="sec-5-2">
        <title>Indexed</title>
      </sec>
      <sec id="sec-5-3">
        <title>Pruned</title>
      </sec>
      <sec id="sec-5-4">
        <title>Indexed</title>
        <p>|L|</p>
        <p>The Reduce function in the similarity phase receives a
reduced number of candidate pairs, and computes a partial
similarity score. Due to index pruning, no partial scores
will be produced from the inverted lists {It | 0 ≤ t &lt; bj},
since these inverted lists will not contain both documents.
Therefore, the reducer will have to retrieve the original
documents, and compute the contribution up to term bj in order
to produce the exact similarity score.</p>
        <p>We chose to distribute the input locally on every node1.
The performance penalty of distributing the input collection
is acceptable for a small number of nodes, but can become
a bottleneck for large clusters. Furthermore, the input is
usually too big to be kept in memory, so we still have to
perform 2 random disk I/O per pair.</p>
        <p>Finally, to improve the load balancing we employ a
simple bucketing technique. During the indexing phase, we
randomly hash the inverted lists to different buckets. This
spreads the longest lists uniformly among the buckets. Each
bucket will be consumed by a different mapper in the
similarity phase. While more sophisticated strategies are possible,
we found that this one works well enough in practice.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Flagging (Version 2)</title>
      <p>In order to avoid the distribution of the full document
collection, we propose a less aggressive pruning strategy. Our
second approach consists in flagging the index items instead
of pruning them. At the same time, the flagged parts of the
documents are written as a side effect file by the mappers
of the indexing phase. This “remainders” file is then
distributed to all the nodes, and made available to the reducers
of the similarity phase. The remainders file is normally just
a fraction of the size of the original input (typically 10%),
so distributing it is not a problem. During pair generation
in the similarity phase, a pair is emitted only if at least one
of the two index items is not flagged.</p>
      <p>Our pair generation strategy emits all the pairs for the
features from bi to |L|, so we just need to add the dot product
of the remainders. The remainders file is small enough to be
easily loaded in memory in one pass during the setup of the
reducer. Thus, for each pair we only need to perform two
in-memory lookups and compute their dot product. This
process involves no I/O, so it is faster than the previous
version.</p>
      <p>The main drawback of this version is that it generates
more pairs than version 1. This leads to unnecessary
evaluation of pairs and consequently to wasted effort.
3.4</p>
    </sec>
    <sec id="sec-7">
      <title>Using Secondary Sort (Version 3)</title>
      <p>This version tries to achieve the benefits of both previous
versions. Observe that for every pair (di, dj) one of the two
documents has been pruned up to a term that precedes the
other (remember that features are sorted according to their
frequency). Let this document be the LPD (Least Pruned
Document) of the pair. Let the other document be the MPD
(Most Pruned Document). In Figure 2, di is the LPD and
dj is the MPD.</p>
      <p>We use version 1 pair generation strategy and version 2
remainder distribution and loading. To generate the partial
scores we lack (from 0 to bj), we just need to perform the
dot product between the whole LPD and the remainder of
the MPD. The catch is to have access to the whole LPD
without doing random disk I/O and without keeping the
input in memory.</p>
      <p>Our proposed solution is to shuffle the input together with
the generated pairs and route the documents where they are
needed. In order to do that, we employ Hadoop’s Secondary
Sort feature. Normally, MapReduce sorts the intermediate
records by key before starting the reduce phase. Using
secondary sort we ask Hadoop to sort the records also by a
1using Hadoop’s Distributed Cache feature
secondary key while the grouping of values is still performed
only by primary key. Instead of using the whole pair as a
key, we use the LPD as the primary key and the MPD as
the secondary key.</p>
      <p>As a result, input values for the reducer are grouped by
LPD, and sorted by both LPD and MPD, so that partial
scores that belong to the same pair are adjacent. The LPD
document from the original input that we shuffled together
with the pairs is in the same group. In addition, we impose
the LPD document itself to sort before every other pair
using a fake minimum secondary key. This allows us to have
access to the document before iterating over the values, and
therefore to perform the dot products on the fly. This is a
representation of the input for the reduce of the similarity
phase:
hdii; h(di, dj), WiAj i; h(di, dj), WiBj i; h(di, dk), WiAki; . . .
| }
group b{yz key di
hdji; h(dj, dk), WjAki; h(dj, dk), WjBki; h(dj, dl), WjAl i; . . .
| }
group b{yz key dj</p>
      <p>First, we load the document di in memory. Then, for each
stripe of equal consecutive pairs (di, dj), we sum the partial
scores WiXj for each common term X. Finally, we compute
the dot product between the LPD and the remainder of the
MPD, which is already loaded in memory from the
remainders file. We repeat this cycle until there are no more values.
After that we can discard the LPD from memory and
proceed to the next key-values group.
4.</p>
    </sec>
    <sec id="sec-8">
      <title>EXPERIMENTAL RESULTS</title>
      <p>In this section we describe the performance evaluation of
the proposed algorithms. We ran the experiments on a
5node cluster. Each node is equipped with two Intel Xeon
E5520 CPUs clocked at 2.27GHz. Each CPU features 4 cores
and Hyper-Threading for a total of 40 virtual cores. Each
node has a 2TiB disk, 8GiB of RAM, and Gigabit Ethernet.</p>
      <p>On each node, we installed Ubuntu 9.10 Karmic, 64-bit
server edition, Sun JVM 1.6.0 20 HotSpot 64-bit server, and
Hadoop 0.20.1 from Cloudera (CDH2).</p>
      <p>We used one of the nodes to run Hadoop’s master daemons
(Namenode and JobTracker), and the rest were configured
as slaves running Datanode and TaskTracker daemons. Two
of the cores on each slave machine where reserved to run the
daemons, the rest were equally split among map and reduce
slots (7 each), for a total of 28 slots for each phase.</p>
      <p>We tuned Hadoop’s configuration in the following way:
we allocated 1GiB of memory to each daemon and 400MiB
to each task, we changed the HDFS block size to 256MiB
and the file buffer size to 128KiB. We also disabled
speculative execution and enabled JVM reuse and map output
compression.</p>
      <p>For each algorithm, we wrote an appropriate combiner to
reduce the shuffle size (a combiner is a reduce-like function
that runs inside the mapper to aggregate partial results). In
our case, the combiners perform the sums of partial scores in
the values, according to the same logic used in the reducer.
We also implemented raw comparators for every key value
used in the algorithms in order to get better performance
(raw comparators are used to compare keys during sorting
without deserializing them into objects).
# documents
# terms
# all pairs
# similar pairs
algorithm version
# evaluated pairs (M)
# partial scores (M)
index size (MB)
remainder size (MB)
running time (s)
avg. map time (s)
stdv. map time (%)
avg. reduce time (s)
stdv. reduce time (%)
We used different subsets of the TREC WT10G Web
corpus. The dataset has 1,692,096 english language documents.
The size of the entire uncompressed collection is around
10GiB.</p>
      <p>We performed a preprocessing step to prepare the data
for analysis. We parsed the dataset, removed stopwords,
performed stemming and vectorization of the input. We
extracted the lexicon and the maximum weight for each term.
We also sorted the features inside each document in
decreasing order of document frequency, as required by the pruning
strategy.
4.1</p>
    </sec>
    <sec id="sec-9">
      <title>Running Time</title>
      <p>We evaluated the running time of the different algorithm
versions while increasing the dataset size. For all the
algorithms, the indexing phase took always less than 1 minute
in the worst case. Thus we do not report indexing times,
but only similarity computation times, which dominate the
whole computation.</p>
      <p>We set the number of mappers to 50 and the number
of reducers to 28, so that the mappers finish in two waves
and all the reducers can run at the same time and start
copying and sorting the partial results while mappers are
still running. For all the experiments, we set the similarity
threshold to 0.9.</p>
      <p>Running time
70000 v0
v1
v2
60000 v3
50000
)
snd 40000
o
c
e
s
(
iem 30000
T
20000
10000
105000 20000 25000 30000 35000 40000 45000 50000 55000 60000 65000</p>
      <p>Number of vectors</p>
      <p>Figure 4 shows the effects of pruning. The maximum
length of the inverted lists is drastically reduced in version 3
compared to version 0. This explains their different running
times, as the algorithm is dominated by the traversal of the
longest inverted list. Figure 5 shows the effects of bucketing.
The load is evenly spread across all the mappers, so that the
time wasted waiting for the slowest mapper is minimized. It
is evident also from Table 1 that the standard deviation of
map running times is much lower when bucketing is enabled.</p>
      <p>On the largest input, version 3 is 5x faster than version 0,
2x faster than version 1 and 1.4x faster than version 2. This
is caused by the fact that version 3 does not access the disk
randomly like version 1 and evaluates less pairs than
version 2. Exact times are reported in Table 1.</p>
      <p>Version 3 outperforms all the others in almost all aspects.
The number of evaluated pairs and the number of partial
scores are the lowest, together with version 1. Version 3 has
also the lowest average map times. The standard deviation
3500
3000
2500
1000
500
0
of map times for versions 1, 2 and 3 is much lower than
version 0 thanks to bucketing.</p>
      <p>For average reduce time, things change with different
input sizes. For small inputs the overhead of version 3 does not
pay back, and version 2 has the best trade-off between
algorithm complexity and number of partial scores. For large
inputs the smaller number of partial scores of version 3 gives
it an edge over other versions. Version 1 is the slowest
because of disk access and version 0 also scales poorly because
of the large number of partial scores.</p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>The All Pairs Similarity Search problem is a challenging
problem that arises in many applications in the area of
information retrieval, such as recommender systems and near
duplicate detection. The size of Web-related problems
mandates the use of parallel approaches in order to achieve
reasonable computing times. In this</p>
      <p>We presented a novel exact algorithm for the APS
problem. The algorithm is based on the inverted index approach
and is developed within the MapReduce framework. To the
best of our knowledge, this is the first work to exploit well
known pruning techniques from the literature adapting them
to the MapReduce framework. We evaluated several
heuristics aimed at reducing the cost of the algorithm. Our
proposed approach runs up to 5x faster than the simple
algorithm based on inverted index.</p>
      <p>In our work we focused on scalability with respect to the
input size. We believe that the scalability of the algorithm
with respect to parallelism level deserves further
investigation. In addition, we believe that more aggressive pruning
techniques can be embedded in the algorithms. Adapting
these techniques to a parallel environment such as
MapReduce requires further study. We also want to investigate
the application of our algorithm to other kinds of real world
data, like social networks.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGEMENTS</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Apache</given-names>
            <surname>Software Foundation. Hadoop</surname>
          </string-name>
          :
          <article-title>A framework for running applications on large clusters built of commodity hardware</article-title>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Awekar</surname>
          </string-name>
          and
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Samatova</surname>
          </string-name>
          .
          <article-title>Fast Matching for All Pairs Similarity Search</article-title>
          .
          <source>In WI-IAT '09: Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology</source>
          , pages
          <fpage>295</fpage>
          -
          <lpage>300</lpage>
          . IEEE Computer Society,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Bayardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Scaling up all pairs similarity search</article-title>
          .
          <source>In WWW '07: Proceedings of the 16th international conference on World Wide Web</source>
          , pages
          <fpage>131</fpage>
          -
          <lpage>140</lpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <surname>S. Ghemawat.</surname>
          </string-name>
          <article-title>MapReduce: simplified data processing on large clusters</article-title>
          .
          <source>In OSDI '04: Proceedings of the 6th Symposium on Opearting Systems Design and Implementation</source>
          , pages
          <fpage>137</fpage>
          -
          <lpage>150</lpage>
          . USENIX Association,
          <year>December 2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <article-title>Pairwise document similarity in large collections with MapReduce</article-title>
          .
          <source>In HLT '08: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies</source>
          , pages
          <fpage>265</fpage>
          -
          <lpage>268</lpage>
          . Association for Computational Linguistics,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>