<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>PhD Workshop, August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Distributed Similarity Joins on Big Textual Data: Toward a Robust Cost-Based Framework</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt-Universita ̈ t zu Berlin Unter den Linden 6 10099 Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>28</volume>
      <issue>2017</issue>
      <abstract>
        <p>Motivated by increasing dataset sizes, various MapReducebased similarity join algorithms have emerged. In our past work (to appear), we compared nine of the most prominent algorithms experimentally. Surprisingly, we found that their runtimes become inhibitively long for only moderately large datasets. There are two main reasons. First, data grouping and replication between Map and Reduce relies on input data characteristics such as word distribution. A skewed distribution as it is common for textual data leads to data groups which reveal very unequal computation costs, leading to Straggling Reducer issues. Second, each Reduce instance only has limited main memory. Data spilling also leads to Straggling Reducers. In order to leverage parallelization, all approaches we investigated rely on high replication and hit this memory limit even with relatively small input data. In this work, we propose an initial approach toward a join framework to overcome both of these issues. It includes a cost-based grouping and replication strategy which is robust against large data sizes and various data characteristics such as skew. Furthermore, we propose an addition to the MapReduce programming paradigm. It unblocks the Reduce execution by running Reducers on partial intermediate datasets, allowing for arbitrarily large data sets between Map and Reduce.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Similarity joins are an important operation for user
recommendations, near-duplicate detection, or plagiarism
detection. They compute similar pairs of objects, such as
strings, sets, multisets, or more complex structures.
Similarity is expressed by similarity (or distance) functions such
as Jaccard, Cosine, or Edit. A naive approach to compute
a similarity self-join is to build the cross product over an
input dataset and lter out all non-similar pairs. This
approach has a quadratic runtime. In the literature, there
are various non-distributed non-parallelized approaches for
45
40
se 3305
cn 25
ta 20
Is#n 1105
5
0</p>
      <p>
        Time
similarity joins based on a two-phase approach [
        <xref ref-type="bibr" rid="ref11 ref15 ref2 ref3 ref4">1, 2, 3, 10,
14</xref>
        ]. They compute a set of candidate pairs which is usually
orders of magnitudes smaller than the cross product.
Subsequently, they verify if the candidates are similar. We refer
to them as lter-and-veri cation approaches. Motivated by
increasing dataset sizes, MapReduce-based distributed
approaches have emerged [
        <xref ref-type="bibr" rid="ref1 ref13 ref14 ref6">5, 12, 13</xref>
        ]. We conducted an
extensive experimental study on nine current MapReduce-based
set similarity join algorithms on textual data (to appear).
There are two key ndings. First, we compared the runtime
of the MapReduce join algorithms to the runtime of
competing non-distributed algorithms from the recent experimental
survey of Mann et al. [
        <xref ref-type="bibr" rid="ref12">11</xref>
        ]. The runtime of MapReduce join
algorithms on small datasets is inferior to the runtime of
non-distributed approaches. This is not surprising due to
the MapReduce overhead. The second nding is that none
of the approaches can compute the join on larger (or even
arbitrarily large) datasets. The runtimes increase so
drastically that we terminated the executions after a long timeout.
      </p>
      <p>
        We identi ed two main reasons for these runtime issues
on large datasets. First, for every MapReduce-based
similarity join algorithm we investigated we found non-optimal
input datasets that lead to only a few running join Reduce
instances while all other instances were left idle. That is,
we often observed Straggling Reducers. Figure 1 shows the
compute instance usage of a non-optimal join execution on
a cluster of 48 compute instances. After roughly half the
execution time, only a few instances are used. The instance
usage is directly connected to data grouping and replication
between Map and Reduce. All algorithms under
investigation exploit and thus rely on certain data characteristics for
replication and grouping. The most relevant characteristics
are the global token frequency of the input dataset and the
number of tokens in each record of a dataset. Stop words,
which occur in a majority of records of a dataset, cause
skewed data groups within most join approaches we
investigated. As second cause, we identi ed memory overload
within Reduce instances. All approaches heavily replicate
data to leverage parallelization. The original MapReduce
programming paradigm as introduced by Dean et al. [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ]
requires the Reduce instances to wait for the Map steps to
nish before the intermediate data groups are sorted and
grouped by key. When the Reduce bu ers are lled, data
is spilled to disk, often causing high runtime penalties. The
use of Combiners is not possible for similarity joins, because
Reducers are stateful. This limitation is inherent to
standard MapReduce.
      </p>
      <p>In this paper, we propose an initial approach toward a
robust framework to compute distributed similarity joins.
It overcomes the Straggling Reducer issues and the input
dataset size limitation we experienced in our past
experiments. Our approach is twofold. First, we nd a
grouping and replication strategy which distributes compute load
evenly over the existing compute instances. This is
challenging since it is not su cient to generate data groups of equal
size. The runtime of a join computation within one group is
dependent on characteristics of the data in the group such
as record lengths. Second, we enable MapReduce to handle
large intermediate datasets by proposing an extension for
MapReduce which unblocks the Reduce execution based on
statistical information gathered in a preprocessing step.</p>
      <p>
        The idea of load balancing in MapReduce based on
statistics is not new. The TopCluster algorithm [
        <xref ref-type="bibr" rid="ref9">8</xref>
        ] is an online
approach which includes cardinality estimations at runtime.
Our approach on the other hand needs exact data statistics
in order to unblock the Reduce execution. These statistics
have to be collected before the join execution. Our approach
is comparable to the one by Kolb et al. [
        <xref ref-type="bibr" rid="ref10">9</xref>
        ], which involves a
preprocessing MR job to collect data statistics and a join job
which uses the statistics for an optimal data grouping and
replication. We extend this approach by using the knowlege
of the group sizes to unblock the Reduce execution.
Furthermore, we tailor the grouping and replication to the speci c
problem of set similarity joins.
      </p>
      <p>The contributions of this paper are as follows:</p>
      <p>We propose a rst approach toward a robust distributed
similarity join framework.</p>
      <p>We de ne a robust grouping and replication strategy
leading to evenly distributed compute loads amongst
the available compute nodes.</p>
      <p>We extend the MapReduce programming paradigm to
unblock Reduce execution to handle (potentially
arbitrarily) large datasets.</p>
      <p>The structure of the paper is as follows. In Section 2, we
give an overview on the similarity join problem, algorithmic
approaches, and motivate the need for research with the
runtime issues we experienced in our past experiments. In
Section 3, we introduce our approach for a robust join
framework and its interaction with an extension of MapReduce to
unblock Reduce execution. In Section 4, we conclude our
work and give an outlook on future work.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>Without loss of generality, we use the set similarity
selfjoin as a running example. Our framework can be applied
to other lter-and-veri cation-based similarity joins as well.
The set similarity join computes all pairs of similar sets
(s1; s2) within a set of records S. A similarity function
sim(s1; s2) expresses the similarity between two records. For
sets, there are similarity functions such as Jaccard, Cosine,
or Dice. The user chooses a threshold t above which two
sets are considered being similar. Formally, given a set S,
a similarity function sim(s1; s2), and a similarity
threshold t, the set similarity join computes the set f(s1; s2) 2
S Sjsim(s1; s2) t; s1 6= s2g.</p>
      <p>
        A naive approach computes the similarity on all pairs
(s1; s2). Since it has a quadratic runtime, it is not
feasible even for small datasets. In the literature,
lter-andveri cation approaches emerged. Their basic idea is to
generate an (inverted) index over all input records. For each
postings list, they compute the cross product (half of it in
the self-join case to be exact) and the union of all these
cross products. Each distinct record ID pair in the union is
a candidate pair, because the two records contain at least
one common token. These candidate pairs are further
veried to compute the end result. Sophisticated ltering
techniques keep the indexes and the number of candidate pairs
small. The most prominent lter is the pre x lter [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">1, 2, 3</xref>
        ].
Given a record length, a similarity function, and a
similarity threshold, the pre x length is the minimum number of
tokens which need to be indexed to guarantee an overlap of
at least one common token if it is similar to another record.
      </p>
      <p>
        Motivated by increasing dataset sizes, MapReduce-based
versions of the lter-and-veri cation approach emerged [
        <xref ref-type="bibr" rid="ref1 ref13 ref14 ref6">5,
12, 13</xref>
        ]. The main idea is identical to the non-distributed
approaches. It is to compute an inverted index, to
compute the cross product on each postings list, and to verify
the resulting candidate pairs. The inverted index is built
as follows. A Map step computes key-value pairs with a
token or a more complex signature as key. The MapReduce
framework groups key-value pairs with the same key to one
Reduce instance. This instance computes the cross product
on the postings list. Depending on the value of the key-value
pair (all tokens of the input record vs. only the record ID),
the veri cation takes place within the Reduce, or there are
further MapReduce steps to join the original records to the
candidate pairs for the veri cation.
      </p>
      <p>
        The key generation of all algorithms known to us relies
on characteristics of the input data. In the most basic
algorithm [
        <xref ref-type="bibr" rid="ref8">7</xref>
        ], each token in the input record is used as key.
Obviously, the number of record groups is equal to the number of
distinct tokens in the input dataset. The size of each record
group depends on the global frequency of its key token. The
data replication is dependent on the record lengths. For
su ciently large datasets with stop words (tokens which
occur in almost every record) and/ or many long records, the
Straggling Reducer e ect occurs. More sophisticated
approaches use a pre x lter, which reduces the number of
tokens for replication to a pre x, which is shorter than the
record length, but still dependent on it. The use of such
lters shifts the Straggling Reducer issue to larger datasets
and/ or datasets with longer records, but does not solve it
for arbitrarily large datasets.
      </p>
      <p>We expect the input of the similarity join to be text, which
is integer-tokenized by a preprocessing step. The
tokenization may include changing letter cases, stemming, or stop
word removal. Depending on the preprocessing, the
properties of input datasets vary by token distribution (stop words,
infrequent tokens), dictionary size, and record size. The
token distribution of textual data is usually Zip an, which
means that there are few very frequent tokens. This is a
challenge for approaches relying on token distribution.</p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>In Figure 2, we illustrate the data ow of our framework.
The rst step computes exact data statistics. It computes
record length frequencies and global token frequencies. These
statistics can be computed in linear time and are highly
parallelizable. Furthermore, it estimates runtime costs for the
join execution, based on data samples with di ering
average record lengths. The second step computes the actual
join. Every Map instance obtains the statistics from the
rst step via a setup function which is called once before the
input data is read. Based on these statistics, it determines
a suitable data grouping and replication and assigns keys
to its output accordingly. Each join Reducer also obtains
the statistics via the setup function. Using the statistics,
it can compute the exact size of each group and start
computing the join on this group once all data for it has
completely arrived. This can happen before all Mappers have
nished their execution. Note that this requires a change
in the original MapReduce. The Reduce-side shu ing
periodically counts the occurrences of each key in its input. It
triggers the execution of the rst-order function once one of
the groups is complete. The Reducer can run any existing
state-of-the art non-distributed similarity join.</p>
      <p>In the following, we describe how to nd a suitable
grouping and replication based on the statistics. We use the
Jaccard similarity function as an example, because it is the
most commonly used function in the literature. Our
framework is also applicable to any other set-based similarity
function. Jaccard is de ned by the intersection divided by the
union of two records ja\bj . Note that records with di ering
ja[bj
lengths can be similar. Figure 3 shows this length
relationship for a similarity threshold of 0.7. For each record
length on the y axis, it shows on the x axis, which record
lengths have to be considered as join candidates. Let us
assume the input has a length distribution as depicted in
Figure 4. In order to obtain data groups which can be
selfjoined independently, we group together all records with the
same length and replicate each group to all larger length
groups it can be similar to. The resulting groups would be
f1g; f2g; f3; 4g; f4; 5; 6; 7g; f5; 6; 7; 8g; f6; 7; 8; 9; 10g etc. Note
that these groups have very uneven cardinalities, for
example jf1gj = 8:000, jf6; 7; 8; 9; 10gj = 688:000 etc.</p>
      <p>In order to distribute the cardinalities evenly, we propose
to apply a hash-based grouping and replication on these
groups. Figure 5 shows an example for a hashing factor
of 4. A hashing function assigns each record to one of 4
groups. Each record is distributed 4 times, so that it joins
each other record in exactly one of the squares in the gure.
Note that there is a tradeo related to the hashing factor.
If it is very low, there are only few large groups and the
replication is low. If it is high, there are many small groups
and the replication is high.</p>
      <p>An even data distribution is not su cient to prevent
Straggling Reducer e ects. The costs of joining a partition with
long records is higher than the cost of joining a partition
with equally many short records. Let us assume that the
Reducer of the join step has a quadratic runtime, which
represents the worst case. The runtime costs of computing a
self-join on one group of records with cardinality groupSize
and with an average record length of avgRecLen can be
estimated with Equation 1, assuming that the tokens in the
records are sorted by a global token order allowing for a
merge-join. In Figure 6, we show a plot of this cost
estimation function. It showsStehiteat1 the costs grow exponentially
with regard to the number of records in the group. The
power of the increase grows exponentially with regard to
the average length of the records in the group. In order
C
o
st
s/
G
r
o
u
p
to avoid a Straggling Reducer e ect, our aim is to nd a
data grouping and replication which at least limits the
maximum compute costs over all groups or ideally imposes equal
computation costs for each data group. In Figure 6, equal
computation costs would occur if all groups would exhibit a
combination of number of records and average record lengths
on an intersection of the graph with a horizontal plane.
groupSize!
2
2 avgRecLen
(1)</p>
      <p>Our idea is to optimize the overall computation costs with
the hashing factor h as variable (Equation 2) and the
constraint that the computation cost of each group may not be
larger than the maximum cost threshold m, which ensures
that no Reducer gets overloaded.</p>
      <p>min X costs(group; h); costs(group; h)
h2N+ group
m
(2)</p>
      <p>The group-wise costs within this equation could either be
estimated by Equation 1 or it might use runtimes on sampled
data from the statistics MapReduce step.</p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSIONS, FUTURE WORK</title>
      <p>
        In this paper, we introduced a rst approach toward a
distributed similarity join framework which is robust against
arbitrary input dataset sizes and data characteristics such
as skew. We plan to detail it out, implement it and run
experiments with it. One crucial detail is to ensure that there
is a su cient number of record groups which is complete. If
a Reduce instance collects only non-complete groups,
straggling will still occur. Another open detail is the choice of
the hash function for the join. Grouping and replication
strategies from existing MapReduce-based similarity join
approaches could be integrated in the proposed strategy.
Especially signature creating approaches like MassJoin [
        <xref ref-type="bibr" rid="ref1 ref6">5</xref>
        ] and
sophisticated grouping strategies like MRGroupJoin [
        <xref ref-type="bibr" rid="ref7">6</xref>
        ]
using the pigeonhole principle are promising.
      </p>
      <p>In future experiments, we are especially interested in the
tradeo between replication and group size. Furthermore, it
is interesting if it pays o to use empirical runtime statistics
for the join costs or simply estimate the runtime analytically.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>REFERENCES</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          5.
          <article-title>This work was supported by the Humboldt Elsevier Advanced Data and Text (HEADT) Center</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Arasu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          .
          <article-title>E cient exact set-similarity joins</article-title>
          .
          <source>In Proceedings of the 32nd international conference on Very large data bases</source>
          , pages
          <volume>918</volume>
          {
          <fpage>929</fpage>
          . VLDB Endowment,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Bayardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          .
          <article-title>Scaling up all pairs similarity search</article-title>
          .
          <source>In Proceedings of the 16th international conference on World Wide Web</source>
          , pages
          <volume>131</volume>
          {
          <fpage>140</fpage>
          . ACM,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          .
          <article-title>A primitive operator for similarity joins in data cleaning</article-title>
          .
          <source>In Data Engineering</source>
          ,
          <year>2006</year>
          .
          <source>ICDE'06. Proceedings of the 22nd International Conference on, pages 5{5</source>
          . IEEE,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghemawat</surname>
          </string-name>
          .
          <source>Mapreduce: Simpli ed data processing on large clusters. OSDI</source>
          , pages
          <volume>137</volume>
          {
          <fpage>150</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Massjoin: A mapreduce-based method for scalable string similarity joins</article-title>
          .
          <source>In Data Engineering (ICDE)</source>
          ,
          <year>2014</year>
          IEEE 30th International Conference on, pages
          <volume>340</volume>
          {
          <fpage>351</fpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>An e cient partition based method for exact set similarity joins</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>9</volume>
          (
          <issue>4</issue>
          ):
          <volume>360</volume>
          {
          <fpage>371</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <article-title>Pairwise document similarity in large collections with mapreduce</article-title>
          .
          <source>In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers</source>
          , pages
          <volume>265</volume>
          {
          <fpage>268</fpage>
          . Association for Computational Linguistics,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            B. Gu er,
            <given-names>N.</given-names>
            <surname>Augsten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reiser</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Kemper</surname>
          </string-name>
          .
          <article-title>Load balancing in mapreduce based on scalable cardinality estimates</article-title>
          .
          <source>In Data Engineering (ICDE)</source>
          ,
          <year>2012</year>
          IEEE 28th International Conference on, pages
          <volume>522</volume>
          {
          <fpage>533</fpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Kolb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thor</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Load balancing for mapreduce-based entity resolution</article-title>
          .
          <source>In Data Engineering (ICDE)</source>
          ,
          <year>2012</year>
          IEEE 28th International Conference on, pages
          <volume>618</volume>
          {
          <fpage>629</fpage>
          . IEEE,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <article-title>Pass-join: A partition-based method for similarity joins</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):
          <volume>253</volume>
          {
          <fpage>264</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>W.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Augsten</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Bouros</surname>
          </string-name>
          .
          <article-title>An empirical evaluation of set similarity join techniques</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>9</volume>
          (
          <issue>9</issue>
          ):
          <volume>636</volume>
          {
          <fpage>647</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Metwally</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          .
          <article-title>V-smart-join: A scalable mapreduce framework for all-pair similarity joins of multisets and vectors</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>5</volume>
          (
          <issue>8</issue>
          ):
          <volume>704</volume>
          {
          <fpage>715</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R.</given-names>
            <surname>Vernica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Carey</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>E cient parallel set-similarity joins using mapreduce</article-title>
          .
          <source>In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data</source>
          , pages
          <volume>495</volume>
          {
          <fpage>506</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. X.</given-names>
            <surname>Yu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Wang</surname>
          </string-name>
          .
          <article-title>E cient similarity joins for near-duplicate detection</article-title>
          .
          <source>ACM Transactions on Database Systems (TODS)</source>
          ,
          <volume>36</volume>
          (
          <issue>3</issue>
          ):
          <fpage>15</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>