<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scaling Up Record-level Matching Rules</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Gagliardelli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Simonini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonia Bergamaschi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are widely employed by data scientists and practitioners that work with data lakes, open data, and data in the wild. In this work we present a novel technique that allows to e ciently execute record-level matching rules on parallel and distributed systems and demonstrate its e ciency on a real-wold data set.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Integration join</kwd>
        <kwd>Entity Resolution</kwd>
        <kwd>Parallel similarity</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Combining data sets that bare information about the same real-world objects is
an everyday task for practitioners that work with structured and semi-structured
data. Frequently (e.g., when dealing with data lakes or when integrating open
data with proprietary data) data sets do not have explicit keys that can be used
for a traditional equi-join [
        <xref ref-type="bibr" rid="ref12 ref13 ref7 ref9">12,13,7,9</xref>
        ]. When that happens, a common solution
is to perform a similarity join [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], i.e., to join records that have an attribute
value similar above a certain threshold, according to a given similarity measure,
as in the following example:
Example 1. (Similarity Join) Given two product data sets, join all the record
pairs with the Jaccard similarity of the product names above 0.8.
      </p>
      <p>
        A plethora of algorithms have been proposed in the last decades to e
ciently execute the similarity join considering a single attribute, i.e.,
attributelevel matching rules (see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for a survey). At their core, all these algorithms
try to prune the candidate pairs of records, on the basis of a single-attribute
predicate|to alleviate the quadratic complexity of the problem.
      </p>
      <p>Interestingly, only a few works had been focused on studying how to execute
record-level matching rules, i.e., the combination of multiple similarity join
predCopyright c 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). This volume is published
and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy.
icates on multiple attributes (see Section 2). Yet, this kind of rules permits to
specify more exible rules to match records, as in the following example:
Example 2. (Record-level matching rule) Given two product data sets, join
all the record pairs that have a Jaccard similarity of the product names above 0.8,
or that have a Jaccard similarity of the description that is above 0.6 and the edit
distance of the manufacturer lower than 3.</p>
      <p>
        Furthermore, record-level matching rules can be used to represent decision
trees [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], hence learned with machine learning algorithms when training data
is available. As a matter of fact, a decision tree for binary classi cation (i.e.,
classi cation of matching/not-matching records) can be naturally represented
with DNF (disjunctive normal form) predicates|the same consideration can be
done for a forest of trees.
      </p>
      <p>
        To the best of our knowledge, no techniques have been proposed to
leverage distributed and parallel computing for scaling record-level matching rules.
The bene t is twofold: (i) distributed computation allows to scale to large data
sets that cannot be handled with a single machine; (ii) parallel execution
reduces the execution times (3 times faster in our experiments). As a matter of
fact, being able to e ciently execute similarity join is crucial when time is a
critical component, e.g., when users are involved in the process. For instance,
in exploratory search in a data lake [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], users typically look for related data
sets and low latency in performing similarity join is required for enabling the
user's interactive exploration. Also, when debugging record-level matching rules,
users typically try di erent con gurations of similarity metrics, thresholds, and
attributes. Hence, enabling fast execution of such rules can signi cantly save
user's time.
      </p>
      <p>Contribution. In this work we present a technique that is able to use di erent
similarity measures to apply record-level matching rules e ciently. Moreover,
we present an algorithm, RulER, to e ciently run record-level matching rules
on MapReduce-like systems, to take full advantage of a parallel and distributed
computation.</p>
      <p>The rest of the paper is organized as follow: Section 2 provides the
preliminaries. Then, in Section 3 we present our novel technique. Section 4 shows the
experimental results. Finally, in Section 5 we draw the conclusions.
2</p>
      <p>Preliminaries and Related Work
This section describes the fundamental concepts and the related work.
2.1</p>
      <p>
        Matching Rule
We de ne a matching rule R as a disjunction (logical OR) of conjunctions
(logical AND ) of similarity join predicates on multiple attribute (i.e., at the record
level ). This design choice is driven by the fact that DNF matching rules are
easy to read and thus to debug, in practice. Moreover, DNFs can be employed
to represent the trained model of a decision tree (or of a random forest), hence
suitable for exploiting labeled data. In this work, we focus on how to scale DNF
matching rules and we do not investigate how to generate good DNFs (i.e.,
decision trees/random forests) starting from training data, which is the focus of
Ardalan et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]|Ardalan et al. do not investigate the parallel and distributed
execution of matching rules.
2.2
      </p>
      <p>Set Similarity Join
A record ri is considered as a set of elements identi ed by a unique identi er.
Di erent techniques can be employed to generate the elements from the values
of a record, for example, each word can be considered as a token or it is possible
to generate the n-grams, etc. Formally, given a collection of records, a similarity
function sim and a similarity threshold t, the goal of set similarity join is to nd
all the possible pairs of records hri; rj i such that sim(ri; rj ) t.</p>
      <p>
        A nave solution to perform the set similarity join is to enumerate and
compare every pair of records, but this process is highly ine cient and not feasible in
the Big Data context. To reduce the task complexity di erent approaches were
proposed in literature [
        <xref ref-type="bibr" rid="ref15 ref16 ref3 ref4">4,3,16,15</xref>
        ]. All these approaches adopt a lter-veri cation
approach: (1) rst an index is used to obtain a set of pre-candidates; (2) the
pre-candidates are ltered using a set of pre-de ned lters; (3) the resulting
candidate pairs are probed with the similarity function to generate the nal
results.
      </p>
      <p>
        The most used lters are: pre x lter, length lter, and positional lter. All
these lters can be adapted to work with di erent similarity measures: Dice,
Cosine, Jaccard Similarity, Edit Distance and Overlap Similarity [
        <xref ref-type="bibr" rid="ref10 ref15 ref16">10,15,16</xref>
        ].
Pre x lter A key technique to perform the set similarity join e ciently is the
pre x lter [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. First of all, given a collection of records (i.e., sets of elements)
their elements are sorted according to a global order O, usually the document
frequency of the tokens (i.e., how many documents contain that token) that is
a heuristic that helps to reduce the number of comparisons [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Then, for each
sorted set, only the rst elements are considered, i.e., the pre xes. A pair
hri; rj i can be safely pruned if their pre xes have no common elements. The
pre x size depends on the similarity threshold and the similarity function. For
example, the pre x lter for the overlap similarity is de ned as follows: given
two sets, ri and rj , and an overlap threshold t; if jri \ rj j t, then there is
at least one common token within the ri -pre x of ri and the rj -pre x of rj ,
where r = jrj j t + 1 and s = jrj j t + 1.
      </p>
      <p>An example of how pre x lter works is reported in Figure 1. The pre xes
for overlap threshold t = 4 are highlighted in grey. Since the two pre xes do
not share any token, the pair hri; rj i can be pruned. The intuition behind this is
that the 3 remaining tokens to check can provide at most a similarity of 3, that
is not enough to reach the requested threshold t.</p>
      <p>t = 4
ri a b c ? ? ?</p>
      <p>rj d e ? ? ?</p>
      <p>
        Length lter A lter that is commonly used in conjunction with the pre x lter
is the length lter [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Normalized similarity functions (e.g., Jaccard, Cosine,
Dice, ED) depend on the set size, thus it is possible to exploit it to prune the
pairs generated with the pre x lter. For the Jaccard Similarity the length lter
is de ned as: a set of elements r can reach Jaccard threshold t only with a set
s of size lbr jsj ubr (lbr = t jrj; ubr = jjrtjj ); for example, if jrj = 10 and
t = 0:8, then 8 jsj 12 is required.
      </p>
      <p>
        Positional lter The positional lter [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] reasons over the matching position
of tokens in the pre x. Given a pair of sets of sorted elements it checks the
positions of their common tokens in the pre x, if the remain tokens to check are
not enough to reach the threshold, it prunes the pair. Since it needs to scan the
tokens in the pre x, this lter is more expensive than pre x and length lters,
so usually it is applied only on the pairs that already passed them.
      </p>
      <p>An example of how positional lter works is provided in Figure 2. The pair
hri; rj i passes both length and pre x lters. The rst match in ri occurs in
position 1 (counting from 0), thus only 8 tokens of rj are left to match tokens of
ri, and the pair can be ltered because it can never reach the requested threshold
t = 9.</p>
      <p>9
ri a f ? ? ? ? ? ? ? ?
8</p>
      <p>t = 9
rj b f ? ? ? ? ? ? ?</p>
      <p>min(8, 9) = 8
8 &lt; 9 → Filtered</p>
      <p>Pre x lter based set similarity join An example of how a pre x lter
based set similarity join works is outlined in Figure 3. Starting from a document
collection, the documents are transformed in sets of elements (e.g., tokens,
ngrams, etc.) and sorted according to a global order (1). Then, using the pre xes
(highlighted in gray) an inverted index is built, i.e., the pre x index (2). From
the pre x index, a set of pre-candidate pairs is built (3), i.e., each pair of pro les
that appear together in at least one entry of the pre x index. The pre-candidate
pairs are ltered using di erent lters (e.g., length lter, positional lter, etc.)
that are fast to compute and let to discard the pairs that cannot reach the
threshold (4). Finally, the pairs that pass all the lters (i.e., candidate pairs) are
probed with the similarity function, and only those that have a similarity above
the threshold are retained (5).
Collection of profiles
p1 = "word1 word2 ..."
p2 = "word3 word4 ..."
p3 = "word5 word1 ..."</p>
      <p>...
&lt;p1, p2&gt;</p>
      <p>...</p>
      <sec id="sec-1-1">
        <title>Results</title>
        <p>(1) elements
generation
and sorting
(5) verification</p>
        <p>Set of sorted elements
p1 = [A, B, D, ...]
p2 = [B, C, D, ...]
p3 = [A, D, E ...]</p>
        <p>...
&lt;p1, p2&gt;
&lt;p2, p3&gt;</p>
        <p>...</p>
      </sec>
      <sec id="sec-1-2">
        <title>Candidates</title>
        <p>(2) prefix
index
building
(4) filtering</p>
      </sec>
      <sec id="sec-1-3">
        <title>Prefix index</title>
        <p>A → [p1, p3, …]
B → [p1, p2, …]
D → [p1, p2, p3, …]
...
&lt;p1, p2&gt;
&lt;p1, p3&gt;
&lt;p2, p3&gt;</p>
        <p>...</p>
      </sec>
      <sec id="sec-1-4">
        <title>Pre-candidates</title>
        <p>
          (3) pre-candidates
generation
In this section, we present our method RulER to e ciently scale record-level
matching rules over big data sets. The presented algorithm is the self-join version
for the sake of the presentation; adapting it for joining two di erent data sets is
straightforward.
Given a matching rule R a nave solution to perform it is to process each
predicate as a single similarity join and then intersect/merge the obtained results
according to the requirements. In particular, we adopted two algorithms to
perform the similarity joins: PPJoin [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] and EDJoin [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Both algorithms employ
pre x lter (see Section 2) to nd candidate pairs. PPJoin is considered one
of the best performing similarity join algorithm [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], also its parallel
implementation (i.e., Vernica Join) has demonstrated to be one of the best performing
        </p>
        <sec id="sec-1-4-1">
          <title>Algorithm 1 PPJoin/EDJoin</title>
          <p>
            Input: R collection of records to join
Input: P predicate that contains the join attribute, the threshold, and the elements
pattern (i.e., tokens, n-grams, etc.)
Output: C, the pairs of records that can satisfy P
1: RT getSortedElements(R; P) //Transforms the records in set of sorted
elements (i.e., tokens, n-grams, etc.)
2: I buildP ref ixIndex(RT ; P) //Build pre x index
3: C atMap Bi 2 I //For each entry of the pre x index
4: for each hrj; rki 2 Bi(rj 6= rk) do
5: if passLengthF ilter(rj; rk; P) then
6: if passP ositionalF ilter(rj; rk; P) then
7: emit(hrj; rki)
for distributed computing [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. It can work with di erent similarity measures like
Jaccard Similarity, Dice and Cosine. EDJoin adapts the PPJoin concepts to
work with the Edit Distance. We adapted both algorithms to work on Spark as
proposed in [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] for PPJoin(i.e., Vernica Join).
          </p>
          <p>The distributed algorithm to perform PPJoin/EDJoin is presented in
Algorithm 1. First, the records are transformed into sets of sorted elements according
to the predicate requirements (line 1). The pre x index (see Subsection 2.2) is
built generating an inverted index that groups all the records that share at least
one token in their pre x. Then, the algorithm iterates over each entry of the
pre x index Bi, probing each pair of records hri; rj i 2 Bi with the appropriate
lters according to the predicate P.</p>
          <p>Algorithm 2 outlines the baseline algorithm (i.e., JoinChain). A rule in DNF
format is composed of di erent blocks Pi of predicates that are in logical OR,
each of these blocks contains one or more predicates pi that are in logical AND.
First, the algorithm iterates over the Pi blocks (line 2) and for each of them
initializes a set of candidates CPi (line 3). Then, each simple predicate pj 2
Pi is used to apply a similarity join on the record collection R according to
requirements (lines 6-9). The result of a Pi block is given by the intersection of
the results provided by each similarity join applied with the predicate pj 2 Pi
(lines 10-13). The nal candidate set is computed by merge the results of each
Pi block (line 14). In the end, the candidates are veri ed with a verify function
that ensures that all predicates are respected (line 15).
10:
11:
12:
13:
Algorithm 2 JoinChain
Input: R collection of records to join
Input: M matching rule in DNF form
Output: M , the pairs of records that satisfy M
1: C fg
2: for each Pi 2 M do //For each block of predicates in or
3: CPi fg //Set of candidate pairs for Pi
4: for each pj 2 Pi do //For each single predicate in and
5: Cpj fg //Set of candidate pairs for pj
6: if pi:type = ED then
7: Cpj EDJ oin(R; pj ) //Get candidate pairs with EDJoin
8: else
9:</p>
          <p>Cpj P P J oin(R; pj ) //Get candidate pairs with PPJoin
if CPi :isEmpty then //Intersects candidates with previous ones</p>
          <p>CPi Cpi</p>
          <p>CPi CPi \ Cpj</p>
          <p>C [ CPi //Merge candidates with previous ones
else
14: C
15: return verif y(C; M)
3.2</p>
          <p>The RulER Algorithm</p>
        </sec>
        <sec id="sec-1-4-2">
          <title>Algorithm 2 has three main drawbacks:</title>
          <p>
            (i) the intersect operation (line 13) is expensive in MapReduce-like systems,
because it generates shu e ;
(ii) a predicate is independently checked by the others. For example, given a
matching rule M = (C1 ^ C2) _ (C3 ^ C4) in which each Cx is a similarity
join predicate (e.g., Jaccard Similarity title 0:8). A pair hri; rj i is probed
with all predicates even if it fails/passes one of them. For example, if the
pair passes the predicate (C3 ^ C4) it is not necessary to probe it with
(C1 ^ C2). Or, if it fails with the predicate C1 it is not necessary to probe
it with C2; C3.
(iii) Vernica Join [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], employed in the implementation of JoinChain algorithm
(Algorithm 2 lines 7, 9), produces duplicates [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] that have to be removed. If
a pair of records appears in more pre x entries, it is processed and emitted
multiple times (Algorithm 1 lines 3-7).
          </p>
          <p>We solved these problems in our RulER algorithm. The main intuition of RulER
is to exploit the pre x indexes|one pre x index for each predicate of the
matching rule|to build a graph structure, which is then employed to iterate over the
records (the nodes of the graph), e ciently applying the rules and to keep only
the candidates (the edges of the graph) that pass the whole rule. In other words,
RulER adopts a record-based parallelization approach; in contrast to the
existing algorithms, which adopt a pre x-based parallelization approach on a single
predicate at a time.</p>
          <p>
            The RulER matching rule execution algorithm is outlined in Algorithm 3.
The algorithm takes as input a collection of records and a record-level matching
rule M and gives as output the set of record pairs that satisfy M. Recall that
M is in DNF, i.e., it is composed of sets of predicates Pj in logical or, each
set Pj contains predicates pk in logical and. First of all, the values of attributes
are converted into sets of elements (Line 1) according to the matching rule
requirements (e.g., n-grams, trigrams, tokens, etc.); then the pre x indexes are
built to nd the candidate pairs (line 2)|one pre x index is needed for each
predicate pk of the matching rule. The pre x indexes are sent in broadcast to
each node (line 3) to be available to each computational node (called worker ).
Then, each worker iterates over its portion of records (lines 5-6), and performs
the following operations for each record ri. First, a set of candidates for ri is
initialized as an empty set Cri (line 7). Second, for each set Pj , a set of candidates
CPj is initialized as an empty set (lines 8-9) and for each pk 2 Pj the candidates
Cri;pk that can match with ri are extracted using the pre x indexes (lines
1011). Third, the candidates Cri;pk are pruned by removing those that already
passed one of the previous Pj set of predicates (line 14), and those that did not
passed previous pk 2 Pj predicates (lines 15-16). Fourth, the retained candidates
are probed with other lters that further improve the e ciency of the overall
process (e.g., length lter, position lter, etc. [
            <xref ref-type="bibr" rid="ref15 ref16">16,15</xref>
            ]) according to the rule (line
18). Since pk is in logical and with the previous predicates, only the candidates
https://spark.apache.org/docs/2.1.0/programming-guide.html#
shuffle-operations
          </p>
          <p>Algorithm 3 RulER
Input: R collection of records to join
Input: M matching rule in DNF
Output: C, the pairs of records that satisfy M
1: RT getElements(R; M)
2: I buildP ref ixIndexes(RT ; M)
3: broadcast(I)
4: C fg //Candidate pairs
5: map partition part 2 RT
6: for each ri 2 part do
7: Cri fg //Candidates for ri
8: for each Pj 2 M do //For each set of predicates in logical or
9: CPj fg //Candidates that satisfy Pj
10: for each pk 2 Pj do //For each predicate in logical and
11: Cri;pk I(pk; ri) //Gets the candidates from the pre x index
12: /*Removes candidates that already passed previous predicates in or
13: and those that did not pass previous predicates in and */
14: Cri
15:
16:</p>
          <p>Cri;pk Cri;pk
if CPj 6= ; then</p>
          <p>Cri;pk Cri;pk \ CPj
/*Applies lters (length, positional, ...)*/</p>
          <p>CPj applyF ilters(ri; Cri;pk ; pk)
that pass the lters are kept. The resulting candidates from Pj are added to Cri
(line 20). Finally, the candidates are veri ed (line 21).</p>
          <p>Given a matching rule R = (C1 ^ C2 ^ C3) _ (C4 ^ C5), in which each Cx
is a similarity join predicate (e.g., Jaccard Similarity title 0:8); an example of
how RulER executes R is outlined in Figure 4. First, a pre x index is built on
the basis of the record-level matching rules expressed in the main matching rule
R. Then, the index is distributed to each worker. Each worker iterates over each
record in its partition extracting the possible candidates from the pre x index.
The rules are applied to each candidate. If more rules are in or it is possible to
avoid computing the other rules when one of them is veri ed, e.g., with hr1; r2i
the rule (C1 ^ C2 ^ C3) is not veri ed since the pair passes the rule (C4 ^ C5).
Otherwise, if more rules are in and, it is possible to avoid the computation when
one of them fails, for example for the pair hr1; r3i, C2 fails, so C3 has not to be
computed.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Experimental Evaluation</title>
      <p>In this section we evaluate RulER with respect to JoinChain (see Section 3). In
particular, the experimental evaluation aims to answer the following questions:</p>
      <p>Prefix Indexes
r1 pop r1
r3
r4
r2
r3
r4
Prefix Indexes
r2 pop r2
r5</p>
      <p>C1 C2 C3 C4 C5</p>
      <p>Worker 5
matches
emit &lt;r1, r2&gt;
&lt;r1, r4&gt;
Worker 6</p>
      <p>Master
Prefix Index C1
Prefix Index C2
...
se
x
e
d
n
iIfr
x
e
P
d
li
u
B
Q 1: What is the performance of RulER in terms of execution time compared to</p>
      <p>JoinChain (i.e., the nave solution)? (Section 4.2)
Q 2: How does RulER scale when varying the number of machines available for
the record-level matching rule processing? (Section 4.3)
4.1</p>
      <p>
        Experimental Setup
All the experiments are performed on a ten-node cluster; each node has two Intel
Xeon E5-2670v2 2.50 GHz (20 cores per node) and 128 GB of RAM, running
Ubuntu 14.04. All the software is implemented in Scala 2.11.8 and available at [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
To assess the performance of the state-of-the-art meta-blocking methods we
reimplemented all of them for running on Apache Spark as well. We employ Apache
Spark 2.1.0, running 3 executors on each node, reserving 30 GB of memory for
1
2
3
      </p>
      <p>4
Fig. 5. Execution times of RulER and JoinChain with the rules reported in Table 1.
the master node. We set the default parallelism to twice the number of cores as
suggested by best practice.</p>
      <p>
        We employed the ombd data set [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to evaluate the performance of RulER
and JoinChain. It contains 2:3 millions of records about movies collected from
omdbapi.com. This data set has a good variety of attributes (e.g., title, cast,
director, writer, plot, etc.) on which it is possible to de ne a di erent kind of
record-level matching rules. Moreover, it contains a su cient number of records
that make it suitable to test the performance with Spark.
      </p>
      <p>The goal of our experiments is to compare the e ciency of the algorithms,
not to nd good matching rules. Moreover, with RulER it would be possible to
de ne an order for the application of the predicates that minimize the execution
time and the number of performed comparisons, but we do not explore this
aspect in this work.
4.2</p>
      <p>RulER vs JoinChain
The goal of this experiment is to compare the e ciency of RulER (Algorithm
3) and JoinChain(Algorithm 2). Both algorithms can be employed to apply a
record-level matching rule M. In this experiment we apply the rules presented
in Table 1 on the omdb data set. Since both algorithms use the same functions
to extract the elements from the records, to generate the pre x indexes and to
verify the candidate pairs, we analyze here only the time requested to perform
the join operation. All the experiments are performed on a single node.</p>
      <p>Figure 5 reports the execution times of RulER and JoinChainwith the rules
reported in Table 1. RulER is always signi cantly faster than JoinChain: 24
with M1 (Figure 5(a)), 5 with M2 (Figure 5(b)), 7 with M3 (Figure 5(c)),
27 with M4 (Figure 5(d)). Moreover, Figure 6 shows the number of
comparisons performed by both algorithms. Also in this case, RulER works better than
JoinChain performing less comparisons: 21:6 106 less for M1 (Figure 6(a)),
23:0 106 less for M2 (Figure 6(b)), 3:8 106 less for M3 (Figure 6(c)), 45:1 106
less for M3 (Figure 6(d)).</p>
      <p>We conclude that RulER is always faster than JoinChain.</p>
      <p>Rule
M1 (T itle; 0:9; J S) ^ (Cast; 0:8; J S)
M2 (T itle; 0:9; J S) _ (Cast; 0:8; J S)
M3 ((Cast; 0:9; J S) ^ (Director; 0:9; J S)</p>
      <p>(W riter; 0:9; J S)) _ (T itle; 0:9; J S)
M4 (Director; 0:9; J S) ^ (T itle; 2; ED)
^</p>
      <p>Candidates Matches
1725885 53023
990278774 253593213
61987445 34798235
1133134
252935
Finally, we assess the scalability of RulER by varying the number of nodes in
the cluster (1, 3, 5, 7 and 10 nodes). For this experiment we apply the rules
described in Table 1 on the omdb data set.</p>
      <p>1
5
p
u4
3
d
e
e2
1
p
S
0 1 3 5 7</p>
      <p>Nodes
(a)</p>
      <p>1
)
n
i14
(m12
e108
i.tcm426
e 0
xE 1 3 5 7</p>
      <p>Nodes
(d)
10</p>
      <p>2
8
p7
u6
d5
e4
e3
p2
S1
0 1 3 5 7</p>
      <p>Nodes
(b)</p>
      <p>2
)
n
im70
(60
e50
40
im30
.20
t
c10
e 0
10 xE 1 3 5 7</p>
      <p>Nodes
(e)
10
10</p>
      <p>3
5
p
u4
3
d
e
e2
1
p
S
0 1 3 5 7</p>
      <p>Nodes
(c)</p>
      <p>3
)
n
i(6
m
4
e
m
i
t2
.
c
e0
xE 1 3 5 7</p>
      <p>Nodes
(f)
10
10
In this work, We tackled the problem of performing record-level matching rules.
We presented two solutions to perform them on parallel and distributed systems:
a baseline one (i.e. JoinChain) implemented by using existing solutions, and
a novel approach (i.e. RulER) that optimizes the execution of the rules. We
conducted a thorough experimental evaluation, demonstrating the e ciency of
the proposed approach, which always outperforms the baseline solution in terms
of execution time and number of comparisons. In the future, we plan to extend
our system to automatically nd the optimal execution order of the predicates
that compose a rule.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Arasu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaushik</surname>
          </string-name>
          , R.:
          <article-title>E cient exact set-similarity joins</article-title>
          .
          <source>In: Proceedings of the 32nd international conference on Very large data bases</source>
          . pp.
          <volume>918</volume>
          {
          <fpage>929</fpage>
          .
          <string-name>
            <given-names>VLDB</given-names>
            <surname>Endowment</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ardalan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Akella</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Smurf: self-service string matching using random forests</article-title>
          .
          <source>PVLDB</source>
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <volume>278</volume>
          {
          <fpage>291</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Bayardo</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Srikant</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          :
          <article-title>Scaling up all pairs similarity search</article-title>
          .
          <source>In: Proceedings of the 16th international conference on World Wide Web</source>
          . pp.
          <volume>131</volume>
          {
          <fpage>140</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chaudhuri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganti</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kaushik</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A primitive operator for similarity joins in data cleaning</article-title>
          .
          <source>In: 22nd International Conference on Data Engineering (ICDE'06)</source>
          . pp.
          <volume>5</volume>
          {
          <issue>5</issue>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>G. C.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.S.</given-names>
            ,
            <surname>Gokhale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.:</surname>
          </string-name>
          <article-title>The magellan data repository</article-title>
          . https://sites.google.com/site/anhaidgroup/projects/data
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fier</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Augsten</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouros</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leser</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freytag</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          :
          <article-title>Set similarity joins on mapreduce: An experimental survey</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>11</volume>
          (
          <issue>10</issue>
          ),
          <volume>1110</volume>
          {
          <fpage>1122</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gagliardelli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Beneventano</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergamaschi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Sparker:
          <article-title>Scaling entity resolution in spark</article-title>
          .
          <source>In: EDBT 2019</source>
          . pp.
          <volume>602</volume>
          {
          <issue>605</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Gagliardelli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergamaschi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Sparker: an entity resolution tool for apache spark (</article-title>
          <year>2017</year>
          ), https://github.com/Gaglia88/sparker
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gagliardelli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simonini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergamaschi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Bigdedup: a big data integration toolkit for duplicate detection in industrial scenarios</article-title>
          .
          <source>In: 25th International Conference on Transdisciplinary Engineering (TE2018)</source>
          . vol.
          <volume>7</volume>
          , pp.
          <volume>1015</volume>
          {
          <issue>1023</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Augsten</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouros</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>An empirical evaluation of set similarity join techniques</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>9</volume>
          (
          <issue>9</issue>
          ),
          <volume>636</volume>
          {
          <fpage>647</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Nargesian</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pu</surname>
            ,
            <given-names>K.Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Arocena</surname>
            ,
            <given-names>P.C.</given-names>
          </string-name>
          :
          <article-title>Data lake management: challenges and opportunities</article-title>
          .
          <source>PVLDB</source>
          <volume>12</volume>
          (
          <issue>12</issue>
          ),
          <year>1986</year>
          {
          <year>1989</year>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Simonini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gagliardelli</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergamaschi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jagadish</surname>
            ,
            <given-names>H.V.</given-names>
          </string-name>
          :
          <article-title>Scaling entity resolution: A loosely schema-aware approach</article-title>
          .
          <source>Inf. Syst</source>
          .
          <volume>83</volume>
          ,
          <issue>145</issue>
          {
          <fpage>165</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Simonini</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Papadakis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palpanas</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergamaschi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Schema-agnostic progressive entity resolution</article-title>
          .
          <source>IEEE TKDE 31(6)</source>
          ,
          <volume>1208</volume>
          {
          <fpage>1221</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Vernica</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carey</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>E cient parallel set-similarity joins using mapreduce</article-title>
          .
          <source>In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data</source>
          . pp.
          <volume>495</volume>
          {
          <fpage>506</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
          </string-name>
          , X.:
          <article-title>Ed-join: an e cient algorithm for similarity joins with edit distance constraints</article-title>
          .
          <source>PVLDB</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>933</volume>
          {
          <fpage>944</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , G.:
          <article-title>E cient similarity joins for near-duplicate detection</article-title>
          .
          <source>ACM TODS 36(3)</source>
          ,
          <volume>15</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>