<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Scaling Up Record-level Matching Rules</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Luca</forename><surname>Gagliardelli</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Modena and Reggio Emilia</orgName>
								<address>
									<settlement>Modena</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giovanni</forename><surname>Simonini</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Modena and Reggio Emilia</orgName>
								<address>
									<settlement>Modena</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Sonia</forename><surname>Bergamaschi</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">University of Modena and Reggio Emilia</orgName>
								<address>
									<settlement>Modena</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Scaling Up Record-level Matching Rules</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">7EC346F39151766F7F501D6532889D71</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T13:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Data Integration</term>
					<term>Entity Resolution</term>
					<term>Parallel similarity join</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are widely employed by data scientists and practitioners that work with data lakes, open data, and data in the wild. In this work we present a novel technique that allows to efficiently execute record-level matching rules on parallel and distributed systems and demonstrate its efficiency on a real-wold data set.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Combining data sets that bare information about the same real-world objects is an everyday task for practitioners that work with structured and semi-structured data. Frequently (e.g., when dealing with data lakes or when integrating open data with proprietary data) data sets do not have explicit keys that can be used for a traditional equi-join <ref type="bibr" target="#b11">[12,</ref><ref type="bibr" target="#b12">13,</ref><ref type="bibr" target="#b6">7,</ref><ref type="bibr" target="#b8">9]</ref>. When that happens, a common solution is to perform a similarity join <ref type="bibr" target="#b9">[10]</ref>, i.e., to join records that have an attribute value similar above a certain threshold, according to a given similarity measure, as in the following example: Example 1. (Similarity Join) Given two product data sets, join all the record pairs with the Jaccard similarity of the product names above 0.8.</p><p>Furthermore, record-level matching rules can be used to represent decision trees <ref type="bibr" target="#b1">[2]</ref>, hence learned with machine learning algorithms when training data is available. As a matter of fact, a decision tree for binary classification (i.e., classification of matching/not-matching records) can be naturally represented with DNF (disjunctive normal form) predicates-the same consideration can be done for a forest of trees.</p><p>To the best of our knowledge, no techniques have been proposed to leverage distributed and parallel computing for scaling record-level matching rules. The benefit is twofold: (i) distributed computation allows to scale to large data sets that cannot be handled with a single machine; (ii) parallel execution reduces the execution times (3 times faster in our experiments). As a matter of fact, being able to efficiently execute similarity join is crucial when time is a critical component, e.g., when users are involved in the process. For instance, in exploratory search in a data lake <ref type="bibr" target="#b10">[11]</ref>, users typically look for related data sets and low latency in performing similarity join is required for enabling the user's interactive exploration. Also, when debugging record-level matching rules, users typically try different configurations of similarity metrics, thresholds, and attributes. Hence, enabling fast execution of such rules can significantly save user's time. Contribution. In this work we present a technique that is able to use different similarity measures to apply record-level matching rules efficiently. Moreover, we present an algorithm, RulER, to efficiently run record-level matching rules on MapReduce-like systems, to take full advantage of a parallel and distributed computation.</p><p>The rest of the paper is organized as follow: Section 2 provides the preliminaries. Then, in Section 3 we present our novel technique. Section 4 shows the experimental results. Finally, in Section 5 we draw the conclusions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries and Related Work</head><p>This section describes the fundamental concepts and the related work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Matching Rule</head><p>We define a matching rule R as a disjunction (logical OR) of conjunctions (logical AND) of similarity join predicates on multiple attribute (i.e., at the record level ). This design choice is driven by the fact that DNF matching rules are easy to read and thus to debug, in practice. Moreover, DNFs can be employed to represent the trained model of a decision tree (or of a random forest), hence suitable for exploiting labeled data. In this work, we focus on how to scale DNF matching rules and we do not investigate how to generate good DNFs (i.e., decision trees/random forests) starting from training data, which is the focus of Ardalan et al. <ref type="bibr" target="#b1">[2]</ref>-Ardalan et al. do not investigate the parallel and distributed execution of matching rules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Set Similarity Join</head><p>A record r i is considered as a set of elements identified by a unique identifier. Different techniques can be employed to generate the elements from the values of a record, for example, each word can be considered as a token or it is possible to generate the n-grams, etc. Formally, given a collection of records, a similarity function sim and a similarity threshold t, the goal of set similarity join is to find all the possible pairs of records r i , r j such that sim(r i , r j ) ≥ t.</p><p>A naïve solution to perform the set similarity join is to enumerate and compare every pair of records, but this process is highly inefficient and not feasible in the Big Data context. To reduce the task complexity different approaches were proposed in literature <ref type="bibr" target="#b3">[4,</ref><ref type="bibr" target="#b2">3,</ref><ref type="bibr" target="#b15">16,</ref><ref type="bibr" target="#b14">15]</ref>. All these approaches adopt a filter-verification approach: (1) first an index is used to obtain a set of pre-candidates; (2) the pre-candidates are filtered using a set of pre-defined filters; (3) the resulting candidate pairs are probed with the similarity function to generate the final results.</p><p>The most used filters are: prefix filter, length filter, and positional filter. All these filters can be adapted to work with different similarity measures: Dice, Cosine, Jaccard Similarity, Edit Distance and Overlap Similarity <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>.</p><p>Prefix filter A key technique to perform the set similarity join efficiently is the prefix filter <ref type="bibr" target="#b3">[4]</ref>. First of all, given a collection of records (i.e., sets of elements) their elements are sorted according to a global order O, usually the document frequency of the tokens (i.e., how many documents contain that token) that is a heuristic that helps to reduce the number of comparisons <ref type="bibr" target="#b3">[4]</ref>. Then, for each sorted set, only the first π elements are considered, i.e., the prefixes. A pair r i , r j can be safely pruned if their prefixes have no common elements. The prefix size depends on the similarity threshold and the similarity function. For example, the prefix filter for the overlap similarity is defined as follows: given two sets, r i and r j , and an overlap threshold t; if |r i ∩ r j | ≥ t, then there is at least one common token within the π ri -prefix of r i and the π rj -prefix of r j , where r = |r j | − t + 1 and s = |r j | − t + 1.</p><p>An example of how prefix filter works is reported in Figure <ref type="figure" target="#fig_0">1</ref>. The prefixes for overlap threshold t = 4 are highlighted in grey. Since the two prefixes do not share any token, the pair r i , r j can be pruned. The intuition behind this is that the 3 remaining tokens to check can provide at most a similarity of 3, that is not enough to reach the requested threshold t. Length filter A filter that is commonly used in conjunction with the prefix filter is the length filter <ref type="bibr" target="#b0">[1]</ref>. Normalized similarity functions (e.g., Jaccard, Cosine, Dice, ED) depend on the set size, thus it is possible to exploit it to prune the pairs generated with the prefix filter. For the Jaccard Similarity the length filter is defined as: a set of elements r can reach Jaccard threshold t only with a set s of size lb r ≤ |s| ≤ ub r (lb r = t • |r|, ub r = |r| |t| ); for example, if |r| = 10 and t = 0.8, then 8 ≤ |s| ≤ 12 is required.</p><p>Positional filter The positional filter <ref type="bibr" target="#b15">[16]</ref> reasons over the matching position of tokens in the prefix. Given a pair of sets of sorted elements it checks the positions of their common tokens in the prefix, if the remain tokens to check are not enough to reach the threshold, it prunes the pair. Since it needs to scan the tokens in the prefix, this filter is more expensive than prefix and length filters, so usually it is applied only on the pairs that already passed them.</p><p>An example of how positional filter works is provided in Figure <ref type="figure" target="#fig_2">2</ref>. The pair r i , r j passes both length and prefix filters. The first match in r i occurs in position 1 (counting from 0), thus only 8 tokens of r j are left to match tokens of r i , and the pair can be filtered because it can never reach the requested threshold t = 9.  Prefix filter based set similarity join An example of how a prefix filter based set similarity join works is outlined in Figure <ref type="figure">3</ref>. Starting from a document collection, the documents are transformed in sets of elements (e.g., tokens, ngrams, etc.) and sorted according to a global order (1). Then, using the prefixes (highlighted in gray) an inverted index is built, i.e., the prefix index <ref type="bibr" target="#b1">(2)</ref>. From the prefix index, a set of pre-candidate pairs is built (3), i.e., each pair of profiles that appear together in at least one entry of the prefix index. The pre-candidate pairs are filtered using different filters (e.g., length filter, positional filter, etc.) that are fast to compute and let to discard the pairs that cannot reach the threshold (4). Finally, the pairs that pass all the filters (i.e., candidate pairs) are probed with the similarity function, and only those that have a similarity above the threshold are retained <ref type="bibr" target="#b4">(5)</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RulER</head><p>In this section, we present our method RulER to efficiently scale record-level matching rules over big data sets. The presented algorithm is the self-join version for the sake of the presentation; adapting it for joining two different data sets is straightforward.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Baseline algorithm</head><p>Given a matching rule R a naïve solution to perform it is to process each predicate as a single similarity join and then intersect/merge the obtained results according to the requirements. In particular, we adopted two algorithms to perform the similarity joins: PPJoin <ref type="bibr" target="#b15">[16]</ref> and EDJoin <ref type="bibr" target="#b14">[15]</ref>. Both algorithms employ prefix filter (see Section 2) to find candidate pairs. PPJoin is considered one of the best performing similarity join algorithm <ref type="bibr" target="#b9">[10]</ref>, also its parallel implementation (i.e., Vernica Join) has demonstrated to be one of the best performing</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 1 PPJoin/EDJoin</head><p>Input: R collection of records to join Input: P predicate that contains the join attribute, the threshold, and the elements pattern (i.e., tokens, n-grams, etc.) Output: C, the pairs of records that can satisfy P 1: RT ← getSortedElements(R, P) //Transforms the records in set of sorted elements (i.e., tokens, n-grams, etc.) 2: I ← buildP ref ixIndex(RT , P) //Build prefix index 3: C ← flatMap Bi ∈ I //For each entry of the prefix index 4:</p><p>for each rj, r k ∈ Bi(rj = r k ) do 5:</p><p>if passLengthF ilter(rj, r k , P) then 6:</p><p>if passP ositionalF ilter(rj, r k , P) then 7:</p><p>emit( rj, r k )</p><p>for distributed computing <ref type="bibr" target="#b5">[6]</ref>. It can work with different similarity measures like Jaccard Similarity, Dice and Cosine. EDJoin adapts the PPJoin concepts to work with the Edit Distance. We adapted both algorithms to work on Spark as proposed in <ref type="bibr" target="#b13">[14]</ref> for PPJoin(i.e., Vernica Join).</p><p>The distributed algorithm to perform PPJoin/EDJoin is presented in Algorithm 1. First, the records are transformed into sets of sorted elements according to the predicate requirements (line 1). The prefix index (see Subsection 2.2) is built generating an inverted index that groups all the records that share at least one token in their prefix. Then, the algorithm iterates over each entry of the prefix index B i , probing each pair of records r i , r j ∈ B i with the appropriate filters according to the predicate P.</p><p>Algorithm 2 outlines the baseline algorithm (i.e., JoinChain). A rule in DNF format is composed of different blocks P i of predicates that are in logical OR, each of these blocks contains one or more predicates p i that are in logical AND. First, the algorithm iterates over the P i blocks (line 2) and for each of them initializes a set of candidates C Pi (line 3). Then, each simple predicate p j ∈ P i is used to apply a similarity join on the record collection R according to requirements (lines 6-9). The result of a P i block is given by the intersection of the results provided by each similarity join applied with the predicate p j ∈ P i (lines 10-13). The final candidate set is computed by merge the results of each P i block (line 14). In the end, the candidates are verified with a verify function that ensures that all predicates are respected (line 15).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 2 JoinChain</head><p>Input: R collection of records to join Input: M matching rule in DNF form Output: M , the pairs of records that satisfy M 1: C ← {} 2: for each Pi ∈ M do //For each block of predicates in or 3: </p><formula xml:id="formula_0">CP i ← {} //Set</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">The RulER Algorithm</head><p>Algorithm 2 has three main drawbacks:</p><p>(i) the intersect operation (line 13) is expensive in MapReduce-like systems, because it generates shuffle ; (ii) a predicate is independently checked by the others. For example, given a matching rule M = (C1 ∧ C2) ∨ (C3 ∧ C4) in which each Cx is a similarity join predicate (e.g., Jaccard Similarity title ≥ 0.8). A pair r i , r j is probed with all predicates even if it fails/passes one of them. For example, if the pair passes the predicate (C3 ∧ C4) it is not necessary to probe it with (C1 ∧ C2). Or, if it fails with the predicate C1 it is not necessary to probe it with C2, C3. (iii) Vernica Join <ref type="bibr" target="#b13">[14]</ref>, employed in the implementation of JoinChain algorithm (Algorithm 2 lines 7, 9), produces duplicates <ref type="bibr" target="#b5">[6]</ref> that have to be removed. If a pair of records appears in more prefix entries, it is processed and emitted multiple times (Algorithm 1 lines 3-7).</p><p>We solved these problems in our RulER algorithm. The main intuition of RulER is to exploit the prefix indexes-one prefix index for each predicate of the matching rule-to build a graph structure, which is then employed to iterate over the records (the nodes of the graph), efficiently applying the rules and to keep only the candidates (the edges of the graph) that pass the whole rule. In other words, RulER adopts a record-based parallelization approach; in contrast to the existing algorithms, which adopt a prefix-based parallelization approach on a single predicate at a time.</p><p>The RulER matching rule execution algorithm is outlined in Algorithm 3. The algorithm takes as input a collection of records and a record-level matching rule M and gives as output the set of record pairs that satisfy M. Recall that M is in DNF, i.e., it is composed of sets of predicates P j in logical or, each set P j contains predicates p k in logical and. First of all, the values of attributes are converted into sets of elements (Line 1) according to the matching rule requirements (e.g., n-grams, trigrams, tokens, etc.); then the prefix indexes are built to find the candidate pairs (line 2)-one prefix index is needed for each predicate p k of the matching rule. The prefix indexes are sent in broadcast to each node (line 3) to be available to each computational node (called worker ). Then, each worker iterates over its portion of records (lines 5-6), and performs the following operations for each record r i . First, a set of candidates for r i is initialized as an empty set C ri (line 7). Second, for each set P j , a set of candidates C Pj is initialized as an empty set (lines 8-9) and for each p k ∈ P j the candidates C ri,p k that can match with r i are extracted using the prefix indexes (lines <ref type="bibr" target="#b9">[10]</ref><ref type="bibr" target="#b10">[11]</ref>. Third, the candidates C ri,p k are pruned by removing those that already passed one of the previous P j set of predicates (line 14), and those that did not passed previous p k ∈ P j predicates (lines <ref type="bibr" target="#b14">[15]</ref><ref type="bibr" target="#b15">[16]</ref>. Fourth, the retained candidates are probed with other filters that further improve the efficiency of the overall process (e.g., length filter, position filter, etc. <ref type="bibr" target="#b15">[16,</ref><ref type="bibr" target="#b14">15]</ref>) according to the rule (line 18). Since p k is in logical and with the previous predicates, only the candidates that pass the filters are kept. The resulting candidates from P j are added to C ri (line 20). Finally, the candidates are verified (line 21).</p><p>Given a matching rule R = (C1 ∧ C2 ∧ C3) ∨ (C4 ∧ C5), in which each Cx is a similarity join predicate (e.g., Jaccard Similarity title ≥ 0.8); an example of how RulER executes R is outlined in Figure <ref type="figure" target="#fig_3">4</ref>. First, a prefix index is built on the basis of the record-level matching rules expressed in the main matching rule R. Then, the index is distributed to each worker. Each worker iterates over each record in its partition extracting the possible candidates from the prefix index. The rules are applied to each candidate. If more rules are in or it is possible to avoid computing the other rules when one of them is verified, e.g., with r1, r2 the rule (C1 ∧ C2 ∧ C3) is not verified since the pair passes the rule (C4 ∧ C5). Otherwise, if more rules are in and, it is possible to avoid the computation when one of them fails, for example for the pair r1, r3 , C2 fails, so C3 has not to be computed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Evaluation</head><p>In this section we evaluate RulER with respect to JoinChain (see Section 3). In particular, the experimental evaluation aims to answer the following questions:</p><p>Scaling Up Record-level Matching Rules Q2: How does RulER scale when varying the number of machines available for the record-level matching rule processing? (Section 4.3)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experimental Setup</head><p>All the experiments are performed on a ten-node cluster; each node has two Intel Xeon E5-2670v2 2.50 GHz (20 cores per node) and 128 GB of RAM, running Ubuntu 14.04. All the software is implemented in Scala 2.11.8 and available at <ref type="bibr" target="#b7">[8]</ref>.</p><p>To assess the performance of the state-of-the-art meta-blocking methods we reimplemented all of them for running on Apache Spark as well. We employ Apache Spark 2.1.0, running 3 executors on each node, reserving 30 GB of memory for  <ref type="table" target="#tab_3">1</ref>.</p><p>the master node. We set the default parallelism to twice the number of cores as suggested by best practice.</p><p>We employed the ombd data set <ref type="bibr" target="#b4">[5]</ref> to evaluate the performance of RulER and JoinChain. It contains 2.3 millions of records about movies collected from omdbapi.com. This data set has a good variety of attributes (e.g., title, cast, director, writer, plot, etc.) on which it is possible to define a different kind of record-level matching rules. Moreover, it contains a sufficient number of records that make it suitable to test the performance with Spark.</p><p>The goal of our experiments is to compare the efficiency of the algorithms, not to find good matching rules. Moreover, with RulER it would be possible to define an order for the application of the predicates that minimize the execution time and the number of performed comparisons, but we do not explore this aspect in this work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">RulER vs JoinChain</head><p>The goal of this experiment is to compare the efficiency of RulER (Algorithm 3) and JoinChain(Algorithm 2). Both algorithms can be employed to apply a record-level matching rule M. In this experiment we apply the rules presented in Table <ref type="table" target="#tab_3">1</ref> on the omdb data set. Since both algorithms use the same functions to extract the elements from the records, to generate the prefix indexes and to verify the candidate pairs, we analyze here only the time requested to perform the join operation. All the experiments are performed on a single node.</p><p>Figure <ref type="figure" target="#fig_4">5</ref> reports the execution times of RulER and JoinChainwith the rules reported in Table <ref type="table" target="#tab_3">1</ref>. RulER is always significantly faster than JoinChain: 24× with M 1 (Figure <ref type="figure" target="#fig_4">5</ref> We conclude that RulER is always faster than JoinChain. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">RulER scalability</head><p>Finally, we assess the scalability of RulER by varying the number of nodes in the cluster (1, 3, 5, 7 and 10 nodes). For this experiment we apply the rules described in Table <ref type="table" target="#tab_3">1</ref> on the omdb data set. Exec. time (min) 3 Fig. <ref type="figure">7</ref>. Execution time and speedup of RulER with the rules reported in Table <ref type="table" target="#tab_3">1</ref>.</p><p>Figure <ref type="figure">7</ref> shows the scalability and the speedup of the whole process. For each step, we observe at least a 50% reduction of execution time from 1 to 3 nodes. Then, the execution times continuously decrease until reaching an overall speedup on 10 nodes of: 5.0× for M 1 , 7.3× for M 2 , and 4.16× for M 3 . M 2 is the rule that takes more advantage in the increase of worker' nodes because it performs more comparisons than the others, as shown in Table <ref type="table" target="#tab_3">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>In this work, We tackled the problem of performing record-level matching rules. We presented two solutions to perform them on parallel and distributed systems: a baseline one (i.e. JoinChain) implemented by using existing solutions, and a novel approach (i.e. RulER) that optimizes the execution of the rules. We conducted a thorough experimental evaluation, demonstrating the efficiency of the proposed approach, which always outperforms the baseline solution in terms of execution time and number of comparisons. In the future, we plan to extend our system to automatically find the optimal execution order of the predicates that compose a rule.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Prefix filter.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. Positional filter example.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Fig. 4 .</head><label>4</label><figDesc>Fig.4. RulER execution model: green cells represents executed and passed rules; red cells executed that do not pass the rules; grey cells not executed rules.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Fig. 5 .</head><label>5</label><figDesc>Fig. 5. Execution times of RulER and JoinChain with the rules reported in Table1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>(a)), 5× with M 2 (Figure 5(b)), 7× with M 3 (Figure 5(c)), 27× with M 4 (Figure 5(d)). Moreover, Figure 6 shows the number of comparisons performed by both algorithms. Also in this case, RulER works better than JoinChain performing less comparisons: 21.6 • 10 6 less for M 1 (Figure 6(a)), 23.0 • 10 6 less for M 2 (Figure 6(b)), 3.8 • 10 6 less for M 3 (Figure 6(c)), 45.1 • 10 6 less for M 3 (Figure 6(d)).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>Pj do //For each predicate in logical and 11: Cr i ,p k ← I(p k , ri) //Gets the candidates from the prefix index 12: /*Removes candidates that already passed previous predicates in or 13: and those that did not pass previous predicates in and */ 14: Cr i ,p k ← Cr i ,p k − Cr i 15: if CP j = ∅ then 16: Cr i ,p k ← Cr i ,p k ∩ CP j</figDesc><table><row><cell cols="2">Algorithm 3 RulER</cell></row><row><cell cols="2">Input: R collection of records to join</cell></row><row><cell cols="2">Input: M matching rule in DNF</cell></row><row><cell cols="2">Output: C, the pairs of records that satisfy M</cell></row><row><cell cols="2">1: RT ← getElements(R, M)</cell></row><row><cell cols="2">2: I ← buildP ref ixIndexes(RT , M)</cell></row><row><cell cols="2">3: broadcast(I)</cell></row><row><cell cols="2">4: C ← {} //Candidate pairs</cell></row><row><cell cols="2">5: map partition part ∈ RT</cell></row><row><cell>6:</cell><cell>for each ri ∈ part do</cell></row><row><cell>7:</cell><cell>Cr i ← {} //Candidates for ri</cell></row><row><cell>8:</cell><cell>for each Pj ∈ M do //For each set of predicates in logical or</cell></row><row><cell>9:</cell><cell>CP j ← {} //Candidates that satisfy Pj</cell></row><row><cell cols="2">10: for each p k ∈ 17: /*Applies filters (length, positional, ...)*/</cell></row><row><cell>18:</cell><cell>CP</cell></row><row><cell cols="2">https://spark.apache.org/docs/2.1.0/programming-guide.html#</cell></row><row><cell cols="2">shuffle-operations</cell></row></table><note>j ← applyF ilters(ri, Cr i ,p k , p k ) 19: Cr i ← Cr i ∪ CP j 20: C.append(Cr i ) 21: return verif y(C, M)</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 1 .</head><label>1</label><figDesc>Matching rules employed in the experiments. For each rule the number of candidate pairs obtained after the filters (i.e., prefix filter, length filter, positional filter, see Section 2) is reported, together with the number of final matching pairs. Fig.6. Number of comparisons performed by and JoinChain with the rules reported in Table1.</figDesc><table><row><cell>Rule</cell></row></table><note>https://spark.apache.org/docs/latest/tuning.html</note></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Efficient exact set-similarity joins</title>
		<author>
			<persName><forename type="first">A</forename><surname>Arasu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Ganti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kaushik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 32nd international conference on Very large data bases</title>
				<meeting>the 32nd international conference on Very large data bases</meeting>
		<imprint>
			<publisher>VLDB Endowment</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="918" to="929" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Smurf: self-service string matching using random forests</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ardalan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Doan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Akella</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PVLDB</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="278" to="291" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Scaling up all pairs similarity search</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Bayardo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Srikant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th international conference on World Wide Web</title>
				<meeting>the 16th international conference on World Wide Web</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="131" to="140" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A primitive operator for similarity joins in data cleaning</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chaudhuri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Ganti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Kaushik</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">22nd International Conference on Data Engineering (ICDE&apos;06)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2006">2006</date>
			<biblScope unit="page" from="5" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Das</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Doan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">C</forename></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">S</forename><surname>Gokhale</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Konda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename></persName>
		</author>
		<ptr target="https://sites.google.com/site/anhaidgroup/projects/data" />
		<title level="m">The magellan data repository</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Set similarity joins on mapreduce: An experimental survey</title>
		<author>
			<persName><forename type="first">F</forename><surname>Fier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Augsten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bouros</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Leser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">C</forename><surname>Freytag</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the VLDB Endowment</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="1110" to="1122" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Sparker: Scaling entity resolution in spark</title>
		<author>
			<persName><forename type="first">L</forename><surname>Gagliardelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Simonini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Beneventano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bergamaschi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">EDBT</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="602" to="605" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Gagliardelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Simonini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bergamaschi</surname></persName>
		</author>
		<ptr target="https://github.com/Gaglia88/sparker" />
		<title level="m">Sparker: an entity resolution tool for apache spark</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Bigdedup: a big data integration toolkit for duplicate detection in industrial scenarios</title>
		<author>
			<persName><forename type="first">L</forename><surname>Gagliardelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Simonini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bergamaschi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">25th International Conference on Transdisciplinary Engineering (TE2018)</title>
				<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page" from="1015" to="1023" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">An empirical evaluation of set similarity join techniques</title>
		<author>
			<persName><forename type="first">W</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Augsten</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bouros</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the VLDB Endowment</title>
				<meeting>the VLDB Endowment</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="636" to="647" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Data lake management: challenges and opportunities</title>
		<author>
			<persName><forename type="first">F</forename><surname>Nargesian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">J</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Pu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">C</forename><surname>Arocena</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PVLDB</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="1986" to="1989" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Scaling entity resolution: A loosely schema-aware approach</title>
		<author>
			<persName><forename type="first">G</forename><surname>Simonini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Gagliardelli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bergamaschi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">V</forename><surname>Jagadish</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Inf. Syst</title>
		<imprint>
			<biblScope unit="volume">83</biblScope>
			<biblScope unit="page" from="145" to="165" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Schema-agnostic progressive entity resolution</title>
		<author>
			<persName><forename type="first">G</forename><surname>Simonini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Papadakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Palpanas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bergamaschi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE TKDE</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="1208" to="1221" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Efficient parallel set-similarity joins using mapreduce</title>
		<author>
			<persName><forename type="first">R</forename><surname>Vernica</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Carey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2010 ACM SIGMOD International Conference on Management of data</title>
				<meeting>the 2010 ACM SIGMOD International Conference on Management of data</meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="495" to="506" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Ed-join: an efficient algorithm for similarity joins with edit distance constraints</title>
		<author>
			<persName><forename type="first">C</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">PVLDB</title>
		<imprint>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="933" to="944" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Efficient similarity joins for near-duplicate detection</title>
		<author>
			<persName><forename type="first">C</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">X</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM TODS</title>
		<imprint>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page">15</biblScope>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
