Introduction

Scaling Up Record-level Matching Rules

Luca Gagliardelli

Giovanni Simonini

Sonia Bergamaschi

0 0 University of Modena and Reggio Emilia , Modena , Italy

Record-level matching rules are chains of similarity join predicates on multiple attributes employed to join records that refer to the same real-world object when an explicit foreign key is not available on the data sets at hand. They are widely employed by data scientists and practitioners that work with data lakes, open data, and data in the wild. In this work we present a novel technique that allows to e ciently execute record-level matching rules on parallel and distributed systems and demonstrate its e ciency on a real-wold data set.

Data Integration join Entity Resolution Parallel similarity

Introduction

Combining data sets that bare information about the same real-world objects is an everyday task for practitioners that work with structured and semi-structured data. Frequently (e.g., when dealing with data lakes or when integrating open data with proprietary data) data sets do not have explicit keys that can be used for a traditional equi-join [ 12,13,7,9 ]. When that happens, a common solution is to perform a similarity join [ 10 ], i.e., to join records that have an attribute value similar above a certain threshold, according to a given similarity measure, as in the following example: Example 1. (Similarity Join) Given two product data sets, join all the record pairs with the Jaccard similarity of the product names above 0.8.

A plethora of algorithms have been proposed in the last decades to e ciently execute the similarity join considering a single attribute, i.e., attributelevel matching rules (see [ 10 ] for a survey). At their core, all these algorithms try to prune the candidate pairs of records, on the basis of a single-attribute predicate|to alleviate the quadratic complexity of the problem.

Interestingly, only a few works had been focused on studying how to execute record-level matching rules, i.e., the combination of multiple similarity join predCopyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). This volume is published and copyrighted by its editors. SEBD 2020, June 21-24, 2020, Villasimius, Italy. icates on multiple attributes (see Section 2). Yet, this kind of rules permits to specify more exible rules to match records, as in the following example: Example 2. (Record-level matching rule) Given two product data sets, join all the record pairs that have a Jaccard similarity of the product names above 0.8, or that have a Jaccard similarity of the description that is above 0.6 and the edit distance of the manufacturer lower than 3.

Furthermore, record-level matching rules can be used to represent decision trees [ 2 ], hence learned with machine learning algorithms when training data is available. As a matter of fact, a decision tree for binary classi cation (i.e., classi cation of matching/not-matching records) can be naturally represented with DNF (disjunctive normal form) predicates|the same consideration can be done for a forest of trees.

To the best of our knowledge, no techniques have been proposed to leverage distributed and parallel computing for scaling record-level matching rules. The bene t is twofold: (i) distributed computation allows to scale to large data sets that cannot be handled with a single machine; (ii) parallel execution reduces the execution times (3 times faster in our experiments). As a matter of fact, being able to e ciently execute similarity join is crucial when time is a critical component, e.g., when users are involved in the process. For instance, in exploratory search in a data lake [ 11 ], users typically look for related data sets and low latency in performing similarity join is required for enabling the user's interactive exploration. Also, when debugging record-level matching rules, users typically try di erent con gurations of similarity metrics, thresholds, and attributes. Hence, enabling fast execution of such rules can signi cantly save user's time.

Contribution. In this work we present a technique that is able to use di erent similarity measures to apply record-level matching rules e ciently. Moreover, we present an algorithm, RulER, to e ciently run record-level matching rules on MapReduce-like systems, to take full advantage of a parallel and distributed computation.

The rest of the paper is organized as follow: Section 2 provides the preliminaries. Then, in Section 3 we present our novel technique. Section 4 shows the experimental results. Finally, in Section 5 we draw the conclusions. 2

Preliminaries and Related Work This section describes the fundamental concepts and the related work. 2.1

Matching Rule We de ne a matching rule R as a disjunction (logical OR) of conjunctions (logical AND ) of similarity join predicates on multiple attribute (i.e., at the record level ). This design choice is driven by the fact that DNF matching rules are easy to read and thus to debug, in practice. Moreover, DNFs can be employed to represent the trained model of a decision tree (or of a random forest), hence suitable for exploiting labeled data. In this work, we focus on how to scale DNF matching rules and we do not investigate how to generate good DNFs (i.e., decision trees/random forests) starting from training data, which is the focus of Ardalan et al. [ 2 ]|Ardalan et al. do not investigate the parallel and distributed execution of matching rules. 2.2

Set Similarity Join A record ri is considered as a set of elements identi ed by a unique identi er. Di erent techniques can be employed to generate the elements from the values of a record, for example, each word can be considered as a token or it is possible to generate the n-grams, etc. Formally, given a collection of records, a similarity function sim and a similarity threshold t, the goal of set similarity join is to nd all the possible pairs of records hri; rj i such that sim(ri; rj ) t.

A nave solution to perform the set similarity join is to enumerate and compare every pair of records, but this process is highly ine cient and not feasible in the Big Data context. To reduce the task complexity di erent approaches were proposed in literature [ 4,3,16,15 ]. All these approaches adopt a lter-veri cation approach: (1) rst an index is used to obtain a set of pre-candidates; (2) the pre-candidates are ltered using a set of pre-de ned lters; (3) the resulting candidate pairs are probed with the similarity function to generate the nal results.

The most used lters are: pre x lter, length lter, and positional lter. All these lters can be adapted to work with di erent similarity measures: Dice, Cosine, Jaccard Similarity, Edit Distance and Overlap Similarity [ 10,15,16 ]. Pre x lter A key technique to perform the set similarity join e ciently is the pre x lter [ 4 ]. First of all, given a collection of records (i.e., sets of elements) their elements are sorted according to a global order O, usually the document frequency of the tokens (i.e., how many documents contain that token) that is a heuristic that helps to reduce the number of comparisons [ 4 ]. Then, for each sorted set, only the rst elements are considered, i.e., the pre xes. A pair hri; rj i can be safely pruned if their pre xes have no common elements. The pre x size depends on the similarity threshold and the similarity function. For example, the pre x lter for the overlap similarity is de ned as follows: given two sets, ri and rj , and an overlap threshold t; if jri \ rj j t, then there is at least one common token within the ri -pre x of ri and the rj -pre x of rj , where r = jrj j t + 1 and s = jrj j t + 1.

An example of how pre x lter works is reported in Figure 1. The pre xes for overlap threshold t = 4 are highlighted in grey. Since the two pre xes do not share any token, the pair hri; rj i can be pruned. The intuition behind this is that the 3 remaining tokens to check can provide at most a similarity of 3, that is not enough to reach the requested threshold t.

t = 4 ri a b c ? ? ?

rj d e ? ? ?

Length lter A lter that is commonly used in conjunction with the pre x lter is the length lter [ 1 ]. Normalized similarity functions (e.g., Jaccard, Cosine, Dice, ED) depend on the set size, thus it is possible to exploit it to prune the pairs generated with the pre x lter. For the Jaccard Similarity the length lter is de ned as: a set of elements r can reach Jaccard threshold t only with a set s of size lbr jsj ubr (lbr = t jrj; ubr = jjrtjj ); for example, if jrj = 10 and t = 0:8, then 8 jsj 12 is required.

Positional lter The positional lter [ 16 ] reasons over the matching position of tokens in the pre x. Given a pair of sets of sorted elements it checks the positions of their common tokens in the pre x, if the remain tokens to check are not enough to reach the threshold, it prunes the pair. Since it needs to scan the tokens in the pre x, this lter is more expensive than pre x and length lters, so usually it is applied only on the pairs that already passed them.

An example of how positional lter works is provided in Figure 2. The pair hri; rj i passes both length and pre x lters. The rst match in ri occurs in position 1 (counting from 0), thus only 8 tokens of rj are left to match tokens of ri, and the pair can be ltered because it can never reach the requested threshold t = 9.

9 ri a f ? ? ? ? ? ? ? ? 8

t = 9 rj b f ? ? ? ? ? ? ?

min(8, 9) = 8 8 < 9 → Filtered

Pre x lter based set similarity join An example of how a pre x lter based set similarity join works is outlined in Figure 3. Starting from a document collection, the documents are transformed in sets of elements (e.g., tokens, ngrams, etc.) and sorted according to a global order (1). Then, using the pre xes (highlighted in gray) an inverted index is built, i.e., the pre x index (2). From the pre x index, a set of pre-candidate pairs is built (3), i.e., each pair of pro les that appear together in at least one entry of the pre x index. The pre-candidate pairs are ltered using di erent lters (e.g., length lter, positional lter, etc.) that are fast to compute and let to discard the pairs that cannot reach the threshold (4). Finally, the pairs that pass all the lters (i.e., candidate pairs) are probed with the similarity function, and only those that have a similarity above the threshold are retained (5). Collection of profiles p1 = "word1 word2 ..." p2 = "word3 word4 ..." p3 = "word5 word1 ..."

... <p1, p2>

...

Results

(1) elements generation and sorting (5) verification

Set of sorted elements p1 = [A, B, D, ...] p2 = [B, C, D, ...] p3 = [A, D, E ...]

... <p1, p2> <p2, p3>

...

Candidates

(2) prefix index building (4) filtering

Prefix index

A → [p1, p3, …] B → [p1, p2, …] D → [p1, p2, p3, …] ... <p1, p2> <p1, p3> <p2, p3>

...

Pre-candidates

(3) pre-candidates generation In this section, we present our method RulER to e ciently scale record-level matching rules over big data sets. The presented algorithm is the self-join version for the sake of the presentation; adapting it for joining two di erent data sets is straightforward. Given a matching rule R a nave solution to perform it is to process each predicate as a single similarity join and then intersect/merge the obtained results according to the requirements. In particular, we adopted two algorithms to perform the similarity joins: PPJoin [ 16 ] and EDJoin [ 15 ]. Both algorithms employ pre x lter (see Section 2) to nd candidate pairs. PPJoin is considered one of the best performing similarity join algorithm [ 10 ], also its parallel implementation (i.e., Vernica Join) has demonstrated to be one of the best performing

Algorithm 1 PPJoin/EDJoin

Input: R collection of records to join Input: P predicate that contains the join attribute, the threshold, and the elements pattern (i.e., tokens, n-grams, etc.) Output: C, the pairs of records that can satisfy P 1: RT getSortedElements(R; P) //Transforms the records in set of sorted elements (i.e., tokens, n-grams, etc.) 2: I buildP ref ixIndex(RT ; P) //Build pre x index 3: C atMap Bi 2 I //For each entry of the pre x index 4: for each hrj; rki 2 Bi(rj 6= rk) do 5: if passLengthF ilter(rj; rk; P) then 6: if passP ositionalF ilter(rj; rk; P) then 7: emit(hrj; rki) for distributed computing [ 6 ]. It can work with di erent similarity measures like Jaccard Similarity, Dice and Cosine. EDJoin adapts the PPJoin concepts to work with the Edit Distance. We adapted both algorithms to work on Spark as proposed in [ 14 ] for PPJoin(i.e., Vernica Join).

The distributed algorithm to perform PPJoin/EDJoin is presented in Algorithm 1. First, the records are transformed into sets of sorted elements according to the predicate requirements (line 1). The pre x index (see Subsection 2.2) is built generating an inverted index that groups all the records that share at least one token in their pre x. Then, the algorithm iterates over each entry of the pre x index Bi, probing each pair of records hri; rj i 2 Bi with the appropriate lters according to the predicate P.

Algorithm 2 outlines the baseline algorithm (i.e., JoinChain). A rule in DNF format is composed of di erent blocks Pi of predicates that are in logical OR, each of these blocks contains one or more predicates pi that are in logical AND. First, the algorithm iterates over the Pi blocks (line 2) and for each of them initializes a set of candidates CPi (line 3). Then, each simple predicate pj 2 Pi is used to apply a similarity join on the record collection R according to requirements (lines 6-9). The result of a Pi block is given by the intersection of the results provided by each similarity join applied with the predicate pj 2 Pi (lines 10-13). The nal candidate set is computed by merge the results of each Pi block (line 14). In the end, the candidates are veri ed with a verify function that ensures that all predicates are respected (line 15). 10: 11: 12: 13: Algorithm 2 JoinChain Input: R collection of records to join Input: M matching rule in DNF form Output: M , the pairs of records that satisfy M 1: C fg 2: for each Pi 2 M do //For each block of predicates in or 3: CPi fg //Set of candidate pairs for Pi 4: for each pj 2 Pi do //For each single predicate in and 5: Cpj fg //Set of candidate pairs for pj 6: if pi:type = ED then 7: Cpj EDJ oin(R; pj ) //Get candidate pairs with EDJoin 8: else 9:

Cpj P P J oin(R; pj ) //Get candidate pairs with PPJoin if CPi :isEmpty then //Intersects candidates with previous ones

CPi Cpi

CPi CPi \ Cpj

C [ CPi //Merge candidates with previous ones else 14: C 15: return verif y(C; M) 3.2

The RulER Algorithm

Algorithm 2 has three main drawbacks:

(i) the intersect operation (line 13) is expensive in MapReduce-like systems, because it generates shu e ; (ii) a predicate is independently checked by the others. For example, given a matching rule M = (C1 ^ C2) _ (C3 ^ C4) in which each Cx is a similarity join predicate (e.g., Jaccard Similarity title 0:8). A pair hri; rj i is probed with all predicates even if it fails/passes one of them. For example, if the pair passes the predicate (C3 ^ C4) it is not necessary to probe it with (C1 ^ C2). Or, if it fails with the predicate C1 it is not necessary to probe it with C2; C3. (iii) Vernica Join [ 14 ], employed in the implementation of JoinChain algorithm (Algorithm 2 lines 7, 9), produces duplicates [ 6 ] that have to be removed. If a pair of records appears in more pre x entries, it is processed and emitted multiple times (Algorithm 1 lines 3-7).

We solved these problems in our RulER algorithm. The main intuition of RulER is to exploit the pre x indexes|one pre x index for each predicate of the matching rule|to build a graph structure, which is then employed to iterate over the records (the nodes of the graph), e ciently applying the rules and to keep only the candidates (the edges of the graph) that pass the whole rule. In other words, RulER adopts a record-based parallelization approach; in contrast to the existing algorithms, which adopt a pre x-based parallelization approach on a single predicate at a time.

The RulER matching rule execution algorithm is outlined in Algorithm 3. The algorithm takes as input a collection of records and a record-level matching rule M and gives as output the set of record pairs that satisfy M. Recall that M is in DNF, i.e., it is composed of sets of predicates Pj in logical or, each set Pj contains predicates pk in logical and. First of all, the values of attributes are converted into sets of elements (Line 1) according to the matching rule requirements (e.g., n-grams, trigrams, tokens, etc.); then the pre x indexes are built to nd the candidate pairs (line 2)|one pre x index is needed for each predicate pk of the matching rule. The pre x indexes are sent in broadcast to each node (line 3) to be available to each computational node (called worker ). Then, each worker iterates over its portion of records (lines 5-6), and performs the following operations for each record ri. First, a set of candidates for ri is initialized as an empty set Cri (line 7). Second, for each set Pj , a set of candidates CPj is initialized as an empty set (lines 8-9) and for each pk 2 Pj the candidates Cri;pk that can match with ri are extracted using the pre x indexes (lines 1011). Third, the candidates Cri;pk are pruned by removing those that already passed one of the previous Pj set of predicates (line 14), and those that did not passed previous pk 2 Pj predicates (lines 15-16). Fourth, the retained candidates are probed with other lters that further improve the e ciency of the overall process (e.g., length lter, position lter, etc. [ 16,15 ]) according to the rule (line 18). Since pk is in logical and with the previous predicates, only the candidates https://spark.apache.org/docs/2.1.0/programming-guide.html# shuffle-operations

Algorithm 3 RulER Input: R collection of records to join Input: M matching rule in DNF Output: C, the pairs of records that satisfy M 1: RT getElements(R; M) 2: I buildP ref ixIndexes(RT ; M) 3: broadcast(I) 4: C fg //Candidate pairs 5: map partition part 2 RT 6: for each ri 2 part do 7: Cri fg //Candidates for ri 8: for each Pj 2 M do //For each set of predicates in logical or 9: CPj fg //Candidates that satisfy Pj 10: for each pk 2 Pj do //For each predicate in logical and 11: Cri;pk I(pk; ri) //Gets the candidates from the pre x index 12: /*Removes candidates that already passed previous predicates in or 13: and those that did not pass previous predicates in and */ 14: Cri 15: 16:

Cri;pk Cri;pk if CPj 6= ; then

Cri;pk Cri;pk \ CPj /*Applies lters (length, positional, ...)*/

CPj applyF ilters(ri; Cri;pk ; pk) that pass the lters are kept. The resulting candidates from Pj are added to Cri (line 20). Finally, the candidates are veri ed (line 21).

Given a matching rule R = (C1 ^ C2 ^ C3) _ (C4 ^ C5), in which each Cx is a similarity join predicate (e.g., Jaccard Similarity title 0:8); an example of how RulER executes R is outlined in Figure 4. First, a pre x index is built on the basis of the record-level matching rules expressed in the main matching rule R. Then, the index is distributed to each worker. Each worker iterates over each record in its partition extracting the possible candidates from the pre x index. The rules are applied to each candidate. If more rules are in or it is possible to avoid computing the other rules when one of them is veri ed, e.g., with hr1; r2i the rule (C1 ^ C2 ^ C3) is not veri ed since the pair passes the rule (C4 ^ C5). Otherwise, if more rules are in and, it is possible to avoid the computation when one of them fails, for example for the pair hr1; r3i, C2 fails, so C3 has not to be computed. 4

Experimental Evaluation

In this section we evaluate RulER with respect to JoinChain (see Section 3). In particular, the experimental evaluation aims to answer the following questions:

Prefix Indexes r1 pop r1 r3 r4 r2 r3 r4 Prefix Indexes r2 pop r2 r5

C1 C2 C3 C4 C5

Worker 5 matches emit <r1, r2> <r1, r4> Worker 6

Master Prefix Index C1 Prefix Index C2 ... se x e d n iIfr x e P d li u B Q 1: What is the performance of RulER in terms of execution time compared to

JoinChain (i.e., the nave solution)? (Section 4.2) Q 2: How does RulER scale when varying the number of machines available for the record-level matching rule processing? (Section 4.3) 4.1

Experimental Setup All the experiments are performed on a ten-node cluster; each node has two Intel Xeon E5-2670v2 2.50 GHz (20 cores per node) and 128 GB of RAM, running Ubuntu 14.04. All the software is implemented in Scala 2.11.8 and available at [ 8 ]. To assess the performance of the state-of-the-art meta-blocking methods we reimplemented all of them for running on Apache Spark as well. We employ Apache Spark 2.1.0, running 3 executors on each node, reserving 30 GB of memory for 1 2 3

4 Fig. 5. Execution times of RulER and JoinChain with the rules reported in Table 1. the master node. We set the default parallelism to twice the number of cores as suggested by best practice.

We employed the ombd data set [ 5 ] to evaluate the performance of RulER and JoinChain. It contains 2:3 millions of records about movies collected from omdbapi.com. This data set has a good variety of attributes (e.g., title, cast, director, writer, plot, etc.) on which it is possible to de ne a di erent kind of record-level matching rules. Moreover, it contains a su cient number of records that make it suitable to test the performance with Spark.

The goal of our experiments is to compare the e ciency of the algorithms, not to nd good matching rules. Moreover, with RulER it would be possible to de ne an order for the application of the predicates that minimize the execution time and the number of performed comparisons, but we do not explore this aspect in this work. 4.2

RulER vs JoinChain The goal of this experiment is to compare the e ciency of RulER (Algorithm 3) and JoinChain(Algorithm 2). Both algorithms can be employed to apply a record-level matching rule M. In this experiment we apply the rules presented in Table 1 on the omdb data set. Since both algorithms use the same functions to extract the elements from the records, to generate the pre x indexes and to verify the candidate pairs, we analyze here only the time requested to perform the join operation. All the experiments are performed on a single node.

Figure 5 reports the execution times of RulER and JoinChainwith the rules reported in Table 1. RulER is always signi cantly faster than JoinChain: 24 with M1 (Figure 5(a)), 5 with M2 (Figure 5(b)), 7 with M3 (Figure 5(c)), 27 with M4 (Figure 5(d)). Moreover, Figure 6 shows the number of comparisons performed by both algorithms. Also in this case, RulER works better than JoinChain performing less comparisons: 21:6 106 less for M1 (Figure 6(a)), 23:0 106 less for M2 (Figure 6(b)), 3:8 106 less for M3 (Figure 6(c)), 45:1 106 less for M3 (Figure 6(d)).

We conclude that RulER is always faster than JoinChain.

Rule M1 (T itle; 0:9; J S) ^ (Cast; 0:8; J S) M2 (T itle; 0:9; J S) _ (Cast; 0:8; J S) M3 ((Cast; 0:9; J S) ^ (Director; 0:9; J S)

(W riter; 0:9; J S)) _ (T itle; 0:9; J S) M4 (Director; 0:9; J S) ^ (T itle; 2; ED) ^

Candidates Matches 1725885 53023 990278774 253593213 61987445 34798235 1133134 252935 Finally, we assess the scalability of RulER by varying the number of nodes in the cluster (1, 3, 5, 7 and 10 nodes). For this experiment we apply the rules described in Table 1 on the omdb data set.

1 5 p u4 3 d e e2 1 p S 0 1 3 5 7

Nodes (a)

1 ) n i14 (m12 e108 i.tcm426 e 0 xE 1 3 5 7

Nodes (d) 10

2 8 p7 u6 d5 e4 e3 p2 S1 0 1 3 5 7

Nodes (b)

2 ) n im70 (60 e50 40 im30 .20 t c10 e 0 10 xE 1 3 5 7

Nodes (e) 10 10

3 5 p u4 3 d e e2 1 p S 0 1 3 5 7

Nodes (c)

3 ) n i(6 m 4 e m i t2 . c e0 xE 1 3 5 7

Nodes (f) 10 10 In this work, We tackled the problem of performing record-level matching rules. We presented two solutions to perform them on parallel and distributed systems: a baseline one (i.e. JoinChain) implemented by using existing solutions, and a novel approach (i.e. RulER) that optimizes the execution of the rules. We conducted a thorough experimental evaluation, demonstrating the e ciency of the proposed approach, which always outperforms the baseline solution in terms of execution time and number of comparisons. In the future, we plan to extend our system to automatically nd the optimal execution order of the predicates that compose a rule.

1. Arasu , A. , Ganti , V. , Kaushik , R.: E cient exact set-similarity joins . In: Proceedings of the 32nd international conference on Very large data bases . pp. 918 { 929 .

VLDB

Endowment ( 2006 )

2. Ardalan , A. , Doan , A. , Akella , A. , et al.: Smurf: self-service string matching using random forests . PVLDB 12 ( 3 ), 278 { 291 ( 2018 )

3. Bayardo , R.J. , Ma, Y. , Srikant , R. : Scaling up all pairs similarity search . In: Proceedings of the 16th international conference on World Wide Web . pp. 131 { 140 . ACM ( 2007 )

4. Chaudhuri , S. , Ganti , V. , Kaushik , R.: A primitive operator for similarity joins in data cleaning . In: 22nd International Conference on Data Engineering (ICDE'06) . pp. 5 { 5 . IEEE ( 2006 )

5. Das , S. , Doan , A. , G. C. , P.S. , Gokhale , C. , Konda , P.: The magellan data repository . https://sites.google.com/site/anhaidgroup/projects/data

6. Fier , F. , Augsten , N. , Bouros , P. , Leser , U. , Freytag , J.C. : Set similarity joins on mapreduce: An experimental survey . Proceedings of the VLDB Endowment 11 ( 10 ), 1110 { 1122 ( 2018 )

7. Gagliardelli , L. , Simonini , G. , Beneventano , D. , Bergamaschi , S. : Sparker: Scaling entity resolution in spark . In: EDBT 2019 . pp. 602 { 605 ( 2019 )

8. Gagliardelli , L. , Simonini , G. , Zhu , S. , Bergamaschi , S. : Sparker: an entity resolution tool for apache spark ( 2017 ), https://github.com/Gaglia88/sparker

9. Gagliardelli , L. , Zhu , S. , Simonini , G. , Bergamaschi , S. : Bigdedup: a big data integration toolkit for duplicate detection in industrial scenarios . In: 25th International Conference on Transdisciplinary Engineering (TE2018) . vol. 7 , pp. 1015 { 1023 ( 2018 )

10. Mann , W. , Augsten , N. , Bouros , P.: An empirical evaluation of set similarity join techniques . Proceedings of the VLDB Endowment 9 ( 9 ), 636 { 647 ( 2016 )

11. Nargesian , F. , Zhu , E. , Miller , R.J. , Pu , K.Q. , Arocena , P.C. : Data lake management: challenges and opportunities . PVLDB 12 ( 12 ), 1986 { 1989 ( 2019 )

12. Simonini , G. , Gagliardelli , L. , Bergamaschi , S. , Jagadish , H.V. : Scaling entity resolution: A loosely schema-aware approach . Inf. Syst . 83 , 145 { 165 ( 2019 )

13. Simonini , G. , Papadakis , G. , Palpanas , T. , Bergamaschi , S. : Schema-agnostic progressive entity resolution . IEEE TKDE 31(6) , 1208 { 1221 ( 2019 )

14. Vernica , R. , Carey , M.J. , Li , C. : E cient parallel set-similarity joins using mapreduce . In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data . pp. 495 { 506 . ACM ( 2010 )

15. Xiao , C. , Wang , W. , Lin , X.: Ed-join: an e cient algorithm for similarity joins with edit distance constraints . PVLDB 1 ( 1 ), 933 { 944 ( 2008 )

16. Xiao , C. , Wang , W. , Lin , X. , Yu , J.X. , Wang , G.: E cient similarity joins for near-duplicate detection . ACM TODS 36(3) , 15 ( 2011 )