=Paper=
{{Paper
|id=Vol-3135/darliap_paper6
|storemode=property
|title=On Customer Data Deduplication: Lessons Learned from a R&D Project in the Financial Sector
|pdfUrl=https://ceur-ws.org/Vol-3135/darliap_paper6.pdf
|volume=Vol-3135
|authors=Pawel Boinski,Mariusz Sienkiewicz,Bartosz Bebel,Robert Wrembel,Dariusz Galezowski,Waldemar Graniszewski
|dblpUrl=https://dblp.org/rec/conf/edbt/BoinskiSBWGG22
}}
==On Customer Data Deduplication: Lessons Learned from a R&D Project in the Financial Sector==
On Customer Data Deduplication: Lessons Learned from a R&D Project in the Financial Sector Paweł Boiński1 , Mariusz Sienkiewicz1 , Bartosz Bębel1 , Robert Wrembel1 , Dariusz Gałęzowski2 and Waldemar Graniszewski3 1 Poznan University of Technology, Poznań, Poland 2 DAMA Poland, Warsaw, Poland 3 Warsaw University of Technology, Warsaw, Poland Abstract Despite the fact that financial institutions (FIs) apply data governance strategies and use the most advanced state-of-the-art data management and data engineering software and systems to support their day-to-day businesses, their databases are not free from some faulty data (dirty and duplicated). In this paper, we report some conclusions from an ongoing research and development project for a FI. The goal of this project is to integrate customers’ data from multiple data sources - clean, homogenize, and deduplicate them. This paper, in particular, focuses on findings from developing customers’ data deduplication process. Keywords data quality, data cleaning, data deduplication pipeline 1. Introduction In this paper, we outline our experience and findings from designing a deduplication pipeline for customers’ Financial institutions (FIs) apply data governance strate- data (Section 2). We discuss approaches that are possi- gies and use the most advanced state-of-the-art data man- ble for each task in the pipeline and present particular agement and data engineering software to manage data solutions that were proven to be adequate to solve the collected by their day-to-day businesses. Unfortunately, addressed problem. Final conclusions are presented in the application of advanced technologies does not pre- Section 3. Notice that this paper presents findings from vent from collecting and storing some faulty data - mainly a real R&D project, and therefore, not all details can be erroneous, outdated, and duplicated, e.g., [1]. Such data revealed, as they are treated as the company know-how. mainly concern customers, both individuals and institu- tions. Duplicated and outdated data cause economic loses, 2. Deduplication pipeline in the increase customer dissatisfaction, and deteriorate a repu- project tation of a FI. For these reasons, data integration, cleaning, and deduplication of customers’ is one of the processes Inspired by the aforementioned base-line data dedupli- in data governance. cation pipeline (BLDDP), in the described project, we In the research literature, a base-line data deduplica- apply an adjusted pipeline that suits the goals of our tion pipeline has been proposed, e.g., [2, 3, 4]. It has project. There are three basic differences between our become a standard pipeline for multiple data dedupli- pipeline and the BLDDP. First, in our pipeline, we ex- cation projects. The pipeline includes four basic tasks, plicitly included all steps that we found to be crucial for namely: (1) blocking (a.k.a. indexing), which arranges the deduplication process on customers’ data, whereas records into groups, such that each group is likely to in the BLDDP some steps are implicit. Second, the last include duplicates, (2) block processing (a.k.a. filtering), task in our pipeline allows to further merge some groups which eliminates records that do not have to be com- of similar records (cliques), whereas the BLDDP, to the pared, (3) entity matching (a.k.a. similarity computation), best of our knowledge, does not include this task. Third, which computes similarity values between record pairs, our pipeline accepts dirty customers data, whereas the and (4) entity clustering, which creates larger clusters of BLDDP assumes that input data were cleaned beforehand. similar records. Our pipeline includes the following tasks (cf. Figure 1), Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint which are outlined in the remainder of the paper: [T1] Conference (March 29-April 1, 2022), Edinburgh, UK selecting grouping attributes, [T2] selecting attributes $ robert.wrembel@cs.put.poznan.pl (R. Wrembel) used to compare record pairs, [T3] choosing a method for 0000-0001-6037-5718 (R. Wrembel) comparing records, [T4] selecting similarity measures for © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). comparing values of attribute pairs, [T5] defining weights CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) of attributes to compute records similarities and choosing 2.2.2. T2: Attributes for comparing record pairs similarity thresholds, [T6] building pairs of records hav- Having ordered records by the attributes from the rank- ing high similarity value, [T7] building cliques of similar ing obtained from task [T1], the next task is to select records, [T8] further merging cliques of similar records. attributes whose values will be compared in record pairs, [T1] realizes blocking in BLDDP; [T3] realizes block pro- to compute the similarity of records in each pair. Poten- cessing and entity matching; [T2], [T4], and [T5] realize tial candidates for comparison are attributes that: (1) are entity matching; [T6] and [T7] realize entity clustering. record identifiers, (2) do not include nulls, (3) include cleaned values, e.g., no typos, no additional erroneous 2.1. Pipeline implementation characters, (4) include unified (homogenized) values, e.g., environments no abbreviations, the same acronyms used throughout the whole data set. Tasks [T1] to [T6] were implemented in parallel in two Unfortunately, in real cases, such attributes often do alternative environments. The first one is a typical data not exist. As it concerns record identifiers, in FI appli- engineering environment, based on the Oracle DBMS as a cations, natural identifiers are typically used, but their data storage and the PL/SQL programming language for values are frequently artificially generated (in cases when implementing the deduplication pipeline; this environ- natural identifiers cannot be used, i.e., a customer is not ment is a standard one in the FI running the project. The able to provide it). Thus, in some cases, artificially gener- second environment is a typical data science environment, ated IDs may have the same values as the natural ones. based on csv files as a data storage and Python (Anaconda Notice that the financial sector is strictly regulated by or Jupyter-lab, data science packages) for implementing means of European law, national law, and recommenda- the pipeline. tions issued by institutions controlling the sector. As a consequence, procedures aiming at improving the qual- 2.2. Tasks in the pipeline ity of data in this sector are strictly controlled. For this reason, possibilities of applying data cleaning processes 2.2.1. T1: Grouping attributes are limited. The main challenge in grouping records is to select such a In the described project, the set of attributes selected set of grouping attributes that would allow to identify the for comparing record pairs is based on the aforemen- highest number of potentially duplicate records. Even tioned preferable attribute characteristics and on expert though, in the research literature there were proposed 14 knowledge. The set includes 18 attributes describing different blocking methods [5], there is no single univer- individual customers (e.g., personal data and address sal grouping method suitable for all application domains components) and 24 attributes describing institutional [6]. customers (e.g., institution names, addresses, type of busi- Inspired by [7], we proposed a method based on statis- ness run). tical characteristics of customers attributes: (1) the num- ber of 𝑛𝑢𝑙𝑙𝑠 to the number of 𝑛𝑜𝑡𝑛𝑢𝑙𝑙 values of an at- 2.2.3. T3: A method for comparing records tribute, (2) the number of 𝑑𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑒𝑑 values to the num- ber of 𝑛𝑜𝑡𝑛𝑢𝑙𝑙 values of an attribute, (3) the number of Based on the ranking of grouping attributes obtained 𝑛𝑜𝑡𝑑𝑢𝑝𝑙𝑖𝑐𝑎𝑡𝑒𝑑 to the number of 𝑛𝑜𝑡𝑛𝑢𝑙𝑙 values of an at- from task [T1] (cf. Section 2.2.1), records need to be tribute, (4) the number of (𝑛𝑜𝑡𝑛𝑢𝑙𝑙 − 𝑑𝑖𝑠𝑡𝑖𝑛𝑐𝑡)/𝑛𝑜𝑡𝑛𝑢𝑙𝑙 arranged into groups. Next, in each group records are values of an attribute. These characteristics are com- compared in pairs. The literature proposes two popular puted for every attribute being a candidate for grouping. techniques for grouping, namely: hashing, e.g., [8, 9] or Additionally, (1) a diversity of values of each attribute sorting - know as the sorted neighborhood method, e.g., is modeled by means of the Gini Index and (2) the size [10, 11]. of a record group is penalized by means of a quadratic The sorted neighborhood method accepts one parame- function with a negative value of coefficient a. ter that is the size of a sliding window in which records Notice that the initial set of potential grouping at- are compared. The larger the window size is the more tributes was selected based on expert knowledge. It in- potential duplicates can be found, but the longer record cluded 20 attributes. Next, the candidate attributes were comparison time is, since more records have to be com- processed by means of our method and their statistical pared each time. Some experimental evaluations of the characteristics were computed. The obtained ranking window size from the literature discuss a typical size that of attributes was verified by domain experts. Based on ranges from 2 to 60 records [10]. their input, the final set of attributes was selected for The sorted neighborhood method is intuitive, has an arranging (grouping) records. acceptable computational complexity, and is available in one of the Python libraries. For this reason, it was applied in the project. We run a series of experiments in order cannot be compared; in this case, 𝐴 is not con- to determine the best window size. Our experiments sidered for computing record similarity for pair showed that the size has to be adjusted experimentally for (𝑟𝑚 , 𝑟𝑛 ); a particular data set being deduplicated. For comparing • when records IDs are compared but the value individual customers’ records we used the window of 20 of the ID for 𝑟𝑚 is artificially generated and the records, whereas for comparing institutional customers value of the ID for 𝑟𝑛 is real, then such values we used a variable window size with the maximum of cannot be compared; in this case, such an ID is 200 records. not considered for computing record similarity for pair (𝑟𝑚 , 𝑟𝑛 ); 2.2.4. T4: Similarity measures for text attributes • if IDs can be compared (i.e., they include real values), then a binary similarity value is assigned The literature on data deduplication and similarity mea- of either 1 (IDs are equal) or 0 (IDs are not equal) sures lists well over 30 different similarity measures for for a compared pair of IDs. text data, e.g., [12, 13]. One may find in the literature suggestions, supported by experimental evaluations, on On top of the rules, for institutional customers we the applicability of different measures to different text used equal weights for each attribute being the subject of data, e.g., [14, 15, 16, 12]. comparison. Whereas for individual customers, higher In our project, we evaluated 44 measures available in weights were set for ID and last name attributes. These Python packages. Some measures, e.g., Levenshtein, Jaro, weights were set based on an iterative experimentation Jaro-Winkler exist in a few different implementations process and evaluation of the results by domain experts. (packages), thus we evaluated these implementations as Based on the weighted values of similarities between well. The evaluation was run on three different real data pairs of individual attributes of records 𝑟𝑚 and 𝑟𝑛 , a to- sets, i.e., (1) customers’ last names of average length of tal similarity of (𝑟𝑚 , 𝑟𝑛 ) was computed. Let us denote 10.9 characters, (2) street names of avg length of 16 chars, it as 𝑟𝑠𝑖𝑚 . Based on its value, a given pair of records and (3) institution names of avg length of 45.5 chars. was classified either as similar (matches) or non-similar Customers’ names included 98% of 1-word names and 2% (non-matches), or undecided. For this kind of classifica- of 2-word names, which reflected a real distribution of tion, the so-called similarity thresholds had to be defined. such types of names in our customers’ population. All Again, in practice, these thresholds are defined based on test data represented true positives, but with typical realthe analysis of the obtained record pairs and based on errors found by data profiling. knowledge of domain experts [12, 17]. From the evaluation we draw the following conclu- In our project we applied the same approach. Based on sions: the knowledge of the FI experts, the lowest value of 𝑟𝑠𝑖𝑚 • for short strings, like last names and street names (i.e., for similar records) was set to 0.8 and the highest (composed of 7 to 28 characters), the Overlap, value for non-similar was set to 0.6. Jaro-Winkler, and StrCmp95 similarity measures gave the highest similarity values; 2.2.6. T6: Building pairs of similar records • for long strings, like institution names (composed The sorted neighborhood method produces pairs of similar of 46 to 116 characters and up to 12 separate records with: (1) their overall value of 𝑟𝑠𝑖𝑚 and (2) sim- words) the Overlap, Sorensen, and StrCmp95 mea- ilarity values for each attribute being compared. These sures gave the highest similarity values, there- data are stored in a repository, cf. Section 2.1 and visual- fore they were recommended for comparing such ized in a spreadsheet for expert verification. kinds of data. 2.2.7. T7: Building cliques of similar records 2.2.5. T5: Attribute weights and similarity thresholds Since similar records’ pairs may form larger sets, to find such sets, all similar pairs have to be combined in a graph, In task [T5], we applied an iterative process of tuning with records representing nodes and labeled edges repre- weights of attributes used to compute records’ similarity, senting similarities between records. In such a graph, a with the support of domain experts. Additionally, rules group of similar records forms a maximal clique. Thus, had to be defined to decide whether to compare values the problem of finding sets of similar records transforms of a given attribute. Let us assume that a pair of records to finding maximal cliques in a graph. In general, it is 𝑟𝑚 and 𝑟𝑛 is compared to compute their similarity value. a NP-hard problem [18]. This problem becomes compu- Some cases handled by the rules include: tationally less expensive for sparse graphs, e.g., [19, 20], • when 𝑟𝑚 has a defined value of attribute 𝐴 and which is the case of a graph created from similar records. the value of 𝐴 of 𝑟𝑛 is null, then the values of 𝐴 For finding maximal cliques a few fast algorithms were the financial sector. There exist some typos, missing val- developed. ues, inconsistent values in attributes storing personal One of them is the Bron-Kerbosh algorithm [21], which data, institution names, and addresses. Moreover, not all we decided to use for the following reasons. First, it is natural IDs are reliable. By regulations, even known dirty frequently used in the community working on graph customers data cannot be cleaned without an explicitly processing. Second, it is implemented in multiple pro- permission of a customer. Getting such permissions from gramming languages, including Python. Third, its worst millions of customers in a finite time frame is impos- case computational complexity is 𝑂(3𝑁/3 ), where 𝑁 de- sible. For this reason, only simple cleaning is possible, notes the number of graph nodes. For sparse graphs the like removing leading or trailing erroneous signs from complexity is lower. customers addresses. For this reason, in practice the The algorithm was used for finding cliques in a graph deduplication pipeline has to be applied to data that has composed of 2228580 customers’ nodes. This evaluation undergone only basic cleaning. confirmed its efficiency in terms of processing time and its applicability to the deduplication problem (confirmed 3.2. Data size by the experts from the FI). Most of the methods used in the base-line data deduplica- 2.2.8. T8: Merging cliques of similar records tion pipeline were verified on either small real data sets, e.g., bibliographical with 32000 records [6, 24, 25, 17, 26], If the number of similar records is larger than the size restaurants - 500 records [24, 25], movies - 5000 records of a sliding window in sorted neighborhood, then a few [26], or patients - 128000 records [27], or on data sets cliques are created and all of them contain records that generated artificially [6, 7]. are similar to each other. Therefore, the final step is to Whereas, in this paper we reported our experience merge cliques that include a certain number of common on deduplicating customers’ data of much larger vol- records (currently the Jaccard coefficient is used to decide umes, i.e.,: (1) 2228580 records describing individual cus- which cliques to merge). We are also experimenting with tomers and (2) 1185290 records describing institutional a variable, automatically adjustable window size. customers. The final goal of the reported project is to ap- ply the developed pipeline and techniques to a database storing more than 11 million of customers’ records (since 3. Final observations the project is ongoing, this stage will be run at the end In this paper, we reported our experience from a R&D of the project). project for a FI on deduplicating customers’ data. The project is ongoing (two out of four stages have been al- 3.3. Tagged data for ML ready realized). In the project, we adapted the standard deduplication pipeline from the literature to the partic- Some tasks in the deduplication pipeline can be run ular characteristics of data being deduplicated and to with the support of machine learning (ML) techniques, the project requirements. The whole pipeline was im- e.g., blocking [28, 29], selecting similarity measures and plemented and verified by domain experts. The results thresholds [30], matching similar records [31, 32]. If the obtained so far were accepted by the FI. pipeline applies ML for the entity matching task, it is It must be stressed that the reality of the discussed assumed that there exists a set of training records tagged project differs from the one assumed in the research lit- as true positives and true negatives. Unfortunately, in erature, i.e., (1) the assumption on the cleanness of data a large FI it is impossible to create such a set of train- being deduplicated, (2) the sizes of deduplicated data ing data because of the volume of data to be processed sets, (3) the availability of tagged data for ML algorithms,by the pipeline. For an original data set composed of and (4) neglecting data aging process. Neither of these several million of customers, a training data set of a rea- assumptions is true in our project, as outlined in the sonable size should include at least several thousands of following sections. tagged training records. In reality, such a large number of training records is impossible to be created by experts. For this reason, in practice, training data are frequently 3.1. Data cleanness unavailable for ML algorithms. The base-line data deduplication pipeline [22, 2, 3, 23, 4] In order to overcome this difficulty, unsupervised learn- assumes that data delivered to the pipeline are clean ing techniques are used, e.g., [33, 34]. Some publications (e.g., no null values, no spelling errors, homogenized full report on applying active learning techniques to a dedu- names and abbreviations). Unfortunately, this assump- plication process, e.g., [35, 36, 37, 38, 39] and this direc- tion in real projects cannot be guaranteed, especially in tion will also be investigated in the project. Currently we are experimenting with weakly supervised learning [4] G. Papadakis, L. Tsekouras, E. Thanos, G. Gian- [40] with the support of the snorkel library. nakopoulos, T. Palpanas, M. Koubarakis, Domain- and structure-agnostic end-to-end entity resolution 3.4. Data aging with jedai, SIGMOD Record 48 (2019) 30–36. [5] A. Colyer, The morning paper on An overview of An inherent feature of some types of data, is their aging. end-to-end entity resolution for big data, https: For example, customers’ last names, identification doc- //blog.acolyer.org/2020/12/14/entity-resolution/, uments, different types of postal addresses, and contact 2020. data (phone numbers, emails) have this feature. Outdated [6] M. Bilenko, B. Kamath, R. J. Mooney, Adaptive data impact the possibility to discover duplicate records. blocking: Learning to scale up record linkage, in: For this reason, for a deduplication process it would be IEEE Int. Conf. on Data Mining (ICDM), IEEE Com- profitable to know which pieces of compared data are puter Society, 2006, pp. 87–96. likely to be outdated. [7] L. de Souza Silva, F. Murai, A. P. C. da Silva, M. M. Building data aging models has not been researched Moro, Automatic identification of best attributes so far (the only approach addressing a related problem is for indexing in data deduplication, in: A. Mendel- [41], but in the context of temporal data). Including aging zon Int. Workshop on Foundations of Data Manage- models into a deduplication pipeline seems to be totally ment, volume 2100 of CEUR Workshop Proceedings, unexplored field of research either. In the last stage of our CEUR-WS.org, 2018. project we aim at developing data aging models based [8] N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma, on ML algorithms. T. Sarlós, Optimal hashing schemes for entity matching, in: Int. World Wide Web Conf. WWW, 3.5. Working with experts 2013, pp. 295–306. [9] H. Kim, D. Lee, HARRA: fast iterative hashed record While designing the deduplication pipeline and evaluat- linkage for large-scale data collections, in: Int. Conf. ing its results, we have benefited from the help of experts. on Extending Database Technology EDBT, volume Their knowledge was used to determine an initial set of 426, ACM, 2010, pp. 525–536. attributes used for comparing records and choosing simi- [10] M. A. Hernández, S. J. Stolfo, The merge/purge larity thresholds. The pipeline was tuned in an iterative problem for large databases, in: ACM SIGMOD Int. way, each time being based on the input from the experts Conf. on Management of Data, ACM Press, 1995, evaluating the obtained results. pp. 127–138. In particular, grouping clients into clicks turned out to [11] B. Ramadan, P. Christen, H. Liang, R. W. Gayler, be very useful - it provided a holistic view of customers’ Dynamic sorted neighborhood indexing for real- representations and allowed to clearly identify duplicates. time entity resolution, ACM Journal of Data and In general, the proposed deduplication approach al- Information Quality 6 (2015) 15:1–15:29. lowed for the proper implementation of FI business goals. [12] P. Christen, Data Matching - Concepts and Tech- niques for Record Linkage, Entity Resolution, and Acknowledgements. The work of Mariusz Sienkiewicz Duplicate Detection, Data-Centric Systems and Ap- is supported by the Applied Doctorate grant no. plications, Springer, 2012. DWD/4/24/2020 from the Polish Ministry of Education [13] F. Naumann, Similarity measures, Hasso Plattner and Science. Institut, 2013. [14] M. Alamuri, B. R. Surampudi, A. Negi, A survey of distance/similarity measures for categorical data, References in: Int. Joint Conf. on Neural Networks (IJCNN), [1] M. Sienkiewicz, R. Wrembel, Managing data in IEEE, 2014, pp. 1907–1914. a big financial institution: Conclusions from a [15] S. Boriah, V. Chandola, V. Kumar, Similarity mea- r&d project, in: Proc. of the Workshops of the sures for categorical data: A comparative evalua- EDBT/ICDT 2021 Joint Conference, volume 2841 of tion, in: SIAM Int. Conf. on Data Mining (SDM), CEUR Workshop Proceedings, CEUR-WS.org, 2021. SIAM, 2008, pp. 243–254. [2] A. K. Elmagarmid, P. G. Ipeirotis, V. S. Verykios, [16] P. Christen, A comparison of personal name match- Duplicate record detection: A survey, IEEE Trans. ing: Techniques and practical issues, in: Int. Conf. Knowl. Data Eng.g 19 (2007) 1–16. on Data Mining (ICDM), IEEE Computer Society, [3] H. Köpcke, E. Rahm, Frameworks for entity match- 2006, pp. 290–294. ing: A comparison, Data & Knowledge Engineering [17] S. Sarawagi, A. Bhamidipaty, Interactive dedupli- 69 (2010) 197–210. cation using active learning, in: ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, ACM, 2002, pp. 269–278. G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, [18] Y. Ma, T. Tran, Typimatch: type-specific unsuper- Deep learning for entity matching: A design space vised learning of keys and key values for heteroge- exploration, in: SIGMOD Int. Conf. on Management neous web data integration, in: ACM Int. Conf. on of Data, ACM, 2018, pp. 19–34. Web Search and Data Mining (WSDM), ACM, 2013, [32] M. Paganelli, F. D. Buono, M. Pevarello, F. Guerra, pp. 325–334. M. Vincini, Automated machine learning for en- [19] R. Carraghan, P. M. Pardalos, An exact algorithm tity matching tasks, in: Int. Conf. on Extending for the maximum clique problem, Operations Re- Database Technology EDBT, OpenProceedings.org, search Letters 9 (1990) 375–382. 2021, pp. 325–330. [20] D. R. Wood, An algorithm for finding a maximum [33] I. Bhattacharya, L. Getoor, A latent dirichlet model clique in a graph, Operations Research Letters 21 for unsupervised entity resolution, in: SIAM Int. (1997) 211–217. Conf. on Data Mining, SIAM, 2006, pp. 47–58. [21] C. Bron, J. Kerbosch, Finding all cliques of an undi- [34] M. Gheini, M. Kejriwal, Unsupervised product en- rected graph (algorithm 457), Communications of tity resolution using graph representation learning, the ACM 16 (1973) 575–576. in: SIGIR Workshop on eCommerce @ ACM SI- [22] V. Christophides, V. Efthymiou, T. Palpanas, G. Pa- GIR Int. Conf. on Research and Development in padakis, K. Stefanidis, An overview of end-to-end Information Retrieval, volume 2410, CEUR-WS.org, entity resolution for big data, ACM Computing 2019. Surveys 53 (2021) 127:1–127:42. [35] U. Brunner, K. Stockinger, Entity matching on un- [23] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, structured data: An active learning approach, in: Blocking and filtering techniques for entity resolu- Swiss Conf. on Data Science SDS, IEEE, 2019, pp. tion: A survey, ACM Computing Surveys 53 (2020) 97–102. 31:1–31:42. [36] X. Chen, Y. Xu, D. Broneske, G. C. Durand, R. Zoun, [24] W. W. Cohen, J. Richman, Learning to match and G. Saake, Heterogeneous committee-based active cluster large high-dimensional data sets for data learning for entity resolution (healer), in: European integration, in: ACM SIGKDD Int. Conf. on Knowl- Conf. on Advances in Databases and Information edge Discovery and Data Mining, ACM, 2002, pp. Systems ADBIS, volume 11695 of LNCS, Springer, 475–480. 2019, pp. 69–85. [25] M. Kejriwal, D. P. Miranker, An unsupervised al- [37] A. Jain, S. Sarawagi, P. Sen, Deep indexed active gorithm for learning blocking schemes, in: IEEE learning for matching heterogeneous entity repre- Int. Conf. on Data Mining, IEEE Computer Society, sentations, VLDB Endowment 15 (2021) 31–45. 2013, pp. 340–349. [38] V. V. Meduri, L. Popa, P. Sen, M. Sarwat, A compre- [26] W. Shen, X. Li, A. Doan, Constraint-based entity hensive benchmark framework for active learning matching, in: Nat. Conf. on Artificial Intelligence methods in entity matching, in: SIGMOD Int. Conf. and Innovative Applications of Artificial Intelli- on Management of Data, ACM, 2020, pp. 1133–1147. gence Conf., AAAI Press / The MIT Press, 2005, [39] M. Sariyar, A. Borg, K. Pommerening, Active learn- pp. 862–867. ing strategies for the deduplication of electronic [27] M. A. Hernández, S. J. Stolfo, Real-world data is patient data using classification trees, Journal of dirty: Data cleansing and the merge/purge problem, Biomedical Informatics 45 (2012) 893–900. Data Mining and Knowledge Discovery 2 (1998) 9– [40] P. Nodet, V. Lemaire, A. Bondu, A. Cornuéjols, 37. A. Ouorou, From weakly supervised learning to [28] L. O. Evangelista, E. Cortez, A. S. da Silva, W. M. Jr., biquality learning: an introduction, in: Int. Joint Adaptive and flexible blocking for record linkage Conf. on Neural Networks (IJCNN), IEEE, 2021, pp. tasks, Journal of Information and Data Management 1–10. 1 (2010) 167–182. [41] A. Zakrzewska, D. A. Bader, Aging data in dynamic [29] M. Michelson, C. A. Knoblock, Learning blocking graphs: A comparative study, in: Int. Conf. on schemes for record linkage, in: Nat. Conf. on Ar- Advances in Social Networks Analysis and Mining, tificial Intelligence and Innovative Applications of ASONAM, IEEE Computer Society, 2016, pp. 1055– Artificial Intelligence Conf., AAAI Press, 2006, pp. 1062. 440–445. [30] M. Bilenko, R. J. Mooney, Adaptive duplicate detec- tion using learnable string similarity measures, in: ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, ACM, 2003, pp. 39–48. [31] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park,