FDup framework: A General-purpose solution for Efficient Entity Deduplication of Record Collections⋆ Michele De Bonis1,*,† , Claudio Atzori1 , Sandro La Bruzzo1 and Paolo Manghi1,2 1 Consiglio Nazionale delle Ricerche - Istituto di Scienza e Tecnologie dell’Informazione "A. Faedo" (ISTI-CNR), Pisa, Italy 2 OpenAIRE AMKE, Marousi (Athens), Greece Abstract Deduplication is a technique aimed at identifying and resolving duplicate metadata records in a collection with a special focus on the performances of the approach. This paper describes FDup(Flat Collections Deduper), a general-purpose software framework supporting a complete deduplication workflow to manage big data record collections: metadata record data model definition, identification of candidate duplicates, identification of duplicates. FDup brings two main innovations: first, it delivers a full deduplication framework in a single easy-to-use software package based on Apache Spark Hadoop framework, where developers can customize the optimal and parallel workflow steps of blocking, sliding windows, and similarity matching function via an intuitive configuration file; second, it introduces a novel approach to improve performance, beyond the known techniques of “blocking” and “sliding window”, by introducing a smart similarity-matching function T-match. T-match is engineered as a decision tree that drives the comparisons of the fields of two records as branches of predicates and allows for successful or unsuccessful early exit strategies. The efficacy of the approach is proved by experiments performed over big data collections of metadata records in the OpenAIRE Graph, a known open-access knowledge base in Scholarly communication. Keywords Data Disambiguation, Scholarly Communication, Deduplication 1. Background Deduplication is a technique for the identification and purging of duplicate metadata records in large datasets, addressing challenges like efficiency and flexibility in the wider context of entity resolution[1, 2, 3, 4, 5]. As emerged from the surveys on this topic[6, 7], traditional methods use: (𝑖) a preliminary blocking phase to group potentially equivalent records, (𝑖𝑖) sliding window techniques to manage the workload and perform pair-wise similarity matches, and (𝑖𝑖𝑖) a final transitive closure to create groups of duplicates. The paper briefly introduces FDup (Flat Collec- tions Deduper), originally presented in [8] and built on Apache Spark, enhancing deduplication by offering customizable configurations and efficient similarity-matching through a decision tree-driven approach. FDup has been conceived as an evolution of the GDup framework[9] SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. † These authors contributed equally. $ michele.debonis@isti.cnr.it (M. De Bonis); claudio.atzori@isti.cnr.it (C. Atzori); sandro.labruzzo@isti.cnr.it (S. La Bruzzo); paolo.manghi@openaire.eu (P. Manghi)  0000-0003-2347-6012 (M. De Bonis); 0000-0001-9613-6639 (C. Atzori); 0000-0003-2855-1245 (S. La Bruzzo); 0000-0001-7291-3210 (P. Manghi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings and is proven to improve its performance in handling flat record collections. The framework is successfully deployed in the OpenAIRE Graph to manage over 300M records, demonstrating significant usability and performance benefits in real-world big data scenarios. Outline: Section 2 formally presents the functional architecture of FDup, focusing on the requirements, the model adopted, and the technical implementation of the framework, providing an example of its usage in the OpenAIRE infrastructure. Section 3 describes methods and tech- niques used to test both the framework efficiency and usability defining an innovative custom configuration for the deduplication. Section 4 provides experimental results and highlights how FDup overcomes traditional approaches for time consumption. Section 5 conclude the paper and delve into possible future works and developments of the framework. 2. Software description 2.1. Architecture FDup realizes the deduplication workflow shown in Figure 1. The workflow is intended to deduplicate large collections of records, processing them in four sequential phases: • Collection import, to set the record collection ready to process by mapping into a “flat” record with attributes (the use case of this paper is the publication deduplication, using PIDs, title, and authors list); • Candidate identification, to block the records to be matched by sorting them following an 𝑜𝑟𝑑𝑒𝑟𝐹 𝑖𝑒𝑙𝑑 and scanning pairs using a sliding window mechanism; • Duplicates identification, to efficiently identify pairs of equivalent records via the T-match function; • Duplicates grouping, to identify groups of equivalent records via transitive closure. Figure 1: FDup deduplication workflow The majority of deduplication frameworks in the literature encode record similarity-matching conditions via a similarity function of the form: ∑︁ 𝑓 ([𝑣1 , . . . , 𝑣𝑘 ], [𝑣1′ , . . . , 𝑣𝑘′ ]) = 𝑓𝑖 (𝑣𝑖 , 𝑣𝑖′ ) × 𝑤𝑖 𝑖:0...𝑘 where the 𝑣𝑖 ’s are the values of field 𝑙𝑖 , 𝑓𝑖 (𝑣𝑖 , 𝑣𝑖′ ) are comparators, functions measuring the “distance”∑︀ of 𝑣𝑖 and 𝑣𝑖′ for the field 𝑙𝑖 , and 𝑤𝑖 ’s are the weights assigned to the comparators 𝑓𝑖 ’s, such that 𝑖:0...𝑘 𝑤𝑖 = 1. As a result, 𝑓 returns a value in a given range, e.g. [0 . . . 1], scoring the “distance” between two records. The records are considered equivalent if the distance measure exceeds a given threshold. For the example above, the similarity function 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ, created using the GDup framework in OpenAIRE, encodes both equivalence by identity and by value as follows: 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ(𝑟, 𝑟′ ) = 𝑗𝑠𝑜𝑛𝐿𝑖𝑠𝑡𝑀 𝑎𝑡𝑐ℎ(𝑟.𝑃 𝐼𝐷𝑠, 𝑟′ .𝑃 𝐼𝐷𝑠) × 0.5+ 𝑇 𝑖𝑡𝑙𝑒𝑉 𝑒𝑟𝑠𝑖𝑜𝑛𝑀 𝑎𝑡𝑐ℎ(𝑟.𝑡𝑖𝑡𝑙𝑒, 𝑟′ .𝑡𝑖𝑡𝑙𝑒) × 0.1+ 𝐴𝑢𝑡ℎ𝑜𝑟𝑠𝑀 𝑎𝑡𝑐ℎ(𝑟.𝑎𝑢𝑡ℎ𝑜𝑟𝑠, 𝑟′ .𝑎𝑢𝑡ℎ𝑜𝑟𝑠) × 0.2+ 𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛𝑇 𝑖𝑡𝑙𝑒(𝑟.𝑡𝑖𝑡𝑙𝑒, 𝑟′ .𝑡𝑖𝑡𝑙𝑒)) × 0.2 where 𝑗𝑠𝑜𝑛𝐿𝑖𝑠𝑡𝑀 𝑎𝑡𝑐ℎ, applied to the field PID, returns 1 if there is at least one PID in common in the two records; 𝑇 𝑖𝑡𝑙𝑒𝑉 𝑒𝑟𝑠𝑖𝑜𝑛𝑀 𝑎𝑡𝑐ℎ, applied to the titles, returns 1 if the two titles contain identical numbers or Roman numbers; 𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛𝑇 𝑖𝑡𝑙𝑒 returns 1 if the two (normalized) titles have a Levenshtein distance greater than 90%, and 𝐴𝑢𝑡ℎ𝑜𝑟𝑠𝑀 𝑎𝑡𝑐ℎ performs a “smart” matching of two lists of author name strings and returns 1 if they are 90% similar (the minimal equivalence threshold is computed over a manually validated ground truth of equivalent records). All comparators return 0 if their condition is not met. The minimal threshold for two records to be equivalent is 0.5, the threshold that can be reached by 𝑗𝑠𝑜𝑛𝐿𝑖𝑠𝑡𝑀 𝑎𝑡𝑐ℎ alone or by combining the positive results of the three functions 𝑇 𝑖𝑡𝑙𝑒𝑉 𝑒𝑟𝑠𝑖𝑜𝑛𝑀 𝑎𝑡𝑐ℎ, 𝐴𝑢𝑡ℎ𝑜𝑟𝑠𝑀 𝑎𝑡𝑐ℎ, and 𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛𝑇 𝑖𝑡𝑙𝑒. All 𝑓𝑖 ’s in 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ are computed and averagely require a constant execution time, despite the successful or unsuccessful match that those may feature. Motivated by such observation, FDup introduces a similarity match function T-match that returns an equivalence match exploiting a decision tree, nesting the comparator functions. Each tree node verifies a condition, which can be the result of combining one or more comparators, and introduces a positive (MATCH ) or negative (NO_MATCH ) exit strategy. If the exit strategy is not fired, T-match heads to the next node. An early exit skips the full traversal of the tree and can turn the result into a MATCH, i.e., a simRel relationship between the two records is drawn, or into a NO_MATCH, i.e., no relationship is drawn. By doing this, T-match allows to save time by avoiding unnecessary computations. A T-match decision is formed by a tree of named nodes with outgoing edges. The core elements of a T-match node are the aggregation function, the list of comparators, and a threshold value. The aggregation function collects the output of the comparators and delivers an “aggregated” result based on one of the following functions: maximum, minimum, average, and weighted mean. The execution of a T-match node must end with a decision, which may be: • positive, i.e. the result of the aggregation function is greater or equal to the threshold value; • negative, i.e. the result of the aggregation function is lower than the threshold; • undefined, i.e. one of the comparators cannot be computed (e.g. absence of values); a node also bears a flag ignoreUndefined that ignores the undefined edge even if one of the values is absent. For each decision, the node provides the name of the next node to be executed. By default, T-match provides two nodes MATCH and NO_MATCH to be used to force a successful or unsuccessful early exit from the tree. The example in Figure 2 shows the function 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ, which uses the same comparators but exploits a T-match decision tree. The individual matches are lined up by introducing MATCH conditions early in the process, i.e. equivalence by identity via 𝑃 𝐼𝐷𝑀 𝑎𝑡𝑐ℎ, and then ordering NO_MATCH conditions by ascendant execution time, i.e. equivalence by value via 𝑣𝑒𝑟𝑠𝑖𝑜𝑛𝑀 𝑎𝑡𝑐ℎ, 𝑡𝑖𝑡𝑙𝑒𝑀 𝑎𝑡𝑐ℎ, and 𝑎𝑢𝑡ℎ𝑜𝑟𝑠𝑀 𝑎𝑡𝑐ℎ. Figure 2: T-match’s decision tree for 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ: (𝑖) compute the 𝑗𝑠𝑜𝑛𝐿𝑖𝑠𝑡𝑀 𝑎𝑡𝑐ℎ. A simRel relationship is drawn if there is at least 1 PID in common, otherwise it proceeds to the next node; (𝑖𝑖) compute the 𝑇 𝑖𝑡𝑙𝑒𝑉 𝑒𝑟𝑠𝑖𝑜𝑛𝑀 𝑎𝑡𝑐ℎ. The computation is interrupted if the titles do not contain identical numbers, otherwise it proceeds to the next step; (𝑖𝑖𝑖) compute the 𝐿𝑒𝑣𝑒𝑛𝑠ℎ𝑡𝑒𝑖𝑛𝑇 𝑖𝑡𝑙𝑒. The computation is interrupted if the Levenshtein distance is lower than 90%, otherwise it proceeds to the next step; (𝑖𝑣) compute the 𝐴𝑢𝑡ℎ𝑜𝑟𝑠𝑀 𝑎𝑡𝑐ℎ. A simRel relationship is drawn if the authors’ lists are at least 90% similar. T-match allows for the definition of multiple paths, hence the simultaneous application of alternative similarity match strategies in one single function. The experiments described in later sections will show that when the number of records is large, T-match significantly improves the overall performance of the deduplication process. 2.2. Implementation FDup’s software is structured in three modules, Pace_Core, Dedup_Workflow, and Configuration file depicted in Figure 3. The framework is implemented in Java and Scala, and grounds on the Apache Spark Framework, an open-source distributed general-purpose cluster-computing framework. FDup exploits Apache Spark to define record collection parallel processing strategies that distribute the computation workload and reduce the execution time of the entire workflow. Scala is instead required to exploit the out-of-the-box library for the calculation of a “closed mesh” in GraphX1 . The software is published in Zenodo.org by [10]. The three modules implement the following aspects of FDup’s architecture: • Pace_Core includes the functions implementing the candidate identification phase (blocking and sliding window) and the T-match function, as well as the (extensible) libraries of comparators and clustering functions. • Dedup_Workflow is the code required to build a deduplication workflow in the Apache Spark Framework by assembling the functions in Pace_core according to the compara- tors, clustering functions, and parameters specified in the Configuration file. • Configuration file sets the parameters to configure the deduplication workflow steps, including record data model, blocking and clustering conditions, and T-match function strategy. Figure 3: FDup software modules The deduplication workflow is implemented as an Oozie workflow that incapsulates jobs executing the three steps depicted in Figure 3, to compute: (𝑖) the similarity relations (𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑆𝑖𝑚𝑅𝑒𝑙𝑠), (𝑖𝑖) the merge relations (𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑀 𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠), and (𝑖𝑖𝑖) the groups of duplicates and the related representative objects (𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝐷𝑒𝑑𝑢𝑝𝐸𝑛𝑡𝑖𝑡𝑦). More specifically: • 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑆𝑖𝑚𝑅𝑒𝑙𝑠: uses classes in the Pace_Core module to divide entities into blocks (clusters) and subsequently computes 𝑠𝑖𝑚𝑅𝑒𝑙𝑠 according to the Configuration file settings for T-match; • 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑀 𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠: uses GraphX library to process the 𝑠𝑖𝑚𝑅𝑒𝑙𝑠 and close meshes they form; for each connected component, a master record ID is chosen and 𝑚𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠 relationships are drawn between the master record and the connected records; • 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝐷𝑒𝑑𝑢𝑝𝐸𝑛𝑡𝑖𝑡𝑦: uses 𝑚𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠 to group connected records and create the representative objects. 1 Apache Spark GraphX, https://spark.apache.org/graphx/ 3. Experiments The experiment aims to show the performance gain yielded by the proper configuration of T-match in a deduplication workflow for the publication similarity match example presented in Figure 2. To this aim, the experiment sets two deduplication workflows with identical blocking and sliding window settings but distinct similarity-matching configurations. Both configurations address the similarity criteria but in opposite ways: • 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ configuration: a configuration that implements the 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ decision tree illustrated in Figure 2, taking advantage of early exits and the T-match ; • 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ configuration: a configuration that implements the sim- ilarity match as the GDup (average mean) function 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ de- scribed in Section 2.1 by combining all comparators in one node, whose final result is a MATCH or NO_MATCH decision. Both configurations are based on the same settings for candidate identification and duplicate identification. In particular: • The clustering functions used to extract keys from publication records are the 𝐿𝑜𝑤𝑒𝑟𝑐𝑎𝑠𝑒𝐶𝑙𝑢𝑠𝑡𝑒𝑟𝑖𝑛𝑔 on the DOI (e.g. a record produces a key equal to the lower- case DOI, the result is a set of clusters composed by publications with the same DOI) and the 𝑆𝑢𝑓 𝑓 𝑖𝑥𝑃 𝑟𝑒𝑓 𝑖𝑥 on the publication title (e.g. a record entitled "Framework for general- purpose deduplication" produces the key "orkgen", the result is a set of clusters composed by publications with potentially equivalent titles); both functions are described in ?? • the 𝑔𝑟𝑜𝑢𝑝𝑀 𝑎𝑥𝑆𝑖𝑧𝑒 is set to 200 (empirically) to avoid the creation of big clusters requir- ing long execution time; • the 𝑠𝑙𝑖𝑑𝑖𝑛𝑔𝑊 𝑖𝑛𝑑𝑜𝑤𝑆𝑖𝑧𝑒 to limit the number of comparisons inside a block is set to 100 (empirically). Both 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ and the 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ configurations were per- formed over the publication record collection published in [11]. The collection contains a set of 10M publications represented in JSON records extracted from the OpenAIRE Graph Dump[12]. In particular, publications have been selected from the Dump to form a dataset with a real-case duplication ratio of around 30% and an appropriate size to prove the substantial performance improvement yielded by the early exit approach. Two tests were performed, comparing the performance of the configurations 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ and 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ over the 10M and the 230M collection respectively. The tests are intended to measure the added value of T-match in terms of performance gain, i.e. 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ vs 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ execution times. The tests were performed with a driver memory set to 4 Gb, the number of executors to 32, the executor cores to 4, and the executor memory to 12 Gb. The Spark dynamic allocation has been disabled to ensure a fixed amount of executors in the Spark environment, so as to avoid aleatory behavior. Moreover, since Spark’s parallelization shows different execution times, depending on both the distribution of the records in the executors and the cutting operations on the blocking phase, each test has been executed 10 times and the average time has been calculated. The execution time was measured in terms of processing time required by 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑆𝑖𝑚𝑅𝑒𝑙𝑠, where the pair-wise comparisons are performed, and by 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑀 𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠, where groups of duplicates are generated. It was observed that 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑆𝑖𝑚𝑅𝑒𝑙𝑠 is dominant taking 70% of the overall processing time. As a consequence, for the sake of experiment evaluation, we: (𝑖) reported and confronted the time consumed by 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑆𝑖𝑚𝑅𝑒𝑙𝑠 under different tests to showcase the performance gain of T-match, and (𝑖𝑖) reported the results of the 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑀 𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠 to ensure that the tests are sound, i.e. yield the same number of groups, or with a small relative change percentage due to the aleatory behavior described above. 4. Results The results of the tests on the 10M publication records dataset and the 230M full publication datasets are depicted in Figure 4 and Figure 5, respectively. The graphs show the average time consumption of the 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑆𝑖𝑚𝑅𝑒𝑙𝑠 phase for each execution of the test. Figure 4: 10M records test Figure 5: 230M records test The average time of the 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑆𝑖𝑚𝑅𝑒𝑙𝑠 stage in the test performed over 10M records dataset with the 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ configuration is 750 seconds, while the 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ configuration consumes 1, 536.4 seconds. The 𝑆𝑝𝑎𝑟𝑘𝐶𝑟𝑒𝑎𝑡𝑒𝑆𝑖𝑚𝑅𝑒𝑙𝑠 test on the 230M records dataset features an average time of 9, 637.6 seconds for 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ and of 15, 224.5 seconds for 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ . The results reported in Table 1 show that the two scenarios produced a comparable but not identical amount of 𝑠𝑖𝑚𝑅𝑒𝑙𝑠, 𝑚𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠, and 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑𝐶𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 due to the aleatory behavior Spark. Based on such results, it can be stated that the 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ configuration over- takes the 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ configuration in terms of time consumption, by im- proving performance up to 50% in the first test and up to 37% in the second test. Table 1 Average number of relations drawn by the deduplication workflow on 10M and 230M publication records size relation type TreeMatch WeightedMatch relative change (%) 𝑠𝑖𝑚𝑅𝑒𝑙𝑠 13,865,552 13,866,320 0.000055 𝑚𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠 5,247,252 5,247,585 0.000063 10M 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑𝐶𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 1,890,012 1,890,148 0.000071 𝑝𝑎𝑖𝑟𝑤𝑖𝑠𝑒𝐶𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛𝑠 255,772,628 255,772,628 0.0 𝑠𝑖𝑚𝑅𝑒𝑙𝑠 172,510,072 172,511,772 0.0000098 𝑚𝑒𝑟𝑔𝑒𝑅𝑒𝑙𝑠 69,974,139 69,974,155 0.00000022 230M 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑𝐶𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 25,250,036 25,250,143 0.0000042 𝑝𝑎𝑖𝑟𝑤𝑖𝑠𝑒𝐶𝑜𝑚𝑝𝑎𝑟𝑖𝑠𝑜𝑛𝑠 3,650,733,202 3,650,733,202 0.0 5. Conclusions This work presented FDup, a framework for the deduplication of record collections that allows: (𝑖) to easily and flexibly configure the deduplication workflow depicted in Figure 1 and (𝑖𝑖) to add to the known execution time optimization techniques of clustering/blocking and sliding window, a new phase of similarity match optimization. The framework allows to customize a deduplication workflow using a configuration file and a rich set of available libraries for comparators and clustering functions. The record collection data model can be adapted to any specific context and T-match function allows for the definition of smart and efficient similarity functions, which may combine multiple and complementary similarity strategies. The implementation using Spark contributes to the computation optimization because of the parallelization of the tasks in the clustering and similarity-matching phase. T-match gains further execution time by anticipating the execution of NO_MATCH decisions and postponing time-consuming decisions, such as the 𝐴𝑢𝑡ℎ𝑜𝑟𝑠𝑀 𝑎𝑡𝑐ℎ in the example. As proven by the reported experiments the hypothesis is not only intuitively correct but brings in some scenarios substantial performance gains. When used to analyze big data collections, time-saving is key for many reasons: the execution of experiments to improve a configuration, speeding up the generation of quality data in production systems or saving time that can be “spent” to improve the recall and precision by relaxing clustering and sliding window approaches, i.e., large numbers of blocks and increased window size. On the other hand, time-saving depends on the ability to identify smart exit strategies applicable to a considerable percentage of the pair-wise comparisons. For example, if the publication record collection used for the experiments features correct and corresponding PIDs for all records, the 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑇 𝑟𝑒𝑒𝑀 𝑎𝑡𝑐ℎ execution time would be further improved; on the contrary, if no PIDs are provided, the execution time would increase and get closer to the 𝑃 𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑊 𝑒𝑖𝑔ℎ𝑡𝑒𝑑𝑀 𝑎𝑡𝑐ℎ . The two functions would perform identically if the records feature no differences in the title, making the 𝐴𝑢𝑡ℎ𝑜𝑟𝑠𝑀 𝑎𝑡𝑐ℎ title determinant of the final decision. Acknowledgments The work in this paper has been funded by the projects OpenAIRE-Nexus (grant agreement ID 101017452) and FAIRCORE4EOSC (grant agreement ID 101057264). References [1] L. Li, Entity resolution in big data era: Challenges and applications, in: C. Liu, L. Zou, J. Li (Eds.), Database Systems for Advanced Applications, Springer International Publishing, Cham, 2018, pp. 114–117. [2] M. Kejriwal, Entity resolution in a big data framework, in: B. Bonet, S. Koenig (Eds.), Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, AAAI Press, 2015, pp. 4243–4244. URL: http://www.aaai. org/ocs/index.php/AAAI/AAAI15/paper/view/9294. [3] E. Rahm, E. Peukert, Large scale entity resolution, in: S. Sakr, A. Y. Zomaya (Eds.), Encyclopedia of Big Data Technologies, Springer, 2019. URL: https://doi.org/10.1007/ 978-3-319-63962-8_4-1. doi:10.1007/978-3-319-63962-8_4-1. [4] V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, K. Stefanidis, End-to-end entity resolution for big data: A survey, 2019. arXiv:1905.06397. [5] M. Nentwig, M. Hartung, A. N. Ngomo, E. Rahm, A survey of current link discovery frameworks, Semantic Web 8 (2017) 419–436. URL: https://doi.org/10.3233/SW-150210. doi:10.3233/SW-150210. [6] J. a. Paulo, J. Pereira, A survey and classification of storage deduplication systems, ACM Comput. Surv. 47 (2014). URL: https://doi.org/10.1145/2611778. doi:10.1145/2611778. [7] A. Venish, K. Sankar, Framework of data deduplication: A survey, Indian Journal of Science and Technology 8 (2015). doi:10.17485/ijst/2015/v8i26/80754. [8] M. De Bonis, P. Manghi, C. Atzori, Fdup: a framework for general-purpose and efficient entity deduplication of record collections, PeerJ Computer Science 8 (2022) e1058. [9] P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications ahead-of-print (2020). doi:10.1108/DTA-09-2019-0163. [10] M. De Bonis, C. Atzori, S. La Bruzzo, miconis/fdup: Fdup v4.1.10, 10.5281/zenodo.6011544, 2022. URL: https://doi.org/10.5281/zenodo.6011544. doi:10.5281/zenodo.6011544. [11] M. De Bonis, 10mi openaire publications dump, 10.5281/zenodo.5347803, 2021. URL: https: //doi.org/10.5281/zenodo.5347803. doi:10.5281/zenodo.5347803. [12] P. Manghi, C. Atzori, A. Bardi, M. Baglioni, J. Schirrwagen, H. Dimitropoulos, S. La Bruzzo, I. Foufoulas, A. Löhden, A. Bäcker, A. Mannocci, M. Horst, P. Jacewicz, A. Czerniak, K. Kiat- ropoulou, A. Kokogiannaki, M. De Bonis, M. Artini, E. Ottonello, A. Lempesis, A. Ioannidis, N. Manola, P. Principe, Openaire research graph dump, 10.5281/zenodo.4707307, 2021. URL: https://doi.org/10.5281/zenodo.4707307. doi:10.5281/zenodo.4707307.