=Paper=
{{Paper
|id=Vol-3741/paper24
|storemode=property
|title=Overlap-Based Duplicate Table Detection
|pdfUrl=https://ceur-ws.org/Vol-3741/paper24.pdf
|volume=Vol-3741
|authors=Luca Zecchini,Tobias Bleifuß,Giovanni Simonini,Sonia Bergamaschi,Felix Naumann
|dblpUrl=https://dblp.org/rec/conf/sebd/ZecchiniBSBN24
}}
==Overlap-Based Duplicate Table Detection==
Overlap-Based Duplicate Table Detection (Discussion Paper) Luca Zecchini1,* , Tobias Bleifuß2 , Giovanni Simonini1 , Sonia Bergamaschi1 and Felix Naumann2 1 University of Modena and Reggio Emilia, Italy 2 Hasso Plattner Institute, University of Potsdam, Germany Abstract Both the Web and data lakes contain much redundant data in the form of largely overlapping pairs of tables. In many cases, this overlap is not accidental and provides meaningful information about the relatedness of the tables. In particular, we focus on the largest overlap between two tables, i.e., their largest common subtable. The largest overlap can help us discover multiple coexisting versions of the same table, which possibly differ in the completeness and correctness of the conveyed information. Automatically detecting these highly similar, duplicate tables would allow us to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. Unfortunately, detecting the largest overlap is a computationally challenging problem, requiring to carefully permute columns and rows. We introduce therefore Sloth, our solution to efficiently detect the largest overlap between two tables. As we experimentally demonstrate on real-world datasets, Sloth is not only effective in solving this task, but can impact on multiple additional use cases, such as detecting potential copying across sources or automatically discovering candidate multi-column joins. Keywords Table Overlap, Table Matching, Related Tables 1. Overlapping Tables The Web contains a huge amount of structured data in tabular form. In 2008, it was already possible to retrieve more than 14 billion tables [1], and nowadays more than 2 million tables can coexist in the English version of Wikipedia alone [2]. Web tables often have a very dynamic existence. A particularly representative case is that of Wikipedia [2], where tables are frequently edited or updated, moved within their page or to another page, copied to related pages or elsewhere, with frequent episodes of carelessness, conflicts among editors [3], and even vandalism [4]. Because of this dynamism and the het- erogeneity of the community of Wikipedia editors, it can be very difficult to guarantee data quality, which is fundamental for an encyclopedia, whose content should always be correct, complete, and updated. SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. $ luca.zecchini@unimore.it (L. Zecchini); tobias.bleifuss@hpi.de (T. Bleifuß); giovanni.simonini@unimore.it (G. Simonini); sonia.bergamaschi@unimore.it (S. Bergamaschi); felix.naumann@hpi.de (F. Naumann) 0000-0002-4856-0838 (L. Zecchini); 0009-0006-9517-7707 (T. Bleifuß); 0000-0002-3466-509X (G. Simonini); 0000-0001-8087-6587 (S. Bergamaschi); 0000-0002-4483-1389 (F. Naumann) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings (a) A pair of tables about football teams and stadiums. (b) The largest overlap between the two tables. Figure 1: Two example tables and their largest overlap. Among the 2.13 million tables existing in Wikipedia at the time of our latest snapshot, we surprisingly discovered that about 6.5 million pairs of tables present an overlap equal to at least half of the area of the smaller table, for an estimated redundancy of 63.49 MB. Even more surprisingly, we detected 5.9 million pairs of coexisting tables with identical content, highlighting the massive diffusion of copy-and-paste practices in Wikipedia [5]. In particular, we focus on the largest overlap between the two tables [5], i.e., their largest common rectangular subtable, as depicted in Figure 1. In many cases, this overlap is not accidental, but gives meaningful insights about the relatedness of the tables and the quality of their content. The nature of tabular data allows changing the order of columns and rows (Figure 1b), making the detection of the largest overlap computationally challenging. The ability to detect the largest overlap between two tables, and in particular to retrieve pairs of highly similar tables, defined as duplicate or matching tables, can lead to several benefits, such as verifying the consistency of the information conveyed by the tables, pointing out cases of incompleteness or inconsistencies. This is not only relevant for Web tables: every scenario where a table can be duplicated at a certain point in time, with an independent development for the different copies, is prone to the insurgence of inconsistencies. For instance, when data scientists retrieve datasets from the enterprise’s data lake, perform transformations (e.g., join, wrangling, etc.) for their analysis, then store back the new datasets into the data lake [6]. Depending on the context, a user might desire to ensure the consistency of the information present in duplicate tables through operations of data cleaning [7] and change propagation [8], or it might be more convenient to directly prevent the rise of inconsistencies by eliminating this redundancy. In fact, avoiding redundancy not only allows to save disk space, but also to lighten the workload for website editors, who would just have to focus on a single consistent table instead of performing every editing multiple times, exposing to the risk of inconsistencies or missed updates. Whatever the purpose, one must first detect such duplicate tables. While the existing literature widely recognizes the importance of discovering related tables [9, 10, 11] on the Web or in data lakes for enriching the conveyed information, proposing many approaches for the efficient detection of unionable tables [12, 13, 14] or joinable tables [15, 16, 17, 18, 19], the task of detecting duplicate tables is only investigated in some specific or limited scenarios (e.g., to detect the subsequent versions of a table throughout the history of a Wikipedia page [20] or restricted to the basic cases of perfect duplicates and row containment [21]). To fill this gap, we recently proposed Sloth [5], a novel solution to determine the largest overlap between a given pair of tables, i.e., the maximal contiguous rectangular area of identical cells that can be achieved by reordering columns and rows of both tables. More formally, given a bijective attribute mapping 𝑀 : 𝑋𝑀 ⊆ 𝑋 → 𝑌𝑀 ⊆ 𝑌 defined between two tables 𝑅(𝑋) and 𝑆(𝑌 ), we refer to the table overlap 𝑂𝑀 = 𝑅[𝑋𝑀 ] ∩ + 𝑆[𝑌𝑀 ] as the intersection under the bag semantics (i.e., which allows duplicates) between the bags of tuples obtained through the projection of 𝑅(𝑋) on 𝑋𝑀 and 𝑆(𝑌 ) on 𝑌𝑀 . Defined 𝒪 as the set of all possible overlaps between the two tables and the overlap area 𝐴𝑀 = |𝑋𝑀 | · |𝑂𝑀 | as the number of cells contained in the overlap 𝑂𝑀 , the set of the largest overlaps 𝒪* = {𝑂𝑀 * ∈ 𝒪 | 𝐴𝑀 * ≥ 𝐴𝑀 , ∀𝑂𝑀 ∈ 𝒪} is composed of the overlaps with the maximum area, in most cases just one. All details of our formalization can be found in the full research paper [5]. At the core of Sloth lies the first algorithm designed to detect the largest overlap between two tables. First, our algorithm detects the pairs of attributes across the two tables that share some cell values. By combining these pairs of attributes, it is possible to obtain a complete overview of all potentially non-empty overlaps existing between the tables (i.e., the candidates to be the largest one) in the form of a lattice [22]. The combined pairs determine an upper bound for the area of the candidates. Thus, our algorithm aims to exploit this bounding mechanism to prioritize candidates and detect the largest overlap as soon as possible, minimizing the number of candidates for which we need to compute the actual area. Since this task is computationally challenging, Sloth also relies on a greedy variant for the algorithm based on beam search [23, 24] to deal with critical pairs for which the exact technique struggles to produce a result in a reasonable time for the user. Section 2 provides an overview of both algorithms, which are described in detail in the full research paper [5]. Beyond the efficiency of Sloth and the quality of the results produced by the greedy algorithm, through our experimental evaluation we were able to highlight multiple relevant real-world use cases, such as the detection of highly overlapping tables in Wikipedia, the recognition of potential copying across tables from different sources [25], and the automated discovery of candidate multi-column joins in a corpus of relational tables. An excerpt from the full research paper [5] is reported in Section 3, while Section 4 briefly compares to the existing literature. Finally, Section 5 presents the future directions of our research and concludes the paper. 2. Largest Overlap Detection Figure 2 illustrates the high-level design of Sloth. Implemented in Python, Sloth1 considers as input two tables 𝑅(𝑋) and 𝑆(𝑌 ) (Figure 2a) and returns the set of their largest overlaps 𝒪* (Figure 2d). The exact algorithm is our default choice (Figure 2b), with a timeout mechanism 1 https://github.com/dbmodena/sloth Figure 2: An overview of the Sloth workflow, from a pair of tables (a) to their largest overlap (d) through our detection algorithm (b) and its greedy variant for critical cases (c). defined to spot critical cases that would not provide a result in a reasonable time, activating its greedy variant (Figure 2c). Both the timeout and the parameter for the greedy algorithm (i.e., the beam width 𝛽) can be edited by the users according to their needs (e.g., a faster computation or a better accuracy). Additional parameters can define the minimum area Δ to consider the largest overlap as relevant and even minimum/maximum width/height thresholds. Our exact algorithm (Algorithm 1) needs to consider all mappings that can potentially determine the largest overlap, denoted as candidates. To identify the candidates, our algorithm first considers all possible single-attribute mappings, i.e., those mappings for which 𝑋𝑀 is represented by a single attribute 𝑥, checking the area of the overlap between 𝑅[𝑥] and 𝑆[𝑀 (𝑥)]. We call seeds those single-attribute mappings whose area is greater than zero, and we collect them in a dedicated list (Line 2) sorted by descending area, as depicted in Figure 3b. All details about the invoked functions can be found in the full research paper [5]. Since every multi-attribute mapping is a combination of some single-attribute mappings, the candidates can be considered as the combinations of the seeds and modeled as the nodes of a lattice, as depicted in Figure 3b, where every level 𝑛 contains the combinations of 𝑛 seeds. Due to the bijectivity of the mapping, some seeds cannot be combined (e.g., 𝑆0 and 𝑆4 in Figure 3b, both considering the same attribute City from 𝑋), hence possibly producing a semilattice. Moving up within the lattice increases the width of the overlap, but not necessarily its area, as its height may decrease as new columns are added. In particular, the seed with the minimum area (equal to its height) in the combination defines an upper bound for the height of the candidate, and therefore for its area. This bounding mechanism can be exploited both to prune the lattice and to prioritize the candidates based on their potential area. We define therefore the pruning threshold 𝜃 (Line 3), which always contains the maximum between Δ and the maximum actual area of a candidate that we know so far (initially the area of the first seed in the list, then possibly updated every time we verify a new candidate). Our algorithm exploits the upper bound defined by the seeds to manage two priority queues (i.e., max heap structures), aiming to minimize both the number of candidates that need to be materialized and those among them whose actual area needs to be computed: (i) Levels (Line 4), containing the representations of the levels of the lattice, used to generate the candidates incrementally by decreasing potential area; (ii) Candidates (Line 5), containing the generated candidates, used to progressively verify their actual area and detect the largest overlap. Algorithm 1: Largest overlap detection algorithm Input: Two tables 𝑅(𝑋) and 𝑆(𝑌 ); minimum area Δ (default 0); minimum/maximum width/height Output: The set of the largest overlaps 𝒪* * 1 𝒪 ← ∅ // largest overlaps 2 Seeds ← findSeeds(𝑅, 𝑆) 3 𝜃 ← max(Δ, Seeds[0].𝐴) // pruning threshold 4 Levels ← initLevels(Seeds) // priority queue to generate candidates 5 Candidates ← maxHeap(∅, key = 𝐴) // priority queue to verify candidates 6 while Candidates ̸= ∅ or Levels ̸= ∅ do 7 while Candidates.head().𝐴 < Levels.head().𝐴 do 8 Levels, Candidates ← genCand(Levels, Candidates, Seeds) // generate more candidates 9 if Candidates ̸= ∅ then 10 topC ← Candidates.pop() // top candidate 11 if topC.𝑂 ̸= ∅ then 12 𝒪* ← 𝒪* ∪ topC.𝑂 // largest overlap found! 13 else 14 Levels, Candidates ← verCand(𝑅, 𝑆, topC, Levels, Candidates) // verify candidate 15 return 𝒪* In particular, we iterate on the priority queues until both of them are emptied (Line 6), terminating early as soon as all largest overlaps are detected. At each iteration, first we need to ensure that at least one of the candidates with the potential largest overlap has been generated and inserted into Candidates (Lines 7-8), then we can check the candidate at the top of Candidates (Line 10). If it has already been verified (i.e., its actual overlap has already been computed), none of the other candidates (among both the ones already generated and the ones yet to generate) can present a greater area, hence it is one of the largest overlaps, and we can add it to the result set (Line 12); otherwise, we need to compute its overlap (and therefore its actual area) and reinsert it into the priority queue if it can still be part of the result set (Line 14). Since our exact algorithm generates the candidates by combining the seeds, the detection of a very large number of seeds (e.g., for a pair of wide tables with some values repeated across several columns in both) may produce a huge lattice, making it sometimes impossible to generate the candidates in a reasonable amount of time. For a result as close as possible to the exact largest overlap, we designed therefore a greedy variant inspired by beam search, a heuristic search algorithm that performs a breadth-first search in a tree by only expanding the 𝛽 most promising nodes at each level, where the parameter 𝛽 is denoted as beam width. Our greedy algorithm is designed to bottom-up traverse the lattice generated from the seeds. At the beginning, we consider a maximum of 𝛽 seeds with the greatest area. To find candidates for the second level, we combine each of them with every other seed and drop repeated and invalid combinations. After this generating step, we verify all new candidates and again select only the 𝛽 candidates with the greatest actual area among them. For the third and every further level, we combine the selected candidates of the previous level with every seed that is not already part of the candidate and then again verify their area and limit their number to 𝛽, while the pruning threshold 𝜃 is updated and can determine an early stopping. (a) Two input tables 𝑅(𝑋) and 𝑆(𝑌 ) about football teams. (b) The sorted list of detected seeds and the valid candidates in the semilattice generated from the seeds. For every node we report the upper bounds for its area and its height (between round brackets). Figure 3: Two input tables, the sorted list of detected seeds, and the lattice of valid candidates. 3. Experimental Evaluation Our experiments, whose configurations and results are reported and discussed in detail in the full research paper [5], aim to evaluate the performance of Sloth (considering both the exact algorithm and its greedy variant) and its application to three real-world use cases. In Table 1, we present the datasets used in our evaluation. We denote as wiki_history the Wikipedia table matching dataset from the IANVS project2 , which captures the evolution of all 3.5M tables present in the English Wikipedia throughout its entire history (until Septem- ber 1, 2019). We separately consider the most recent snapshot from this dataset, denoted as wiki_latest. Beyond the Wikipedia scenario, we employ uni_dwh [16], a real-world university data warehouse, and two datasets3 reporting the information about stock symbols and flights captured from different sources across multiple days [25], both in their original version (i.e., raw) and in the one obtained through schema alignment (i.e., clean). First, we evaluate the performance of Sloth on a representative subset of the wiki_history dataset and the uni_dwh dataset, chosen to cover both the case of Web tables and the one of relational database tables in our analysis. Our experiments [5] confirm that the main factor 2 https://hpi.de/naumann/projects/data-profiling-and-analytics/change-exploration.html 3 https://lunadong.com/fusionDataSets.htm Table 1 Statistics about the number and the size of the tables appearing in the used datasets. Width (#columns) Height (#rows) Dataset #D MIN MAX AVG MIN MAX AVG wiki_history 55.97M 1 5694 5.92 1 17.38k 26.63 wiki_latest 2.13M 1 883 5.23 1 4670 11.47 uni_dwh 158 1 55 9.48 2 151.78k 5604.79 stock_raw 1.15k 4 69 16.18 221 1000 987.58 stock_clean 1.15k 3 17 12.49 221 1000 987.58 flight_clean 1.17k 3 7 5.75 6 1309 662.17 Table 2 Types of overlapping tables in Wikipedia. Type #Table Pairs Estimated Size Perfect duplicates 5.91M (85.521%) 39.55 MB Containment 351.24k (5.085%) 6.05 MB Additional rows 59.75k (0.865%) 4.35 MB Additional columns 289.97k (4.198%) 1.91 MB Additional rows and columns 1.53k (0.022%) 0.56 MB Partial overlaps 648.91k (9.394%) 39.16 MB ≥ 50% of the smallest table 252.80k (3.660%) 35.52 MB < 50% of the smallest table 396.12k (5.735%) 6.60 MB Total (≥ 50%) 6.51M (94.265%) 63.49 MB leading to timeouts is the number of seeds, which directly affects the size of the lattice, hence the number of candidates that potentially need to be generated. This aspect is strictly correlated to the width of the tables and the amount of repeated cell values across them. Further, we analyze the impact on the runtime of the different tasks (i.e., seed detection, candidate generation, and candidate verification) for both the exact and the greedy algorithm, and also of the beam width for the latter. We also show the impact of the table height on the seed detection runtime and how our similarity estimation significantly differs from two widely adopted similarity metrics based on set semantics, such as Jaccard similarity and overlap set similarity [17, 26]. Finally, we evaluate the accuracy of the greedy algorithm, showing that, even in absence of approximation guarantees on the quality of its result, it is generally able to detect largest overlaps of the same area as those discovered by the exact algorithm. Moving to the real-world use cases, first we use Sloth to quantify the amount of largely overlapping pairs of tables coexisting in Wikipedia, using wiki_latest. The results, reported in Table 2, highlight the redundancy of the information conveyed by Wikipedia tables and the diffusion of copy-and-paste practices in the encyclopedia. Then, we show that Sloth can be used to detect potential copying across multiple sources, allowing us to retrieve all clusters of sources with declared copying dependencies in the stock and flight datasets [25] with a minimum effort, without the need for schema alignment, also leading to the discovery of meaningful additional sources. Finally, we use Sloth to automatically detect candidate multi-column joins in uni_dwh, a task not supported by any of the existing solutions [5], as highlighted by showing the limitations of a simple adaptation of Josie [17] in such a scenario. 4. Related Work Even if a plethora of algorithms have been designed to efficiently discover related tables [9, 10, 11], especially unionable [12, 13, 14] and joinable ones [15, 16, 17, 18, 19], none of them can exactly compute the largest overlap between two tables. For instance, Josie [17] considers a join column from a query table and finds the top-k columns whose set of cell values presents the largest intersection with the one of the join column. Since it operates on single columns using the set semantics, it is impossible to use it for detecting the largest overlap. At most, an adaptation considering entire tables under the bag semantics would produce an upper bound for the largest overlap, so it might be used to enhance scalability by passing only the most promising pairs to Sloth. Similarly, Mate [19] is the only system supporting the discovery of multi-column joins (using a dedicated hashing function named XASH), but it requires to provide a set of columns as input. Since the set that yields the largest overlap is not known up-front, it cannot be easily employed as an alternative to Sloth. Moving to the previous approaches to duplicate table detection, Bleifuß et al. [20] aim to find matching tables across subsequent versions of a Wikipedia page. Their solution (i.e., a multi-stage matching process based on Jaccard similarity) is designed for one-to-one matches among a limited number of tables and exploits specific aspects such as the position of the tables inside the page, hence it is not generalizable. Koch et al. [21] use instead XASH and define two tables as duplicates if they contain the same set of tuples, only tackling the cases of perfect duplicates or row containment. 5. Conclusion and Future Work We presented Sloth, a method to efficiently determine the largest overlap between two tables, allowing to detect duplicate tables. This leads to several benefits both on the Web and in data lakes. For instance, it allows spotting and solving common data quality issues, such as inconsistent or incomplete information. Also, it helps eliminate redundancy to free up storage space or to save additional work for the editors, preventing the insurgence of data quality problems. Through our experimental evaluation we assessed the performance of Sloth in real-world scenarios, considering Web tables from Wikipedia and relational tables from a data warehouse, up to use cases such as the detection of potential copying across multiple sources and the discovery of candidate multi-column joins in a corpus of relational tables. Moving beyond the results presented in this paper, we plan to broaden our research in multiple directions. First, we want to design updatable indexes to enable overlap-based duplicate table detection at scale, allowing users to provide a table as a query and retrieve the top-k tables presenting the largest overlap with it. Then, we want to enable Sloth to detect not only the largest, but also the best overlap between two tables, defining quality metrics to capture how meaningful an overlap is. Finally, we plan to evaluate the use of table embeddings [27] to estimate the largest overlap and to analyze the impact of table deduplication on the performance of tabular language models [28], similarly to what has already been demonstrated for their textual counterparts [29]. References [1] M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, Y. Zhang, WebTables: Exploring the Power of Tables on the Web, Proceedings of the VLDB Endowment (PVLDB) 1 (2008) 538–549. doi:10.14778/1453856.1453916. [2] T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava, The Secret Life of Wikipedia Tables, in: Proceedings of the Workshop on Search, Exploration, and Analysis in Heterogeneous Datastores (SEA Data @ VLDB), 2021, pp. 20–26. URL: https: //ceur-ws.org/Vol-2929/paper4.pdf. [3] S. Bykau, F. Korn, D. Srivastava, Y. Velegrakis, Fine-Grained Controversy Detection in Wikipedia, in: Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2015, pp. 1573–1584. doi:10.1109/ICDE.2015.7113426. [4] M. Potthast, B. Stein, R. Gerling, Automatic Vandalism Detection in Wikipedia, in: Proceedings of the European Conference on Information Retrieval (ECIR), 2008, pp. 663– 668. doi:10.1007/978-3-540-78646-7_75. [5] L. Zecchini, T. Bleifuß, G. Simonini, S. Bergamaschi, F. Naumann, Determining the Largest Overlap between Tables, Proceedings of the ACM on Management of Data (PACMMOD) 2 (2024) 48:1–48:26. doi:10.1145/3639303. [6] F. Nargesian, E. Zhu, R. J. Miller, K. Q. Pu, P. C. Arocena, Data Lake Management: Challenges and Opportunities, Proceedings of the VLDB Endowment (PVLDB) 12 (2019) 1986–1989. doi:10.14778/3352063.3352116. [7] I. F. Ilyas, X. Chu, Data Cleaning, 2019. doi:10.1145/3310205. [8] T. Bleifuß, L. Bornemann, T. Johnson, D. V. Kalashnikov, F. Naumann, D. Srivastava, Exploring Change: A New Dimension of Data Analytics, Proceedings of the VLDB Endowment (PVLDB) 12 (2018) 85–98. doi:10.14778/3282495.3282496. [9] A. Das Sarma, L. Fang, N. Gupta, A. Halevy, H. Lee, F. Wu, R. Xin, C. Yu, Finding Related Tables, in: Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2012, pp. 817–828. doi:10.1145/2213836.2213962. [10] A. Bogatu, A. A. A. Fernandes, N. W. Paton, N. Konstantinou, Dataset Discovery in Data Lakes, in: Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2020, pp. 709–720. doi:10.1109/ICDE48307.2020.00067. [11] Y. Zhang, Z. G. Ives, Finding Related Tables in Data Lakes for Interactive Data Science, in: Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2020, pp. 1951–1966. doi:10.1145/3318464.3389726. [12] M. J. Cafarella, A. Halevy, N. Khoussainova, Data Integration for the Relational Web, Pro- ceedings of the VLDB Endowment (PVLDB) 2 (2009) 1090–1101. doi:10.14778/1687627. 1687750. [13] O. Lehmberg, C. Bizer, Stitching Web Tables for Improving Matching Quality, Proceed- ings of the VLDB Endowment (PVLDB) 10 (2017) 1502–1513. doi:10.14778/3137628. 3137657. [14] F. Nargesian, E. Zhu, K. Q. Pu, R. J. Miller, Table Union Search on Open Data, Proceedings of the VLDB Endowment (PVLDB) 11 (2018) 813–825. doi:10.14778/3192965.3192973. [15] E. Zhu, F. Nargesian, K. Q. Pu, R. J. Miller, LSH Ensemble: Internet-Scale Domain Search, Proceedings of the VLDB Endowment (PVLDB) 9 (2016) 1185–1196. doi:10. 14778/2994509.2994534. [16] R. Castro Fernandez, Z. Abedjan, F. Koko, G. Yuan, S. Madden, M. Stonebraker, Aurum: A Data Discovery System, in: Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2018, pp. 1001–1012. doi:10.1109/ICDE.2018.00094. [17] E. Zhu, D. Deng, F. Nargesian, R. J. Miller, JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes, in: Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2019, pp. 847–864. doi:10.1145/3299869.3300065. [18] Y. Dong, K. Takeoka, C. Xiao, M. Oyamada, Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach, in: Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2021, pp. 456–467. doi:10.1109/ ICDE51399.2021.00046. [19] M. Esmailoghli, J. Quiané-Ruiz, Z. Abedjan, MATE: Multi-Attribute Table Extraction, Proceedings of the VLDB Endowment (PVLDB) 15 (2022) 1684–1696. doi:10.14778/ 3529337.3529353. [20] T. Bleifuß, L. Bornemann, D. V. Kalashnikov, F. Naumann, D. Srivastava, Structured Object Matching across Web Page Revisions, in: Proceedings of the IEEE International Conference on Data Engineering (ICDE), 2021, pp. 1284–1295. doi:10.1109/ICDE51399. 2021.00115. [21] M. Koch, M. Esmailoghli, S. Auer, Z. Abedjan, Duplicate Table Detection with Xash, in: Proceedings of the Conference on Database Systems for Business, Technology and Web (BTW), 2023, pp. 367–390. doi:10.18420/BTW2023-18. [22] G. Birkhoff, Lattice Theory, 1940. doi:10.1090/coll/025. [23] B. T. Lowerre, The HARPY Speech Recognition System, Ph.D. thesis, Carnegie Mellon University, 1976. [24] X. Huang, J. Baker, R. Reddy, A Historical Perspective of Speech Recognition, Communi- cations of the ACM (CACM) 57 (2014) 94–103. doi:10.1145/2500887. [25] X. Li, X. L. Dong, K. Lyons, W. Meng, D. Srivastava, Truth Finding on the Deep Web: Is the Problem Solved?, Proceedings of the VLDB Endowment (PVLDB) 6 (2012) 97–108. doi:10.14778/2535568.2448943. [26] D. Deng, Y. Tao, G. Li, Overlap Set Similarity Joins with Theoretical Guarantees, in: Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2018, pp. 905–920. doi:10.1145/3183713.3183748. [27] R. Cappuzzo, P. Papotti, S. Thirumuruganathan, Creating Embeddings of Heteroge- neous Relational Datasets for Data Integration Tasks, in: Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2020, pp. 1335–1349. doi:10.1145/3318464.3389742. [28] G. Badaro, M. Saeed, P. Papotti, Transformers for Tabular Data Representation: A Survey of Models and Applications, Transactions of the Association for Computational Linguistics (TACL) 11 (2023) 227–249. doi:10.1162/tacl_a_00544. [29] K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, N. Carlini, Dedu- plicating Training Data Makes Language Models Better, in: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2022, pp. 8424–8445. doi:10.18653/v1/2022.acl-long.577.