pyJedAI: a Lightsaber for Link Discovery Konstantinos Nikoletos1 , George Papadakis1 and Manolis Koubarakis1 1 National & Kapodistrian University of Athens, Panepistimioupolis 15703, Ilisia, Athens, Greece Abstract Link Discovery constitutes a crucial task for increasing the connections between data sources in the Linked Open Data Cloud. Part of this task is Entity Resolution (ER), which aims to identify owl:sameAs relations between different entity descriptions that pertain to the same real-world object. Due to its quadratic time complexity, ER is typically carried out in two steps: first, blocking restricts the computational cost to similar descriptions, and then, matching estimates the actual similarity between them. A plethora of techniques has been proposed for each step. To facilitate their use by researchers and practitioners, we present pyJedAI, an open-source library that leverages Python’s data science ecosystem to build powerful end-to-end ER workflows. The purpose of this work is to demonstrate how this can be accomplished by expert and novice users in an intuitive, yet efficient and effective way. Keywords Link Discovery, Entity Resolution, Blocking, Matching 1. Introduction At the core of Semantic Web lies the Linked Open Data (LOD) Cloud, with its constantly increasing size: from 570 datasets in 2014 to 1,255 in 2020 [1]. Yet, the links between its datasets remain low, just 16,174 as of May 2020 [1]. This means that on average, every dataset is connected to just 13 others, i.e., ∼1% of all possible links. To increase the connectivity between the LOD datasets, Link Discovery automatically detects relations between their entity descriptions [2, 3]. Entity Resolution (ER) is a subtask of Link Discovery that focuses on detecting owl:sameAs between entity descriptions that represent the same real-world object [4, 5, 6]. ER constitutes a non-trivial task, due to two challenges: 1. its quadratic time complexity, which cannot scale to large volumes of data, and 2. the ambiguity in the entity descriptions. The former challenge is addressed through blocking, which curtails the search space to highly similar descriptions, instead of considering all possible pairs [7]. The latter challenge is addressed through matching, which leverages similarity signals in order to categorize every pair of descriptions into matching or non-matching ones [8]. Numerous methods have been proposed for blocking and matching [9]. Yet, the available open-source ER tools offer very few of them, typically the ones proposed by their creators [3]. Woodstock’22: Symposium on the irreproducible science, June 07–11, 2022, Woodstock, NY $ sdi1700104@di.uoa.gr (K. Nikoletos); gpapadis@di.uoa.gr (G. Papadakis); koubarak@di.uoa.gr (M. Koubarakis) € https://gpapadis.wordpress.com/ (G. Papadakis); https://cgi.di.uoa.gr/~koubarak/ (M. Koubarakis)  0000-0002-7298-9431 (G. Papadakis); 0000-0002-1954-8338 (M. Koubarakis) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) The largest variety of methods is implemented by JedAI [10]. However, JedAI, like most Link Discovery tools, constitutes an isolated system, implemented in Java, which cannot be easily extended with existing state-of-the-art techniques from other domains, like Deep Learning and Natural Language Processing (NLP). To address this issue, we present pyJedAI, a new open-source system that implements the same methods as JedAI, but is capable of combining them with any package from Python’s data science ecosystem. We have publicly released the source code of pyJedAI at https://github.com/Nikoletos-K/pyJedAI under Apache License V2.0, which supports both academic and commercial applications. 2. System Overview pyJedAI addresses the following task: Given a source and a target dataset, 𝑆 and 𝑇 , respectively, discover the set of links 𝐿 = {(𝑠,owl:sameAS, 𝑡)|𝑠 ∈ 𝑆 ∧ 𝑡 ∈ 𝑇 }. Its architecture appears in Figure 1. The first module is the data reader, which specifies the user input. pyJedAI supports both semi-structured and structured data as input. The former, which include SPARQL endpoints and RDF/OWL dumps, are read by RDFLib1 . The latter, which include relational databases as well as CSV and JSON files, are read by pandas2 . In this way, pyJedAI is able to interlink any combination of semi-structured and structured data sources, which is a unique feature. The second step in pyJedAI’s pipeline performs block building, a coarse-grained process that clusters together similar entities. The end result consists of a set of candidate pairs, which are examined analytically by the subsequent steps. pyJedAI implements the same established methods for similarity joins and blocking as JedAI, such as Standard Blocking and Sorted Neighborhood, but goes beyond all Link Discovery tools by incorporating recent, state-of-the- art libraries for nearest neighbor search like FALCONN 3 and FAISS4 . In the near future, we will also add support for DeepBlocker [11], the best performing blocking method that leverages Deep Learning without the need to provide any labelled instances – just like all other block building methods. The next two workflow steps are optional, implementing the same established block and comparison cleaning methods as JedAI. Their goal is to significantly reduce the number of can- didate pairs, increasing the overall time efficiency and scalability at a small cost in effectiveness, i.e., by sacrificing recall to an insignificant extent. All methods are efficiently implemented on top of Python’s dictionaries, just like the block building ones. The entity matching step estimates the actual similarity between the candidate pairs. Unlike all other Link Discovery tools, which rely exclusively on string similarity measures like edit distance and Jaccard coefficient [3], pyJedAI leverages the latest advanced NLP techniques, like pre-trained embeddings (e.g., word2vect, fastText and Glove) and transformer language models (i.e., BERT and its variants) [12]. More specifically, pyJedAI supports packages like 1 https://rdflib.dev 2 https://pandas.pydata.org 3 https://falconn-lib.org 4 https://github.com/facebookresearch/faiss pyJedAI Block Block Comparison Entity Entity Input Cleaning Output Building Cleaning Matching Clustering Weighted Weighted Unique Evaluation Standard Edge Mapping RDF/OWL Measures Blocking Pruning Pruning Clustering Weighted Weighted SPARKQL Markov Visual- FAISS Block Node Clustering ization Purging Pruning Pruning Cardinality CSV Kiraly Data FALCONN Edge Block Clustering Writing Pruning Filtering Cardinality Cardinality JSON Correlation Joins Node pypi:strsimpy Clustering Pruning Pruning Exact DB BLAST ... Clustering .... .... NetworkX Network Analysis in Python Figure 1: The architecture of pyJedAI. The dotted lines indicate optional steps. pypi:strsimpy5 , Gensim6 and Hugging Face7 . This unique feature boosts pyJedAI’s accuracy to a significant extent, without requiring any labelled instances from the user. The last step performs entity clustering to further increase the accuracy. The relevant tech- niques consider the global information provided by the similarity scores of all candidate pairs in order to take local decisions for each pair of entity descriptions. pyJedAI implements and offers the same established algorithms as JedAI, using NetworkX8 to ensure high time efficiency. Finally, users are able to evaluate, visualize and store the results of the selected pipeline through the intuitive interface of Jupyter notebooks. In this way, pyJedAI facilitates its use by researchers and practitioners that are familiar with the data science ecosystem, regardless of their familiarity with ER and Link Discovery, in general. 3. Demonstration The purpose of our demonstration is to highlight pyJedAI’s unique capabilities and ease-of-use. To this end, the user is merely asked to select the dataset(s) to be processed and the methods that will form the end-to-end workflow. For the former, the user can select any of the datasets for instance matching from the latest OAEI [13], or any of the four established benchmark 5 https://github.com/luozhouyang/python-string-similarity 6 https://radimrehurek.com/gensim 7 https://huggingface.co 8 https://networkx.org Figure 2: Example of running pyJedAI through a Jupyter notebook. ER datasets9 in any of the supported data formats. Regarding method selection, no labelled instances are required from any of the implemented techiques; the user merely needs to call them in the correct order and to configure their parameters, if the default ones do not yield satisfactory performance. This is accomplished through a Jupyter notebook that contains detailed instructions for the user, lists all available methods per step and shows the status of every running method through a progress bar.10 Special care has been taken to assess the performance of every workflow step along with the overall pipeline. Thus, a series of effectiveness and time efficiency techniques is reported after every step. See Figure 2 for an example: command [18] shows the available methods for the optional step of Comparison Cleaning, command [19] applies one of them to the existing set of blocks, while showing its progress, and command [20] reports the performance of the clean set of blocks. Note that it is also possible to collectively report the performance of all tests executed in every session so as to facilitate the comparison between different pipelines and configurations. 4. Conclusions pyJedAI constitutes the sole open-source Link Discovery tool that is capable of exploiting the latest breakthroughs in Deep Learning and NPL techniques, which are publicly available through 9 https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution 10 https://nbviewer.org/github/Nikoletos-K/pyJedAI/blob/main/CleanCleanER-AbtBuy.ipynb the Python data science ecosystem. This applies to both blocking and matching, thus ensuring high time efficiency, high scalability as well as high effectiveness, without requiring any labelled instances from the user. In the future, we intend to extend pyJedAI with more capabilities, such as Schema Matching through the Valentine system (https://github.com/delftdata/valentine-system). Acknowledgments This work has received funding from the European Union’s Horizon 2020 research and in- novation programme under GA No 101016798 (AI4Copernicus), EU Horizon Europe GA No 101070122 (STELAR), and from the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “First Call for H.F.R.I. Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment grant” (Project Number: HFRI-FM17-2351 GeoQA). References [1] The linked open data cloud, https://lod-cloud.net/#about, 2022. [2] A. Ferrara, A. Nikolov, F. Scharffe, Data linking for the semantic web, Int. J. Semantic Web Inf. Syst. 7 (2011) 46–76. [3] M. Nentwig, M. Hartung, A. N. Ngomo, E. Rahm, A survey of current link discovery frameworks, Semantic Web 8 (2017) 419–436. [4] P. Christen, Data Matching, Springer, 2012. [5] X. L. Dong, D. Srivastava, Big Data Integration, Synthesis Lectures on Data Management, Morgan & Claypool Publishers, 2015. [6] V. Christophides, V. Efthymiou, K. Stefanidis, Entity Resolution in the Web of Data, Morgan & Claypool Publishers, 2015. [7] G. Papadakis, D. Skoutas, E. Thanos, T. Palpanas, Blocking and filtering techniques for entity resolution: A survey, ACM CSUR 53 (2020) 31:1–31:42. [8] V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, K. Stefanidis, An overview of end-to-end entity resolution for big data, ACM CSUR 53 (2021) 127:1–127:42. [9] G. Papadakis, E. Ioannou, E. Thanos, T. Palpanas, The Four Generations of Entity Resolution, Morgan & Claypool Publishers, 2021. [10] G. Papadakis, G. M. Mandilaras, L. Gagliardelli, G. Simonini, E. Thanos, G. Giannakopoulos, S. Bergamaschi, T. Palpanas, M. Koubarakis, Three-dimensional entity resolution with jedai, Inf. Syst. 93 (2020) 101565. [11] S. Thirumuruganathan, H. Li, N. Tang, M. Ouzzani, Y. Govind, D. Paulsen, G. Fung, A. Doan, Deep learning for blocking in entity matching: A design space exploration, Proc. VLDB Endow. 14 (2021) 2459–2472. [12] Q. Liu, M. J. Kusner, P. Blunsom, A survey on contextual embeddings, CoRR abs/2003.07278 (2020). [13] M. A. N. Pour, A. Algergawy, et al., Results of the ontology alignment evaluation initiative 2021, in: Proceedings of the 16th International Workshop on Ontology Matching co-located with ISWC 2021, volume 3063, 2021, pp. 62–108.