1. Introduction

pyJedAI: a Lightsaber for Link Discovery

Konstantinos Nikoletos

George Papadakis

Manolis Koubarakis

0 0 National & Kapodistrian University of Athens , Panepistimioupolis 15703, Ilisia, Athens , Greece

Link Discovery constitutes a crucial task for increasing the connections between data sources in the Linked Open Data Cloud. Part of this task is Entity Resolution (ER), which aims to identify owl:sameAs relations between diferent entity descriptions that pertain to the same real-world object. Due to its quadratic time complexity, ER is typically carried out in two steps: first, blocking restricts the computational cost to similar descriptions, and then, matching estimates the actual similarity between them. A plethora of techniques has been proposed for each step. To facilitate their use by researchers and practitioners, we present pyJedAI, an open-source library that leverages Python's data science ecosystem to build powerful end-to-end ER workflows. The purpose of this work is to demonstrate how this can be accomplished by expert and novice users in an intuitive, yet eficient and efective way.

eol>Link Discovery Entity Resolution Blocking Matching

1. Introduction

1. its quadratic time complexity, which cannot scale to large volumes of data, and

2. the ambiguity in the entity descriptions.

The former challenge is addressed through blocking, which curtails the search space to highly similar descriptions, instead of considering all possible pairs [ 7 ]. The latter challenge is addressed through matching, which leverages similarity signals in order to categorize every pair of descriptions into matching or non-matching ones [ 8 ].

Numerous methods have been proposed for blocking and matching [ 9 ]. Yet, the available open-source ER tools ofer very few of them, typically the ones proposed by their creators [ 3 ]. The largest variety of methods is implemented by JedAI [ 10 ]. However, JedAI, like most Link Discovery tools, constitutes an isolated system, implemented in Java, which cannot be easily extended with existing state-of-the-art techniques from other domains, like Deep Learning and Natural Language Processing (NLP). To address this issue, we present pyJedAI, a new open-source system that implements the same methods as JedAI, but is capable of combining them with any package from Python’s data science ecosystem. We have publicly released the source code of pyJedAI at https://github.com/Nikoletos-K/pyJedAI under Apache License V2.0, which supports both academic and commercial applications.

2. System Overview

pyJedAI addresses the following task:

Given a source and a target dataset, and , respectively, discover the set of links = {(,owl:sameAS, )| ∈ ∧ ∈ }.

Its architecture appears in Figure 1. The first module is the data reader, which specifies the user input. pyJedAI supports both semi-structured and structured data as input. The former, which include SPARQL endpoints and RDF/OWL dumps, are read by RDFLib1. The latter, which include relational databases as well as CSV and JSON files, are read by pandas2. In this way, pyJedAI is able to interlink any combination of semi-structured and structured data sources, which is a unique feature.

The second step in pyJedAI’s pipeline performs block building, a coarse-grained process that clusters together similar entities. The end result consists of a set of candidate pairs, which are examined analytically by the subsequent steps. pyJedAI implements the same established methods for similarity joins and blocking as JedAI, such as Standard Blocking and Sorted Neighborhood, but goes beyond all Link Discovery tools by incorporating recent, state-of-theart libraries for nearest neighbor search like FALCONN 3 and FAISS4. In the near future, we will also add support for DeepBlocker [ 11 ], the best performing blocking method that leverages Deep Learning without the need to provide any labelled instances – just like all other block building methods.

The next two workflow steps are optional, implementing the same established block and comparison cleaning methods as JedAI. Their goal is to significantly reduce the number of candidate pairs, increasing the overall time eficiency and scalability at a small cost in efectiveness, i.e., by sacrificing recall to an insignificant extent. All methods are eficiently implemented on top of Python’s dictionaries, just like the block building ones.

The entity matching step estimates the actual similarity between the candidate pairs. Unlike all other Link Discovery tools, which rely exclusively on string similarity measures like edit distance and Jaccard coeficient [ 3 ], pyJedAI leverages the latest advanced NLP techniques, like pre-trained embeddings (e.g., word2vect, fastText and Glove) and transformer language models (i.e., BERT and its variants) [ 12 ]. More specifically, pyJedAI supports packages like

1https://rdflib.dev 2https://pandas.pydata.org 3https://falconn-lib.org 4https://github.com/facebookresearch/faiss

...

Block Purging Block Filtering

pyJedAI Comparison

Cleaning

Weighted

Edge PPrruningg Weighted

Node PPrruningg Cardinality

Edge Pruning CCaarrddinalittyy

Node PPrruningg BLAST ....

pypi:strsimpy

Unique Mapping Clustering Markov Clustering

Kiraly Clustering Correlation Clustering

Exact Clustering

....

NetworkX Network Analysis in Python

Output Evaluation Measures Visualization Data Writing pypi:strsimpy5, Gensim6 and Hugging Face7. This unique feature boosts pyJedAI’s accuracy to a significant extent, without requiring any labelled instances from the user.

The last step performs entity clustering to further increase the accuracy. The relevant techniques consider the global information provided by the similarity scores of all candidate pairs in order to take local decisions for each pair of entity descriptions. pyJedAI implements and ofers the same established algorithms as JedAI, using NetworkX8 to ensure high time eficiency.

Finally, users are able to evaluate, visualize and store the results of the selected pipeline through the intuitive interface of Jupyter notebooks. In this way, pyJedAI facilitates its use by researchers and practitioners that are familiar with the data science ecosystem, regardless of their familiarity with ER and Link Discovery, in general.

3. Demonstration

The purpose of our demonstration is to highlight pyJedAI’s unique capabilities and ease-of-use. To this end, the user is merely asked to select the dataset(s) to be processed and the methods that will form the end-to-end workflow. For the former, the user can select any of the datasets for instance matching from the latest OAEI [ 13 ], or any of the four established benchmark

5https://github.com/luozhouyang/python-string-similarity 6https://radimrehurek.com/gensim 7https://huggingface.co 8https://networkx.org

ER datasets9 in any of the supported data formats. Regarding method selection, no labelled instances are required from any of the implemented techiques; the user merely needs to call them in the correct order and to configure their parameters, if the default ones do not yield satisfactory performance.

This is accomplished through a Jupyter notebook that contains detailed instructions for the user, lists all available methods per step and shows the status of every running method through a progress bar.10 Special care has been taken to assess the performance of every workflow step along with the overall pipeline. Thus, a series of efectiveness and time eficiency techniques is reported after every step. See Figure 2 for an example: command [18] shows the available methods for the optional step of Comparison Cleaning, command [19] applies one of them to the existing set of blocks, while showing its progress, and command [20] reports the performance of the clean set of blocks. Note that it is also possible to collectively report the performance of all tests executed in every session so as to facilitate the comparison between diferent pipelines and configurations. 4. Conclusions pyJedAI constitutes the sole open-source Link Discovery tool that is capable of exploiting the latest breakthroughs in Deep Learning and NPL techniques, which are publicly available through 9https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution 10https://nbviewer.org/github/Nikoletos-K/pyJedAI/blob/main/CleanCleanER-AbtBuy.ipynb the Python data science ecosystem. This applies to both blocking and matching, thus ensuring high time eficiency, high scalability as well as high efectiveness, without requiring any labelled instances from the user. In the future, we intend to extend pyJedAI with more capabilities, such as Schema Matching through the Valentine system (https://github.com/delftdata/valentine-system).

Acknowledgments

This work has received funding from the European Union’s Horizon 2020 research and innovation programme under GA No 101016798 (AI4Copernicus), EU Horizon Europe GA No 101070122 (STELAR), and from the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “First Call for H.F.R.I. Research Projects to support Faculty members and Researchers and the procurement of high-cost research equipment grant” (Project Number: HFRI-FM17-2351 GeoQA).

[1] The linked open data cloud , https://lod-cloud.net/#about, 2022 .

[2]

Ferrara ,

Nikolov ,

Scharfe , Data linking for the semantic web , Int. J. Semantic Web Inf. Syst . 7 ( 2011 ) 46 - 76 .

[3]

Nentwig ,

Hartung ,

A. N.

Ngomo ,

Rahm , A survey of current link discovery frameworks , Semantic Web 8 ( 2017 ) 419 - 436 .

[4]

Christen , Data Matching, Springer, 2012 .

[5]

X. L.

Dong ,

Srivastava , Big Data Integration, Synthesis Lectures on Data Management , Morgan & Claypool Publishers, 2015 .

[6]

Christophides ,

Efthymiou ,

Stefanidis , Entity Resolution in the Web of Data , Morgan & Claypool Publishers, 2015 .

[7]

Papadakis ,

Skoutas , E. Thanos, T. Palpanas, Blocking and filtering techniques for entity resolution: A survey , ACM CSUR 53 ( 2020 ) 31 : 1 - 31 : 42 .

[8]

Christophides ,

Efthymiou ,

Palpanas , G. Papadakis,

Stefanidis , An overview of end-to-end entity resolution for big data , ACM CSUR 53 ( 2021 ) 127 : 1 - 127 : 42 .

[9]

Papadakis ,

Ioannou , E. Thanos, T. Palpanas, The Four Generations of Entity Resolution, Morgan & Claypool Publishers, 2021 .

[10]

Papadakis ,

G. M.

Mandilaras ,

Gagliardelli ,

Simonini , E. Thanos, G. Giannakopoulos,

Bergamaschi ,

Palpanas ,

Koubarakis , Three-dimensional entity resolution with jedai , Inf. Syst . 93 ( 2020 ) 101565 .

[11]

Thirumuruganathan ,

Li ,

Tang ,

Ouzzani ,

Govind ,

Paulsen ,

Fung ,

Doan , Deep learning for blocking in entity matching: A design space exploration , Proc. VLDB Endow . 14 ( 2021 ) 2459 - 2472 .

[12]

Liu ,

M. J.

Kusner ,

Blunsom , A survey on contextual embeddings , CoRR abs/ 2003 .07278 ( 2020 ).

[13]

M. A. N.

Pour ,

Algergawy , et al., Results of the ontology alignment evaluation initiative 2021 , in: Proceedings of the 16th International Workshop on Ontology Matching co-located with ISWC 2021 , volume 3063 , 2021 , pp. 62 - 108 .