<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>pyJedAI: a Lightsaber for Link Discovery</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Konstantinos Nikoletos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>George Papadakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manolis Koubarakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National &amp; Kapodistrian University of Athens</institution>
          ,
          <addr-line>Panepistimioupolis 15703, Ilisia, Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Link Discovery constitutes a crucial task for increasing the connections between data sources in the Linked Open Data Cloud. Part of this task is Entity Resolution (ER), which aims to identify owl:sameAs relations between diferent entity descriptions that pertain to the same real-world object. Due to its quadratic time complexity, ER is typically carried out in two steps: first, blocking restricts the computational cost to similar descriptions, and then, matching estimates the actual similarity between them. A plethora of techniques has been proposed for each step. To facilitate their use by researchers and practitioners, we present pyJedAI, an open-source library that leverages Python's data science ecosystem to build powerful end-to-end ER workflows. The purpose of this work is to demonstrate how this can be accomplished by expert and novice users in an intuitive, yet eficient and efective way.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Link Discovery</kwd>
        <kwd>Entity Resolution</kwd>
        <kwd>Blocking</kwd>
        <kwd>Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1. its quadratic time complexity, which cannot scale to large volumes of data, and</p>
      <sec id="sec-1-1">
        <title>2. the ambiguity in the entity descriptions.</title>
        <p>
          The former challenge is addressed through blocking, which curtails the search space to highly
similar descriptions, instead of considering all possible pairs [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. The latter challenge is addressed
through matching, which leverages similarity signals in order to categorize every pair of
descriptions into matching or non-matching ones [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          Numerous methods have been proposed for blocking and matching [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Yet, the available
open-source ER tools ofer very few of them, typically the ones proposed by their creators [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
The largest variety of methods is implemented by JedAI [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. However, JedAI, like most Link
Discovery tools, constitutes an isolated system, implemented in Java, which cannot be easily
extended with existing state-of-the-art techniques from other domains, like Deep Learning
and Natural Language Processing (NLP). To address this issue, we present pyJedAI, a new
open-source system that implements the same methods as JedAI, but is capable of combining
them with any package from Python’s data science ecosystem. We have publicly released the
source code of pyJedAI at https://github.com/Nikoletos-K/pyJedAI under Apache License V2.0,
which supports both academic and commercial applications.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. System Overview</title>
      <p>pyJedAI addresses the following task:</p>
      <p>Given a source and a target dataset,  and  , respectively, discover the set of links  =
{(,owl:sameAS, )| ∈  ∧  ∈  }.</p>
      <p>Its architecture appears in Figure 1. The first module is the data reader, which specifies the
user input. pyJedAI supports both semi-structured and structured data as input. The former,
which include SPARQL endpoints and RDF/OWL dumps, are read by RDFLib1. The latter, which
include relational databases as well as CSV and JSON files, are read by pandas2. In this way,
pyJedAI is able to interlink any combination of semi-structured and structured data sources,
which is a unique feature.</p>
      <p>
        The second step in pyJedAI’s pipeline performs block building, a coarse-grained process that
clusters together similar entities. The end result consists of a set of candidate pairs, which
are examined analytically by the subsequent steps. pyJedAI implements the same established
methods for similarity joins and blocking as JedAI, such as Standard Blocking and Sorted
Neighborhood, but goes beyond all Link Discovery tools by incorporating recent,
state-of-theart libraries for nearest neighbor search like FALCONN 3 and FAISS4. In the near future, we
will also add support for DeepBlocker [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], the best performing blocking method that leverages
Deep Learning without the need to provide any labelled instances – just like all other block
building methods.
      </p>
      <p>The next two workflow steps are optional, implementing the same established block and
comparison cleaning methods as JedAI. Their goal is to significantly reduce the number of
candidate pairs, increasing the overall time eficiency and scalability at a small cost in efectiveness,
i.e., by sacrificing recall to an insignificant extent. All methods are eficiently implemented on
top of Python’s dictionaries, just like the block building ones.</p>
      <p>
        The entity matching step estimates the actual similarity between the candidate pairs. Unlike
all other Link Discovery tools, which rely exclusively on string similarity measures like edit
distance and Jaccard coeficient [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], pyJedAI leverages the latest advanced NLP techniques,
like pre-trained embeddings (e.g., word2vect, fastText and Glove) and transformer language
models (i.e., BERT and its variants) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. More specifically, pyJedAI supports packages like
      </p>
      <sec id="sec-2-1">
        <title>1https://rdflib.dev 2https://pandas.pydata.org 3https://falconn-lib.org 4https://github.com/facebookresearch/faiss</title>
        <p>...</p>
        <p>Block
Purging
Block
Filtering</p>
        <p>pyJedAI
Comparison</p>
        <p>Cleaning</p>
        <p>Weighted</p>
        <p>Edge
PPrruningg
Weighted</p>
        <p>Node
PPrruningg
Cardinality</p>
        <p>Edge
Pruning
CCaarrddinalittyy</p>
        <p>Node
PPrruningg
BLAST
....</p>
        <p>pypi:strsimpy</p>
        <p>Unique
Mapping
Clustering
Markov
Clustering</p>
        <p>Kiraly
Clustering
Correlation
Clustering</p>
        <p>Exact
Clustering</p>
        <p>....</p>
        <p>NetworkX
Network Analysis in Python</p>
        <p>Output
Evaluation
Measures
Visualization
Data
Writing
pypi:strsimpy5, Gensim6 and Hugging Face7. This unique feature boosts pyJedAI’s accuracy to
a significant extent, without requiring any labelled instances from the user.</p>
        <p>The last step performs entity clustering to further increase the accuracy. The relevant
techniques consider the global information provided by the similarity scores of all candidate pairs in
order to take local decisions for each pair of entity descriptions. pyJedAI implements and ofers
the same established algorithms as JedAI, using NetworkX8 to ensure high time eficiency.</p>
        <p>Finally, users are able to evaluate, visualize and store the results of the selected pipeline
through the intuitive interface of Jupyter notebooks. In this way, pyJedAI facilitates its use by
researchers and practitioners that are familiar with the data science ecosystem, regardless of
their familiarity with ER and Link Discovery, in general.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Demonstration</title>
      <p>
        The purpose of our demonstration is to highlight pyJedAI’s unique capabilities and ease-of-use.
To this end, the user is merely asked to select the dataset(s) to be processed and the methods
that will form the end-to-end workflow. For the former, the user can select any of the datasets
for instance matching from the latest OAEI [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], or any of the four established benchmark
      </p>
      <sec id="sec-3-1">
        <title>5https://github.com/luozhouyang/python-string-similarity 6https://radimrehurek.com/gensim 7https://huggingface.co 8https://networkx.org</title>
        <p>ER datasets9 in any of the supported data formats. Regarding method selection, no labelled
instances are required from any of the implemented techiques; the user merely needs to call
them in the correct order and to configure their parameters, if the default ones do not yield
satisfactory performance.</p>
        <p>This is accomplished through a Jupyter notebook that contains detailed instructions for the
user, lists all available methods per step and shows the status of every running method through
a progress bar.10 Special care has been taken to assess the performance of every workflow step
along with the overall pipeline. Thus, a series of efectiveness and time eficiency techniques
is reported after every step. See Figure 2 for an example: command [18] shows the available
methods for the optional step of Comparison Cleaning, command [19] applies one of them to the
existing set of blocks, while showing its progress, and command [20] reports the performance
of the clean set of blocks. Note that it is also possible to collectively report the performance of
all tests executed in every session so as to facilitate the comparison between diferent pipelines
and configurations.
4. Conclusions
pyJedAI constitutes the sole open-source Link Discovery tool that is capable of exploiting the
latest breakthroughs in Deep Learning and NPL techniques, which are publicly available through
9https://dbs.uni-leipzig.de/research/projects/object_matching/benchmark_datasets_for_entity_resolution
10https://nbviewer.org/github/Nikoletos-K/pyJedAI/blob/main/CleanCleanER-AbtBuy.ipynb
the Python data science ecosystem. This applies to both blocking and matching, thus ensuring
high time eficiency, high scalability as well as high efectiveness, without requiring any labelled
instances from the user. In the future, we intend to extend pyJedAI with more capabilities, such as
Schema Matching through the Valentine system (https://github.com/delftdata/valentine-system).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work has received funding from the European Union’s Horizon 2020 research and
innovation programme under GA No 101016798 (AI4Copernicus), EU Horizon Europe GA No
101070122 (STELAR), and from the Hellenic Foundation for Research and Innovation (H.F.R.I.)
under the “First Call for H.F.R.I. Research Projects to support Faculty members and Researchers
and the procurement of high-cost research equipment grant” (Project Number: HFRI-FM17-2351
GeoQA).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] The linked open data cloud</article-title>
          , https://lod-cloud.net/#about,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferrara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Scharfe</surname>
          </string-name>
          ,
          <article-title>Data linking for the semantic web</article-title>
          ,
          <source>Int. J. Semantic Web Inf. Syst</source>
          .
          <volume>7</volume>
          (
          <year>2011</year>
          )
          <fpage>46</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nentwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hartung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          ,
          <article-title>A survey of current link discovery frameworks</article-title>
          ,
          <source>Semantic Web</source>
          <volume>8</volume>
          (
          <year>2017</year>
          )
          <fpage>419</fpage>
          -
          <lpage>436</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          , Data Matching, Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X. L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <source>Big Data Integration, Synthesis Lectures on Data Management</source>
          , Morgan &amp; Claypool Publishers,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          ,
          <article-title>Entity Resolution in the Web of Data</article-title>
          , Morgan &amp; Claypool Publishers,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Skoutas</surname>
          </string-name>
          , E. Thanos, T. Palpanas,
          <article-title>Blocking and filtering techniques for entity resolution: A survey</article-title>
          ,
          <source>ACM CSUR 53</source>
          (
          <year>2020</year>
          )
          <volume>31</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          :
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          , G. Papadakis,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          ,
          <article-title>An overview of end-to-end entity resolution for big data</article-title>
          ,
          <source>ACM CSUR 53</source>
          (
          <year>2021</year>
          )
          <volume>127</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>127</lpage>
          :
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ioannou</surname>
          </string-name>
          , E. Thanos, T. Palpanas, The Four Generations of Entity Resolution, Morgan &amp; Claypool Publishers,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Mandilaras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gagliardelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Simonini</surname>
          </string-name>
          , E. Thanos, G. Giannakopoulos,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bergamaschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          ,
          <article-title>Three-dimensional entity resolution with jedai</article-title>
          ,
          <source>Inf. Syst</source>
          .
          <volume>93</volume>
          (
          <year>2020</year>
          )
          <fpage>101565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Thirumuruganathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Govind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Paulsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <article-title>Deep learning for blocking in entity matching: A design space exploration</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>2459</fpage>
          -
          <lpage>2472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kusner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blunsom</surname>
          </string-name>
          ,
          <article-title>A survey on contextual embeddings</article-title>
          , CoRR abs/
          <year>2003</year>
          .07278 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M. A. N.</given-names>
            <surname>Pour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Algergawy</surname>
          </string-name>
          , et al.,
          <article-title>Results of the ontology alignment evaluation initiative 2021</article-title>
          ,
          <source>in: Proceedings of the 16th International Workshop on Ontology Matching co-located with ISWC</source>
          <year>2021</year>
          , volume
          <volume>3063</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>62</fpage>
          -
          <lpage>108</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>