<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>demo: Fairness-Aware and Explainable Entity Resolution</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nikolaos Fanourakis</string-name>
          <email>fanourakis@ics.forth.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christos Kontousias</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vasilis Efthymiou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vassilis Christophides</string-name>
          <email>Vassilis.Christophides@ensea.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Plexousakis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FORTH-ICS</institution>
          ,
          <addr-line>Heraklion</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lab. ETIS, CY Cergy Paris University</institution>
          ,
          <addr-line>ENSEA, CNRS UMR 8051</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Crete</institution>
          ,
          <addr-line>Heraklion</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <fpage>6</fpage>
      <lpage>10</lpage>
      <abstract>
        <p>Entity resolution (ER) is the problem of identifying references to the same real-world objects, in disparate data sources. It has been recently shown that ER results are prone to bias, related to both factual and structural characteristics of the input data. In this work, we demonstrate an extended version of FairER, an open-source ER framework, that receives either tabular or knowledge graph data as input and produces fair and explainable results. Demonstration scenarios showcase some of its capabilities, while a public demo and video are available.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Entity resolution (ER) is the task of identifying pieces of data that refer to the same real-world
entity (e.g., a person, an organization), scattered across disparate data sources and formats, e.g.,
in tables or in knowledge graphs (KGs). Recent works have shown that the results of ER may
be biased against some entity groups, based on some of their sensitive attributes [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] (e.g.,
gender, ethnicity), or even their structural representation in a KG [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] (e.g., central vs long-tail
entities). In order to mitigate various forms of direct and indirect bias [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we need to understand
ifrst how ER methods decide which entity pairs match [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        In this demo, we introduce a unified platform that combines and extends several works for
ER, namely FairER [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], extended to work not only on tabular, but also on KG data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], with a
sampling step, SUSIE [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], that deals with various structural bias settings in KG embedding-based
ER algorithms [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Specifically, we are interested in statistical parity as the target fairness
measure [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and in a connectivity-based structural bias definition [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Comparison with existing works. A few end-to-end ER frameworks exist, with the most
notable being Magellan [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] (now commercialized and closed-source), Dedupe1, and JedAI [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Compared to existing works, this is the first end-to-end ER framework to incorporate fairness
(both factual and structural) by design, ofering a parameterized sampling for structural bias,
†These authors contributed equally.
CEUR
Workshop
Proceedings
while also allowing the provision of visual explanations (e.g., from MOJITO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]) for the matching
decisions and a novel, intuitive explanation model for the fairness aspect.
      </p>
      <p>Contributions. In summary, the contributions of this work are: (a) The first open-source,
unified platform for fair and explainable ER on tabular data and on KGs. (b) A visual explanation
method for the matching results, that incorporates both similarity and fairness assessments. (c)
All the modules of our platform are easily extendable to include state-of-the-art methods, as
well as various definitions of fairness.</p>
      <p>The source code of our system2, along with a running public demo3 and a video
demonstration4 are available online.</p>
    </sec>
    <sec id="sec-3">
      <title>2. System Overview</title>
      <p>
        In this section, we provide background information from relevant works [
        <xref ref-type="bibr" rid="ref2 ref4 ref6">2, 4, 6</xref>
        ]. Then, we
provide an overview of our system’s pipeline, focusing on its most important components.
      </p>
      <sec id="sec-3-1">
        <title>2.1. Background</title>
        <p>
          FairER [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] defines fairness-aware ER as a ranking of entity pairs that maximizes the (cumulative)
match likelihood score in its top- ranks, while satisfying a given (group) fairness condition  ,
such as statistical parity. That fairness condition  can decide if the so-called protected pairs
are fairly represented in the ER results, compared to non-protected pairs.
        </p>
        <p>As a simplified example of how FairER works, consider two datasets to be matched, both
describing people, with some of them being convicted criminals (which is the protected group
in this example). If the fairness condition is statistical parity, requiring equal representation of
protected and non-protected group members, then FairER will create two priority queues of
candidate matching pairs - one for the convicted criminals and one for the rest of the people
sort them in descending order of match likelihood (see Similarity Scoring in Section 2.2), and
interchangeably pick the most likely match from each queue (see Matching in Section 2.2).</p>
        <p>
          Typically, the decision of whether an entity is protected or non-protected is given by the
values of a sensitive attribute. However, in a recent work [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] we provided a more general,
graph-theoretic definition of structural bias, which can also play the role of fairness condition  ,
based on the observation that KG nodes that belong to large (above a size threshold) connected
components are more likely to be correctly matched. The same work [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] also introduces a
sampling algorithm, SUSIE, that can take representative samples of both small and big connected
components, and evaluate the robustness of ER works on structural bias [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. System architecture</title>
        <p>The high-level system architecture is illustrated in Figure 1 and briefly described next. The
modular design of our architecture allows the seamless adaptation of new methods. Therefore,
the selection of methods already implemented is just indicative and orthogonal to our framework.
2https://github.com/vefthym/fairER
3https://isl.ics.forth.gr/fairER/
4https://youtu.be/DTrf9sbmCZE</p>
        <p>
          Sampling. The optional sampling component is responsible for selecting a subset of the
input KGs (not yet applicable to tabular data), with the desired parameters (e.g., sample size,
random walk jump probability, size threshold for considering a connected component as small
or big). The sampling component follows the implementation of SUSIE [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], but instead of
evaluating the robustness of ER methods, we use SUSIE in our framework to provide a fair
representation of small connected components in the input data of ER.
        </p>
        <p>
          Similarity Scoring. For assessing the matching scores between entity pairs, i.e., the
likelihood that a pair is referring to the same real-world entity, we provide a wide selection of
probabilistic and embedding-based matching algorithms for tabular and KG data (e.g.,
Deepmatcher [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], BERT-INT [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], PARIS [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]). Most embedding-based methods rely on cosine
similarity on the entity embeddings that they create using pre-labeled training data (seed
alignment). The diferent ways to create such embeddings is orthogonal to FairER and for knowledge
graph embeddings, they are extensively described in our previous work [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>
          Matching. Matching decisions [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] typically rely only on the similarity scores. However,
FairER also considers fairness constraints before returning its results. Our current
implementation uses an extension of the Unique Mapping Clustering algorithm [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] for the matching
decisions, and statistical parity as the fairness constraint [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. As mentioned before, FairER
extends the Unique Mapping Clustering algorithm to operate on two, instead of one,
priority queues, with each representing a diferent group. We note that FairER can be seamlessly
extended to also operate with more than two groups and, consequently, priority queues.
        </p>
        <p>In addition to providing a default criterion for splitting data into protected and non-protected
groups for some selected datasets, our platform also allows its users to define their own, custom
fairness criteria and test them with actual examples. The matching results can be viewed and
downloaded raw, using our API, but the users can also see the evaluation results in terms of
accuracy and fairness.</p>
        <p>
          Explanations. Our framework ofers two types of explanations, depending on the input data
format: (a) explanations on the similarity scoring, relying on the deep matching explainability
tool MOJITO [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], and (b) explanations on the fairness of the returned matches. The first one
assigns scores to schema attributes, identifying the most important ones for match or no match.
The latter also includes the similarity scores for both protected and non-protected matches, to
give a broader picture (i.e., some matches may be ranked higher than others, even if they have
(a) Dataset selection and evaluation results.
        </p>
        <p>(b) Explaining the selection of suggested matches.
lower similarity scores, in order to respect the fairness constraints). An example of the latter is
shown in Figure 2b and described in Scenario 1 of Section 3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Demonstration</title>
      <p>
        Here, we briefly describe two indicative demonstration scenarios, which are also covered in the
supplementary material, explaining how the ISWC attendees will interact with our system. In
what follows, we will focus more on the case of matching data coming from KGs, but most of
what is described next also applies to the tabular data case. One diference is the explanation
component, for which we utilize MOJITO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] in the tabular data case. Some of the features
described below are shown in Figure 2.
      </p>
      <p>Scenario 1 (Walk-through/Author-driven). An attendee selects one of our pre-loaded
tabular or KG datasets and clicks on “Fairness Conditions” to see the default fairness conditions,
based on which entities and entity pairs are considered as protected or non-protected. Then,
the attendee can run the matching algorithm and see the matching results from FairER, as well
as other baseline methods. Finally, the attendee can receive visual explanations of the matching
decisions. E.g., in Figure 2b, the user can see how protected and non-protected candidates are
ranked separately, as well as how they are ranked in the FairER results to respect the fairness
condition, along with their protected/non-protected color-coding and matching score.</p>
      <p>Scenario 2 (Usage/Attendee-driven). An ISWC attendee uploads one new dataset (that
they bring, or we provide) and after viewing a random entity from each of the two data sources,
enter their desired protected conditions, for both individual entities and entity pairs (currently
supporting conjunctive and disjunctive conditions for pairs). They can even check the condition
for a single entity or a pair of entities that they manually type (which may not even exist in the
dataset), or load such an entity (pair) from a file. The attendee then clicks on “Dataset Statistics”
to check the number of protected and non-protected pairs, the average similarity scores of
protected vs non-protected matches (and similarly for non-matches). The features described in
Scenario 1 are also applicable for a user-provided dataset.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>
        We have presented the first unified framework for end-to-end entity resolution (ER) that is fair
and explainable by design. This framework builds upon and largely extends works for direct
(FairER [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) and indirect (SUSIE [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]) bias, while it reuses works on explainable ER (MOJITO [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]).
We plan to extend the implementation with more definitions of fairness, as well as design a
new explainability method that interprets not only the similarity-based matching decision, but
also the fairness aspect.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has received funding from the Hellenic Foundation for Research and Innovation
(HFRI) and the General Secretariat for Research and Technology (GSRT), under GA No 969.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shahbazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Danevski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nargesian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Asudeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Through the fairness lens: Experimental analysis and evaluation of entity matching</article-title>
          ,
          <source>PVLDB</source>
          <volume>16</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pitoura</surname>
          </string-name>
          , V. Christophides,
          <article-title>FairER: Entity resolution with fairness constraints</article-title>
          ,
          <source>in: CIKM</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Karakasidis</surname>
          </string-name>
          , E. Pitoura,
          <article-title>Identifying bias in name matching tasks</article-title>
          ,
          <source>in: EDBT</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fanourakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kotzinos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pitoura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          ,
          <article-title>Structural bias in knowledge graphs for the entity alignment task</article-title>
          ,
          <source>in: ESWC</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V. D.</given-names>
            <surname>Cicco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Firmani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Merialdo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Interpreting deep learning models for entity resolution: an experience report using LIME</article-title>
          , in: aiDM@SIGMOD,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Fanourakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kotzinos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <article-title>Knowledge graph embedding methods for entity alignment: experimental review</article-title>
          ,
          <source>Data Min. Knowl. Discov</source>
          .
          <volume>37</volume>
          (
          <year>2023</year>
          )
          <fpage>2070</fpage>
          -
          <lpage>2137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Konda</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. S. G. C.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Govind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Paulsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chandrasekhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martinkus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Christie</surname>
          </string-name>
          ,
          <article-title>Magellan: toward building ecosystems of entity matching solutions</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>63</volume>
          (
          <year>2020</year>
          )
          <fpage>83</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tsekouras</surname>
          </string-name>
          , E. Thanos, G. Giannakopoulos,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          ,
          <article-title>The return of jedai: End-to-end entity resolution for structured and semi-structured data</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>11</volume>
          (
          <year>2018</year>
          )
          <fpage>1950</fpage>
          -
          <lpage>1953</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , W. Hu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Akrami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A benchmarking study of embedding-based entity alignment for knowledge graphs</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>2326</fpage>
          -
          <lpage>2340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>BERT-INT: A bert-based interaction model for knowledge graph alignment</article-title>
          ,
          <source>in: IJCAI</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Suchanek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abiteboul</surname>
          </string-name>
          , P. Senellart, PARIS:
          <article-title>probabilistic alignment of relations, instances, and schema</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>5</volume>
          (
          <year>2011</year>
          )
          <fpage>157</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Papadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Thanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christen</surname>
          </string-name>
          ,
          <article-title>An analysis of one-to-one matching algorithms for entity resolution</article-title>
          ,
          <source>VLDB J</source>
          . (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>