<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Computational Humanities Research Conference, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Entity Matching in Digital Humanities Knowledge Graphs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Juriaan Baas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mehdi M. Dastani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ad J. Feelders</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Utrecht University</institution>
          ,
          <addr-line>Heidelberglaan 8, 3584 CS Utrecht</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>1</volume>
      <issue>4</issue>
      <fpage>7</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>We propose a method for entity matching that takes into account the characteristic complex properties of decentralized cultural heritage data sources, where multiple data sources may contain duplicates within and between sources. We apply the proposed method to historical data from the Amsterdam City Archives using several clustering algorithms and evaluate the results against a partial ground truth. We also evaluate our method on a semi-synthetic data set for which we have a complete ground truth. The results show that the proposed method for entity matching performs well and is able to handle the complex properties of historical data sources.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;entity matching</kwd>
        <kwd>historical data</kwd>
        <kwd>knowledge graphs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>1. The KGs contain internal duplicates, these can be either due to error (a record was
mistakenly duplicated) or not (the same person married twice).</p>
      <p>2. There exist duplicate entities that occur between diferent KGs.
3. Entities may have missing attributes.
4. Attributes, such as dates, can be approximate.
5. Attribute values can have errors, these can be in the original historical version or
introduced over time when the data is copied by hand, digitized, moved, etc. This is also a
potential source of duplicates within KGs.
6. There is no standard for names, e.g. there is no official way to spell a certain name. The
use of shorthands for patronyms is common, e.g. the last names ’Jans’, ’Jansz’, ’Janz’,
’Janssen’ can all refer to the same person.
7. Diferent KGs use diferent attributes to describe the same type of entities, e.g. persons.</p>
      <p>These attributes can be highly correlated but not have identical meanings. For example,
one KG can use birth dates and another one baptism dates. We call these high correlated
attributes proxy variables.
8. There are many one-to-n and n-to-n relationships in the data, this makes the use of
a tabular structure very difficult. Our method is capable of handling these kinds of
relationships naturally.</p>
      <p>The final output of our method is a set of clusters, each claimed to correspond to one and
the same real life object. Furthermore, we assume that these clusters are themselves used
in downstream tasks. The application of transitive closure can sometimes yield very poor
performance if this is not taken into account during the clustering stage.</p>
      <p>Our contribution is the proposal of a comprehensive method that starts with a set of KGs
with the above mentioned characteristic properties. All relevant entities, such as all persons,
in the merged KGs are then embedded based on their local context. This context can also be
influenced by weights that are set in the configuration file to serve a specific task, such as entity
matching (e.g. names are more important than dates). When all entities are embedded, we
make use of an approximate nearest neighbor algorithm to efficiently find pairs of entities that
are likely duplicates. We call the pairs that pass a similarity threshold the candidate pairs.
Since ground truth is very limited, our method is unsupervised. Therefore, these candidate
pairs are then further refined with the use of unsupervised clustering algorithms, resulting in
a set of entity clusters, each predicted to represent a distinct object.</p>
      <p>We experiment with a real life collection of KGs in the cultural heritage domain for which we
have a partial ground truth (gold standard), and a semi-synthetic KG that acts as an analog
that mirrors the graph structure and noise of the real-life data. We use the semi-synthetic
KG because it is much larger, and moreover since we have introduced duplicates ourselves,
we have a complete ground truth. For each case, we then embed all relevant nodes in a
50-dimensional Euclidean space. We chose reasonable values from Cochez et al. [11] for the
number of dimensions and the BCA hyperparamters to reduce the number of parameters that
need tuning. The settings used in the experiments can all be found in the repositories published
in section 6.0.1. Finally we apply several clustering algorithms, both heuristic and exact, and
compare their results to the transitive closure of the connected components formed by the
candidate pairs. We show that a correctly tuned embedding can achieve good performance,
even when no advanced clustering is used. However, in practice there is often no ground truth
available and therefore it is difficult to choose the similarity threshold that yields these good
results. To address this, we show that heuristic as well as exact clustering algorithms can be
used to repair clusters that are too large and achieve high performance for a much larger range
of similarity thresholds.</p>
      <p>The structure of this paper is as follows. First, we present other work related to our method
in section 2. Then in section 3, we give an overview of our methodology. Afterwards we present
the experimental setup in section 4, where we discuss the data and clustering algorithms. Next,
section 5 provides an analysis of the results of the experiments. Finally, we conclude the paper
with some future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        To position our work we make use of the taxonomy defined in the survey paper by Christophides
et al. [
        <xref ref-type="bibr" rid="ref9">10</xref>
        ]. Our method can best be placed in both the Multi-Source Entity Resolution (ER)
(there exist duplicates between data sources), and Dirty ER (data sources themselves contain
duplicates) categories. We therefore categorize our method as Multi-Dirty Source ER.
      </p>
      <p>
        Much work has been done on the problem of identifying entities in the data that refer to the
same real-world object, often using related terms such as: entity linking, entity disambiguation,
entity resolution, entity deduplication, entity alignment and record linkage. Many of these
terms are used in diferent circumstances or for related problems, for example, record linkage
is usually used in the context of matching records in one dataset with records in other datasets.
These works can be rule-based [
        <xref ref-type="bibr" rid="ref21 ref28">24, 31</xref>
        ], make use of word embeddings of tokens that appear
in the descriptions of entities [13, 34], or focus on scalability [
        <xref ref-type="bibr" rid="ref26 ref27">23, 29, 30</xref>
        ]. Some of these works
make use of structured (tabular) data and exploit extra information such as duplicate free
sources. Furthermore, it is often assumed that entities can be neatly described using a fixed
number of attributes, and that all entities either use (diferent names for) the same attributes,
or use diferent attributes that can be easily mapped.
      </p>
      <p>
        Most relevant to our setting are methods that make use of representation learning
techniques. In this case the term entity alignment is most often used. The key idea is to learn
embeddings of KGs, such that entities with similar neighbor structures in the KG have a
close representation in the embedding space. These techniques have their origins (for graphs
in general) in Node2Vec [
        <xref ref-type="bibr" rid="ref12">15</xref>
        ] and Deepwalk [
        <xref ref-type="bibr" rid="ref23">26</xref>
        ], and (for KGs) in TransE [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and its many
extensions, such as [
        <xref ref-type="bibr" rid="ref29 ref30">12, 32, 33</xref>
        ]. Furthermore, several techniques exist for embedding nodes in
the case of mapping two KGs that are in diferent languages, e.g. [
        <xref ref-type="bibr" rid="ref8">8, 9</xref>
        ]. In that case one can
exploit the fact that there can be at most two duplicates entities per real world object.
      </p>
      <p>
        In light of other work on cultural heritage data sources, Raad et al. [
        <xref ref-type="bibr" rid="ref24">27</xref>
        ] summarize their
eforts to create a certificate linking method for Dutch civil certificates from the Zeeland region,
based on efficient string similarity computations. Furthermore, they propose a contextual
identity link [
        <xref ref-type="bibr" rid="ref25">28</xref>
        ], as they observe that the owl:sameAs link is often misused. They note that
the notion of identity can change under diferent contexts. For example, two pharmaceuticals
may be judged as equivalent when their names match under some conditions, while under other
conditions their chemical structure needs to be identical as well. Their solution is an algorithm
which detects the specific global contexts in which two entities are identical. Similarly, Idrissou
et al. [
        <xref ref-type="bibr" rid="ref16 ref17">20, 19</xref>
        ] have proposed a contextual identity link based on the use of related entities to
construct evidence for likely duplicate pairs. An example of evidence is that two entities may
co-occur in multiple records under similar names. Their method requires the user to specify
beforehand what is considered evidence and how entities should be matched. Koho et al. [
        <xref ref-type="bibr" rid="ref20">21</xref>
        ]
reconcile military historical persons in three registers, and use similarity measures between
attributes in both a deterministic rule-based method, based on a pre-defined handcrafted
formula, and a probabilistic method that makes use of supervised learning. In contrast to our
work, only precision is reported, as an exact recall cannot be calculated with a small manually
generated partial ground truth. Hendriks et al. [
        <xref ref-type="bibr" rid="ref15">18</xref>
        ] use data from the Amsterdam Notary
Archives and Dutch East India Company (VOC) and perform both named entity recognition
and record linkage with the help of supervised learning, where we use unsupervised learning.
      </p>
      <p>
        To create the embedding, we build on the work of Cochez et al. [11] and Baas et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], who
both use the Bookmark Coloring Algorithm (BCA) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to create a co-occurrence matrix from
the knowledge graph, and GloVe [
        <xref ref-type="bibr" rid="ref22">25</xref>
        ] to learn the embedding from the co-occurrence matrix.
Additionally, Baas et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] introduce a mechanism for merging multiple KGs based on user set
similarity rules and focus on graph traversal strategies to create entity contexts. Afterwards
they apply supervised learning to create a classifier which is able to identify duplicate pairs
of entities. In more recent work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], they introduce the cluster editing (also called correlation
clustering) algorithm to create clusters using the aforementioned classifier to generate input
weights.
      </p>
      <p>Important diferences with this work are that 1. We forego the supervised approach and
instead use an unsupervised method, as labeled examples are often very rare (or non-existent)
in historical research settings. 2. We apply a form of blocking and filtering to create clusters
without having to compute all pairwise similarities. 3. We modified GloVe to be more efficient
on large KGs, as explained in sections 3.2, and 3.3. 4. Since exact solution of the cluster
editing problem is computationally infeasible for large instances, we also applied three heuristic
clustering algorithms. 5. We use an additional dataset (DBLP) to show that our method works
in multiple settings.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>We will first briefly formalize the entity matching objective in our setting. Given a starting set
of knowledge graphs {KG1, ..., KGn} with KGi = (Vi, Ei), our objective is to find the subset
of pairs of duplicate entities L, where</p>
      <p>V</p>
      <p>=
P =
L ⊆
n
∪ Vi,
i=1
V × V , and</p>
      <p>P.</p>
      <p>It is understood in the field of entity matching that the quadratic growth of P presents a
major problem. Therefore most methods employ some sort of blocking and/or filtering methods
to reduce the number of pairs that have to be evaluated. We employ the embedding for this
purpose, as detailed in section 3.4. The workflow of our method is illustrated in figure 1, and
will be described in the rest of this section.</p>
      <sec id="sec-3-1">
        <title>3.1. Merging KGs</title>
        <p>We start with a set of KGs, as seen in panel a of Figure 1. In our setting, the only way the
graphs can be compared and connected is through identical or similar values in their literal
nodes. Therefore, we perform two steps:</p>
        <p>1. All literal nodes with identical values are merged based on their predicate. Merging on
solely the value of a literal node will cause ambiguity as to what the value of that literal
node represents.
2. Although graphs can be partly connected after applying the first step, we still face the
problem of noise present in the values of literal nodes, and, often predicates associated
with literal nodes may only occur in one of the graphs. However, these predicates can
still be related if one can act as a proxy variable for the other, e.g. one graph contains
the bornOn predicate and another the baptisedOn predicate. We connect such nodes if
they satisfy a similarity criterion, such as a preferred distance between certain dates.
This technique can be used to deal with noise as well, e.g. to connect literal nodes that
represent similar names by using, for instance, Jaro-Winkler string similarity.
The resulting merged graph, panel b in Figure 1, does not necessarily adhere to the RDF
standards, as we may have added edges between literal nodes. From this point on, we do not
treat the merged graph as a KG, but instead as an undirected weighted graph G = (V, E), where
the weights are either predefined (for predicates) or are created (for literal node similarities)
in step 2 above.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Generating Entity Contexts</title>
        <p>
          We use an adaptation of the Bookmark Coloring Algorithm [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] (BCA) to generate a context for
each node in a graph. The general idea of BCA is to consider the context of a node in a graph
as an approximation of the personalized PageRank for that node, i.e., a set of probabilities
associated with nodes in the graph. A useful analogy is imagining dropping a unit amount of
paint on a given starting node. A fraction α of the paint remains on the node and the fraction
1 − α is distributed among neighboring nodes. When the amount of paint falls under ϵ the
paint is no longer distributed. This means that even in the case of loops, the algorithm will
eventually terminate after running out of paint. The value for ϵ can be tuned to get a smaller
or larger neighborhood. The fraction of paint which continues to flow can be adjusted with α.
        </p>
        <p>Instead of uniformly distributing paint over neighboring nodes, as standard BCA does, the
paint is distributed relative to the weight of edges. The weights on edges between literal
nodes are computed from the similarity of their values. All remaining weights are set in the
configuration file by the domain expert, or can be omitted, in which case they all equal 1.</p>
        <p>As mentioned before, we treat the edges in the merged graph as undirected. This is because
the directed structure of the graph can curb the ‘flow of paint’, especially when information
about duplicate entities is dominated by literal nodes, which only have incoming edges by
default. This will cause otherwise related entities to have no or very small overlapping context,
something we wish to avoid.</p>
        <p>Finally, we have increased the efficiency of calculating the entity contexts in two ways.
First, we perform early stopping by not processing nodes for which we know there will not be
enough paint to continue. This behaviour also introduces a limit on the minimum amount of
paint present on any given node reached by BCA, which increases numerical stability in the
embedding process. Secondly, we observe that we do not need to perform the BCA algorithm
for every single node. Instead, only the nodes we wish to embed have their context calculated,
where every node can potentially appear in a context, including those that are not embedded.
We call the nodes that have their context calculated focus nodes. This modification greatly
reduces the running time of the context generation stage, as usually only a fraction of the
nodes in the graph need to be embedded, e.g. only nodes of type person. Furthermore, by
still including each node in the potential context, we do not impair the context by removing
important co-occurrence information. This second modification does require us to modify the
GloVe algorithm slightly, detailed below.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Embedding nodes in the merged graph</title>
        <p>
          GloVe [
          <xref ref-type="bibr" rid="ref22">25</xref>
          ] has been established as an efective method to use a text corpus for embedding
words in a vector space. We adapt it to embed nodes in a graph instead. GloVe takes as input
a co-occurrence matrix X, where in our case Xij signifies the amount of paint of BCA for node
j, when starting from node i. When BCA has not reached node j starting from node i, then
Xij = 0. Calculating node vectors can be achieved by minimizing the cost function
        </p>
        <p>N |V |
cost = ∑ ∑ f (Xij) (bi + ¯bj + wiT w¯ j − log Xij)2
i j
(1)
where N is the number of focus nodes we wish to embed, wi is the vector for node i as a focus
node, and w¯ j is the vector for node j as a context node. Likewise, bi and ¯bj are the bias terms
of node i as a focus node, and node j as a context node respectively.</p>
        <p>In the original specification of GloVe, X is assumed to be a square matrix. However, as
mentioned in section 3.2, we do not perform BCA for every node in the merged graph. Therefore
X is no longer a square matrix, instead, there are N rows and |V | columns. This also means
that there are N vectors w, and |V | vectors w¯ , the same goes for the bias terms. When the
change between iterations in the cost function 1 falls below a user set threshold, we regard the
algorithm to be converged. The final embedded vector for each focus node i is the average
between vectors wi and w¯ i. We use the term embedding for the set of all embedded vectors,
shown in panel c in Figure 1.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Efficiently selecting likely duplicate pairs</title>
        <p>
          As stated in the entity matching objective, we wish to determine the subset L ⊆ P that
contains the duplicate pairs. However, the vast majority of pairs in P do not link duplicate
entities, so duplicate pairs are very rare [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. We reduce the degree of rarity by creating two
subsets P1 ⊆ P and P2 = P − P1, where P1 is expected to be much smaller in size and to have
a much higher proportion of duplicate pairs. We construct P1 by, for every embedded entity,
taking its approximate k nearest neighbors, where from experience we have learned that k = 2
is a good value for a range of cluster sizes. This efectively acts as a blocking mechanism, as
most pairs in P are never considered. The problem of approximate nearest neighbor (ANN)
search has been well researched [
          <xref ref-type="bibr" rid="ref13 ref6">6, 16</xref>
          ] and calculating the approximate k nearest neighbors
for all entities can be achieved in much less than quadratic time, depending on the level of
approximation one is willing to tolerate. We note that when a near neighbor j is omitted in
the approximate result for node i, i.e. we miss the pair (i, j), then the pair (j, i) can still be
found when considering the approximate neighbors for node j. This reduces the impact of
the errors returned in approximate search in our method. Next, all pairs in P1 are treated as
unordered, i.e. pairs (i, j) and (j, i) are merged. Finally, we calculate the cosine similarity sij
for all pairs in P1, and name all pairs where the similarity exceeds a threshold θ the candidate
pairs, shown in panel d in Figure 1.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Clustering</title>
        <p>When the candidate pairs are treated as edges in an undirected graph, the resulting graph
typically consists of a number of connected components. Since only pairs of entities in the same
connected component are regarded as potential duplicates, further processing is performed
separately for each connected component C. First, C is modified by adding a weighted edge
for each possible pair of entities (i, j) ∈ C. We define the weight for the edge between entities i
and j as wij = sij − θ. Note that wij can be negative. We call the resulting weighted complete
graph C′, which is then used as input for a clustering algorithm. Note that for high values of
θ ≈ 1, there are fewer candidate pairs (and thus smaller connected components) than for lower
values of θ. We elaborate on the specifics of each clustering algorithm in section 4.3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Setup</title>
      <p>
        For our experiments we have used two datasets, one is a set of KGs containing real historical
data from the cultural heritage domain, the other is a semi-synthetic KG designed to have
similar structural properties as the merge of the KGs from the cultural heritage domain. The
main reason to design and use a semi-synthetic KG is to have a complete ground truth to
validate the performance of the proposed approach. As mentioned in the introduction section,
the KGs from the cultural heritage domain in our project have only a partial ground truth,
obtained with manual validation by domain experts [
        <xref ref-type="bibr" rid="ref19">22</xref>
        ]. In this section, we first explain the
used datasets, i.e. SAA and DBLP. We then explain how the combined KGs of the SAA, and the
DBLP KG are then processed according to the methodology explained in section 3.5. Without
getting into the technical details of the application of our methodology in this experimental
setup, we only note that these knowledge graphs are embedded into a separate 50-dimensional
space. Finally, we explain the clustering methods used are explained in the rest of this section.
      </p>
      <sec id="sec-4-1">
        <title>4.1. City Archives of Amsterdam</title>
        <p>The City Archives of Amsterdam2 (in Dutch Stadsarchief Amsterdam, abbreviated SAA) is a
collection of registers, acts and books from the 16th century up to modern times. The original
data are in the form of handwritten books that have been scanned and digitised by hand.
Often more information is stored in the original form than was transcribed. Fully digitising
all information is an ongoing process, performed mostly by volunteers. For this project we
made use of a subset collected from three diferent registers: Burial, Prenuptial Marriage and
Baptism. The burial register does not describe who was buried where, but simply records the
act of someone declaring a deceased person. To this end, it mentions the date and place of
declaration and references two persons, one of whom is dead. Sadly, it does not tell us which
one of the two has died. The prenuptial marriage records tell us the church, religion and
names of those who are planning to get married. It also mentions names of persons involved in
previous marriages if applicable. The baptism register mentions the place and date of where a
child was baptised. It does not tell us the name of the child, only the names of the father and
mother. Lastly the above records were combined with a subset from the Ecartico 3 data set,
which is a comprehensive collection of biographical data from more well-known people from the
Dutch golden age. Figure 2 shows the graph representations of a record in each register. Note
that these records can be linked to each other by sharing a literal node, in this case a name
or a date field, where the name of an individual is often written in diferent ways and many
dates are approximate. For our experiments we use a subset containing 12,517 (non-unique)
persons, and a partial ground truth of 1073 clusters.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Semi-synthetic DBLP-based KG</title>
        <p>Since we have only a partial ground truth available for the SAA dataset, and to validate
the performance of our approach, we have created a KG based on a 2017 RDF release4 of
DBLP,5 a well-known dataset among computer scientists, containing publication and author
information. This version of DBLP has already been disambiguated, that is, diferent persons
with the same name have unique URIs. We have reversed this by first taking a random sample
of persons and then creating new anonymous URIs for each author listed per publication.
In total there are 76,397 new URIs created for disambiguation into 51,515 clusters. These
anonymous author nodes are then assigned their original name, with the further introduction
of noise. Every character in each name instance has a 1% chance of deletions, random character
swaps and replacing a character with a random character. Certain characters (é becomes e)
have a 25% chance of alteration, and middle names are shortened to their initials with a 50%
probability. There are never more than 3 alterations per name instance. We chose these
particular percentages to create enough noise such that names can have multiple diferent,
but not unrecognisable, versions, as is the case in the SAA dataset. This type of attribute
error, if large enough, can decrease precision as identical entities will have very dissimilar
attributes. The final result is a KG with publication nodes, each with a title attribute and
one or more authors, each of which has a name attribute. The clustering objective is to group
together all author URIs that represent the same real life person, where we only use noisy
2https://archief.amsterdam/indexen
3http://www.vondel.humanities.uva.nl/ecartico
4http://data.linkeddatafragments.org/dblp
5https://dblp.uni-trier.de
R1
P1b</p>
        <p>name
mentions</p>
        <p>date
mentions
name</p>
        <sec id="sec-4-2-1">
          <title>Claesz, Jan</title>
          <p>07/09/1645
prevHusband
prevWife</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Crijnen, Aeltje</title>
          <p>P2d
R2
P2b
name
groom
date
bride
name</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>Claasz, Jan</title>
          <p>01/05/1646</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>Crijnen, Aeltje</title>
          <p>(a) Burial Record
(b) Prenuptial Marriage Record</p>
        </sec>
        <sec id="sec-4-2-5">
          <title>Jan Claesz name name</title>
          <p>P3a
R3
P3b
father
date
mother
name</p>
        </sec>
        <sec id="sec-4-2-6">
          <title>Claeszoon, Jan</title>
          <p>19/11/1650</p>
        </sec>
        <sec id="sec-4-2-7">
          <title>Crijnen, Aaltje groom bride</title>
          <p>P4a
R4
P4b
name
date
name</p>
        </sec>
        <sec id="sec-4-2-8">
          <title>Jan Claes</title>
          <p>01/06/1646</p>
        </sec>
        <sec id="sec-4-2-9">
          <title>Aeltje Crijnen</title>
          <p>(c) Baptism Record
(d) Ecartico Marriage Record
name information, the co-occurrence of other authors and their co-authors (who again only
have a name), and publication titles. This makes the DBLP KG share properties that are
analogous to the merged KGs of SAA, such as n-to-n relationships, noise in literal values, the
way entities can be related (through literal nodes) and similar cluster size distribution. Note
that the clustering objective, where we try to cluster together person nodes in records, remains
similar to that of SAA. Finally, we note that since we have a complete ground truth for DBLP,
we can compute reliable precision and recall values in our experiments.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Clustering Algorithms</title>
        <p>
          We perform our experiments by first constructing the weighted complete graphs C′, explained
in Section 3.5, for both SAA and DBLP knowledge graphs. Then, four diferent clustering
algorithms are applied and compared, with transitive closure method as the baseline.
4.3.1. Transitive Closure:
This method, which is used as the baseline, generates the transitive closure of the set of entities
involved in a connected component. In this case, each connected component C′ is treated as
a single cluster, regardless of weights. For small similarity thresholds of θ ≈ 0.5, this results
in fewer but larger components, and thus clusters, while a large θ ≈ 1 results in many small
(singleton) clusters.
4.3.2. Center Clustering [
          <xref ref-type="bibr" rid="ref14">17</xref>
          ]:
This method first sorts all entity pairs in C′ in descending order by their weights. Then, for
each entity pair, both nodes are clustered according to the following rules: 1) If neither node
is assigned to a cluster, then assign one of the nodes as the center of a new cluster and add
the other node to that cluster. 2) If one of the nodes is the center of a cluster, and the other
node has no cluster, then add the other node to that cluster. 3) Otherwise, do nothing. The
center clustering algorithm has a tendency to create many small clusters based on strong links,
as the pairs with highest similarity are treated first, yielding a high precision but often at the
expense of recall. For each (sub)component, all possible entity pairs are calculated to supply
the algorithm with maximum information.
4.3.3. Merge-Center Clustering [
          <xref ref-type="bibr" rid="ref14">17</xref>
          ]:
This method adds another step to the center clustering algorithm where if one of the nodes
is a center node, and the other node is already assigned to a diferent cluster, then the two
clusters are merged. This tends to result in fewer and larger clusters than center clustering.
4.3.4. Vote Clustering [14]:
This algorithm processes the nodes in C′ in arbitrary order. Each node i is assigned to the
cluster with the largest positive sum of weights with regard to i. If there are no clusters yet,
or no cluster has a positive sum, then a new cluster is created for i. For high values of θ ≈ 1,
this will cause most sums to be negative, thereby creating more clusters. Vote clustering has
been suggested as a heuristic alternative to the exact correlation clustering, explained next.
4.3.5. Correlation Clustering [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]:
This method partitions the nodes of C′ = (V, E) into a number of disjoint subsets (clusters)
such that the sum of the inter-cluster positive weights minus the sum of the intra-cluster
negative weights is minimized. More formally, let Γ = {V1, V2, . . .} denote a partitioning of V
into disjoint subsets (clusters). Furthermore, let E+ = {(i, j) ∈ E : wij &gt; 0} denote the edges
with positive weight, and let E− = E \ E+ denote the edges with non-positive weight. Finally,
let intra(Γ) denote the collection of edges with both endpoints in the same partition (cluster),
and inter(Γ) the collection of edges with both endpoints in a diferent partition (cluster). The
objective is to find the clustering Γ that minimizes the cost:
cost(C′, Γ) =
        </p>
        <p>∑
(i,j)∈inter(Γ)∩E+
wij −</p>
        <p>∑
(i,j)∈intra(Γ)∩E−
wij
The intuition behind minimizing the cost function is to discourage the situation where nodes
with positive similarity are assigned to diferent clusters and nodes with negative similarity are
assigned to the same cluster. Note that high values of θ will cause most weights to be negative
and to have an associated cost of putting the corresponding nodes in the same cluster. This
will yield an optimal solution with smaller clusters. Low values of θ will results in fewer
but larger clusters. Due to the NP-hardness of the correlation clustering problem, we only
perform correlation clustering on components no larger than 50 nodes. Larger components are
processed with the vote algorithm.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The domain of cultural heritage often makes use of KGs that have characteristics which make
them difficult to process, such as noise (misspelling of names), proxy variables and n-to-n
vote</p>
      <p>0.7 0.8
(a) F-1 Score
0.6
0.7 0.8
(b) F- 21 Score
0.9
1.0
0.5
0.6
0.9
1.0
relationships. Furthermore, most of the time there is either no or a very limited ground truth
available.</p>
      <p>We have shown that our method can handle the complexities of real KGs from the cultural
heritage domain and is able to yield a result with high precision while still having good recall.
The experiment with the DBLP KG shows that the results can be replicated on a diferent KG
with similar structural properties. Furthermore, we have shown that several diferent clustering
algorithms can be applied in our method.</p>
      <p>The use of an unsupervised method can be most appropriate when there is no or very limited
ground truth available. Moreover, in case of limited ground truth a supervised learner can work
if only a few features are used. Therefore, in the future we will experiment with introducing
supervised learning in combination with additional symbolic features, such as logical rules
about how certain properties relate, to create a system that combines both symbolic and
sub-symbolic information.
6.0.1. Reproducibility
We have published the source code6 for creating the embeddings and the scripts for running
the experiments online7, including the instructions on how to build the source code into a
standalone executable, the configuration files that were used to create the embeddings, and
the KGs used in the experiments.
vote
0.9
1.0
0.6
0.9</p>
      <p>1.0
0.7 0.8
(b) F- 21 Score
0.7 0.8</p>
      <p>(a) F-1 Score</p>
      <p>1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0</p>
      <p>M. Chen, Y. Tian, M. Yang, and C. Zaniolo. “Multilingual knowledge graph embeddings
for cross-lingual knowledge alignment”. In: arXiv preprint arXiv:1611.03954 (2016).
[11]
[12]
[13]
[14]</p>
      <p>M. Cochez, P. Ristoski, S. P. Ponzetto, and H. Paulheim. “Global RDF vector space
embeddings”. In: International Semantic Web Conference. Springer. 2017, pp. 190–207.
K. Do, T. Tran, and S. Venkatesh. “Knowledge graph embedding with multiple relation
projections”. In: 2018 24th International Conference on Pattern Recognition (ICPR).
Ieee. 2018, pp. 332–337.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M. Al</given-names>
            <surname>Hasan</surname>
          </string-name>
          and
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Zaki</surname>
          </string-name>
          . “
          <article-title>A survey of link prediction in social networks”</article-title>
          .
          <source>In: Social network data analytics</source>
          . Springer,
          <year>2011</year>
          , pp.
          <fpage>243</fpage>
          -
          <lpage>275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Baas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dastani</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Feelders</surname>
          </string-name>
          . “
          <article-title>Exploiting Transitivity for Entity Matching”</article-title>
          .
          <source>In: ESWC2021 Poster and Demo Track</source>
          .
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Baas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dastani</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Feelders</surname>
          </string-name>
          . “
          <article-title>Tailored graph embeddings for entity alignment on historical data”</article-title>
          .
          <source>In: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications &amp; Services</source>
          .
          <year>2020</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blum</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Chawla</surname>
          </string-name>
          . “
          <article-title>Correlation Clustering”</article-title>
          .
          <source>In: Mach. Learn</source>
          .
          <volume>56</volume>
          .1-
          <fpage>3</fpage>
          (
          <year>2004</year>
          ), pp.
          <fpage>89</fpage>
          -
          <lpage>113</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Berkhin</surname>
          </string-name>
          . “
          <article-title>Bookmark-coloring algorithm for personalized pagerank computing”</article-title>
          .
          <source>In: Internet Mathematics 3</source>
          .1 (
          <issue>2006</issue>
          ), pp.
          <fpage>41</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Bernhardsson</surname>
          </string-name>
          . Annoy at GitHub. https://github.com/spotify/annoy.
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bordes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia-Duran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Yakhnenko</surname>
          </string-name>
          . “
          <article-title>Translating embeddings for modeling multi-relational data”</article-title>
          .
          <source>In: Neural Information Processing Systems (NIPS)</source>
          .
          <year>2013</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiena</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. Zaniolo. “</surname>
          </string-name>
          <article-title>Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment”</article-title>
          . In: arXiv preprint arXiv:
          <year>1806</year>
          .
          <volume>06478</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Christophides</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Palpanas</surname>
          </string-name>
          , G. Papadakis, and
          <string-name>
            <given-names>K.</given-names>
            <surname>Stefanidis</surname>
          </string-name>
          .
          <article-title>“An Overview of End-to-End Entity Resolution for Big Data”</article-title>
          .
          <source>In: ACM Comput. Surv. 53.6</source>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .1145/3418896.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Ebraheem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thirumuruganathan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Joty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouzzani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Tang</surname>
          </string-name>
          . “
          <article-title>Distributed representations of tuples for entity resolution”</article-title>
          .
          <source>In: Proceedings of the VLDB Endowment 11.11</source>
          (
          <year>2018</year>
          ), pp.
          <fpage>1454</fpage>
          -
          <lpage>1467</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Elsner</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Schudy</surname>
          </string-name>
          . “
          <article-title>Bounding and comparing methods for correlation clustering beyond ILP”</article-title>
          .
          <source>In: Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing</source>
          .
          <year>2009</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grover</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          . “node2vec:
          <article-title>Scalable feature learning for networks”</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          .
          <source>2016</source>
          , pp.
          <fpage>855</fpage>
          -
          <lpage>864</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>R.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lindgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Simcha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chern</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          . “
          <article-title>Accelerating large-scale inference with anisotropic vector quantization”</article-title>
          .
          <source>In: International Conference on Machine Learning. Pmlr</source>
          .
          <year>2020</year>
          , pp.
          <fpage>3887</fpage>
          -
          <lpage>3896</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          . “
          <article-title>Creating probabilistic databases from duplicated data”</article-title>
          .
          <source>In: The VLDB Journal 18.5</source>
          (
          <issue>2009</issue>
          ), pp.
          <fpage>1141</fpage>
          -
          <lpage>1166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hendriks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Groth</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. van Erp. “</surname>
          </string-name>
          <article-title>Recognising and Linking Entities in Old Dutch Text: A Case Study on VOC Notary Records”</article-title>
          .
          <source>In: Proceedings of the International Conference Collect and Connect: Archives and Collections in a Digital Age</source>
          .
          <year>2021</year>
          , pp.
          <fpage>25</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zamborlini</surname>
          </string-name>
          ,
          <string-name>
            <surname>F. Van Harmelen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Latronico</surname>
          </string-name>
          . “
          <article-title>Contextual entity disambiguation in domains with weak identity criteria: Disambiguating golden age amsterdamers”</article-title>
          .
          <source>In: Proceedings of the 10th International Conference on Knowledge Capture</source>
          .
          <year>2019</year>
          , pp.
          <fpage>259</fpage>
          -
          <lpage>262</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [20]
          <string-name>
            <surname>A. K. Idrissou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hoekstra</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Van Harmelen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Khalili</surname>
            , and
            <given-names>P. Van Den Besselaar.</given-names>
          </string-name>
          “
          <article-title>Is my: sameas the same as your: sameas? lenticular lenses for context-specific identity”</article-title>
          .
          <source>In: Proceedings of the Knowledge Capture Conference</source>
          .
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Koho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Leskinen</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Hyvönen.</surname>
          </string-name>
          “
          <article-title>Integrating historical person registers as linked open data in the warsampo knowledge graph”</article-title>
          .
          <source>In: Semantic Systems. In the Era of Knowledge Graphs</source>
          .
          <source>SEMANTiCS</source>
          (
          <year>2020</year>
          ), pp.
          <fpage>118</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>C.</given-names>
            <surname>Latronico</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zamborlini</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissou</surname>
          </string-name>
          . “
          <article-title>AMSTERDAMERS: from the Golden Age to the Information Age via Lenticular Lenses”</article-title>
          .
          <source>In: Digital Humanities Benelux</source>
          .
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [21] [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Nentwig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Groß</surname>
          </string-name>
          , and
          <string-name>
            <surname>E. Rahm.</surname>
          </string-name>
          “
          <article-title>Holistic entity clustering for linked data”</article-title>
          .
          <source>In: 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW)</source>
          .
          <source>Ieee</source>
          .
          <year>2016</year>
          , pp.
          <fpage>194</fpage>
          -
          <lpage>201</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          . “
          <article-title>Limes-a time-efficient approach for large-scale link discovery on the web of data”</article-title>
          .
          <source>In: integration 15.3</source>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Manning</surname>
          </string-name>
          . “Glove:
          <article-title>Global vectors for word representation”</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          .
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>B.</given-names>
            <surname>Perozzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Skiena</surname>
          </string-name>
          . “Deepwalk:
          <article-title>Online learning of social representations”</article-title>
          .
          <source>In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          .
          <source>2014</source>
          , pp.
          <fpage>701</fpage>
          -
          <lpage>710</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Raad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mourits</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rijpma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zijdeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Mandemakers</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Merono-Penuela</surname>
          </string-name>
          .
          <article-title>“Linking Dutch civil certificates”</article-title>
          . In: (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Raad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pernelle</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Saı</surname>
          </string-name>
          <article-title>̈s. “Detection of contextual identity links in a knowledge base”</article-title>
          .
          <source>In: Proceedings of the knowledge capture conference</source>
          .
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Saeedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nentwig</surname>
          </string-name>
          , E. Peukert, and
          <string-name>
            <surname>E. Rahm.</surname>
          </string-name>
          “
          <article-title>Scalable matching and clustering of entities with FAMER”</article-title>
          .
          <source>In: Complex Systems Informatics and Modeling Quarterly</source>
          <volume>16</volume>
          .16 (
          <year>2018</year>
          ), pp.
          <fpage>61</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Saeedi</surname>
          </string-name>
          , E. Peukert, and
          <string-name>
            <surname>E. Rahm.</surname>
          </string-name>
          “
          <article-title>Using link features for entity clustering in knowledge graphs”</article-title>
          .
          <source>In: European Semantic Web Conference</source>
          . Springer.
          <year>2018</year>
          , pp.
          <fpage>576</fpage>
          -
          <lpage>592</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>J.</given-names>
            <surname>Volz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaedke</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Kobilarov.</surname>
          </string-name>
          “
          <article-title>Silk-a link discovery framework for the web of data”</article-title>
          .
          <source>In: Ldow</source>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Feng, and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          . “
          <article-title>Knowledge graph embedding by translating on hyperplanes”</article-title>
          .
          <source>In: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          . Vol.
          <volume>28</volume>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          , and W.-Y. Ma. “
          <article-title>Collaborative knowledge base embedding for recommender systems”</article-title>
          .
          <source>In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining</source>
          .
          <source>2016</source>
          , pp.
          <fpage>353</fpage>
          -
          <lpage>362</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>