=Paper=
{{Paper
|id=Vol-2126/paper8
|storemode=property
|title=Blocking Music Metadata from Heterogenous Data Sources
|pdfUrl=https://ceur-ws.org/Vol-2126/paper8.pdf
|volume=Vol-2126
|authors=Oliver Pabst,Udo W. Lipeck
|dblpUrl=https://dblp.org/rec/conf/gvd/PabstL18
}}
==Blocking Music Metadata from Heterogenous Data Sources==
<pdf width="1500px">https://ceur-ws.org/Vol-2126/paper8.pdf</pdf>
<pre>
Blocking music metadata from heterogenous data sources

                             Oliver Pabst                                             Udo W. Lipeck
            FG Datenbanken und Informationssysteme                        FG Datenbanken und Informationssysteme
                Institut für Praktische Informatik                            Institut für Praktische Informatik
                  Leibniz Universität Hannover                                  Leibniz Universität Hannover
                 pabst@dbs.uni-hannover.de                                      ul@dbs.uni-hannover.de


ABSTRACT                                                              information integration where mappings between relational
Entity resolution or object matching describes the assign-            schemas are the groundwork before relational schemas can
ment of different objects to each other that describe the             be matched and later fused. Whereas relational databases
same object of the real world. It is used in a variety of tech-       have to compute expensive joins over relations to gather
nical systems, e.g. systems that fuse different data sources.         data that are related, the graph data model is predestined to
Blocking is used in this context as an approach to reduce the         directly store relational data and process them object-wise.
total amount of comparisons by grouping similar objects in            While we have to perform complex operations in relational
the same cluster and dissimilar objects in different clusters.        systems to get the join partners of a specific tuple, in the
As a result only the objects of the same clusters have to             graph data model we just have to query the adjacent edges
be compared to each other. To deal with noise, for instance           and nodes of a specific node. Now if we want to use graphs
spelling errors, that can result from different heterogeneous         to match entities, we do need some starting points, since
data sources, various blocking approaches exist that may              testing all possibilities is infeasible.
add or remove redundancy to the data.                                    As an origin we have two relational databases, Discogs1
   In this paper we propose a system that utilizes a deriva-          and Musicbrainz2 , which collect and maintain music meta-
tive of the standard blocking technique to compute corre-             data from web communities, that we want to fuse into one
spondences between objects as starting points for a graph             database. Both databases are both structurally and syntacti-
matching process. The blocking technique, which usually re-           cally heterogenous. To overcome this heterogeneity, we have
lies on identity of blocking keys derived from attributes, is         defined an integrated schema and have transformed both da-
modified to cope with heterogenous source data with few at-           tabases into this schema. There are, however, actually very
tributes suitable for matching. A common criticism of stan-           few attributes in the data that are suitable for an attribute-
dard blocking is low efficiency, since the block sizes are un-        based matching process. For example, the position of a track
balanced with regard to the number of contained entities.             on a release or the genre of a release are by no means suita-
We take precautions to keep the efficiency high by reducing           ble: Genres occur frequently and possess only a small num-
the size and amount of large partitions.                              ber of distinct values, i.e., a low entropy. And track positions
                                                                      are not comparable; two tracks are not more similar if the
                                                                      difference of the positions is low. Additionally these data
Categories and Subject Descriptors                                    are afflicted by noise in the form of spelling mistakes and
H.4 [Information Systems Applications]: Miscellaneous                 modifications by internal processes. No matter how a later
                                                                      object matching is implemented, as the focus of the later
General Terms                                                         object matching could be shifted from attribute-based to
                                                                      relational similarities[5], the blocking before matching has
Blocking, matching, entity resolution
                                                                      to stay reliable even for few and ’noisy’ attributes. Thus we
                                                                      need correspondences between blocks based on similar ins-
Keywords                                                              tead of identical blocking keys. These correspondences form
Blocking, matching, entity resolution                                 partitions of matching candidates that can be processed in
                                                                      parallel.
1.   INTRODUCTION                                                        During initial tests we have looked at different solutions
                                                                      to perform the blocking process directly on the database
  Merging two different relational databases is a difficult
                                                                      system that holds our data. [2] provides a list of possible ap-
task. There are well known research approaches in the area of
                                                                      proaches like q-gram-indexing which first seemed promising,
                                                                      as PostgreSQL already has indexing structures that support
                                                                      q-grams and matching operators in place. Unfortunately the
                                                                      approach did not perform well at all. If it is used to perform
                                                                      string matching between two different tables, a nested loop
                                                                      join is required to compute the string similarities. This leads
                                                                      to very high runtimes (days) for the calculations, despite the
30th GI-Workshop on Foundations of Databases (Grundlagen von Daten-   1
banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany.                     https://www.discogs.com
                                                                      2
Copyright is held by the author/owner(s).                                 https://musicbrainz.org
indexes. Additionally the database approach does not scale            Our proposed method at first behaves like a redundancy-
well at all, since operations are performed sequentially, not      positive method since similar (i.e. candidate) entity pairs
in parallel. Other indexing techniques like phonetic codes         may appear in multiple block correspondences, but finally
were considered, but also discarded, because they delivered        it delivers a redundancy-free blocking since duplicate entity
no suitable results.                                               pairs are filtered after the block building process.
   To deal with this problem, we have decided to modify the
standard blocking technique, first presented by Fellegi and        2.   PROCESS OVERVIEW
Sunter [3] and later picked up by [9] and [8] to cope with
                                                                      As described earlier, we want to compute possible corre-
very few available attributes. With regards to performance
                                                                   spondences between nodes of two different graphs to initia-
we want to remain as efficient as standard blocking while
                                                                   lise a graph-matching process. To achieve this goal we have
decreasing the redundancy and increasing the quality.
                                                                   developed a process that takes data from two sources in the
   This paper is structured as follows. In the next subsection
                                                                   form s1: (id, string), s2: (id, string) that is sufficient to des-
we provide a brief overview of the various classes of blocking
                                                                   cribe an entity. In our prototype, that currently works with
techniques. In section 2 we outline the steps of our envisio-
                                                                   music metadata, we use the name of artists (together with
ned process that takes two input data sources and computes
                                                                   their local identifier) to approximately identify real world
starting points for a graph matching process. We explain
                                                                   objects in our data sources. Other attributes of artists are
our new blocking technique and outline the differences to
                                                                   not given, but rich contexts of related tracks, labels, releases,
the standard blocking and explain our modifications. Fur-
                                                                   etc.
thermore, we describe the block matching which calculates
                                                                      These input data need to be preprocessed since they are
the correspondences. We show which steps we take to handle
                                                                   afflicted by noise, be it human mistakes like spelling errors
noise in the data sources and how we compute block pairs
                                                                   which result in transposed characters, omitted strings or mo-
that might be similar and have to be checked in a subse-
                                                                   difications by internal processes of the systems 3 that handle
quent graph-matching process; that later process is not part
                                                                   the metadata. The string content from an incoming record
of this paper. In section 3 we describe our experiences with
                                                                   is split apart at blanks into tokens and the tokens are then
the blocking process. Finally in section 4 we present the con-
                                                                   processed by applying a stop word list and regular expressi-
clusions and future work.
                                                                   ons.
Related work
                                                                                                  Preprocess
                                                                               Input                                Block Building
Blocking has been introduced as an important preparation                                      (Stopwords, RegExp)

of matching by, e.g., Christen [1]. Over the time a variety
of blocking techniques have evolved. Papadakis [7] catego-
rises them into two classes: redundancy-free methods and
methods manipulating redundancy. Redundancy in this ca-
                                                                           Block Matching      Partition Creation   Pair Filtering
se means, that the same entity may occur in more than one
block. The standard blocking proposed by Fellegi at al. [3]
and later modified by Papadakis [8] assigns a blocking key
to every entity and groups two entities in the same block if
their blocking keys match exactly. This technique is efficient
                                                                                                                       Output
but can lead to insufficient results as it does not handle noise
and missing values well, but is free of redundancy.
   The other class, methods manipulating redundancy, i.e.
methods that may decrease or increase the redundancy, can                              Figure 1: process overview
roughly be separated into three categories of negative red-
undancy, neutral redundancy and positive redundancy. An               To reduce the number of comparisons needed to compute
example for negative redundancy is canopy clustering [6]. In       possible correspondences between both datasets, we use the
every iteration an object is picked as a cluster center, points    standard blocking technique. We work with either of two
within a threshold distance are assigned to the cluster, and       different strategies to create blocks from the preprocessed
objects inside a smaller threshold distance are completely         input data. The split-at-blank -approach leaves the tokens
removed from the dataset. This is justified by the assump-         apart, which potentially leads to more than one block per
tion that highly similar objects are most likely in the same       entity. Alternatively blanks are eliminated by the merge-at-
block and will not match with other objects. By removing           blank -approach, so that the incoming tokens are merged;
them from the dataset these data cannot be assigned to a           with this strategy only one block per entity will be created.
new cluster, thus the redundancy is decreased. Redundancy          Of course, created blocks will collect further entities with
neutral techniques comprise the same number of common              the same split/merged tokens.
blocks across all entity pairs. An example for this category          Then we have to match blocks of entities from our two
is sorted neighbourhood blocking [4] as the sliding window         sources. To avoid missing matches due to noise we do not
is shared among all entity pairs of the dataset. Redundancy        use an exact matching to compute correspondences between
positive techniques, however, share the concept that the si-       blocks. Instead blocks with same or similar labels shall be
milarity of two entities correlates with the number of blocks      matched by applying a string similarity measure on the la-
that contain both entities. An example is q-gram blocking,         bels of all block pairs. Thus we can additionally apply a
where two entities are most similar to each other if they          3
                                                                    By system we mean internal rules inside a processing sys-
share all q-grams while they are least similar if they have no     tem that affect the data as they are fed into the information
common q-gram.                                                     system, processed and stored.
variable threshold that can be set appropriately to fit the            In figure 3 (left) the result of the preprocessing can be
noise of the data sources.                                           seen. For the first entity the stop words a and to have be-
   The resulting matches from the block matching are then            en removed. The entities 2 and 3 were modified by regular
used to create partitions. Each match is used to build a par-        expressions, removing the numerical suffix ’(2)’ and the apo-
tition, which clusters the identifiers of similar block labels       strophe from entity 3.
together according to the computed block matches. Each
match is used to build such a partition by carrying two lists
that contain the identifiers participating in the block mat-
                                                                              day                      remember               dayremember
ching.
   To reduce redundancy and thus increase efficiency, we ap-
ply a pair filtering after the block matching process. If we                 green                       day                    greenday

consider the strategy that splits the entity string at blanks
into multiple tokens, it is obvious that an entity can be con-              fiddlers                    green                 fiddlersgreen
tained in multiple blocks. Since this applies to both data
sources, it is clear that the same entity can participate in
more than one candidate pair.
   The resulting output is a set of partitions that contain          Figure 3: cleansed sample data, (left) split-at-blank,
all correspondence pairs of entities that we want to use as          (right) merge-at-blank
starting points for the later graph-matching process. There
we will decide which entity pairs really match, in our ap-             In figure 3 the result of both string handling strategies are
plication mainly by utilizing relational similarities. To save       depicted. On the left side, the remaining tokens are unchan-
space pairs are implicitly stored; they can be reconstructed         ged for the blocking step. The benefit of this approach is that
by building the local cartesian product between two stored           big mistakes can be compensated as we may get a match by
identifier lists, while discarded pairs are stored in a blacklist.   other blocks that contain the same entity. On the other hand
   The whole process is depicted in figure 1 and the com-            this may lead to a large number of blocks; entities which
ponents of the process are further described in the following.       consist of more than one token and spelling mistakes will
                                                                     increase the count even further. Additionally, this approach
2.1    Preprocess and Block Building                                 does not cover cases when blanks were omitted and tokens
   The proposed blocking technique bases on the method pre-          were mistakenly concatenated in the source data. With the
sented by Papadakis at al. [8]. Strings are initially split at       second strategy all tokens of one entity will be collapsed to
blanks into tokens. On each token of a split string a domain         just one string, which results in just one block. While we are
specific stop word list is applied to omit frequent substrings,      able to handle mistakenly concatenated strings big mistakes
that do not contribute to a matching decision and have a             cannot be compensated by other blocks but only by signifi-
high entropy (e.g. the). Afterwards the remaining tokens             cantly reducing the required similarity threshold to assume
are kept separated (split-at-blank -strategy) or the blanks are      a match.
eliminated and the remaining tokens are fused (merge-at-
blank -strategy).

                                                                                    remember: {1}
                       1: a day to remember
                                                                                                                  dayremember: {1}

                                                                                        day: {1, 2}
                         2: green day (2)                                                                           greenday: {2}

                                                                                       green: {2, 3}
                         3: fiddler´s green                                                                       fiddlersgreen: {3}

                                                                                       fiddlers: {3}
                 Figure 2: input example

  In our approach the incoming entities, which consist of
                                                                     Figure 4: resulting blocks - (left) split-at-blank
a unique identifier and a string, are initially preprocessed
                                                                     (right) merge-at-blank
before being handled by the block building process in order
to reduce the quantity of noise in the data. Opposite to [8],
due to the lack of suitable matching attributes, we do not use          The result of the block building is shown in figure 4. The
a unique identifier accompanied by a set of key-value-pairs,         first strategy places the entities 1 and 2 in the same block for
but only a unique identifier and an associated attribute. A          the token day and the entities 2 and 3 in the same block for
set of sample entities can be seen in figure 2.                      the token green. For the desired matching of two data sources
  As previously described, the strings are first separated into      it is not required that similar objects of the same source are
tokens using blanks as delimiters. A domain-specific stop            actually placed in the same block, since a deduplication is
word list with common expressions (e.g. the, a or – in the           usually presumed beforehand.
music domain – dj ) is applied and afterwards a set of regular          Each block has a label that indicates the token it origina-
expressions is used to eliminate disturbing suffixes like (1),       ted from and a list of identifiers from entities that contain
(3), ?. Consequently some tokens will be removed after this          the token. These labels are used in the block matching, but
step.                                                                are irrelevant after the block matching is performed.
2.2    Block matching
                                                                                                                       offspring: {5}
   After the block building we have to apply a matching bet-
ween the block sets of both data sources to compute the
possible correspondences between entities from both data                           remebmer: {1}                        day: {3, 7}
sources. Since the sources are different we have to hand-
le syntactical heterogeneity, in other words noise. Therefore
                                                                                    day: {1, 2}                          grene: {3}
we must not use exact matching to determine correspon-
dences between blocks from both sources. Because the block
labels are stored as strings, we use the Jaro-Winkler string                       green: {2, 3}                       remember: {7}
similarity to measure the similarity of block labels, and we
need a threshold to determine whether two blocks represent
                                                                                    fiddlers: {3}                       fidelers: {2}
possible entity correspondences or not.
   To compute the matching, we have to compare the block
labels of both block sets. Comparing them to each other                                                                  green: {2}
is computationally very expensive, so we choose to sort the
blocking labels in alphabetical order and apply a sliding win-
dow approach to reduce the number of comparisons. We her-           Figure 6: block matching result for two block sets
eby rely on the assumption that spelling mistakes are more          using not-exact matching (threshold: 0.9)
likely to appear in the end part of a word. Then a spelling
error will not much affect the sorting position.
   Let us assume that the block building has been applied on        delers, green and grene). With a threshold of 0.9, the artist
datasets similar to figure 2, using the split at blank -strategy.   The Offspring does not match with anything, as there is no
The built blocks for both data sources are depicted in figure       fitting partner in the left side. Arguably this can change if we
5. Arrows represent the correspondences between the blocks.         would lower the threshold strongly. In this example though,
The blocks on the left side have resulted from the following        the threshold is low enough to discover all reasonable mat-
artist names: 1) A Day To Remebmer, 2) Green Day and                chings, but high enough to avoid unreasonable matches.
3) Fiddler’s Green; we have noise in the form of two trans-
posed characters in the token remebmer (b ↔ m). On the              2.3     Partition building
right side, the blocks are built from the artist names 5) The         For further computation, we utilize the existing block cor-
Offspring, 3) Grene Day, 7) A Day To Remember and 2)                respondences as partitions for our later entity matching.
Fideler’s Green; here we have more noise.                           Thus we can load the data from all entities that are con-
                                                                    tained in both blocks in one in-memory step. The partitions
                                         offspring: {5}
                                                                    consist of two separate lists that each contains the identifiers
                                                                    that are stored in a block. As aforementioned, each partition
                                                                    also will have a blacklist introduced in the next subsection.
          remebmer: {1}                   day: {3, 7}
                                                                            partition 1               partition 2             partition 3
                                                                          s1-ids      s2-ids        s1-ids    s2-ids        s1-ids      s2-ids
            day: {1, 2}                   grene: {3}
                                                                            1           7             1         3             2           3
                                                                                                      2         7             3
           green: {2, 3}                remember: {7}                         blacklist                 blacklist               blacklist
                                                                                 ∅                       ¬(1 , 7)                ¬(2 , 3)


           fiddlers: {3}                 fidelers: {2}
                                                                            partition 4               partition 5
                                                                          s1-ids      s2-ids        s1-ids    s2-ids
                                          green: {2}                        3           2             3         2


                                                                              blacklist                 blacklist
Figure 5: blocking matching result for two block sets                            ∅                       ¬(3 , 2)

using exact matching
                                                                                          Figure 7: result partitions
  If we take the example in figure 5 and assume exact mat-
ching, two block matches (day and green) are considered                As can be seen in figure 7, the partitions are built accor-
which would cover a few entity match candidates, like 1 − 7         ding to the block matchings from figure 6. Partition 3, e.g.
(A Day To Remebmer, A Day to Remember ) or 2 − 3 (Green             emerges from the matching of the blocks labeled green and
Day, Grene Day).                                                    grene. The identifiers are taken from the blocks and added
  In contrast, figure 6 depicts the results for a non-exact         to the identifier lists of the partition. Please note that we
matching of the block labels, using the Jaro-Winkler distan-        do not actually store all pairs to save space; instead we car-
ce; similarities with a threshold of 0.9 or above are consi-        ry two lists along, each containing the identifiers from one
dered a block match. With this approach we get an entity            source and reconstruct the pairs later.
match candidate for the artist Fiddler’s Green, as both to-            Additionally two inverse indexes are created alongside crea-
kens have matching partners on each side (fiddlers and fi-          ting the partitions (figure 8). The index on each source tells
                  1 : p1 , p 2         2 : p4 , p 5                the number of pairs immediately before and after applying
                  2 : p2 , p 3         3 : p2 , p 3                the duplicate filtering. All experiments where conducted on a
                  3 : p3 , p 4 , p 5   7 : p1 , p 2 , p 5          3.4 GHZ quad-core CPU with 32 Gigabytes of main memory
                                                                   running Linux 4.15 (Debian) and using Java 8u162.
                Figure 8: inverse indexes
                                                                    sim         # before        # after     duplicates      runtime
                                                                                                                (in %)    (hh:mm:ss)
for each identifier in which partitions it is contained. These      1.0        734.047.668    728.536.532          0.76         15:31
indexes will be used in the next step, the pair filtering.          0.99       745.763.364    739.682.616          0.82         19:28
                                                                    0.95       810.004.501    799.165.736          1.36       1:05:55
2.4    Pair filtering                                               0.9        872.870.493    858.427.029          1.68       3:23:52
   Once the partitions are built, we apply a filter to eliminate        Table 1: pair filtering results for split-strategy
redundant pairs across the generated partitions. To achieve
this, we utilize the inverse indexes that were created in the         In table 1 the results for pair filtering utilizing the split-at-
partition building step and a temporary blacklist, to avoid        blank strategy with various similarities are depicted. With a
testing pairs for duplicates more than once. For each par-         decreasing similarity threshold we can see that the number
tition, we reconstruct all possible pairs using the cartesian      of pairs to test increases; yet the amount of duplicates is
product. For each pair, if it is not already contained in the      initially small and does not grow jointly with the number of
blacklist, we compose two sets by gathering the occurrences        pairs to be tested for possible duplicates.
in partitions by looking up the inverse indexes. Next, we cal-        When applying the merge-at-blank strategy it is obvious
culate the intersection of both sets, i.e., all partitions that    that the number of pairs to test is dramatically lower than
contain the current pair. We will add the pair to the tem-         with the split-at-blank strategy. Then we can see that the re-
porary blacklist and add the pair to the blacklists of all         quired runtimes for pair filtering are significantly lower than
partitions but the first.                                          the corresponding runtimes for the split-at-blank strategy,
   The blacklist is later used when the pairs have to be recon-    due to the lower pair count. With regards to the amount of
structed using the in-memory cartesian product; blacklisted        duplicates detected by the pair filtering, the results for the
pairs will be rejected during the reconstruction process.          merge-at-blank strategy in table 2 are similarly underwhel-
   We can see that three duplicate pairs exist across all five     ming. In this case, pair filtering yields less than one percent
partitions: the pair (1, 7) exists in partitions 1 and 2 and       reduction concerning the total amount of pairs.
the pair (2, 3) exists in partitions 2 and 3 and the pair (3, 2)
exists in partitions 4 and 5. Accordingly the pair (1, 7) is            sim      # before       # after     duplicates    runtime
added to the blacklist of partition 2, the pair (2, 3) is added                                                 (in %)     (mm:ss)
to the blacklist of partition 3 and the pair (3, 2) is added            1.0       1.873.697    1.867.733           0.32      00:03
to the blacklist of partition 5. Thereby partition 5 is empty           0.99      2.013.566    2.004.183           0.47      00:04
and can consequently be removed.                                        0.95      3.607.298    3.574.736           0.91      00:11
                                                                        0.9       9.185.447    9.103.081           0.90      01:26
   The partitions, as depicted in figure 7, represent the out-
put of our process and contain the correspondences that we           Table 2: pair filtering results for merge-strategy
use to initialise a matching process across two graphs repre-
senting our data source.                                             Obviously the chosen attributes, the artist names, of the
                                                                   selected datasets have a high entropy, thus the effect of a
3.    PRELIMINARY RESULTS                                          post-processing that applies a pair filtering step to eliminate
   The blocking technique presented was successfully applied       duplicate pairs is ineffective, as the effort spent for the post-
in a project at our institute, namely in the master thesis of      processing bears no proportion to the savings. Especially
Kroll [5] who investigated relational similarity in the context    for the split-at-blank strategy the increasing runtime with a
of a matching process for graph databases. We have tested          lowered similarity threshold gets disproportionate.
the blocking process with the database dumps of the Discogs
and MusicBrainz projects. The Discogs dataset contains over        4.      CONCLUSIONS AND FUTURE WORK
400 000 artists while the MusicBrainz dataset comprises over         In this paper we have presented a new blocking method
900 000 artists. After the preprocessing and the block buil-       as part of a process that allows the computation of starting
ding are applied on both datasets, we get different numbers        points for the initialisation of a graph matching process. We
of blocks for the implemented block building strategies. Uti-      took steps to adapt the standard blocking technique to data-
lizing the split-at-blank strategy, we obtain 242 900 blocks       sets that have few attributes that are suitable for a matching.
for the Discogs dataset and 435 563 blocks for the Music-          We also addressed the disadvantages of standard blocking by
Brainz dataset; this corresponds roughly to a reduction by         introducing stop word lists and regular expressions to reduce
half. With the merge-at-blank strategy, we receive 445 763         the size and amount of overly large blocks.
blocks for Discogs and 903 059 blocks for Musicbrainz; con-          Kroll[5] has obtained very satisfactory quality results on
trary to the other strategy, we see no reduction with regards      the overall graph matching starting with our blocking me-
to the block count.                                                thod. Nevertheless we have to thoroughly evaluate several
   In our experiments we examined the effectiveness of our         key aspects of our technique. Foremost we have to evaluate
pair filtering approach. Table 1 and 2 show the results for        the quality of our approach with regards to the accuracy,
both implemented strategies. The aim of these experiments          especially in comparison to competing techniques.
was to investigate the benefit of pair filtering concerning the      Additional steps have to be taken for the evaluation of two
amount of detected duplicates. For measurement we counted          core elements of our block matching step, the effect of the
used similarity function and the optimal size of the sliding
window. Concerning the result quality we want to explore
our blocking technique with alternatives to the Jaro-Winkler
distance and varying sizes of the sliding window and their
ramifications on the quality.

5.   REFERENCES
[1] P. Christen. Data Matching - Concepts and Techniques
    for Record Linkage, Entity Resolution, and Duplicate
    Detection. Data-Centric Systems and Applications.
    Springer, 2012.
[2] P. Christen. A survey of indexing techniques for
    scalable record linkage and deduplication. IEEE Trans.
    Knowl. Data Eng., 24(9):1537–1555, 2012.
[3] I. P. Fellegi and A. B. Sunter. A theory for record
    linkage. American Statistical Association,
    64(328):1183–1210, 1969.
[4] M. A. Hernández and S. J. Stolfo. The merge/purge
    problem for large databases. In M. J. Carey and D. A.
    Schneider, editors, Proceedings of the 1995 ACM
    SIGMOD International Conference on Management of
    Data, San Jose, California, May 22-25, 1995., pages
    127–138. ACM Press, 1995.
[5] H. Kroll. Relationale Ähnlichkeit im Matching-Prozess
    für Graphdatenbanken. Master’s thesis, Leibniz
    University Hannover, 2017.
[6] A. McCallum, K. Nigam, and L. H. Ungar. Efficient
    clustering of high-dimensional data sets with
    application to reference matching. In R. Ramakrishnan,
    S. J. Stolfo, R. J. Bayardo, and I. Parsa, editors,
    Proceedings of the sixth ACM SIGKDD international
    conference on Knowledge discovery and data mining,
    Boston, MA, USA, August 20-23, 2000, pages 169–178.
    ACM, 2000.
[7] G. Papadakis. Blocking techniques for efficient entity
    resolution over large, highly heterogeneous information
    spaces. PhD thesis, Leibniz University Hannover, 2013.
[8] G. Papadakis and W. Nejdl. Efficient entity resolution
    methods for heterogeneous information spaces. In
    S. Abiteboul, K. Böhm, C. Koch, and K. Tan, editors,
    Workshops Proceedings of the 27th International
    Conference on Data Engineering, ICDE 2011, April
    11-16, 2011, Hannover, Germany, pages 304–307. IEEE
    Computer Society, 2011.
[9] J. Zobel and A. Moffat. Inverted files for text search
    engines. ACM Comput. Surv., 38(2):6, 2006.

</pre>