=Paper=
{{Paper
|id=Vol-2126/paper8
|storemode=property
|title=Blocking Music Metadata from Heterogenous Data Sources
|pdfUrl=https://ceur-ws.org/Vol-2126/paper8.pdf
|volume=Vol-2126
|authors=Oliver Pabst,Udo W. Lipeck
|dblpUrl=https://dblp.org/rec/conf/gvd/PabstL18
}}
==Blocking Music Metadata from Heterogenous Data Sources==
Blocking music metadata from heterogenous data sources Oliver Pabst Udo W. Lipeck FG Datenbanken und Informationssysteme FG Datenbanken und Informationssysteme Institut für Praktische Informatik Institut für Praktische Informatik Leibniz Universität Hannover Leibniz Universität Hannover pabst@dbs.uni-hannover.de ul@dbs.uni-hannover.de ABSTRACT information integration where mappings between relational Entity resolution or object matching describes the assign- schemas are the groundwork before relational schemas can ment of different objects to each other that describe the be matched and later fused. Whereas relational databases same object of the real world. It is used in a variety of tech- have to compute expensive joins over relations to gather nical systems, e.g. systems that fuse different data sources. data that are related, the graph data model is predestined to Blocking is used in this context as an approach to reduce the directly store relational data and process them object-wise. total amount of comparisons by grouping similar objects in While we have to perform complex operations in relational the same cluster and dissimilar objects in different clusters. systems to get the join partners of a specific tuple, in the As a result only the objects of the same clusters have to graph data model we just have to query the adjacent edges be compared to each other. To deal with noise, for instance and nodes of a specific node. Now if we want to use graphs spelling errors, that can result from different heterogeneous to match entities, we do need some starting points, since data sources, various blocking approaches exist that may testing all possibilities is infeasible. add or remove redundancy to the data. As an origin we have two relational databases, Discogs1 In this paper we propose a system that utilizes a deriva- and Musicbrainz2 , which collect and maintain music meta- tive of the standard blocking technique to compute corre- data from web communities, that we want to fuse into one spondences between objects as starting points for a graph database. Both databases are both structurally and syntacti- matching process. The blocking technique, which usually re- cally heterogenous. To overcome this heterogeneity, we have lies on identity of blocking keys derived from attributes, is defined an integrated schema and have transformed both da- modified to cope with heterogenous source data with few at- tabases into this schema. There are, however, actually very tributes suitable for matching. A common criticism of stan- few attributes in the data that are suitable for an attribute- dard blocking is low efficiency, since the block sizes are un- based matching process. For example, the position of a track balanced with regard to the number of contained entities. on a release or the genre of a release are by no means suita- We take precautions to keep the efficiency high by reducing ble: Genres occur frequently and possess only a small num- the size and amount of large partitions. ber of distinct values, i.e., a low entropy. And track positions are not comparable; two tracks are not more similar if the difference of the positions is low. Additionally these data Categories and Subject Descriptors are afflicted by noise in the form of spelling mistakes and H.4 [Information Systems Applications]: Miscellaneous modifications by internal processes. No matter how a later object matching is implemented, as the focus of the later General Terms object matching could be shifted from attribute-based to relational similarities[5], the blocking before matching has Blocking, matching, entity resolution to stay reliable even for few and ’noisy’ attributes. Thus we need correspondences between blocks based on similar ins- Keywords tead of identical blocking keys. These correspondences form Blocking, matching, entity resolution partitions of matching candidates that can be processed in parallel. 1. INTRODUCTION During initial tests we have looked at different solutions to perform the blocking process directly on the database Merging two different relational databases is a difficult system that holds our data. [2] provides a list of possible ap- task. There are well known research approaches in the area of proaches like q-gram-indexing which first seemed promising, as PostgreSQL already has indexing structures that support q-grams and matching operators in place. Unfortunately the approach did not perform well at all. If it is used to perform string matching between two different tables, a nested loop join is required to compute the string similarities. This leads to very high runtimes (days) for the calculations, despite the 30th GI-Workshop on Foundations of Databases (Grundlagen von Daten- 1 banken), 22.05.2018 - 25.05.2018, Wuppertal, Germany. https://www.discogs.com 2 Copyright is held by the author/owner(s). https://musicbrainz.org indexes. Additionally the database approach does not scale Our proposed method at first behaves like a redundancy- well at all, since operations are performed sequentially, not positive method since similar (i.e. candidate) entity pairs in parallel. Other indexing techniques like phonetic codes may appear in multiple block correspondences, but finally were considered, but also discarded, because they delivered it delivers a redundancy-free blocking since duplicate entity no suitable results. pairs are filtered after the block building process. To deal with this problem, we have decided to modify the standard blocking technique, first presented by Fellegi and 2. PROCESS OVERVIEW Sunter [3] and later picked up by [9] and [8] to cope with As described earlier, we want to compute possible corre- very few available attributes. With regards to performance spondences between nodes of two different graphs to initia- we want to remain as efficient as standard blocking while lise a graph-matching process. To achieve this goal we have decreasing the redundancy and increasing the quality. developed a process that takes data from two sources in the This paper is structured as follows. In the next subsection form s1: (id, string), s2: (id, string) that is sufficient to des- we provide a brief overview of the various classes of blocking cribe an entity. In our prototype, that currently works with techniques. In section 2 we outline the steps of our envisio- music metadata, we use the name of artists (together with ned process that takes two input data sources and computes their local identifier) to approximately identify real world starting points for a graph matching process. We explain objects in our data sources. Other attributes of artists are our new blocking technique and outline the differences to not given, but rich contexts of related tracks, labels, releases, the standard blocking and explain our modifications. Fur- etc. thermore, we describe the block matching which calculates These input data need to be preprocessed since they are the correspondences. We show which steps we take to handle afflicted by noise, be it human mistakes like spelling errors noise in the data sources and how we compute block pairs which result in transposed characters, omitted strings or mo- that might be similar and have to be checked in a subse- difications by internal processes of the systems 3 that handle quent graph-matching process; that later process is not part the metadata. The string content from an incoming record of this paper. In section 3 we describe our experiences with is split apart at blanks into tokens and the tokens are then the blocking process. Finally in section 4 we present the con- processed by applying a stop word list and regular expressi- clusions and future work. ons. Related work Preprocess Input Block Building Blocking has been introduced as an important preparation (Stopwords, RegExp) of matching by, e.g., Christen [1]. Over the time a variety of blocking techniques have evolved. Papadakis [7] catego- rises them into two classes: redundancy-free methods and methods manipulating redundancy. Redundancy in this ca- Block Matching Partition Creation Pair Filtering se means, that the same entity may occur in more than one block. The standard blocking proposed by Fellegi at al. [3] and later modified by Papadakis [8] assigns a blocking key to every entity and groups two entities in the same block if their blocking keys match exactly. This technique is efficient Output but can lead to insufficient results as it does not handle noise and missing values well, but is free of redundancy. The other class, methods manipulating redundancy, i.e. methods that may decrease or increase the redundancy, can Figure 1: process overview roughly be separated into three categories of negative red- undancy, neutral redundancy and positive redundancy. An To reduce the number of comparisons needed to compute example for negative redundancy is canopy clustering [6]. In possible correspondences between both datasets, we use the every iteration an object is picked as a cluster center, points standard blocking technique. We work with either of two within a threshold distance are assigned to the cluster, and different strategies to create blocks from the preprocessed objects inside a smaller threshold distance are completely input data. The split-at-blank -approach leaves the tokens removed from the dataset. This is justified by the assump- apart, which potentially leads to more than one block per tion that highly similar objects are most likely in the same entity. Alternatively blanks are eliminated by the merge-at- block and will not match with other objects. By removing blank -approach, so that the incoming tokens are merged; them from the dataset these data cannot be assigned to a with this strategy only one block per entity will be created. new cluster, thus the redundancy is decreased. Redundancy Of course, created blocks will collect further entities with neutral techniques comprise the same number of common the same split/merged tokens. blocks across all entity pairs. An example for this category Then we have to match blocks of entities from our two is sorted neighbourhood blocking [4] as the sliding window sources. To avoid missing matches due to noise we do not is shared among all entity pairs of the dataset. Redundancy use an exact matching to compute correspondences between positive techniques, however, share the concept that the si- blocks. Instead blocks with same or similar labels shall be milarity of two entities correlates with the number of blocks matched by applying a string similarity measure on the la- that contain both entities. An example is q-gram blocking, bels of all block pairs. Thus we can additionally apply a where two entities are most similar to each other if they 3 By system we mean internal rules inside a processing sys- share all q-grams while they are least similar if they have no tem that affect the data as they are fed into the information common q-gram. system, processed and stored. variable threshold that can be set appropriately to fit the In figure 3 (left) the result of the preprocessing can be noise of the data sources. seen. For the first entity the stop words a and to have be- The resulting matches from the block matching are then en removed. The entities 2 and 3 were modified by regular used to create partitions. Each match is used to build a par- expressions, removing the numerical suffix ’(2)’ and the apo- tition, which clusters the identifiers of similar block labels strophe from entity 3. together according to the computed block matches. Each match is used to build such a partition by carrying two lists that contain the identifiers participating in the block mat- day remember dayremember ching. To reduce redundancy and thus increase efficiency, we ap- ply a pair filtering after the block matching process. If we green day greenday consider the strategy that splits the entity string at blanks into multiple tokens, it is obvious that an entity can be con- fiddlers green fiddlersgreen tained in multiple blocks. Since this applies to both data sources, it is clear that the same entity can participate in more than one candidate pair. The resulting output is a set of partitions that contain Figure 3: cleansed sample data, (left) split-at-blank, all correspondence pairs of entities that we want to use as (right) merge-at-blank starting points for the later graph-matching process. There we will decide which entity pairs really match, in our ap- In figure 3 the result of both string handling strategies are plication mainly by utilizing relational similarities. To save depicted. On the left side, the remaining tokens are unchan- space pairs are implicitly stored; they can be reconstructed ged for the blocking step. The benefit of this approach is that by building the local cartesian product between two stored big mistakes can be compensated as we may get a match by identifier lists, while discarded pairs are stored in a blacklist. other blocks that contain the same entity. On the other hand The whole process is depicted in figure 1 and the com- this may lead to a large number of blocks; entities which ponents of the process are further described in the following. consist of more than one token and spelling mistakes will increase the count even further. Additionally, this approach 2.1 Preprocess and Block Building does not cover cases when blanks were omitted and tokens The proposed blocking technique bases on the method pre- were mistakenly concatenated in the source data. With the sented by Papadakis at al. [8]. Strings are initially split at second strategy all tokens of one entity will be collapsed to blanks into tokens. On each token of a split string a domain just one string, which results in just one block. While we are specific stop word list is applied to omit frequent substrings, able to handle mistakenly concatenated strings big mistakes that do not contribute to a matching decision and have a cannot be compensated by other blocks but only by signifi- high entropy (e.g. the). Afterwards the remaining tokens cantly reducing the required similarity threshold to assume are kept separated (split-at-blank -strategy) or the blanks are a match. eliminated and the remaining tokens are fused (merge-at- blank -strategy). remember: {1} 1: a day to remember dayremember: {1} day: {1, 2} 2: green day (2) greenday: {2} green: {2, 3} 3: fiddler´s green fiddlersgreen: {3} fiddlers: {3} Figure 2: input example In our approach the incoming entities, which consist of Figure 4: resulting blocks - (left) split-at-blank a unique identifier and a string, are initially preprocessed (right) merge-at-blank before being handled by the block building process in order to reduce the quantity of noise in the data. Opposite to [8], due to the lack of suitable matching attributes, we do not use The result of the block building is shown in figure 4. The a unique identifier accompanied by a set of key-value-pairs, first strategy places the entities 1 and 2 in the same block for but only a unique identifier and an associated attribute. A the token day and the entities 2 and 3 in the same block for set of sample entities can be seen in figure 2. the token green. For the desired matching of two data sources As previously described, the strings are first separated into it is not required that similar objects of the same source are tokens using blanks as delimiters. A domain-specific stop actually placed in the same block, since a deduplication is word list with common expressions (e.g. the, a or – in the usually presumed beforehand. music domain – dj ) is applied and afterwards a set of regular Each block has a label that indicates the token it origina- expressions is used to eliminate disturbing suffixes like (1), ted from and a list of identifiers from entities that contain (3), ?. Consequently some tokens will be removed after this the token. These labels are used in the block matching, but step. are irrelevant after the block matching is performed. 2.2 Block matching offspring: {5} After the block building we have to apply a matching bet- ween the block sets of both data sources to compute the possible correspondences between entities from both data remebmer: {1} day: {3, 7} sources. Since the sources are different we have to hand- le syntactical heterogeneity, in other words noise. Therefore day: {1, 2} grene: {3} we must not use exact matching to determine correspon- dences between blocks from both sources. Because the block labels are stored as strings, we use the Jaro-Winkler string green: {2, 3} remember: {7} similarity to measure the similarity of block labels, and we need a threshold to determine whether two blocks represent fiddlers: {3} fidelers: {2} possible entity correspondences or not. To compute the matching, we have to compare the block labels of both block sets. Comparing them to each other green: {2} is computationally very expensive, so we choose to sort the blocking labels in alphabetical order and apply a sliding win- dow approach to reduce the number of comparisons. We her- Figure 6: block matching result for two block sets eby rely on the assumption that spelling mistakes are more using not-exact matching (threshold: 0.9) likely to appear in the end part of a word. Then a spelling error will not much affect the sorting position. Let us assume that the block building has been applied on delers, green and grene). With a threshold of 0.9, the artist datasets similar to figure 2, using the split at blank -strategy. The Offspring does not match with anything, as there is no The built blocks for both data sources are depicted in figure fitting partner in the left side. Arguably this can change if we 5. Arrows represent the correspondences between the blocks. would lower the threshold strongly. In this example though, The blocks on the left side have resulted from the following the threshold is low enough to discover all reasonable mat- artist names: 1) A Day To Remebmer, 2) Green Day and chings, but high enough to avoid unreasonable matches. 3) Fiddler’s Green; we have noise in the form of two trans- posed characters in the token remebmer (b ↔ m). On the 2.3 Partition building right side, the blocks are built from the artist names 5) The For further computation, we utilize the existing block cor- Offspring, 3) Grene Day, 7) A Day To Remember and 2) respondences as partitions for our later entity matching. Fideler’s Green; here we have more noise. Thus we can load the data from all entities that are con- tained in both blocks in one in-memory step. The partitions offspring: {5} consist of two separate lists that each contains the identifiers that are stored in a block. As aforementioned, each partition also will have a blacklist introduced in the next subsection. remebmer: {1} day: {3, 7} partition 1 partition 2 partition 3 s1-ids s2-ids s1-ids s2-ids s1-ids s2-ids day: {1, 2} grene: {3} 1 7 1 3 2 3 2 7 3 green: {2, 3} remember: {7} blacklist blacklist blacklist ∅ ¬(1 , 7) ¬(2 , 3) fiddlers: {3} fidelers: {2} partition 4 partition 5 s1-ids s2-ids s1-ids s2-ids green: {2} 3 2 3 2 blacklist blacklist Figure 5: blocking matching result for two block sets ∅ ¬(3 , 2) using exact matching Figure 7: result partitions If we take the example in figure 5 and assume exact mat- ching, two block matches (day and green) are considered As can be seen in figure 7, the partitions are built accor- which would cover a few entity match candidates, like 1 − 7 ding to the block matchings from figure 6. Partition 3, e.g. (A Day To Remebmer, A Day to Remember ) or 2 − 3 (Green emerges from the matching of the blocks labeled green and Day, Grene Day). grene. The identifiers are taken from the blocks and added In contrast, figure 6 depicts the results for a non-exact to the identifier lists of the partition. Please note that we matching of the block labels, using the Jaro-Winkler distan- do not actually store all pairs to save space; instead we car- ce; similarities with a threshold of 0.9 or above are consi- ry two lists along, each containing the identifiers from one dered a block match. With this approach we get an entity source and reconstruct the pairs later. match candidate for the artist Fiddler’s Green, as both to- Additionally two inverse indexes are created alongside crea- kens have matching partners on each side (fiddlers and fi- ting the partitions (figure 8). The index on each source tells 1 : p1 , p 2 2 : p4 , p 5 the number of pairs immediately before and after applying 2 : p2 , p 3 3 : p2 , p 3 the duplicate filtering. All experiments where conducted on a 3 : p3 , p 4 , p 5 7 : p1 , p 2 , p 5 3.4 GHZ quad-core CPU with 32 Gigabytes of main memory running Linux 4.15 (Debian) and using Java 8u162. Figure 8: inverse indexes sim # before # after duplicates runtime (in %) (hh:mm:ss) for each identifier in which partitions it is contained. These 1.0 734.047.668 728.536.532 0.76 15:31 indexes will be used in the next step, the pair filtering. 0.99 745.763.364 739.682.616 0.82 19:28 0.95 810.004.501 799.165.736 1.36 1:05:55 2.4 Pair filtering 0.9 872.870.493 858.427.029 1.68 3:23:52 Once the partitions are built, we apply a filter to eliminate Table 1: pair filtering results for split-strategy redundant pairs across the generated partitions. To achieve this, we utilize the inverse indexes that were created in the In table 1 the results for pair filtering utilizing the split-at- partition building step and a temporary blacklist, to avoid blank strategy with various similarities are depicted. With a testing pairs for duplicates more than once. For each par- decreasing similarity threshold we can see that the number tition, we reconstruct all possible pairs using the cartesian of pairs to test increases; yet the amount of duplicates is product. For each pair, if it is not already contained in the initially small and does not grow jointly with the number of blacklist, we compose two sets by gathering the occurrences pairs to be tested for possible duplicates. in partitions by looking up the inverse indexes. Next, we cal- When applying the merge-at-blank strategy it is obvious culate the intersection of both sets, i.e., all partitions that that the number of pairs to test is dramatically lower than contain the current pair. We will add the pair to the tem- with the split-at-blank strategy. Then we can see that the re- porary blacklist and add the pair to the blacklists of all quired runtimes for pair filtering are significantly lower than partitions but the first. the corresponding runtimes for the split-at-blank strategy, The blacklist is later used when the pairs have to be recon- due to the lower pair count. With regards to the amount of structed using the in-memory cartesian product; blacklisted duplicates detected by the pair filtering, the results for the pairs will be rejected during the reconstruction process. merge-at-blank strategy in table 2 are similarly underwhel- We can see that three duplicate pairs exist across all five ming. In this case, pair filtering yields less than one percent partitions: the pair (1, 7) exists in partitions 1 and 2 and reduction concerning the total amount of pairs. the pair (2, 3) exists in partitions 2 and 3 and the pair (3, 2) exists in partitions 4 and 5. Accordingly the pair (1, 7) is sim # before # after duplicates runtime added to the blacklist of partition 2, the pair (2, 3) is added (in %) (mm:ss) to the blacklist of partition 3 and the pair (3, 2) is added 1.0 1.873.697 1.867.733 0.32 00:03 to the blacklist of partition 5. Thereby partition 5 is empty 0.99 2.013.566 2.004.183 0.47 00:04 and can consequently be removed. 0.95 3.607.298 3.574.736 0.91 00:11 0.9 9.185.447 9.103.081 0.90 01:26 The partitions, as depicted in figure 7, represent the out- put of our process and contain the correspondences that we Table 2: pair filtering results for merge-strategy use to initialise a matching process across two graphs repre- senting our data source. Obviously the chosen attributes, the artist names, of the selected datasets have a high entropy, thus the effect of a 3. PRELIMINARY RESULTS post-processing that applies a pair filtering step to eliminate The blocking technique presented was successfully applied duplicate pairs is ineffective, as the effort spent for the post- in a project at our institute, namely in the master thesis of processing bears no proportion to the savings. Especially Kroll [5] who investigated relational similarity in the context for the split-at-blank strategy the increasing runtime with a of a matching process for graph databases. We have tested lowered similarity threshold gets disproportionate. the blocking process with the database dumps of the Discogs and MusicBrainz projects. The Discogs dataset contains over 4. CONCLUSIONS AND FUTURE WORK 400 000 artists while the MusicBrainz dataset comprises over In this paper we have presented a new blocking method 900 000 artists. After the preprocessing and the block buil- as part of a process that allows the computation of starting ding are applied on both datasets, we get different numbers points for the initialisation of a graph matching process. We of blocks for the implemented block building strategies. Uti- took steps to adapt the standard blocking technique to data- lizing the split-at-blank strategy, we obtain 242 900 blocks sets that have few attributes that are suitable for a matching. for the Discogs dataset and 435 563 blocks for the Music- We also addressed the disadvantages of standard blocking by Brainz dataset; this corresponds roughly to a reduction by introducing stop word lists and regular expressions to reduce half. With the merge-at-blank strategy, we receive 445 763 the size and amount of overly large blocks. blocks for Discogs and 903 059 blocks for Musicbrainz; con- Kroll[5] has obtained very satisfactory quality results on trary to the other strategy, we see no reduction with regards the overall graph matching starting with our blocking me- to the block count. thod. Nevertheless we have to thoroughly evaluate several In our experiments we examined the effectiveness of our key aspects of our technique. Foremost we have to evaluate pair filtering approach. Table 1 and 2 show the results for the quality of our approach with regards to the accuracy, both implemented strategies. The aim of these experiments especially in comparison to competing techniques. was to investigate the benefit of pair filtering concerning the Additional steps have to be taken for the evaluation of two amount of detected duplicates. For measurement we counted core elements of our block matching step, the effect of the used similarity function and the optimal size of the sliding window. Concerning the result quality we want to explore our blocking technique with alternatives to the Jaro-Winkler distance and varying sizes of the sliding window and their ramifications on the quality. 5. REFERENCES [1] P. Christen. Data Matching - Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Data-Centric Systems and Applications. Springer, 2012. [2] P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 24(9):1537–1555, 2012. [3] I. P. Fellegi and A. B. Sunter. A theory for record linkage. American Statistical Association, 64(328):1183–1210, 1969. [4] M. A. Hernández and S. J. Stolfo. The merge/purge problem for large databases. In M. J. Carey and D. A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, May 22-25, 1995., pages 127–138. ACM Press, 1995. [5] H. Kroll. Relationale Ähnlichkeit im Matching-Prozess für Graphdatenbanken. Master’s thesis, Leibniz University Hannover, 2017. [6] A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo, and I. Parsa, editors, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Boston, MA, USA, August 20-23, 2000, pages 169–178. ACM, 2000. [7] G. Papadakis. Blocking techniques for efficient entity resolution over large, highly heterogeneous information spaces. PhD thesis, Leibniz University Hannover, 2013. [8] G. Papadakis and W. Nejdl. Efficient entity resolution methods for heterogeneous information spaces. In S. Abiteboul, K. Böhm, C. Koch, and K. Tan, editors, Workshops Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, pages 304–307. IEEE Computer Society, 2011. [9] J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2):6, 2006.