Automatic Interlinking of Music Datasets on the Semantic Web Yves Raimond, Christopher Sutton and Mark Sandler Centre for Digital Music Queen Mary, University of London {yves.raimond,chris.sutton,mark.sandler}@elec.qmul.ac.uk ABSTRACT in a Creative Commons label is linked to its location in the In this paper, we describe current efforts towards interlink- Geonames dataset, and to the corresponding resource in an ing music-related datasets on the Web. We first explain editorial database1 : some initial interlinking experiences, and the poor results obtained by taking a naı̈ve approach. We then detail a par- ticular interlinking algorithm, taking into account both the foaf:based_near ; similarities of web resources and of their neighbours. We owl:sameAs . link a Creative Commons music dataset to an editorial one, and to link a personal music collection to corresponding web Then, you may access the actual audio content from the Cre- identifiers. The latter provides a user with personally mean- ative Commons label, some extra information such as the ingful entry points for exploring the web of data, and we birth dates of the members of this band from the editorial conclude by describing some concrete tools built to generate dataset, and detailed geographic information (latitude, lon- and use such links. gitude, hierarchy of geographical features) from the Geon- ames dataset. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous For small datasets published manually (such as an individ- ual’s FOAF file), it is possible to create such links manu- General Terms ally. However, doing so for large datasets is impractical: we Linked Data need a way to automatically detect the overlapping parts of heterogeneous datasets. In this paper, we detail a few algo- Keywords rithms that have been developed, implemented and practi- Semantic-Web,Linked Data,Music cally deployed to interlink different music-related datasets. We mainly focus on the most sophisticated one, applicable in a Linked Data context, and taking into account not only 1. INTRODUCTION the similarities of single resources but also the similarities The Linking Open Data community project [3] aims at pub- of their neighbours. We evaluate how this algorithm per- lishing and interlinking open datasets by following simple forms when applied to link a real-world Creative Commons rules [2] for linking data. The publication step can to some dataset to an editorial one. We also show how a personal extent be automated, using tools such as D2R or Open- music collection can be treated as one such dataset, enabling Link Virtuoso (which both allow relational databases to be a user to benefit from the growing body of knowledge on the published as linked data) or P2R (which allows SWI-Prolog Semantic Web in a personally meaningful way. knowledge bases to be published in the same way). All these tools handle declarative mappings from a given data struc- We define the mapping problem as follows. We con- ture to corresponding web resources and associated RDF sider two RDF datasets D1 and D2 , respectively describ- descriptions. Once this publication is achieved, we still have ing a number of web resources ri and si . We consider to create links to other datasets, in order for a user agent the problem of matching resources—finding resources sy to navigate from one to another. A typical example of such in our target dataset, D2 , which identify the same ob- interlinking would be the following one, where a music band ject as a resource rx in our seed dataset, D1 . For ex- ample, we want to find sy = http://zitgist.com/music/artist/ 0781a3f3-645c-45d1-a84f-76b4e4decf6d for rx = http://dbtune.org/ jamendo/artist/5. All mapping problems in our context can be reduced to this one, even literal expansion where a resource rx is linked to a literal l and we are looking for a resource sy corresponding to l—for example, expanding “Moselle, France” into http://sws.geonames.org/2991627/. In these cases, 1 LDOW’08, April 28, 2008, Beijing, China We use the Turtle notation throughout the paper, with the namespaces defined in § 8 a simple transformation (creating a blank node between rx and l) is enough to return ourselves to a resource matching problem. 2. NAIVE INTERLINKING In this section, we describe some naı̈ve approaches to tack- ling this resource matching problem, and identify their fail- ings. 2.1 Simple literal lookups Most datasets provide a literal search facility, either through a dedicated Web service (eg. Geonames, Musicbrainz), or through a SPARQL end-point on which we can use filters on literals, or in some cases built-in literal matching functional- Figure 1: Example datasets—we here try to match ele- ity. So one solution to link resources from our two datasets ments from A and B would be to first issue the following query on D1 : UNION {?r p:wikiPageUsesTemplate SELECT ?l } WHERE { ?p ?l } FILTER (isLiteral(?l)) } } and use the bindings of ?l to issue a literal search on D2 . We can then try to map the resulting resources to r. This approach was used to link the musical works and the artists in the BBC John Peel sessions dataset to correspond- We used this kind of approach to link the Jamendo dataset ing resources in DBpedia. Constraints on the target re- to the Geonames one2 . Jamendo provides information about sources were manually defined, and queries taking into ac- the location of artists, in the form of a literal string (such as count the seed literal and these constraints were issued to “Moselle, France”). We use this to query the Geonames web the DBpedia SPARQL end-point. service, and get back the corresponding Geonames resource. In practice, we found that the literal strings provided by However, even with restrictions on the nature of target re- Jamendo are specific enough for this approach to work well. sources, a literal may not be discriminating enough. For ex- In the two instances when more than one candidate was ample, the resource http://dbtune.org/jamendo/artist/5 is linked returned for a location string, no link was created. to the literal “Both”. Searching the Musicbrainz dataset for “Both” while restricting ourselves to artists gives us two re- 2.2 Extended literal lookups sults. Likewise, when looking for the Nirvana version of the Although this simple approach can be suitable for interlink- song “Love Buzz” in the DBpedia dataset, we have to dis- ing particular datasets in cases where a literal string reliably ambiguate between the original song and the Nirvana cover. provides sufficient disambiguation it is unlikely to discrim- There is therefore a need for a more sophisticated algorithm, inate suitably in most cases. For example, when trying to capable of handling such disambiguation. apply it to link musical works in the BBC John Peel Sessions to resources in the DBpedia dataset [1], we come across the 3. GRAPH MATCHING literal “Violet”3 . The actual song which the algorithm should An intuitive approach to disambiguate two artists with the link to is just one of the sixteen results of the corresponding same name would be to check the titles of their releases, literal lookup. and see if they match the titles of the releases in our seed dataset. If by any chance they have releases with the same One solution is to add constraints on the resulting resources. title, we can check their track titles, and disambiguate using For example, by using the DBpedia links to Yago [10], we them, and so on. In this section, we develop this idea and can restrict our resources to be of a specific Yago type, or we give a formal specification of such an algorithm. can restrict it to be linked to a particular infobox URI (spec- ifying a template for structured data on the corresponding Wikipedia page). This leads us to the following SPARQL 3.1 Offline graph matching query: Consider the two datasets illustrated in fig. 1, with our seed dataset on the left containing a single artist with the name “Both”, and the target dataset on the right containing two artists named “Both”. We model the two datasets as graphs, PREFIX p: SELECT ?r with each edge represented as a triple (s, p, o). WHERE { ?r ?p "Violet"@en. As a first step, we compute initial similarity values between { all pairs of resources (s1 , s2 ) and (o1 , o2 ) such that (s1 , p, o1 ) ∈ {?r a } D1 and (s2 , p, o2 ) ∈ D2 . Such similarity values might be 2 All links to mentioned datasets are available in § 7 calculated by a string similarity algorithm (such as the ones 3 For the resource http://dbtune.org/bbc/peel/work/1498 described in section 2 of [12]) comparing literals directly at- issue a query (through a SPARQL end-point or custom web Table 1: Initial similarity map service) to D2 which involves l and constraints over what we Resource 1 Resource 2 Initial Similarity are looking for. This gives us a list of resources. 1 4 1 1 7 1 For each sk in this list, we access Gsk . Now, for all pos- 2 5 0.9 sible graph mappings MGr :Gsk ,i , we compute a measure as 3 6 0.8 defined in § 3.1. If we can make a clear decision now (ie. 2 6 0.1 there is just one measure above our decision threshold), we 3 5 0.1 terminate and choose the corresponding graph mapping. If 2 8 0.1 not, we look for object properties p such that (r, p, o) ∈ Gr 3 8 0.1 and (sk , p, o0 ) ∈ Gsk , and we obtain Go and G0o . 4 Then, we update our possible graph mappings and the associated measures. We iterate this process until we can make a deci- Table 2: Possible graph mappings—(x, y) ≡ {x sion (we have one unique mapping with measure above the owl:sameAs y}, and associated measures threshold), or until we can’t go any further (no unexplored Graphs Mapping Measures object properties). Practically, we also limit the maximum G1 G2 MG1:G2 a = {(1, 4), (2, 5), (3, 6)} 0.9 number of iterations the algorithm may perform. G1 G2 MG1:G2 b = {(1, 4), (2, 6), (3, 5)} 0.4 G1 G3 MG1:G3 a = {(1, 7), (2, 8)} 0.55 In our example, we first dereference http://dbtune.org/ G1 G3 MG1:G3 b = {(1, 7), (3, 8)} 0.55 jamendo/artist/5. We get access to the following facts: our URI is identifying a musical band, it is called “Both”, and it made 5 two things: http://dbtune.org/jamendo/record/174 tached to these resources. In our example, this produces the and http://dbtune.org/jamendo/record/33. We now look for results in table 1. an artist named “Both” in the Musicbrainz dataset, through the Musicbrainz web service6 . This gives Next, we construct a graph similarity measure. In our ex- us back two URIs: http://zitgist.com/music/artist/ ample, we consider the possible graph mappings in table 2. 5f9f2dfb-76f0-4872-ad7d-f9d84a908cb5 and http://zitgist.com/ Then, we associate a measure with such mappings: we sum music/artist/0781a3f3-645c-45d1-a84f-76b4e4decf6d. We derefer- the similarity values s associated with each pair (x, y) and ence them: the first one identifies an artist named “Both” we normalise it by the number of pairs in the mapping. In which made two things, and the second one also an artist our example, the resulting measures are in table 2. named “Both” which made one thing. Finally, we choose the mapping whose similarity measure is We now consider two possible graph mappings (correspond- the highest, optionally thresholding to avoid making map- ing to the two potential matches of our artist resource), with pings between graphs which are too dissimilar. In our ex- two measures, both equal to 1. We continue looking for fur- ample, we choose MG1:G2 a. ther clues, as there is not yet any way to disambiguate be- tween the two. We take the object property occurring in our 3.2 Linked Data context three graphs, f oaf :made, and dereference all the objects of this property that we currently know about. Our starting re- Now, we apply this algorithm in a Linked Data [2] context, source made two records, named “Simple Exercice” and “En where we discover our graphs as we go: the main idea being attendant d’aller sur Mars”. The first matching artist in the that we update the graph mappings and their measures as Musicbrainz dataset made two records, named “Simple ex- we update our local Semantic Web cache. This is necessary ercice” and “En attendant d’aller sur Mars...”. The second because in general the size of the datasets D1 and D2 will matching artist made one record, named “The Inevitable be prohibitive; if not for loading both datasets, then for Phyllis”. We now update the possible graph mappings, and computing all the possible graph mappings between them. reach the results in table 2. Now, we have one mapping identifiably better than the others, with graph similarity Our starting point is now a single URI r in a dataset D1 . measure 0.9. We choose it, and hence derive the following We try to find the corresponding s in a dataset D2 , as well statements: as mappings of resources in the neighbourhood of r. We illustrate this using the same example as the one in § 3.1: we want to map the URI http://dbtune.org/jamendo/artist/5 to the corresponding Musicbrainz URI. As part of the same owl:sameAs . for this artist. owl:sameAs . In the following, we will use Named Graphs [5], in order to owl:sameAs track the provenance of a particular graph, and we let Gx . 4 The first thing we do is to retrieve Gr , and extract a suitable We could additionally consider triples of the form (o, p, r) label l for r (using dc:title or f oaf :name properties, etc). and (o0 , p, sk ). 5 Now, we need to access some potential candidates for s. To Captured through the f oaf :made predicate 6 do that, we use the same approach as described in § 2.2. We See http://wiki.musicbrainz.org/XMLWebService Having chosen a mapping we could go further, to also derive 0 Ok,r,p is the list of all o such that (r0 , p, o) ∈ Gr0 such statements for the tracks in these two albums. An- 0 S Foreach Objmapk,i ∈ r combinations(Ok,r,p , Ok,r,p ): other possible extension of this algorithm is to perform lit- simk,i = (measurek + measure(Objmapk,i ))/(|Mk | + eral lookups in D2 at each step (therefore providing new |Objmapk,i |) possible graph mappings each time). This helps ensure we If no simk,i , fail still find the correct mapping in the case that our initial If simk,i > threshold for exactly one {k, i} pair, return append(Mk , Objmapk,i ) literal lookup does not include the correct resource among Else, N ewM appings is the list of all append(Mk , Objmapk,i ), its results. For example, the correct target artist might be and return propagate(N ewM appings) (if the maximum number listed as having a different name in D2 as in D1 , such that of recursions is not reached, otherwise fail) they do not feature in the results of our initial literal lookup. However, we might have some clues about who this artist is Now, we apply this pseudocode to our earlier “Both” from the names of the albums they produced, and so per- example (r = 1 = http://dbtune.org/jamendo/artist/5), with a threshold of 0.8. This makes us go through the following forming additional literal lookups (on the album titles) may steps: allow us to find the correct artist and hence the correct map- ping. Such a variant of this algorithm is implemented in the lookup(r) = {4, 7} GNAT software described in § 4.2. M1 = {(1, 4)}, sim1 = 1 M2 = {(1, 7)}, sim2 = 1 M appings = {{(1, 4)}, {(1, 7)}} 3.3 Algorithm definition The algorithm described above can be expressed in the propagate({{(1, 4)}, {(1, 7)}}) following pseudo-code. We assume the existence of a k = 1, p = f oaf :made 0 O1,1,f oaf :made = {2, 3}, O1,4,f function string similarity(x, y) and define the following oaf :made = {5, 6} Objmap1,1 = {(2, 5), (3, 6)}, Objmap1,2 = {(2, 6), (3, 5)} additional functions: sim1,1 = 0.9, sim1,2 = 0.4 k = 2, p = f oaf :made function similarity(x, y) : 0 O2,1,f oaf :made = {2, 3}, O2,7,f oaf :made = {8} Extract a suitable label lx for x in Gx Objmap2,1 = {(2, 8)}, Objmap2,2 = {(3, 8)} Extract a suitable label ly for y in Gy sim2,1 = 0.55, sim2,2 = 0.55 Return string similarity(lx , ly ) Now, sim1,1 is the only simk,i above the threshold, we therefore choose {(1, 4), (2, 5), (3, 6)} as our mapping, which corresponds to the RDF code in § 3.2. function lookup(x) : Extract a suitable label lx for x in Gx Of course, several heuristics could be added to this pseudo-code, Perform a search for lx on D2 in order to improve the scalability of the algorithm. In practice, Return the set of resources retrieved from the search we associate weights to properties, in order to start from the most informative one (f oaf :made, for example). function measure(M ) : Foreach (ri , rj ) ∈ M 4. EXPERIMENTS simi,j = similarity(ri , rj ) In this section, we detail two experiments using this algo- P Return i,j simi,j rithm, and their respective evaluations. The first one deals with the automatic interlinking of two online music datasets. function combinations(O1 , O2 ) : The second one deals with the linking of a personal music Return all possible combinations of collection towards corresponding web identifiers. elements of O1 and elements of O2 e.g. combinations({1, 2}, {3, 4}) = {{(1, 3), (2, 4)}, {(1, 4), (2, 3)}} 4.1 Linking two overlapping web datasets In this section, we focus on a concrete interlinking which has Our starting point is a URI r in D1 and a decision been achieved using this algorithm, between two overlapping threshold threshold. Our mapping pseudo-code is then web datasets: Jamendo and Musicbrainz. We implemented defined as: this algorithm7 in SWI-Prolog [11], with only one lookup on the Musicbrainz end-point, at the artist level. Our algo- rithm derived 10944 similarity statements for artist, record, and track resources so far, which allows us to get detailed ed- Foreach sk ∈ lookup(r) itorial information from Musicbrainz, and the actual audio Mk = {(r, sk )} content, as well as tags, from the Jamendo dataset. measurek = measure(Mk ) simk = measurek /|Mk | We focus our evaluation on artist resources. As we perform If simk > threshold for exactly one k, return Mk only one lookup, at the artist level, no tracks or records Else, M appings is the list of all Mk , and return propagate(M appings) can be matched if the artist is not. In order to evaluate the quality of the interlinking, we take a random sample function propagate(M appings) : from the Jamendo dataset: we first collect every single artist Foreach Mk ∈ M appings: URI8 , and we randomly select 60 from among them. Then, measurek = measure(Mk ) 7 Foreach p s.t. (∃(r, r0 ) ∈ Mk , ∃(r, p, o) ∈ Gr , The source code of all the software mentioned ∃(r0 , p, o0 ) ∈ Gr0 and ∀M ap ∈ M appings, (o, o0 ) ∈ / M ap): is available as part of the motools project: Foreach (r, r0 ) ∈ Mk : http://sourceforge.net/projects/motools 8 Ok,r,p is the list of all o such that (r, p, o) ∈ Gr The results of such a SPARQL query are available at Table 3: Evaluation of the Jamendo/Musicbrainz interlinking Link derived Link not derived Correct 5 53 Incorrect 0 2 we run our mapping algorithm and manually check whether the mappings are correct. Each tested resource therefore falls into one of the following categories: • An owl:sameAs link is derived : correct (same artist in the Jamendo and in the Musicbrainz datasets) ; • An owl:sameAs link is derived : incorrect (different Figure 2: Constructing a graph from local music meta- artists) ; data • No link is derived : correct (there is a corresponding artist in the Musicbrainz dataset) ; statements making the links between local audio files and the remote manifestation identifiers. The fingerprinting func- • No link is derived : incorrect (no corresponding artists tionality can be useful when the metadata available is par- in the Musicbrainz dataset). ticularly poor, but since it is highly dependent on the fin- gerprinting service chosen, we concern ourselves here solely with the metadata-based approach. The results9 , in terms of how many resources fall within each of the above defined categories, are shown in table 3. In our All modern audio encodings allow for the inclusion of meta- test dataset, the disambiguation was needed in 16 cases. data “tags” alongside audio data in a file, such that each For example, one of the artist resources was named “Hair”, audio file can include the kind of editorial information con- which matches four resources within Musicbrainz, none of tained in the data sets described above. We can therefore them being the same band. The first case that failed is due consider a reasonably well tagged personal music collection to an implementation mistake, failing to normalise the graph to be just another music data set, apply the algorithm de- similarity measures correctly when the target graph is bigger scribed in § 3, and hence link each local audio file to a cor- than the seed one (in this case, the artist had two releases responding resource on the Semantic Web. GNAT uses the on Musicbrainz, and just one on Jamendo). The second case variant discussed at the end of § 3.2 which maps artists, al- that failed is due to the fact that the Musicbrainz RDF is bums and tracks, performing literal lookups at each stage. outdated (the artist does not exist in the RDF dump, but For each local audio file, a simple seed graph is constructed does exist in the Musicbrainz database). based on the artist, album, title and track number specified in the file’s ID3 tag (see fig. 2 for an example). It proceeds 4.2 Linking personal music collections as set out in § 3.3 until a single best mapping is found. After Personal music collections can also be a part of the web processing a directory of files in this way, GNAT outputs an of data. The Music Ontology [9] makes the same distinc- RDF file providing mo:available_as links from URIs in the tion as FRBR between manifestations (all physical objects Zitgist publication of MusicBrainz data to the local audio that bear the same characteristics, eg. a particular album) files. For example : and items (a concrete entity, eg. my copy of the album on CD). A manifestation and a corresponding item are linked through a predicate mo:available_as. Therefore, given a mo:available_as set of audio files in a personal music collection, it is possible to keep track of the set of statements linking this collection to identifiers elsewhere in the Semantic Web which denote the corresponding manifestations. These statements provide Using this algorithm rather than one of the more naı̈ve ap- a set of entry points to the Semantic Web, allowing access proaches should allow GNAT to be robust to various inac- to information such as the birth date of the artists responsi- curacies in the local files’ metadata. We evaluated GNAT’s ble for items in the collection, geographical locations of the behaviour in the face of such problems by taking a correctly- recordings, etc. tagged MP3 file of the Beatles track “I want to hold your hand” and artificially introducing various mistakes. The Mu- GNAT is an implementation of automatic linking from a sicBrainz dataset lists no less than 25 releases for this track personal audio collection to the Musicbrainz dataset—it uses by The Beatles, and dozens of artists with songs of the same audio fingerprinting and available metadata to find corre- name. The correct set of metadata is shown in table 4 and sponding dereferencable identifiers, and then outputs RDF results are shown in table 5. http://dbtune.org:2105/sparql/?query=select%20distinct%20%3Fa %20where%20%7B%3Fa%20a%20mo%3AMusicArtist%7D We can see that GNAT performs well in the face of inaccu- 9 rate metadata. The release chosen when the album field is The detailed results, resource by resource, are available at http://moustaki.org/resources/results.txt missing or set to a random string is arguably correct—it is Table 5: Evaluation of GNAT’s linkage on particular cases Change made Resulting mapping Artist field missing Correct Artist set to random string Correct Artist set to “Beetles” Correct Artist set to “Al Green” Mapped to Al Green’s cover version Album field missing Mapped to correct track on the release Album set to random string “The Capitol Albums, Volume 1 (disc 1: Album set to “Meat the Beatles” Meet the Beatles!)” Track set to random string Correct Track set to “I Wanna Hold Your Hand” Correct can prioritise links which use properties from the Music On- Table 4: Base (correct) metadata for GNAT evalu- tology (or other specified namespaces). This helps ensure ation that relevant information is prioritised over less obviously- Artist “The Beatles” useful information. Album “Meet The Beatles!” Title “I Want To Hold Your Hand” Information relating to resources of interest may also be re- Track Num 1 trieved based on specific rules. For example, if we know that the SBSimilarity service provides “similar track” infor- mation about tracks by appending their Musicbrainz ID to the same CD, released as part of a box set. One would hope a given prefix10 , we can specify a rule in GNARQL for gen- that the release “Meet the Beatles!” would be chosen in the erating rdfs:seeAlso links from known tracks to the corre- case of the album being misspelled. sponding documents in the SBSimilarity namespace. These rule-derived links will then be followed as part of the crawl- With a practical implementation some trade-offs must be ing strategy, and their information added to the local RDF made. In the case of setting the artist to “Al Green”, we store. have two conflicting pieces of information (artist and album) and our implementation here chooses the mapping for which Naturally, the information held in GNARQL’s store need the artist matches. A more sophisticated version of GNAT not come solely from GNAT. In fact, any RDF data in the might consider other tracks in the current directory to es- designated directory tree will be loaded. In our research tablish that the track most likely comes from the Beatles group, this means that Chord Ontology11 transcriptions are release rather than Al Green’s release. held alongside the MusicBrainz data retrieved from Zitgist and social tag data retrieved from various websites. Such links from a user’s own files to information on the Se- mantic Web could be a significant step towards making data To enable applications to take advantage of this aggregated on the Web available and relevant to people, but a user agent data, GNARQL provides a SPARQL endpoint. This frees must act on such links before they are directly useful to a end-user applications from the need to themselves maintain human. In the next section we describe a companion tool to a database relating to the user’s music collection, and al- GNAT, designed to exploit the links GNAT produces. lows multiple applications to benefit from a single store of aggregated information. 4.3 Use-cases Based on datasets available today, some example queries For the links between a user’s files and Semantic Web re- such user interfaces might pass on to GNARQL include sources to be useful, an application must have some infor- “Find tracks which are performances of works by Russian mation about the resources. The GNARQL program in the composers around the turn of the twentieth century”, “Find motools project is beginning to explore some of the pos- me cover versions of rock songs in non-rock genres” or po- sibilities in this direction. The program loads in all the tentially (by using information linked from a user’s FOAF owl:sameAs links produced by GNAT, dereferences the cor- file) “Find me gigs by the artists I play frequently which fit responding URIs, and then aggregates additional informa- with my vacation schedule”. tion about those resources. To experiment with browsing the data aggregated by The basic mechanism for aggregating additional information GNARQL, we have developed a prototype user interface, is to crawl outwards from the given resource, dereferencing based heavily on the /facet program described in [6]. This linked resources and adding their descriptions to the local provides a web browser-based interface for exploring the RDF store. In the simplest case, this crawling can be un- aggregated information, and performing simple facet-based guided, simply following all links regardless of the properties used, and performing a breadth-first traversal of the Seman- 10 eg. http://isophonics.net/music/signal/280b7fae-724e-4a6d-8e91- tic Web from all known resources. 6fe3f0a2bdad provides extra information about http://zitgist.com/music/signal/280b7fae-724e-4a6d-8e91-6fe3f0a2 More sophisticated crawling strategies may lead to better bdad 11 aggregation for a user’s purposes. For example, GNARQL See http://purl.org/ontology/chord/ queries. The map functionality allows the results of such young, more work is required to fully exploit the function- queries to be plotted geographically, as shown in fig. 3. ality GNARQL is beginning to exhibit. 5. FUTURE WORK 6. CONCLUSION 5.1 Work on the interlinking algorithm In this paper, we described several different methods for in- The algorithm proposed here has several limitations. Firstly, terlinking Semantic Web datasets. We mentioned two naı̈ve it doesn’t specify any heuristics to use if the ontologies differ. approaches, leading to the construction of a more elaborate In this case,we would need a different methodology, perhaps algorithm, which takes into account not only the similarity inspired by the approach described in [8]. Secondly, the al- of the resources themselves but also the similarity of their gorithm is designed for the case where a meaningful similar- neighbours. Its main advantages are to provide a best-effort ity measure between pairs of individual resources is available mapping, without any need for a learning step (for which (here, using string similarity of labels attached to them with we would have to manually interlink some resources), and to certain predicates). In a linking scenario where there is no work in a linked data environment, where new resources are particularly good similarity measure on individual resources discovered as we get through the mapping process. We de- and the graph structure is therefore the most salient factor scribed two implementations of this algorithm. The first one for correct linking, another algorithm may be more appropri- was used to interlink artists, records, and tracks in two on- ate. In this case, Melnik’s “similarity flooding” [7] approach line music datasets: Jamendo and Musicbrainz, and the sec- could be used, which prioritises graph structure rather than ond one allows any user to link their personal music collec- node similarity, relying on a post-processing stage to filter tion to corresponding identifiers in the Musicbrainz dataset. out unsuitable mappings. We evaluated these two implementations separately. Based on these observations, further work developing the Creating links between heterogeneous datasets can dramat- interlinking algorithm could provide a framework for inter- ically enhance the usefulness of each. Using such links, a linking a wider variety of datasets. Semantic Web user agent can jump from a band within the Jamendo dataset to the corresponding resource in the Mu- 5.2 Work on implementations sicbrainz one, to the corresponding resource in DBpedia, Currently, the GNAT tool implements two distinct ap- to its approximate geographic location, to the famous com- proaches to finding manifestation URIs for local audio files. posers born in that city, etc. However, if it is possible to One uses just the available metadata, and this approach is define such links manually for small datasets, it is impossi- described in § 4.2, above. Since audio metadata in personal ble for large ones. We need methodologies to discover them collections is frequently incomplete, inaccurate, or missing in an automated way. The techniques presented here are entirely, this may not be sufficient. The other approach far from perfect, but represent some initial efforts in this therefore exploits audio fingerprinting [4] to try to identify direction. the track, and then if there is remaining ambiguity the local metadata is used to choose a single URI. 7. DATASETS The following datasets are mentioned throughout the paper: Ideally, we could use the fingerprint of an audio file as just another piece of information about the track, incorporat- Jamendo on DBTune: http://dbtune.org/jamendo/ ing it into our graph mapping approach. In practice, the BBC John Peel sessions: http://dbtune.org/bbc/peel/ main fingerprinting service available with a large support- SBSimilarity: http://www.isophonics.net/SBSimilarity ing database, MusicIP’s MusicDNS service12 , is relatively Musicbrainz RDF: http://zitgist.com/music/ opaque. Fingerprinting a track either returns a PUID, which DBpedia: http://dbpedia.org/ can be used to perform a search on MusicBrainz, or returns Geonames: http://geonames.org/ no results. It therefore provides only a boolean test for sim- ilarity, and some hidden decisions have been made by the MusicDNS service using any available metadata. As a re- sult, there is no obvious way to uniformly combine finger- 8. NAMESPACES print information and local metadata in a graph mapping We use the following namespaces throughout our RDF ex- approach. amples: A fingerprinting service which exposed the server-side database and the actual process of matching a fingerprint @prefix mo: . to a database entry could allow for some more sophisticated @prefix foaf: . linkage between personal audio collections and the Semantic @prefix owl: . Web. We have some hopes that Last.fm’s recently launched fingerprinting service might gather a large database and make it freely available in a flexible way. 9. ACKNOWLEDGEMENTS The authors acknowledge the support of both the Cen- tre For Digital Music and the Department of Computer Although the core of the GNARQL tool is in place, more Science at Queen Mary University of London for the stu- work is required to explore different approaches to crawl- dentship for Yves Raimond. This work has been partially ing. Also, since Semantic Web user interfaces are relatively supported by the EPSRC-funded ICT project OMRAS-2 12 See http://www.musicip.com/dns/ (EP/E017614/1). Figure 3: Navigating a personal audio collection using aggregated Semantic Web data 10. REFERENCES [9] Yves Raimond, Samer Abdallah, Mark Sandler, and [1] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, Frederick Giasson. The music ontology. In Proceedings R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a of the International Conference on Music Information web of open data. In Proceedings of the International Retrieval, pages 417–422, September 2007. Available Semantic Web Conference, Busan, Korea, November at http://ismir2007.ismir.net/proceedings/ 11-15 2007. ISMIR2007_p417_raimond.pdf. Last accessed January [2] Tim Berners-Lee. Linked data. World wide web design 2008. issues, July 2006. Available at [10] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard http://www.w3.org/DesignIssues/LinkedData.html. Weikum. Yago - a core of semantic knowledge. In 16th Last accessed September 2007. international World Wide Web conference, 2007. [3] Chris Bizer, Tom Heath, Danny Ayers, and Yves Available at http://www.mpi-inf.mpg.de/ Raimond. Interlinking open data on the web. In ~suchanek/publications/www2007.pdf. Last accessed Demonstrations Track, 4th European Semantic Web january 2008. Conference, Innsbruck, Austria, 2007. Available at [11] Jan Wielemaker, Zhisheng Huang, and Lourens http://www.eswc2007.org/pdf/demo-pdf/ Van Der Meij. SWI-Prolog and the web. Theory and LinkingOpenData.pdf. Last accessed September 2007. Practice of Logic Programming, 2003. Available at [4] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A http://hcs.science.uva.nl/projects/SWI-Prolog/ review of audio fingerprinting. The Journal of VLSI articles/TPLP-plweb.pdf. Last accessed September Signal Processing, 41(3):271 – 284, 2005. 2007. [5] Jeremy J. Carroll, Christian Bizer, Pat Hayes, and [12] W. Winkler. Advanced methods for record linkage. Patrick Stickler. Named graphs. Journal of Web Technical report, Statistical Research Division, Semantics, 2005. Washington, DC: U.S. Bureau of the Census., 1994. [6] Michiel Hildebrand, Jacco van Ossenbruggen, and Available at Lynda Hardman. The Semantic Web - ISWC 2006, http://citeseer.ist.psu.edu/254560.html. Last volume 4273/2006 of Lecture Notes in Computer accessed January 2008. Science, chapter /facet: A Browser for Heterogeneous Semantic Web Repositories, pages 272–285. Springer Berlin / Heidelberg, 2006. [7] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity flooding: a versatile graph matching algorithm and itsapplication to schema matching. In Proceedings of the 18th International Conference on Data Engineering, pages 117–128, San Jose, CA, USA, February-March 2002. [8] M. Neiling. Data fusion with record linkage. In I. Schmitt, C. Turker, E. Hildebrandt, and M. Hoding, editors, Proceedings of the Workshop ’Foederierte Datenbanken’, Aachen, 1998. Available at http://citeseer.ist.psu.edu/189652.html. Last accessed January 2008.