=Paper=
{{Paper
|id=Vol-369/paper-19
|storemode=property
|title=Automatic Interlinking of Music Datasets on the Semantic Web
|pdfUrl=https://ceur-ws.org/Vol-369/paper18.pdf
|volume=Vol-369
|dblpUrl=https://dblp.org/rec/conf/www/RaimondSS08
}}
==Automatic Interlinking of Music Datasets on the Semantic Web==
Automatic Interlinking of Music Datasets
on the Semantic Web
Yves Raimond, Christopher Sutton and Mark Sandler
Centre for Digital Music
Queen Mary, University of London
{yves.raimond,chris.sutton,mark.sandler}@elec.qmul.ac.uk
ABSTRACT in a Creative Commons label is linked to its location in the
In this paper, we describe current efforts towards interlink- Geonames dataset, and to the corresponding resource in an
ing music-related datasets on the Web. We first explain editorial database1 :
some initial interlinking experiences, and the poor results
obtained by taking a naı̈ve approach. We then detail a par-
ticular interlinking algorithm, taking into account both the foaf:based_near ;
similarities of web resources and of their neighbours. We owl:sameAs .
link a Creative Commons music dataset to an editorial one,
and to link a personal music collection to corresponding web
Then, you may access the actual audio content from the Cre-
identifiers. The latter provides a user with personally mean-
ative Commons label, some extra information such as the
ingful entry points for exploring the web of data, and we
birth dates of the members of this band from the editorial
conclude by describing some concrete tools built to generate
dataset, and detailed geographic information (latitude, lon-
and use such links.
gitude, hierarchy of geographical features) from the Geon-
ames dataset.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous For small datasets published manually (such as an individ-
ual’s FOAF file), it is possible to create such links manu-
General Terms ally. However, doing so for large datasets is impractical: we
Linked Data need a way to automatically detect the overlapping parts of
heterogeneous datasets. In this paper, we detail a few algo-
Keywords rithms that have been developed, implemented and practi-
Semantic-Web,Linked Data,Music cally deployed to interlink different music-related datasets.
We mainly focus on the most sophisticated one, applicable
in a Linked Data context, and taking into account not only
1. INTRODUCTION the similarities of single resources but also the similarities
The Linking Open Data community project [3] aims at pub- of their neighbours. We evaluate how this algorithm per-
lishing and interlinking open datasets by following simple forms when applied to link a real-world Creative Commons
rules [2] for linking data. The publication step can to some dataset to an editorial one. We also show how a personal
extent be automated, using tools such as D2R or Open- music collection can be treated as one such dataset, enabling
Link Virtuoso (which both allow relational databases to be a user to benefit from the growing body of knowledge on the
published as linked data) or P2R (which allows SWI-Prolog Semantic Web in a personally meaningful way.
knowledge bases to be published in the same way). All these
tools handle declarative mappings from a given data struc- We define the mapping problem as follows. We con-
ture to corresponding web resources and associated RDF sider two RDF datasets D1 and D2 , respectively describ-
descriptions. Once this publication is achieved, we still have ing a number of web resources ri and si . We consider
to create links to other datasets, in order for a user agent the problem of matching resources—finding resources sy
to navigate from one to another. A typical example of such in our target dataset, D2 , which identify the same ob-
interlinking would be the following one, where a music band ject as a resource rx in our seed dataset, D1 . For ex-
ample, we want to find sy = http://zitgist.com/music/artist/
0781a3f3-645c-45d1-a84f-76b4e4decf6d for rx = http://dbtune.org/
jamendo/artist/5. All mapping problems in our context can be
reduced to this one, even literal expansion where a resource
rx is linked to a literal l and we are looking for a resource
sy corresponding to l—for example, expanding “Moselle,
France” into http://sws.geonames.org/2991627/. In these cases,
1
LDOW’08, April 28, 2008, Beijing, China We use the Turtle notation throughout the paper, with the
namespaces defined in § 8
a simple transformation (creating a blank node between rx
and l) is enough to return ourselves to a resource matching
problem.
2. NAIVE INTERLINKING
In this section, we describe some naı̈ve approaches to tack-
ling this resource matching problem, and identify their fail-
ings.
2.1 Simple literal lookups
Most datasets provide a literal search facility, either through
a dedicated Web service (eg. Geonames, Musicbrainz), or
through a SPARQL end-point on which we can use filters on
literals, or in some cases built-in literal matching functional- Figure 1: Example datasets—we here try to match ele-
ity. So one solution to link resources from our two datasets ments from A and B
would be to first issue the following query on D1 :
UNION
{?r p:wikiPageUsesTemplate
SELECT ?l }
WHERE { ?p ?l } FILTER (isLiteral(?l)) }
}
and use the bindings of ?l to issue a literal search on D2 .
We can then try to map the resulting resources to r. This approach was used to link the musical works and the
artists in the BBC John Peel sessions dataset to correspond-
We used this kind of approach to link the Jamendo dataset ing resources in DBpedia. Constraints on the target re-
to the Geonames one2 . Jamendo provides information about sources were manually defined, and queries taking into ac-
the location of artists, in the form of a literal string (such as count the seed literal and these constraints were issued to
“Moselle, France”). We use this to query the Geonames web the DBpedia SPARQL end-point.
service, and get back the corresponding Geonames resource.
In practice, we found that the literal strings provided by However, even with restrictions on the nature of target re-
Jamendo are specific enough for this approach to work well. sources, a literal may not be discriminating enough. For ex-
In the two instances when more than one candidate was ample, the resource http://dbtune.org/jamendo/artist/5 is linked
returned for a location string, no link was created. to the literal “Both”. Searching the Musicbrainz dataset for
“Both” while restricting ourselves to artists gives us two re-
2.2 Extended literal lookups sults. Likewise, when looking for the Nirvana version of the
Although this simple approach can be suitable for interlink- song “Love Buzz” in the DBpedia dataset, we have to dis-
ing particular datasets in cases where a literal string reliably ambiguate between the original song and the Nirvana cover.
provides sufficient disambiguation it is unlikely to discrim- There is therefore a need for a more sophisticated algorithm,
inate suitably in most cases. For example, when trying to capable of handling such disambiguation.
apply it to link musical works in the BBC John Peel Sessions
to resources in the DBpedia dataset [1], we come across the 3. GRAPH MATCHING
literal “Violet”3 . The actual song which the algorithm should
An intuitive approach to disambiguate two artists with the
link to is just one of the sixteen results of the corresponding
same name would be to check the titles of their releases,
literal lookup.
and see if they match the titles of the releases in our seed
dataset. If by any chance they have releases with the same
One solution is to add constraints on the resulting resources.
title, we can check their track titles, and disambiguate using
For example, by using the DBpedia links to Yago [10], we
them, and so on. In this section, we develop this idea and
can restrict our resources to be of a specific Yago type, or we
give a formal specification of such an algorithm.
can restrict it to be linked to a particular infobox URI (spec-
ifying a template for structured data on the corresponding
Wikipedia page). This leads us to the following SPARQL 3.1 Offline graph matching
query: Consider the two datasets illustrated in fig. 1, with our seed
dataset on the left containing a single artist with the name
“Both”, and the target dataset on the right containing two
artists named “Both”. We model the two datasets as graphs,
PREFIX p:
SELECT ?r
with each edge represented as a triple (s, p, o).
WHERE
{ ?r ?p "Violet"@en. As a first step, we compute initial similarity values between
{ all pairs of resources (s1 , s2 ) and (o1 , o2 ) such that (s1 , p, o1 ) ∈
{?r a }
D1 and (s2 , p, o2 ) ∈ D2 . Such similarity values might be
2
All links to mentioned datasets are available in § 7 calculated by a string similarity algorithm (such as the ones
3
For the resource http://dbtune.org/bbc/peel/work/1498 described in section 2 of [12]) comparing literals directly at-
issue a query (through a SPARQL end-point or custom web
Table 1: Initial similarity map service) to D2 which involves l and constraints over what we
Resource 1 Resource 2 Initial Similarity are looking for. This gives us a list of resources.
1 4 1
1 7 1 For each sk in this list, we access Gsk . Now, for all pos-
2 5 0.9 sible graph mappings MGr :Gsk ,i , we compute a measure as
3 6 0.8 defined in § 3.1. If we can make a clear decision now (ie.
2 6 0.1 there is just one measure above our decision threshold), we
3 5 0.1 terminate and choose the corresponding graph mapping. If
2 8 0.1 not, we look for object properties p such that (r, p, o) ∈ Gr
3 8 0.1 and (sk , p, o0 ) ∈ Gsk , and we obtain Go and G0o . 4 Then,
we update our possible graph mappings and the associated
measures. We iterate this process until we can make a deci-
Table 2: Possible graph mappings—(x, y) ≡ {x sion (we have one unique mapping with measure above the
owl:sameAs y}, and associated measures threshold), or until we can’t go any further (no unexplored
Graphs Mapping Measures object properties). Practically, we also limit the maximum
G1 G2 MG1:G2 a = {(1, 4), (2, 5), (3, 6)} 0.9 number of iterations the algorithm may perform.
G1 G2 MG1:G2 b = {(1, 4), (2, 6), (3, 5)} 0.4
G1 G3 MG1:G3 a = {(1, 7), (2, 8)} 0.55 In our example, we first dereference http://dbtune.org/
G1 G3 MG1:G3 b = {(1, 7), (3, 8)} 0.55 jamendo/artist/5. We get access to the following facts: our
URI is identifying a musical band, it is called “Both”, and
it made 5 two things: http://dbtune.org/jamendo/record/174
tached to these resources. In our example, this produces the and http://dbtune.org/jamendo/record/33. We now look for
results in table 1. an artist named “Both” in the Musicbrainz dataset,
through the Musicbrainz web service6 . This gives
Next, we construct a graph similarity measure. In our ex- us back two URIs: http://zitgist.com/music/artist/
ample, we consider the possible graph mappings in table 2. 5f9f2dfb-76f0-4872-ad7d-f9d84a908cb5 and http://zitgist.com/
Then, we associate a measure with such mappings: we sum music/artist/0781a3f3-645c-45d1-a84f-76b4e4decf6d. We derefer-
the similarity values s associated with each pair (x, y) and ence them: the first one identifies an artist named “Both”
we normalise it by the number of pairs in the mapping. In which made two things, and the second one also an artist
our example, the resulting measures are in table 2. named “Both” which made one thing.
Finally, we choose the mapping whose similarity measure is We now consider two possible graph mappings (correspond-
the highest, optionally thresholding to avoid making map- ing to the two potential matches of our artist resource), with
pings between graphs which are too dissimilar. In our ex- two measures, both equal to 1. We continue looking for fur-
ample, we choose MG1:G2 a. ther clues, as there is not yet any way to disambiguate be-
tween the two. We take the object property occurring in our
3.2 Linked Data context three graphs, f oaf :made, and dereference all the objects of
this property that we currently know about. Our starting re-
Now, we apply this algorithm in a Linked Data [2] context,
source made two records, named “Simple Exercice” and “En
where we discover our graphs as we go: the main idea being
attendant d’aller sur Mars”. The first matching artist in the
that we update the graph mappings and their measures as
Musicbrainz dataset made two records, named “Simple ex-
we update our local Semantic Web cache. This is necessary
ercice” and “En attendant d’aller sur Mars...”. The second
because in general the size of the datasets D1 and D2 will
matching artist made one record, named “The Inevitable
be prohibitive; if not for loading both datasets, then for
Phyllis”. We now update the possible graph mappings, and
computing all the possible graph mappings between them.
reach the results in table 2. Now, we have one mapping
identifiably better than the others, with graph similarity
Our starting point is now a single URI r in a dataset D1 .
measure 0.9. We choose it, and hence derive the following
We try to find the corresponding s in a dataset D2 , as well
statements:
as mappings of resources in the neighbourhood of r. We
illustrate this using the same example as the one in § 3.1:
we want to map the URI http://dbtune.org/jamendo/artist/5 to
the corresponding Musicbrainz URI. As part of the same owl:sameAs
.
for this artist. owl:sameAs
.
In the following, we will use Named Graphs [5], in order to owl:sameAs
track the provenance of a particular graph, and we let Gx .
4
The first thing we do is to retrieve Gr , and extract a suitable We could additionally consider triples of the form (o, p, r)
label l for r (using dc:title or f oaf :name properties, etc). and (o0 , p, sk ).
5
Now, we need to access some potential candidates for s. To Captured through the f oaf :made predicate
6
do that, we use the same approach as described in § 2.2. We See http://wiki.musicbrainz.org/XMLWebService
Having chosen a mapping we could go further, to also derive 0
Ok,r,p is the list of all o such that (r0 , p, o) ∈ Gr0
such statements for the tracks in these two albums. An- 0
S
Foreach Objmapk,i ∈ r combinations(Ok,r,p , Ok,r,p ):
other possible extension of this algorithm is to perform lit- simk,i = (measurek + measure(Objmapk,i ))/(|Mk | +
eral lookups in D2 at each step (therefore providing new |Objmapk,i |)
possible graph mappings each time). This helps ensure we If no simk,i , fail
still find the correct mapping in the case that our initial If simk,i > threshold for exactly one {k, i} pair, return
append(Mk , Objmapk,i )
literal lookup does not include the correct resource among Else, N ewM appings is the list of all append(Mk , Objmapk,i ),
its results. For example, the correct target artist might be and return propagate(N ewM appings) (if the maximum number
listed as having a different name in D2 as in D1 , such that of recursions is not reached, otherwise fail)
they do not feature in the results of our initial literal lookup.
However, we might have some clues about who this artist is Now, we apply this pseudocode to our earlier “Both”
from the names of the albums they produced, and so per- example (r = 1 = http://dbtune.org/jamendo/artist/5), with a
threshold of 0.8. This makes us go through the following
forming additional literal lookups (on the album titles) may steps:
allow us to find the correct artist and hence the correct map-
ping. Such a variant of this algorithm is implemented in the lookup(r) = {4, 7}
GNAT software described in § 4.2. M1 = {(1, 4)}, sim1 = 1
M2 = {(1, 7)}, sim2 = 1
M appings = {{(1, 4)}, {(1, 7)}}
3.3 Algorithm definition
The algorithm described above can be expressed in the propagate({{(1, 4)}, {(1, 7)}})
following pseudo-code. We assume the existence of a k = 1, p = f oaf :made
0
O1,1,f oaf :made = {2, 3}, O1,4,f
function string similarity(x, y) and define the following oaf :made = {5, 6}
Objmap1,1 = {(2, 5), (3, 6)}, Objmap1,2 = {(2, 6), (3, 5)}
additional functions: sim1,1 = 0.9, sim1,2 = 0.4
k = 2, p = f oaf :made
function similarity(x, y) : 0
O2,1,f oaf :made = {2, 3}, O2,7,f oaf :made = {8}
Extract a suitable label lx for x in Gx Objmap2,1 = {(2, 8)}, Objmap2,2 = {(3, 8)}
Extract a suitable label ly for y in Gy sim2,1 = 0.55, sim2,2 = 0.55
Return string similarity(lx , ly ) Now, sim1,1 is the only simk,i above the threshold, we
therefore choose {(1, 4), (2, 5), (3, 6)} as our mapping, which
corresponds to the RDF code in § 3.2.
function lookup(x) :
Extract a suitable label lx for x in Gx Of course, several heuristics could be added to this pseudo-code,
Perform a search for lx on D2 in order to improve the scalability of the algorithm. In practice,
Return the set of resources retrieved from the search we associate weights to properties, in order to start from the
most informative one (f oaf :made, for example).
function measure(M ) :
Foreach (ri , rj ) ∈ M 4. EXPERIMENTS
simi,j = similarity(ri , rj ) In this section, we detail two experiments using this algo-
P
Return i,j simi,j rithm, and their respective evaluations. The first one deals
with the automatic interlinking of two online music datasets.
function combinations(O1 , O2 ) : The second one deals with the linking of a personal music
Return all possible combinations of collection towards corresponding web identifiers.
elements of O1 and elements of O2
e.g. combinations({1, 2}, {3, 4}) =
{{(1, 3), (2, 4)}, {(1, 4), (2, 3)}}
4.1 Linking two overlapping web datasets
In this section, we focus on a concrete interlinking which has
Our starting point is a URI r in D1 and a decision been achieved using this algorithm, between two overlapping
threshold threshold. Our mapping pseudo-code is then web datasets: Jamendo and Musicbrainz. We implemented
defined as: this algorithm7 in SWI-Prolog [11], with only one lookup
on the Musicbrainz end-point, at the artist level. Our algo-
rithm derived 10944 similarity statements for artist, record,
and track resources so far, which allows us to get detailed ed-
Foreach sk ∈ lookup(r) itorial information from Musicbrainz, and the actual audio
Mk = {(r, sk )} content, as well as tags, from the Jamendo dataset.
measurek = measure(Mk )
simk = measurek /|Mk | We focus our evaluation on artist resources. As we perform
If simk > threshold for exactly one k, return Mk only one lookup, at the artist level, no tracks or records
Else, M appings is the list of all Mk , and return
propagate(M appings)
can be matched if the artist is not. In order to evaluate
the quality of the interlinking, we take a random sample
function propagate(M appings) : from the Jamendo dataset: we first collect every single artist
Foreach Mk ∈ M appings: URI8 , and we randomly select 60 from among them. Then,
measurek = measure(Mk ) 7
Foreach p s.t. (∃(r, r0 ) ∈ Mk , ∃(r, p, o) ∈ Gr , The source code of all the software mentioned
∃(r0 , p, o0 ) ∈ Gr0 and ∀M ap ∈ M appings, (o, o0 ) ∈ / M ap): is available as part of the motools project:
Foreach (r, r0 ) ∈ Mk : http://sourceforge.net/projects/motools
8
Ok,r,p is the list of all o such that (r, p, o) ∈ Gr The results of such a SPARQL query are available at
Table 3: Evaluation of the Jamendo/Musicbrainz
interlinking
Link derived Link not derived
Correct 5 53
Incorrect 0 2
we run our mapping algorithm and manually check whether
the mappings are correct. Each tested resource therefore
falls into one of the following categories:
• An owl:sameAs link is derived : correct (same artist
in the Jamendo and in the Musicbrainz datasets) ;
• An owl:sameAs link is derived : incorrect (different Figure 2: Constructing a graph from local music meta-
artists) ; data
• No link is derived : correct (there is a corresponding
artist in the Musicbrainz dataset) ; statements making the links between local audio files and the
remote manifestation identifiers. The fingerprinting func-
• No link is derived : incorrect (no corresponding artists tionality can be useful when the metadata available is par-
in the Musicbrainz dataset). ticularly poor, but since it is highly dependent on the fin-
gerprinting service chosen, we concern ourselves here solely
with the metadata-based approach.
The results9 , in terms of how many resources fall within each
of the above defined categories, are shown in table 3. In our
All modern audio encodings allow for the inclusion of meta-
test dataset, the disambiguation was needed in 16 cases.
data “tags” alongside audio data in a file, such that each
For example, one of the artist resources was named “Hair”,
audio file can include the kind of editorial information con-
which matches four resources within Musicbrainz, none of
tained in the data sets described above. We can therefore
them being the same band. The first case that failed is due
consider a reasonably well tagged personal music collection
to an implementation mistake, failing to normalise the graph
to be just another music data set, apply the algorithm de-
similarity measures correctly when the target graph is bigger
scribed in § 3, and hence link each local audio file to a cor-
than the seed one (in this case, the artist had two releases
responding resource on the Semantic Web. GNAT uses the
on Musicbrainz, and just one on Jamendo). The second case
variant discussed at the end of § 3.2 which maps artists, al-
that failed is due to the fact that the Musicbrainz RDF is
bums and tracks, performing literal lookups at each stage.
outdated (the artist does not exist in the RDF dump, but
For each local audio file, a simple seed graph is constructed
does exist in the Musicbrainz database).
based on the artist, album, title and track number specified
in the file’s ID3 tag (see fig. 2 for an example). It proceeds
4.2 Linking personal music collections as set out in § 3.3 until a single best mapping is found. After
Personal music collections can also be a part of the web processing a directory of files in this way, GNAT outputs an
of data. The Music Ontology [9] makes the same distinc- RDF file providing mo:available_as links from URIs in the
tion as FRBR between manifestations (all physical objects Zitgist publication of MusicBrainz data to the local audio
that bear the same characteristics, eg. a particular album) files. For example :
and items (a concrete entity, eg. my copy of the album on
CD). A manifestation and a corresponding item are linked
through a predicate mo:available_as. Therefore, given a mo:available_as
set of audio files in a personal music collection, it is possible
to keep track of the set of statements linking this collection
to identifiers elsewhere in the Semantic Web which denote
the corresponding manifestations. These statements provide Using this algorithm rather than one of the more naı̈ve ap-
a set of entry points to the Semantic Web, allowing access proaches should allow GNAT to be robust to various inac-
to information such as the birth date of the artists responsi- curacies in the local files’ metadata. We evaluated GNAT’s
ble for items in the collection, geographical locations of the behaviour in the face of such problems by taking a correctly-
recordings, etc. tagged MP3 file of the Beatles track “I want to hold your
hand” and artificially introducing various mistakes. The Mu-
GNAT is an implementation of automatic linking from a sicBrainz dataset lists no less than 25 releases for this track
personal audio collection to the Musicbrainz dataset—it uses by The Beatles, and dozens of artists with songs of the same
audio fingerprinting and available metadata to find corre- name. The correct set of metadata is shown in table 4 and
sponding dereferencable identifiers, and then outputs RDF results are shown in table 5.
http://dbtune.org:2105/sparql/?query=select%20distinct%20%3Fa
%20where%20%7B%3Fa%20a%20mo%3AMusicArtist%7D We can see that GNAT performs well in the face of inaccu-
9 rate metadata. The release chosen when the album field is
The detailed results, resource by resource, are available at
http://moustaki.org/resources/results.txt missing or set to a random string is arguably correct—it is
Table 5: Evaluation of GNAT’s linkage on particular cases
Change made Resulting mapping
Artist field missing Correct
Artist set to random string Correct
Artist set to “Beetles” Correct
Artist set to “Al Green” Mapped to Al Green’s cover version
Album field missing Mapped to correct track on the release
Album set to random string “The Capitol Albums, Volume 1 (disc 1:
Album set to “Meat the Beatles” Meet the Beatles!)”
Track set to random string Correct
Track set to “I Wanna Hold Your Hand” Correct
can prioritise links which use properties from the Music On-
Table 4: Base (correct) metadata for GNAT evalu- tology (or other specified namespaces). This helps ensure
ation that relevant information is prioritised over less obviously-
Artist “The Beatles”
useful information.
Album “Meet The Beatles!”
Title “I Want To Hold Your Hand”
Information relating to resources of interest may also be re-
Track Num 1
trieved based on specific rules. For example, if we know
that the SBSimilarity service provides “similar track” infor-
mation about tracks by appending their Musicbrainz ID to
the same CD, released as part of a box set. One would hope a given prefix10 , we can specify a rule in GNARQL for gen-
that the release “Meet the Beatles!” would be chosen in the erating rdfs:seeAlso links from known tracks to the corre-
case of the album being misspelled. sponding documents in the SBSimilarity namespace. These
rule-derived links will then be followed as part of the crawl-
With a practical implementation some trade-offs must be ing strategy, and their information added to the local RDF
made. In the case of setting the artist to “Al Green”, we store.
have two conflicting pieces of information (artist and album)
and our implementation here chooses the mapping for which Naturally, the information held in GNARQL’s store need
the artist matches. A more sophisticated version of GNAT not come solely from GNAT. In fact, any RDF data in the
might consider other tracks in the current directory to es- designated directory tree will be loaded. In our research
tablish that the track most likely comes from the Beatles group, this means that Chord Ontology11 transcriptions are
release rather than Al Green’s release. held alongside the MusicBrainz data retrieved from Zitgist
and social tag data retrieved from various websites.
Such links from a user’s own files to information on the Se-
mantic Web could be a significant step towards making data To enable applications to take advantage of this aggregated
on the Web available and relevant to people, but a user agent data, GNARQL provides a SPARQL endpoint. This frees
must act on such links before they are directly useful to a end-user applications from the need to themselves maintain
human. In the next section we describe a companion tool to a database relating to the user’s music collection, and al-
GNAT, designed to exploit the links GNAT produces. lows multiple applications to benefit from a single store of
aggregated information.
4.3 Use-cases Based on datasets available today, some example queries
For the links between a user’s files and Semantic Web re- such user interfaces might pass on to GNARQL include
sources to be useful, an application must have some infor- “Find tracks which are performances of works by Russian
mation about the resources. The GNARQL program in the composers around the turn of the twentieth century”, “Find
motools project is beginning to explore some of the pos- me cover versions of rock songs in non-rock genres” or po-
sibilities in this direction. The program loads in all the tentially (by using information linked from a user’s FOAF
owl:sameAs links produced by GNAT, dereferences the cor- file) “Find me gigs by the artists I play frequently which fit
responding URIs, and then aggregates additional informa- with my vacation schedule”.
tion about those resources.
To experiment with browsing the data aggregated by
The basic mechanism for aggregating additional information GNARQL, we have developed a prototype user interface,
is to crawl outwards from the given resource, dereferencing based heavily on the /facet program described in [6]. This
linked resources and adding their descriptions to the local provides a web browser-based interface for exploring the
RDF store. In the simplest case, this crawling can be un- aggregated information, and performing simple facet-based
guided, simply following all links regardless of the properties
used, and performing a breadth-first traversal of the Seman- 10
eg. http://isophonics.net/music/signal/280b7fae-724e-4a6d-8e91-
tic Web from all known resources. 6fe3f0a2bdad provides extra information about
http://zitgist.com/music/signal/280b7fae-724e-4a6d-8e91-6fe3f0a2
More sophisticated crawling strategies may lead to better bdad
11
aggregation for a user’s purposes. For example, GNARQL See http://purl.org/ontology/chord/
queries. The map functionality allows the results of such young, more work is required to fully exploit the function-
queries to be plotted geographically, as shown in fig. 3. ality GNARQL is beginning to exhibit.
5. FUTURE WORK 6. CONCLUSION
5.1 Work on the interlinking algorithm In this paper, we described several different methods for in-
The algorithm proposed here has several limitations. Firstly, terlinking Semantic Web datasets. We mentioned two naı̈ve
it doesn’t specify any heuristics to use if the ontologies differ. approaches, leading to the construction of a more elaborate
In this case,we would need a different methodology, perhaps algorithm, which takes into account not only the similarity
inspired by the approach described in [8]. Secondly, the al- of the resources themselves but also the similarity of their
gorithm is designed for the case where a meaningful similar- neighbours. Its main advantages are to provide a best-effort
ity measure between pairs of individual resources is available mapping, without any need for a learning step (for which
(here, using string similarity of labels attached to them with we would have to manually interlink some resources), and to
certain predicates). In a linking scenario where there is no work in a linked data environment, where new resources are
particularly good similarity measure on individual resources discovered as we get through the mapping process. We de-
and the graph structure is therefore the most salient factor scribed two implementations of this algorithm. The first one
for correct linking, another algorithm may be more appropri- was used to interlink artists, records, and tracks in two on-
ate. In this case, Melnik’s “similarity flooding” [7] approach line music datasets: Jamendo and Musicbrainz, and the sec-
could be used, which prioritises graph structure rather than ond one allows any user to link their personal music collec-
node similarity, relying on a post-processing stage to filter tion to corresponding identifiers in the Musicbrainz dataset.
out unsuitable mappings. We evaluated these two implementations separately.
Based on these observations, further work developing the Creating links between heterogeneous datasets can dramat-
interlinking algorithm could provide a framework for inter- ically enhance the usefulness of each. Using such links, a
linking a wider variety of datasets. Semantic Web user agent can jump from a band within the
Jamendo dataset to the corresponding resource in the Mu-
5.2 Work on implementations sicbrainz one, to the corresponding resource in DBpedia,
Currently, the GNAT tool implements two distinct ap- to its approximate geographic location, to the famous com-
proaches to finding manifestation URIs for local audio files. posers born in that city, etc. However, if it is possible to
One uses just the available metadata, and this approach is define such links manually for small datasets, it is impossi-
described in § 4.2, above. Since audio metadata in personal ble for large ones. We need methodologies to discover them
collections is frequently incomplete, inaccurate, or missing in an automated way. The techniques presented here are
entirely, this may not be sufficient. The other approach far from perfect, but represent some initial efforts in this
therefore exploits audio fingerprinting [4] to try to identify direction.
the track, and then if there is remaining ambiguity the local
metadata is used to choose a single URI.
7. DATASETS
The following datasets are mentioned throughout the paper:
Ideally, we could use the fingerprint of an audio file as just
another piece of information about the track, incorporat-
Jamendo on DBTune: http://dbtune.org/jamendo/
ing it into our graph mapping approach. In practice, the
BBC John Peel sessions: http://dbtune.org/bbc/peel/
main fingerprinting service available with a large support-
SBSimilarity: http://www.isophonics.net/SBSimilarity
ing database, MusicIP’s MusicDNS service12 , is relatively
Musicbrainz RDF: http://zitgist.com/music/
opaque. Fingerprinting a track either returns a PUID, which
DBpedia: http://dbpedia.org/
can be used to perform a search on MusicBrainz, or returns
Geonames: http://geonames.org/
no results. It therefore provides only a boolean test for sim-
ilarity, and some hidden decisions have been made by the
MusicDNS service using any available metadata. As a re-
sult, there is no obvious way to uniformly combine finger- 8. NAMESPACES
print information and local metadata in a graph mapping We use the following namespaces throughout our RDF ex-
approach. amples:
A fingerprinting service which exposed the server-side
database and the actual process of matching a fingerprint @prefix mo: .
to a database entry could allow for some more sophisticated @prefix foaf: .
linkage between personal audio collections and the Semantic @prefix owl: .
Web. We have some hopes that Last.fm’s recently launched
fingerprinting service might gather a large database and
make it freely available in a flexible way.
9. ACKNOWLEDGEMENTS
The authors acknowledge the support of both the Cen-
tre For Digital Music and the Department of Computer
Although the core of the GNARQL tool is in place, more
Science at Queen Mary University of London for the stu-
work is required to explore different approaches to crawl-
dentship for Yves Raimond. This work has been partially
ing. Also, since Semantic Web user interfaces are relatively
supported by the EPSRC-funded ICT project OMRAS-2
12
See http://www.musicip.com/dns/ (EP/E017614/1).
Figure 3: Navigating a personal audio collection using aggregated Semantic Web data
10. REFERENCES [9] Yves Raimond, Samer Abdallah, Mark Sandler, and
[1] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov, Frederick Giasson. The music ontology. In Proceedings
R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a of the International Conference on Music Information
web of open data. In Proceedings of the International Retrieval, pages 417–422, September 2007. Available
Semantic Web Conference, Busan, Korea, November at http://ismir2007.ismir.net/proceedings/
11-15 2007. ISMIR2007_p417_raimond.pdf. Last accessed January
[2] Tim Berners-Lee. Linked data. World wide web design 2008.
issues, July 2006. Available at [10] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard
http://www.w3.org/DesignIssues/LinkedData.html. Weikum. Yago - a core of semantic knowledge. In 16th
Last accessed September 2007. international World Wide Web conference, 2007.
[3] Chris Bizer, Tom Heath, Danny Ayers, and Yves Available at http://www.mpi-inf.mpg.de/
Raimond. Interlinking open data on the web. In ~suchanek/publications/www2007.pdf. Last accessed
Demonstrations Track, 4th European Semantic Web january 2008.
Conference, Innsbruck, Austria, 2007. Available at [11] Jan Wielemaker, Zhisheng Huang, and Lourens
http://www.eswc2007.org/pdf/demo-pdf/ Van Der Meij. SWI-Prolog and the web. Theory and
LinkingOpenData.pdf. Last accessed September 2007. Practice of Logic Programming, 2003. Available at
[4] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A http://hcs.science.uva.nl/projects/SWI-Prolog/
review of audio fingerprinting. The Journal of VLSI articles/TPLP-plweb.pdf. Last accessed September
Signal Processing, 41(3):271 – 284, 2005. 2007.
[5] Jeremy J. Carroll, Christian Bizer, Pat Hayes, and [12] W. Winkler. Advanced methods for record linkage.
Patrick Stickler. Named graphs. Journal of Web Technical report, Statistical Research Division,
Semantics, 2005. Washington, DC: U.S. Bureau of the Census., 1994.
[6] Michiel Hildebrand, Jacco van Ossenbruggen, and Available at
Lynda Hardman. The Semantic Web - ISWC 2006, http://citeseer.ist.psu.edu/254560.html. Last
volume 4273/2006 of Lecture Notes in Computer accessed January 2008.
Science, chapter /facet: A Browser for Heterogeneous
Semantic Web Repositories, pages 272–285. Springer
Berlin / Heidelberg, 2006.
[7] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity
flooding: a versatile graph matching algorithm and
itsapplication to schema matching. In Proceedings of
the 18th International Conference on Data
Engineering, pages 117–128, San Jose, CA, USA,
February-March 2002.
[8] M. Neiling. Data fusion with record linkage. In
I. Schmitt, C. Turker, E. Hildebrandt, and M. Hoding,
editors, Proceedings of the Workshop ’Foederierte
Datenbanken’, Aachen, 1998. Available at
http://citeseer.ist.psu.edu/189652.html. Last
accessed January 2008.