<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Interlinking of Music Datasets on the Semantic Web</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Yves Raimond, Christopher Sutton and Mark Sandler Centre for Digital Music Queen Mary, University of London</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe current e orts towards interlinking music-related datasets on the Web. We rst explain some initial interlinking experiences, and the poor results obtained by taking a nave approach. We then detail a particular interlinking algorithm, taking into account both the similarities of web resources and of their neighbours. We detail the application of this algorithm in two contexts: to link a Creative Commons music dataset to an editorial one, and to link a personal music collection to corresponding web identi ers. The latter provides a user with personally meaningful entry points for exploring the web of data, and we conclude by describing some concrete tools built to generate and use such links.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Semantic-Web</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Music</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The Linking Open Data community project [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] aims at
publishing and interlinking open datasets by following simple
rules [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for linking data. The publication step can to some
extent be automated, using tools such as D2R or
OpenLink Virtuoso (which both allow relational databases to be
published as linked data) or P2R (which allows SWI-Prolog
knowledge bases to be published in the same way). All these
tools handle declarative mappings from a given data
structure to corresponding web resources and associated RDF
descriptions. Once this publication is achieved, we still have
to create links to other datasets, in order for a user agent
to navigate from one to another. A typical example of such
interlinking would be the following one, where a music band
in a Creative Commons label is linked to its location in the
Geonames dataset, and to the corresponding resource in an
editorial database1:
&lt;http://dbtune.org/jamendo/artist/5&gt;
foaf:based_near &lt;http://sws.geonames.org/2991627/&gt; ;
owl:sameAs &lt;http://zitgist.com/music/artist/
      </p>
      <p>0781a3f3-645c-45d1-a84f-76b4e4decf6d&gt;.</p>
      <p>Then, you may access the actual audio content from the
Creative Commons label, some extra information such as the
birth dates of the members of this band from the editorial
dataset, and detailed geographic information (latitude,
longitude, hierarchy of geographical features) from the
Geonames dataset.</p>
      <p>For small datasets published manually (such as an
individual's FOAF le), it is possible to create such links
manually. However, doing so for large datasets is impractical: we
need a way to automatically detect the overlapping parts of
heterogeneous datasets. In this paper, we detail a few
algorithms that have been developed, implemented and
practically deployed to interlink di erent music-related datasets.
We mainly focus on the most sophisticated one, applicable
in a Linked Data context, and taking into account not only
the similarities of single resources but also the similarities
of their neighbours. We evaluate how this algorithm
performs when applied to link a real-world Creative Commons
dataset to an editorial one. We also show how a personal
music collection can be treated as one such dataset, enabling
a user to bene t from the growing body of knowledge on the
Semantic Web in a personally meaningful way.</p>
      <p>We de ne the mapping problem as follows. We
consider two RDF datasets D1 and D2, respectively
describing a number of web resources ri and si. We consider
the problem of matching resources| nding resources sy
in our target dataset, D2, which identify the same
object as a resource rx in our seed dataset, D1. For
example, we want to nd sy = http://zitgist.com/music/artist/
0781a3f3-645c-45d1-a84f-76b4e4decf6d for rx = http://dbtune.org/
jamendo/artist/5. All mapping problems in our context can be
reduced to this one, even literal expansion where a resource
rx is linked to a literal l and we are looking for a resource
sy corresponding to l|for example, expanding \Moselle,
France" into http://sws.geonames.org/2991627/. In these cases,
1We use the Turtle notation throughout the paper, with the
namespaces de ned in x 8
a simple transformation (creating a blank node between rx
and l) is enough to return ourselves to a resource matching
problem.</p>
    </sec>
    <sec id="sec-2">
      <title>2. NAIVE INTERLINKING</title>
      <p>In this section, we describe some nave approaches to
tackling this resource matching problem, and identify their
failings.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1 Simple literal lookups</title>
      <p>Most datasets provide a literal search facility, either through
a dedicated Web service (eg. Geonames, Musicbrainz), or
through a SPARQL end-point on which we can use lters on
literals, or in some cases built-in literal matching
functionality. So one solution to link resources from our two datasets
would be to rst issue the following query on D1:
SELECT ?l
WHERE { &lt;r&gt; ?p ?l } FILTER (isLiteral(?l))
and use the bindings of ?l to issue a literal search on D2.
We can then try to map the resulting resources to r.
We used this kind of approach to link the Jamendo dataset
to the Geonames one2. Jamendo provides information about
the location of artists, in the form of a literal string (such as
\Moselle, France"). We use this to query the Geonames web
service, and get back the corresponding Geonames resource.
In practice, we found that the literal strings provided by
Jamendo are speci c enough for this approach to work well.
In the two instances when more than one candidate was
returned for a location string, no link was created.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Extended literal lookups</title>
      <p>
        Although this simple approach can be suitable for
interlinking particular datasets in cases where a literal string reliably
provides su cient disambiguation it is unlikely to
discriminate suitably in most cases. For example, when trying to
apply it to link musical works in the BBC John Peel Sessions
to resources in the DBpedia dataset [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we come across the
literal \Violet"3. The actual song which the algorithm should
link to is just one of the sixteen results of the corresponding
literal lookup.
      </p>
      <p>
        One solution is to add constraints on the resulting resources.
For example, by using the DBpedia links to Yago [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], we
can restrict our resources to be of a speci c Yago type, or we
can restrict it to be linked to a particular infobox URI
(specifying a template for structured data on the corresponding
Wikipedia page). This leads us to the following SPARQL
query:
PREFIX p: &lt;http://dbpedia.org/property/&gt;
SELECT ?r
WHERE
{ ?r ?p "Violet"@en.
      </p>
      <p>{</p>
      <p>{?r a &lt;http://dbpedia.org/class/yago/Song107048000&gt;}
2All links to mentioned datasets are available in x 7
3For the resource http://dbtune.org/bbc/peel/work/1498
This approach was used to link the musical works and the
artists in the BBC John Peel sessions dataset to
corresponding resources in DBpedia. Constraints on the target
resources were manually de ned, and queries taking into
account the seed literal and these constraints were issued to
the DBpedia SPARQL end-point.</p>
      <p>However, even with restrictions on the nature of target
resources, a literal may not be discriminating enough. For
example, the resource http://dbtune.org/jamendo/artist/5 is linked
to the literal \Both". Searching the Musicbrainz dataset for
\Both" while restricting ourselves to artists gives us two
results. Likewise, when looking for the Nirvana version of the
song \Love Buzz" in the DBpedia dataset, we have to
disambiguate between the original song and the Nirvana cover.
There is therefore a need for a more sophisticated algorithm,
capable of handling such disambiguation.</p>
    </sec>
    <sec id="sec-5">
      <title>3. GRAPH MATCHING</title>
      <p>An intuitive approach to disambiguate two artists with the
same name would be to check the titles of their releases,
and see if they match the titles of the releases in our seed
dataset. If by any chance they have releases with the same
title, we can check their track titles, and disambiguate using
them, and so on. In this section, we develop this idea and
give a formal speci cation of such an algorithm.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Offline graph matching</title>
      <p>Consider the two datasets illustrated in g. 1, with our seed
dataset on the left containing a single artist with the name
\Both", and the target dataset on the right containing two
artists named \Both". We model the two datasets as graphs,
with each edge represented as a triple (s; p; o).</p>
      <p>
        As a rst step, we compute initial similarity values between
all pairs of resources (s1; s2) and (o1; o2) such that (s1; p; o1) 2
D1 and (s2; p; o2) 2 D2. Such similarity values might be
calculated by a string similarity algorithm (such as the ones
described in section 2 of [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) comparing literals directly
attached to these resources. In our example, this produces the
results in table 1.
      </p>
      <p>Next, we construct a graph similarity measure. In our
example, we consider the possible graph mappings in table 2.
Then, we associate a measure with such mappings: we sum
the similarity values s associated with each pair (x; y) and
we normalise it by the number of pairs in the mapping. In
our example, the resulting measures are in table 2.
Finally, we choose the mapping whose similarity measure is
the highest, optionally thresholding to avoid making
mappings between graphs which are too dissimilar. In our
example, we choose MG1:G2a.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Linked Data context</title>
      <p>
        Now, we apply this algorithm in a Linked Data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] context,
where we discover our graphs as we go: the main idea being
that we update the graph mappings and their measures as
we update our local Semantic Web cache. This is necessary
because in general the size of the datasets D1 and D2 will
be prohibitive; if not for loading both datasets, then for
computing all the possible graph mappings between them.
Our starting point is now a single URI r in a dataset D1.
We try to nd the corresponding s in a dataset D2, as well
as mappings of resources in the neighbourhood of r. We
illustrate this using the same example as the one in x 3:1:
we want to map the URI http://dbtune.org/jamendo/artist/5 to
the corresponding Musicbrainz URI. As part of the same
process we wish to map corresponding albums and tracks
for this artist.
      </p>
      <p>
        In the following, we will use Named Graphs [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in order to
track the provenance of a particular graph, and we let Gx
denote the graph retrieved when dereferencing x.
The rst thing we do is to retrieve Gr, and extract a suitable
label l for r (using dc:title or f oaf :name properties, etc).
Now, we need to access some potential candidates for s. To
do that, we use the same approach as described in x 2:2. We
issue a query (through a SPARQL end-point or custom web
service) to D2 which involves l and constraints over what we
are looking for. This gives us a list of resources.
For each sk in this list, we access Gsk . Now, for all
possible graph mappings MGr:Gsk ;i, we compute a measure as
de ned in x 3:1. If we can make a clear decision now (ie.
there is just one measure above our decision threshold), we
terminate and choose the corresponding graph mapping. If
not, we look for object properties p such that (r; p; o) 2 Gr
and (sk; p; o0) 2 Gsk , and we obtain Go and G0o. 4 Then,
we update our possible graph mappings and the associated
measures. We iterate this process until we can make a
decision (we have one unique mapping with measure above the
threshold), or until we can't go any further (no unexplored
object properties). Practically, we also limit the maximum
number of iterations the algorithm may perform.
In our example, we rst dereference http://dbtune.org/
jamendo/artist/5. We get access to the following facts: our
URI is identifying a musical band, it is called \Both", and
it made5 two things: http://dbtune.org/jamendo/record/174
and http://dbtune.org/jamendo/record/33. We now look for
an artist named \Both" in the Musicbrainz dataset,
through the Musicbrainz web service6. This gives
us back two URIs: http://zitgist.com/music/artist/
5f9f2dfb-76f0-4872-ad7d-f9d84a908cb5 and http://zitgist.com/
music/artist/0781a3f3-645c-45d1-a84f-76b4e4decf6d. We
dereference them: the rst one identi es an artist named \Both"
which made two things, and the second one also an artist
named \Both" which made one thing.
      </p>
      <p>We now consider two possible graph mappings
(corresponding to the two potential matches of our artist resource), with
two measures, both equal to 1. We continue looking for
further clues, as there is not yet any way to disambiguate
between the two. We take the object property occurring in our
three graphs, f oaf :made, and dereference all the objects of
this property that we currently know about. Our starting
resource made two records, named \Simple Exercice" and \En
attendant d'aller sur Mars". The rst matching artist in the
Musicbrainz dataset made two records, named \Simple
exercice" and \En attendant d'aller sur Mars...". The second
matching artist made one record, named \The Inevitable
Phyllis". We now update the possible graph mappings, and
reach the results in table 2. Now, we have one mapping
identi ably better than the others, with graph similarity
measure 0:9. We choose it, and hence derive the following
statements:
&lt;http://dbtune.org/jamendo/artist/5&gt; owl:sameAs
&lt;http://zitgist.com/music/artist/</p>
      <p>0781a3f3-645c-45d1-a84f-76b4e4decf6d&gt;.
&lt;http://dbtune.org/jamendo/record/174&gt; owl:sameAs
&lt;http://zitgist.com/music/record/</p>
      <p>3042765f-67ba-49ef-ab28-45805fabef4a&gt;.
&lt;http://dbtune.org/jamendo/record/33&gt; owl:sameAs
&lt;http://zitgist.com/music/record/</p>
      <p>fade0242-e1f0-457b-99de-d9fe0c8cbd57&gt;.
4We could additionally consider triples of the form (o; p; r)
and (o0; p; sk).
5Captured through the f oaf :made predicate
6See http://wiki.musicbrainz.org/XMLWebService
Having chosen a mapping we could go further, to also derive
such statements for the tracks in these two albums.
Another possible extension of this algorithm is to perform
literal lookups in D2 at each step (therefore providing new
possible graph mappings each time). This helps ensure we
still nd the correct mapping in the case that our initial
literal lookup does not include the correct resource among
its results. For example, the correct target artist might be
listed as having a di erent name in D2 as in D1, such that
they do not feature in the results of our initial literal lookup.
However, we might have some clues about who this artist is
from the names of the albums they produced, and so
performing additional literal lookups (on the album titles) may
allow us to nd the correct artist and hence the correct
mapping. Such a variant of this algorithm is implemented in the
GNAT software described in x 4:2.</p>
    </sec>
    <sec id="sec-8">
      <title>3.3 Algorithm definition</title>
      <p>The algorithm described above can be expressed in the
following pseudo-code. We assume the existence of a
function string similarity(x; y) and de ne the following
additional functions:
function similarity(x; y) :</p>
      <p>Extract a suitable label lx for x in Gx
Extract a suitable label ly for y in Gy</p>
      <p>Return string similarity(lx; ly)
function lookup(x) :</p>
      <p>Extract a suitable label lx for x in Gx
Perform a search for lx on D2</p>
      <p>Return the set of resources retrieved from the search
function measure(M ) :</p>
      <p>Foreach (ri; rj ) 2 M</p>
      <p>simi;j = similarity(ri; rj )</p>
      <p>Return Pi;j simi;j
function combinations(O1; O2) :</p>
      <p>Return all possible combinations of
elements of O1 and elements of O2
e.g. combinations(f1; 2g; f3; 4g) =</p>
      <p>ff(1; 3); (2; 4)g; f(1; 4); (2; 3)gg
Our starting point is a URI r in D1 and a decision
threshold threshold. Our mapping pseudo-code is then
de ned as:</p>
      <sec id="sec-8-1">
        <title>Foreach sk 2 lookup(r)</title>
        <p>Mk = f(r; sk)g
measurek = measure(Mk)
simk = measurek=jMkj
If simk &gt; threshold for exactly one k, return Mk
Else, M appings is the list of all Mk, and return
propagate(M appings)
function propagate(M appings) :</p>
        <p>Foreach Mk 2 M appings:
measurek = measure(Mk)</p>
        <p>Foreach p s:t: (9(r; r0) 2 Mk; 9(r; p; o) 2
9(r0; p; o0) 2 Gr0 and 8M ap 2 M appings; (o; o0) 2= M ap):
Foreach (r; r0) 2 Mk:</p>
        <p>Ok;r;p is the list of all o such that (r; p; o) 2 Gr
Gr,</p>
      </sec>
      <sec id="sec-8-2">
        <title>Ok0;r;p is the list of all o such that (r0; p; o) 2 Gr0</title>
        <p>Foreach Objmapk;i 2 Sr combinations(Ok;r;p; Ok0;r;p):
simk;i = (measurek + measure(Objmapk;i))=(jMkj +
jObjmapk;ij)</p>
        <p>If no simk;i, fail</p>
        <p>If simk;i &gt; threshold for exactly one fk; ig pair, return
append(Mk; Objmapk;i)</p>
        <p>Else, N ewM appings is the list of all append(Mk; Objmapk;i),
and return propagate(N ewM appings) (if the maximum number
of recursions is not reached, otherwise fail)
Now, we apply this pseudocode to our earlier \Both"
example (r = 1 = http://dbtune.org/jamendo/artist/5), with a
threshold of 0:8. This makes us go through the following
steps:
lookup(r) = f4; 7g
M1 = f(1; 4)g, sim1 = 1
M2 = f(1; 7)g, sim2 = 1
M appings = ff(1; 4)g; f(1; 7)gg
propagate(ff(1; 4)g; f(1; 7)gg)
k = 1, p = f oaf :made</p>
        <p>O1;1;foaf:made = f2; 3g, O10;4;foaf:made = f5; 6g
Objmap1;1 = f(2; 5); (3; 6)g, Objmap1;2 = f(2; 6); (3; 5)g
sim1;1 = 0:9, sim1;2 = 0:4
k = 2, p = f oaf :made</p>
        <p>O2;1;foaf:made = f2; 3g, O20;7;foaf:made = f8g
Objmap2;1 = f(2; 8)g, Objmap2;2 = f(3; 8)g
sim2;1 = 0:55, sim2;2 = 0:55</p>
        <p>Now, sim1;1 is the only simk;i above the threshold, we
therefore choose f(1; 4); (2; 5); (3; 6)g as our mapping, which
corresponds to the RDF code in x 3:2.</p>
        <p>Of course, several heuristics could be added to this pseudo-code,
in order to improve the scalability of the algorithm. In practice,
we associate weights to properties, in order to start from the
most informative one (f oaf :made, for example).</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4. EXPERIMENTS</title>
      <p>In this section, we detail two experiments using this
algorithm, and their respective evaluations. The rst one deals
with the automatic interlinking of two online music datasets.
The second one deals with the linking of a personal music
collection towards corresponding web identi ers.</p>
    </sec>
    <sec id="sec-10">
      <title>4.1 Linking two overlapping web datasets</title>
      <p>
        In this section, we focus on a concrete interlinking which has
been achieved using this algorithm, between two overlapping
web datasets: Jamendo and Musicbrainz. We implemented
this algorithm7 in SWI-Prolog [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], with only one lookup
on the Musicbrainz end-point, at the artist level. Our
algorithm derived 10944 similarity statements for artist, record,
and track resources so far, which allows us to get detailed
editorial information from Musicbrainz, and the actual audio
content, as well as tags, from the Jamendo dataset.
We focus our evaluation on artist resources. As we perform
only one lookup, at the artist level, no tracks or records
can be matched if the artist is not. In order to evaluate
the quality of the interlinking, we take a random sample
from the Jamendo dataset: we rst collect every single artist
URI8, and we randomly select 60 from among them. Then,
7The source code of all the software mentioned
is available as part of the motools project:
http://sourceforge.net/projects/motools
8The results of such a SPARQL query are available at
we run our mapping algorithm and manually check whether
the mappings are correct. Each tested resource therefore
falls into one of the following categories:
      </p>
      <p>An owl:sameAs link is derived : correct (same artist
in the Jamendo and in the Musicbrainz datasets) ;
An owl:sameAs link is derived : incorrect (di erent
artists) ;
No link is derived : correct (there is a corresponding
artist in the Musicbrainz dataset) ;
No link is derived : incorrect (no corresponding artists
in the Musicbrainz dataset).</p>
      <p>The results9, in terms of how many resources fall within each
of the above de ned categories, are shown in table 3. In our
test dataset, the disambiguation was needed in 16 cases.
For example, one of the artist resources was named \Hair",
which matches four resources within Musicbrainz, none of
them being the same band. The rst case that failed is due
to an implementation mistake, failing to normalise the graph
similarity measures correctly when the target graph is bigger
than the seed one (in this case, the artist had two releases
on Musicbrainz, and just one on Jamendo). The second case
that failed is due to the fact that the Musicbrainz RDF is
outdated (the artist does not exist in the RDF dump, but
does exist in the Musicbrainz database).</p>
    </sec>
    <sec id="sec-11">
      <title>4.2 Linking personal music collections</title>
      <p>
        Personal music collections can also be a part of the web
of data. The Music Ontology [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] makes the same
distinction as FRBR between manifestations (all physical objects
that bear the same characteristics, eg. a particular album)
and items (a concrete entity, eg. my copy of the album on
CD). A manifestation and a corresponding item are linked
through a predicate mo:available_as. Therefore, given a
set of audio les in a personal music collection, it is possible
to keep track of the set of statements linking this collection
to identi ers elsewhere in the Semantic Web which denote
the corresponding manifestations. These statements provide
a set of entry points to the Semantic Web, allowing access
to information such as the birth date of the artists
responsible for items in the collection, geographical locations of the
recordings, etc.
      </p>
      <p>GNAT is an implementation of automatic linking from a
personal audio collection to the Musicbrainz dataset|it uses
audio ngerprinting and available metadata to nd
corresponding dereferencable identi ers, and then outputs RDF
http://dbtune.org:2105/sparql/?query=select%20distinct%20%3Fa
%20where%20%7B%3Fa%20a%20mo%3AMusicArtist%7D
9The detailed results, resource by resource, are available at
http://moustaki.org/resources/results.txt
statements making the links between local audio les and the
remote manifestation identi ers. The ngerprinting
functionality can be useful when the metadata available is
particularly poor, but since it is highly dependent on the
ngerprinting service chosen, we concern ourselves here solely
with the metadata-based approach.</p>
      <p>All modern audio encodings allow for the inclusion of
metadata \tags" alongside audio data in a le, such that each
audio le can include the kind of editorial information
contained in the data sets described above. We can therefore
consider a reasonably well tagged personal music collection
to be just another music data set, apply the algorithm
described in x 3, and hence link each local audio le to a
corresponding resource on the Semantic Web. GNAT uses the
variant discussed at the end of x 3:2 which maps artists,
albums and tracks, performing literal lookups at each stage.
For each local audio le, a simple seed graph is constructed
based on the artist, album, title and track number speci ed
in the le's ID3 tag (see g. 2 for an example). It proceeds
as set out in x 3:3 until a single best mapping is found. After
processing a directory of les in this way, GNAT outputs an
RDF le providing mo:available_as links from URIs in the
Zitgist publication of MusicBrainz data to the local audio
les. For example :
&lt;http://zitgist.com/music/track/
1adfecb7-875f-4203-b3b1-8e2e643f94a2&gt; mo:available_as
&lt;file:///mnt/music/Artists/Nirvana/Bleach/track5.mp3&gt;
Using this algorithm rather than one of the more nave
approaches should allow GNAT to be robust to various
inaccuracies in the local les' metadata. We evaluated GNAT's
behaviour in the face of such problems by taking a
correctlytagged MP3 le of the Beatles track \I want to hold your
hand" and arti cially introducing various mistakes. The
MusicBrainz dataset lists no less than 25 releases for this track
by The Beatles, and dozens of artists with songs of the same
name. The correct set of metadata is shown in table 4 and
results are shown in table 5.</p>
      <p>We can see that GNAT performs well in the face of
inaccurate metadata. The release chosen when the album eld is
missing or set to a random string is arguably correct|it is
the same CD, released as part of a box set. One would hope
that the release \Meet the Beatles!" would be chosen in the
case of the album being misspelled.</p>
      <p>With a practical implementation some trade-o s must be
made. In the case of setting the artist to \Al Green", we
have two con icting pieces of information (artist and album)
and our implementation here chooses the mapping for which
the artist matches. A more sophisticated version of GNAT
might consider other tracks in the current directory to
establish that the track most likely comes from the Beatles
release rather than Al Green's release.</p>
      <p>Such links from a user's own les to information on the
Semantic Web could be a signi cant step towards making data
on the Web available and relevant to people, but a user agent
must act on such links before they are directly useful to a
human. In the next section we describe a companion tool to
GNAT, designed to exploit the links GNAT produces.</p>
    </sec>
    <sec id="sec-12">
      <title>4.3 Use-cases</title>
      <p>For the links between a user's les and Semantic Web
resources to be useful, an application must have some
information about the resources. The GNARQL program in the
motools project is beginning to explore some of the
possibilities in this direction. The program loads in all the
owl:sameAs links produced by GNAT, dereferences the
corresponding URIs, and then aggregates additional
information about those resources.</p>
      <p>The basic mechanism for aggregating additional information
is to crawl outwards from the given resource, dereferencing
linked resources and adding their descriptions to the local
RDF store. In the simplest case, this crawling can be
unguided, simply following all links regardless of the properties
used, and performing a breadth- rst traversal of the
Semantic Web from all known resources.</p>
      <p>More sophisticated crawling strategies may lead to better
aggregation for a user's purposes. For example, GNARQL
can prioritise links which use properties from the Music
Ontology (or other speci ed namespaces). This helps ensure
that relevant information is prioritised over less
obviouslyuseful information.</p>
      <p>Information relating to resources of interest may also be
retrieved based on speci c rules. For example, if we know
that the SBSimilarity service provides \similar track"
information about tracks by appending their Musicbrainz ID to
a given pre x10 , we can specify a rule in GNARQL for
generating rdfs:seeAlso links from known tracks to the
corresponding documents in the SBSimilarity namespace. These
rule-derived links will then be followed as part of the
crawling strategy, and their information added to the local RDF
store.</p>
      <p>Naturally, the information held in GNARQL's store need
not come solely from GNAT. In fact, any RDF data in the
designated directory tree will be loaded. In our research
group, this means that Chord Ontology11 transcriptions are
held alongside the MusicBrainz data retrieved from Zitgist
and social tag data retrieved from various websites.
To enable applications to take advantage of this aggregated
data, GNARQL provides a SPARQL endpoint. This frees
end-user applications from the need to themselves maintain
a database relating to the user's music collection, and
allows multiple applications to bene t from a single store of
aggregated information.</p>
      <p>Based on datasets available today, some example queries
such user interfaces might pass on to GNARQL include
\Find tracks which are performances of works by Russian
composers around the turn of the twentieth century", \Find
me cover versions of rock songs in non-rock genres" or
potentially (by using information linked from a user's FOAF
le) \Find me gigs by the artists I play frequently which t
with my vacation schedule".</p>
      <p>
        To experiment with browsing the data aggregated by
GNARQL, we have developed a prototype user interface,
based heavily on the /facet program described in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This
provides a web browser-based interface for exploring the
aggregated information, and performing simple facet-based
10eg.
http://isophonics.net/music/signal/280b7fae-724e-4a6d-8e916fe3f0a2bdad provides extra information about
http://zitgist.com/music/signal/280b7fae-724e-4a6d-8e91-6fe3f0a2
bdad
11See http://purl.org/ontology/chord/
queries. The map functionality allows the results of such
queries to be plotted geographically, as shown in g. 3.
young, more work is required to fully exploit the
functionality GNARQL is beginning to exhibit.
      </p>
    </sec>
    <sec id="sec-13">
      <title>5. FUTURE WORK</title>
    </sec>
    <sec id="sec-14">
      <title>5.1 Work on the interlinking algorithm</title>
      <p>
        The algorithm proposed here has several limitations. Firstly,
it doesn't specify any heuristics to use if the ontologies di er.
In this case,we would need a di erent methodology, perhaps
inspired by the approach described in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Secondly, the
algorithm is designed for the case where a meaningful
similarity measure between pairs of individual resources is available
(here, using string similarity of labels attached to them with
certain predicates). In a linking scenario where there is no
particularly good similarity measure on individual resources
and the graph structure is therefore the most salient factor
for correct linking, another algorithm may be more
appropriate. In this case, Melnik's \similarity ooding" [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] approach
could be used, which prioritises graph structure rather than
node similarity, relying on a post-processing stage to lter
out unsuitable mappings.
      </p>
      <p>Based on these observations, further work developing the
interlinking algorithm could provide a framework for
interlinking a wider variety of datasets.</p>
    </sec>
    <sec id="sec-15">
      <title>5.2 Work on implementations</title>
      <p>
        Currently, the GNAT tool implements two distinct
approaches to nding manifestation URIs for local audio les.
One uses just the available metadata, and this approach is
described in x 4:2, above. Since audio metadata in personal
collections is frequently incomplete, inaccurate, or missing
entirely, this may not be su cient. The other approach
therefore exploits audio ngerprinting [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to try to identify
the track, and then if there is remaining ambiguity the local
metadata is used to choose a single URI.
      </p>
      <p>Ideally, we could use the ngerprint of an audio le as just
another piece of information about the track,
incorporating it into our graph mapping approach. In practice, the
main ngerprinting service available with a large
supporting database, MusicIP's MusicDNS service12, is relatively
opaque. Fingerprinting a track either returns a PUID, which
can be used to perform a search on MusicBrainz, or returns
no results. It therefore provides only a boolean test for
similarity, and some hidden decisions have been made by the
MusicDNS service using any available metadata. As a
result, there is no obvious way to uniformly combine
ngerprint information and local metadata in a graph mapping
approach.</p>
      <p>A ngerprinting service which exposed the server-side
database and the actual process of matching a ngerprint
to a database entry could allow for some more sophisticated
linkage between personal audio collections and the Semantic
Web. We have some hopes that Last.fm's recently launched
ngerprinting service might gather a large database and
make it freely available in a exible way.</p>
      <p>Although the core of the GNARQL tool is in place, more
work is required to explore di erent approaches to
crawling. Also, since Semantic Web user interfaces are relatively
12See http://www.musicip.com/dns/</p>
    </sec>
    <sec id="sec-16">
      <title>6. CONCLUSION</title>
      <p>In this paper, we described several di erent methods for
interlinking Semantic Web datasets. We mentioned two nave
approaches, leading to the construction of a more elaborate
algorithm, which takes into account not only the similarity
of the resources themselves but also the similarity of their
neighbours. Its main advantages are to provide a best-e ort
mapping, without any need for a learning step (for which
we would have to manually interlink some resources), and to
work in a linked data environment, where new resources are
discovered as we get through the mapping process. We
described two implementations of this algorithm. The rst one
was used to interlink artists, records, and tracks in two
online music datasets: Jamendo and Musicbrainz, and the
second one allows any user to link their personal music
collection to corresponding identi ers in the Musicbrainz dataset.
We evaluated these two implementations separately.
Creating links between heterogeneous datasets can
dramatically enhance the usefulness of each. Using such links, a
Semantic Web user agent can jump from a band within the
Jamendo dataset to the corresponding resource in the
Musicbrainz one, to the corresponding resource in DBpedia,
to its approximate geographic location, to the famous
composers born in that city, etc. However, if it is possible to
de ne such links manually for small datasets, it is
impossible for large ones. We need methodologies to discover them
in an automated way. The techniques presented here are
far from perfect, but represent some initial e orts in this
direction.</p>
    </sec>
    <sec id="sec-17">
      <title>7. DATASETS</title>
      <p>The following datasets are mentioned throughout the paper:
Jamendo on DBTune: http://dbtune.org/jamendo/
BBC John Peel sessions: http://dbtune.org/bbc/peel/
SBSimilarity: http://www.isophonics.net/SBSimilarity
Musicbrainz RDF: http://zitgist.com/music/
DBpedia: http://dbpedia.org/
Geonames: http://geonames.org/</p>
    </sec>
    <sec id="sec-18">
      <title>8. NAMESPACES</title>
      <p>We use the following namespaces throughout our RDF
examples:
@prefix mo: &lt;http://purl.org/ontology/mo/&gt;.
@prefix foaf: &lt;http://xmlns.com/foaf/0.1/&gt;.
@prefix owl: &lt;http://www.w3.org/2002/07/owl#&gt;.</p>
    </sec>
    <sec id="sec-19">
      <title>9. ACKNOWLEDGEMENTS</title>
      <p>The authors acknowledge the support of both the
Centre For Digital Music and the Department of Computer
Science at Queen Mary University of London for the
studentship for Yves Raimond. This work has been partially
supported by the EPSRC-funded ICT project OMRAS-2
(EP/E017614/1).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ives</surname>
          </string-name>
          .
          <article-title>Dbpedia: A nucleus for a web of open data</article-title>
          .
          <source>In Proceedings of the International Semantic Web Conference</source>
          , Busan, Korea, November
          <volume>11</volume>
          -15
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Tim</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked data</article-title>
          .
          <source>World wide web design issues</source>
          ,
          <year>July 2006</year>
          . Available at http://www.w3.org/DesignIssues/LinkedData.html.
          <source>Last accessed September</source>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Bizer</surname>
          </string-name>
          , Tom Heath,
          <string-name>
            <given-names>Danny</given-names>
            <surname>Ayers</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Yves</given-names>
            <surname>Raimond</surname>
          </string-name>
          .
          <article-title>Interlinking open data on the web</article-title>
          .
          <source>In Demonstrations Track, 4th European Semantic Web Conference</source>
          , Innsbruck, Austria,
          <year>2007</year>
          . Available at http://www.eswc2007.org/pdf/demo-pdf/ LinkingOpenData.pdf.
          <source>Last accessed September</source>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cano</surname>
          </string-name>
          , E. Batlle,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kalker</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Haitsma</surname>
          </string-name>
          .
          <article-title>A review of audio ngerprinting</article-title>
          .
          <source>The Journal of VLSI Signal Processing</source>
          ,
          <volume>41</volume>
          (
          <issue>3</issue>
          ):
          <volume>271</volume>
          {
          <fpage>284</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Jeremy</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Carroll</surname>
            , Christian Bizer, Pat Hayes, and
            <given-names>Patrick</given-names>
          </string-name>
          <string-name>
            <surname>Stickler</surname>
          </string-name>
          .
          <article-title>Named graphs</article-title>
          .
          <source>Journal of Web Semantics</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Michiel</given-names>
            <surname>Hildebrand</surname>
          </string-name>
          , Jacco van Ossenbruggen,
          <string-name>
            <given-names>and Lynda</given-names>
            <surname>Hardman</surname>
          </string-name>
          .
          <source>The Semantic Web - ISWC</source>
          <year>2006</year>
          , volume
          <volume>4273</volume>
          /2006 of Lecture Notes in Computer Science, chapter
          <article-title>/facet: A Browser for Heterogeneous Semantic Web Repositories</article-title>
          , pages
          <volume>272</volume>
          {
          <fpage>285</fpage>
          . Springer Berlin / Heidelberg,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Melnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          .
          <article-title>Similarity ooding: a versatile graph matching algorithm and itsapplication to schema matching</article-title>
          .
          <source>In Proceedings of the 18th International Conference on Data Engineering</source>
          , pages
          <volume>117</volume>
          {
          <fpage>128</fpage>
          , San Jose, CA, USA, February-March
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Neiling</surname>
          </string-name>
          .
          <article-title>Data fusion with record linkage</article-title>
          . In I. Schmitt,
          <string-name>
            <given-names>C.</given-names>
            <surname>Turker</surname>
          </string-name>
          , E. Hildebrandt, and M. Hoding, editors,
          <source>Proceedings of the Workshop 'Foederierte Datenbanken', Aachen</source>
          ,
          <year>1998</year>
          . Available at http://citeseer.ist.psu.edu/189652.html.
          <source>Last accessed January</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yves</given-names>
            <surname>Raimond</surname>
          </string-name>
          , Samer Abdallah,
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sandler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Frederick</given-names>
            <surname>Giasson</surname>
          </string-name>
          .
          <article-title>The music ontology</article-title>
          .
          <source>In Proceedings of the International Conference on Music Information Retrieval</source>
          , pages
          <volume>417</volume>
          {
          <fpage>422</fpage>
          ,
          <year>September 2007</year>
          . Available at http://ismir2007.ismir.
          <source>net/proceedings/ ISMIR2007_p417_raimond.pdf. Last accessed January</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Fabian</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Suchanek</surname>
            , Gjergji Kasneci, and
            <given-names>Gerhard</given-names>
          </string-name>
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Yago - a core of semantic knowledge</article-title>
          .
          <source>In 16th international World Wide Web conference</source>
          ,
          <year>2007</year>
          . Available at http://www.mpi-inf.mpg.de/ ~suchanek/publications/www2007.pdf. Last accessed january
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jan</surname>
            <given-names>Wielemaker</given-names>
          </string-name>
          , Zhisheng Huang,
          <string-name>
            <surname>and Lourens Van Der Meij.</surname>
          </string-name>
          <article-title>SWI-Prolog and the web</article-title>
          .
          <source>Theory and Practice of Logic Programming</source>
          ,
          <year>2003</year>
          . Available at http://hcs.science.uva.nl/projects/SWI-Prolog/ articles/TPLP-plweb.
          <source>pdf. Last accessed September</source>
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Winkler</surname>
          </string-name>
          .
          <article-title>Advanced methods for record linkage</article-title>
          .
          <source>Technical report</source>
          , Statistical Research Division, Washington, DC: U.S.
          <article-title>Bureau of the Census</article-title>
          .,
          <year>1994</year>
          . Available at http://citeseer.ist.psu.edu/254560.html.
          <source>Last accessed January</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>