Automatic Interlinking of Music Datasets
                               on the Semantic Web

                                Yves Raimond, Christopher Sutton and Mark Sandler
                                                 Centre for Digital Music
                                             Queen Mary, University of London
                           {yves.raimond,chris.sutton,mark.sandler}@elec.qmul.ac.uk


ABSTRACT                                                         in a Creative Commons label is linked to its location in the
In this paper, we describe current efforts towards interlink-    Geonames dataset, and to the corresponding resource in an
ing music-related datasets on the Web. We first explain          editorial database1 :
some initial interlinking experiences, and the poor results
obtained by taking a naı̈ve approach. We then detail a par-
                                                                 <http://dbtune.org/jamendo/artist/5>
ticular interlinking algorithm, taking into account both the       foaf:based_near <http://sws.geonames.org/2991627/> ;
similarities of web resources and of their neighbours. We          owl:sameAs <http://zitgist.com/music/artist/
detail the application of this algorithm in two contexts: to                     0781a3f3-645c-45d1-a84f-76b4e4decf6d>.
link a Creative Commons music dataset to an editorial one,
and to link a personal music collection to corresponding web
                                                                 Then, you may access the actual audio content from the Cre-
identifiers. The latter provides a user with personally mean-
                                                                 ative Commons label, some extra information such as the
ingful entry points for exploring the web of data, and we
                                                                 birth dates of the members of this band from the editorial
conclude by describing some concrete tools built to generate
                                                                 dataset, and detailed geographic information (latitude, lon-
and use such links.
                                                                 gitude, hierarchy of geographical features) from the Geon-
                                                                 ames dataset.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous            For small datasets published manually (such as an individ-
                                                                 ual’s FOAF file), it is possible to create such links manu-
General Terms                                                    ally. However, doing so for large datasets is impractical: we
Linked Data                                                      need a way to automatically detect the overlapping parts of
                                                                 heterogeneous datasets. In this paper, we detail a few algo-
Keywords                                                         rithms that have been developed, implemented and practi-
Semantic-Web,Linked Data,Music                                   cally deployed to interlink different music-related datasets.
                                                                 We mainly focus on the most sophisticated one, applicable
                                                                 in a Linked Data context, and taking into account not only
1.    INTRODUCTION                                               the similarities of single resources but also the similarities
The Linking Open Data community project [3] aims at pub-         of their neighbours. We evaluate how this algorithm per-
lishing and interlinking open datasets by following simple       forms when applied to link a real-world Creative Commons
rules [2] for linking data. The publication step can to some     dataset to an editorial one. We also show how a personal
extent be automated, using tools such as D2R or Open-            music collection can be treated as one such dataset, enabling
Link Virtuoso (which both allow relational databases to be       a user to benefit from the growing body of knowledge on the
published as linked data) or P2R (which allows SWI-Prolog        Semantic Web in a personally meaningful way.
knowledge bases to be published in the same way). All these
tools handle declarative mappings from a given data struc-       We define the mapping problem as follows. We con-
ture to corresponding web resources and associated RDF           sider two RDF datasets D1 and D2 , respectively describ-
descriptions. Once this publication is achieved, we still have   ing a number of web resources ri and si . We consider
to create links to other datasets, in order for a user agent     the problem of matching resources—finding resources sy
to navigate from one to another. A typical example of such       in our target dataset, D2 , which identify the same ob-
interlinking would be the following one, where a music band      ject as a resource rx in our seed dataset, D1 . For ex-
                                                                 ample, we want to find sy = http://zitgist.com/music/artist/
                                                                 0781a3f3-645c-45d1-a84f-76b4e4decf6d for rx = http://dbtune.org/
                                                                 jamendo/artist/5. All mapping problems in our context can be
                                                                 reduced to this one, even literal expansion where a resource
                                                                 rx is linked to a literal l and we are looking for a resource
                                                                 sy corresponding to l—for example, expanding “Moselle,
                                                                 France” into http://sws.geonames.org/2991627/. In these cases,
                                                                 1
LDOW’08, April 28, 2008, Beijing, China                            We use the Turtle notation throughout the paper, with the
                                                                 namespaces defined in § 8
a simple transformation (creating a blank node between rx
and l) is enough to return ourselves to a resource matching
problem.

2.      NAIVE INTERLINKING
In this section, we describe some naı̈ve approaches to tack-
ling this resource matching problem, and identify their fail-
ings.

2.1      Simple literal lookups
Most datasets provide a literal search facility, either through
a dedicated Web service (eg. Geonames, Musicbrainz), or
through a SPARQL end-point on which we can use filters on
literals, or in some cases built-in literal matching functional-    Figure 1: Example datasets—we here try to match ele-
ity. So one solution to link resources from our two datasets        ments from A and B
would be to first issue the following query on D1 :
                                                                            UNION
                                                                            {?r p:wikiPageUsesTemplate
SELECT ?l                                                                     <http://dbpedia.org/resource/Template:single_infobox>}
WHERE { <r> ?p ?l } FILTER (isLiteral(?l))                              }
                                                                    }

and use the bindings of ?l to issue a literal search on D2 .
We can then try to map the resulting resources to r.                This approach was used to link the musical works and the
                                                                    artists in the BBC John Peel sessions dataset to correspond-
We used this kind of approach to link the Jamendo dataset           ing resources in DBpedia. Constraints on the target re-
to the Geonames one2 . Jamendo provides information about           sources were manually defined, and queries taking into ac-
the location of artists, in the form of a literal string (such as   count the seed literal and these constraints were issued to
“Moselle, France”). We use this to query the Geonames web           the DBpedia SPARQL end-point.
service, and get back the corresponding Geonames resource.
In practice, we found that the literal strings provided by          However, even with restrictions on the nature of target re-
Jamendo are specific enough for this approach to work well.         sources, a literal may not be discriminating enough. For ex-
In the two instances when more than one candidate was               ample, the resource http://dbtune.org/jamendo/artist/5 is linked
returned for a location string, no link was created.                to the literal “Both”. Searching the Musicbrainz dataset for
                                                                    “Both” while restricting ourselves to artists gives us two re-
2.2      Extended literal lookups                                   sults. Likewise, when looking for the Nirvana version of the
Although this simple approach can be suitable for interlink-        song “Love Buzz” in the DBpedia dataset, we have to dis-
ing particular datasets in cases where a literal string reliably    ambiguate between the original song and the Nirvana cover.
provides sufficient disambiguation it is unlikely to discrim-       There is therefore a need for a more sophisticated algorithm,
inate suitably in most cases. For example, when trying to           capable of handling such disambiguation.
apply it to link musical works in the BBC John Peel Sessions
to resources in the DBpedia dataset [1], we come across the         3.       GRAPH MATCHING
literal “Violet”3 . The actual song which the algorithm should
                                                                    An intuitive approach to disambiguate two artists with the
link to is just one of the sixteen results of the corresponding
                                                                    same name would be to check the titles of their releases,
literal lookup.
                                                                    and see if they match the titles of the releases in our seed
                                                                    dataset. If by any chance they have releases with the same
One solution is to add constraints on the resulting resources.
                                                                    title, we can check their track titles, and disambiguate using
For example, by using the DBpedia links to Yago [10], we
                                                                    them, and so on. In this section, we develop this idea and
can restrict our resources to be of a specific Yago type, or we
                                                                    give a formal specification of such an algorithm.
can restrict it to be linked to a particular infobox URI (spec-
ifying a template for structured data on the corresponding
Wikipedia page). This leads us to the following SPARQL              3.1        Offline graph matching
query:                                                              Consider the two datasets illustrated in fig. 1, with our seed
                                                                    dataset on the left containing a single artist with the name
                                                                    “Both”, and the target dataset on the right containing two
                                                                    artists named “Both”. We model the two datasets as graphs,
PREFIX p: <http://dbpedia.org/property/>
SELECT ?r
                                                                    with each edge represented as a triple (s, p, o).
WHERE
{ ?r ?p "Violet"@en.                                                As a first step, we compute initial similarity values between
  {                                                                 all pairs of resources (s1 , s2 ) and (o1 , o2 ) such that (s1 , p, o1 ) ∈
    {?r a <http://dbpedia.org/class/yago/Song107048000>}
                                                                    D1 and (s2 , p, o2 ) ∈ D2 . Such similarity values might be
2
    All links to mentioned datasets are available in § 7            calculated by a string similarity algorithm (such as the ones
3
    For the resource http://dbtune.org/bbc/peel/work/1498           described in section 2 of [12]) comparing literals directly at-
                                                                   issue a query (through a SPARQL end-point or custom web
            Table 1: Initial similarity map                        service) to D2 which involves l and constraints over what we
        Resource 1 Resource 2    Initial Similarity                are looking for. This gives us a list of resources.
        1           4            1
        1           7            1                                 For each sk in this list, we access Gsk . Now, for all pos-
        2           5            0.9                               sible graph mappings MGr :Gsk ,i , we compute a measure as
        3           6            0.8                               defined in § 3.1. If we can make a clear decision now (ie.
        2           6            0.1                               there is just one measure above our decision threshold), we
        3           5            0.1                               terminate and choose the corresponding graph mapping. If
        2           8            0.1                               not, we look for object properties p such that (r, p, o) ∈ Gr
        3           8            0.1                               and (sk , p, o0 ) ∈ Gsk , and we obtain Go and G0o . 4 Then,
                                                                   we update our possible graph mappings and the associated
                                                                   measures. We iterate this process until we can make a deci-
Table 2: Possible graph mappings—(x, y) ≡ {x                       sion (we have one unique mapping with measure above the
owl:sameAs y}, and associated measures                             threshold), or until we can’t go any further (no unexplored
     Graphs   Mapping                         Measures             object properties). Practically, we also limit the maximum
    G1 G2 MG1:G2 a = {(1, 4), (2, 5), (3, 6)} 0.9                  number of iterations the algorithm may perform.
    G1 G2 MG1:G2 b = {(1, 4), (2, 6), (3, 5)} 0.4
    G1 G3 MG1:G3 a = {(1, 7), (2, 8)}         0.55                 In our example, we first dereference http://dbtune.org/
    G1 G3 MG1:G3 b = {(1, 7), (3, 8)}         0.55                 jamendo/artist/5. We get access to the following facts: our
                                                                   URI is identifying a musical band, it is called “Both”, and
                                                                   it made 5 two things: http://dbtune.org/jamendo/record/174
tached to these resources. In our example, this produces the       and http://dbtune.org/jamendo/record/33. We now look for
results in table 1.                                                an artist named “Both” in the Musicbrainz dataset,
                                                                   through the Musicbrainz web service6 .               This gives
Next, we construct a graph similarity measure. In our ex-          us back two URIs:                http://zitgist.com/music/artist/
ample, we consider the possible graph mappings in table 2.         5f9f2dfb-76f0-4872-ad7d-f9d84a908cb5 and http://zitgist.com/
Then, we associate a measure with such mappings: we sum            music/artist/0781a3f3-645c-45d1-a84f-76b4e4decf6d. We derefer-
the similarity values s associated with each pair (x, y) and       ence them: the first one identifies an artist named “Both”
we normalise it by the number of pairs in the mapping. In          which made two things, and the second one also an artist
our example, the resulting measures are in table 2.                named “Both” which made one thing.

Finally, we choose the mapping whose similarity measure is         We now consider two possible graph mappings (correspond-
the highest, optionally thresholding to avoid making map-          ing to the two potential matches of our artist resource), with
pings between graphs which are too dissimilar. In our ex-          two measures, both equal to 1. We continue looking for fur-
ample, we choose MG1:G2 a.                                         ther clues, as there is not yet any way to disambiguate be-
                                                                   tween the two. We take the object property occurring in our
3.2    Linked Data context                                         three graphs, f oaf :made, and dereference all the objects of
                                                                   this property that we currently know about. Our starting re-
Now, we apply this algorithm in a Linked Data [2] context,
                                                                   source made two records, named “Simple Exercice” and “En
where we discover our graphs as we go: the main idea being
                                                                   attendant d’aller sur Mars”. The first matching artist in the
that we update the graph mappings and their measures as
                                                                   Musicbrainz dataset made two records, named “Simple ex-
we update our local Semantic Web cache. This is necessary
                                                                   ercice” and “En attendant d’aller sur Mars...”. The second
because in general the size of the datasets D1 and D2 will
                                                                   matching artist made one record, named “The Inevitable
be prohibitive; if not for loading both datasets, then for
                                                                   Phyllis”. We now update the possible graph mappings, and
computing all the possible graph mappings between them.
                                                                   reach the results in table 2. Now, we have one mapping
                                                                   identifiably better than the others, with graph similarity
Our starting point is now a single URI r in a dataset D1 .
                                                                   measure 0.9. We choose it, and hence derive the following
We try to find the corresponding s in a dataset D2 , as well
                                                                   statements:
as mappings of resources in the neighbourhood of r. We
illustrate this using the same example as the one in § 3.1:
we want to map the URI http://dbtune.org/jamendo/artist/5 to
the corresponding Musicbrainz URI. As part of the same             <http://dbtune.org/jamendo/artist/5> owl:sameAs
                                                                              <http://zitgist.com/music/artist/
process we wish to map corresponding albums and tracks                          0781a3f3-645c-45d1-a84f-76b4e4decf6d>.
for this artist.                                                   <http://dbtune.org/jamendo/record/174> owl:sameAs
                                                                              <http://zitgist.com/music/record/
                                                                                3042765f-67ba-49ef-ab28-45805fabef4a>.
In the following, we will use Named Graphs [5], in order to        <http://dbtune.org/jamendo/record/33> owl:sameAs
track the provenance of a particular graph, and we let Gx                     <http://zitgist.com/music/record/
denote the graph retrieved when dereferencing x.                                fade0242-e1f0-457b-99de-d9fe0c8cbd57>.

                                                                   4
The first thing we do is to retrieve Gr , and extract a suitable     We could additionally consider triples of the form (o, p, r)
label l for r (using dc:title or f oaf :name properties, etc).     and (o0 , p, sk ).
                                                                   5
Now, we need to access some potential candidates for s. To           Captured through the f oaf :made predicate
                                                                   6
do that, we use the same approach as described in § 2.2. We          See http://wiki.musicbrainz.org/XMLWebService
Having chosen a mapping we could go further, to also derive                              0
                                                                                       Ok,r,p  is the list of all o such that (r0 , p, o) ∈ Gr0
such statements for the tracks in these two albums. An-                                                                                      0
                                                                                                              S
                                                                                    Foreach Objmapk,i ∈ r combinations(Ok,r,p , Ok,r,p          ):
other possible extension of this algorithm is to perform lit-                          simk,i = (measurek + measure(Objmapk,i ))/(|Mk | +
eral lookups in D2 at each step (therefore providing new                    |Objmapk,i |)
possible graph mappings each time). This helps ensure we                       If no simk,i , fail
still find the correct mapping in the case that our initial                     If simk,i > threshold for exactly one {k, i} pair, return
                                                                            append(Mk , Objmapk,i )
literal lookup does not include the correct resource among                     Else, N ewM appings is the list of all append(Mk , Objmapk,i ),
its results. For example, the correct target artist might be                and return propagate(N ewM appings) (if the maximum number
listed as having a different name in D2 as in D1 , such that                of recursions is not reached, otherwise fail)
they do not feature in the results of our initial literal lookup.
However, we might have some clues about who this artist is                  Now, we apply this pseudocode to our earlier “Both”
from the names of the albums they produced, and so per-                     example (r = 1 = http://dbtune.org/jamendo/artist/5), with a
                                                                            threshold of 0.8. This makes us go through the following
forming additional literal lookups (on the album titles) may                steps:
allow us to find the correct artist and hence the correct map-
ping. Such a variant of this algorithm is implemented in the                lookup(r) = {4, 7}
GNAT software described in § 4.2.                                           M1 = {(1, 4)}, sim1 = 1
                                                                            M2 = {(1, 7)}, sim2 = 1
                                                                            M appings = {{(1, 4)}, {(1, 7)}}
3.3     Algorithm definition
The algorithm described above can be expressed in the                       propagate({{(1, 4)}, {(1, 7)}})
following pseudo-code. We assume the existence of a                            k = 1, p = f oaf :made
                                                                                                              0
                                                                                 O1,1,f oaf :made = {2, 3}, O1,4,f
function string similarity(x, y) and define the following                                                          oaf :made = {5, 6}
                                                                                 Objmap1,1 = {(2, 5), (3, 6)}, Objmap1,2 = {(2, 6), (3, 5)}
additional functions:                                                            sim1,1 = 0.9, sim1,2 = 0.4
                                                                               k = 2, p = f oaf :made
function similarity(x, y) :                                                                                   0
                                                                                 O2,1,f oaf :made = {2, 3}, O2,7,f oaf :made = {8}
   Extract a suitable label lx for x in Gx                                       Objmap2,1 = {(2, 8)}, Objmap2,2 = {(3, 8)}
   Extract a suitable label ly for y in Gy                                       sim2,1 = 0.55, sim2,2 = 0.55
   Return string similarity(lx , ly )                                           Now, sim1,1 is the only simk,i above the threshold, we
                                                                            therefore choose {(1, 4), (2, 5), (3, 6)} as our mapping, which
                                                                            corresponds to the RDF code in § 3.2.
function lookup(x) :
   Extract a suitable label lx for x in Gx                                  Of course, several heuristics could be added to this pseudo-code,
   Perform a search for lx on D2                                            in order to improve the scalability of the algorithm. In practice,
   Return the set of resources retrieved from the search                    we associate weights to properties, in order to start from the
                                                                            most informative one (f oaf :made, for example).
function measure(M ) :
   Foreach (ri , rj ) ∈ M                                                   4.     EXPERIMENTS
     simi,j = similarity(ri , rj )                                          In this section, we detail two experiments using this algo-
          P
   Return i,j simi,j                                                        rithm, and their respective evaluations. The first one deals
                                                                            with the automatic interlinking of two online music datasets.
function combinations(O1 , O2 ) :                                           The second one deals with the linking of a personal music
   Return all possible combinations of                                      collection towards corresponding web identifiers.
   elements of O1 and elements of O2
   e.g. combinations({1, 2}, {3, 4}) =
            {{(1, 3), (2, 4)}, {(1, 4), (2, 3)}}
                                                                            4.1     Linking two overlapping web datasets
                                                                            In this section, we focus on a concrete interlinking which has
Our starting point is a URI r in D1 and a decision                          been achieved using this algorithm, between two overlapping
threshold threshold. Our mapping pseudo-code is then                        web datasets: Jamendo and Musicbrainz. We implemented
defined as:                                                                 this algorithm7 in SWI-Prolog [11], with only one lookup
                                                                            on the Musicbrainz end-point, at the artist level. Our algo-
                                                                            rithm derived 10944 similarity statements for artist, record,
                                                                            and track resources so far, which allows us to get detailed ed-
Foreach sk ∈ lookup(r)                                                      itorial information from Musicbrainz, and the actual audio
    Mk = {(r, sk )}                                                         content, as well as tags, from the Jamendo dataset.
    measurek = measure(Mk )
    simk = measurek /|Mk |                                                  We focus our evaluation on artist resources. As we perform
If simk > threshold for exactly one k, return Mk                            only one lookup, at the artist level, no tracks or records
Else, M appings is the list of all Mk , and                     return
propagate(M appings)
                                                                            can be matched if the artist is not. In order to evaluate
                                                                            the quality of the interlinking, we take a random sample
function propagate(M appings) :                                             from the Jamendo dataset: we first collect every single artist
   Foreach Mk ∈ M appings:                                                  URI8 , and we randomly select 60 from among them. Then,
       measurek = measure(Mk )                                              7
        Foreach p s.t. (∃(r, r0 ) ∈ Mk , ∃(r, p, o) ∈                Gr ,     The source       code of all the software mentioned
∃(r0 , p, o0 ) ∈ Gr0 and ∀M ap ∈ M appings, (o, o0 ) ∈   / M ap):           is available       as part of the motools project:
          Foreach (r, r0 ) ∈ Mk :                                           http://sourceforge.net/projects/motools
                                                                            8
              Ok,r,p is the list of all o such that (r, p, o) ∈ Gr              The results of such a SPARQL query are available at
Table 3: Evaluation of the Jamendo/Musicbrainz
interlinking
                 Link derived Link not derived
       Correct   5            53
       Incorrect 0            2


we run our mapping algorithm and manually check whether
the mappings are correct. Each tested resource therefore
falls into one of the following categories:


      • An owl:sameAs link is derived : correct (same artist
        in the Jamendo and in the Musicbrainz datasets) ;
      • An owl:sameAs link is derived : incorrect (different        Figure 2: Constructing a graph from local music meta-
        artists) ;                                                  data

      • No link is derived : correct (there is a corresponding
        artist in the Musicbrainz dataset) ;                        statements making the links between local audio files and the
                                                                    remote manifestation identifiers. The fingerprinting func-
      • No link is derived : incorrect (no corresponding artists    tionality can be useful when the metadata available is par-
        in the Musicbrainz dataset).                                ticularly poor, but since it is highly dependent on the fin-
                                                                    gerprinting service chosen, we concern ourselves here solely
                                                                    with the metadata-based approach.
The results9 , in terms of how many resources fall within each
of the above defined categories, are shown in table 3. In our
                                                                    All modern audio encodings allow for the inclusion of meta-
test dataset, the disambiguation was needed in 16 cases.
                                                                    data “tags” alongside audio data in a file, such that each
For example, one of the artist resources was named “Hair”,
                                                                    audio file can include the kind of editorial information con-
which matches four resources within Musicbrainz, none of
                                                                    tained in the data sets described above. We can therefore
them being the same band. The first case that failed is due
                                                                    consider a reasonably well tagged personal music collection
to an implementation mistake, failing to normalise the graph
                                                                    to be just another music data set, apply the algorithm de-
similarity measures correctly when the target graph is bigger
                                                                    scribed in § 3, and hence link each local audio file to a cor-
than the seed one (in this case, the artist had two releases
                                                                    responding resource on the Semantic Web. GNAT uses the
on Musicbrainz, and just one on Jamendo). The second case
                                                                    variant discussed at the end of § 3.2 which maps artists, al-
that failed is due to the fact that the Musicbrainz RDF is
                                                                    bums and tracks, performing literal lookups at each stage.
outdated (the artist does not exist in the RDF dump, but
                                                                    For each local audio file, a simple seed graph is constructed
does exist in the Musicbrainz database).
                                                                    based on the artist, album, title and track number specified
                                                                    in the file’s ID3 tag (see fig. 2 for an example). It proceeds
4.2      Linking personal music collections                         as set out in § 3.3 until a single best mapping is found. After
Personal music collections can also be a part of the web            processing a directory of files in this way, GNAT outputs an
of data. The Music Ontology [9] makes the same distinc-             RDF file providing mo:available_as links from URIs in the
tion as FRBR between manifestations (all physical objects           Zitgist publication of MusicBrainz data to the local audio
that bear the same characteristics, eg. a particular album)         files. For example :
and items (a concrete entity, eg. my copy of the album on
CD). A manifestation and a corresponding item are linked
through a predicate mo:available_as. Therefore, given a             <http://zitgist.com/music/track/
                                                                     1adfecb7-875f-4203-b3b1-8e2e643f94a2> mo:available_as
set of audio files in a personal music collection, it is possible       <file:///mnt/music/Artists/Nirvana/Bleach/track5.mp3>
to keep track of the set of statements linking this collection
to identifiers elsewhere in the Semantic Web which denote
the corresponding manifestations. These statements provide          Using this algorithm rather than one of the more naı̈ve ap-
a set of entry points to the Semantic Web, allowing access          proaches should allow GNAT to be robust to various inac-
to information such as the birth date of the artists responsi-      curacies in the local files’ metadata. We evaluated GNAT’s
ble for items in the collection, geographical locations of the      behaviour in the face of such problems by taking a correctly-
recordings, etc.                                                    tagged MP3 file of the Beatles track “I want to hold your
                                                                    hand” and artificially introducing various mistakes. The Mu-
GNAT is an implementation of automatic linking from a               sicBrainz dataset lists no less than 25 releases for this track
personal audio collection to the Musicbrainz dataset—it uses        by The Beatles, and dozens of artists with songs of the same
audio fingerprinting and available metadata to find corre-          name. The correct set of metadata is shown in table 4 and
sponding dereferencable identifiers, and then outputs RDF           results are shown in table 5.
http://dbtune.org:2105/sparql/?query=select%20distinct%20%3Fa
%20where%20%7B%3Fa%20a%20mo%3AMusicArtist%7D                        We can see that GNAT performs well in the face of inaccu-
9                                                                   rate metadata. The release chosen when the album field is
    The detailed results, resource by resource, are available at
http://moustaki.org/resources/results.txt                           missing or set to a random string is arguably correct—it is
                            Table 5: Evaluation of GNAT’s linkage on particular cases
                        Change made                   Resulting mapping
                        Artist field missing                      Correct
                        Artist set to random string               Correct
                        Artist set to “Beetles”                   Correct
                        Artist set to “Al Green”                  Mapped to Al Green’s cover version
                        Album field missing                       Mapped to correct track on the release
                        Album set to random string                “The Capitol Albums, Volume 1 (disc 1:
                        Album set to “Meat the Beatles”           Meet the Beatles!)”
                        Track set to random string                Correct
                        Track set to “I Wanna Hold Your Hand”     Correct


                                                                       can prioritise links which use properties from the Music On-
Table 4: Base (correct) metadata for GNAT evalu-                       tology (or other specified namespaces). This helps ensure
ation                                                                  that relevant information is prioritised over less obviously-
       Artist     “The Beatles”
                                                                       useful information.
       Album      “Meet The Beatles!”
       Title      “I Want To Hold Your Hand”
                                                                       Information relating to resources of interest may also be re-
       Track Num 1
                                                                       trieved based on specific rules. For example, if we know
                                                                       that the SBSimilarity service provides “similar track” infor-
                                                                       mation about tracks by appending their Musicbrainz ID to
the same CD, released as part of a box set. One would hope             a given prefix10 , we can specify a rule in GNARQL for gen-
that the release “Meet the Beatles!” would be chosen in the            erating rdfs:seeAlso links from known tracks to the corre-
case of the album being misspelled.                                    sponding documents in the SBSimilarity namespace. These
                                                                       rule-derived links will then be followed as part of the crawl-
With a practical implementation some trade-offs must be                ing strategy, and their information added to the local RDF
made. In the case of setting the artist to “Al Green”, we              store.
have two conflicting pieces of information (artist and album)
and our implementation here chooses the mapping for which              Naturally, the information held in GNARQL’s store need
the artist matches. A more sophisticated version of GNAT               not come solely from GNAT. In fact, any RDF data in the
might consider other tracks in the current directory to es-            designated directory tree will be loaded. In our research
tablish that the track most likely comes from the Beatles              group, this means that Chord Ontology11 transcriptions are
release rather than Al Green’s release.                                held alongside the MusicBrainz data retrieved from Zitgist
                                                                       and social tag data retrieved from various websites.
Such links from a user’s own files to information on the Se-
mantic Web could be a significant step towards making data             To enable applications to take advantage of this aggregated
on the Web available and relevant to people, but a user agent          data, GNARQL provides a SPARQL endpoint. This frees
must act on such links before they are directly useful to a            end-user applications from the need to themselves maintain
human. In the next section we describe a companion tool to             a database relating to the user’s music collection, and al-
GNAT, designed to exploit the links GNAT produces.                     lows multiple applications to benefit from a single store of
                                                                       aggregated information.
4.3    Use-cases                                                       Based on datasets available today, some example queries
For the links between a user’s files and Semantic Web re-              such user interfaces might pass on to GNARQL include
sources to be useful, an application must have some infor-             “Find tracks which are performances of works by Russian
mation about the resources. The GNARQL program in the                  composers around the turn of the twentieth century”, “Find
motools project is beginning to explore some of the pos-               me cover versions of rock songs in non-rock genres” or po-
sibilities in this direction. The program loads in all the             tentially (by using information linked from a user’s FOAF
owl:sameAs links produced by GNAT, dereferences the cor-               file) “Find me gigs by the artists I play frequently which fit
responding URIs, and then aggregates additional informa-               with my vacation schedule”.
tion about those resources.
                                                                       To experiment with browsing the data aggregated by
The basic mechanism for aggregating additional information             GNARQL, we have developed a prototype user interface,
is to crawl outwards from the given resource, dereferencing            based heavily on the /facet program described in [6]. This
linked resources and adding their descriptions to the local            provides a web browser-based interface for exploring the
RDF store. In the simplest case, this crawling can be un-              aggregated information, and performing simple facet-based
guided, simply following all links regardless of the properties
used, and performing a breadth-first traversal of the Seman-          10
                                                                         eg. http://isophonics.net/music/signal/280b7fae-724e-4a6d-8e91-
tic Web from all known resources.                                      6fe3f0a2bdad     provides      extra     information       about
                                                                       http://zitgist.com/music/signal/280b7fae-724e-4a6d-8e91-6fe3f0a2
More sophisticated crawling strategies may lead to better              bdad
                                                                      11
aggregation for a user’s purposes. For example, GNARQL                   See http://purl.org/ontology/chord/
queries. The map functionality allows the results of such            young, more work is required to fully exploit the function-
queries to be plotted geographically, as shown in fig. 3.            ality GNARQL is beginning to exhibit.

5. FUTURE WORK                                                       6.   CONCLUSION
5.1 Work on the interlinking algorithm                               In this paper, we described several different methods for in-
The algorithm proposed here has several limitations. Firstly,        terlinking Semantic Web datasets. We mentioned two naı̈ve
it doesn’t specify any heuristics to use if the ontologies differ.   approaches, leading to the construction of a more elaborate
In this case,we would need a different methodology, perhaps          algorithm, which takes into account not only the similarity
inspired by the approach described in [8]. Secondly, the al-         of the resources themselves but also the similarity of their
gorithm is designed for the case where a meaningful similar-         neighbours. Its main advantages are to provide a best-effort
ity measure between pairs of individual resources is available       mapping, without any need for a learning step (for which
(here, using string similarity of labels attached to them with       we would have to manually interlink some resources), and to
certain predicates). In a linking scenario where there is no         work in a linked data environment, where new resources are
particularly good similarity measure on individual resources         discovered as we get through the mapping process. We de-
and the graph structure is therefore the most salient factor         scribed two implementations of this algorithm. The first one
for correct linking, another algorithm may be more appropri-         was used to interlink artists, records, and tracks in two on-
ate. In this case, Melnik’s “similarity flooding” [7] approach       line music datasets: Jamendo and Musicbrainz, and the sec-
could be used, which prioritises graph structure rather than         ond one allows any user to link their personal music collec-
node similarity, relying on a post-processing stage to filter        tion to corresponding identifiers in the Musicbrainz dataset.
out unsuitable mappings.                                             We evaluated these two implementations separately.

Based on these observations, further work developing the             Creating links between heterogeneous datasets can dramat-
interlinking algorithm could provide a framework for inter-          ically enhance the usefulness of each. Using such links, a
linking a wider variety of datasets.                                 Semantic Web user agent can jump from a band within the
                                                                     Jamendo dataset to the corresponding resource in the Mu-
5.2      Work on implementations                                     sicbrainz one, to the corresponding resource in DBpedia,
Currently, the GNAT tool implements two distinct ap-                 to its approximate geographic location, to the famous com-
proaches to finding manifestation URIs for local audio files.        posers born in that city, etc. However, if it is possible to
One uses just the available metadata, and this approach is           define such links manually for small datasets, it is impossi-
described in § 4.2, above. Since audio metadata in personal          ble for large ones. We need methodologies to discover them
collections is frequently incomplete, inaccurate, or missing         in an automated way. The techniques presented here are
entirely, this may not be sufficient. The other approach             far from perfect, but represent some initial efforts in this
therefore exploits audio fingerprinting [4] to try to identify       direction.
the track, and then if there is remaining ambiguity the local
metadata is used to choose a single URI.
                                                                     7.   DATASETS
                                                                     The following datasets are mentioned throughout the paper:
Ideally, we could use the fingerprint of an audio file as just
another piece of information about the track, incorporat-
                                                                     Jamendo on DBTune: http://dbtune.org/jamendo/
ing it into our graph mapping approach. In practice, the
                                                                     BBC John Peel sessions: http://dbtune.org/bbc/peel/
main fingerprinting service available with a large support-
                                                                     SBSimilarity: http://www.isophonics.net/SBSimilarity
ing database, MusicIP’s MusicDNS service12 , is relatively
                                                                     Musicbrainz RDF: http://zitgist.com/music/
opaque. Fingerprinting a track either returns a PUID, which
                                                                     DBpedia: http://dbpedia.org/
can be used to perform a search on MusicBrainz, or returns
                                                                     Geonames: http://geonames.org/
no results. It therefore provides only a boolean test for sim-
ilarity, and some hidden decisions have been made by the
MusicDNS service using any available metadata. As a re-
sult, there is no obvious way to uniformly combine finger-           8.   NAMESPACES
print information and local metadata in a graph mapping              We use the following namespaces throughout our RDF ex-
approach.                                                            amples:

A fingerprinting service which exposed the server-side
database and the actual process of matching a fingerprint            @prefix mo: <http://purl.org/ontology/mo/>.
to a database entry could allow for some more sophisticated          @prefix foaf: <http://xmlns.com/foaf/0.1/>.
linkage between personal audio collections and the Semantic          @prefix owl: <http://www.w3.org/2002/07/owl#>.
Web. We have some hopes that Last.fm’s recently launched
fingerprinting service might gather a large database and
make it freely available in a flexible way.
                                                                     9.   ACKNOWLEDGEMENTS
                                                                     The authors acknowledge the support of both the Cen-
                                                                     tre For Digital Music and the Department of Computer
Although the core of the GNARQL tool is in place, more
                                                                     Science at Queen Mary University of London for the stu-
work is required to explore different approaches to crawl-
                                                                     dentship for Yves Raimond. This work has been partially
ing. Also, since Semantic Web user interfaces are relatively
                                                                     supported by the EPSRC-funded ICT project OMRAS-2
12
     See http://www.musicip.com/dns/                                 (EP/E017614/1).
              Figure 3: Navigating a personal audio collection using aggregated Semantic Web data


10.   REFERENCES                                             [9] Yves Raimond, Samer Abdallah, Mark Sandler, and
[1] S. Auer, C. Bizer, J. Lehmann, G. Kobilarov,                 Frederick Giasson. The music ontology. In Proceedings
    R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a           of the International Conference on Music Information
    web of open data. In Proceedings of the International        Retrieval, pages 417–422, September 2007. Available
    Semantic Web Conference, Busan, Korea, November              at http://ismir2007.ismir.net/proceedings/
    11-15 2007.                                                  ISMIR2007_p417_raimond.pdf. Last accessed January
[2] Tim Berners-Lee. Linked data. World wide web design          2008.
    issues, July 2006. Available at                         [10] Fabian M. Suchanek, Gjergji Kasneci, and Gerhard
    http://www.w3.org/DesignIssues/LinkedData.html.              Weikum. Yago - a core of semantic knowledge. In 16th
    Last accessed September 2007.                                international World Wide Web conference, 2007.
[3] Chris Bizer, Tom Heath, Danny Ayers, and Yves                Available at http://www.mpi-inf.mpg.de/
    Raimond. Interlinking open data on the web. In               ~suchanek/publications/www2007.pdf. Last accessed
    Demonstrations Track, 4th European Semantic Web              january 2008.
    Conference, Innsbruck, Austria, 2007. Available at      [11] Jan Wielemaker, Zhisheng Huang, and Lourens
    http://www.eswc2007.org/pdf/demo-pdf/                        Van Der Meij. SWI-Prolog and the web. Theory and
    LinkingOpenData.pdf. Last accessed September 2007.           Practice of Logic Programming, 2003. Available at
[4] P. Cano, E. Batlle, T. Kalker, and J. Haitsma. A             http://hcs.science.uva.nl/projects/SWI-Prolog/
    review of audio fingerprinting. The Journal of VLSI          articles/TPLP-plweb.pdf. Last accessed September
    Signal Processing, 41(3):271 – 284, 2005.                    2007.
[5] Jeremy J. Carroll, Christian Bizer, Pat Hayes, and      [12] W. Winkler. Advanced methods for record linkage.
    Patrick Stickler. Named graphs. Journal of Web               Technical report, Statistical Research Division,
    Semantics, 2005.                                             Washington, DC: U.S. Bureau of the Census., 1994.
[6] Michiel Hildebrand, Jacco van Ossenbruggen, and              Available at
    Lynda Hardman. The Semantic Web - ISWC 2006,                 http://citeseer.ist.psu.edu/254560.html. Last
    volume 4273/2006 of Lecture Notes in Computer                accessed January 2008.
    Science, chapter /facet: A Browser for Heterogeneous
    Semantic Web Repositories, pages 272–285. Springer
    Berlin / Heidelberg, 2006.
[7] S. Melnik, H. Garcia-Molina, and E. Rahm. Similarity
    flooding: a versatile graph matching algorithm and
    itsapplication to schema matching. In Proceedings of
    the 18th International Conference on Data
    Engineering, pages 117–128, San Jose, CA, USA,
    February-March 2002.
[8] M. Neiling. Data fusion with record linkage. In
    I. Schmitt, C. Turker, E. Hildebrandt, and M. Hoding,
    editors, Proceedings of the Workshop ’Foederierte
    Datenbanken’, Aachen, 1998. Available at
    http://citeseer.ist.psu.edu/189652.html. Last
    accessed January 2008.