<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pasquale Lisena EURECOM Sophia Antipolis</string-name>
          <email>raphael.troncy@eurecom.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France pasquale.lisena@eurecom.fr</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology, Music Metadata, Linked Data, Recommender System</institution>
          ,
          <addr-line>Graph Embeddings</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Raphaël Troncy EURECOM Sophia Antipolis</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>The aim of this tutorial is first to provide in-depth explanations of DOREMUS, a model for describing music metadata. We will demonstrate how real data coming from musical libraries can be converted to this model by presenting the whole DOREMUS tools chain. We will illustrate how the DOREMUS data can be used for query answering and consumed through various applications including an exploratory search engine and music recommender systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Ontologies; Recommender systems;
Semantic web description languages; Music retrieval;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Music information can be very complex. Describing a classical
masterpiece in all its form (the composition, the score, the various
publications, a performance, a recording, the derivative works, etc.)
is a complex activity. An even more challenging task consists in
describing jazz and ethnic music for which the performance plays a
central role, the music is generally not written and the authorship is
not well defined. In the context of the DOREMUS research project 1,
we develop tools and methods to manage music catalogues on the
web using semantic web technologies.</p>
      <p>In this tutorial, we show strategies and tools for managing music
knowledge. In the Section 2, we present the DOREMUS model for
describing music, together with music specific controlled
vocabularies. In the Section 3, we present tools for converting music
datasets, taking as example the ones coming from the rich musical
archives of three leading cultural institutions in France – the
Bibliothèque Nationale de France (BnF), the Philharmonie de Paris (PP)
and Radio France (RF) – describing musical works, publications,
performances and concerts. We demonstrate the expressiveness
of the model showing how complex music-specific queries can be
answered. Finally, we describe strategies for data visualisation and
recommendation in the Section 4.</p>
    </sec>
    <sec id="sec-3">
      <title>A MUSIC DATA MODEL</title>
      <p>
        Among the music ontologies, the most known example is the
Music Ontology [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] that provides a set of music-specific classes and
properties for describing musical works, performances and tracks,
2.1
      </p>
    </sec>
    <sec id="sec-4">
      <title>The DOREMUS Ontology</title>
      <p>
        The DOREMUS model2 is an extension of FRBRoo, for describing
cultural objects [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], applied to the specific domain of music. This
is a dynamic model, in which the abstract intention of the author
(called Work) exists only through an Event (i.e. the composition
event) that realises it in a distinct series of choices called Expression.
This Work-Expression-Event triplet can also describe diferent parts
of the life of a work, like the Performance, the Publication or the
creation of a derivative Work, each one incorporating the expression
from which it comes from.
      </p>
      <p>
        On top of the FRBRoo original classes and properties, specific
ones have been added in order to describe aspects of a work that
are specifically related to music, such as the musical key, the genre,
the tempo, the medium of performance (MoP), etc. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Each triplet contains an information that, at the same time, can
live autonomously and be linked to the other entities. Thinking
about a classic work, we will have a triplet for the composition,
one for any performance event, one for every manifestation (i.e.
the score), etc., all connected in the graph. A jazz improvisation
that consists in an extemporaneous creation of a new work, will
have only the triplet for the Performance Work, Performance
Expression and Performance Creation, in absence of the moment of
composition and writing of the score that are almost mandatory for
classical music and without the need to be attached to any other
entity. It is considered a work per se. All the Work entities of each
triplet are then connected to a Complex Work, a class that has the
objective of collecting together all the representations — both the
conceptual and sensory ones (manifestation) — of the same creative
idea.</p>
      <p>The result is a model that, if on one side is quite complex and
hard to adopt, on the other hand has a very detailed expressiveness.
The graph depicted inFigure 1 shows a real example from our data:
Beethoven’s Sonata for piano and cello n.13.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Music Controlled Vocabularies</title>
      <p>
        A large number of properties that are involved in the music
description are supposed to contain values that are shared among
diferent entities: diferent composition can have as genre “sonata”,
diferent performer can play a “bassoon”, diferent authors can have
as function “composer” or “lyricist”. These labels can be expressed
in multiple languages or in alternative forms (i.e. “sax” and
“saxophone”, or the French keys "Do majeur" and "Ut majeur"), making
2http://data.doremus.org/ontology/
3http://data.doremus.org/expression/614925f2-1da7-39c1-8fb7-4866b1d39fc7
reconciliation hard. Our choice is to use controlled vocabularies
for those common concepts. A controlled vocabulary is a thematic
thesaurus of entities, each one being again identified with a URI. We
are using SKOS [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as representation model, that allows to specify
for each concept the preferred and the alternative labels in multiple
language, to define a hierarchy between the concepts (so that the
“violin” is a narrower concept with respect to “string”), and to add
comments and notes for describing the entity and help the
annotation activity. Each concept becomes a common node in the musical
graph that can connect a musical work to another, an author to a
performer, etc.
      </p>
      <p>Diferent kinds of vocabularies are required for describing music.
Some of them are already available on the web: this is the case
of MIMO4 for describing musical instruments, or RAMEAU5 for
musical genres, ethnic groups, etc. Some others are not published
in a suitable format for the Web of Data, or the version published
is not as complete as other formats that are available to libraries
or in online sources: this happens with the vocabularies published
by the International Association of Music Libraries (IAML), 6 that
have been published after the start of the project and for which
we sometimes provide more details (labels, languages, etc.). Finally,
there is also the case of vocabularies that do not exist at all and
that we generate on the base of real data coming from the partners,
enriched by an editorial process that involved also librarians. As
4http://www.mimo-db.eu/
5http://rameau.bnf.fr/
6http://iflastandards.info/ns/unimarc/
a result, we collected, implemented and published 15 controlled
vocabularies belonging to 6 diferent categories 7.
3</p>
    </sec>
    <sec id="sec-6">
      <title>DATA CONVERSION</title>
      <p>
        Both the French National Library (BnF) and Philharmonie of Paris
make use of the MARC format for representing the music metadata.
The flat structure of MARC, which consists in a succession of fields
and subfields (Figure 2), reflects the purpose of converting printed
or handwritten records in a computer form. Although MARC is a
standard, its adoption is restricted to the library world, making its
serialization to other formats (usually XML) a need for an actual
use. MARC fields are also not labeled explicitly, but encoded with
numbers, with the consequence of having to use a manual for
deciphering the content. The semantics of these fields and subfields is
not trivial: a subfield can change its meaning depending on the field,
under which it is found, and on the particular variant of MARC
(UNIMARC and INTERMARC). A field or subfield can contain
information about diferent entities, like the first performance and the
ifrst publication combined in the same field of the notes, without a
clear separation. Often, the information is represented in the form
of a free text [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        The benefits of moving from MARC to an RDF-based solution
consist in the interoperability and the integration among libraries
and with third party actors, with the possibility of realizing smart
federated search [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. In order to achieve these goals, two tasks
are necessary: data conversion and data linking.
7https://github.com/DOREMUS-ANR/knowledge-base/tree/master/vocabularies
3.1
      </p>
    </sec>
    <sec id="sec-7">
      <title>From MARC to RDF</title>
      <p>
        For the conversion task, we rely on marc2rdf, 8 an open source
prototype we developed for the automatic conversion of MARC
bibliographic records to RDF using the DOREMUS ontology [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The
conversion process relies on explicit expert-defined transfer rules
(or mappings) that indicate where in the MARC file to look for what
kind of information, providing the corresponding property path in
the model as well as useful examples that illustrate each transfer
rule, as shown in Figure 3. The role of these rules goes beyond
being a simple documentation for the MARC records, embedding
also information on some librarian practices in the formalisation of
the content (format of dates, agreements on the syntax of textual
ifelds, default values if the information is absent).
      </p>
      <p>The converter is composed of diferent modules, that works in
succession. First, a file parser reads the MARC file and makes the
content accessible by field and subfield number. We implemented a
converting module for both the INTERMARC and UNIMARC
variants. Then, it builds the RDF graph reading the fields and assigning
their content to the DOREMUS property suggested in the transfer
rules.</p>
      <p>Then the free-text interpreter extracts further information from
the plain text fields, that includes editorial notes.This amounts to do
a knowledge-aware parsing, since we search in the string exactly
the information we want to instantiate from the model (i.e. the
MoP from the casting notes, or the date and the publisher from
the first publication note). The parsing is realized through
empirically defined regular expression, that are going to be supported by
Named Entity Recognition techniques as a future work. Finally, the
string2vocabulary component performs an automatic mapping of
string literals to URIs coming from controlled vocabularies. All
variants for a concept label are considered in order to deal with potential
diferences in naming terms. As additional feature, this component
is able to recognise and correct some noise that is present in the
source MARC file: this is the case of musical keys declared as genre,
or fields for the opus number that contain actually a catalog number
and vice-versa. These cases and other typos and mistakes have been
identified thanks to the conversion process and the visualization of
the converted data, supporting the source institution in they work
of updating and correcting constantly their data.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Dealing With Heterogeneous Formats</title>
      <p>Apart from MARC, we are converting other source bases (in XML),
that are too specific to be handled by a single converter.
Therefore, we developed ad hoc software that have a generic workflow:
parse the input file and collect the required information, create
the graph structure in RDF, run the string2vocabulary module
described previously. This procedure creates diferent graphs, one for
each source. Those source databases are complementary but also
contain overlaps (e.g. two databases that describe the same work
or the same performance with complementary metadata). We have
started to automatically interlink the datasets, so that the resulting
knowledge graph provides a richer description of each work.
8https://github.com/DOREMUS-ANR/marc2rdf
Before the beginning of the project, a list of questions have been
collected from experts of the partner institutions9. These questions
reflect real needs of the institutions and reveal problems that they
face daily in the task of selecting information from the database
(e.g. concert organisation or broadcast programming) or for
supporting librarian and musicologist studies. They can be related to
practical use cases (the search of all the scores that suit a particular
formation), to musicologist topics (the music of a certain region in
a particular historical period), to interesting stats (the works
usually performed or published together), or to curious connections
between works, performances or artists. Most of the questions are
very specific and complex, so that it is very hard to find their answer
by simply querying the search engines currently available on the
web. We have grouped these questions in categories, according to
the DOREMUS classes involved in the question.</p>
      <p>Table 1 provides an overview of how many queries we can
currently write for each category. The implementation of recordings,
scores, performance that is still work in progress – along with the
interconnection to the LOD repositories – is one important reason
for which some questions have not yet been translated into SPARQL
and other ones have not results.
9https://github.com/DOREMUS-ANR/knowledge-base/tree/master/query-examples</p>
    </sec>
    <sec id="sec-9">
      <title>4 EXPLORATION AND RECOMMENDATION</title>
      <p>We consider exploration and recommendation as two sides of the
same medal. With the first one, we let the user browse the datasets,
discover connections on his own, understand how we build the
knowledge. Through recommendation, we remove this
responsibility to the user with the purpose of presenting what he needs in a
particular moment.</p>
    </sec>
    <sec id="sec-10">
      <title>4.1 Visualizing the Complexity</title>
      <p>We developed the first version of Overture, a web prototype of
an exploratory search engine for DOREMUS data. The application
makes requests directly to our SPARQL endpoint10 and provides
the information in a nice user interface.</p>
      <p>At the top of the user interface, the navigation bar allows the user
to navigate between the main concepts of the DOREMUS model:
expression, performance, score, recording, artist. The challenge is
in giving to the final user a complete vision on the data of each class
and letting him/her understand how they are connected to each
other. We keep as example Beethoven’s Sonata for piano and cello n.1
11. Aside from the diferent versions of the title, the composer and a
textual description, the page provides details on the information we
have about the work, like the musical key, the genres, the intended
MoP, the opus number. When these values come from a controlled
vocabulary, a link is presented in order to search for expressions
that share the same value (for example, the same genre or the same
musical key). A timeline shows the most important events related
to the work (the composition, the premiere, the first publication).
Other performances and publications can be represented below. The
background is a portrait of the composer that comes from DBpedia.
It is retrieved thanks to the presence in the DOREMUS database of
owl:sameAs links. These links comes in part from the International
Standard Name Identifier (ISNI) service 12, in part thanks to an
interlinking realised by matching the artist name, birth and death
date in the diferent datasets.</p>
    </sec>
    <sec id="sec-11">
      <title>4.2 Music Recommendation Using Graph</title>
    </sec>
    <sec id="sec-12">
      <title>Embeddings</title>
      <p>
        What should we suggest to a user listening Beethoven? Similar
musicians should share with the German composer some features: the
period, similar properties on the compositions (genre, key, casting)
or similar instrument played (the piano itself, or also the
harpsichord that is in the same family). But how to define a similarity
measure that take into acount these concepts? We propose a
solution based on graph embeddings generated at diferent levels:
(1) For simple features (e.g. genre, key, instrument), we
compute for each term an embedding applying node2vec [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] on
two sub-graphs: the one of the controlled vocabularies and
the one corresponding to the usage of their values in the
DOREMUS dataset;
10http://data.doremus.org/sparql
11http://overture.doremus.org/expression/614925f2-1da7-39c1-8fb7-4866b1d39fc7
12The ISNI database contains authority information about people involved in creative
processes (i.e. artists). It is managed by the ISNI Quality Team, which the BnF is a
member of, and artists record in the BnF database contains generally an ISNI reference.
(2) For complex features (e.g. artist), we generate the
embeddings by the combination of its corresponding feature
embedding. In the case of artists, we will generate a vector
composed of the period (mapped in [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]) and the averages
of the vector of the genre, key and casting (instrument) of his
composition, together with the one of the played instrument,
after having reduced their dimensionality;
(3) Finally, for the work, we combine again simple and complex
feature embedding, following the same rules.
      </p>
      <p>
        Using graph embeddings reduces the similarity problem as the
reverse of an euclidean distance. if some properties are missing, we
apply a penalisation computed as percentage of missing feature in
the target vector with respect to the seed one [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>The biggest advantage of this method is that the embeddings
computation is required only for the simple features: each
embedding is re-used in diferent combination. Because diferent weights
can be assigned to each property in order to tune up the
recommendation, we plan to experiment with neural networks in order
to discover the best weighting strategy.</p>
    </sec>
    <sec id="sec-13">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work has been partially supported by the French National
Research Agency (ANR) within the DOREMUS Project, under grant
number ANR-14-CE24-0020.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Getaneh</given-names>
            <surname>Alemu</surname>
          </string-name>
          , Brett Stevens,
          <string-name>
            <surname>Penny Ross</surname>
            , and
            <given-names>Jane</given-names>
          </string-name>
          <string-name>
            <surname>Chandler</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Linked Data for libraries: Benefits of a conceptual shift from library-specific record structures to RDF-based data models</article-title>
          .
          <source>New Library World</source>
          <volume>113</volume>
          ,
          <issue>11</issue>
          /12 (
          <year>2012</year>
          ),
          <fpage>549</fpage>
          -
          <lpage>570</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Gillian</given-names>
            <surname>Byrne</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lisa</given-names>
            <surname>Goddard</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The strongest link: Libraries and linked data</article-title>
          .
          <source>D-Lib magazine 16</source>
          ,
          <issue>11</issue>
          (
          <year>2010</year>
          ),
          <fpage>5</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Pierre</given-names>
            <surname>Chofé</surname>
          </string-name>
          and
          <string-name>
            <given-names>Françoise</given-names>
            <surname>Leresche</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>DOREMUS: Connecting Sources, Enriching Catalogues and User Experience</article-title>
          .
          <source>In 24th IFLA World Library and Information Congress</source>
          . Colombus, USA.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Doerr</surname>
          </string-name>
          , Chryssoula Bekiari, and Patrick LeBoeuf.
          <year>2008</year>
          .
          <article-title>FRBRoo: a conceptual model for performing arts</article-title>
          .
          <source>In CIDOC Annual Conference</source>
          . Athens, Greece,
          <fpage>6</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Aditya</given-names>
            <surname>Grover</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jure</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>node2vec: Scalable Feature Learning for Networks</article-title>
          .
          <source>In 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          . San Francisco, USA.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Pasquale</given-names>
            <surname>Lisena</surname>
          </string-name>
          , Manel Achichi, Eva Fernandez, Konstantin Todorov, and
          <string-name>
            <given-names>Raphaël</given-names>
            <surname>Troncy</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Exploring Linked Classical Music Catalogs with OVERTURE</article-title>
          .
          <source>In 15th International Semantic Web Conference (ISWC)</source>
          . Kobe, Japan.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Pasquale</given-names>
            <surname>Lisena</surname>
          </string-name>
          and
          <string-name>
            <given-names>Raphaël</given-names>
            <surname>Troncy</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Combining Music Specific Embeddings for Computing Artist Similarity</article-title>
          .
          <source>In 18th International Conference on Music Information Retrieval (ISMIR)</source>
          ,
          <article-title>Late-Breaking Demo Track</article-title>
          . Suzhou, China.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Alistair</given-names>
            <surname>Miles and José R Pérez-Agüera</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Skos: Simple knowledge organisation for the web</article-title>
          .
          <source>Cataloging &amp; Classification Quarterly</source>
          <volume>43</volume>
          ,
          <fpage>3</fpage>
          -
          <lpage>4</lpage>
          (
          <year>2007</year>
          ),
          <fpage>69</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Yves</given-names>
            <surname>Raimond</surname>
          </string-name>
          , Samer A. Abdallah,
          <string-name>
            <given-names>Mark B.</given-names>
            <surname>Sandler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Frederick</given-names>
            <surname>Giasson</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>The Music Ontology</article-title>
          .
          <source>In 15th International Conference on Music Information Retrieval (ISMIR)</source>
          .
          <volume>417</volume>
          -
          <fpage>422</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Roy</given-names>
            <surname>Tennant</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>MARC must die</article-title>
          .
          <source>Library Journal</source>
          <volume>127</volume>
          ,
          <issue>17</issue>
          (
          <year>2002</year>
          ),
          <fpage>26</fpage>
          -
          <lpage>27</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>