=Paper=
{{Paper
|id=Vol-2065/paper14
|storemode=property
|title=DOing REusable MUSical Data (DOREMUS)
|pdfUrl=https://ceur-ws.org/Vol-2065/paper14.pdf
|volume=Vol-2065
|authors=Pasquale Lisena,Raphaël Troncy
|dblpUrl=https://dblp.org/rec/conf/kcap/LisenaT17
}}
==DOing REusable MUSical Data (DOREMUS)==
DOing REusable MUSical data (DOREMUS) Pasquale Lisena Raphaël Troncy EURECOM EURECOM Sophia Antipolis, France Sophia Antipolis, France pasquale.lisena@eurecom.fr raphael.troncy@eurecom.fr ABSTRACT together with fragments of them. Further needs in exploit the mu- The aim of this tutorial is first to provide in-depth explanations of sic knowledge coming from libraries led to the definition of a new DOREMUS, a model for describing music metadata. We will demon- ontology. strate how real data coming from musical libraries can be converted to this model by presenting the whole DOREMUS tools chain. We 2.1 The DOREMUS Ontology will illustrate how the DOREMUS data can be used for query an- The DOREMUS model2 is an extension of FRBRoo, for describing swering and consumed through various applications including an cultural objects [4], applied to the specific domain of music. This exploratory search engine and music recommender systems. is a dynamic model, in which the abstract intention of the author (called Work) exists only through an Event (i.e. the composition CCS CONCEPTS event) that realises it in a distinct series of choices called Expression. • Information systems → Ontologies; Recommender systems; This Work-Expression-Event triplet can also describe different parts Semantic web description languages; Music retrieval; of the life of a work, like the Performance, the Publication or the creation of a derivative Work, each one incorporating the expression KEYWORDS from which it comes from. Ontology, Music Metadata, Linked Data, Recommender System, On top of the FRBRoo original classes and properties, specific Graph Embeddings ones have been added in order to describe aspects of a work that are specifically related to music, such as the musical key, the genre, the tempo, the medium of performance (MoP), etc. [3]. 1 INTRODUCTION Each triplet contains an information that, at the same time, can Music information can be very complex. Describing a classical live autonomously and be linked to the other entities. Thinking masterpiece in all its form (the composition, the score, the various about a classic work, we will have a triplet for the composition, publications, a performance, a recording, the derivative works, etc.) one for any performance event, one for every manifestation (i.e. is a complex activity. An even more challenging task consists in the score), etc., all connected in the graph. A jazz improvisation describing jazz and ethnic music for which the performance plays a that consists in an extemporaneous creation of a new work, will central role, the music is generally not written and the authorship is have only the triplet for the Performance Work, Performance Ex- not well defined. In the context of the DOREMUS research project1 , pression and Performance Creation, in absence of the moment of we develop tools and methods to manage music catalogues on the composition and writing of the score that are almost mandatory for web using semantic web technologies. classical music and without the need to be attached to any other In this tutorial, we show strategies and tools for managing music entity. It is considered a work per se. All the Work entities of each knowledge. In the Section 2, we present the DOREMUS model for triplet are then connected to a Complex Work, a class that has the describing music, together with music specific controlled vocab- objective of collecting together all the representations — both the ularies. In the Section 3, we present tools for converting music conceptual and sensory ones (manifestation) — of the same creative datasets, taking as example the ones coming from the rich musical idea. archives of three leading cultural institutions in France – the Biblio- The result is a model that, if on one side is quite complex and thèque Nationale de France (BnF), the Philharmonie de Paris (PP) hard to adopt, on the other hand has a very detailed expressiveness. and Radio France (RF) – describing musical works, publications, The graph depicted inFigure 1 shows a real example from our data: performances and concerts. We demonstrate the expressiveness Beethoven’s Sonata for piano and cello n.13 . of the model showing how complex music-specific queries can be answered. Finally, we describe strategies for data visualisation and 2.2 Music Controlled Vocabularies recommendation in the Section 4. A large number of properties that are involved in the music de- 2 A MUSIC DATA MODEL scription are supposed to contain values that are shared among different entities: different composition can have as genre “sonata”, Among the music ontologies, the most known example is the Mu- different performer can play a “bassoon”, different authors can have sic Ontology [9] that provides a set of music-specific classes and as function “composer” or “lyricist”. These labels can be expressed properties for describing musical works, performances and tracks, in multiple languages or in alternative forms (i.e. “sax” and “saxo- 1 urlhttp://www.doremus.org phone”, or the French keys "Do majeur" and "Ut majeur"), making K-CAP2017 Workshops and Tutorials Proceedings, 2017 2 http://data.doremus.org/ontology/ 3 http://data.doremus.org/expression/614925f2-1da7-39c1-8fb7-4866b1d39fc7 ©2017 Copyright held by the owner/author(s). K-CAP2017 Workshops and Tutorials Proceedings, 2017 P. Lisena et R. Troncy Figure 1: Beethoven’s Sonata for piano and cello n.1 represented as a graph using the DOREMUS ontology reconciliation hard. Our choice is to use controlled vocabularies a result, we collected, implemented and published 15 controlled for those common concepts. A controlled vocabulary is a thematic vocabularies belonging to 6 different categories7 . thesaurus of entities, each one being again identified with a URI. We are using SKOS [8] as representation model, that allows to specify 3 DATA CONVERSION for each concept the preferred and the alternative labels in multiple Both the French National Library (BnF) and Philharmonie of Paris language, to define a hierarchy between the concepts (so that the make use of the MARC format for representing the music metadata. “violin” is a narrower concept with respect to “string”), and to add The flat structure of MARC, which consists in a succession of fields comments and notes for describing the entity and help the annota- and subfields (Figure 2), reflects the purpose of converting printed tion activity. Each concept becomes a common node in the musical or handwritten records in a computer form. Although MARC is a graph that can connect a musical work to another, an author to a standard, its adoption is restricted to the library world, making its performer, etc. serialization to other formats (usually XML) a need for an actual Different kinds of vocabularies are required for describing music. use. MARC fields are also not labeled explicitly, but encoded with Some of them are already available on the web: this is the case numbers, with the consequence of having to use a manual for deci- of MIMO4 for describing musical instruments, or RAMEAU5 for phering the content. The semantics of these fields and subfields is musical genres, ethnic groups, etc. Some others are not published not trivial: a subfield can change its meaning depending on the field, in a suitable format for the Web of Data, or the version published under which it is found, and on the particular variant of MARC is not as complete as other formats that are available to libraries (UNIMARC and INTERMARC). A field or subfield can contain infor- or in online sources: this happens with the vocabularies published mation about different entities, like the first performance and the by the International Association of Music Libraries (IAML), 6 that first publication combined in the same field of the notes, without a have been published after the start of the project and for which clear separation. Often, the information is represented in the form we sometimes provide more details (labels, languages, etc.). Finally, of a free text [10]. there is also the case of vocabularies that do not exist at all and The benefits of moving from MARC to an RDF-based solution that we generate on the base of real data coming from the partners, consist in the interoperability and the integration among libraries enriched by an editorial process that involved also librarians. As and with third party actors, with the possibility of realizing smart federated search [1, 2]. In order to achieve these goals, two tasks 4 http://www.mimo-db.eu/ 5 http://rameau.bnf.fr/ are necessary: data conversion and data linking. 6 http://iflastandards.info/ns/unimarc/ 7 https://github.com/DOREMUS-ANR/knowledge-base/tree/master/vocabularies DOing REusable MUSical data (DOREMUS) K-CAP2017 Workshops and Tutorials Proceedings, 2017 3.1 From MARC to RDF For the conversion task, we rely on marc2rdf, 8 an open source prototype we developed for the automatic conversion of MARC bibliographic records to RDF using the DOREMUS ontology [6]. The conversion process relies on explicit expert-defined transfer rules (or mappings) that indicate where in the MARC file to look for what Figure 2: An excerpt of a UNIMARC record. kind of information, providing the corresponding property path in the model as well as useful examples that illustrate each transfer rule, as shown in Figure 3. The role of these rules goes beyond being a simple documentation for the MARC records, embedding also information on some librarian practices in the formalisation of the content (format of dates, agreements on the syntax of textual fields, default values if the information is absent). The converter is composed of different modules, that works in succession. First, a file parser reads the MARC file and makes the content accessible by field and subfield number. We implemented a converting module for both the INTERMARC and UNIMARC vari- ants. Then, it builds the RDF graph reading the fields and assigning their content to the DOREMUS property suggested in the transfer Figure 3: Example of mapping rules describing the opus rules. number and sub-number of a work Then the free-text interpreter extracts further information from the plain text fields, that includes editorial notes.This amounts to do Category Query / Questions a knowledge-aware parsing, since we search in the string exactly A. Works 23 / 29 the information we want to instantiate from the model (i.e. the B. Artists 1/3 MoP from the casting notes, or the date and the publisher from C. Performances 6/9 the first publication note). The parsing is realized through empiri- D. Recordings 0 / 11 cally defined regular expression, that are going to be supported by E. Publications 0/5 Named Entity Recognition techniques as a future work. Finally, the string2vocabulary component performs an automatic mapping of Table 1: For each category of questions, we provide the ratio string literals to URIs coming from controlled vocabularies. All vari- of the number of converted queries ants for a concept label are considered in order to deal with potential differences in naming terms. As additional feature, this component is able to recognise and correct some noise that is present in the source MARC file: this is the case of musical keys declared as genre, 3.3 Answering complex queries or fields for the opus number that contain actually a catalog number Before the beginning of the project, a list of questions have been and vice-versa. These cases and other typos and mistakes have been collected from experts of the partner institutions9 . These questions identified thanks to the conversion process and the visualization of reflect real needs of the institutions and reveal problems that they the converted data, supporting the source institution in they work face daily in the task of selecting information from the database of updating and correcting constantly their data. (e.g. concert organisation or broadcast programming) or for sup- porting librarian and musicologist studies. They can be related to 3.2 Dealing With Heterogeneous Formats practical use cases (the search of all the scores that suit a particular formation), to musicologist topics (the music of a certain region in Apart from MARC, we are converting other source bases (in XML), a particular historical period), to interesting stats (the works usu- that are too specific to be handled by a single converter. There- ally performed or published together), or to curious connections fore, we developed ad hoc software that have a generic workflow: between works, performances or artists. Most of the questions are parse the input file and collect the required information, create very specific and complex, so that it is very hard to find their answer the graph structure in RDF, run the string2vocabulary module de- by simply querying the search engines currently available on the scribed previously. This procedure creates different graphs, one for web. We have grouped these questions in categories, according to each source. Those source databases are complementary but also the DOREMUS classes involved in the question. contain overlaps (e.g. two databases that describe the same work Table 1 provides an overview of how many queries we can cur- or the same performance with complementary metadata). We have rently write for each category. The implementation of recordings, started to automatically interlink the datasets, so that the resulting scores, performance that is still work in progress – along with the knowledge graph provides a richer description of each work. interconnection to the LOD repositories – is one important reason for which some questions have not yet been translated into SPARQL and other ones have not results. 8 https://github.com/DOREMUS-ANR/marc2rdf 9 https://github.com/DOREMUS-ANR/knowledge-base/tree/master/query-examples K-CAP2017 Workshops and Tutorials Proceedings, 2017 P. Lisena et R. Troncy 4 EXPLORATION AND RECOMMENDATION (2) For complex features (e.g. artist), we generate the embed- We consider exploration and recommendation as two sides of the dings by the combination of its corresponding feature em- same medal. With the first one, we let the user browse the datasets, bedding. In the case of artists, we will generate a vector discover connections on his own, understand how we build the composed of the period (mapped in [0, 1]) and the averages knowledge. Through recommendation, we remove this responsibil- of the vector of the genre, key and casting (instrument) of his ity to the user with the purpose of presenting what he needs in a composition, together with the one of the played instrument, particular moment. after having reduced their dimensionality; (3) Finally, for the work, we combine again simple and complex feature embedding, following the same rules. 4.1 Visualizing the Complexity Using graph embeddings reduces the similarity problem as the We developed the first version of Overture, a web prototype of reverse of an euclidean distance. if some properties are missing, we an exploratory search engine for DOREMUS data. The application apply a penalisation computed as percentage of missing feature in makes requests directly to our SPARQL endpoint10 and provides the target vector with respect to the seed one [7]. the information in a nice user interface. The biggest advantage of this method is that the embeddings At the top of the user interface, the navigation bar allows the user computation is required only for the simple features: each embed- to navigate between the main concepts of the DOREMUS model: ding is re-used in different combination. Because different weights expression, performance, score, recording, artist. The challenge is can be assigned to each property in order to tune up the recom- in giving to the final user a complete vision on the data of each class mendation, we plan to experiment with neural networks in order and letting him/her understand how they are connected to each to discover the best weighting strategy. other. We keep as example Beethoven’s Sonata for piano and cello n.1 11 . Aside from the different versions of the title, the composer and a ACKNOWLEDGMENTS textual description, the page provides details on the information we have about the work, like the musical key, the genres, the intended This work has been partially supported by the French National MoP, the opus number. When these values come from a controlled Research Agency (ANR) within the DOREMUS Project, under grant vocabulary, a link is presented in order to search for expressions number ANR-14-CE24-0020. that share the same value (for example, the same genre or the same musical key). A timeline shows the most important events related REFERENCES [1] Getaneh Alemu, Brett Stevens, Penny Ross, and Jane Chandler. 2012. Linked Data to the work (the composition, the premiere, the first publication). for libraries: Benefits of a conceptual shift from library-specific record structures Other performances and publications can be represented below. The to RDF-based data models. New Library World 113, 11/12 (2012), 549–570. background is a portrait of the composer that comes from DBpedia. [2] Gillian Byrne and Lisa Goddard. 2010. The strongest link: Libraries and linked data. D-Lib magazine 16, 11 (2010), 5. It is retrieved thanks to the presence in the DOREMUS database of [3] Pierre Choffé and Françoise Leresche. 2016. DOREMUS: Connecting Sources, owl:sameAs links. These links comes in part from the International Enriching Catalogues and User Experience. In 24t h IFLA World Library and Standard Name Identifier (ISNI) service12 , in part thanks to an Information Congress. Colombus, USA. [4] Martin Doerr, Chryssoula Bekiari, and Patrick LeBoeuf. 2008. FRBRoo: a con- interlinking realised by matching the artist name, birth and death ceptual model for performing arts. In CIDOC Annual Conference. Athens, Greece, date in the different datasets. 6–18. [5] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning for Networks. In 22nd ACM SIGKDD International Conference on Knowledge 4.2 Music Recommendation Using Graph Discovery and Data Mining. San Francisco, USA. [6] Pasquale Lisena, Manel Achichi, Eva Fernandez, Konstantin Todorov, and Raphaël Embeddings Troncy. 2016. Exploring Linked Classical Music Catalogs with OVERTURE. In What should we suggest to a user listening Beethoven? Similar mu- 15t h International Semantic Web Conference (ISWC). Kobe, Japan. [7] Pasquale Lisena and Raphaël Troncy. 2017. Combining Music Specific Embed- sicians should share with the German composer some features: the dings for Computing Artist Similarity. In 18t h International Conference on Music period, similar properties on the compositions (genre, key, casting) Information Retrieval (ISMIR), Late-Breaking Demo Track. Suzhou, China. or similar instrument played (the piano itself, or also the harpsi- [8] Alistair Miles and José R Pérez-Agüera. 2007. Skos: Simple knowledge organisa- tion for the web. Cataloging & Classification Quarterly 43, 3-4 (2007), 69–83. chord that is in the same family). But how to define a similarity [9] Yves Raimond, Samer A. Abdallah, Mark B. Sandler, and Frederick Giasson. 2007. measure that take into acount these concepts? We propose a solu- The Music Ontology. In 15t h International Conference on Music Information tion based on graph embeddings generated at different levels: Retrieval (ISMIR). 417–422. [10] Roy Tennant. 2002. MARC must die. Library Journal 127, 17 (2002), 26–27. (1) For simple features (e.g. genre, key, instrument), we com- pute for each term an embedding applying node2vec [5] on two sub-graphs: the one of the controlled vocabularies and the one corresponding to the usage of their values in the DOREMUS dataset; 10 http://data.doremus.org/sparql 11 http://overture.doremus.org/expression/614925f2-1da7-39c1-8fb7-4866b1d39fc7 12 The ISNI database contains authority information about people involved in creative processes (i.e. artists). It is managed by the ISNI Quality Team, which the BnF is a member of, and artists record in the BnF database contains generally an ISNI reference.