A multimodal approach to bridge the Music Semantic Gap Òscar Celma, Perfecto Herrera, and Xavier Serra Abstract— In this paper we present the music information II. T HE M USIC I NFORMATION P LANE plane and the different levels of information extraction that exist in the musical domain. Based on this approach we propose a way Due to the inherent complexity to describe multimedia to overcome the existing semantic gap in the music field. Our objects, a layered approach with different levels of granu- approximation is twofold: we propose a set of music descriptors larity is needed when designing an ontology for a particular that can automatically be extracted from the audio signals, and domain. Depending on the requirements, one might choose a top-down multimodal approach that adds explicit and formal the appropriate level of abstraction. In the multimedia field semantics to these annotations. We believe that merging both approaches (bottom-up and top-down) can overcome the existing and, in concrete, in the music field we foresee three levels of semantic gap in the musical domain. abstraction: low-level (physical and basic semantic) features, mid-level semantic features, and human understanding and Index Terms— Semantic Gap, Music Information Retrieval, Multimodal processing reasoning. The first level includes physical features of the objects, such as the sampling rate of an audio file, as well as some basic features like the spectral centroid of an audio I. I NTRODUCTION frame, or even the predominant chord in a sequential list of frames. A higher-level of abstraction aims at describing I N recent years the typical music consumption behaviour has changed dramatically. Personal music collections have grown favoured by technological improvements in networks, concepts such as a guitar solo, or tonality information (e.g key and mode) of a music title. Finally, the reasoning level uses storage, portability of devices and Internet services. The inference methods and semantic rules to retrieve, for instance, amount and availability of songs has de-emphasized its value: several audio files with similar guitar solos over the same key. it is usually the case that users own many music files that they Similarly, we describe the music information plane in two have only listened to once or even never. It seems reasonable to dimensions. One dimension takes into account the different think that by providing listeners with efficient ways to create a media types that serve as input data. The other dimension is the personalized order on their collections, and by providing ways level of abstraction in the information extraction process of this to explore hidden “treasures” inside them, the value of their data (see Fig.1). The input media types include data coming collection will drastically increase. from: audio (music recordings), text (lyrics, editorial text, Beside, on the digital music distribution front, there is a press releases, etc.) and image (video clips, CD covers, printed need to find ways of improving music retrieval effectiveness. scores, etc.). On the other side, for each media type there are Artist, title, and genre keywords might not be the only criteria different levels of information extraction. The lowest level is to help music consumers finding music they like. This is located at the signal features. This level lay far away from what currently mainly achieved using cultural or editorial metadata an end-user might find meaningful. Anyway, it is the basis that (“this artist is somehow related with that one”) or exploit- allow to describe the content and to produce more elaborated ing existing purchasing behaviour data (“since you bought descriptions of the media objects. This level includes basic this artist, you might also want to buy this one, as other audio features, such as: energy, frequency, mel frequency customers with a similar profile than yours did”). A largely cepstral coefficients, etc., or basic natural language processing unexplored (and potentially interesting) alternative is using for the text media. At the mid-level (the content objects level), semantic descriptors automatically extracted from the music the information extraction process and the elements described audio files. These descriptors can be applied, for example, are closer to the end-user. This level includes description of to organize a listener’s collection, recommend new music, musical concepts (e.g. rhythm, harmony, melody), or named or generate playlists. In the past twenty years, the signal entity recognition for text information. Finally, the higher- processing and computer music communities have developed level, the Human Knowledge, includes information tightly a wealth of techniques and technologies to describe audio related with the human beings when interacting with music and music contents at the lowest (or close-to-signal) level knowledge. of representation. However, the gap between these low-level descriptors and the concepts that music listeners use to relate III. P USHING THE CURRENT LIMITS with music collections (the so-called “semantic gap”) is still, The main problem, then, is how to push automatic media- to a large extent, waiting to be bridged. based descriptions up to the human understanding. We believe Òscar Celma, Perfecto Herrera, and Xavier Serra are with the Music that this process can not be achieved if we focus in only Technology Group, Universitat Pompeu Fabra, Barcelona, SPAIN one direction (say, a bottom-up approach). For many years Fig. 1. The music information plane and its semantic gap between content objects and human understanding. Signal Processing has been the main discipline used to auto- of music annotations based on audio similarity or detect matically generate music descriptors. More recently Statistical inconsistencies in editorial metadata. Modeling, Machine Learning, Music Theory and Web Mining technologies (to name a few) have also been used to push up IV. C ONCLUSIONS the semantic level of music descriptors. Anyway, we believe We have presented the music information plane and the that the current approaches to automatic music description, existing semantic gap that occurs between content object level which are mainly bottom-up, will not allow us to bridge and human understanding. Thus, we foresee that a mixing the semantic gap. Thus, we need an important shift in our approach (both bottom-up and top-down) can help to reduce approach. The music description problem will not be solved the existing semantic gap in the music field. by just focusing on the audio signals; a Multimodal Processing Moreover, we are now viewing an explosion of the practical approach is needed. We also need top-down approaches based applications coming out from the Music Information Retrieval on Ontologies, Reasoning Rules, Music Cognition, or even research: Music Identification systems, Music Recommenders, Computational Neuroscience and Computational Musicology. Playlist Generators, Music Search Engines, Music Discovery Regarding ontologies and basic reasoning rules, in [1] and Personalization systems, and this is just the beginning2 . we have proposed a general multimedia ontology based on At this stage, we might be closer in bridging the semantic MPEG-7, described in OWL1 language, that allows to for- gap in music than in any other multimedia knowledge domain. mally describe the automatic annotations from the audio (and, Music was a key factor in taking Internet from its text-centered obviously, more general descriptions of multimedia assets). origins to being a complete multimedia environment. Music The approach contributes a complete and automatic mapping might do the same for the Semantic Web. of the whole MPEG-7 standard to OWL. It is based on a XML Schema to OWL mapping that tries to be as transparent ACKNOWLEDGMENT as possible. The previous mapping is complemented with an XML metadata instances to RDF mapping that completes a The reported research has been funded by the EU-FP6- tool set to transfer metadata from the XML to the Semantic IST-507142 project SIMAC (Semantic Interaction with Music Web domain. Audio Contents). Additional information can be found at the project websitehttp://www.semanticaudio.org. Once all the multimedia metadata —not only automatic acoustic annotations from audio files, but editorial and cultural data too [2]— has been integrated in a common framework R EFERENCES (that is, in our case, in the MPEG-7 OWL ontology) we [1] Garcia, R. and Celma, O., Semantic Integration and Retrieval of Multi- can benefit from the, now, explicit semantics. Based on this media Metadata, 2005, Proceedings of 4rd International Semantic Web Conference, Galway, Ireland. framework, we foresee some usages of the ontology to help the [2] Pachet, F., Knowledge Management and Musical Metadata, 2005, Ency- process of automatic annotation of music, such as propagation clopedia of Knowledge Management. 1 http://www.w3.org/2004/OWL/ 2 A detailed list of MIR systems are available at http://mirsystems.info/