=Paper=
{{Paper
|id=Vol-233/paper-11
|storemode=property
|title=A Multimodal Approach to Bridge the Music Semantic Gap
|pdfUrl=https://ceur-ws.org/Vol-233/p23.pdf
|volume=Vol-233
|dblpUrl=https://dblp.org/rec/conf/samt/CelmaHS06
}}
==A Multimodal Approach to Bridge the Music Semantic Gap==
A multimodal approach to bridge the Music
Semantic Gap
Òscar Celma, Perfecto Herrera, and Xavier Serra
Abstract— In this paper we present the music information II. T HE M USIC I NFORMATION P LANE
plane and the different levels of information extraction that exist
in the musical domain. Based on this approach we propose a way Due to the inherent complexity to describe multimedia
to overcome the existing semantic gap in the music field. Our objects, a layered approach with different levels of granu-
approximation is twofold: we propose a set of music descriptors larity is needed when designing an ontology for a particular
that can automatically be extracted from the audio signals, and domain. Depending on the requirements, one might choose
a top-down multimodal approach that adds explicit and formal the appropriate level of abstraction. In the multimedia field
semantics to these annotations. We believe that merging both
approaches (bottom-up and top-down) can overcome the existing and, in concrete, in the music field we foresee three levels of
semantic gap in the musical domain. abstraction: low-level (physical and basic semantic) features,
mid-level semantic features, and human understanding and
Index Terms— Semantic Gap, Music Information Retrieval,
Multimodal processing reasoning. The first level includes physical features of the
objects, such as the sampling rate of an audio file, as well
as some basic features like the spectral centroid of an audio
I. I NTRODUCTION frame, or even the predominant chord in a sequential list
of frames. A higher-level of abstraction aims at describing
I N recent years the typical music consumption behaviour
has changed dramatically. Personal music collections have
grown favoured by technological improvements in networks,
concepts such as a guitar solo, or tonality information (e.g key
and mode) of a music title. Finally, the reasoning level uses
storage, portability of devices and Internet services. The inference methods and semantic rules to retrieve, for instance,
amount and availability of songs has de-emphasized its value: several audio files with similar guitar solos over the same key.
it is usually the case that users own many music files that they Similarly, we describe the music information plane in two
have only listened to once or even never. It seems reasonable to dimensions. One dimension takes into account the different
think that by providing listeners with efficient ways to create a media types that serve as input data. The other dimension is the
personalized order on their collections, and by providing ways level of abstraction in the information extraction process of this
to explore hidden “treasures” inside them, the value of their data (see Fig.1). The input media types include data coming
collection will drastically increase. from: audio (music recordings), text (lyrics, editorial text,
Beside, on the digital music distribution front, there is a press releases, etc.) and image (video clips, CD covers, printed
need to find ways of improving music retrieval effectiveness. scores, etc.). On the other side, for each media type there are
Artist, title, and genre keywords might not be the only criteria different levels of information extraction. The lowest level is
to help music consumers finding music they like. This is located at the signal features. This level lay far away from what
currently mainly achieved using cultural or editorial metadata an end-user might find meaningful. Anyway, it is the basis that
(“this artist is somehow related with that one”) or exploit- allow to describe the content and to produce more elaborated
ing existing purchasing behaviour data (“since you bought descriptions of the media objects. This level includes basic
this artist, you might also want to buy this one, as other audio features, such as: energy, frequency, mel frequency
customers with a similar profile than yours did”). A largely cepstral coefficients, etc., or basic natural language processing
unexplored (and potentially interesting) alternative is using for the text media. At the mid-level (the content objects level),
semantic descriptors automatically extracted from the music the information extraction process and the elements described
audio files. These descriptors can be applied, for example, are closer to the end-user. This level includes description of
to organize a listener’s collection, recommend new music, musical concepts (e.g. rhythm, harmony, melody), or named
or generate playlists. In the past twenty years, the signal entity recognition for text information. Finally, the higher-
processing and computer music communities have developed level, the Human Knowledge, includes information tightly
a wealth of techniques and technologies to describe audio related with the human beings when interacting with music
and music contents at the lowest (or close-to-signal) level knowledge.
of representation. However, the gap between these low-level
descriptors and the concepts that music listeners use to relate III. P USHING THE CURRENT LIMITS
with music collections (the so-called “semantic gap”) is still,
The main problem, then, is how to push automatic media-
to a large extent, waiting to be bridged.
based descriptions up to the human understanding. We believe
Òscar Celma, Perfecto Herrera, and Xavier Serra are with the Music that this process can not be achieved if we focus in only
Technology Group, Universitat Pompeu Fabra, Barcelona, SPAIN one direction (say, a bottom-up approach). For many years
Fig. 1. The music information plane and its semantic gap between content objects and human understanding.
Signal Processing has been the main discipline used to auto- of music annotations based on audio similarity or detect
matically generate music descriptors. More recently Statistical inconsistencies in editorial metadata.
Modeling, Machine Learning, Music Theory and Web Mining
technologies (to name a few) have also been used to push up IV. C ONCLUSIONS
the semantic level of music descriptors. Anyway, we believe We have presented the music information plane and the
that the current approaches to automatic music description, existing semantic gap that occurs between content object level
which are mainly bottom-up, will not allow us to bridge and human understanding. Thus, we foresee that a mixing
the semantic gap. Thus, we need an important shift in our approach (both bottom-up and top-down) can help to reduce
approach. The music description problem will not be solved the existing semantic gap in the music field.
by just focusing on the audio signals; a Multimodal Processing Moreover, we are now viewing an explosion of the practical
approach is needed. We also need top-down approaches based applications coming out from the Music Information Retrieval
on Ontologies, Reasoning Rules, Music Cognition, or even research: Music Identification systems, Music Recommenders,
Computational Neuroscience and Computational Musicology. Playlist Generators, Music Search Engines, Music Discovery
Regarding ontologies and basic reasoning rules, in [1] and Personalization systems, and this is just the beginning2 .
we have proposed a general multimedia ontology based on At this stage, we might be closer in bridging the semantic
MPEG-7, described in OWL1 language, that allows to for- gap in music than in any other multimedia knowledge domain.
mally describe the automatic annotations from the audio (and, Music was a key factor in taking Internet from its text-centered
obviously, more general descriptions of multimedia assets). origins to being a complete multimedia environment. Music
The approach contributes a complete and automatic mapping might do the same for the Semantic Web.
of the whole MPEG-7 standard to OWL. It is based on a
XML Schema to OWL mapping that tries to be as transparent ACKNOWLEDGMENT
as possible. The previous mapping is complemented with an
XML metadata instances to RDF mapping that completes a The reported research has been funded by the EU-FP6-
tool set to transfer metadata from the XML to the Semantic IST-507142 project SIMAC (Semantic Interaction with Music
Web domain. Audio Contents). Additional information can be found at the
project websitehttp://www.semanticaudio.org.
Once all the multimedia metadata —not only automatic
acoustic annotations from audio files, but editorial and cultural
data too [2]— has been integrated in a common framework R EFERENCES
(that is, in our case, in the MPEG-7 OWL ontology) we [1] Garcia, R. and Celma, O., Semantic Integration and Retrieval of Multi-
can benefit from the, now, explicit semantics. Based on this media Metadata, 2005, Proceedings of 4rd International Semantic Web
Conference, Galway, Ireland.
framework, we foresee some usages of the ontology to help the [2] Pachet, F., Knowledge Management and Musical Metadata, 2005, Ency-
process of automatic annotation of music, such as propagation clopedia of Knowledge Management.
1 http://www.w3.org/2004/OWL/ 2 A detailed list of MIR systems are available at http://mirsystems.info/