A multimodal approach to bridge the Music
                       Semantic Gap
                                        Òscar Celma, Perfecto Herrera, and Xavier Serra


   Abstract— In this paper we present the music information                        II. T HE M USIC I NFORMATION P LANE
plane and the different levels of information extraction that exist
in the musical domain. Based on this approach we propose a way             Due to the inherent complexity to describe multimedia
to overcome the existing semantic gap in the music ﬁeld. Our            objects, a layered approach with different levels of granu-
approximation is twofold: we propose a set of music descriptors         larity is needed when designing an ontology for a particular
that can automatically be extracted from the audio signals, and         domain. Depending on the requirements, one might choose
a top-down multimodal approach that adds explicit and formal            the appropriate level of abstraction. In the multimedia ﬁeld
semantics to these annotations. We believe that merging both
approaches (bottom-up and top-down) can overcome the existing           and, in concrete, in the music ﬁeld we foresee three levels of
semantic gap in the musical domain.                                     abstraction: low-level (physical and basic semantic) features,
                                                                        mid-level semantic features, and human understanding and
 Index Terms— Semantic Gap, Music Information Retrieval,
Multimodal processing                                                   reasoning. The ﬁrst level includes physical features of the
                                                                        objects, such as the sampling rate of an audio ﬁle, as well
                                                                        as some basic features like the spectral centroid of an audio
                       I. I NTRODUCTION                                 frame, or even the predominant chord in a sequential list
                                                                        of frames. A higher-level of abstraction aims at describing
I   N recent years the typical music consumption behaviour
    has changed dramatically. Personal music collections have
grown favoured by technological improvements in networks,
                                                                        concepts such as a guitar solo, or tonality information (e.g key
                                                                        and mode) of a music title. Finally, the reasoning level uses
storage, portability of devices and Internet services. The              inference methods and semantic rules to retrieve, for instance,
amount and availability of songs has de-emphasized its value:           several audio ﬁles with similar guitar solos over the same key.
it is usually the case that users own many music ﬁles that they            Similarly, we describe the music information plane in two
have only listened to once or even never. It seems reasonable to        dimensions. One dimension takes into account the different
think that by providing listeners with efﬁcient ways to create a        media types that serve as input data. The other dimension is the
personalized order on their collections, and by providing ways          level of abstraction in the information extraction process of this
to explore hidden “treasures” inside them, the value of their           data (see Fig.1). The input media types include data coming
collection will drastically increase.                                   from: audio (music recordings), text (lyrics, editorial text,
    Beside, on the digital music distribution front, there is a         press releases, etc.) and image (video clips, CD covers, printed
need to ﬁnd ways of improving music retrieval effectiveness.            scores, etc.). On the other side, for each media type there are
Artist, title, and genre keywords might not be the only criteria        different levels of information extraction. The lowest level is
to help music consumers ﬁnding music they like. This is                 located at the signal features. This level lay far away from what
currently mainly achieved using cultural or editorial metadata          an end-user might ﬁnd meaningful. Anyway, it is the basis that
(“this artist is somehow related with that one”) or exploit-            allow to describe the content and to produce more elaborated
ing existing purchasing behaviour data (“since you bought               descriptions of the media objects. This level includes basic
this artist, you might also want to buy this one, as other              audio features, such as: energy, frequency, mel frequency
customers with a similar proﬁle than yours did”). A largely             cepstral coefﬁcients, etc., or basic natural language processing
unexplored (and potentially interesting) alternative is using           for the text media. At the mid-level (the content objects level),
semantic descriptors automatically extracted from the music             the information extraction process and the elements described
audio ﬁles. These descriptors can be applied, for example,              are closer to the end-user. This level includes description of
to organize a listener’s collection, recommend new music,               musical concepts (e.g. rhythm, harmony, melody), or named
or generate playlists. In the past twenty years, the signal             entity recognition for text information. Finally, the higher-
processing and computer music communities have developed                level, the Human Knowledge, includes information tightly
a wealth of techniques and technologies to describe audio               related with the human beings when interacting with music
and music contents at the lowest (or close-to-signal) level             knowledge.
of representation. However, the gap between these low-level
descriptors and the concepts that music listeners use to relate                     III. P USHING THE CURRENT LIMITS
with music collections (the so-called “semantic gap”) is still,
                                                                           The main problem, then, is how to push automatic media-
to a large extent, waiting to be bridged.
                                                                        based descriptions up to the human understanding. We believe
  Òscar Celma, Perfecto Herrera, and Xavier Serra are with the Music   that this process can not be achieved if we focus in only
Technology Group, Universitat Pompeu Fabra, Barcelona, SPAIN            one direction (say, a bottom-up approach). For many years
Fig. 1.   The music information plane and its semantic gap between content objects and human understanding.


Signal Processing has been the main discipline used to auto-                 of music annotations based on audio similarity or detect
matically generate music descriptors. More recently Statistical              inconsistencies in editorial metadata.
Modeling, Machine Learning, Music Theory and Web Mining
technologies (to name a few) have also been used to push up                                            IV. C ONCLUSIONS
the semantic level of music descriptors. Anyway, we believe                     We have presented the music information plane and the
that the current approaches to automatic music description,                  existing semantic gap that occurs between content object level
which are mainly bottom-up, will not allow us to bridge                      and human understanding. Thus, we foresee that a mixing
the semantic gap. Thus, we need an important shift in our                    approach (both bottom-up and top-down) can help to reduce
approach. The music description problem will not be solved                   the existing semantic gap in the music ﬁeld.
by just focusing on the audio signals; a Multimodal Processing                  Moreover, we are now viewing an explosion of the practical
approach is needed. We also need top-down approaches based                   applications coming out from the Music Information Retrieval
on Ontologies, Reasoning Rules, Music Cognition, or even                     research: Music Identiﬁcation systems, Music Recommenders,
Computational Neuroscience and Computational Musicology.                     Playlist Generators, Music Search Engines, Music Discovery
   Regarding ontologies and basic reasoning rules, in [1]                    and Personalization systems, and this is just the beginning2 .
we have proposed a general multimedia ontology based on                      At this stage, we might be closer in bridging the semantic
MPEG-7, described in OWL1 language, that allows to for-                      gap in music than in any other multimedia knowledge domain.
mally describe the automatic annotations from the audio (and,                Music was a key factor in taking Internet from its text-centered
obviously, more general descriptions of multimedia assets).                  origins to being a complete multimedia environment. Music
The approach contributes a complete and automatic mapping                    might do the same for the Semantic Web.
of the whole MPEG-7 standard to OWL. It is based on a
XML Schema to OWL mapping that tries to be as transparent                                             ACKNOWLEDGMENT
as possible. The previous mapping is complemented with an
XML metadata instances to RDF mapping that completes a                         The reported research has been funded by the EU-FP6-
tool set to transfer metadata from the XML to the Semantic                   IST-507142 project SIMAC (Semantic Interaction with Music
Web domain.                                                                  Audio Contents). Additional information can be found at the
                                                                             project websitehttp://www.semanticaudio.org.
   Once all the multimedia metadata —not only automatic
acoustic annotations from audio ﬁles, but editorial and cultural
data too [2]— has been integrated in a common framework                                                    R EFERENCES
(that is, in our case, in the MPEG-7 OWL ontology) we                        [1] Garcia, R. and Celma, O., Semantic Integration and Retrieval of Multi-
can beneﬁt from the, now, explicit semantics. Based on this                      media Metadata, 2005, Proceedings of 4rd International Semantic Web
                                                                                 Conference, Galway, Ireland.
framework, we foresee some usages of the ontology to help the                [2] Pachet, F., Knowledge Management and Musical Metadata, 2005, Ency-
process of automatic annotation of music, such as propagation                    clopedia of Knowledge Management.


  1 http://www.w3.org/2004/OWL/                                                2 A detailed list of MIR systems are available at http://mirsystems.info/