<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Valentin Emiya, Roland Badeau, Adrien Daniel, Bertrand David</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>TELECOM ParisTech (ENST)</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CNRS LTCI</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>rue Barrault</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paris cedex</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France valentin.emiya@enst.fr</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The automatic transcription of music is a key task in the field of information retrieval in audio signals. This paper summarizes recent works on automatic transcription of piano music. The first one is a full transcription system that analyzes an input recording and provides a MIDI file including the estimation of notes. The multipitch estimation stage of the system is based on a method which is detailed separately. Finally, we report some advances in the evaluation of the resulting transcriptions in order to obtain relevant metrics from a perceptually-based point of view.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>Automatic transcription of music refers to the analysis of the
recording of a musical piece in order to extract the part of its
contents related to notes: pitches, onset times, duration, and
sometimes higher-level features like rhythm patterns, key and
time signatures, etc. As a process for information extraction,
the automatic transcription of music is a key task in the field
of Music Information Retrieval (MIR) in which it is not only
a target application, making it possible to implement some
audio-to-score and audio-to-MIDI algorithms, but also a basic
component for other applications such as indexing and
classification tasks, query by humming and other kinds of similarity
analysis, or score alignment and following. While many
systems have already been proposed for about thirty years [1–4],
the automatic transcription of music is still a very active
research field, giving rise to new approaches and more and more
satisfying results.</p>
      <p>This paper is focusing on the automatic transcription of
piano music and its concomitant tasks. From a musical and a
state-of-the-art point of view, while the piano is a widely-used
instrument within the field of western music, while pieces for
piano solo are so numerous, the piano stands among the most
difficult musical instruments to be transcribed automatically
(e.g. see [5]). This may be due to its large fundamental
frequency (F0) range and to the virtuosity of pieces for piano,
causing fast and compact groups of notes and high polyphony</p>
      <p>The research leading to this paper was supported by the European
Commission under contract FP6-027026-K-SPACE.
levels, but also to intrinsic characteristics like the
inharmonicity or the beats occurring in its sounds. This thus motivates
our choice to investigate transcription methods specific to
piano music.</p>
      <p>This paper is structured as follows. A full system for
automatic transcription of piano is introduced in Section 2. It
includes a method for the estimation of simultaneous pitches
detailed in Section 3. Finally, we raise the question of the
evaluation of the resulting transcriptions and propose some
enhancement to the usual evaluation metrics in Section 4.</p>
      <p>2. AUTOMATIC TRANSCRIPTION SYSTEM
The input of our transcription system [6] is a monaural
recording of a piece of piano music sampled at 22 kHz. This signal
is considered as a sum of notes and noise that is observed and
analyzed in 93ms overlapping frames. Each note is modeled
by a sum of sinusoids, the so-called partials, with frequencies
distributed according to the inharmonicity law:
fh = hf0p1 + βh2
(1)
where f0 is the fundamental frequency, which identifies the
note, β is the inharmonicity coefficient and h is the partial
order. This pseudo-harmonic, piano-specific distribution is
due to the stiffness of piano strings: β values are typically
around 10−3, depending on the note whereas the spectrum of
musical instruments like winds or bowed strings are perfectly
harmonic (i.e. β = 0). Besides, two additional assumptions
help characterizing piano sounds: the onsets of notes are
percussive (there are no tied notes like with other instruments)
and frequency modulations are not significant (no glissando,
no vibrato).</p>
      <p>In this context, we adopt the following strategy to perform
the transcription task:
• Onset detection: the signal is segmented according to
the apparition of new events (i.e. notes).
• F0 candidate selection: after each onset, a superset of
likely notes is estimated in order to lower the
subsequent computations and to estimate accurate values of
F0’s and inharmonicity coefficients using eq. (1).
silence
silence
silence
silence
silence
C4
E4
C4E4</p>
      <p>C4
E4
C4E4</p>
      <p>C4
E4</p>
      <p>C4E4
1st frame
2nd frame
3rd frame</p>
      <p>C4
E4
C4E4</p>
      <p>C4
E4</p>
      <p>C4E4
Last frame
• HMM tracking of most likely notes: between two
consecutive onsets, one hidden Markov model (HMM) is
used to select the most likely path among all
combinations of note candidates, each combination being a
hidden state (see Fig. 1).
• a MIDI file is generated as the output of the system1.</p>
    </sec>
    <sec id="sec-2">
      <title>3. MULTIPITCH ESTIMATION</title>
      <p>Some particular works have been dedicated to the
development of the multipitch estimation method [7]. It has then been
embedded in the HMM tracking of the above transcription
system, as a mean to estimate the likelihood of any observed
frames in any given states.</p>
      <p>The main idea consists in finding parametric models for
the spectral envelopes of notes and for the noise. By using
a low-order autoregressive (AR) model, the spectral
smoothness principle [8] is implemented, allowing to deal with the
variability of piano spectra. Besides, the parametric aspect
makes it possible to derive an estimator for the amplitudes of
partials in the case of overlapping spectra. The noise is
modeled by a moving-average (MA) process, which is a more
fitting model for audio signals than the commonly-chosen white
noise, and is more discriminating than an AR model when
attempting to analyze a sinusoid+noise mixture. The resulting
multipitch estimation technique is providing the likelihood of
the data given a set of possible simultaneous notes and is
robust for estimating the polyphony level (i.e. the number of
simultaneous notes).</p>
      <p>
        4. EVALUATION OF TRANSCRIPTIONS
Usually, transcription systems are evaluated by generating a
set of transcriptions, by classifying the notes among correct
estimations (or true positives, TP), false alarms (or false
positives, FN) and missing notes (or false negatives, FN) and by
using a metric like the so-called F-measure, defined by
1See the audio examples available on the authors’ web site at:
http://perso.enst.fr/˜emiya/EUSIPCO08/
f , 2 × recall × precision =
recall + precision
#TP + 21 #FN + 21 #FP
in order to give a score to each algorithm. However, in that
expression, each error has an equal weight. In a perception
test [
        <xref ref-type="bibr" rid="ref1">9</xref>
        ], we showed that errors could be divided in several
classes, each of them having its own perceptual weight. A
perceptually-based version of the usual metrics can thus be
defined, such as the following perceptual F-measure:
#TP
6
#TP + Pi=1 αiwi#Ei
with perceptual weights αi for typical errors i (octave, fifth,
other intervals, deletion, duration, onset) and time-related
errors wi.
(2)
(3)
References
[2] A.T. Cemgil, Bayesian Music Transcription, Ph.D. thesis,
SNN, Radboud University Nijmegen, the Netherlands,
2004.
[3] M. Marolt, “A connectionist approach to automatic
transcription of polyphonic piano music,” IEEE Trans. on
Multimedia, vol. 6, no. 3, pp. 439–449, 2004.
[5] G. Peeters, “Music pitch representation by periodicity
measures based on combined temporal and spectral
representations,” in Proc. of ICASSP 2006, Toulouse, France,
May 2006, vol. 5, pp. 53–56.
[6] V. Emiya, R. Badeau, and B. David, “Automatic
transcription of piano music based on HMM tracking of
jointly-estimated pitches,” in Proc. of EUSIPCO,
Lausanne, Switzerland, Aug. 2008.
[7] V. Emiya, R. Badeau, and B. David, “Multipitch
estimation of inharmonic sounds in colored noise,” in Proc. of
DAFx, Bordeaux, France, Sept. 2007, pp. 93–98.
[8] A.P. Klapuri, “Multiple fundamental frequency
estimation based on harmonicity and spectral smoothness,”
IEEE Trans. on Speech and Audio Processing, vol. 11,
no. 6, pp. 804–816, Nov. 2003.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Daniel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Emiya</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>David</surname>
          </string-name>
          , “
          <article-title>Perceptually-based evaluation of the errors usually made when automatically transcribing music,”</article-title>
          <source>in Proc. of ISMIR</source>
          , Philadelphia, Pennsylvania, USA, Sept.
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>