=Paper= {{Paper |id=Vol-379/paper-9 |storemode=property |title=Automatic Beat-Synchronous Generation of Music Lead Sheets |pdfUrl=https://ceur-ws.org/Vol-379/paper17.pdf |volume=Vol-379 }} ==Automatic Beat-Synchronous Generation of Music Lead Sheets== https://ceur-ws.org/Vol-379/paper17.pdf
          AUTOMATIC BEAT-SYNCHRONOUS GENERATION OF MUSIC LEAD SHEETS

       DURRIEU Jean-Louis, durrieu@enst.fr                                                   WEIL Jan, weil@nue.tu-berlin.de

        TELECOM ParisTech - TSI / LTCI                                                   TUB - Communication Systems Group
 46 rue Barrault, F-75634 Paris Cedex 13, France                                         Einsteinufer 17, 10587 Berlin, Germany


                            ABSTRACT                                        This document is organized as follows: first, we present the
                                                                            proposed beat tracking and pulse alignment algorithms. Then
Most of the popular music scores are written in a specific for-
                                                                            we explain the chord and melody estimation modules. There-
mat, the lead sheet format. It sums up a song by representing
                                                                            after, a short evaluation of these tasks is proposed. A short
the notes of the main melody, along with the chord sequence
                                                                            insight in what can be done for time signature is also pre-
together with other cues such as style, tempo and time sig-
                                                                            sented before the conclusion. At last, we conclude with some
nature. This sort of representation is very common in jazz
                                                                            futur works and perspectives.
and pop music, where the accompaniment playing the chord
sequence usually is improvised. The aim of our study is to
bring together two techniques, a chord detection system and                              2. BEAT TRACKING MODULE
a lead melody transcriber, in order to produce a lead sheet. In
addition to the respective issues inherent to each problem, we              In order to produce musically relevant lead-sheets, we need to
also need to address tempo estimation, time signature estima-               determine the temporal structure of the song, i.e. the tempo,
tion, and, based on these estimations, time quantification of               dealt in this section, and the time signature, dealt in section 5.
both the chord sequence and the melody line. We propose a                   In order to do so, the tempo is first estimated on 10s-long
tempo tracker that aligns the beats to the audio, and adapt the             frames, with a .5s hopsize. This estimation is based on a
chord detection and melody extraction systems so as to take                 detection function proposed in [1]. From this function, at
into account this new piece of information. Future works in-                each window an auto-correlation function (ACF) is computed,
clude cover song detection based on lead sheet representation,              which gives us a “ACF-map”. A viterbi algorithm allows us
query-by-similarity applications and so on.                                 to find the optimal tempo-path, with a trade-off between the
                                                                            tempo variation smoothness and the maxima of the ACF-map.
                                                                            We also output an estimate of the “tatum”, which supposedly
                      1. INTRODUCTION                                       is the smallest time-unit of the song, in use for the melody
                                                                            quantification part.
The lead-sheet format in music is well-known among jazz and                 We tackle the beat/tatum location problem using a dynamic
rock players. It consists of the main melody, along with the                programming approach. We use the same hopsize for the win-
chords of the accompaniment. It can also include more infor-                dows as previously, but their length is at least 10 beats/tatums
mation such as the style of the song, the lyrics, the structure,            per window - i.e. 10 times the maximum time lag between
etc.                                                                        two beats. For each window, an impulse comb is generated
Here we are interested in combining two existing systems: a                 with a period corresponding to the estimated tempo. The
chord detection algorithm and a melody extractor, in order to               cross-correlation between the comb and the data in the win-
obtain such a representation. We however missed the tem-                    dow is stored in a matrix, the maxima of which give the time
poral information such as tempi and time signature. Tempo                   lags or “phases” necessary to align the combs to the data.
estimation is a well studied problem and we base our system                 In order to avoid off-beat problems, which are common in
on some previous works [1]. We also designed a method that                  rock and jazz music, we designed a Viterbi algorithm that
aligns the beat to the data. The time signature estimation still            smoothes the variability of location of the pulses. Instead
is an open problem, for which we propose some general di-                   of smoothing the path “horizontally” in this phase-matrix, it
rections.                                                                   takes into account the tempo changes and favors phase loca-
Some improvements can also come from the fusion of the re-                  tions where they are expected to occur according to the previ-
sults of the different algorithms. We propose some of such                  ous window.
improvements, but we expect that further studies could un-                  At last, for each window, we place the pulses according to the
ravel even more of those correlated results.                                estimated phase, taking care of possible “double” pulses by
     This work was partly supported by the European Commission under con-   choosing a location between them where the onset detection
tract FP6-027026-K-SPACE.                                                   is at a local maximum.
               3. DETECTION MODULES                               sults when taking into account non-vocal frames with 65% of
                                                                  global recall. It also leads to spurious notes in the transcrip-
3.1. Chord sequence detection                                     tion. In order to avoid these, some heuristics can be applied,
                                                                  e.g. penalizing segments in which the melody is varying to
The chord detection method we developped is close to the sys-
                                                                  deeply.
tem introduced in [3]: the chosen features are the tonal cen-
troids, derived from the chroma vectors. A Hidden Markov
Model (HMM) is assumed for the chord sequence. As in [3],                       5. ABOUT TIME SIGNATURE
we assume the transition probabilities to depend only on the
interval between the chords. Further studies should aim at us-    As we discussed in the previous sections, we also need an es-
ing key-specific HMMs, in order to estimate the main key at       timate of the time signature of the song. This signature is a
the same time.                                                    fraction: the denominator gives the musical unit related to the
In order to integrate beat information, either we compute the     beat, while the numerator tells how many of these units there
features within segments obtained thanks to the beat location     are in 1 measure.
given in section 2 or we constrain the Viterbi decoding of        We propose the following direction for future works on the
the chord sequence: within each segment, the state (i.e. the      topic: there usually are two trends for choosing the denomi-
chord) is assumed constant. However, neither of these two         nator. The song either has binary rhythmic patterns or ternary
solutions gave better results for now.                            ones. In the first case, one usually can assume the unit to be
                                                                  the eighth note, with symbol 4, in the other case, it is often
                                                                  chosen as the sixteenth note, symbol 8. As a first approxima-
3.2. Main melody extraction
                                                                  tion, one can assign either of these two denominators. The
The main melody transcription module is based on the lead-        tatum to beat ratio may give some insight as to which of them
ing melody estimation of [2]. A source-filter model catches       it should be.
F0 candidates for each frame, and the main melody is com-         Assuming that the chord changes mainly occur on the beats,
puted thanks to a Viterbi smoothing algorithm, accomplishing      and more specifically on the up-beats, at the beginning of
a trade-off between the energy and the frequency proximity        measures, the numerator could be infered from the harmonic
of consecutive candidates. This system only outputs a frame-      structure. More evidence is needed for this last assumption,
wise sequence of frequencies in Hz. In order to obtain the        but this should give a rather straight-forward way of estimat-
desired sequence of temporally quantified notes (i.e. on the      ing the time signature of the analyzed song.
Western music scale), we use the tatum estimation of section
2. It provides segments on which we can decide which note                             6. CONCLUSIONS
was intended. Most of pop music singers do not have strong
vibrato, which makes this task rather straight-forward in those   In this study, we have found that each system can take advan-
cases: a simple decision like taking the mean or the median of    tage of the beat/tatum estimation, especially on the quantifica-
the output frequency sequence within each segment will give       tion step. This seems to produce musically relevant material.
satisfying results. More studies on vibrato estimation may be     The result is not yet completely ready and we still need to es-
useful in order to deal with classical music.                     timate the time-signature. This feature is closely related to the
As pointed out in section 3.1, the algorithm can also separate    beat and tatum ratio, but also, we believe, to the melodic and
the singer voice from the background music. This output can       harmonic structure. Further studies aim at designing a robust
then be used as a pre-processing step to other tasks such as      way of estimating the time-signature as well as the overall
chord detection or multi-F0 estimation.                           structure of the musical piece, which would for example help
                                                                  avoiding repetitions in the output lead-sheet.
                     4. EVALUATION
                                                                                       7. REFERENCES
The evalution of such a transcription system as a whole is not
clear yet. However, we can evaluate the different modules         [1] M. Alonso, G. Richard, and B. David. Extracting Note
separately.                                                           Onsets from Musical Recordings. ICME, 2005.
The Chord detection was tested on a database of beatles songs
                                                                  [2] J.-L. Durrieu, G. Richard, and B. David. Singer melody
along with MIDI synthetized ones. The recognition recall
                                                                      extraction in polyphonic signals using source separation
vary from 65% to 70%.
                                                                      methods. ICASSP, 2008.
As for the main melody extractor, as was stated in [2], per-
forms amongst the state-of-the-art systems, with 78% frame-       [3] K. Lee and M. Slaney. Acoustic Chord Transcription
wise recall on the pitched frames. One of the main drawback           and Key Extraction From Audio Using Key-Dependent
of this module for now is the lack of silence detection in vo-        HMMs Trained on Synthesized Audio. IEEE Trans. on
cal activity. As such, we observe a significant drop in the re-       ASLP, 2008.