MedleyAssistant – A system for personalized music medley
                        creation
                        Zhengshan Shi                                                  Gautham J. Mysore
                 CCRMA, Stanford University                                               Adobe Research
                        Stanford, USA                                                   San Francisco, USA
                 kittyshi@ccrma.stanford.edu                                           gmysore@adobe.com


ABSTRACT                                                               3. Determine the exact transition points from a given segment
In this paper, we present MedleyAssistant, a system to assist             to the following segment, crop the segments accordingly,
in the creation of music medleys from segments of existing                and use a crossfade to stitch the segments together. This
music. Our goal is to make medley creation more accessible to             step is crucial for a seamless transition between segments.
novices, while still allowing them to express their own creative
                                                                       4. Adjust tempos or keys when necessary by using traditional
style. Our system addresses two key challenges in medley cre-
                                                                          audio editing tools.
ation – determining which segments of music sound natural
when transitioning to which other segments of music, and               Steps 2 and step 4 above require a keen musical ear or a
determining specific transition points between two given seg-          background in music theory. Step 3 requires a certain amount
ments of music. This constrains the problem so that medleys            of skill in audio editing. All three of these steps can be quite
created with our system tend to sound natural while allowing           tedious. Step 1, however requires less of a prior background
the user to be creative with music selection. We also provide          in music and audio editing and can simply be based on music
a music visualization that helps users understand the musical          preference.
principles of medley creation.
                                                                       We present MedleyAssistant, a system to help people easily
ACM Classification Keywords                                            create personalized music medleys with little or no background
H.5.5. Sound and Music Computing : Methodologies and                   in music and audio editing. Our system assists in step 2 and
Techniques; I.5.5. Implementation : Interactive systems                automates step 3. It allows users to be creative with song
                                                                       selection and step 1. Moreover, it visualizes certain musical
                                                                       features to help guide users with these steps and can help
Author Keywords
                                                                       them better understand the underlying musical principles. We
music medley; creative MIR; personalized music creation.
                                                                       believe that this can make medley creation a more accessi-
                                                                       ble process for novices, and allow experts to speed up their
INTRODUCTION AND MOTIVATION
                                                                       workflow.
A music medley is a piece of music that is composed or ar-
ranged from a series of songs or musical segments. In a high           RELATED WORK
quality medley, each segment tends to naturally flow into the          Recent advances in Music Information Retrieval (MIR) tech-
next segment, and the transitions typically sound seamless.            niques have given rise to intelligent musical interfaces [3, 9],
Medleys provide a way to create new variations of music                making certain aspects of music creation more accessible to
starting from existing music. They can be used for music               non-experts. This includes applications such as an automatic
playback by itself or as backing tracks for media such as videos       DJ [4], a song mixing tool [2], an automatic mashup system
and video games. They provide a way to customize a piece               [1], and a loop creation system [11]. All of these applications
of music so that different sections of the media correspond to         help reuse parts of existing music to create new music.
different segments of music.                                           To the best of our knowledge, the work that is most closely
Manual creation of high quality medleys can be a challenging           related to our proposed medley creation system is Music Cut
task and typically requires a background in music and audio            and Paste [5, 6], a personalized music-cut-and-paste system,
editing. They are often created by musicians and DJs. The              which is also used to create medleys. The key difference is
typical sequence of steps to create a medley is as follows:            that this system only allows users to specify the sequence of
                                                                       segments in terms of vocal and instrumental sections, whereas
1. Select a number of candidate musical segments from various          our system provides the user with significantly more flexibility
   pieces of music.                                                    in terms of choosing and adjusting segments.
2. Determine a musically natural sequence of segments for the
   proposed medley from the above candidate set of segments.

©2018. Copyright for the individual papers remains with the authors.
Copying permitted for private and academic purposes.
MILC ’18, March 11, 2018, Tokyo, Japan
SYSTEM OVERVIEW
In this section, we describe the workflow and interface of
MedleyAssistant. A user creates a medley using our system as
follows:
1. The user provides a number of candidate songs based on
   their preference in music.
2. Our system provides a visualization of these songs as shown
   in Figure 1. Specifically, it visualizes the chord structure
   and tempo. This visualization could help guide the user in
   the subsequent steps.
3. The user chooses the first segment of the medley based
   on the music that they would like to be the introduction.
   This forms the foundation of the medley because it in turn
   dictates subsequent segments. The user can choose this first
   segment based on listening as well as using the visualization
   as a guide. The beginning and end of the selection snaps to
   the nearest beat.
4. Based on the first segment, the system assists in choosing
   the second segment. It estimates multiple potential candi-
   dates for the second segment (over all songs) and highlights
   them, as shown in Figure 2. The opacity of each gray box
   indicates a confidence level of how good the segment will
                                                                   Figure 1. MedleyAssistant Interface. For each song, we visualize chords
   sound. The system attempts to choose candidate segments         (as colored blocks on the waveform) and tempo (as a color density bar
   that will musically flow well from the first segment based on   below the waveform).
   chord structure, tempo, and timbre. The chosen segments
   will be 16 beats or less. We use a relatively small size so
   that the user can adjust the length to whatever they desire
   in the next step. The user can choose to use one of these       so we have a total of 24 chords. The color scheme we used
   segments, and the system will then perform the low level        in visualizing each chord is inspired by Alexander Scriabin’s
   edits to concatenate the two segments. However, the user        “Clavier à lumières" (“keyboard with lights") [10], a keyboard
   can alternatively choose any part of any song (even if it is    instrument with colors assigned to different keys. For example,
   not highlighted) and use that as the second segment.            we use intense red for C and orange for G. We utilize this
5. After choosing a segment, the user can drag the segment         color scheme because similar color produces a desirable chord
   ending boundary to adjust the length of the segment.            progression, such as the proximity in RGB color value of the
                                                                   chord C and G. The color mapping of the chords are shown in
6. The user chooses the next segment as outlined in the previ-     a chord visualization colormap in Figure 1.
   ous two steps and continues this process until the medley is
                                                                   We visualize tempo below the waveform, with a bar represent-
   complete.
                                                                   ing rhythmic density of the music as a function of time. For
The interface in our system is realized as a web application       the rhythmic feature, we first extract the dynamic tempos of
using wavesurfer1 , an API built on top of Web Audio API2          the song, and then group the dynamic tempos into different
and HTML5 Canvas.                                                  sections. We use color interpolation to indicate the tempo
                                                                   density (i.e: deeper color for a faster tempo).
Visualization
                                                                   Our system demonstrates the visualization of two features
As shown in Figure 1, our system visualizes both the chord         (chord structure and tempo), but this can be extended to various
structure as well as the tempo for each song. This visualization   other features. We think that different kinds of musical features
provides a tool for users to better understand the segments        could be used to assist in different forms of medley creation
highlighted by the system, helps users choose new segments         styles.
that are not highlighted, and can serve as an educational tool
to better understand the music theoretical principals of medley
creation.                                                          ALGORITHMS
We visualize the chord structure of each song by overlaying        In this section, we describe the algorithms that we use for
the waveform on colored blocks where each color corresponds        visualization and automatic segment selection (as described in
to a different chord. We use only Major and Minor chords,          steps 2 and 4 of the workflow in the previous section).
1 https://wavesurfer-js.org/                                       We start with a pre-processing step in which we extract fea-
2 https://www.w3.org/TR/webaudio/                                  tures at each beat of each song. These features are used both
                                                                           Host. These estimated chords are used in the visualization
                                                                           under the waveform.


                                                                           Acoustic Smoothness
                                                                           Given the query segment, S1 of length N, and the potential
                                                                           candidate segment S2 of length M = 16, our goal is to compute
                                                                           the optimal transition point from S1 to S2 . We do this by
                                                                           computing a cost function between four-beat windows sliding
                                                                           over both S1 and S2 . We define the optimal transition point
                                                                           by the location of the sliding windows with the lowest cost,
                                                                           which we refer to as the highest acoustic smoothness.
                                                                           We consider four beat sliding windows starting from beat 14 N
                                                                           to beat N − 3 of S1 , and beat 1 to beat 34 M of S2 . We do not
                                                                           consider the beginning of S1 and the ending of S2 in order to
                                                                           ensure that the resulting combination of S1 and S2 retain at
                                                                           least a part of each segment.
                                                                           Our cost function over a four-beat sliding window of S1 to S2
                                                                           is:

                                                                                                  3
Figure 2. When a query segment (the gray section in the first song) is         C(S1i , S2 j ) = α ∑ Dc (CS1 [i + k],CS2 [ j + k])
selected, a menu with a segment selection option appear at the bottom of                        k=0
the song such that the user can add the segment into the medley editor
(bottom row). The system then calculates and suggests the user potential                      ∑3 Dm (MS1 [i + k], MS2 [ j + k])
next segments (the gray boxes in the second and the third song.)
                                                                                           + β k=0
                                                                                                                σm                         (1)
                                                                                              Dr (RS1 [i − 4 : i − 1], RS2 [ j : j + 3])
                                                                                           +γ
                                                                                                                 σr
in the visualization and computation of the cost function de-                                 Dt (S1 , S2 )
scribed below.                                                                             +δ
                                                                                                  σt
Given a specific segment of the medley, which we refer to as               Where Dc denotes the cosine distance of the chroma features,
the query segment, the goal of our algorithm is to estimate                Dm denotes the Euclidean distance of the timbre features, Dr
potential candidates for the next segment. We select the next              denotes the difference of root mean square energy, and Dt
candidate segment based on specific criteria as follows:                   denotes the tempo difference. α, β , γ, and δ are tuning factors,
1. Acoustically smooth and seamless transitions between con-               whereas σm , σr , and σt represents the standard deviations.
   secutive segments. We compute this smoothness based on                  Figure 3 illustrates the computation of the cost function.
   a cost function between the query segment and every other               We compute our tuning factors apriori as follows. The goal is
   segment in every song based on a sliding window.                        to define the tuning factors that help indicate acoustic smooth-
                                                                           ness. By definition, the transitions between consecutive seg-
2. Consecutive segments should conform to a harmonic pro-                  ments of a given song are maximally smooth. Therefore we
   gression specified by music theory rules. Based on the                  compute the cost function between all consecutive segments of
   acoustically smooth candidates (as determined by the previ-             a number of songs using a number of different combinations
   ous step), we select a subset of candidate segments based               of tuning factors (each tuning factor can vary from 0 to 1.0).
   on a harmonic progression factor.                                       We choose the combination of tuning factors that on average
                                                                           yields the lowest cost.
Feature Extraction
                                                                           When we compute the cost function for segments S1 and S2 ,
We first estimate beat locations in each song by applying
                                                                           we choose the optimal transition point between S1 and S2
beat tracking on the onset envelope of the audio signal. At
                                                                           based on the i and j that yield the lowest cost. Given the
every beat of every song, we compute timbral features (Mel-
                                                                           optimal i and j, the optimal transition between the segments
frequency cepstrum coefficients), signal energy level (root                is to go from beat i − 1 of S1 to beat j of S2 .
mean square), and harmonic features (chroma vectors). We
also compute tempo over a window around each beat. We                      The query segment S1 has a cost with respect to each segment
compute these features using librosa [8]. Additionally, we                 S2 (associated with the optimal transition point) of each song.
estimate the chords in each song using the chordino plugin                 We choose all of the segments S2 that have a cost under a
based on NNLS Chroma [7]. We compute this in the VamPy                     threshold as candidate segments for the next step.
                                                                              REFERENCES
                                                                               1. Matthew EP Davies, Philippe Hamel, Kazuyoshi Yoshii,
                                                                                  and Masataka Goto. 2014. AutoMashUpper: Automatic
                                                                                  creation of multi-song music mashups. IEEE/ACM
                                                                                  Transactions on Audio, Speech, and Language Processing
                                                                                  22, 12 (2014), 1726–1737.
                                                                               2. Tatsunori Hirai, Hironori Doi, and Shigeo Morishima.
                                                                                  2015. MusicMixer: Computer-aided DJ system based on
                                                                                  an automatic song mixing. In Proceedings of the 12th
                                                                                  International Conference on Advances in Computer
                                                                                  Entertainment Technology. ACM, 41.
                                                                               3. Eric J Humphrey, Douglas Turnbull, and Tom Collins.
                                                                                  2013. A brief review of creative MIR. In Proceedings of
                                                                                  the International Conference on Music Information
Figure 3. Illustration of the computation of the cost function over a four-       Retrieval, Late Breaking Demo.
beat sliding window (as in dashed rectangle) between beat i in segment
S1 , and beat j in segment S2 .                                                4. Hiromi Ishizaki, Keiichiro Hoashi, and Yasuhiro
                                                                                  Takishima. 2009. Full-Automatic DJ mixing system with
                                                                                  optimal tempo adjustment based on measurement
Harmonic Progression Factor                                                       function of user discomfort. In ISMIR. 135–140.
After obtaining a set of candidate segments with acoustically                  5. Yin-Tzu Lin, I-Ting Liu, Jyh-Shing Roger Jang, and
smooth transition points from S1 , we determine which of these                    Ja-Ling Wu. 2015. Audio musical dice game: A
candidates yield a music theoretically valid harmonic progres-                    user-preference-aware medley generating system.
sion when transitioning from S1 .                                                 TOMCCAP 11 (2015), 52:1–52:24.
Given candidate beat b1 in S1 that connects to beat b2 in S2 , we              6. I-Ting Liu, Yin-Tzu Lin, and Ja-Ling Wu. 2013. Music
analyze the four-beat harmonic progression from S1 [b1 − 1 :                      Cut and Paste: A personalized musical medley generating
b1 ] to S2 [b2 : b2 + 1], namely the last two beats of the transition             system. In ISMIR. 463–468.
point in S1 and the first two beats in transition point in S2 ,
as well as the two-beat harmonic progression from S1 [b1 ] to                  7. Matthias Mauch and Simon Dixon. 2010. Approximate
S2 [b2 ]. We assign a score Qb1 ,b2 as:                                           Note Transcription for the Improved Identification of
                                                                                  Difficult Chords. In ISMIR. 135–140.
          Qb1 ,b2 = P5th (S1 [b1 ], S2 [b2 ])
                                                                       (2)     8. Brian McFee, Colin Raffel, Dawen Liang, Daniel PW
                  + Ppop (S1 [b1 − 1 : b1 ], S2 [b2 : b2 + 1])                    Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto.
Where P5th is the chord transition probability from the last                      2015. librosa: Audio and music signal analysis in python.
beat of the transition point in S1 to the first beat of the transition            In Proceedings of the 14th Python in Science Conference.
point in S2 based on a circle of fifth, and Ppop is the four-beat                 18–25.
harmonic transition probability from the last two beats of the                 9. Markus Schedl, Emilia Gómez, Julián Urbano, and others.
transition point in S1 to the first two beats of the transition                   2014. Music information retrieval: Recent developments
point in S2 , trained from the SALAMI dataset [12].                               and applications. Foundations and Trends® in
We choose all segments with a score Qb1 ,b2 above a threshold                     Information Retrieval 8, 2-3 (2014), 127–261.
as candidate segments that are displayed as gray boxes in the                 10. Aleksandr Scriabin and Leonid Sabaneev. 1913.
interface, as shown in Figure 2. The confidence value of a                        Prométhée, le poème du feu pour grand orchestre et piano
given segment is based on the score Qb1 ,b2 of that segment and                   avec orgue, choeurs et clavier à lumières. Op. 60.
mapped to an opacity value of the corresponding gray box in                       Transcription pour 2 pianos à quatre mains par L.
the interface.                                                                    Sabaneiew. (1913).

CONCLUSION                                                                    11. Zhengshan Shi and Gautham J Mysore. 2018.
We present medleyAssistant, an interactive music medley cre-                      LoopMaker: Automatic creation of music loops for
ation system that enables users to create personalized music                      pre-recorded music. In Proceedings of the SIGCHI
medleys. Our informal pilot study showed that our interface                       Conference on Human Factors in Computing Systems.
makes medley creation significantly easier for novices. We                        ACM.
believe that it could be a useful tool for experts as well as it              12. Jordan Bennett Louis Smith, John Ashley Burgoyne,
could help them create medleys more quickly.                                      Ichiro Fujinaga, David De Roure, and J Stephen Downie.
                                                                                  2011. Design and creation of a large-scale database of
                                                                                  structural annotations.. In ISMIR, Vol. 11. 555–560.