Curating Generative Raw Audio Music with D.O.M.E.
                                       CJ Carr                                                              Zack Zukowski
                                  Dadabots                                                                      Dadabots
                                 Boston, USA                                                                Los Angeles, USA
                             emperorcj@gmail.com                                                         thedadabot@gmail.com

ABSTRACT                                                                               a spectrogram visualization, but no method of clustering similar
With the creation of neural synthesis systems which output raw                         audio together.
audio, it has become possible to generate dozens of hours of music.                       Neural synthesizers include architectures based on convolutional
While not a perfect imitation of the original training data, the qual-                 neural networks (WaveNet [22]), recurrent neural networks (Sam-
ity of neural synthesis can provide an artist with many variations                     pleRNN [11], WaveRNN [4]), and flow-based networks (WaveGlow
of musical ideas. However, it is tedious for an artist to explore the                  [15]). This family of tools is often utilized as a vocoder component
full musical range and select interesting material when searching                      in end-to-end text-to-speech models [15] and can be appropriated
through the output. We needed a faster human curation tool, and we                     for music synthesis [2, 25].
built it. DOME is the Disproportionately-Oversized Music Explorer.
A PCA-component k-means-clustered rasterfairy-quantized t-SNE                          2     RELATED WORK
grid is used to navigate clusters of similar audio clips. The color                    Self-organizing maps (SOMs) have been created for the purpose of
mapping of spectral and chroma data assist the user by enriching                       organizing large libraries of audio using clustering techniques to
the visual representation with meaningful features. Care is taken in                   build grids of audio [16]. These interfaces are useful for seeing the
the visualizations to aid the user in quickly developing an intuition                  similarity between artist or genre. Some systems have used non-
for the similarity and range of sound in the rendered audio. This                      technical-looking designs to encourage a feeling of exploration by
turns the time consuming task of previewing hours of audio into                        stylizing the clustered datapoints into island-like [6, 14] and galaxy-
something which can be done at a glance.                                               like [19] environments. Our prototype adds to this grid approach
                                                                                       by introducing visualizations of the audio which allow the user
CCS CONCEPTS                                                                           to leverage visual cues resulting from spectral, chroma, and other
• Human-centered computing → Visualization systems and                                 metadata features.
tools; • Applied computing → Sound and music computing;                                   Unlike [24] which uses text or thumbnails to represent audio, and
• Computing methodologies → Machine learning.                                          unlike [6, 14] which use a histogram of critical bands, we choose to
                                                                                       use mel-spectrograms to maintain local detail along the time axis.
KEYWORDS
audio clustering, t-SNE, PCA, k-means, generative music, visualiza-                    3 METHOD
tion tool.                                                                             3.1 Audio Preparation
ACM Reference Format:                                                                  Our tool works well on audio datasets up to dozens of hours in total
CJ Carr and Zack Zukowski. 2019. Curating Generative Raw Audio Music                   length, where individual audio files are typically 1 minute to 20
with D.O.M.E.. In Joint Proceedings of the ACM IUI 2019 Workshops, Los                 minutes long. Larger datasets of 100+ hours have not been tested.
Angeles, USA, March 20, 2019. ACM, New York, NY, USA, 4 pages.                            In one example, we start with the entire wav output of a Sam-
                                                                                       pleRNN experiment, as described in [25], where the training data is
1    MOTIVATION                                                                        the album ‘Time Death’ by earth metal band (((::ofthesun::))) and the
With the creation of neural synthesis systems which output raw                         output is generated music in the style of the training data. At each
audio, it has been getting easier to generate dozens of hours of                       epoch during training, and after training, wav files are generated,
music in a specific style. [1] describe a music production process                     each 1 to 5 minutes long, for a total of 10 hours of audio.
where this generated audio is curated and arranged into albums.                           In another example, we worked with breakcore/electronic artist
However, much of what is generated from these networks tends to                        Drumcorps, to train a separate net on each of the stems (guitar,
fall into similar patterns. It can be tedious to discover sections of                  voice, drums, synth) on his newest unreleased album, as well as
musical interest when searching through the output. We have felt                       the combined master recording. For each net, at each epoch, and
a need for faster human curation tools.                                                after training, wav files were generated, each 1 to 5 minutes long,
   Digital audio workstations such as Ableton Live only supply                         totaling 50 hours.
the user with an amplitude visualization, which makes searching
difficult. Other digital audio workstations such as Audacity have                      3.2    Analysis
                                                                                       Three types of analysis are performed: a mel-spectrogram rolloff
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                     visualization, a source-separation-pre-processed chromagram visu-
© 2019 Copyright for the individual papers by the papers’ authors. Copying permitted   alization, and a PCA fingerprinting used for clustering and nearest
for private and academic purposes. This volume is published and copyrighted by its
editors.                                                                               neighbor sorting. Each audio file is loaded into our analyzer, which
                                                                                       uses scikit-learn and the librosa library [9]. STFT is done with 44.1k
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                       Carr and Zukowski


sample rate, 1024 fft size, 1024 window size, and 256 hop size. The
power spectrogram is made by squaring the absolute value of the
STFT, discarding the phase. Frequencies are weighted according to
A-weighting, a perceptual weighting scheme. Power is converted
to dB.                                                                    Figure 3: Chroma visualization with (above) and without
                                                                          (bottom) pre-processing with harmonic source separation
                                                                          using HPSS. Notice how removal of non-pitched audio de-
3.3    Color
                                                                          noises the chromagram.
We follow the usability guidelines of memorability, informational
delivery, distinguishability [24] through our use of color choices
which make it easier to distinguish between different pieces of
audio.                                                                    3.5    Chromagram
   It is important to choose the color mappings with care. Rainbow        For the chromagram visualization (Figure 2) the STFT is pre-processed
palettes are prone to illusions and misrepresenting variance. How-        with harmonic-percussive source separation (HPSS) [3]. A large
ever, in the case of chromagrams, because of the shared symmetry          margin of 10 is applied to single out the harmonic component, on
between pitch and hue, a rainbow palette is useful. Another con-          which the chroma values are calculated. The percussion and residual
sideration is the difficulty a color blind person might have with         components are discarded. The chromagram is colored according
some palettes. Since 1 in 12 men, and a little over 4% of the whole       to the rainbow, such that the same pitches have the same color. To
population, are color blind [23], we have included a key command          make true pitches to pop out, and to avoid the case where white
to rotate hues to accommodate incomplete color blindness.                 noise looks colorful, a norm of 1 is used, any chroma value under
                                                                          0.5 is desaturated, and values between 0.5 and 1.0 are progressively
3.4    Rolloff Spectrogram                                                saturated.
For the rolloff visualization (Figure 1) the spectrograms are scaled to
23 mel bins. Spectral rolloff is calculated for each frame and colored    3.6    Metadata Visualization
accordingly. We find rolloff to be a useful measure of bassiness          More advanced visualization is possible if the generative model
to trebliness. When looking at a spectrogram colored this way,            supplements the audio with metadata.
at a distance, the eye clearly sees long-term structure and spots            In Figure 4 we display the current epoch and iteration of the
similar-sounding sections in a collection. We also include the option     model which generated the audio, along with parameters such as
to vertically reflect the spectrogram along the time axis, which          beam_width [10].
seems to assist the eye in seeing structure, perhaps because of the          In Figure 5 we display the change in temperature [18] over time.
vertical symmetry, or perhaps because users are conditioned to see        Temperature is a parameter related to autoregressive sequence
waveforms this way.                                                       generation which regulates stochasticity when sampling the multi-
                                                                          nomial distribution during inference. A lower temperature often
                                                                          causes the model to get stuck in repetitions.
                                                                             In Figure 6 we display local conditioning [8, 17] over time. If
                                                                          we condition the model during training using a one-hot vector of
                                                                          size n for each of the n songs in the training data, then during
Figure 1: Rolloff visualization mode. Frequency is on the y-              generation we can change the value of this conditioning vector
axis, time is on the x-axis. The spectrogram is vertically re-            over time, which influences the audio to sound more like those
flected, and colored reddish when bass frequencies are dom-               particular songs.
inant and blue for treble. Notice a quiet section on the left,
followed by a rhythmic section to the right.


                                                                          Figure 4: Visualizing epoch, iteration, and beam_width meta-
                                                                          data


Figure 2: The same wav from Figure 1 in chroma visualiza-
tion mode. Pitches are on the y-axis, time is on the x-axis.
Notice it starts with a drone which holds out a 5th dyad be-
tween G (cyan) and D (yellow). The drone is comparatively
quiet, but normalization of the chromagram makes it stand                 Figure 5: Visualizing change in temperature over time dur-
out visually.                                                             ing inference
Curating Generative Raw Audio Music with D.O.M.E.                                     IUI Workshops’19, March 20, 2019, Los Angeles, USA


Figure 6: Visualizing change in local conditioning over time
during inference


3.7    Fingerprinting
For fingerprinting, the entire dataset is iteratively loaded, seg-
mented into ten-second chunks with five-second hops, and con-
verted to dB mel-spectrograms. Dimensionality reduction is per-
formed using Incremental PCA rendering 10 components per chunk.
The learned PCA basis functions are shown in Figure 7.                    Figure 8: DOME in spectrogram rolloff mode. The left side is
                                                                          grid view, an overview of the diversity of sound which has
                                                                          been loaded. Clicking a gridcell changes the list view on the
                                                                          right, sorting the audio by closeness to that gridcell.


Figure 7: PCA basis functions for each of the 10 components.
The y-axis represents the 23 mel-frequency bands and the x-
axis represents STFT frames over time.


3.8    Grid building Process
The grid interface is 10x10. To fit thousands of chunks to this grid,     Figure 9: DOME in chromagram mode. At a glance, for ex-
k-means clustering is used on the PCA components with k=100. The          ample, the user can see melodic vs non-melodic sections,
cluster centroids are then spaced using t-SNE [7] , and quantized         drones, melodies, and occurrences of the note G (cyan).
into the final 10x10 grid using Rasterfairy [5].
   This process is intended to allow for the inspection of single
audio chunks. With t-SNE visualizations there tends to be many            3.10    Crowd Curation
overlapping points in the space which reduces the ease at which           Our prototype works in all modern web browsers. This environment
a single example can be previewed. t-SNE interfaces work well             was chosen with a concern for cross-device compatibility. Our
with short audio samples like those found in the infinite drum            intention is to make curation more accessible for the purpose of
machine [20], but interface issues arise with larger sections of          crowd-sourcing. Export functionality allows the end user to rapidly
music. Rasterfairy, on the other hand, stretches these points out to      collect variations of sonic elements and facilitates crowd-sourced
cover an entire 2D grid.                                                  collaboration.

3.9    Exploratory Interface Design                                       4   INFORMAL EVALUATION
In grid view, each gridcell represents a cluster centroid and ran-        Informally, we gave three producers, experienced in sampling and
domly displays one of the top 6 nearest neighbor chunks to the            curating music, the task of curating "interesting" pieces of music
centroid. By clicking on a gridcell, the program sorts the chunks         from 10 hours of SampleRNN generated audio. They first used their
by distance to the centroid of that original audio chunk. In the list     preferred method of curation (which included "load all 10 hours
view, the full audio files which contain those chunks are listed. The     into Ableton Live and look at the raw waveform" and "hunt around
user scrolls down in list view. The user seeks to any position in         with the MacOS Finder and hope for the best"), then later used
the audio by clicking. Highlighted sections can be exported as wav        DOME. They self-reported the curation process was between 5x
files. At the end of every audio file it finds a new file to play. This   and 20x faster with DOME.
autoplay ability enables a continuous listening experience that we           Producer Drumcorps reported, "DOME was quite helpful in this
find useful while passively auditioning renderings.                       project. Being able to scan through the audio content visually made
    Figures 8 and 9 show what the app looks like with the rolloff         it much easier to pick out useful and interesting sounds. After a
and chroma visualizations. The left side of the app is grid view. The     short time browsing, I can get a sense of what a specific type of
right side of the app is list view.                                       sound might look like, and I can start to find what I’m looking
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                                           Carr and Zukowski


for much faster. I found it good to build a little directory of my               [2] Sander Dieleman, Aäron van den Oord, and Karen Simonyan. 2018. The challenge
favorite clips from DOME, and work from there, rather than the                       of realistic music generation: modelling raw audio at scale. CoRR abs/1806.10474
                                                                                     (2018). arXiv:1806.10474 http://arxiv.org/abs/1806.10474
usual ‘hunt around with the MacOS Finder and hope for the best’                  [3] Jonathan Driedger, Meinard Müller, and Sascha Disch. 2014. Extending Harmonic-
method. The two methods are incomparable, the differences are                        Percussive Separation of Audio Signals. ISMIR (2014).
                                                                                 [4] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande,
night and day, and the results end up different as well. With the                    Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and
Finder, I start out listening to files, but then often get frustrated                Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. CoRR abs/1802.08435
and just select a few files at random, then choose the best one from                 (2018). arXiv:1802.08435 http://arxiv.org/abs/1802.08435
                                                                                 [5] Mario Klingemann. 2015. RasterFairy.             https://github.com/Quasimondo/
those. With DOME I end up finding a wide variety quickly - and                       RasterFairy
then can choose further work from a more informed position - it                  [6] Peter Knees, Markus Schedl, Tim Pohle, and Gerhard Widmer. 2006. An Innova-
gives me a larger sample size. There’s only so much listening one                    tive Three-dimensional User Interface for Exploring Music Collections Enriched.
                                                                                     In Proceedings of the 14th ACM International Conference on Multimedia (MM ’06).
can do in a day, and if you need to listen to samples in real-time                   ACM, New York, NY, USA, 17–24. https://doi.org/10.1145/1180639.1180652
to determine which you’re going to use, that’s less time available               [7] L.J.P.V.D. Maaten and GE Hinton. 2008. Visualizing High-Dimensional Data using
                                                                                     t-SNE. Journal of Machine Learning Research 9 (01 2008), 2579–2605.
for getting down to the actual composition work. Some samples                    [8] Rachel Manzelli, Vijay Thakkar, Ali Siahkamari, and Brian Kulis. 2018. Condi-
contain wild shifts and interesting artefacts within them - you’ll                   tioning Deep Generative Raw Audio Models for Structured Automatic Music.
see this in DOME right away and be able to listen to that piece                      CoRR abs/1806.09905 (2018). arXiv:1806.09905 http://arxiv.org/abs/1806.09905
                                                                                 [9] Brian McFee, Matt McVicar, Stefan Balke, Carl Thomé, Vincent Lostanlen, Colin
immediately. With the Finder, it’s listening and hoping. There’s                     Raffel, Dana Lee, Oriol Nieto, Eric Battenberg, Dan Ellis, Ryuichi Yamamoto,
a place for randomness and hidden surprises too - but I find that                    Josh Moore, WZY, Rachel Bittner, Keunwoo Choi, Pius Friesch, Fabian-Robert
when I’m trying to get something done quickly, DOME is most                          Stöter, Matt Vollrath, Siddhartha Kumar, nehz, Simon Waloschek, Seth, Rimvydas
                                                                                     Naktinis, Douglas Repetto, Curtis "Fjord" Hawthorne, CJ Carr, João Felipe Santos,
helpful."                                                                            Jackie Wu, Erik, and Adrian Holovaty. 2018. librosa/librosa: 0.6.2.          https:
                                                                                     //doi.org/10.5281/zenodo.1342708
                                                                                [10] M.F. Medress, F.S. Cooper, J.W. Forgie, C.C. Green, D.H. Klatt, M.H. O’Malley,
5    FUTURE WORK                                                                     E.P. Neuburg, A. Newell, D.R. Reddy, B. Ritea, J.E. Shoup-Hummel, D.E. Walker,
The use of PCA for fingerprinting is limiting. While it is helpful at                and W.A. Woods. 1977. Speech understanding systems: Report of a steering
                                                                                     committee. Artificial Intelligence 9, 3 (1977), 307 – 316. https://doi.org/10.1016/
clustering similar audio textures together, it is not powerful and                   0004-3702(77)90026-1
precise enough to distinguish nuances in sound. In the future, for              [11] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham
fingerprints, we could use embeddings from a trained deep net                        Jain, Jose Sotelo, Aaron C. Courville, and Yoshua Bengio. 2016. SampleRNN: An
                                                                                     Unconditional End-to-End Neural Audio Generation Model. CoRR abs/1612.07837
audio classifier.                                                                    (2016). arXiv:1612.07837 http://arxiv.org/abs/1612.07837
   The use of Librosa for analysis was slow. A 10-hour dataset takes            [12] Parag K. Mital. 2017. Time Domain Neural Audio Style Transfer. CoRR
                                                                                     abs/1711.11160 (2017). arXiv:1711.11160 http://arxiv.org/abs/1711.11160
a few hours to analyze on a MacBookPro. A C-compiled analyzer,                  [13] Noam Mor, Lior Wolf, Adam Polyak, and Yaniv Taigman. 2018. A Universal
or a distributed process (using cloud compute or AWS Lambda)                         Music Translation Network. CoRR abs/1805.07848 (2018). arXiv:1805.07848
would be more efficient.                                                             http://arxiv.org/abs/1805.07848
                                                                                [14] Elias Pampalk, Simon Dixon, and Gerhard Widmer. 2004. Exploring Music
   Additional audio visualizations could include: the fingerprint                    Collections by Browsing Different Views. Comput. Music J. 28, 2 (June 2004),
embeddings over time, and annotating audio used when priming                         49–62. https://doi.org/10.1162/014892604323112248
the sequence before generation.                                                 [15] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2018. WaveGlow: A Flow-
                                                                                     based Generative Network for Speech Synthesis. CoRR abs/1811.00002 (2018).
   With the addition of a composing feature, the end user could                      arXiv:1811.00002 http://arxiv.org/abs/1811.00002
arrange music by sticking curated sections together. With the addi-             [16] Andreas Rauber, Elias Pampalk, and Dieter Merkl. 2002. Using Psycho-Acoustic
                                                                                     Models and Self-Organizing Maps To Create Hierarchical Structuring of Music
tion of an upvoting feature, the crowd could further curate their                    by Sound Similarity. (2002).
favorite sections and arrangements.                                             [17] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly,
                                                                                     Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan,
                                                                                     Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural TTS
6    CONCLUSION                                                                      Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. CoRR
Steady progress has been made on fast generative raw audio with                      abs/1712.05884 (2017). arXiv:1712.05884 http://arxiv.org/abs/1712.05884
                                                                                [18] Robin Sloan. 2018. Expressive Temperature. https://www.robinsloan.com/
neural synthesis. With the advent of audio style transfer [12, 13, 21],              expressive-temperature/
one could render all possible permutations of style transfers, yet              [19] Sebastian Stober. 2010. MusicGalaxy - An Adaptive User-Interface for Exploratory
                                                                                     Music Retrieval.
would still need a good way to explore the output. Digital audio                [20] Manny Tan and Kyle McDonald. 2017. Infinite Drum Machine.                    https:
workstations such as Ableton Live are not fit for this task. We                      //experiments.withgoogle.com/ai/drum-machine
designed an interface to minimize the time and effort required by               [21] Dmitry Ulyanov. 2016. Audio Texture Synthesis and Style Transfer. https:
                                                                                     //dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/
listening to hours of similar audio clips. Care was taken in the                [22] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol
visualizations to aid the user. This turned the time-consuming task                  Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray
of previewing hours of audio into something which can be done at a                   Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. CoRR
                                                                                     abs/1609.03499 (2016). arXiv:1609.03499 http://arxiv.org/abs/1609.03499
glance. We believe self-organizing interfaces like ours will be more            [23] Fernanda Viégas and Martin Wattenberg. 2018. Visualization for Machine Learn-
important as large directories of generated audio can be rendered                    ing (NeurIPS 2018 Tutorial). https://www.youtube.com/watch?v=ze08gwVPaXk
                                                                                [24] Kazuyoshi Yoshii and Masataka Goto. 2008. Music Thumbnailer: Visualizing
with faster inference speeds and greater parallelization.                            Musical Pieces in Thumbnail Images Based on Acoustic Features. In ISMIR.
    A demo of this tool will be available online at dadabots.com/dome           [25] Zack Zukowski and CJ Carr. 2017. Generating Black Metal and Math Rock:
                                                                                     Beyond Bach, Beethoven, and Beatles. NIPS Workshop on Machine Learning for
                                                                                     Creativity and Design (2017). arXiv:1811.06639 http://arxiv.org/abs/1811.06639
REFERENCES
 [1] CJ Carr and Zack Zukowski. 2018. Generating Albums with SampleRNN to
     Imitate Metal, Rock, and Punk Bands. MUME (2018). arXiv:1811.06633 http:
     //arxiv.org/abs/1811.06633