Curating Generative Raw Audio Music with D.O.M.E. CJ Carr Zack Zukowski Dadabots Dadabots Boston, USA Los Angeles, USA emperorcj@gmail.com thedadabot@gmail.com ABSTRACT a spectrogram visualization, but no method of clustering similar With the creation of neural synthesis systems which output raw audio together. audio, it has become possible to generate dozens of hours of music. Neural synthesizers include architectures based on convolutional While not a perfect imitation of the original training data, the qual- neural networks (WaveNet [22]), recurrent neural networks (Sam- ity of neural synthesis can provide an artist with many variations pleRNN [11], WaveRNN [4]), and flow-based networks (WaveGlow of musical ideas. However, it is tedious for an artist to explore the [15]). This family of tools is often utilized as a vocoder component full musical range and select interesting material when searching in end-to-end text-to-speech models [15] and can be appropriated through the output. We needed a faster human curation tool, and we for music synthesis [2, 25]. built it. DOME is the Disproportionately-Oversized Music Explorer. A PCA-component k-means-clustered rasterfairy-quantized t-SNE 2 RELATED WORK grid is used to navigate clusters of similar audio clips. The color Self-organizing maps (SOMs) have been created for the purpose of mapping of spectral and chroma data assist the user by enriching organizing large libraries of audio using clustering techniques to the visual representation with meaningful features. Care is taken in build grids of audio [16]. These interfaces are useful for seeing the the visualizations to aid the user in quickly developing an intuition similarity between artist or genre. Some systems have used non- for the similarity and range of sound in the rendered audio. This technical-looking designs to encourage a feeling of exploration by turns the time consuming task of previewing hours of audio into stylizing the clustered datapoints into island-like [6, 14] and galaxy- something which can be done at a glance. like [19] environments. Our prototype adds to this grid approach by introducing visualizations of the audio which allow the user CCS CONCEPTS to leverage visual cues resulting from spectral, chroma, and other • Human-centered computing → Visualization systems and metadata features. tools; • Applied computing → Sound and music computing; Unlike [24] which uses text or thumbnails to represent audio, and • Computing methodologies → Machine learning. unlike [6, 14] which use a histogram of critical bands, we choose to use mel-spectrograms to maintain local detail along the time axis. KEYWORDS audio clustering, t-SNE, PCA, k-means, generative music, visualiza- 3 METHOD tion tool. 3.1 Audio Preparation ACM Reference Format: Our tool works well on audio datasets up to dozens of hours in total CJ Carr and Zack Zukowski. 2019. Curating Generative Raw Audio Music length, where individual audio files are typically 1 minute to 20 with D.O.M.E.. In Joint Proceedings of the ACM IUI 2019 Workshops, Los minutes long. Larger datasets of 100+ hours have not been tested. Angeles, USA, March 20, 2019. ACM, New York, NY, USA, 4 pages. In one example, we start with the entire wav output of a Sam- pleRNN experiment, as described in [25], where the training data is 1 MOTIVATION the album ‘Time Death’ by earth metal band (((::ofthesun::))) and the With the creation of neural synthesis systems which output raw output is generated music in the style of the training data. At each audio, it has been getting easier to generate dozens of hours of epoch during training, and after training, wav files are generated, music in a specific style. [1] describe a music production process each 1 to 5 minutes long, for a total of 10 hours of audio. where this generated audio is curated and arranged into albums. In another example, we worked with breakcore/electronic artist However, much of what is generated from these networks tends to Drumcorps, to train a separate net on each of the stems (guitar, fall into similar patterns. It can be tedious to discover sections of voice, drums, synth) on his newest unreleased album, as well as musical interest when searching through the output. We have felt the combined master recording. For each net, at each epoch, and a need for faster human curation tools. after training, wav files were generated, each 1 to 5 minutes long, Digital audio workstations such as Ableton Live only supply totaling 50 hours. the user with an amplitude visualization, which makes searching difficult. Other digital audio workstations such as Audacity have 3.2 Analysis Three types of analysis are performed: a mel-spectrogram rolloff IUI Workshops’19, March 20, 2019, Los Angeles, USA visualization, a source-separation-pre-processed chromagram visu- © 2019 Copyright for the individual papers by the papers’ authors. Copying permitted alization, and a PCA fingerprinting used for clustering and nearest for private and academic purposes. This volume is published and copyrighted by its editors. neighbor sorting. Each audio file is loaded into our analyzer, which uses scikit-learn and the librosa library [9]. STFT is done with 44.1k IUI Workshops’19, March 20, 2019, Los Angeles, USA Carr and Zukowski sample rate, 1024 fft size, 1024 window size, and 256 hop size. The power spectrogram is made by squaring the absolute value of the STFT, discarding the phase. Frequencies are weighted according to A-weighting, a perceptual weighting scheme. Power is converted to dB. Figure 3: Chroma visualization with (above) and without (bottom) pre-processing with harmonic source separation using HPSS. Notice how removal of non-pitched audio de- 3.3 Color noises the chromagram. We follow the usability guidelines of memorability, informational delivery, distinguishability [24] through our use of color choices which make it easier to distinguish between different pieces of audio. 3.5 Chromagram It is important to choose the color mappings with care. Rainbow For the chromagram visualization (Figure 2) the STFT is pre-processed palettes are prone to illusions and misrepresenting variance. How- with harmonic-percussive source separation (HPSS) [3]. A large ever, in the case of chromagrams, because of the shared symmetry margin of 10 is applied to single out the harmonic component, on between pitch and hue, a rainbow palette is useful. Another con- which the chroma values are calculated. The percussion and residual sideration is the difficulty a color blind person might have with components are discarded. The chromagram is colored according some palettes. Since 1 in 12 men, and a little over 4% of the whole to the rainbow, such that the same pitches have the same color. To population, are color blind [23], we have included a key command make true pitches to pop out, and to avoid the case where white to rotate hues to accommodate incomplete color blindness. noise looks colorful, a norm of 1 is used, any chroma value under 0.5 is desaturated, and values between 0.5 and 1.0 are progressively 3.4 Rolloff Spectrogram saturated. For the rolloff visualization (Figure 1) the spectrograms are scaled to 23 mel bins. Spectral rolloff is calculated for each frame and colored 3.6 Metadata Visualization accordingly. We find rolloff to be a useful measure of bassiness More advanced visualization is possible if the generative model to trebliness. When looking at a spectrogram colored this way, supplements the audio with metadata. at a distance, the eye clearly sees long-term structure and spots In Figure 4 we display the current epoch and iteration of the similar-sounding sections in a collection. We also include the option model which generated the audio, along with parameters such as to vertically reflect the spectrogram along the time axis, which beam_width [10]. seems to assist the eye in seeing structure, perhaps because of the In Figure 5 we display the change in temperature [18] over time. vertical symmetry, or perhaps because users are conditioned to see Temperature is a parameter related to autoregressive sequence waveforms this way. generation which regulates stochasticity when sampling the multi- nomial distribution during inference. A lower temperature often causes the model to get stuck in repetitions. In Figure 6 we display local conditioning [8, 17] over time. If we condition the model during training using a one-hot vector of size n for each of the n songs in the training data, then during Figure 1: Rolloff visualization mode. Frequency is on the y- generation we can change the value of this conditioning vector axis, time is on the x-axis. The spectrogram is vertically re- over time, which influences the audio to sound more like those flected, and colored reddish when bass frequencies are dom- particular songs. inant and blue for treble. Notice a quiet section on the left, followed by a rhythmic section to the right. Figure 4: Visualizing epoch, iteration, and beam_width meta- data Figure 2: The same wav from Figure 1 in chroma visualiza- tion mode. Pitches are on the y-axis, time is on the x-axis. Notice it starts with a drone which holds out a 5th dyad be- tween G (cyan) and D (yellow). The drone is comparatively quiet, but normalization of the chromagram makes it stand Figure 5: Visualizing change in temperature over time dur- out visually. ing inference Curating Generative Raw Audio Music with D.O.M.E. IUI Workshops’19, March 20, 2019, Los Angeles, USA Figure 6: Visualizing change in local conditioning over time during inference 3.7 Fingerprinting For fingerprinting, the entire dataset is iteratively loaded, seg- mented into ten-second chunks with five-second hops, and con- verted to dB mel-spectrograms. Dimensionality reduction is per- formed using Incremental PCA rendering 10 components per chunk. The learned PCA basis functions are shown in Figure 7. Figure 8: DOME in spectrogram rolloff mode. The left side is grid view, an overview of the diversity of sound which has been loaded. Clicking a gridcell changes the list view on the right, sorting the audio by closeness to that gridcell. Figure 7: PCA basis functions for each of the 10 components. The y-axis represents the 23 mel-frequency bands and the x- axis represents STFT frames over time. 3.8 Grid building Process The grid interface is 10x10. To fit thousands of chunks to this grid, Figure 9: DOME in chromagram mode. At a glance, for ex- k-means clustering is used on the PCA components with k=100. The ample, the user can see melodic vs non-melodic sections, cluster centroids are then spaced using t-SNE [7] , and quantized drones, melodies, and occurrences of the note G (cyan). into the final 10x10 grid using Rasterfairy [5]. This process is intended to allow for the inspection of single audio chunks. With t-SNE visualizations there tends to be many 3.10 Crowd Curation overlapping points in the space which reduces the ease at which Our prototype works in all modern web browsers. This environment a single example can be previewed. t-SNE interfaces work well was chosen with a concern for cross-device compatibility. Our with short audio samples like those found in the infinite drum intention is to make curation more accessible for the purpose of machine [20], but interface issues arise with larger sections of crowd-sourcing. Export functionality allows the end user to rapidly music. Rasterfairy, on the other hand, stretches these points out to collect variations of sonic elements and facilitates crowd-sourced cover an entire 2D grid. collaboration. 3.9 Exploratory Interface Design 4 INFORMAL EVALUATION In grid view, each gridcell represents a cluster centroid and ran- Informally, we gave three producers, experienced in sampling and domly displays one of the top 6 nearest neighbor chunks to the curating music, the task of curating "interesting" pieces of music centroid. By clicking on a gridcell, the program sorts the chunks from 10 hours of SampleRNN generated audio. They first used their by distance to the centroid of that original audio chunk. In the list preferred method of curation (which included "load all 10 hours view, the full audio files which contain those chunks are listed. The into Ableton Live and look at the raw waveform" and "hunt around user scrolls down in list view. The user seeks to any position in with the MacOS Finder and hope for the best"), then later used the audio by clicking. Highlighted sections can be exported as wav DOME. They self-reported the curation process was between 5x files. At the end of every audio file it finds a new file to play. This and 20x faster with DOME. autoplay ability enables a continuous listening experience that we Producer Drumcorps reported, "DOME was quite helpful in this find useful while passively auditioning renderings. project. Being able to scan through the audio content visually made Figures 8 and 9 show what the app looks like with the rolloff it much easier to pick out useful and interesting sounds. After a and chroma visualizations. The left side of the app is grid view. The short time browsing, I can get a sense of what a specific type of right side of the app is list view. sound might look like, and I can start to find what I’m looking IUI Workshops’19, March 20, 2019, Los Angeles, USA Carr and Zukowski for much faster. I found it good to build a little directory of my [2] Sander Dieleman, Aäron van den Oord, and Karen Simonyan. 2018. The challenge favorite clips from DOME, and work from there, rather than the of realistic music generation: modelling raw audio at scale. CoRR abs/1806.10474 (2018). arXiv:1806.10474 http://arxiv.org/abs/1806.10474 usual ‘hunt around with the MacOS Finder and hope for the best’ [3] Jonathan Driedger, Meinard Müller, and Sascha Disch. 2014. Extending Harmonic- method. The two methods are incomparable, the differences are Percussive Separation of Audio Signals. ISMIR (2014). [4] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, night and day, and the results end up different as well. With the Edward Lockhart, Florian Stimberg, Aäron van den Oord, Sander Dieleman, and Finder, I start out listening to files, but then often get frustrated Koray Kavukcuoglu. 2018. Efficient Neural Audio Synthesis. CoRR abs/1802.08435 and just select a few files at random, then choose the best one from (2018). arXiv:1802.08435 http://arxiv.org/abs/1802.08435 [5] Mario Klingemann. 2015. RasterFairy. https://github.com/Quasimondo/ those. With DOME I end up finding a wide variety quickly - and RasterFairy then can choose further work from a more informed position - it [6] Peter Knees, Markus Schedl, Tim Pohle, and Gerhard Widmer. 2006. An Innova- gives me a larger sample size. There’s only so much listening one tive Three-dimensional User Interface for Exploring Music Collections Enriched. In Proceedings of the 14th ACM International Conference on Multimedia (MM ’06). can do in a day, and if you need to listen to samples in real-time ACM, New York, NY, USA, 17–24. https://doi.org/10.1145/1180639.1180652 to determine which you’re going to use, that’s less time available [7] L.J.P.V.D. Maaten and GE Hinton. 2008. Visualizing High-Dimensional Data using t-SNE. Journal of Machine Learning Research 9 (01 2008), 2579–2605. for getting down to the actual composition work. Some samples [8] Rachel Manzelli, Vijay Thakkar, Ali Siahkamari, and Brian Kulis. 2018. Condi- contain wild shifts and interesting artefacts within them - you’ll tioning Deep Generative Raw Audio Models for Structured Automatic Music. see this in DOME right away and be able to listen to that piece CoRR abs/1806.09905 (2018). arXiv:1806.09905 http://arxiv.org/abs/1806.09905 [9] Brian McFee, Matt McVicar, Stefan Balke, Carl Thomé, Vincent Lostanlen, Colin immediately. With the Finder, it’s listening and hoping. There’s Raffel, Dana Lee, Oriol Nieto, Eric Battenberg, Dan Ellis, Ryuichi Yamamoto, a place for randomness and hidden surprises too - but I find that Josh Moore, WZY, Rachel Bittner, Keunwoo Choi, Pius Friesch, Fabian-Robert when I’m trying to get something done quickly, DOME is most Stöter, Matt Vollrath, Siddhartha Kumar, nehz, Simon Waloschek, Seth, Rimvydas Naktinis, Douglas Repetto, Curtis "Fjord" Hawthorne, CJ Carr, João Felipe Santos, helpful." Jackie Wu, Erik, and Adrian Holovaty. 2018. librosa/librosa: 0.6.2. https: //doi.org/10.5281/zenodo.1342708 [10] M.F. Medress, F.S. Cooper, J.W. Forgie, C.C. Green, D.H. Klatt, M.H. O’Malley, 5 FUTURE WORK E.P. Neuburg, A. Newell, D.R. Reddy, B. Ritea, J.E. Shoup-Hummel, D.E. Walker, The use of PCA for fingerprinting is limiting. While it is helpful at and W.A. Woods. 1977. Speech understanding systems: Report of a steering committee. Artificial Intelligence 9, 3 (1977), 307 – 316. https://doi.org/10.1016/ clustering similar audio textures together, it is not powerful and 0004-3702(77)90026-1 precise enough to distinguish nuances in sound. In the future, for [11] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, Rithesh Kumar, Shubham fingerprints, we could use embeddings from a trained deep net Jain, Jose Sotelo, Aaron C. Courville, and Yoshua Bengio. 2016. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. CoRR abs/1612.07837 audio classifier. (2016). arXiv:1612.07837 http://arxiv.org/abs/1612.07837 The use of Librosa for analysis was slow. A 10-hour dataset takes [12] Parag K. Mital. 2017. Time Domain Neural Audio Style Transfer. CoRR abs/1711.11160 (2017). arXiv:1711.11160 http://arxiv.org/abs/1711.11160 a few hours to analyze on a MacBookPro. A C-compiled analyzer, [13] Noam Mor, Lior Wolf, Adam Polyak, and Yaniv Taigman. 2018. A Universal or a distributed process (using cloud compute or AWS Lambda) Music Translation Network. CoRR abs/1805.07848 (2018). arXiv:1805.07848 would be more efficient. http://arxiv.org/abs/1805.07848 [14] Elias Pampalk, Simon Dixon, and Gerhard Widmer. 2004. Exploring Music Additional audio visualizations could include: the fingerprint Collections by Browsing Different Views. Comput. Music J. 28, 2 (June 2004), embeddings over time, and annotating audio used when priming 49–62. https://doi.org/10.1162/014892604323112248 the sequence before generation. [15] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2018. WaveGlow: A Flow- based Generative Network for Speech Synthesis. CoRR abs/1811.00002 (2018). With the addition of a composing feature, the end user could arXiv:1811.00002 http://arxiv.org/abs/1811.00002 arrange music by sticking curated sections together. With the addi- [16] Andreas Rauber, Elias Pampalk, and Dieter Merkl. 2002. Using Psycho-Acoustic Models and Self-Organizing Maps To Create Hierarchical Structuring of Music tion of an upvoting feature, the crowd could further curate their by Sound Similarity. (2002). favorite sections and arrangements. [17] Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, and Yonghui Wu. 2017. Natural TTS 6 CONCLUSION Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. CoRR Steady progress has been made on fast generative raw audio with abs/1712.05884 (2017). arXiv:1712.05884 http://arxiv.org/abs/1712.05884 [18] Robin Sloan. 2018. Expressive Temperature. https://www.robinsloan.com/ neural synthesis. With the advent of audio style transfer [12, 13, 21], expressive-temperature/ one could render all possible permutations of style transfers, yet [19] Sebastian Stober. 2010. MusicGalaxy - An Adaptive User-Interface for Exploratory Music Retrieval. would still need a good way to explore the output. Digital audio [20] Manny Tan and Kyle McDonald. 2017. Infinite Drum Machine. https: workstations such as Ableton Live are not fit for this task. We //experiments.withgoogle.com/ai/drum-machine designed an interface to minimize the time and effort required by [21] Dmitry Ulyanov. 2016. Audio Texture Synthesis and Style Transfer. https: //dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer/ listening to hours of similar audio clips. Care was taken in the [22] Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol visualizations to aid the user. This turned the time-consuming task Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray of previewing hours of audio into something which can be done at a Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. CoRR abs/1609.03499 (2016). arXiv:1609.03499 http://arxiv.org/abs/1609.03499 glance. We believe self-organizing interfaces like ours will be more [23] Fernanda Viégas and Martin Wattenberg. 2018. Visualization for Machine Learn- important as large directories of generated audio can be rendered ing (NeurIPS 2018 Tutorial). https://www.youtube.com/watch?v=ze08gwVPaXk [24] Kazuyoshi Yoshii and Masataka Goto. 2008. Music Thumbnailer: Visualizing with faster inference speeds and greater parallelization. Musical Pieces in Thumbnail Images Based on Acoustic Features. In ISMIR. A demo of this tool will be available online at dadabots.com/dome [25] Zack Zukowski and CJ Carr. 2017. Generating Black Metal and Math Rock: Beyond Bach, Beethoven, and Beatles. NIPS Workshop on Machine Learning for Creativity and Design (2017). arXiv:1811.06639 http://arxiv.org/abs/1811.06639 REFERENCES [1] CJ Carr and Zack Zukowski. 2018. Generating Albums with SampleRNN to Imitate Metal, Rock, and Punk Bands. MUME (2018). arXiv:1811.06633 http: //arxiv.org/abs/1811.06633