=Paper=
{{Paper
|id=Vol-2327/MILC6
|storemode=property
|title=A Minimal Template for Interactive Web-based Demonstrations of Musical Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-MILC-6.pdf
|volume=Vol-2327
|authors=Vibert Thio,Hao-Min Liu,Yin-Cheng Yeh,Yi-Hsuan Yang
|dblpUrl=https://dblp.org/rec/conf/iui/ThioLYY19
}}
==A Minimal Template for Interactive Web-based Demonstrations of Musical Machine Learning==
IUI Workshops’19, March 20, 2019, Los Angeles, USA. A Minimal Template for Interactive Web-based Demonstrations of Musical Machine Learning Vibert Thio, Hao-Min Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan {vibertthio, paul115236, ycyeh, yang}@citi.sinica.edu.tw ABSTRACT on the accompanying project websites. This method works New machine learning algorithms are being developed to solve well in the early days. However, as the ML models themselves problems in different areas, including music. Intuitive, acces- are getting more complicated, some concepts of the algorithms sible, and understandable demonstrations of the newly built may not be clearly expressed with only static sounds. models could help attract the attention of people from different In the neighboring field of computer vision, many interactive disciplines and evoke discussions. However, we notice that demonstrations of ML models have been developed recently. it has not been a common practice for researchers working Famous examples include DeepDream [18], image/video style on musical machine learning to demonstrate their models in transfer [8, 23], and DCGAN [19]. These interactive demos an interactive way. To address this issue, we present in this provoke active discussions and positive anticipation about paper an template that is specifically designed to demonstrate the technology. Nevertheless, the demonstration of musical symbolic musical machine learning models on the web. The machine learning models is not as easy in the case of computer template comes with a small codebase, is open source, and vision, due to the fact that it involves audio rendering (i.e., we is meant to be easy to use by any practitioners to implement cannot simply use images for demonstration). Web Audio API, their own demonstrations. Moreover, its modular design fa- a high-level JavaScript API for processing and synthesizing cilitates the reuse of the musical components and accelerates audio in web applications, was published only in 2011, which the implementation. We use the template to build interactive is not far from now compared to WebGL and other features of demonstrations of four exemplary music generation models. the browser. Furthermore, interactivity is needed to improve We show that the built-in interactivity and real-time audio understandability and create engaging experiences. rendering of the browser make the demonstration easier to understand and to play with. It also helps researchers to gain Musical machine learning is gaining increasing attention. We insights into different models and to A/B test them. believe that if more people from other fields, such as art and music, start to appreciate the new models of musical machine ACM Classification Keywords learning, it is easier to create an active community and to D.2.2 Design Tools and Techniques: Modules and interfaces; stimulate new ideas to improve the technology. H.5.2 User Interfaces: Prototyping; H.5.5 Sound and Music Computing: Systems The goal of this paper is to fulfill this need by building and sharing with the community a template that is designed to Author Keywords demonstrate ML models for symbolic-domain music process- Musical interface; web; latent space; deep learning ing and generation, in an interactive way. Therefore, The template is also open-source on GitHub (https://github.com/ ACM Reference Format vibertthio/musical-ml-web-demo-minimal-template). Vibert Thio, Hao-Min Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. 2019. A Minimal Template for Interactive Web-based RELATED WORKS Demonstrations of Musical Machine Learning. In Joint Pro- ceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, Audio Rendering in Python March 20, 2019, 6 pages. When it comes to testing or interacting with the musical ma- chine learning models, the output of the models must be ren- INTRODUCTION dered as audio files or streams to be listened to by humans. Recent years have witnessed great progress in applying ma- Most researchers in the field nowadays use Python as the pro- chine learning (ML) to music related problems, such as thumb- gramming language for model implementation because of the nailing [10], music generation [5,7,21], and style transfer [15]. powerful ML and statistical packages built around it. For To demonstrate the result of such musical machine learning example, librosa [17] is a Python package often used for models, researchers usually put the audio output as the result audio and signal processing. It includes functions for spectral analysis, display, tempo detection, structural analysis, and out- put. Many interactive demonstrations are built with librosa IUI Workshops’19, March 20, 2019, Los Angeles, USA. Copyright ©2019 for the individual papers by the papers’ authors. Copying permitted on the Jupyter Notebook. However, a major drawback of this for private and academic purposes. This volume is published and copyrighted by its approach is that the audio files have to be sent over the Internet editors. IUI Workshops’19, March 20, 2019, Los Angeles, USA. for demonstration, which can be slow sometimes depending on the network connection bandwidth. Another widely-used Python package is pretty_midi [20], which is designed for manipulation of symbolic-domain data such as Musical Instrument Digital Interface (MIDI) data. It could be used as a tool to render the symbolic output of a musical machine learning model, such as a melody generation model [24]. The problem is that after getting the result as a MIDI file, the user still has to put it into a digital audio workstation (DAW) to synthesize the audio waveform from the MIDI. For better listening experience, the researcher still has to synthesize the audio files offline and then send the audio Figure 1. The schematic diagram of the proposed template. files over the Internet for demonstration. Different from prior works, we propose to use Tone.js, a JavaScript framework for rendering MIDI files into audio di- We choose the web as the platform for the demonstrations for rectly in the browser on the client side. This turns out to be a several reasons. First, it is convenient as the user only has to much more efficient way to demonstrate a symbolic musical open a browser or click the hyperlink to play with the models. ML model. It also helps build an interactive demo. Second, it is inherently interactive. The system can utilize a plenty of forms of interaction available in the browser to create Interactive Musical Machine Learning the specific user experience. Similar to our work, Vogl et al. [7] introduced an interactive App for drum pattern generation based on ML. They used Requirements generative adversarial networks (GAN) [9] as the generative In the design process, we have prioritized some crucial quali- model, which is trained on a MIDI dataset. The user inter- ties. First, we made the structure of the design as simple as pos- face consists of some classic sequencer with an x/y pad that sible. In most cases, the demo is for a proof-of-concept rather controls the complexity and the loudness of the drum pattern than to showcase a ready-to-sell product. Hence, we desire generated. Additionally, controls for genre and swing are said that a person with basic knowledge of Python and JavaScript to be provided. However, both the demo and its source code could understand our template within a short period of time so cannot be found online currently. It is not clear whether the that the template can serve as a minimal starting point. App is built on iOS, Android, or the Web. Second, the codebase should be small, so that transplanting Closely related to our project is the MusicVAE model [21] a new model into this template is easier. Moreover, a small presented by Magenta, Google Brain Team. MusicVAE is a codebase also makes it easier to debug. generative recurrent variational autoencoder (VAE) model that Third, the audio rendering must be interactive and real-time. can generate melodies and drum beats. Importantly, the au- The demonstrations must be responsive to some inputs from thors also released a JavaScript package called Magenta.js the user so that the user could understand the model by know- (https://github.com/tensorflow/magenta-js/) [22], to make ing how it works in several different ways. As for researchers, their models more accessible. They also provide some pre- if the result could be rendered instantly, it would be easier to trained models of MusicVAE along with other ones. There are A/B test different designs of models or parameters. several interactive demos using the package, as can be found on their website [16]. Most of them are well designed, user- Finally, we want the components of the template to be modular, friendly, and extremely helpful for understanding the models. so that they can be reused and recombined easily. Such compo- Yet, the major drawback is that the codebase of the project is a nents may include, e.g., chord progression, pianoroll, monolithic one [11] and is therefore quite big.1 Users may not drum pattern, and sliders. Practitioners can build their easily modify the code for customization. For example, be- own demonstrations based on these components. cause Magenta uses Tensorflow as the backbone deep learning System Architecture framework, it is hard for PyTorch users to use Magenta.js. As shown in Figure 1, the system consists of three parts: a musical machine learning model, a server, and a client. When TEMPLATE DESIGN a user opens the URL of the demonstration site, it will load the In this paper, we present a minimal template, which is sim- client program into the browser and render the basic interface. ple, flexible, and designed for interactive demonstration of The client program will send a request to the server to fetch symbolic musical machine learning models on the web. the data. The server program will parse the request and use the function based on the model to make the corresponding 1 A monolithic repo is defined in [11] as: “a model of source code output and send it back to the client to render the audio effect. organization where engineers have broad access to source code, a shared set of tooling, and a single set of common dependencies. This standardization and level of access are enabled by having a single, Server shared repo that stores the source code for all the projects in an We used Flask (http://flask.pocoo.org/), a lightweight web organization.” This is the case of the Google Magenta project. application framework, to build the server. Flask only han- IUI Workshops’19, March 20, 2019, Los Angeles, USA. Figure 2. The Latent Inspector. Left: a snapshot of the demonstration. The upper half is an editable drum pattern display, whereas the bottom half is a graph showing the N-dimensional latent vector (here N = 32) of the VAE model. Each vertex of the graph can be adjusted independently, to change the latent vector and accordingly the drum pattern. This is exemplified by the snapshot shown on the right. Although not shown in the figure, we can also click on the drum pattern display to modify the drum pattern directly, which would also change the latent vector accordingly. dles essential core functions to build a web server, such as on the musical machine learning model. The classes of the representational state transfer (REST) handling the request models are not limited to certain ones. For instance, the first from the client. Therefore, we can build the server without two demonstrate the musical machine learning models based any redundant elements but focus on the function of the model. on VAE [13]. In contrast, the last two are mainly based on a As a result, the server template code has only about 150 lines, recurrent neural network (RNN). Also, the type of instruments excluding the model implementation part. could be different. For example, the first one is about percus- sion and the other three are about melody. This is designed Client deliberately to show the general purpose of this template. We Several technologies have been built to render the real-time au- aim to make them more understandable and interesting by dio output since the Web Audio API was released in 2010 [2]. adding interactivity, interface design, and visual effects. Tone.js (https://tonejs.github.io/) is such a framework for creating interactive music in the browser. It provides simple Latent Inspector with DrumVAE workflows for synthesizing audio from oscillators, organizing the audio samples, musician-friendly API, and timeline sched- DrumVAE is an original work. It uses VAE for generating ule. It makes the development of real-time rendering from one-bar drum patterns. Drum patterns are represented using the output of the ML model much easier. Added in the new the pianoroll format [6] with 96 time-step per bar. It com- standard HTML5, the HTML canvas element can be used to presses (or encodes) the drum patterns into a latent space via a draw dynamic graphics via writing JavaScript [1]. Thus, we bidirectional gated recurrent unit (BGRU) neural network [4]. use JavaScript canvas with Tone.js to create audio and visual The outputs from BGRUs are used as mean and variance of experience coherently. a Gaussian distribution. A latent vector is sampled from the Gaussian distribution. We apply the similar but reverse struc- The modularization is taken care of in the design of the in- ture of the encoder in the decoder and pass the latent vector terface (see Figures 2–5). The layout of the user interfaces into it to reconstruct the drum patterns. It is trained on one-bar is implemented as a grid system. This speeds up the design drum patterns collected from the Lakh Pianoroll Dataset [5], process because it simplifies the choices for the positions of considering the following nine drums: kick drum, snare drum, the elements and the margins between them. The recurring el- closed/open hi-hat, low/mid/high toms, crash cymbal, and ride ements, such as pianoroll and drum pattern display, are cymbal. implemented based on object-oriented principles, thus they can be re-used easily. See Table 1 for a summary. The Latent Inspector, shown in Figure 2, lets the user modify the latent vector of a drum pattern displayed in the browser to find out how the drum pattern will alter correspondingly. On DEMONSTRATIONS DESIGN the other hand, the user can also modify the drum pattern to We built four different demonstrations based on the proposed observe the changes in the latent vectors. template. We call them ‘Latent Inspector,’ ‘Song Mixer,’ ‘Comp It,’ and ‘Tuning Turing.’ Each of them was designed to X/Y pads are used in other works to explore the latent space. serve one exact purpose and demonstrate a single idea based Yet, the dimension of the latent vectors used in practice is IUI Workshops’19, March 20, 2019, Los Angeles, USA. Figure 3. The Song Mixer. The top and bottom panels display the melody and the chords of the first and the second song. In the middle is the interpolation between the two songs. We add visual aids to guide the user to interact with the App. For example, the top panel is highlighted in this figure to invite the user to listen to the first song, before the second song and then the interpolations. usually larger than two. As a result, we designed a circular new lead sheets from scratch, but we use it for generating diagram which can represent high dimensional data. As shown interpolations here. in Figure 2, the latent vector or our DrumVAE model has The Song Mixer, shown in Figure 3, takes two existing lead dimension N = 32. Since the effect of every dimension should sheets as input and shows the interpolations of them generated be symmetrical in the latent vector of DrumVAE, using circular by LeadSheetVAE. Similarly, a user can modify the melody diagram can eliminate the terminal point of the line chart. or chords using the upper panel, or choose other lead sheets It is possible to further improve the UI by adding conditional from our dataset, to see how it affects the interpolation. functionalities, to give each vertex some musical or seman- tic meaning. While this can be a future direction, we argue The aim of this demo is to make the interpolation understand- able. Therefore, we build interactive guidance with visual cues that the current is also interesting— for musicians, it is some- through the process to make sure the user grasp the idea of times more interesting to have a bunch of knob of unknown lead sheet interpolation. The demo website of Song Mixer can functionalities to play with. be found at http://vibertthio.com/leadsheet-vae-client/. The demo website of Latent Inspector can be found at http: //vibertthio.com/drum-vae-client/public/. Evaluating the quality of interpolations generated by general VAE models (not limited to music-related ones), and many other generative models, has been known to be difficult. A core reason is that there is no ground truth for such interpola- Song Mixer with LeadSheetVAE tions. Song Mixer makes it easy to assess the result of musical LeadSheetVAE is another model we recently developed [14]. interpolations. Moreover, with the proposed template, it is It is also based on a VAE, but it is designed to deal with lead easy to extend Song Mixer to show the interpolation produced sheets instead of drum patterns. A lead sheet is composed of a by two different models side-by-side and in-sync in the middle melody line and a sequence of chord labels [14]. We consider of the UI. This facilitates A/B testing the two different models four-bar lead sheets here. Melody lines and chord sequences with a user study. are represented using one-hot vectors and chroma vectors, re- spectively. It resembles the structure of DrumVAE, but the Comp It & Tuning Turing with MTRNNHarmonizer main difference is that by the end of the encoder the output of Finally, MTRNNHarmonizer is another new model that we the two BGRUs (one for melody and one for chords) are con- recently developed.2 It is an RNN-based model for adding catenated and passed through few dense layers for calculating chords to harmonize a given melody. In other words, given a the mean and variance for the Gaussian distribution. In the melody line, the model produces a chord sequence to make it decoder, we apply two unidirectional GRUs to reconstruct the a lead sheet. The model is special in that it takes a multi-task melody lines and chord sequences. The model is trained on the learning framework to predict not only the chord label but TheoryTab dataset [14] with 31,229 four-bar segments of lead sheets featuring different genres. LeadSheetVAE can generate 2 More details of the model will be provided in a forthcoming paper. IUI Workshops’19, March 20, 2019, Los Angeles, USA. Figure 4. Comp It. The upper half is the editable melody and the chords predicted by the underlying melody harmonization model. In the lower left is a graph showing which class the current chord belongs to. In the lower right is a graph showing the position of the current chord on the so-called circle of fifths. The current melody note and chord being played are marked in red. also the chord’s functional harmony, for a given segment of vibertthio), including the template (https://github.com/ melody (half-bar in our implementation). Taking the func- vibertthio/musical-ml-web-demo-minimal-template), the inter- tional harmony into account makes the model less sensitive to faces, and the ML models. the imbalance of different chords in the training data. Further- more, the chord progression can have the phrasing that better CONCLUSION matches the given melody line. This paper presents an open-source template for creating an Similar to the two aforementioned demos, Comp It allows interactive demonstration of musical machine learning on the a user to modify the melodies displayed in the browser to web along with four exemplary demonstrations. The architec- find out how this will alter the chord progression correspond- ture of the template is meant to be simple and the codebase is ingly. Furthermore, as shown in Figure 4, we add a triangular small so that other practitioners can implement their models graph and an animated circle of fifths [12] graph to visualize with it within a short time. The modular design makes the the changing between different chord classes. The triangular musical component reusable. The interactivity and real-time graph displays the chord class of the chord being played, cov- audio rendering of the browser make the demonstration easier ering tonal, dominant, and sub-dominant. The circle of fifths to understand and to play with. However, we try to elaborate graph, on the other hand, organizes the chords in a way that re- the quantitative aspects of the project without quantitative anal- flects the “harmonic distance” between chords [3]. These two ysis. For future work, we will run user studies to validate the graphs make it easier to study the chord progression generated effectiveness of these projects. With more intuitive, accessible, by the melody harmonization model, which is MTRNNHarmo- and understandable demonstrations of the new models, we nizer here but can be other models in other implementations. hope new people might be brought together to form a larger community to stimulate new ideas. Furthermore, we made a simple Turing game for the model, called “Tuning Turing.” As shown in Figures 5, the player has REFERENCES to pick out the harmonization generated by the model from two 1. 2006. Canvas API. MDN Web docs. (2006). https: music clips. There are both “practice mode” and “challenge //developer.mozilla.org/en-US/docs/Web/API/Canvas_API. mode.” The former has 6 fixed levels. In the “challenge mode,” the player can keep playing until three wrong answers. 2. 2011. Web Audio API. W3C. (2011). https://www.w3.org/TR/2011/WD-webaudio-20111215/. The demo website of Comp It and Tuning Turing can be found at http://vibertthio.com/m2c-client/ and http://vibertthio. 3. Juan Bello and Jeremy Pickens. 2005. A robust mid-level com/tuning-turing/ respectively. representation for harmonic content in music signals. In Proc. Int. Soc. Music Information Retrieval Conf. AVAILABILITY Supplementary resources including open source code will 4. Kyunghyun Cho and others. 2014. Learning phrase be available at the GitHub repos (https://github.com/ representations using RNN encoder-decoder for statistical IUI Workshops’19, March 20, 2019, Los Angeles, USA. Demonstration Modules 10. Yu-Siang Huang, Szu-Yu Chou, and Yi-Hsuan Yang. audio rendering (sample) 2018. Pop music highlighter: Marking the emotion editable pianoroll (drum) keypoints. Transactions of the International Society for Latent Inspector editable latent vector (circular) Music Information Retrieval 1, 1 (2018), 68–78. **radio panel (genre selection) audio rendering (synthesize) 11. Ciera Jaspan and others. 2018. Advantages and editable pianoroll (melody) × 3 disadvantages of a monolithic codebase. In Proc. Int. Song Mixer Conf. Software Engineering. chord visualization (text) radio panel (interpolations selec- 12. Claudia R. Jensen. 1992. A theoretical work of late tion) seventeenth-century muscovy: Nikolai Diletskii’s audio rendering (synthesize) “Grammatika” and the earliest circle of fifths. J. American Comp It editable pianoroll (melody) Musicological Society 45, 2 (1992), 305–331. chord visualization (text, function, circle of fifths) 13. Diederik P. Kingma and Max Welling. 2014. audio rendering (sample) Auto-encoding variational Bayes. In Proc. Int. Conf. Tuning Turing waveform visualization Learning Representations. Table 1. Some of the modules are reused by more than one demonstra- 14. Hao-Min Liu, Meng-Hsuan Wu, and Yi-Hsuan Yang. tion. For example, three of them uses “editable pianoroll”. In our im- 2018. Lead sheet generation and arrangement via a hybrid plementation, We reuse the modules to reduce the development effort. generative model. In Proc. Int. Soc. Music Information Therefore, it is useful and convenient to develop new demos with these modules. ** indicates that a feature that has not been implemented yet. Retrieval Conf., Late Breaking and Demo Papers. 15. Chien-Yu Lu and others. 2019. Play as You Like: Timbre-enhanced multi-modal music style transfer. Proc. AAAI Conf. Artificial Intelligence. 16. Google Brain Magenta. 2018. Demos. Magenta Blog. (2018). https://magenta.tensorflow.org/demos. 17. Brian McFee and others. 2015. librosa: Audio and music signal analysis in python. Proc. 14th Python in Science Conf., 18–25. 18. Alexander Mordvintsev, Christopher Olah, and Mike Tyka. 2015. Inceptionism: Going deeper into neural networks. Google AI Blog. (2015). https://ai.googleblog.com/2015/06/ inceptionism-going-deeper-into-neural.html. Figure 5. Tuning Turing. Two different kinds of harmonization for a single melody are rendered on the page. The player has to pick out the 19. Alec Radford, Luke Metz, and Soumith Chintala. 2015. one generated by the algorithm and send the result after choosing with Unsupervised representation learning with deep the mouse. convolutional generative adversarial networks. (2015). machine translation. CoRR abs/1406.1078 (2014). https://arxiv.org/abs/1511.06434. http://arxiv.org/abs/1406.1078 20. Colin Raffel and Daniel P. W. Ellis. 2014. Intuitive 5. Hao-Wen Dong and others. 2018a. MuseGAN: analysis, creation and manipulation of MIDI data with Multi-track sequential generative adversarial networks for pretty_midi. ISMIR Late Breaking and Demo Papers. symbolic music generation and accompaniment. Proc. 21. Adam Roberts and others. 2018a. A hierarchical latent AAAI Conf. Artificial Intelligence. vector model for learning long-term structure in music. 6. Hao-Wen Dong, Wen-Yi Hsiao, and Yi-Hsuan Yang. (2018). https://arxiv.org/abs/1803.05428. 2018b. Pypianoroll: Open source Python package for 22. Adam Roberts, Curtis Hawthorne, and Ian Simon. 2018b. handling multitrack pianoroll. In Proc. Int. Soc. Music Magenta.js: A JavaScript API for Augmenting Creativity Information Retrieval Conf. Late-breaking paper; with Deep Learning. (2018). https://github.com/salu133445/pypianoroll. https://ai.google/research/pubs/pub47115. 7. Hamid Eghbal-zadeh and others. 2018. A GAN based 23. Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox. drum pattern generation UI prototype. ISMIR Late 2018. Artistic style transfer for videos and spherical Breaking and Demo Papers. images. Int. J. Computer Vision (2018). http://lmb. 8. Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. informatik.uni-freiburg.de/Publications/2018/RDB18 2015. A neural algorithm of artistic style. (2015). 24. Ian Simon and Sageev Oore. 2017. Performance RNN: https://arxiv.org/abs/1508.06576. Generating music with expressive timing and dynamics. 9. Ian J. Goodfellow and others. 2014. Generative (2017). https://magenta.tensorflow.org/performance-rnn. adversarial nets. In Proc. Advances in Neural Information Processing Systems. 2672–2680.