=Paper=
{{Paper
|id=Vol-2327/MILC5
|storemode=property
|title=Towards a Hybrid Recommendation System for a Sound Library
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-MILC-5.pdf
|volume=Vol-2327
|authors=Jason Smith,Dillon Weeks,Mikhail Jacob,Jason Freeman,Brian Magerko
|dblpUrl=https://dblp.org/rec/conf/iui/0005WJFM19
}}
==Towards a Hybrid Recommendation System for a Sound Library==
Towards a Hybrid Recommendation System for a Sound Library Jason Smith Dillon Weeks Mikhail Jacob jsmith775@gatech.edu dweeks7@gatech.edu mikhail.jacob@gatech.edu Center for Music Technology School of Interactive Computing School of Interactive Computing Georgia Institute of Technology Georgia Institute of Technology Georgia Institute of Technology Atlanta, GA Atlanta, GA Atlanta, GA Jason Freeman Brian Magerko jason.freeman@gatech.edu magerko@gatech.edu Center for Music Technology School of Literature, Media, and Georgia Institute of Technology Communication Atlanta, GA Georgia Institute of Technology Atlanta, GA ABSTRACT 1 INTRODUCTION Recommendation systems are widespread in music distribution EarSketch [7] is an online environment for learning computer pro- and discovery services but far less common in music production gramming and audio loop-based music composition. Students write software such as EarSketch, an online learning environment that JavaScript or Python scripts to algorithmically generate musical engages learners in writing code to create music. The EarSketch compositions. The user interface borrows design cues from both in- interface contains a sound library that learners can access through tegrated development environments (IDEs) and digital audio work- a browser pane. The current implementation of the sound browser station (DAW) software, combining a code editor and console with includes basic search and filtering functionality but no mechanism a multi-track audio timeline and sound browser. EarSketch has for sound discovery, such as a recommendation system. As a result, primarily been used in high school and college computer science users have historically selected a small subsection of sounds in high classrooms, with over 300,000 users to date [5]. frequencies, leading to lower compositional diversity. In this paper, In previous research in EarSketch classrooms, significant rela- we propose a recommendation system for the EarSketch sound tionships have been found between student perceptions of authen- browser which uses collaborative filtering and audio features to ticity – including their desire to share personally expressive work suggest sounds. with others – and student attitudes towards computing [9]. Explo- ration of a larger number of musical ideas – including the sounds CCS CONCEPTS that form the building blocks of student compositions in EarSketch • Human-centered computing → User interface design; • Ap- – may magnify a student’s capacity to create personally expressive plied computing → Sound and music computing. compositions. EarSketch contains a library of over 3,500 sounds for students to KEYWORDS use in their compositions. The sounds were created by musicians recommendation systems, interface design, music Richard Devine and Young Guru specifically for EarSketch and con- sist of multi-measure audio loops that are separated by instrument ACM Reference Format: and span over 20 popular musical genres. However, a statistical Jason Smith, Dillon Weeks, Mikhail Jacob, Jason Freeman, and Brian Magerko. analysis of scripts written by users showed that the vast majority of 2019. Towards a Hybrid Recommendation System for a Sound Library. In Joint Proceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, March user projects used only a small subset of the sound library. Feedback 20, 2019 , 6 pages. from EarSketch users (found in the interviews section) showed that their lack of exploration was primarily the result of the difficulty in finding sounds that appealed to them. We propose, therefore, that providing users with an easier mechanism for exploring the sound library will enable them to find and use audio loops that spur further musical creativity and personal expression, while ulti- mately furthering their learning about music and coding through EarSketch. We have explored the addition of a recommendation (or rec- ommender) system, after conducting user studies, as a method of encouraging users to explore more of the EarSketch sound library in their scripts. Recommendation systems are widespread in music IUI Workshops’19, March 20, 2019, Los Angeles, USA distribution and discovery platforms (where they operate at the Copyright ©2019 for the individual papers by the papers’ authors. Copying permitted song level) but far less common in music production workflows for private and academic purposes. This volume is published and copyrighted by its (where they could operate at the sound clip level). Recommendation editors. IUI Workshops’19, March 20, 2019, Los Angeles, USA Smith and Weeks, et al. Figure 1: View of EarSketch browser interface. users in order to generate recommendations from what similar users in the past selected (for example [13]). Content-based filter- ing compares inherent properties of content to recommend items, such as with the use of audio feature-based deep learning [3] and calculation of short sample similarity metrics [14]. We can use a hybrid approach that combines both techniques to generate recom- mendations. Some previous recommendation systems for sounds employed the Freesound sample library [4]. These projects used feature simi- larity calculations without co-usage statistics [12] or textual meta- data to augment recommendations [11]. The proposed system for EarSketch differs from these examples by combining only audio similarity and co-usage to generate recommendations, reserving genre labels for manual user filtering. In this article, we present our initial research on a recommenda- tion system for discovering new sounds for use in EarSketch. The main contributions discussed are: Figure 2: The EarSketch sounds used the highest number of times in 20,000 user scripts (highest 1,000 shown for legibil- • An initial user-centered design process for systematically ity), showing under-utilization of the majority of the library. understanding how best to add an audio loop recommenda- tion system into the EarSketch environment, including, the way users currently use the sound browser, the challenges systems suggest content to users that is most likely to appeal to to using it successfully, the kinds of recommendations users them based on profiles of their preferences as well as content that desire, and the best way to present users with recommenda- they would most likely find novel, diverse, and unexpectedly useful tions. (serendipitous) [1]. EarSketch could use such a recommendation • The initial application of a hybrid (collaborative and content- system to automatically search through its sound library to find based filtering) recommendation system for sounds in a dig- relevant sounds that encourage the user to explore novel, diverse, ital audio workstation, in contrast to song recommendation and serendipitous regions of the sound library. systems. This is a first step towards improving user explo- Recommendation generation techniques include collaborative ration of the EarSketch sound library according to the user filtering, content-based filtering, and hybrid techniques. Collabora- requirements and design principles arising from the initial tive filtering [1] involves comparing the current user to previous user-centered design process. Towards a Hybrid Recommendation System for a Sound Library IUI Workshops’19, March 20, 2019, Los Angeles, USA • A proposed methodology for evaluating both the success of 2.1 Initial Design the recommendation system in providing users with rele- The sound browser experience prior to the addition of a recommen- vant, novel, diverse, and serendipitous recommendations [1] dation system included sound folders that consisted of a title and a and the relative importance of the different factors used to list of sounds corresponding to that title. For example, the sound generate recommendations, as well as the usability of the folder titled "DUBSTEP 140 BPM DUBBASS WOBBLE" included a sound browser with the recommendation system added. list of "DUBSTEP BASS WOBBLE" sounds underneath it, followed The remainder of the paper describes the details of the user-centered by other sound folders and their associated sounds. This list was design process for adding a recommendation system to the EarS- navigated via scrolling and sounds were distributed across multi- ketch sound browser and the initial prototype of the hybrid recom- ple pages within the browser. The user had the ability to favorite mendation system resulting from that design process. The paper and preview these sounds from within the browser as well. The concludes by discussing the planned evaluation methodology, limi- user also had the ability to discover sounds in the library via text tations of the current prototype, and future work. search from the search bar along with the functionality to filter these sounds by artists, instruments, and genre. 2 USER RESEARCH AND INTERFACE DESIGN An initial user study was conducted in order to gain a systematic 2.2 Interviews and Survey of EarSketch understanding of how best to add a recommendation system into Students the EarSketch sound browser. This included understanding the Four qualitative interviews were conducted with undergraduate different ways that users used the sound browser, the challenges students in an introductory programming course at a four-year they faced to use it successfully, the kinds of recommendations users college to explore current EarSketch users’ challenges, behaviors, desired, and the best ways to present recommendations to users. and interactions with the sound browser. This was done to identify The study resulted in a set of requirements for the recommendation the best opportunities for the recommendation system to fit their system and a redesign of the sound browser interface integrating needs. These interviews were utilized to gather qualitative data the generated recommendations. such as reported behaviors, motivations behind those behaviors, opportunities for future designs and a recommendation integration. A quantitative survey was sent out to the same undergraduate class and received 55 responses. The survey was used to determine the prevalence of identified behaviors and preferences. Participants reported being more inclined to use the Instrument and Genre filters than the Artist filter. In addition, users expressed their desire for a Key and Beats-Per-Minute (BPM) filter. This sug- gested the need to prioritize recommendations based on instru- ments, genres, keys, and BPM in the future. Users reported that it was hard to discover groups of sounds they considered to be a good recommendation. They considered strong recommendations to be sounds that they liked that also fit in their script (relevant) and that they had not heard before (novel) or were not expecting (serendipitous). Discovering sounds similar to previously used sounds was of lesser importance to them. This confirmed that those users desired recommendations that were in accordance with the recommendation system goals defined by [1]. 3 HYBRID RECOMMENDATION SYSTEM A set of design principles arose as a result of these user studies. Recommendations were to be relevant, novel, diverse, and serendip- itous. Additionally, users were interested in getting recommen- dations in the interface separated into different categories (e.g. "Sounds That Fit Your Tastes" and "Discover Different Kinds of Sounds"). Users were also interested in getting recommendations Figure 3: Original sound browser design prior to research ac- matching with semantic features of the sounds in their work-in- tivities (left) and sound browser design after research activi- progress compositions (e.g. instrument, genre, key, and BPM). ties (right). Includes a like/dislike functionality, collapsible The initial recommendation system we have developed does not sound folders, new recommended sound folders with gold yet support the entire set of user requirements that were illumi- text to distinguish them as recommendations, and the addi- nated by the user studies. It does combine collaborative filtering tion of Key and BPM filters. (using a statistical analysis of sound usage in past user scripts) and content-based filtering (using extracted audio features) to increase the relevance and novelty of the generated recommendations. IUI Workshops’19, March 20, 2019, Los Angeles, USA Smith and Weeks, et al. Recommendations are generated as follows: allows us to evaluate time-based similarities between sounds, (1) The algorithm takes in one or more sounds as its input. This and recommend sounds with similar function in a rhythmic input is the set of sounds that are already a part of a user’s context. work-in-progress script/composition. Mel-Frequency Cepstral Coefficients DM F CC is the euclidean (2) The algorithm then generates a first list of sounds from distance between the short-term power spectrum of two the EarSketch sound library (the co-usage list) that have sounds, using the librosa MFCC function [8] [10]. This com- commonly been used in the past with the input sounds in pares sounds in terms of temporally-independent energy, scripts by any user. and acts as genre or instrument groupings. (3) The algorithm then uses audio features of the sounds in Both features have been chosen due to their common usage in the co-usage list to create a second list containing other music information retrieval [6]. sounds in the sound library that are acoustically similar to the sounds in the co-usage list (the similarity list). 3.3 Recommendation Algorithms (4) The algorithm removes sounds from the similarity list that have been commonly used with sounds in the co-usage list. This design aims to generate recommendations of sounds that are (5) Finally, the algorithm chooses sounds from the similarity list serendipitous to the user by not having high co-usage, and relevant to present to the user as recommendations. through acoustical similarity to sounds that do. Diversity in rec- ommendations is possible by including a high number of co-used The co-usage list is an example of collaborative filtering (see sounds of a variety of styles. The multiple stages of randomness in the collaborative filtering section) and adds relevance to the gen- both models, while not guaranteeing novelty, allow for different erated recommendations by ensuring that recommendations are recommendations to be generated for the same combinations of compatible with the set of sounds in the user’s work-in-progress inputs. script/composition. The usage of the similarity list (rather than just N represents an arbitrary factor limiting the amount of results the co-usage list) is an example of content-based filtering (see the gathered at different steps in the algorithms, and will be empirically content filtering section). The removal of sounds from the similarity determined during evaluation. The value of each variable labeled N list (that are commonly used with the co-usage list) adds novelty in the below sections can be manipulated separately. This includes to the recommendations. The approach described here attempts to the lengths of the list of final recommendations, the co-usage list, address diversity and serendipity of the generated recommenda- and the similarity list. tions, but explicit measures to ensure and evaluate these qualities The initial prototype of the recommender system is designed for is planned for future work (see future work). use in standalone offline applications in addition to integration with the main EarSketch browser. Two recommendation algorithms were 3.1 Collaborative Filtering developed: one for live, real-time recommendation calculations The input to the collaborative filtering is the collection of sounds and the other for faster server-side calculations. The first model, already being used in an active script at the time of recommendation the dynamic model, conducts all calculations offline using pre- generation. We take an item-based approach involving only an computed audio features to generate a list of recommendations analysis of previous co-usage between sounds [13]. We take this for any combination of sounds. The static model, intended for approach to impose minimal collection of user information, such as online use, combines pre-computed lists of recommendations for user demographics and profile usage history, protecting EarSketch’s any individual sounds to generate a single recommendation list. primarily school-aged user base and conforming with its privacy policy [5]. The system returns a co-usage list of sounds in order of 3.3.1 Dynamic. The most commonly used sounds in conjunction co-usage frequency. This co-usage is calculated using a sample set with any of the input sounds parsed from a user script are found of 20,000 user scripts. Any sounds that are also in the input list are collectively using the collaborative filtering paradigm in the col- excluded to ensure that commonly co-used input sounds do not laborative filtering section. Each commonly co-used sound is then simply recommend each other. compared to all other sounds in the EarSketch library, and a recom- mendation score for each is generated as the following equation: 3.2 Content-based Filtering We compare two audio features to find sounds acoustically simi- −1 −1 S = DST FT + DM F CC + U (1) lar to the items in the co-usage list. These recommendations are the final output of the system. Recommended sounds are chosen where DST FT = normalized STFT euclidean distance, DM F CC = based on their similarity to the most commonly co-used sounds normalized normalized STFT euclidean distance, and U = normal- comparing two properties of the audio signal — Short-Time Fourier ized co-usage. Transform features and Mel-Frequency Cepstral Coefficients. The Additionally, STFT and MFCC distance from the original input sounds are compared using the euclidean distance between their samples are added or subtracted from the final recommendation feature vectors, taken from the first 2 seconds of 48000 sample rate score. This to generate recommendations that are either acousti- audio with a 1024-point Hann window and normalized for tempo. cally similar or different from the ones already found in the user Short-Time Fourier Transform Features DST FT is the eu- script at the time recommendation. The sounds with the highest N clidean distance between the spectral density of two sounds, recommendation scores are stored and joined together in a single calculated using the librosa STFT function [8]. This function similarity list. A random selection of N recommendations is chosen Towards a Hybrid Recommendation System for a Sound Library IUI Workshops’19, March 20, 2019, Los Angeles, USA from the highest N normalized recommendation scores in the mas- the calculations between audio features will be performed with op- ter list, with higher priority given to the highest recommendations erations and statistical measures other than euclidean distance, and through fitness proportionate selection [2]. will incorporate higher-level features such as rhythm. Similarly for a threshold diversity value D, recommendations would be chosen Static. The static model differs from the dynamic model in that by adding sounds to a candidate set such that each new addition is it uses a pre-computed list of similarity lists generated for each in- at least D distance from every other item already in the set. Finally, dividual sound in EarSketch, in order to make the recommendation serendipity will be explicitly optimized for by first collecting data algorithm less computationally intensive for server-side deploy- searching for recommendations that are relevant but with low co- ment. The lists for any combination of input sounds are joined usage frequencies (indicating that they are rarely used together). together into a master list, and any duplicate sounds have their rec- Finally, each of the four recommendation generation goals will be ommendation scores added and balanced by a factor of the square weighted in order to tailor recommendations to different situations root of the number of lists. This method of balancing is in order or different recommendation folders. to assign higher value to the strongest recommendations without drowning out the others, and is another scalable parameter that 4.2 Proposed Evaluation will be evaluated in future work (see the future work section). A random selection of N recommendations is chosen with higher 4.2.1 Recommendation System. Participants in a user study will priority given to the highest recommendations as with the dynamic empirically refine the various iterations of the recommendation model. system using different output-limiting values of N, and different relative weighting of DM F CC and DST FT . Additionally, they will be asked to choose sounds from the recommendation system and rate them in terms of relevance, novelty, diversity, and serendipity [13] for a combination of input sounds. The sounds they choose will be represented by the recommendation scores generated by each system iteration, in order to evaluate the weightings independently. Additionally, qualitative questions will reveal user opinions on other design aspects, like how many recommendations users want to see at once. 4.2.2 Interface Redesign. The current redesign has not been prop- erly tested in a real world scenario, thus potential usability issues may arise with the navigation, language, and recommendation types. We will conduct moderated usability testing and record users’ sessions interacting with a high-fidelity prototype while a researcher prompts them with tasks to complete. This testing will allow more information regarding EarSketch users’ perceptions of a ’good’ recommendation and how users will actually utilize Figure 4: Program flow of the Dynamic recommendation these recommendations. As we move toward understanding how system model, following the analysis of input samples to to recommend sounds to our user and better facilitate the explo- generate co-usage, similarity, and final recommendation ration and discovery of sounds within EarSketch, our near-term lists. goal is to iterate and improve on the proposed EarSketch redesign to accommodate recommendations. 4 FUTURE WORK This algorithm is an exploratory stage of development and we plan to expand it along with the interface design with respect to current limitations information gained from user testing. 4.1 Recommendation System The recommendation generation process will be modified to im- prove how it explicitly addresses its goals of relevance, novelty, diversity, and serendipity. Recommendation relevance will be im- proved by adding semantic metadata tags to the sounds, like instru- ment, genre, key, and BPM, and using those parameters (in addition to co-usage statistics and feature similarity) to select sounds. Nov- elty will be explicitly optimized for by measuring the distance between sounds in the lists and ensuring that recommendations are intentionally selected to be different from previously generated recommendations by some threshold novelty value N. Additionally, IUI Workshops’19, March 20, 2019, Los Angeles, USA Smith and Weeks, et al. REFERENCES python. In Proceedings of the 14th python in science conference. 18–25. [1] Charu C Aggarwal et al. 2016. Recommender systems. Springer. [9] Tom McKlin, Brian Magerko, Taneisha Lee, Dana Wanzer, Doug Edwards, and [2] Thomas Bäck. 1996. Evolutionary Algorithms in Theory and Practice: Evolution Jason Freeman. 2018. Authenticity and Personal Creativity: How EarSketch Strategies, Evolutionary Programming, Genetic Algorithms. Oxford University Affects Student Persistence. In Proceedings of the 49th ACM Technical Symposium Press, Inc., New York, NY, USA. on Computer Science Education. ACM, 987–992. [3] S. Chang, A. Abdul, J. Chen, and H. Liao. 2018. A Personalized Music Recom- [10] Paul Mermelstein. 1976. Distance measures for speech recognition, psychological mendation System using Convolutional Neural Networks Approach. In IEEE and instrumental. Pattern recognition and artificial intelligence 116 (1976), 374– International Conference on Applied System Invention (ICASI). IEEE, St. Petersburg 388. Russia, 47–49. https://doi.org/10.1109/ICASI.2018.8394293 [11] Sergio Oramas, V.C. Ostuni, T. Di Noia, Xavier Serra, and E. Di Sciascio. 2016. [4] Bram de Jong. 2005. Freesound. https://freesound.org Sound and Music Recommendation with Knowledge Graphs. ACM Transactions [5] Brian Magerko Jason Freeman. 2011. EarSketch. http://earsketch.gatech.edu/ on Intelligent Systems and Technology (TIST) 8 (10/2016 2016), 1–21. https: landing/ //doi.org/10.1145/2926718 [6] Alexander Lerch. 2012. An Introduction to Audio Content Analysis: Applications in [12] Gerard Roma and Xavier Serra. 2015. Music performance by discovering commu- Signal Processing and Music Informatics (1st ed.). Wiley-IEEE Press. nity loops. In Proceedings of the Web Audio Conference (WAC), Paris. [7] Brian Magerko, Jason Freeman, Tom Mcklin, Mike Reilly, Elise Livingston, Scott [13] E. Shakirova. 2017. Collaborative Filtering for Music Recommender System. In Mccoid, and Andrea Crews-Brown. 2016. Earsketch: A steam-based approach for IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineer- underrepresented populations in high school computer science education. ACM ing (EIConRus). IEEE, St. Petersburg Russia, 548–550. https://doi.org/10.1109/ Transactions on Computing Education (TOCE) 16, 4 (2016), 14. EIConRus.2017.7910613 [8] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric [14] Kai Siedenburg and Daniel Müllensiefen. 2007. Modeling Timbre Similarity Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in of Short Music Clips. Frontiers in psychology 8, 1 (April 2007), 36–44. https: //doi.org/10.3389/fpsyg.2017.00639