The Impact of Salient Musical Features in a Hybrid Recommendation System for a Sound Library Jason Brent Smith1 , Ashvala Vinay1 and Jason Freeman1 1 Georgia Tech Center for Music Technology 840 McMillan Street NW, Atlanta, Georgia, USA, 30308 Abstract EarSketch is an online learning environment that teaches coding and music concepts through the computational manipulation of sounds selected from a large sound library. It features sound recommendations based on acoustic similarity and co-usage with a user’s current sound selection in order to encourage exploration of the library. However, students have reported that the recommended sounds do not complement their current projects in terms of two areas: musical key and rhythm. We aim to improve the relevance of these recommendations through the inclusion of these two musically related features. This paper describes the addition of key signature and beat extraction to the EarSketch sound recommendation model in order to improve the musical compatibility of the recommendations with the sounds in a user’s project. Additionally, we present an analysis of the effects of these new recommendation strategies on user exploration and usage of the recommended sounds. The results of this analysis suggest that the addition of explicitly musically-relevant attributes increases the coverage of the sound library among sound recommendations as well as the sounds selected by users. It reflects the importance of including multiple musical attributes when building recommendation systems for creative and open-ended musical systems. 1. Introduction filter by artist, genre, instrument, or key signature, and mark them as favorites for future use (see Fig 1) and can EarSketch [1] is a computational music remixing environ- preview or copy these sounds into their code as constants. ment designed to teach music and computing concepts through the process of writing code to creatively manip- ulate audio loops. It is a web application that contains a code editor for students to write Python or JavaScript code using a custom API, and a Digital Audio Worksta- tion for them to view and listen to the musical output produced by their code. Previous analysis of EarSketch users revealed that a sense of creative ownership and expression for their work has been linked to intentions to persist in computer sci- ence education [2]. To this end, EarSketch was designed with the goal of being authentic to industry tools in terms of music production, musical content, and computing lan- guages. It achieves this with the design of its interface and API as well as with the inclusion of a large sound library for students to explore and find sounds that are personally expressive and meaningful to them. EarSketch contains a library of over 4,500 sounds pro- duced by professional artists such as sound designer Richard Devine, hip-hop producer/DJ Young Guru, and additional stems from popular musicians such as Alicia Keys, Ciara, Common, Dakota Bear, Irizzary y Caraballo, Jayli Wolf, Khalid, Milk + Sizz, Pharrell Williams, and Samian. Users are able to search for sounds by name, Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Figure 1: View of EarSketch Sound Browser interface (top), Sydney, Australia with example recommendations (bottom). $ jsmith775@gatech.edu (J. B. Smith); ashvala@gatech.edu (A. Vinay); jason.freeman@gatech.edu (J. Freeman)  0000-0002-7075-6132 (J. B. Smith); 0000-0002-2487-2052 A previous analysis of 20,000 user-created scripts showed (A. Vinay); 0000-0003-3827-1060 (J. Freeman) © 2023 Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons that fewer than 200 library sounds were used in over 1% License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) of scripts and under 20 sounds were used in over 10% of Figure 2: View of the EarSketch interface, with Sound Browser (left), Digital Audio Workstation (top), and Code Editor (bottom), and Curriculum (right). scripts. It was hypothesized that this was due to difficulty a unique project than when matching sounds to others in navigating the sound browser, as users reported that out of the context of EarSketch [6]. it was hard to discover groups of sounds relevant to their While the initial recommendation system, a hybrid current work. In order to address this under-utilization of model using collaborative filtering and content-based the sound library and to promote compositional diversity similarity metrics, improved the number of sounds ex- among its users’ projects, a recommendation system was plored by users, users have reported a lack of musical co- added to EarSketch [3]. hesion between sounds after they have already included Diversity and coverage, measures of how different a contrasting elements in a project, as well as a lack of set of recommendations are from each other and how sound suggestions that facilitated specific compositional much of a set of available options is being recommended, ideas such as creating a new section of a song. This work are common design goal of recommendation systems aims to improve the recommendation system’s impact on [4]. Recommendation systems to present diverse com- sound exploration and usage by adding two additional positional material are prevalent in music production musical features as inputs: key signature and beat sim- platforms with which EarSketch aligns its design goals. ilarity. These features are musically motivated in that, The EarSketch sound recommendation system was de- unlike the existing system’s use of Short-time Frequency signed to assist in the process of navigating the sound Transform, they use explicit human-understandable la- library by presenting relevant, novel sounds for users bels grounded in music theory. Although EarSketch does to include in their code. By giving users more easily- not include western music notation by design, each tonal accessible sound options that match the content of their sound was originally composed with a major or minor in-progress compositions, the system aims to improve key signature in mind. As such, by adding explicit key the variety of sounds that users preview and copy into labels [7] to sounds, the overall key of a user’s current their scripts. It uses collaborative filtering [5] and acous- project can be estimated and sounds with that key can tic similarity metrics to minimize or maximize co-usage have their recommendation scores increased. In addition and similarity scores in various combinations to generate to tonal similarity, the system can prioritize recommen- recommendation scores, which can be used for different dations that are rhythmically consistent with a user’s recommendations such as “Songs that Fit Your Script” project [8]. Beat detection is performed by generating a or “Others Like You Used These”. Combining multiple numerical vector representing the rhythm of each sound recommendation strategies allowed for increased user in the sound library, then computing the distance be- exploration and sound usage and that users preferred tween two sounds’ vectors and factoring it into their different types of recommendations when freely creating pairwise recommendation scores. By adding the above features, we aim to answer the compute the key signatures for the dataset where key following question: signatures were appropriate.1 • How does the addition of salient musical features Beats were extracted using librosa’s [12] beat track in the EarSketch sound recommendation system prediction method. The method takes an audio signal impact the diversity of sounds recommended and and predicts its tempo and beat track. Details of the used in student projects? method can be found in Daniel Ellis’ paper [13], which is the implementation used by librosa. The beat track The contributions of this work include the augmenta- prediction provided by librosa is a series of timestamps tion of a hybrid recommendation system, combining col- indicating where a beat might be. We take these time laborative filtering with multiple aspects of feature-based stamps and construct an audio signal that is a click at audio similarity, and the evaluation of sound recommen- those time stamps. For the sake of computational and dations in a creative, open-ended task. The rest of this space efficiency when computing scores, we downsam- paper will detail the process of adding the musically moti- pled the signal from 44100 Hz to 100 Hz. vated features of key signature and beat similarity to the In the paper detailing the implementation of the beat EarSketch recommendation system (the dataset, the key tracker, it is shown that the dynamic programming ap- signature and beat similarity extraction, and how they proach is capable of achieving 93.4% accuracy on the were incorporated into the recommender), followed by MIREX beat tracking dataset [13]. Since we did not have the methodology and analysis of an evaluation of these ground truth annotations, we manually verified the beat recommendations on aggregate statistics of users on the predictions internally by using generated click tracks on EarSketch website. a random subset of the sound library with an informal subjective evaluation. Using 5 sets of 16 sounds at a time, testers from the EarSketch development team rated the 2. Implementation implementation as appropriately matching their percep- The EarSketch web client continuously monitors the tion for each example. sounds included in a user’s project as they edit their code. Once a change is detected, the recommendation 2.2. Recommendations system generates a set of recommendations using the newly stored list of sounds as input [6]. The output is The previous algorithm to recommend sounds to users presented to users as a list of recommendations in the was described in [3]. In short, sounds in the library were sound browser (Figure 1). This section will discuss the assigned a score 𝑆 implementations of key signature estimation and beat −1 −1 similarity calculation, as well as their addition to the ex- 𝑆 = 𝒟𝑆𝑇 𝐹 𝑇 + 𝒟𝑀 𝐹 𝐶𝐶 + 𝒰 (1) isting EarSketch recommendation algorithm at the time where 𝒟𝑆𝑇 𝐹 𝑇 and 𝒟𝑀 𝐹 𝐶𝐶 are acoustic feature dis- of generation. tances between a given sound and every sound in the library and 𝒰 is the co-usage score, i.e, a score indicating 2.1. Key signature and beat extraction how often two sounds were used together. In order to add key signatures into our algorithm, we In order to extract key signatures for the clips in the compute the key signature of the project 𝒦proj as the sound library, we used Essentia [9], a popular software most frequent key signature label across all sound clips in package for music information retrieval. It implements a project. For a given sound clip S and its corresponding several key-profiles to estimate the key signature for a key signature 𝒦S we compute it’s key signature score 𝒦 given sound such as “edmm” [10], a profile that is gener- as ally suited to estimating key signatures from electronic {︃ music and “braw” [11], a more general key signature 1, if 𝒦S ∈ {(𝒦proj | relative(𝒦proj )} estimation profile. In addition to the key signature, Es- 𝒦= (2) 0 otherwise sentia’s key signature estimator also produced a strength score indicating how strong the presence of an annotated where 𝒦 is set to 1 if the clip’s key signature matches key signature is in the sample. the project or has a relative major/minor key relationship Identifying the best key signature profile in Essentia with the project’s key signature. was done using the annotated subset of the library de- To add a beat similarity score, we compute the Ham- scribed in the section above. For each profile, we com- ming distance [14] between two given beat tracks. This pared the predicted key signatures for the subset against is denoted as 𝒟−1 . We assume that users might select ℎ𝑎𝑚𝑚 the ground truth annotations. The “edmm” profile stood out as the best profile since it predicted the largest num- 1 We excluded purely percussive sounds and short, single shot ber of correct annotations. Therefore, it was used to examples - for e.g, snare samples a set of samples with varying attributes, for example, Distribution of unique recommendations per sound 0.0008 genre or instrumentation, but happen to have a consis- After update 0.0007 Before update tent rhythmic structure. Hamming distances have been 0.0006 shown by Toussiant et al[? ] as a good measure of rhyth- Density of events 0.0005 mic similarity. Given that EarSketch time-stretches sam- 0.0004 ples to match a specified tempo, we wanted to choose 0.0003 a similarity measure that was tempo invariant and pri- 0.0002 marily focused on the difference in how the rhythms are 0.0001 performed directly. 0.0000 Adding key and beat information to the system was 0 2000 4000 6000 8000 10000 12000 14000 16000 Frequency of unique recommendations done as an addition to the score 𝑆 described in Equation 1: Figure 3: The distribution of how frequently a sound was recommended across the entire library before and after the −1 −1 −1 𝑆 = 𝒟𝑆𝑇 𝐹 𝑇 + 𝒟𝑀 𝐹 𝐶𝐶 + 𝒰 + 𝒦 + 𝒟hamm (3) addition of key signature and beat similarity to the recommen- dation system. The figure is scaled by the number of unique Like the co-usage and acoustic similarity scores in the user sessions in the two time periods. initial version of the recommendation system [6], the key signature estimation and beat extraction processes Distribution of Unique Usages are performed offline for the whole sound library. Their 0.06 After update results are deployed to the EarSketch web client to be re- 0.05 Before update trieved for individual sound-sound pairs and used in real- Density of events 0.04 time recommendations. This is done to allow for faster 0.03 recommendations without the requirement for heavy 0.02 audio processing while users are editing a project. 0.01 0.00 0 50 100 150 200 250 300 3. Results Unique Usage Frequency As with a previous evaluation of the recommendation sys- Figure 4: The distribution of how recommended sounds tem [6], the impact of this recommender was measured were added to projects before and after the addition of key through statistical analysis of the sounds recommended signature and beat similarity to the recommendation system. and added to projects by EarSketch users. The key signa- The figure is scaled by the number of unique user sessions in the two time periods. ture and beat similarity recommendation changes were added to the EarSketch website in October 2022. Using an analytics engine, the actions of 103,828 users before the update were recorded between July and September recommended. The higher average and lower skew of 2022, and the actions of 133,349 users after the update the distribution after the addition of key signature and were recorded between October and December 2022. beat similarity indicates that more sounds from the EarS- During each session for a given user, each unique rec- ketch library are more likely to be recommended. In the ommendation made is stored as a separate event rec- period prior to the update, a given sound would be rec- ommendation. A separate event, recommendationUsed, ommended an average of 1790 times (1.72% of sessions). is stored when the student uses a recommended sound Comparatively, following the update a given sound would in their project by copying it directly from the sound be recommended 3240 times (2.42% of sessions). Using browser interface or by writing the name of the sound a two-sample t-test, we note that the difference in rec- into their code. The density of sounds being recom- ommendation frequency across the entire library was mended is a function of recommendationUsed / recom- statistically significant (𝑝 < 0.05). mendation for each individual sound constant. We deter- When comparing the usage of recommendations across mine coverage of the sound library by the distribution both periods, we measured the frequency distribution of of unique recommendations made, as well as the rate recommendationUsed events, or the unique instances of at which these recommendations are used in student previously recommended sounds being used in projects projects. (see Fig 4). A two sample t-test shows that there is a sta- The analysis of this data suggests that the inclusion tistically significant increase in recommendation usage of musically driven features further improves the diver- frequency following the update (𝑝 < 0.05). On aver- sity of sounds suggested by our hybrid recommenda- age, a recommended sound was used 0.94% of times it tion system. Fig 3 depicts the frequency density of each was recommended following the update compared to the sound based on the number of times that sound was average of 0.73% prior to the update. We also observed that the inclusion of rhythmic fea- how recommendation behavior influences the song cre- tures coincides with a noticeable uptick in the usage of ation process for students. By rating the understanding percussive loops with a larger proportion of used sounds of and satisfaction with recommendations by users with from a recommendation being percussive. We investi- and without the musical features in a controlled setting, gated the top 10 sounds being recommended and used we can determine how effective these features are and during both periods. We found that 6 of the 10 most what visual design changes are necessary to enhance the frequently used recommendations after the update are effectiveness of musically-informed recommendations. categorized as purely percussive sounds. In the period In conclusion, this analysis of the impact that salient prior to the update, there was no majority among the musical features have on EarSketch users reveals multiple instruments in the most used recommendations. insights for the design of recommendation systems and other creative systems. The use of recommendation den- sity to compare groups shows how artifact analysis can 4. Discussion and Future Work represent trends in user interaction with a creative musi- cal assistant, even at its most simple form. The significant We implemented key signature and beat extraction in the change in the density of unique sound recommendations EarSketch sound recommendation system, to improve shows the effectiveness of multimodal domain knowl- the diversity and coverage of sounds that are being rec- edge on recommendation generation. As the EarSketch ommended to users and to make more musically relevant recommendation system either minimizes or maximizes suggestions for a student’s project. We analyzed two co-usage scores as well as acoustic similarity [3], the ad- periods of data above to identify trends in usage before dition of features to multiple types of recommendations and after the addition of these two musical features. shows the importance of understanding task specifica- In our results, we were able to successfully demon- tions when discussing recommendations for a creative strate that the inclusion of these features improves the system. diversity and coverage of recommended sounds. By com- paring the distributions of unique recommendations per sound before and after the change, we found that the 5. Acknowledgments number of recommended sounds was more evenly dis- tributed across the sound library after the change. This This material is based upon work supported by the Na- may be because the algorithm is able to pick up on more tional Science Foundation Award No. 1814083. Any opin- sounds that are pertinent to a given user’s project more ions, findings, and conclusions or recommendations ex- frequently. Additionally, there was a statistically sig- pressed in this material are those of the author(s) and nificant increase in how often students elected to use a do not necessarily reflect the views of the National Sci- recommendation. This could be attributed to the promi- ence Foundation. EarSketch is available online at https: nence of beat similarity in the recommendation algorithm //earsketch.gatech.edu providing sounds that stylistically match a user’s current sounds and as such present more viable options to try in a given project. References We noticed that there has been a shift in the types [1] B. Magerko, J. Freeman, T. Mcklin, M. Reilly, E. Liv- of recommended sounds that are more frequently used ingston, S. Mccoid, A. Crews-Brown, Earsketch: A across both periods. Following the introduction of our steam-based approach for underrepresented popu- updated algorithm, we found that a majority of the most lations in high school computer science education used recommendations were percussive or primarily rhyth- 16 (2016) 1–25. doi:10.1145/2886418. mic. We believe that this is an artifact of how the key sig- [2] T. McKlin, B. Magerko, T. Lee, D. Wanzer, D. Ed- natures and rhythmic similarities of sounds are weighted wards, J. Freeman, Authenticity and personal cre- in the recommendation process. We speculate that stu- ativity: How EarSketch affects student persistence, dents are largely seeking rhythmic sounds at the begin- in: Proceedings of the 49th ACM Technical Sym- ning stages of the song-creating process. Given that the posium on Computer Science Education, 2018, pp. weighting for pitched sounds necessitates the existence 987–992. doi:10.1145/3159450.3159523. of a key signature, the recommendation algorithm skews [3] J. Smith, D. Weeks, M. Jacob, J. Freeman, heavily towards rhythmic sounds at the start of a new B. Magerko, Towards a Hybrid Recommendation project. Additionally, users with developed projects may System for a Sound Library, in: Joint Proceedings prefer recommendations that do not clash with their cur- of the ACM IUI 2019 Workshops, CEUR-WS, 2019. rent selections, such as the percussion samples without [4] C. C. Aggarwal, Recommender Systems, Springer a key signature. In order to understand this behavior International Publishing, 2016. URL: http://link. better, we need a more in-depth user study to understand springer.com/10.1007/978-3-319-29659-3. doi:10. P. Friesch, M. Vollrath, T. Kim, Librosa/Librosa: 1007/978-3-319-29659-3. 0.9.2, 2022. URL: https://doi.org/10.5281/zenodo. [5] J. B. Schafer, D. Frankowski, J. Herlocker, S. Sen, 6759664. doi:10.5281/zenodo.6759664. Collaborative Filtering Recommender Systems, in: [13] D. P. W. Ellis, Beat tracking by dy- P. Brusilovsky, A. Kobsa, W. Nejdl (Eds.), The namic programming, Journal of New Adaptive Web: Methods and Strategies of Web Music Research 36 (2007) 51–60. URL: Personalization, Lecture Notes in Computer Sci- https://doi.org/10.1080/09298210701653344. ence, Springer, 2007, pp. 291–324. URL: https://doi. doi:10.1080/09298210701653344. org/10.1007/978-3-540-72079-9_9. doi:10.1007/ [14] G. T. Toussaint, A comparison of rhythmic sim- 978-3-540-72079-9_9. ilarity measures., in: Proceedings of the 5th In- [6] J. Smith, M. Jacob, J. Freeman, B. Magerko, T. Mck- ternational Conference on Music Information Re- lin, Combining collaborative and content filtering trieval, ISMIR, Barcelona, Spain, 2004. URL: https: in a recommendation system for a web-based daw, //doi.org/10.5281/zenodo.1416812. doi:10.5281/ in: A. Xambó, S. R. Martín, G. Roma (Eds.), Proceed- zenodo.1416812. ings of the International Web Audio Conference, WAC ’19, NTNU, Trondheim, Norway, 2019, pp. 53–58. [7] M.-K. Shan, F.-F. Kuo, M.-F. Chiang, S.-Y. Lee, Emotion-based music recommendation by affin- ity discovery from film music 36 (2009) 7666–7674. doi:10.1016/j.eswa.2008.09.042. [8] X. Wang, Y. Wang, Improving content-based and hybrid music recommendation using deep learn- ing, in: Proceedings of the 22nd ACM Interna- tional Conference on Multimedia, 2014, pp. 627–636. doi:10.1145/2647868.2654940. [9] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Her- rera, O. Mayor, G. Roma, J. Salamon, J. Zapata, X. Serra, ESSENTIA: An open-source library for sound and music analysis, in: Proceedings of the 21st ACM International Conference on Mul- timedia, MM ’13, Association for Computing Ma- chinery, 2013, pp. 855–858. URL: https://doi.org/ 10.1145/2502081.2502229. doi:10.1145/2502081. 2502229. [10] Á. Faraldo, E. Gómez, S. Jordà, P. Herrera, Key Es- timation in Electronic Dance Music, in: N. Ferro, F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. M. Di Nunzio, C. Hauff, G. Silvello (Eds.), Advances in Information Retrieval, volume 9626 of Lecture Notes in Computer Science, Springer International Pub- lishing, 2016, pp. 335–347. URL: http://link.springer. com/10.1007/978-3-319-30671-1_25. doi:10.1007/ 978-3-319-30671-1_25. [11] Á. Faraldo, S. Jordà, P. Herrera, A Multi-Profile Method for Key Estimation in EDM, in: Pro- ceedings of Conference on Semantic Audio, 2017, p. 7. URL: https://doi.org/10.5281/zenodo.3855499. doi:10.5281/zenodo.3855499. [12] B. McFee, A. Metsai, M. McVicar, S. Balke, C. Thomé, C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee, O. Ni- eto, D. Ellis, J. Mason, E. Battenberg, S. Seyfarth, R. Yamamoto, viktorandreevichmorozov, K. Choi, J. Moore, R. Bittner, S. Hidaka, Z. Wei, nullmighty- bofo, A. Weiss, D. Hereñú, F.-R. Stöter, L. Nickel,