The Impact of Salient Musical Features in a Hybrid
Recommendation System for a Sound Library
Jason Brent Smith1 , Ashvala Vinay1 and Jason Freeman1
1
 Georgia Tech Center for Music Technology
840 McMillan Street NW, Atlanta, Georgia, USA, 30308


                        Abstract
                        EarSketch is an online learning environment that teaches coding and music concepts through the computational manipulation
                        of sounds selected from a large sound library. It features sound recommendations based on acoustic similarity and co-usage
                        with a user’s current sound selection in order to encourage exploration of the library. However, students have reported that
                        the recommended sounds do not complement their current projects in terms of two areas: musical key and rhythm. We
                        aim to improve the relevance of these recommendations through the inclusion of these two musically related features. This
                        paper describes the addition of key signature and beat extraction to the EarSketch sound recommendation model in order to
                        improve the musical compatibility of the recommendations with the sounds in a user’s project. Additionally, we present an
                        analysis of the effects of these new recommendation strategies on user exploration and usage of the recommended sounds.
                        The results of this analysis suggest that the addition of explicitly musically-relevant attributes increases the coverage of the
                        sound library among sound recommendations as well as the sounds selected by users. It reflects the importance of including
                        multiple musical attributes when building recommendation systems for creative and open-ended musical systems.


1. Introduction                                             filter by artist, genre, instrument, or key signature, and
                                                            mark them as favorites for future use (see Fig 1) and can
EarSketch [1] is a computational music remixing environ- preview or copy these sounds into their code as constants.
ment designed to teach music and computing concepts
through the process of writing code to creatively manip-
ulate audio loops. It is a web application that contains
a code editor for students to write Python or JavaScript
code using a custom API, and a Digital Audio Worksta-
tion for them to view and listen to the musical output
produced by their code.
   Previous analysis of EarSketch users revealed that a
sense of creative ownership and expression for their work
has been linked to intentions to persist in computer sci-
ence education [2]. To this end, EarSketch was designed
with the goal of being authentic to industry tools in terms
of music production, musical content, and computing lan-
guages. It achieves this with the design of its interface
and API as well as with the inclusion of a large sound
library for students to explore and find sounds that are
personally expressive and meaningful to them.
   EarSketch contains a library of over 4,500 sounds pro-
duced by professional artists such as sound designer
Richard Devine, hip-hop producer/DJ Young Guru, and
additional stems from popular musicians such as Alicia
Keys, Ciara, Common, Dakota Bear, Irizzary y Caraballo,
Jayli Wolf, Khalid, Milk + Sizz, Pharrell Williams, and
Samian. Users are able to search for sounds by name,
Joint Proceedings of the ACM IUI Workshops 2023, March 2023,                                      Figure 1: View of EarSketch Sound Browser interface (top),
Sydney, Australia                                                                                 with example recommendations (bottom).
$ jsmith775@gatech.edu (J. B. Smith); ashvala@gatech.edu
(A. Vinay); jason.freeman@gatech.edu (J. Freeman)
 0000-0002-7075-6132 (J. B. Smith); 0000-0002-2487-2052                                             A previous analysis of 20,000 user-created scripts showed
(A. Vinay); 0000-0003-3827-1060 (J. Freeman)
    © 2023 Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons   that fewer than 200 library sounds were used in over 1%
    License Attribution 4.0 International (CC BY 4.0).
    CEUR Workshop Proceedings (CEUR-WS.org)                                                       of scripts and under 20 sounds were used in over 10% of
Figure 2: View of the EarSketch interface, with Sound Browser (left), Digital Audio Workstation (top), and Code Editor
(bottom), and Curriculum (right).


scripts. It was hypothesized that this was due to difficulty    a unique project than when matching sounds to others
in navigating the sound browser, as users reported that         out of the context of EarSketch [6].
it was hard to discover groups of sounds relevant to their         While the initial recommendation system, a hybrid
current work. In order to address this under-utilization of     model using collaborative filtering and content-based
the sound library and to promote compositional diversity        similarity metrics, improved the number of sounds ex-
among its users’ projects, a recommendation system was          plored by users, users have reported a lack of musical co-
added to EarSketch [3].                                         hesion between sounds after they have already included
   Diversity and coverage, measures of how different a          contrasting elements in a project, as well as a lack of
set of recommendations are from each other and how              sound suggestions that facilitated specific compositional
much of a set of available options is being recommended,        ideas such as creating a new section of a song. This work
are common design goal of recommendation systems                aims to improve the recommendation system’s impact on
[4]. Recommendation systems to present diverse com-             sound exploration and usage by adding two additional
positional material are prevalent in music production           musical features as inputs: key signature and beat sim-
platforms with which EarSketch aligns its design goals.         ilarity. These features are musically motivated in that,
The EarSketch sound recommendation system was de-               unlike the existing system’s use of Short-time Frequency
signed to assist in the process of navigating the sound         Transform, they use explicit human-understandable la-
library by presenting relevant, novel sounds for users          bels grounded in music theory. Although EarSketch does
to include in their code. By giving users more easily-          not include western music notation by design, each tonal
accessible sound options that match the content of their        sound was originally composed with a major or minor
in-progress compositions, the system aims to improve            key signature in mind. As such, by adding explicit key
the variety of sounds that users preview and copy into          labels [7] to sounds, the overall key of a user’s current
their scripts. It uses collaborative filtering [5] and acous-   project can be estimated and sounds with that key can
tic similarity metrics to minimize or maximize co-usage         have their recommendation scores increased. In addition
and similarity scores in various combinations to generate       to tonal similarity, the system can prioritize recommen-
recommendation scores, which can be used for different          dations that are rhythmically consistent with a user’s
recommendations such as “Songs that Fit Your Script”            project [8]. Beat detection is performed by generating a
or “Others Like You Used These”. Combining multiple             numerical vector representing the rhythm of each sound
recommendation strategies allowed for increased user            in the sound library, then computing the distance be-
exploration and sound usage and that users preferred            tween two sounds’ vectors and factoring it into their
different types of recommendations when freely creating         pairwise recommendation scores.
   By adding the above features, we aim to answer the          compute the key signatures for the dataset where key
following question:                                            signatures were appropriate.1
     • How does the addition of salient musical features          Beats were extracted using librosa’s [12] beat track
       in the EarSketch sound recommendation system            prediction method. The method takes an audio signal
       impact the diversity of sounds recommended and          and predicts its tempo and beat track. Details of the
       used in student projects?                               method can be found in Daniel Ellis’ paper [13], which
                                                               is the implementation used by librosa. The beat track
   The contributions of this work include the augmenta-        prediction provided by librosa is a series of timestamps
tion of a hybrid recommendation system, combining col-         indicating where a beat might be. We take these time
laborative filtering with multiple aspects of feature-based    stamps and construct an audio signal that is a click at
audio similarity, and the evaluation of sound recommen-        those time stamps. For the sake of computational and
dations in a creative, open-ended task. The rest of this       space efficiency when computing scores, we downsam-
paper will detail the process of adding the musically moti-    pled the signal from 44100 Hz to 100 Hz.
vated features of key signature and beat similarity to the        In the paper detailing the implementation of the beat
EarSketch recommendation system (the dataset, the key          tracker, it is shown that the dynamic programming ap-
signature and beat similarity extraction, and how they         proach is capable of achieving 93.4% accuracy on the
were incorporated into the recommender), followed by           MIREX beat tracking dataset [13]. Since we did not have
the methodology and analysis of an evaluation of these         ground truth annotations, we manually verified the beat
recommendations on aggregate statistics of users on the        predictions internally by using generated click tracks on
EarSketch website.                                             a random subset of the sound library with an informal
                                                               subjective evaluation. Using 5 sets of 16 sounds at a time,
                                                               testers from the EarSketch development team rated the
2. Implementation                                              implementation as appropriately matching their percep-
The EarSketch web client continuously monitors the             tion for each example.
sounds included in a user’s project as they edit their
code. Once a change is detected, the recommendation            2.2. Recommendations
system generates a set of recommendations using the
newly stored list of sounds as input [6]. The output is        The previous algorithm to recommend sounds to users
presented to users as a list of recommendations in the         was described in [3]. In short, sounds in the library were
sound browser (Figure 1). This section will discuss the        assigned a score 𝑆
implementations of key signature estimation and beat                              −1        −1
similarity calculation, as well as their addition to the ex-                 𝑆 = 𝒟𝑆𝑇 𝐹 𝑇 + 𝒟𝑀 𝐹 𝐶𝐶 + 𝒰                   (1)
isting EarSketch recommendation algorithm at the time          where 𝒟𝑆𝑇 𝐹 𝑇 and 𝒟𝑀 𝐹 𝐶𝐶 are acoustic feature dis-
of generation.                                              tances between a given sound and every sound in the
                                                            library and 𝒰 is the co-usage score, i.e, a score indicating
2.1. Key signature and beat extraction                      how often two sounds were used together.
                                                               In order to add key signatures into our algorithm, we
In order to extract key signatures for the clips in the compute the key signature of the project 𝒦proj as the
sound library, we used Essentia [9], a popular software most frequent key signature label across all sound clips in
package for music information retrieval. It implements a project. For a given sound clip S and its corresponding
several key-profiles to estimate the key signature for a key signature 𝒦S we compute it’s key signature score 𝒦
given sound such as “edmm” [10], a profile that is gener- as
ally suited to estimating key signatures from electronic             {︃
music and “braw” [11], a more general key signature                    1, if 𝒦S ∈ {(𝒦proj | relative(𝒦proj )}
estimation profile. In addition to the key signature, Es-      𝒦=                                                         (2)
                                                                       0 otherwise
sentia’s key signature estimator also produced a strength
score indicating how strong the presence of an annotated       where 𝒦 is set to 1 if the clip’s key signature matches
key signature is in the sample.                             the project or has a relative major/minor key relationship
   Identifying the best key signature profile in Essentia with the project’s key signature.
was done using the annotated subset of the library de-         To add a beat similarity score, we compute the Ham-
scribed in the section above. For each profile, we com- ming distance [14] between two given beat tracks. This
pared the predicted key signatures for the subset against is denoted as 𝒟−1 . We assume that users might select
                                                                             ℎ𝑎𝑚𝑚
the ground truth annotations. The “edmm” profile stood
out as the best profile since it predicted the largest num-     1
                                                                  We excluded purely percussive sounds and short, single shot
ber of correct annotations. Therefore, it was used to examples - for e.g, snare samples
a set of samples with varying attributes, for example,                                                        Distribution of unique recommendations per sound
                                                                                           0.0008
genre or instrumentation, but happen to have a consis-                                                                                                                          After update
                                                                                           0.0007                                                                               Before update
tent rhythmic structure. Hamming distances have been
                                                                                           0.0006
shown by Toussiant et al[? ] as a good measure of rhyth-


                                                                       Density of events
                                                                                           0.0005
mic similarity. Given that EarSketch time-stretches sam-
                                                                                           0.0004
ples to match a specified tempo, we wanted to choose
                                                                                           0.0003
a similarity measure that was tempo invariant and pri-
                                                                                           0.0002
marily focused on the difference in how the rhythms are
                                                                                           0.0001
performed directly.
                                                                                           0.0000
   Adding key and beat information to the system was                                                0        2000   4000        6000        8000         10000    12000    14000       16000
                                                                                                                            Frequency of unique recommendations
done as an addition to the score 𝑆 described in Equation 1:
                                                              Figure 3: The distribution of how frequently a sound was
                                                              recommended across the entire library before and after the
          −1        −1                −1
     𝑆 = 𝒟𝑆𝑇 𝐹 𝑇 + 𝒟𝑀 𝐹 𝐶𝐶 + 𝒰 + 𝒦 + 𝒟hamm             (3)    addition of key signature and beat similarity to the recommen-
                                                              dation system. The figure is scaled by the number of unique
   Like the co-usage and acoustic similarity scores in the    user sessions in the two time periods.
initial version of the recommendation system [6], the
key signature estimation and beat extraction processes                                                                Distribution of Unique Usages
are performed offline for the whole sound library. Their                          0.06                                                                                      After update
results are deployed to the EarSketch web client to be re-                        0.05
                                                                                                                                                                            Before update

trieved for individual sound-sound pairs and used in real-    Density of events   0.04
time recommendations. This is done to allow for faster
                                                                                  0.03
recommendations without the requirement for heavy
                                                                                  0.02
audio processing while users are editing a project.
                                                                                  0.01

                                                                                  0.00
                                                                                             0          50            100                150               200            250               300
3. Results                                                                                                                      Unique Usage Frequency


As with a previous evaluation of the recommendation sys-      Figure 4: The distribution of how recommended sounds
tem [6], the impact of this recommender was measured          were added to projects before and after the addition of key
through statistical analysis of the sounds recommended        signature and beat similarity to the recommendation system.
and added to projects by EarSketch users. The key signa-      The figure is scaled by the number of unique user sessions in
                                                              the two time periods.
ture and beat similarity recommendation changes were
added to the EarSketch website in October 2022. Using
an analytics engine, the actions of 103,828 users before
the update were recorded between July and September           recommended. The higher average and lower skew of
2022, and the actions of 133,349 users after the update       the distribution after the addition of key signature and
were recorded between October and December 2022.              beat similarity indicates that more sounds from the EarS-
   During each session for a given user, each unique rec-     ketch library are more likely to be recommended. In the
ommendation made is stored as a separate event rec-           period prior to the update, a given sound would be rec-
ommendation. A separate event, recommendationUsed,            ommended an average of 1790 times (1.72% of sessions).
is stored when the student uses a recommended sound           Comparatively, following the update a given sound would
in their project by copying it directly from the sound        be recommended 3240 times (2.42% of sessions). Using
browser interface or by writing the name of the sound         a two-sample t-test, we note that the difference in rec-
into their code. The density of sounds being recom-           ommendation frequency across the entire library was
mended is a function of recommendationUsed / recom-           statistically significant (𝑝 < 0.05).
mendation for each individual sound constant. We deter-          When comparing the usage of recommendations across
mine coverage of the sound library by the distribution        both periods, we measured the frequency distribution of
of unique recommendations made, as well as the rate           recommendationUsed events, or the unique instances of
at which these recommendations are used in student            previously recommended sounds being used in projects
projects.                                                     (see Fig 4). A two sample t-test shows that there is a sta-
   The analysis of this data suggests that the inclusion      tistically significant increase in recommendation usage
of musically driven features further improves the diver-      frequency following the update (𝑝 < 0.05). On aver-
sity of sounds suggested by our hybrid recommenda-            age, a recommended sound was used 0.94% of times it
tion system. Fig 3 depicts the frequency density of each      was recommended following the update compared to the
sound based on the number of times that sound was             average of 0.73% prior to the update.
   We also observed that the inclusion of rhythmic fea-      how recommendation behavior influences the song cre-
tures coincides with a noticeable uptick in the usage of     ation process for students. By rating the understanding
percussive loops with a larger proportion of used sounds     of and satisfaction with recommendations by users with
from a recommendation being percussive. We investi-          and without the musical features in a controlled setting,
gated the top 10 sounds being recommended and used           we can determine how effective these features are and
during both periods. We found that 6 of the 10 most          what visual design changes are necessary to enhance the
frequently used recommendations after the update are         effectiveness of musically-informed recommendations.
categorized as purely percussive sounds. In the period          In conclusion, this analysis of the impact that salient
prior to the update, there was no majority among the         musical features have on EarSketch users reveals multiple
instruments in the most used recommendations.                insights for the design of recommendation systems and
                                                             other creative systems. The use of recommendation den-
                                                             sity to compare groups shows how artifact analysis can
4. Discussion and Future Work                                represent trends in user interaction with a creative musi-
                                                             cal assistant, even at its most simple form. The significant
We implemented key signature and beat extraction in the
                                                             change in the density of unique sound recommendations
EarSketch sound recommendation system, to improve
                                                             shows the effectiveness of multimodal domain knowl-
the diversity and coverage of sounds that are being rec-
                                                             edge on recommendation generation. As the EarSketch
ommended to users and to make more musically relevant
                                                             recommendation system either minimizes or maximizes
suggestions for a student’s project. We analyzed two
                                                             co-usage scores as well as acoustic similarity [3], the ad-
periods of data above to identify trends in usage before
                                                             dition of features to multiple types of recommendations
and after the addition of these two musical features.
                                                             shows the importance of understanding task specifica-
   In our results, we were able to successfully demon-
                                                             tions when discussing recommendations for a creative
strate that the inclusion of these features improves the
                                                             system.
diversity and coverage of recommended sounds. By com-
paring the distributions of unique recommendations per
sound before and after the change, we found that the 5. Acknowledgments
number of recommended sounds was more evenly dis-
tributed across the sound library after the change. This This material is based upon work supported by the Na-
may be because the algorithm is able to pick up on more tional Science Foundation Award No. 1814083. Any opin-
sounds that are pertinent to a given user’s project more ions, findings, and conclusions or recommendations ex-
frequently. Additionally, there was a statistically sig- pressed in this material are those of the author(s) and
nificant increase in how often students elected to use a do not necessarily reflect the views of the National Sci-
recommendation. This could be attributed to the promi- ence Foundation. EarSketch is available online at https:
nence of beat similarity in the recommendation algorithm //earsketch.gatech.edu
providing sounds that stylistically match a user’s current
sounds and as such present more viable options to try in
a given project.                                             References
   We noticed that there has been a shift in the types
                                                               [1] B. Magerko, J. Freeman, T. Mcklin, M. Reilly, E. Liv-
of recommended sounds that are more frequently used
                                                                   ingston, S. Mccoid, A. Crews-Brown, Earsketch: A
across both periods. Following the introduction of our
                                                                   steam-based approach for underrepresented popu-
updated algorithm, we found that a majority of the most
                                                                   lations in high school computer science education
used recommendations were percussive or primarily rhyth-
                                                                   16 (2016) 1–25. doi:10.1145/2886418.
mic. We believe that this is an artifact of how the key sig-
                                                               [2] T. McKlin, B. Magerko, T. Lee, D. Wanzer, D. Ed-
natures and rhythmic similarities of sounds are weighted
                                                                   wards, J. Freeman, Authenticity and personal cre-
in the recommendation process. We speculate that stu-
                                                                   ativity: How EarSketch affects student persistence,
dents are largely seeking rhythmic sounds at the begin-
                                                                   in: Proceedings of the 49th ACM Technical Sym-
ning stages of the song-creating process. Given that the
                                                                   posium on Computer Science Education, 2018, pp.
weighting for pitched sounds necessitates the existence
                                                                   987–992. doi:10.1145/3159450.3159523.
of a key signature, the recommendation algorithm skews
                                                               [3] J. Smith, D. Weeks, M. Jacob, J. Freeman,
heavily towards rhythmic sounds at the start of a new
                                                                   B. Magerko, Towards a Hybrid Recommendation
project. Additionally, users with developed projects may
                                                                   System for a Sound Library, in: Joint Proceedings
prefer recommendations that do not clash with their cur-
                                                                   of the ACM IUI 2019 Workshops, CEUR-WS, 2019.
rent selections, such as the percussion samples without
                                                               [4] C. C. Aggarwal, Recommender Systems, Springer
a key signature. In order to understand this behavior
                                                                   International Publishing, 2016. URL: http://link.
better, we need a more in-depth user study to understand
     springer.com/10.1007/978-3-319-29659-3. doi:10.                P. Friesch, M. Vollrath, T. Kim, Librosa/Librosa:
     1007/978-3-319-29659-3.                                        0.9.2, 2022. URL: https://doi.org/10.5281/zenodo.
 [5] J. B. Schafer, D. Frankowski, J. Herlocker, S. Sen,            6759664. doi:10.5281/zenodo.6759664.
     Collaborative Filtering Recommender Systems, in:          [13] D. P. W. Ellis,          Beat tracking by dy-
     P. Brusilovsky, A. Kobsa, W. Nejdl (Eds.), The                 namic programming,             Journal of New
     Adaptive Web: Methods and Strategies of Web                    Music Research 36 (2007) 51–60. URL:
     Personalization, Lecture Notes in Computer Sci-                https://doi.org/10.1080/09298210701653344.
     ence, Springer, 2007, pp. 291–324. URL: https://doi.           doi:10.1080/09298210701653344.
     org/10.1007/978-3-540-72079-9_9. doi:10.1007/             [14] G. T. Toussaint, A comparison of rhythmic sim-
     978-3-540-72079-9_9.                                           ilarity measures., in: Proceedings of the 5th In-
 [6] J. Smith, M. Jacob, J. Freeman, B. Magerko, T. Mck-            ternational Conference on Music Information Re-
     lin, Combining collaborative and content filtering             trieval, ISMIR, Barcelona, Spain, 2004. URL: https:
     in a recommendation system for a web-based daw,                //doi.org/10.5281/zenodo.1416812. doi:10.5281/
     in: A. Xambó, S. R. Martín, G. Roma (Eds.), Proceed-           zenodo.1416812.
     ings of the International Web Audio Conference,
     WAC ’19, NTNU, Trondheim, Norway, 2019, pp.
     53–58.
 [7] M.-K. Shan, F.-F. Kuo, M.-F. Chiang, S.-Y. Lee,
     Emotion-based music recommendation by affin-
     ity discovery from film music 36 (2009) 7666–7674.
     doi:10.1016/j.eswa.2008.09.042.
 [8] X. Wang, Y. Wang, Improving content-based and
     hybrid music recommendation using deep learn-
     ing, in: Proceedings of the 22nd ACM Interna-
     tional Conference on Multimedia, 2014, pp. 627–636.
     doi:10.1145/2647868.2654940.
 [9] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Her-
     rera, O. Mayor, G. Roma, J. Salamon, J. Zapata,
     X. Serra, ESSENTIA: An open-source library for
     sound and music analysis, in: Proceedings of
     the 21st ACM International Conference on Mul-
     timedia, MM ’13, Association for Computing Ma-
     chinery, 2013, pp. 855–858. URL: https://doi.org/
     10.1145/2502081.2502229. doi:10.1145/2502081.
     2502229.
[10] Á. Faraldo, E. Gómez, S. Jordà, P. Herrera, Key Es-
     timation in Electronic Dance Music, in: N. Ferro,
     F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. M.
     Di Nunzio, C. Hauff, G. Silvello (Eds.), Advances in
     Information Retrieval, volume 9626 of Lecture Notes
     in Computer Science, Springer International Pub-
     lishing, 2016, pp. 335–347. URL: http://link.springer.
     com/10.1007/978-3-319-30671-1_25. doi:10.1007/
     978-3-319-30671-1_25.
[11] Á. Faraldo, S. Jordà, P. Herrera, A Multi-Profile
     Method for Key Estimation in EDM, in: Pro-
     ceedings of Conference on Semantic Audio, 2017,
     p. 7. URL: https://doi.org/10.5281/zenodo.3855499.
     doi:10.5281/zenodo.3855499.
[12] B. McFee, A. Metsai, M. McVicar, S. Balke, C. Thomé,
     C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee, O. Ni-
     eto, D. Ellis, J. Mason, E. Battenberg, S. Seyfarth,
     R. Yamamoto, viktorandreevichmorozov, K. Choi,
     J. Moore, R. Bittner, S. Hidaka, Z. Wei, nullmighty-
     bofo, A. Weiss, D. Hereñú, F.-R. Stöter, L. Nickel,