=Paper= {{Paper |id=Vol-3359/paper18 |storemode=property |title=The Impact of Salient Musical Features in a Hybrid Recommendation System for a Sound Library |pdfUrl=https://ceur-ws.org/Vol-3359/paper18.pdf |volume=Vol-3359 |authors=Jason Brent Smith,Ashvala Vinay,Jason Freeman |dblpUrl=https://dblp.org/rec/conf/iui/SmithV023 }} ==The Impact of Salient Musical Features in a Hybrid Recommendation System for a Sound Library== https://ceur-ws.org/Vol-3359/paper18.pdf

The Impact of Salient Musical Features in a Hybrid
Recommendation System for a Sound Library
Jason Brent Smith1 , Ashvala Vinay1 and Jason Freeman1
1
Georgia Tech Center for Music Technology
840 McMillan Street NW, Atlanta, Georgia, USA, 30308

Abstract
EarSketch is an online learning environment that teaches coding and music concepts through the computational manipulation
of sounds selected from a large sound library. It features sound recommendations based on acoustic similarity and co-usage
with a user’s current sound selection in order to encourage exploration of the library. However, students have reported that
the recommended sounds do not complement their current projects in terms of two areas: musical key and rhythm. We
aim to improve the relevance of these recommendations through the inclusion of these two musically related features. This
paper describes the addition of key signature and beat extraction to the EarSketch sound recommendation model in order to
improve the musical compatibility of the recommendations with the sounds in a user’s project. Additionally, we present an
analysis of the effects of these new recommendation strategies on user exploration and usage of the recommended sounds.
The results of this analysis suggest that the addition of explicitly musically-relevant attributes increases the coverage of the
sound library among sound recommendations as well as the sounds selected by users. It reflects the importance of including
multiple musical attributes when building recommendation systems for creative and open-ended musical systems.

1. Introduction filter by artist, genre, instrument, or key signature, and
mark them as favorites for future use (see Fig 1) and can
EarSketch [1] is a computational music remixing environ- preview or copy these sounds into their code as constants.
ment designed to teach music and computing concepts
through the process of writing code to creatively manip-
ulate audio loops. It is a web application that contains
a code editor for students to write Python or JavaScript
code using a custom API, and a Digital Audio Worksta-
tion for them to view and listen to the musical output
produced by their code.
Previous analysis of EarSketch users revealed that a
sense of creative ownership and expression for their work
has been linked to intentions to persist in computer sci-
ence education [2]. To this end, EarSketch was designed
with the goal of being authentic to industry tools in terms
of music production, musical content, and computing lan-
guages. It achieves this with the design of its interface
and API as well as with the inclusion of a large sound
library for students to explore and find sounds that are
personally expressive and meaningful to them.
EarSketch contains a library of over 4,500 sounds pro-
duced by professional artists such as sound designer
Richard Devine, hip-hop producer/DJ Young Guru, and
additional stems from popular musicians such as Alicia
Keys, Ciara, Common, Dakota Bear, Irizzary y Caraballo,
Jayli Wolf, Khalid, Milk + Sizz, Pharrell Williams, and
Samian. Users are able to search for sounds by name,
Joint Proceedings of the ACM IUI Workshops 2023, March 2023, Figure 1: View of EarSketch Sound Browser interface (top),
Sydney, Australia with example recommendations (bottom).
$ jsmith775@gatech.edu (J. B. Smith); ashvala@gatech.edu
(A. Vinay); jason.freeman@gatech.edu (J. Freeman)
0000-0002-7075-6132 (J. B. Smith); 0000-0002-2487-2052 A previous analysis of 20,000 user-created scripts showed
(A. Vinay); 0000-0003-3827-1060 (J. Freeman)
© 2023 Copyright © 2023 for this paper by its authors. Use permitted under Creative Commons that fewer than 200 library sounds were used in over 1%
License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org) of scripts and under 20 sounds were used in over 10% of
Figure 2: View of the EarSketch interface, with Sound Browser (left), Digital Audio Workstation (top), and Code Editor
(bottom), and Curriculum (right).

scripts. It was hypothesized that this was due to difficulty a unique project than when matching sounds to others
in navigating the sound browser, as users reported that out of the context of EarSketch [6].
it was hard to discover groups of sounds relevant to their While the initial recommendation system, a hybrid
current work. In order to address this under-utilization of model using collaborative filtering and content-based
the sound library and to promote compositional diversity similarity metrics, improved the number of sounds ex-
among its users’ projects, a recommendation system was plored by users, users have reported a lack of musical co-
added to EarSketch [3]. hesion between sounds after they have already included
Diversity and coverage, measures of how different a contrasting elements in a project, as well as a lack of
set of recommendations are from each other and how sound suggestions that facilitated specific compositional
much of a set of available options is being recommended, ideas such as creating a new section of a song. This work
are common design goal of recommendation systems aims to improve the recommendation system’s impact on
[4]. Recommendation systems to present diverse com- sound exploration and usage by adding two additional
positional material are prevalent in music production musical features as inputs: key signature and beat sim-
platforms with which EarSketch aligns its design goals. ilarity. These features are musically motivated in that,
The EarSketch sound recommendation system was de- unlike the existing system’s use of Short-time Frequency
signed to assist in the process of navigating the sound Transform, they use explicit human-understandable la-
library by presenting relevant, novel sounds for users bels grounded in music theory. Although EarSketch does
to include in their code. By giving users more easily- not include western music notation by design, each tonal
accessible sound options that match the content of their sound was originally composed with a major or minor
in-progress compositions, the system aims to improve key signature in mind. As such, by adding explicit key
the variety of sounds that users preview and copy into labels [7] to sounds, the overall key of a user’s current
their scripts. It uses collaborative filtering [5] and acous- project can be estimated and sounds with that key can
tic similarity metrics to minimize or maximize co-usage have their recommendation scores increased. In addition
and similarity scores in various combinations to generate to tonal similarity, the system can prioritize recommen-
recommendation scores, which can be used for different dations that are rhythmically consistent with a user’s
recommendations such as “Songs that Fit Your Script” project [8]. Beat detection is performed by generating a
or “Others Like You Used These”. Combining multiple numerical vector representing the rhythm of each sound
recommendation strategies allowed for increased user in the sound library, then computing the distance be-
exploration and sound usage and that users preferred tween two sounds’ vectors and factoring it into their
different types of recommendations when freely creating pairwise recommendation scores.
By adding the above features, we aim to answer the compute the key signatures for the dataset where key
following question: signatures were appropriate.1
• How does the addition of salient musical features Beats were extracted using librosa’s [12] beat track
in the EarSketch sound recommendation system prediction method. The method takes an audio signal
impact the diversity of sounds recommended and and predicts its tempo and beat track. Details of the
used in student projects? method can be found in Daniel Ellis’ paper [13], which
is the implementation used by librosa. The beat track
The contributions of this work include the augmenta- prediction provided by librosa is a series of timestamps
tion of a hybrid recommendation system, combining col- indicating where a beat might be. We take these time
laborative filtering with multiple aspects of feature-based stamps and construct an audio signal that is a click at
audio similarity, and the evaluation of sound recommen- those time stamps. For the sake of computational and
dations in a creative, open-ended task. The rest of this space efficiency when computing scores, we downsam-
paper will detail the process of adding the musically moti- pled the signal from 44100 Hz to 100 Hz.
vated features of key signature and beat similarity to the In the paper detailing the implementation of the beat
EarSketch recommendation system (the dataset, the key tracker, it is shown that the dynamic programming ap-
signature and beat similarity extraction, and how they proach is capable of achieving 93.4% accuracy on the
were incorporated into the recommender), followed by MIREX beat tracking dataset [13]. Since we did not have
the methodology and analysis of an evaluation of these ground truth annotations, we manually verified the beat
recommendations on aggregate statistics of users on the predictions internally by using generated click tracks on
EarSketch website. a random subset of the sound library with an informal
subjective evaluation. Using 5 sets of 16 sounds at a time,
testers from the EarSketch development team rated the
2. Implementation implementation as appropriately matching their percep-
The EarSketch web client continuously monitors the tion for each example.
sounds included in a user’s project as they edit their
code. Once a change is detected, the recommendation 2.2. Recommendations
system generates a set of recommendations using the
newly stored list of sounds as input [6]. The output is The previous algorithm to recommend sounds to users
presented to users as a list of recommendations in the was described in [3]. In short, sounds in the library were
sound browser (Figure 1). This section will discuss the assigned a score 𝑆
implementations of key signature estimation and beat −1 −1
similarity calculation, as well as their addition to the ex- 𝑆 = 𝒟𝑆𝑇 𝐹 𝑇 + 𝒟𝑀 𝐹 𝐶𝐶 + 𝒰 (1)
isting EarSketch recommendation algorithm at the time where 𝒟𝑆𝑇 𝐹 𝑇 and 𝒟𝑀 𝐹 𝐶𝐶 are acoustic feature dis-
of generation. tances between a given sound and every sound in the
library and 𝒰 is the co-usage score, i.e, a score indicating
2.1. Key signature and beat extraction how often two sounds were used together.
In order to add key signatures into our algorithm, we
In order to extract key signatures for the clips in the compute the key signature of the project 𝒦proj as the
sound library, we used Essentia [9], a popular software most frequent key signature label across all sound clips in
package for music information retrieval. It implements a project. For a given sound clip S and its corresponding
several key-profiles to estimate the key signature for a key signature 𝒦S we compute it’s key signature score 𝒦
given sound such as “edmm” [10], a profile that is gener- as
ally suited to estimating key signatures from electronic {︃
music and “braw” [11], a more general key signature 1, if 𝒦S ∈ {(𝒦proj | relative(𝒦proj )}
estimation profile. In addition to the key signature, Es- 𝒦= (2)
0 otherwise
sentia’s key signature estimator also produced a strength
score indicating how strong the presence of an annotated where 𝒦 is set to 1 if the clip’s key signature matches
key signature is in the sample. the project or has a relative major/minor key relationship
Identifying the best key signature profile in Essentia with the project’s key signature.
was done using the annotated subset of the library de- To add a beat similarity score, we compute the Ham-
scribed in the section above. For each profile, we com- ming distance [14] between two given beat tracks. This
pared the predicted key signatures for the subset against is denoted as 𝒟−1 . We assume that users might select
ℎ𝑎𝑚𝑚
the ground truth annotations. The “edmm” profile stood
out as the best profile since it predicted the largest num- 1
We excluded purely percussive sounds and short, single shot
ber of correct annotations. Therefore, it was used to examples - for e.g, snare samples
a set of samples with varying attributes, for example, Distribution of unique recommendations per sound
0.0008
genre or instrumentation, but happen to have a consis- After update
0.0007 Before update
tent rhythmic structure. Hamming distances have been
0.0006
shown by Toussiant et al[? ] as a good measure of rhyth-

Density of events
0.0005
mic similarity. Given that EarSketch time-stretches sam-
0.0004
ples to match a specified tempo, we wanted to choose
0.0003
a similarity measure that was tempo invariant and pri-
0.0002
marily focused on the difference in how the rhythms are
0.0001
performed directly.
0.0000
Adding key and beat information to the system was 0 2000 4000 6000 8000 10000 12000 14000 16000
Frequency of unique recommendations
done as an addition to the score 𝑆 described in Equation 1:
Figure 3: The distribution of how frequently a sound was
recommended across the entire library before and after the
−1 −1 −1
𝑆 = 𝒟𝑆𝑇 𝐹 𝑇 + 𝒟𝑀 𝐹 𝐶𝐶 + 𝒰 + 𝒦 + 𝒟hamm (3) addition of key signature and beat similarity to the recommen-
dation system. The figure is scaled by the number of unique
Like the co-usage and acoustic similarity scores in the user sessions in the two time periods.
initial version of the recommendation system [6], the
key signature estimation and beat extraction processes Distribution of Unique Usages
are performed offline for the whole sound library. Their 0.06 After update
results are deployed to the EarSketch web client to be re- 0.05
Before update

trieved for individual sound-sound pairs and used in real- Density of events 0.04
time recommendations. This is done to allow for faster
0.03
recommendations without the requirement for heavy
0.02
audio processing while users are editing a project.
0.01

0.00
0 50 100 150 200 250 300
3. Results Unique Usage Frequency

As with a previous evaluation of the recommendation sys- Figure 4: The distribution of how recommended sounds
tem [6], the impact of this recommender was measured were added to projects before and after the addition of key
through statistical analysis of the sounds recommended signature and beat similarity to the recommendation system.
and added to projects by EarSketch users. The key signa- The figure is scaled by the number of unique user sessions in
the two time periods.
ture and beat similarity recommendation changes were
added to the EarSketch website in October 2022. Using
an analytics engine, the actions of 103,828 users before
the update were recorded between July and September recommended. The higher average and lower skew of
2022, and the actions of 133,349 users after the update the distribution after the addition of key signature and
were recorded between October and December 2022. beat similarity indicates that more sounds from the EarS-
During each session for a given user, each unique rec- ketch library are more likely to be recommended. In the
ommendation made is stored as a separate event rec- period prior to the update, a given sound would be rec-
ommendation. A separate event, recommendationUsed, ommended an average of 1790 times (1.72% of sessions).
is stored when the student uses a recommended sound Comparatively, following the update a given sound would
in their project by copying it directly from the sound be recommended 3240 times (2.42% of sessions). Using
browser interface or by writing the name of the sound a two-sample t-test, we note that the difference in rec-
into their code. The density of sounds being recom- ommendation frequency across the entire library was
mended is a function of recommendationUsed / recom- statistically significant (𝑝 < 0.05).
mendation for each individual sound constant. We deter- When comparing the usage of recommendations across
mine coverage of the sound library by the distribution both periods, we measured the frequency distribution of
of unique recommendations made, as well as the rate recommendationUsed events, or the unique instances of
at which these recommendations are used in student previously recommended sounds being used in projects
projects. (see Fig 4). A two sample t-test shows that there is a sta-
The analysis of this data suggests that the inclusion tistically significant increase in recommendation usage
of musically driven features further improves the diver- frequency following the update (𝑝 < 0.05). On aver-
sity of sounds suggested by our hybrid recommenda- age, a recommended sound was used 0.94% of times it
tion system. Fig 3 depicts the frequency density of each was recommended following the update compared to the
sound based on the number of times that sound was average of 0.73% prior to the update.
We also observed that the inclusion of rhythmic fea- how recommendation behavior influences the song cre-
tures coincides with a noticeable uptick in the usage of ation process for students. By rating the understanding
percussive loops with a larger proportion of used sounds of and satisfaction with recommendations by users with
from a recommendation being percussive. We investi- and without the musical features in a controlled setting,
gated the top 10 sounds being recommended and used we can determine how effective these features are and
during both periods. We found that 6 of the 10 most what visual design changes are necessary to enhance the
frequently used recommendations after the update are effectiveness of musically-informed recommendations.
categorized as purely percussive sounds. In the period In conclusion, this analysis of the impact that salient
prior to the update, there was no majority among the musical features have on EarSketch users reveals multiple
instruments in the most used recommendations. insights for the design of recommendation systems and
other creative systems. The use of recommendation den-
sity to compare groups shows how artifact analysis can
4. Discussion and Future Work represent trends in user interaction with a creative musi-
cal assistant, even at its most simple form. The significant
We implemented key signature and beat extraction in the
change in the density of unique sound recommendations
EarSketch sound recommendation system, to improve
shows the effectiveness of multimodal domain knowl-
the diversity and coverage of sounds that are being rec-
edge on recommendation generation. As the EarSketch
ommended to users and to make more musically relevant
recommendation system either minimizes or maximizes
suggestions for a student’s project. We analyzed two
co-usage scores as well as acoustic similarity [3], the ad-
periods of data above to identify trends in usage before
dition of features to multiple types of recommendations
and after the addition of these two musical features.
shows the importance of understanding task specifica-
In our results, we were able to successfully demon-
tions when discussing recommendations for a creative
strate that the inclusion of these features improves the
system.
diversity and coverage of recommended sounds. By com-
paring the distributions of unique recommendations per
sound before and after the change, we found that the 5. Acknowledgments
number of recommended sounds was more evenly dis-
tributed across the sound library after the change. This This material is based upon work supported by the Na-
may be because the algorithm is able to pick up on more tional Science Foundation Award No. 1814083. Any opin-
sounds that are pertinent to a given user’s project more ions, findings, and conclusions or recommendations ex-
frequently. Additionally, there was a statistically sig- pressed in this material are those of the author(s) and
nificant increase in how often students elected to use a do not necessarily reflect the views of the National Sci-
recommendation. This could be attributed to the promi- ence Foundation. EarSketch is available online at https:
nence of beat similarity in the recommendation algorithm //earsketch.gatech.edu
providing sounds that stylistically match a user’s current
sounds and as such present more viable options to try in
a given project. References
We noticed that there has been a shift in the types
[1] B. Magerko, J. Freeman, T. Mcklin, M. Reilly, E. Liv-
of recommended sounds that are more frequently used
ingston, S. Mccoid, A. Crews-Brown, Earsketch: A
across both periods. Following the introduction of our
steam-based approach for underrepresented popu-
updated algorithm, we found that a majority of the most
lations in high school computer science education
used recommendations were percussive or primarily rhyth-
16 (2016) 1–25. doi:10.1145/2886418.
mic. We believe that this is an artifact of how the key sig-
[2] T. McKlin, B. Magerko, T. Lee, D. Wanzer, D. Ed-
natures and rhythmic similarities of sounds are weighted
wards, J. Freeman, Authenticity and personal cre-
in the recommendation process. We speculate that stu-
ativity: How EarSketch affects student persistence,
dents are largely seeking rhythmic sounds at the begin-
in: Proceedings of the 49th ACM Technical Sym-
ning stages of the song-creating process. Given that the
posium on Computer Science Education, 2018, pp.
weighting for pitched sounds necessitates the existence
987–992. doi:10.1145/3159450.3159523.
of a key signature, the recommendation algorithm skews
[3] J. Smith, D. Weeks, M. Jacob, J. Freeman,
heavily towards rhythmic sounds at the start of a new
B. Magerko, Towards a Hybrid Recommendation
project. Additionally, users with developed projects may
System for a Sound Library, in: Joint Proceedings
prefer recommendations that do not clash with their cur-
of the ACM IUI 2019 Workshops, CEUR-WS, 2019.
rent selections, such as the percussion samples without
[4] C. C. Aggarwal, Recommender Systems, Springer
a key signature. In order to understand this behavior
International Publishing, 2016. URL: http://link.
better, we need a more in-depth user study to understand
springer.com/10.1007/978-3-319-29659-3. doi:10. P. Friesch, M. Vollrath, T. Kim, Librosa/Librosa:
1007/978-3-319-29659-3. 0.9.2, 2022. URL: https://doi.org/10.5281/zenodo.
[5] J. B. Schafer, D. Frankowski, J. Herlocker, S. Sen, 6759664. doi:10.5281/zenodo.6759664.
Collaborative Filtering Recommender Systems, in: [13] D. P. W. Ellis, Beat tracking by dy-
P. Brusilovsky, A. Kobsa, W. Nejdl (Eds.), The namic programming, Journal of New
Adaptive Web: Methods and Strategies of Web Music Research 36 (2007) 51–60. URL:
Personalization, Lecture Notes in Computer Sci- https://doi.org/10.1080/09298210701653344.
ence, Springer, 2007, pp. 291–324. URL: https://doi. doi:10.1080/09298210701653344.
org/10.1007/978-3-540-72079-9_9. doi:10.1007/ [14] G. T. Toussaint, A comparison of rhythmic sim-
978-3-540-72079-9_9. ilarity measures., in: Proceedings of the 5th In-
[6] J. Smith, M. Jacob, J. Freeman, B. Magerko, T. Mck- ternational Conference on Music Information Re-
lin, Combining collaborative and content filtering trieval, ISMIR, Barcelona, Spain, 2004. URL: https:
in a recommendation system for a web-based daw, //doi.org/10.5281/zenodo.1416812. doi:10.5281/
in: A. Xambó, S. R. Martín, G. Roma (Eds.), Proceed- zenodo.1416812.
ings of the International Web Audio Conference,
WAC ’19, NTNU, Trondheim, Norway, 2019, pp.
53–58.
[7] M.-K. Shan, F.-F. Kuo, M.-F. Chiang, S.-Y. Lee,
Emotion-based music recommendation by affin-
ity discovery from film music 36 (2009) 7666–7674.
doi:10.1016/j.eswa.2008.09.042.
[8] X. Wang, Y. Wang, Improving content-based and
hybrid music recommendation using deep learn-
ing, in: Proceedings of the 22nd ACM Interna-
tional Conference on Multimedia, 2014, pp. 627–636.
doi:10.1145/2647868.2654940.
[9] D. Bogdanov, N. Wack, E. Gómez, S. Gulati, P. Her-
rera, O. Mayor, G. Roma, J. Salamon, J. Zapata,
X. Serra, ESSENTIA: An open-source library for
sound and music analysis, in: Proceedings of
the 21st ACM International Conference on Mul-
timedia, MM ’13, Association for Computing Ma-
chinery, 2013, pp. 855–858. URL: https://doi.org/
10.1145/2502081.2502229. doi:10.1145/2502081.
2502229.
[10] Á. Faraldo, E. Gómez, S. Jordà, P. Herrera, Key Es-
timation in Electronic Dance Music, in: N. Ferro,
F. Crestani, M.-F. Moens, J. Mothe, F. Silvestri, G. M.
Di Nunzio, C. Hauff, G. Silvello (Eds.), Advances in
Information Retrieval, volume 9626 of Lecture Notes
in Computer Science, Springer International Pub-
lishing, 2016, pp. 335–347. URL: http://link.springer.
com/10.1007/978-3-319-30671-1_25. doi:10.1007/
978-3-319-30671-1_25.
[11] Á. Faraldo, S. Jordà, P. Herrera, A Multi-Profile
Method for Key Estimation in EDM, in: Pro-
ceedings of Conference on Semantic Audio, 2017,
p. 7. URL: https://doi.org/10.5281/zenodo.3855499.
doi:10.5281/zenodo.3855499.
[12] B. McFee, A. Metsai, M. McVicar, S. Balke, C. Thomé,
C. Raffel, F. Zalkow, A. Malek, Dana, K. Lee, O. Ni-
eto, D. Ellis, J. Mason, E. Battenberg, S. Seyfarth,
R. Yamamoto, viktorandreevichmorozov, K. Choi,
J. Moore, R. Bittner, S. Hidaka, Z. Wei, nullmighty-
bofo, A. Weiss, D. Hereñú, F.-R. Stöter, L. Nickel,