=Paper=
{{Paper
|id=Vol-2327/MILC5
|storemode=property
|title=Towards a Hybrid Recommendation System for a Sound Library
|pdfUrl=https://ceur-ws.org/Vol-2327/IUI19WS-MILC-5.pdf
|volume=Vol-2327
|authors=Jason Smith,Dillon Weeks,Mikhail Jacob,Jason Freeman,Brian Magerko
|dblpUrl=https://dblp.org/rec/conf/iui/0005WJFM19
}}
==Towards a Hybrid Recommendation System for a Sound Library==
<pdf width="1500px">https://ceur-ws.org/Vol-2327/IUI19WS-MILC-5.pdf</pdf>
<pre>
Towards a Hybrid Recommendation System for a Sound Library
                    Jason Smith                                             Dillon Weeks                                Mikhail Jacob
            jsmith775@gatech.edu                                     dweeks7@gatech.edu                          mikhail.jacob@gatech.edu
         Center for Music Technology                            School of Interactive Computing               School of Interactive Computing
        Georgia Institute of Technology                         Georgia Institute of Technology               Georgia Institute of Technology
                 Atlanta, GA                                               Atlanta, GA                                   Atlanta, GA

                                              Jason Freeman                                    Brian Magerko
                                      jason.freeman@gatech.edu                             magerko@gatech.edu
                                     Center for Music Technology                       School of Literature, Media, and
                                    Georgia Institute of Technology                           Communication
                                              Atlanta, GA                              Georgia Institute of Technology
                                                                                                 Atlanta, GA
ABSTRACT                                                                              1   INTRODUCTION
Recommendation systems are widespread in music distribution                           EarSketch [7] is an online environment for learning computer pro-
and discovery services but far less common in music production                        gramming and audio loop-based music composition. Students write
software such as EarSketch, an online learning environment that                       JavaScript or Python scripts to algorithmically generate musical
engages learners in writing code to create music. The EarSketch                       compositions. The user interface borrows design cues from both in-
interface contains a sound library that learners can access through                   tegrated development environments (IDEs) and digital audio work-
a browser pane. The current implementation of the sound browser                       station (DAW) software, combining a code editor and console with
includes basic search and filtering functionality but no mechanism                    a multi-track audio timeline and sound browser. EarSketch has
for sound discovery, such as a recommendation system. As a result,                    primarily been used in high school and college computer science
users have historically selected a small subsection of sounds in high                 classrooms, with over 300,000 users to date [5].
frequencies, leading to lower compositional diversity. In this paper,                    In previous research in EarSketch classrooms, significant rela-
we propose a recommendation system for the EarSketch sound                            tionships have been found between student perceptions of authen-
browser which uses collaborative filtering and audio features to                      ticity – including their desire to share personally expressive work
suggest sounds.                                                                       with others – and student attitudes towards computing [9]. Explo-
                                                                                      ration of a larger number of musical ideas – including the sounds
CCS CONCEPTS                                                                          that form the building blocks of student compositions in EarSketch
• Human-centered computing → User interface design; • Ap-                             – may magnify a student’s capacity to create personally expressive
plied computing → Sound and music computing.                                          compositions.
                                                                                         EarSketch contains a library of over 3,500 sounds for students to
KEYWORDS                                                                              use in their compositions. The sounds were created by musicians
recommendation systems, interface design, music                                       Richard Devine and Young Guru specifically for EarSketch and con-
                                                                                      sist of multi-measure audio loops that are separated by instrument
ACM Reference Format:
                                                                                      and span over 20 popular musical genres. However, a statistical
Jason Smith, Dillon Weeks, Mikhail Jacob, Jason Freeman, and Brian Magerko.
                                                                                      analysis of scripts written by users showed that the vast majority of
2019. Towards a Hybrid Recommendation System for a Sound Library. In
Joint Proceedings of the ACM IUI 2019 Workshops, Los Angeles, USA, March              user projects used only a small subset of the sound library. Feedback
20, 2019 , 6 pages.                                                                   from EarSketch users (found in the interviews section) showed that
                                                                                      their lack of exploration was primarily the result of the difficulty
                                                                                      in finding sounds that appealed to them. We propose, therefore,
                                                                                      that providing users with an easier mechanism for exploring the
                                                                                      sound library will enable them to find and use audio loops that
                                                                                      spur further musical creativity and personal expression, while ulti-
                                                                                      mately furthering their learning about music and coding through
                                                                                      EarSketch.
                                                                                         We have explored the addition of a recommendation (or rec-
                                                                                      ommender) system, after conducting user studies, as a method of
                                                                                      encouraging users to explore more of the EarSketch sound library
                                                                                      in their scripts. Recommendation systems are widespread in music
IUI Workshops’19, March 20, 2019, Los Angeles, USA
                                                                                      distribution and discovery platforms (where they operate at the
Copyright ©2019 for the individual papers by the papers’ authors. Copying permitted   song level) but far less common in music production workflows
for private and academic purposes. This volume is published and copyrighted by its    (where they could operate at the sound clip level). Recommendation
editors.
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                 Smith and Weeks, et al.


                                             Figure 1: View of EarSketch browser interface.


                                                                        users in order to generate recommendations from what similar
                                                                        users in the past selected (for example [13]). Content-based filter-
                                                                        ing compares inherent properties of content to recommend items,
                                                                        such as with the use of audio feature-based deep learning [3] and
                                                                        calculation of short sample similarity metrics [14]. We can use a
                                                                        hybrid approach that combines both techniques to generate recom-
                                                                        mendations.
                                                                           Some previous recommendation systems for sounds employed
                                                                        the Freesound sample library [4]. These projects used feature simi-
                                                                        larity calculations without co-usage statistics [12] or textual meta-
                                                                        data to augment recommendations [11]. The proposed system for
                                                                        EarSketch differs from these examples by combining only audio
                                                                        similarity and co-usage to generate recommendations, reserving
                                                                        genre labels for manual user filtering.
                                                                           In this article, we present our initial research on a recommenda-
                                                                        tion system for discovering new sounds for use in EarSketch. The
                                                                        main contributions discussed are:
Figure 2: The EarSketch sounds used the highest number of
times in 20,000 user scripts (highest 1,000 shown for legibil-              • An initial user-centered design process for systematically
ity), showing under-utilization of the majority of the library.               understanding how best to add an audio loop recommenda-
                                                                              tion system into the EarSketch environment, including, the
                                                                              way users currently use the sound browser, the challenges
systems suggest content to users that is most likely to appeal to             to using it successfully, the kinds of recommendations users
them based on profiles of their preferences as well as content that           desire, and the best way to present users with recommenda-
they would most likely find novel, diverse, and unexpectedly useful           tions.
(serendipitous) [1]. EarSketch could use such a recommendation              • The initial application of a hybrid (collaborative and content-
system to automatically search through its sound library to find              based filtering) recommendation system for sounds in a dig-
relevant sounds that encourage the user to explore novel, diverse,            ital audio workstation, in contrast to song recommendation
and serendipitous regions of the sound library.                               systems. This is a first step towards improving user explo-
    Recommendation generation techniques include collaborative                ration of the EarSketch sound library according to the user
filtering, content-based filtering, and hybrid techniques. Collabora-         requirements and design principles arising from the initial
tive filtering [1] involves comparing the current user to previous            user-centered design process.
Towards a Hybrid Recommendation System for a Sound Library                           IUI Workshops’19, March 20, 2019, Los Angeles, USA


    • A proposed methodology for evaluating both the success of         2.1    Initial Design
      the recommendation system in providing users with rele-           The sound browser experience prior to the addition of a recommen-
      vant, novel, diverse, and serendipitous recommendations [1]       dation system included sound folders that consisted of a title and a
      and the relative importance of the different factors used to      list of sounds corresponding to that title. For example, the sound
      generate recommendations, as well as the usability of the         folder titled "DUBSTEP 140 BPM DUBBASS WOBBLE" included a
      sound browser with the recommendation system added.               list of "DUBSTEP BASS WOBBLE" sounds underneath it, followed
The remainder of the paper describes the details of the user-centered   by other sound folders and their associated sounds. This list was
design process for adding a recommendation system to the EarS-          navigated via scrolling and sounds were distributed across multi-
ketch sound browser and the initial prototype of the hybrid recom-      ple pages within the browser. The user had the ability to favorite
mendation system resulting from that design process. The paper          and preview these sounds from within the browser as well. The
concludes by discussing the planned evaluation methodology, limi-       user also had the ability to discover sounds in the library via text
tations of the current prototype, and future work.                      search from the search bar along with the functionality to filter
                                                                        these sounds by artists, instruments, and genre.
2   USER RESEARCH AND INTERFACE DESIGN
An initial user study was conducted in order to gain a systematic
                                                                        2.2    Interviews and Survey of EarSketch
understanding of how best to add a recommendation system into                  Students
the EarSketch sound browser. This included understanding the            Four qualitative interviews were conducted with undergraduate
different ways that users used the sound browser, the challenges        students in an introductory programming course at a four-year
they faced to use it successfully, the kinds of recommendations users   college to explore current EarSketch users’ challenges, behaviors,
desired, and the best ways to present recommendations to users.         and interactions with the sound browser. This was done to identify
The study resulted in a set of requirements for the recommendation      the best opportunities for the recommendation system to fit their
system and a redesign of the sound browser interface integrating        needs. These interviews were utilized to gather qualitative data
the generated recommendations.                                          such as reported behaviors, motivations behind those behaviors,
                                                                        opportunities for future designs and a recommendation integration.
                                                                        A quantitative survey was sent out to the same undergraduate class
                                                                        and received 55 responses. The survey was used to determine the
                                                                        prevalence of identified behaviors and preferences.
                                                                           Participants reported being more inclined to use the Instrument
                                                                        and Genre filters than the Artist filter. In addition, users expressed
                                                                        their desire for a Key and Beats-Per-Minute (BPM) filter. This sug-
                                                                        gested the need to prioritize recommendations based on instru-
                                                                        ments, genres, keys, and BPM in the future.
                                                                           Users reported that it was hard to discover groups of sounds
                                                                        they considered to be a good recommendation. They considered
                                                                        strong recommendations to be sounds that they liked that also fit
                                                                        in their script (relevant) and that they had not heard before (novel)
                                                                        or were not expecting (serendipitous). Discovering sounds similar
                                                                        to previously used sounds was of lesser importance to them. This
                                                                        confirmed that those users desired recommendations that were in
                                                                        accordance with the recommendation system goals defined by [1].

                                                                        3     HYBRID RECOMMENDATION SYSTEM
                                                                        A set of design principles arose as a result of these user studies.
                                                                        Recommendations were to be relevant, novel, diverse, and serendip-
                                                                        itous. Additionally, users were interested in getting recommen-
                                                                        dations in the interface separated into different categories (e.g.
                                                                        "Sounds That Fit Your Tastes" and "Discover Different Kinds of
                                                                        Sounds"). Users were also interested in getting recommendations
Figure 3: Original sound browser design prior to research ac-           matching with semantic features of the sounds in their work-in-
tivities (left) and sound browser design after research activi-         progress compositions (e.g. instrument, genre, key, and BPM).
ties (right). Includes a like/dislike functionality, collapsible           The initial recommendation system we have developed does not
sound folders, new recommended sound folders with gold                  yet support the entire set of user requirements that were illumi-
text to distinguish them as recommendations, and the addi-              nated by the user studies. It does combine collaborative filtering
tion of Key and BPM filters.                                            (using a statistical analysis of sound usage in past user scripts) and
                                                                        content-based filtering (using extracted audio features) to increase
                                                                        the relevance and novelty of the generated recommendations.
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                   Smith and Weeks, et al.


    Recommendations are generated as follows:                                    allows us to evaluate time-based similarities between sounds,
    (1) The algorithm takes in one or more sounds as its input. This             and recommend sounds with similar function in a rhythmic
        input is the set of sounds that are already a part of a user’s           context.
        work-in-progress script/composition.                                   Mel-Frequency Cepstral Coefficients DM F CC is the euclidean
    (2) The algorithm then generates a first list of sounds from                 distance between the short-term power spectrum of two
        the EarSketch sound library (the co-usage list) that have                sounds, using the librosa MFCC function [8] [10]. This com-
        commonly been used in the past with the input sounds in                  pares sounds in terms of temporally-independent energy,
        scripts by any user.                                                     and acts as genre or instrument groupings.
    (3) The algorithm then uses audio features of the sounds in            Both features have been chosen due to their common usage in
        the co-usage list to create a second list containing other       music information retrieval [6].
        sounds in the sound library that are acoustically similar to
        the sounds in the co-usage list (the similarity list).
                                                                         3.3     Recommendation Algorithms
    (4) The algorithm removes sounds from the similarity list that
        have been commonly used with sounds in the co-usage list.        This design aims to generate recommendations of sounds that are
    (5) Finally, the algorithm chooses sounds from the similarity list   serendipitous to the user by not having high co-usage, and relevant
        to present to the user as recommendations.                       through acoustical similarity to sounds that do. Diversity in rec-
                                                                         ommendations is possible by including a high number of co-used
    The co-usage list is an example of collaborative filtering (see
                                                                         sounds of a variety of styles. The multiple stages of randomness in
the collaborative filtering section) and adds relevance to the gen-
                                                                         both models, while not guaranteeing novelty, allow for different
erated recommendations by ensuring that recommendations are
                                                                         recommendations to be generated for the same combinations of
compatible with the set of sounds in the user’s work-in-progress
                                                                         inputs.
script/composition. The usage of the similarity list (rather than just
                                                                            N represents an arbitrary factor limiting the amount of results
the co-usage list) is an example of content-based filtering (see the
                                                                         gathered at different steps in the algorithms, and will be empirically
content filtering section). The removal of sounds from the similarity
                                                                         determined during evaluation. The value of each variable labeled N
list (that are commonly used with the co-usage list) adds novelty
                                                                         in the below sections can be manipulated separately. This includes
to the recommendations. The approach described here attempts to
                                                                         the lengths of the list of final recommendations, the co-usage list,
address diversity and serendipity of the generated recommenda-
                                                                         and the similarity list.
tions, but explicit measures to ensure and evaluate these qualities
                                                                            The initial prototype of the recommender system is designed for
is planned for future work (see future work).
                                                                         use in standalone offline applications in addition to integration with
                                                                         the main EarSketch browser. Two recommendation algorithms were
3.1    Collaborative Filtering
                                                                         developed: one for live, real-time recommendation calculations
The input to the collaborative filtering is the collection of sounds     and the other for faster server-side calculations. The first model,
already being used in an active script at the time of recommendation     the dynamic model, conducts all calculations offline using pre-
generation. We take an item-based approach involving only an             computed audio features to generate a list of recommendations
analysis of previous co-usage between sounds [13]. We take this          for any combination of sounds. The static model, intended for
approach to impose minimal collection of user information, such as       online use, combines pre-computed lists of recommendations for
user demographics and profile usage history, protecting EarSketch’s      any individual sounds to generate a single recommendation list.
primarily school-aged user base and conforming with its privacy
policy [5]. The system returns a co-usage list of sounds in order of     3.3.1 Dynamic. The most commonly used sounds in conjunction
co-usage frequency. This co-usage is calculated using a sample set       with any of the input sounds parsed from a user script are found
of 20,000 user scripts. Any sounds that are also in the input list are   collectively using the collaborative filtering paradigm in the col-
excluded to ensure that commonly co-used input sounds do not             laborative filtering section. Each commonly co-used sound is then
simply recommend each other.                                             compared to all other sounds in the EarSketch library, and a recom-
                                                                         mendation score for each is generated as the following equation:
3.2    Content-based Filtering
We compare two audio features to find sounds acoustically simi-                                   −1       −1
                                                                                             S = DST FT + DM F CC + U                      (1)
lar to the items in the co-usage list. These recommendations are
the final output of the system. Recommended sounds are chosen            where DST FT = normalized STFT euclidean distance, DM F CC =
based on their similarity to the most commonly co-used sounds            normalized normalized STFT euclidean distance, and U = normal-
comparing two properties of the audio signal — Short-Time Fourier        ized co-usage.
Transform features and Mel-Frequency Cepstral Coefficients. The             Additionally, STFT and MFCC distance from the original input
sounds are compared using the euclidean distance between their           samples are added or subtracted from the final recommendation
feature vectors, taken from the first 2 seconds of 48000 sample rate     score. This to generate recommendations that are either acousti-
audio with a 1024-point Hann window and normalized for tempo.            cally similar or different from the ones already found in the user
    Short-Time Fourier Transform Features DST FT is the eu-              script at the time recommendation. The sounds with the highest N
        clidean distance between the spectral density of two sounds,     recommendation scores are stored and joined together in a single
        calculated using the librosa STFT function [8]. This function    similarity list. A random selection of N recommendations is chosen
Towards a Hybrid Recommendation System for a Sound Library                            IUI Workshops’19, March 20, 2019, Los Angeles, USA


from the highest N normalized recommendation scores in the mas-          the calculations between audio features will be performed with op-
ter list, with higher priority given to the highest recommendations      erations and statistical measures other than euclidean distance, and
through fitness proportionate selection [2].                             will incorporate higher-level features such as rhythm. Similarly for
                                                                         a threshold diversity value D, recommendations would be chosen
    Static. The static model differs from the dynamic model in that
                                                                         by adding sounds to a candidate set such that each new addition is
it uses a pre-computed list of similarity lists generated for each in-
                                                                         at least D distance from every other item already in the set. Finally,
dividual sound in EarSketch, in order to make the recommendation
                                                                         serendipity will be explicitly optimized for by first collecting data
algorithm less computationally intensive for server-side deploy-
                                                                         searching for recommendations that are relevant but with low co-
ment. The lists for any combination of input sounds are joined
                                                                         usage frequencies (indicating that they are rarely used together).
together into a master list, and any duplicate sounds have their rec-
                                                                         Finally, each of the four recommendation generation goals will be
ommendation scores added and balanced by a factor of the square
                                                                         weighted in order to tailor recommendations to different situations
root of the number of lists. This method of balancing is in order
                                                                         or different recommendation folders.
to assign higher value to the strongest recommendations without
drowning out the others, and is another scalable parameter that
                                                                         4.2    Proposed Evaluation
will be evaluated in future work (see the future work section). A
random selection of N recommendations is chosen with higher              4.2.1 Recommendation System. Participants in a user study will
priority given to the highest recommendations as with the dynamic        empirically refine the various iterations of the recommendation
model.                                                                   system using different output-limiting values of N, and different
                                                                         relative weighting of DM F CC and DST FT . Additionally, they will
                                                                         be asked to choose sounds from the recommendation system and
                                                                         rate them in terms of relevance, novelty, diversity, and serendipity
                                                                         [13] for a combination of input sounds. The sounds they choose will
                                                                         be represented by the recommendation scores generated by each
                                                                         system iteration, in order to evaluate the weightings independently.
                                                                         Additionally, qualitative questions will reveal user opinions on
                                                                         other design aspects, like how many recommendations users want
                                                                         to see at once.
                                                                         4.2.2 Interface Redesign. The current redesign has not been prop-
                                                                         erly tested in a real world scenario, thus potential usability issues
                                                                         may arise with the navigation, language, and recommendation
                                                                         types. We will conduct moderated usability testing and record
                                                                         users’ sessions interacting with a high-fidelity prototype while
                                                                         a researcher prompts them with tasks to complete. This testing will
                                                                         allow more information regarding EarSketch users’ perceptions
                                                                         of a ’good’ recommendation and how users will actually utilize
Figure 4: Program flow of the Dynamic recommendation                     these recommendations. As we move toward understanding how
system model, following the analysis of input samples to                 to recommend sounds to our user and better facilitate the explo-
generate co-usage, similarity, and final recommendation                  ration and discovery of sounds within EarSketch, our near-term
lists.                                                                   goal is to iterate and improve on the proposed EarSketch redesign
                                                                         to accommodate recommendations.

4     FUTURE WORK
This algorithm is an exploratory stage of development and we plan
to expand it along with the interface design with respect to current
limitations information gained from user testing.

4.1    Recommendation System
The recommendation generation process will be modified to im-
prove how it explicitly addresses its goals of relevance, novelty,
diversity, and serendipity. Recommendation relevance will be im-
proved by adding semantic metadata tags to the sounds, like instru-
ment, genre, key, and BPM, and using those parameters (in addition
to co-usage statistics and feature similarity) to select sounds. Nov-
elty will be explicitly optimized for by measuring the distance
between sounds in the lists and ensuring that recommendations
are intentionally selected to be different from previously generated
recommendations by some threshold novelty value N. Additionally,
IUI Workshops’19, March 20, 2019, Los Angeles, USA                                                                                              Smith and Weeks, et al.


REFERENCES                                                                                    python. In Proceedings of the 14th python in science conference. 18–25.
[1] Charu C Aggarwal et al. 2016. Recommender systems. Springer.                          [9] Tom McKlin, Brian Magerko, Taneisha Lee, Dana Wanzer, Doug Edwards, and
[2] Thomas Bäck. 1996. Evolutionary Algorithms in Theory and Practice: Evolution              Jason Freeman. 2018. Authenticity and Personal Creativity: How EarSketch
    Strategies, Evolutionary Programming, Genetic Algorithms. Oxford University               Affects Student Persistence. In Proceedings of the 49th ACM Technical Symposium
    Press, Inc., New York, NY, USA.                                                           on Computer Science Education. ACM, 987–992.
[3] S. Chang, A. Abdul, J. Chen, and H. Liao. 2018. A Personalized Music Recom-          [10] Paul Mermelstein. 1976. Distance measures for speech recognition, psychological
    mendation System using Convolutional Neural Networks Approach. In IEEE                    and instrumental. Pattern recognition and artificial intelligence 116 (1976), 374–
    International Conference on Applied System Invention (ICASI). IEEE, St. Petersburg        388.
    Russia, 47–49. https://doi.org/10.1109/ICASI.2018.8394293                            [11] Sergio Oramas, V.C. Ostuni, T. Di Noia, Xavier Serra, and E. Di Sciascio. 2016.
[4] Bram de Jong. 2005. Freesound. https://freesound.org                                      Sound and Music Recommendation with Knowledge Graphs. ACM Transactions
[5] Brian Magerko Jason Freeman. 2011. EarSketch. http://earsketch.gatech.edu/                on Intelligent Systems and Technology (TIST) 8 (10/2016 2016), 1–21. https:
    landing/                                                                                  //doi.org/10.1145/2926718
[6] Alexander Lerch. 2012. An Introduction to Audio Content Analysis: Applications in    [12] Gerard Roma and Xavier Serra. 2015. Music performance by discovering commu-
    Signal Processing and Music Informatics (1st ed.). Wiley-IEEE Press.                      nity loops. In Proceedings of the Web Audio Conference (WAC), Paris.
[7] Brian Magerko, Jason Freeman, Tom Mcklin, Mike Reilly, Elise Livingston, Scott       [13] E. Shakirova. 2017. Collaborative Filtering for Music Recommender System. In
    Mccoid, and Andrea Crews-Brown. 2016. Earsketch: A steam-based approach for               IEEE Conference of Russian Young Researchers in Electrical and Electronic Engineer-
    underrepresented populations in high school computer science education. ACM               ing (EIConRus). IEEE, St. Petersburg Russia, 548–550. https://doi.org/10.1109/
    Transactions on Computing Education (TOCE) 16, 4 (2016), 14.                              EIConRus.2017.7910613
[8] Brian McFee, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric          [14] Kai Siedenburg and Daniel Müllensiefen. 2007. Modeling Timbre Similarity
    Battenberg, and Oriol Nieto. 2015. librosa: Audio and music signal analysis in            of Short Music Clips. Frontiers in psychology 8, 1 (April 2007), 36–44. https:
                                                                                              //doi.org/10.3389/fpsyg.2017.00639

</pre>