-

EmoMusic: A New Fun and Interactive Way to Listen to Music

Aman Shukla

1 2

Gus Xia

gxia@nyu.edu 0 2

Sydney, Australia

0 New York University Shanghai , Shanghai , China 1 New York University , New York , United States 2 Workshop Proce dings

Modern music platforms like Spotify support users to interact with music through diferent interactive tools; from creating a playlist to liking or skipping a song. A prominent feature of such platforms is interacting with users by allowing them to react to music via likes and skips. Some video sharing interfaces like Niconico and Bilibili allow users to view and add overlaid commentary on videos in a synchronized fashion creating a sense of shared watching experience. However, integration of emoticons with real-time music has been rare. In this work, we propose additional channels for interacting with music through emoticons. Emoticons have been widely accepted and integrated as a medium of communication and expression, especially in text. It conveys more information than its matching text and is space optimal. We aim to integrate emoticons into an interactive music-listening interface. We believe that emoticon representation of music allows for a finer granularity in representing emotions and provides users with additional options to interact with music. We propose to build an interface which presents emoticon representation of music with basic music player functionalities.

Music learning interface user interface music emotion retrieval deep learning

1. Introduction chine learning to sentiment analysis [1, 2, 3], and music interaction [4, 5]. Despite the progress, we are yet to see an interface that combines music and emoticons. Emoticons have played a major part in the sentiment analysis domain, especially in understanding emotions from text or tweets. Emoticons also have been widely integrated into text messages and lend a meaningful value in determining context in text and natural language processing applications [6]. In our work, we build an emoji-informed interface, through which the emotion of the music is displayed in real time and users can also input their emoticons associated with a music piece. First, we developed a back-end machine learning system that decodes emoticons from audio by using lyrics as a proxy. Secondly, we design an interface which simultaneously displays audio properties with their corresponding emoticons. This interface is built on top of the back-end ML system as it uses the output emoticons from the system to display with music. Finally, we extend the interface to enable the user to interact with the music they are listening via emoticons in real-time. This additional feedback from the user interaction is then used to retrain our model and improve the performance of our machine learning system. CEUR htp:/ceur-ws.org ISN1613-073 © 2023 Copyright for this paper by its authors. Use permitted under Creative

CEUR

Workshop Proceedings (CEUR-WS.org)

Our design is diferent from SmartVideoRanking [ 7] and MusicCommentrator [8], both of which estimate ments by the users rather than the metadata from the audio itself. 2.

Methodology Our system is designed for interactive demonstration of musical emotions through machine learning and emoticons, which contains three parts. First, preparing a fresh dataset with matching emoticon labels for a song by leveraging the underlying time-annotated lyric representation. Second, training a machine learning model via supervised learning by representing audio signals as corresponding spectrograms and using the generated emoticon symbols as target labels. Finally, we integrate this system into our interactive display which utilizes the music-emoticon pair as a starting point and follows up by enabling users to interact with music by selecting their preferred emoticons. We dive into some of the details of each of these sections below and present an outline of the system in

2.1. Dataset Creation

We have used the DALI dataset [ 9 ] which is a large and rich multimodal dataset containing 5358 audio tracks with their time-aligned vocal melody notes and lyrics along with other meta-data. Although the dataset contains rich information, for our setup we’ve only considered English songs and paragraph level lyrical annotations. The decision to incorporate paragraph level annotations stemmed from our analysis where we found the corresponding context was necessary to derive a sentiment from the audio. This analysis was done under the assumption that music segments and their annotated lyrics share the same emotion content that can be efectively represented by emoticons. The lyrics are passed to the fine-tuned DeepMoji [ 10 ] transformer to extract emoticons labels for the piece. We fine-tune the output classes by eliminating music based symbols as they do not represent any emotion.This subprocess is represented in Figure 1 titled Labeling. Finally from this process we are able to generate an audio-emoticon pairing which will serve as a basis for our supervised learning algorithm.

2.2. Musical Machine Learning Model

We first transform audio signals to spectrograms via short-time fourier transforms. It has been studied before that spectrograms ofer a rich representation of audio [11]. Then, we utilize the spectrogram-emoticon pair to train our model. Since we’ve represented audio signals as spectrograms, we leverage transfer learning to extract information.

2.2.1. Transfer Learning

Transfer Learning in audio has been mainly focussed on pretraining a model on a large corpus of audio datasets. We follow an approach similar to [12] where we leverage the power of transfer learning shifting focus from audio datasets to image datasets. We train DenseNet [13] and ResNet [14] which are convolutional neural network (CNN) architectures trained on the ImageNet [15] dataset.

2.2.2. Fine-Tuning

EmoMusic Player Song Level Emoji

Pop-Up

Reactions User Reaction

Paragraph Level Emoji Both DenseNet and ResNet are fine-tuned to predict 62 emoticons. To accomplish this, we add a fully connected layer followed by a sigmoid layer to obtain class proba- Figure 2: Interface with real-time emoticon representation of bilities. This subprocess is represented in Figure 1 titled audio alongside user enabled reactions. Training. 3. Interface Design and with real-time audio playback. The emoticon icon opens a pop up for users to input their selection of emotiIn our proposed integrated music player design, we aim cons. The pop up icon on the audio waveform section to build a web based music player that includes basic fea- inputs user interaction for real-time feedback while the tures like Play/Pause, Next/Previous, Playlist etc. When pop-up icon embedded in the player(horizontally to the a song is playing, we intend to display the corresponding song title) provides a song level emoticon feedback. From audio waveform, current emoticon representation (de- these user interactions, we intend to improve the model’s rived from the paragraph based annotation), and song performance by re-training our model. A visualization level emoticon representation. In addition, we enable the of the music player is shown in Figure 2. user to interact with the music by selecting emoticons.

We propose to build a two-level interaction; with a song

[9]

Meseguer-Brocal ,

Cohen-Hadria , G. Peeters,

Dali: A large dataset of synchronized audio , lyrics [1]

L. M.

Gómez ,

M. N.

Cáceres , Applying data min- and notes, automatically created using teacher-

Cyber-Physical

Multi-Agent

Systems . The PAAMS of the 19th International Society for Music Infor-

Collection - 15th International

Conference

, PAAMS mation Retrieval Conference, ISMIR, Paris, France

2017, Springer International Publishing, Cham, ( 2018 ) 431 - 437 . doi: 10 .5281/ZENODO.1492443.

2018 , pp. 198 - 205 . [10]

Felbo ,

Mislove ,

Søgaard , I. Rahwan , [2]

G. M.

Biancofiore ,

T. Di

Noia ,

E. Di

Sciascio ,

Nar- S. Lehmann , Using millions of emoji occurrences

ceedings of the 37th ACM/SIGAPP Symposium ceedings of the 2017 Conference on Empirical

on Applied Computing , SAC '22, Association for Methods in Natural Language Processing , Asso-

Computing

Machinery , New York, NY, USA, 2022 , ciation for Computational Linguistics, 2017 . URL:

p. 696 - 703 . URL: https://doi.org/10.1145/3477314. https://doi.org/10.18653%2Fv1% 2Fd17 - 1169 . doi:10.

3507092. doi: 10 .1145/3477314.3507092. 18653/v1/d17- 1169 . [3]

Tarufi ,

Allen ,

Downing ,

Heaton , Indi- [11]

Wyse , Audio spectrogram representations for

The Influence of Externally Oriented Thinking , CoRR abs/1706 .09559 ( 2017 ). URL: http://arxiv.org/

Music Perception 34 ( 2017 ) 253 - 266 . URL: https:// abs/1706.09559. arXiv: 1706 . 09559 .

doi.org/10.1525/mp. 2017 . 34 .3.253. doi: 10 .1525/mp. [12]

Palanisamy ,

Singhania ,

Yao , Rethink-

2017 . 34 .3.253. ing CNN models for audio classification , CoRR [4]

Smith ,

Weeks ,

Jacob ,

Freeman ,

Magerko , abs/ 2007 .11154 ( 2020 ). URL: https://arxiv.org/abs/

Towards a hybrid recommendation system for a 2007.11154 . arXiv: 2007 .11154.

sound library , in: C. Trattner , D.

Parra , N. Riche [13] G.

Huang , Z.

Liu , K. Q.

Weinberger , Densely

(Eds.), Joint Proceedings of the ACM IUI 2019 Work- connected convolutional networks , CoRR

shops co-located with the 24th

ACM Conference on abs/1608 .06993 ( 2016 ). URL: http://arxiv.org/abs/

Intelligent

User

Interfaces (ACM IUI

2019 ), Los An- 1608 .06993. arXiv: 1608 . 06993 .

geles , USA, March 20 , 2019 , volume 2327 of CEUR [14]

He ,

Zhang , S. Ren,

Sun , Deep resid-

Workshop

Proceedings , CEUR-WS.org, 2019 . URL: ual learning for image recognition , CoRR

http://ceur-ws. org/ Vol- 2327 / IUI19WS -MILC- 5 .pdf. abs/1512 .03385 ( 2015 ). URL: http://arxiv.org/abs/ [5]

Thio , H. Liu,

Yeh ,

Yang , A mini- 1512 .03385. arXiv: 1512 . 03385 .

mal template for interactive web-based demon- [15]

Deng ,

Dong ,

Socher ,

L.-J.

Li ,

Li , L . Fei-

abs/ 1902 .03722 ( 2019 ). URL: http://arxiv.org/abs/ database, in: 2009 IEEE Conference on Computer

1902 .03722. arXiv: 1902 .03722. Vision and Pattern Recognition , IEEE, 2009 , pp. [6]

Miller ,

Thebault-Spieker ,

Chang , I. John- 248- 255 . doi: 10 .1109/CVPR. 2009 . 5206848 .

ence on Web and Social Media 10 ( 2021 ) 259 - 268 .

view/14757 . doi: 10 .1609/icwsm.v10i1. 14757 . [7]

Tsukuda ,

Masahiro ,

Goto , Smartvideo-

time-synchronized comments , in: 2016 IEEE 16th

shops (ICDMW) , 2016 , pp. 960 - 969 . doi: 10 .1109/

ICDMW.

2016 . 0140 . [8]

Yoshii ,

Goto , Musiccommentator: Generating

tainment Computing - ICEC 2009 , Springer Berlin

Heidelberg , Berlin, Heidelberg, 2009 , pp. 85 - 97 .