EmoMusic: A New Fun and Interactive Way to Listen to Music Aman Shukla1 , Gus Xia2 1 New York University, New York, United States 2 New York University Shanghai, Shanghai, China Abstract Modern music platforms like Spotify support users to interact with music through different interactive tools; from creating a playlist to liking or skipping a song. A prominent feature of such platforms is interacting with users by allowing them to react to music via likes and skips. Some video sharing interfaces like Niconico and Bilibili allow users to view and add overlaid commentary on videos in a synchronized fashion creating a sense of shared watching experience. However, integration of emoticons with real-time music has been rare. In this work, we propose additional channels for interacting with music through emoticons. Emoticons have been widely accepted and integrated as a medium of communication and expression, especially in text. It conveys more information than its matching text and is space optimal. We aim to integrate emoticons into an interactive music-listening interface. We believe that emoticon representation of music allows for a finer granularity in representing emotions and provides users with additional options to interact with music. We propose to build an interface which presents emoticon representation of music with basic music player functionalities. Keywords Music learning interface, user interface, music emotion retrieval, deep learning 1. Introduction Our design is different from SmartVideoRanking [7] and MusicCommentrator [8], both of which estimate Recent years have witnessed exponential progress in ma- emotions from audio based on time-synchronized com- chine learning to sentiment analysis [1, 2, 3], and music ments by the users rather than the metadata from the interaction [4, 5]. Despite the progress, we are yet to see audio itself. an interface that combines music and emoticons. Emoti- cons have played a major part in the sentiment analysis domain, especially in understanding emotions from text 2. Methodology or tweets. Emoticons also have been widely integrated into text messages and lend a meaningful value in deter- Our system is designed for interactive demonstration of mining context in text and natural language processing musical emotions through machine learning and emoti- applications [6]. In our work, we build an emoji-informed cons, which contains three parts. First, preparing a fresh interface, through which the emotion of the music is dis- dataset with matching emoticon labels for a song by lever- played in real time and users can also input their emoti- aging the underlying time-annotated lyric representation. cons associated with a music piece. First, we developed Second, training a machine learning model via supervised a back-end machine learning system that decodes emoti- learning by representing audio signals as corresponding cons from audio by using lyrics as a proxy. Secondly, we spectrograms and using the generated emoticon symbols design an interface which simultaneously displays au- as target labels. Finally, we integrate this system into dio properties with their corresponding emoticons. This our interactive display which utilizes the music-emoticon interface is built on top of the back-end ML system as pair as a starting point and follows up by enabling users it uses the output emoticons from the system to display to interact with music by selecting their preferred emoti- with music. Finally, we extend the interface to enable cons. We dive into some of the details of each of these the user to interact with the music they are listening via sections below and present an outline of the system in emoticons in real-time. This additional feedback from Figure 1. the user interaction is then used to retrain our model and improve the performance of our machine learning 2.1. Dataset Creation system. We have used the DALI dataset [9] which is a large and rich multimodal dataset containing 5358 audio tracks Joint Proceedings of the ACM IUI Workshops 2023, March 2023, with their time-aligned vocal melody notes and lyrics Sydney, Australia Envelope-Open as14034@nyu.edu (A. Shukla); gxia@nyu.edu (G. Xia) along with other meta-data. Although the dataset con- © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tains rich information, for our setup we’ve only consid- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) ered English songs and paragraph level lyrical annota- 1 Aman Shukla et al. CEUR Workshop Proceedings 1–3 tions. The decision to incorporate paragraph level an- Labeling Labels notations stemmed from our analysis where we found Fine-Tuned the corresponding context was necessary to derive a sen- Lyrics Transformer timent from the audio. This analysis was done under the assumption that music segments and their annotated Audio lyrics share the same emotion content that can be effec- Spectrogram Song tively represented by emoticons. The lyrics are passed Model to the fine-tuned DeepMoji [10] transformer to extract Training emoticons labels for the piece. We fine-tune the output classes by eliminating music based symbols as they do not Short-Time Fourier represent any emotion.This subprocess is represented in Transform Audio Predictions Figure 1 titled Labeling. Finally from this process we are Waveform able to generate an audio-emoticon pairing which will Inference serve as a basis for our supervised learning algorithm. Interface 2.2. Musical Machine Learning Model We first transform audio signals to spectrograms via short-time fourier transforms. It has been studied before Figure 1: ML-based backend system showing data prepara- tion, model training and inference. that spectrograms offer a rich representation of audio [11]. Then, we utilize the spectrogram-emoticon pair to train our model. Since we’ve represented audio signals as spectrograms, we leverage transfer learning to extract EmoMusic Player information. Song Level Emoji Pop-Up Reactions 2.2.1. Transfer Learning Transfer Learning in audio has been mainly focussed on pretraining a model on a large corpus of audio datasets. We follow an approach similar to [12] where we lever- age the power of transfer learning shifting focus from audio datasets to image datasets. We train DenseNet [13] and ResNet [14] which are convolutional neural net- User Reaction work (CNN) architectures trained on the ImageNet [15] dataset. 2.2.2. Fine-Tuning Paragraph Level Emoji Both DenseNet and ResNet are fine-tuned to predict 62 emoticons. To accomplish this, we add a fully connected layer followed by a sigmoid layer to obtain class proba- Figure 2: Interface with real-time emoticon representation of bilities. This subprocess is represented in Figure 1 titled audio alongside user enabled reactions. Training. and with real-time audio playback. The emoticon icon 3. Interface Design opens a pop up for users to input their selection of emoti- In our proposed integrated music player design, we aim cons. The pop up icon on the audio waveform section to build a web based music player that includes basic fea- inputs user interaction for real-time feedback while the tures like Play/Pause, Next/Previous, Playlist etc. When pop-up icon embedded in the player(horizontally to the a song is playing, we intend to display the corresponding song title) provides a song level emoticon feedback. From audio waveform, current emoticon representation (de- these user interactions, we intend to improve the model’s rived from the paragraph based annotation), and song performance by re-training our model. A visualization level emoticon representation. In addition, we enable the of the music player is shown in Figure 2. user to interact with the music by selecting emoticons. We propose to build a two-level interaction; with a song 2 Aman Shukla et al. CEUR Workshop Proceedings 1–3 References [9] G. Meseguer-Brocal, A. Cohen-Hadria, G. Peeters, Dali: A large dataset of synchronized audio, lyrics [1] L. M. Gómez, M. N. Cáceres, Applying data min- and notes, automatically created using teacher- ing for sentiment analysis in music, in: Trends in student machine learning paradigm., Proceedings Cyber-Physical Multi-Agent Systems. The PAAMS of the 19th International Society for Music Infor- Collection - 15th International Conference, PAAMS mation Retrieval Conference, ISMIR, Paris, France 2017, Springer International Publishing, Cham, (2018) 431–437. doi:10.5281/ZENODO.1492443 . 2018, pp. 198–205. [10] B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, [2] G. M. Biancofiore, T. Di Noia, E. Di Sciascio, F. Nar- S. Lehmann, Using millions of emoji occurrences ducci, P. Pastore, Aspect based sentiment analy- to learn any-domain representations for detect- sis in music: A case study with spotify, in: Pro- ing sentiment, emotion and sarcasm, in: Pro- ceedings of the 37th ACM/SIGAPP Symposium ceedings of the 2017 Conference on Empirical on Applied Computing, SAC ’22, Association for Methods in Natural Language Processing, Asso- Computing Machinery, New York, NY, USA, 2022, ciation for Computational Linguistics, 2017. URL: p. 696–703. URL: https://doi.org/10.1145/3477314. https://doi.org/10.18653%2Fv1%2Fd17-1169. doi:10. 3507092. doi:10.1145/3477314.3507092 . 18653/v1/d17- 1169 . [3] L. Taruffi, R. Allen, J. Downing, P. Heaton, Indi- [11] L. Wyse, Audio spectrogram representations for vidual Differences in Music-Perceived Emotions: processing with convolutional neural networks, The Influence of Externally Oriented Thinking, CoRR abs/1706.09559 (2017). URL: http://arxiv.org/ Music Perception 34 (2017) 253–266. URL: https:// abs/1706.09559. arXiv:1706.09559 . doi.org/10.1525/mp.2017.34.3.253. doi:10.1525/mp. [12] K. Palanisamy, D. Singhania, A. Yao, Rethink- 2017.34.3.253 . ing CNN models for audio classification, CoRR [4] J. Smith, D. Weeks, M. Jacob, J. Freeman, B. Magerko, abs/2007.11154 (2020). URL: https://arxiv.org/abs/ Towards a hybrid recommendation system for a 2007.11154. arXiv:2007.11154 . sound library, in: C. Trattner, D. Parra, N. Riche [13] G. Huang, Z. Liu, K. Q. Weinberger, Densely (Eds.), Joint Proceedings of the ACM IUI 2019 Work- connected convolutional networks, CoRR shops co-located with the 24th ACM Conference on abs/1608.06993 (2016). URL: http://arxiv.org/abs/ Intelligent User Interfaces (ACM IUI 2019), Los An- 1608.06993. arXiv:1608.06993 . geles, USA, March 20, 2019, volume 2327 of CEUR [14] K. He, X. Zhang, S. Ren, J. Sun, Deep resid- Workshop Proceedings, CEUR-WS.org, 2019. URL: ual learning for image recognition, CoRR http://ceur-ws.org/Vol-2327/IUI19WS-MILC-5.pdf. abs/1512.03385 (2015). URL: http://arxiv.org/abs/ [5] V. Thio, H. Liu, Y. Yeh, Y. Yang, A mini- 1512.03385. arXiv:1512.03385 . mal template for interactive web-based demon- [15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei- strations of musical machine learning, CoRR Fei, Imagenet: A large-scale hierarchical image abs/1902.03722 (2019). URL: http://arxiv.org/abs/ database, in: 2009 IEEE Conference on Computer 1902.03722. arXiv:1902.03722 . Vision and Pattern Recognition, IEEE, 2009, pp. [6] H. Miller, J. Thebault-Spieker, S. Chang, I. John- 248–255. doi:10.1109/CVPR.2009.5206848 . son, L. Terveen, B. Hecht, “blissfully happy” or “ready tofight”: Varying interpretations of emoji, Proceedings of the International AAAI Confer- ence on Web and Social Media 10 (2021) 259–268. URL: https://ojs.aaai.org/index.php/ICWSM/article/ view/14757. doi:10.1609/icwsm.v10i1.14757 . [7] K. Tsukuda, H. Masahiro, M. Goto, Smartvideo- ranking: Video search by mining emotions from time-synchronized comments, in: 2016 IEEE 16th International Conference on Data Mining Work- shops (ICDMW), 2016, pp. 960–969. doi:10.1109/ ICDMW.2016.0140 . [8] K. Yoshii, M. Goto, Musiccommentator: Generating comments synchronized with musical audio signals by a joint probabilistic model of acoustic and tex- tual features, in: S. Natkin, J. Dupire (Eds.), Enter- tainment Computing – ICEC 2009, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 85–97. 3