EmoMusic: A New Fun and Interactive Way to Listen to
Music
Aman Shukla1 , Gus Xia2
1
    New York University, New York, United States
2
    New York University Shanghai, Shanghai, China


                                             Abstract
                                             Modern music platforms like Spotify support users to interact with music through different interactive tools; from creating a
                                             playlist to liking or skipping a song. A prominent feature of such platforms is interacting with users by allowing them to react
                                             to music via likes and skips. Some video sharing interfaces like Niconico and Bilibili allow users to view and add overlaid
                                             commentary on videos in a synchronized fashion creating a sense of shared watching experience. However, integration
                                             of emoticons with real-time music has been rare. In this work, we propose additional channels for interacting with music
                                             through emoticons. Emoticons have been widely accepted and integrated as a medium of communication and expression,
                                             especially in text. It conveys more information than its matching text and is space optimal. We aim to integrate emoticons
                                             into an interactive music-listening interface. We believe that emoticon representation of music allows for a finer granularity
                                             in representing emotions and provides users with additional options to interact with music. We propose to build an interface
                                             which presents emoticon representation of music with basic music player functionalities.

                                             Keywords
                                             Music learning interface, user interface, music emotion retrieval, deep learning


1. Introduction                                               Our design is different from SmartVideoRanking [7]
                                                            and MusicCommentrator [8], both of which estimate
Recent years have witnessed exponential progress in ma- emotions from audio based on time-synchronized com-
chine learning to sentiment analysis [1, 2, 3], and music ments by the users rather than the metadata from the
interaction [4, 5]. Despite the progress, we are yet to see audio itself.
an interface that combines music and emoticons. Emoti-
cons have played a major part in the sentiment analysis
domain, especially in understanding emotions from text 2. Methodology
or tweets. Emoticons also have been widely integrated
into text messages and lend a meaningful value in deter- Our system is designed for interactive demonstration of
mining context in text and natural language processing musical emotions through machine learning and emoti-
applications [6]. In our work, we build an emoji-informed cons, which contains three parts. First, preparing a fresh
interface, through which the emotion of the music is dis- dataset with matching emoticon labels for a song by lever-
played in real time and users can also input their emoti- aging the underlying time-annotated lyric representation.
cons associated with a music piece. First, we developed Second, training a machine learning model via supervised
a back-end machine learning system that decodes emoti- learning by representing audio signals as corresponding
cons from audio by using lyrics as a proxy. Secondly, we spectrograms and using the generated emoticon symbols
design an interface which simultaneously displays au- as target labels. Finally, we integrate this system into
dio properties with their corresponding emoticons. This our interactive display which utilizes the music-emoticon
interface is built on top of the back-end ML system as pair as a starting point and follows up by enabling users
it uses the output emoticons from the system to display to interact with music by selecting their preferred emoti-
with music. Finally, we extend the interface to enable cons. We dive into some of the details of each of these
the user to interact with the music they are listening via sections below and present an outline of the system in
emoticons in real-time. This additional feedback from Figure 1.
the user interaction is then used to retrain our model
and improve the performance of our machine learning                                                                       2.1. Dataset Creation
system.
                                                                                                                          We have used the DALI dataset [9] which is a large and
                                                                                                                          rich multimodal dataset containing 5358 audio tracks
Joint Proceedings of the ACM IUI Workshops 2023, March 2023,                                                              with their time-aligned vocal melody notes and lyrics
Sydney, Australia
Envelope-Open as14034@nyu.edu (A. Shukla); gxia@nyu.edu (G. Xia)
                                                                                                                          along with other meta-data. Although the dataset con-
                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                                                          tains rich information, for our setup we’ve only consid-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                            ered English songs and paragraph level lyrical annota-


                                                                                                                      1
Aman Shukla et al. CEUR Workshop Proceedings                                                                                                1–3


tions. The decision to incorporate paragraph level an-
                                                                    Labeling                                      Labels
notations stemmed from our analysis where we found
                                                                                    Fine-Tuned
the corresponding context was necessary to derive a sen-           Lyrics
                                                                                   Transformer
timent from the audio. This analysis was done under
the assumption that music segments and their annotated                                                   Audio
lyrics share the same emotion content that can be effec-                                              Spectrogram
                                                                    Song
tively represented by emoticons. The lyrics are passed                                                                 Model
to the fine-tuned DeepMoji [10] transformer to extract                              Training
emoticons labels for the piece. We fine-tune the output
classes by eliminating music based symbols as they do not                                Short-Time
                                                                                          Fourier
represent any emotion.This subprocess is represented in                                  Transform
                                                                    Audio                                           Predictions
Figure 1 titled Labeling. Finally from this process we are         Waveform
able to generate an audio-emoticon pairing which will                                                              Inference
serve as a basis for our supervised learning algorithm.

                                                                                            Interface
2.2. Musical Machine Learning Model
We first transform audio signals to spectrograms via
short-time fourier transforms. It has been studied before        Figure 1: ML-based backend system showing data prepara-
                                                                 tion, model training and inference.
that spectrograms offer a rich representation of audio
[11]. Then, we utilize the spectrogram-emoticon pair to
train our model. Since we’ve represented audio signals
as spectrograms, we leverage transfer learning to extract
                                                                                    EmoMusic Player
information.                                                         Song Level Emoji                  Pop-Up
                                                                                                      Reactions

2.2.1. Transfer Learning
Transfer Learning in audio has been mainly focussed on
pretraining a model on a large corpus of audio datasets.
We follow an approach similar to [12] where we lever-
age the power of transfer learning shifting focus from
audio datasets to image datasets. We train DenseNet
[13] and ResNet [14] which are convolutional neural net-                 User Reaction
work (CNN) architectures trained on the ImageNet [15]
dataset.

2.2.2. Fine-Tuning                                                                                                  Paragraph Level Emoji


Both DenseNet and ResNet are fine-tuned to predict 62
emoticons. To accomplish this, we add a fully connected
layer followed by a sigmoid layer to obtain class proba- Figure 2: Interface with real-time emoticon representation of
bilities. This subprocess is represented in Figure 1 titled audio alongside user enabled reactions.
Training.

                                                           and with real-time audio playback. The emoticon icon
3. Interface Design                                        opens a pop up for users to input their selection of emoti-
In our proposed integrated music player design, we aim cons. The pop up icon on the audio waveform section
to build a web based music player that includes basic fea- inputs user interaction for real-time feedback while the
tures like Play/Pause, Next/Previous, Playlist etc. When pop-up icon embedded in the player(horizontally to the
a song is playing, we intend to display the corresponding song title) provides a song level emoticon feedback. From
audio waveform, current emoticon representation (de- these user interactions, we intend to improve the model’s
rived from the paragraph based annotation), and song performance by re-training our model. A visualization
level emoticon representation. In addition, we enable the of the music player is shown in Figure 2.
user to interact with the music by selecting emoticons.
We propose to build a two-level interaction; with a song


                                                             2
Aman Shukla et al. CEUR Workshop Proceedings                                                                           1–3


References                                                         [9] G. Meseguer-Brocal, A. Cohen-Hadria, G. Peeters,
                                                                       Dali: A large dataset of synchronized audio, lyrics
 [1] L. M. Gómez, M. N. Cáceres, Applying data min-                    and notes, automatically created using teacher-
     ing for sentiment analysis in music, in: Trends in                student machine learning paradigm., Proceedings
     Cyber-Physical Multi-Agent Systems. The PAAMS                     of the 19th International Society for Music Infor-
     Collection - 15th International Conference, PAAMS                 mation Retrieval Conference, ISMIR, Paris, France
     2017, Springer International Publishing, Cham,                    (2018) 431–437. doi:10.5281/ZENODO.1492443 .
     2018, pp. 198–205.                                           [10] B. Felbo, A. Mislove, A. Søgaard, I. Rahwan,
 [2] G. M. Biancofiore, T. Di Noia, E. Di Sciascio, F. Nar-            S. Lehmann, Using millions of emoji occurrences
     ducci, P. Pastore, Aspect based sentiment analy-                  to learn any-domain representations for detect-
     sis in music: A case study with spotify, in: Pro-                 ing sentiment, emotion and sarcasm, in: Pro-
     ceedings of the 37th ACM/SIGAPP Symposium                         ceedings of the 2017 Conference on Empirical
     on Applied Computing, SAC ’22, Association for                    Methods in Natural Language Processing, Asso-
     Computing Machinery, New York, NY, USA, 2022,                     ciation for Computational Linguistics, 2017. URL:
     p. 696–703. URL: https://doi.org/10.1145/3477314.                 https://doi.org/10.18653%2Fv1%2Fd17-1169. doi:10.
     3507092. doi:10.1145/3477314.3507092 .                            18653/v1/d17- 1169 .
 [3] L. Taruffi, R. Allen, J. Downing, P. Heaton, Indi-           [11] L. Wyse, Audio spectrogram representations for
     vidual Differences in Music-Perceived Emotions:                   processing with convolutional neural networks,
     The Influence of Externally Oriented Thinking,                    CoRR abs/1706.09559 (2017). URL: http://arxiv.org/
     Music Perception 34 (2017) 253–266. URL: https://                 abs/1706.09559. arXiv:1706.09559 .
     doi.org/10.1525/mp.2017.34.3.253. doi:10.1525/mp.            [12] K. Palanisamy, D. Singhania, A. Yao, Rethink-
     2017.34.3.253 .                                                   ing CNN models for audio classification, CoRR
 [4] J. Smith, D. Weeks, M. Jacob, J. Freeman, B. Magerko,             abs/2007.11154 (2020). URL: https://arxiv.org/abs/
     Towards a hybrid recommendation system for a                      2007.11154. arXiv:2007.11154 .
     sound library, in: C. Trattner, D. Parra, N. Riche           [13] G. Huang, Z. Liu, K. Q. Weinberger, Densely
     (Eds.), Joint Proceedings of the ACM IUI 2019 Work-               connected convolutional networks,             CoRR
     shops co-located with the 24th ACM Conference on                  abs/1608.06993 (2016). URL: http://arxiv.org/abs/
     Intelligent User Interfaces (ACM IUI 2019), Los An-               1608.06993. arXiv:1608.06993 .
     geles, USA, March 20, 2019, volume 2327 of CEUR              [14] K. He, X. Zhang, S. Ren, J. Sun, Deep resid-
     Workshop Proceedings, CEUR-WS.org, 2019. URL:                     ual learning for image recognition,           CoRR
     http://ceur-ws.org/Vol-2327/IUI19WS-MILC-5.pdf.                   abs/1512.03385 (2015). URL: http://arxiv.org/abs/
 [5] V. Thio, H. Liu, Y. Yeh, Y. Yang,            A mini-              1512.03385. arXiv:1512.03385 .
     mal template for interactive web-based demon-                [15] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-
     strations of musical machine learning, CoRR                       Fei, Imagenet: A large-scale hierarchical image
     abs/1902.03722 (2019). URL: http://arxiv.org/abs/                 database, in: 2009 IEEE Conference on Computer
     1902.03722. arXiv:1902.03722 .                                    Vision and Pattern Recognition, IEEE, 2009, pp.
 [6] H. Miller, J. Thebault-Spieker, S. Chang, I. John-                248–255. doi:10.1109/CVPR.2009.5206848 .
     son, L. Terveen, B. Hecht, “blissfully happy” or
     “ready tofight”: Varying interpretations of emoji,
     Proceedings of the International AAAI Confer-
     ence on Web and Social Media 10 (2021) 259–268.
     URL: https://ojs.aaai.org/index.php/ICWSM/article/
     view/14757. doi:10.1609/icwsm.v10i1.14757 .
 [7] K. Tsukuda, H. Masahiro, M. Goto, Smartvideo-
     ranking: Video search by mining emotions from
     time-synchronized comments, in: 2016 IEEE 16th
     International Conference on Data Mining Work-
     shops (ICDMW), 2016, pp. 960–969. doi:10.1109/
     ICDMW.2016.0140 .
 [8] K. Yoshii, M. Goto, Musiccommentator: Generating
     comments synchronized with musical audio signals
     by a joint probabilistic model of acoustic and tex-
     tual features, in: S. Natkin, J. Dupire (Eds.), Enter-
     tainment Computing – ICEC 2009, Springer Berlin
     Heidelberg, Berlin, Heidelberg, 2009, pp. 85–97.


                                                              3