<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EmoMusic: A New Fun and Interactive Way to Listen to Music</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aman Shukla</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gus Xia</string-name>
          <email>gxia@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Sydney, Australia</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>New York University Shanghai</institution>
          ,
          <addr-line>Shanghai</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>New York University</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Modern music platforms like Spotify support users to interact with music through diferent interactive tools; from creating a playlist to liking or skipping a song. A prominent feature of such platforms is interacting with users by allowing them to react to music via likes and skips. Some video sharing interfaces like Niconico and Bilibili allow users to view and add overlaid commentary on videos in a synchronized fashion creating a sense of shared watching experience. However, integration of emoticons with real-time music has been rare. In this work, we propose additional channels for interacting with music through emoticons. Emoticons have been widely accepted and integrated as a medium of communication and expression, especially in text. It conveys more information than its matching text and is space optimal. We aim to integrate emoticons into an interactive music-listening interface. We believe that emoticon representation of music allows for a finer granularity in representing emotions and provides users with additional options to interact with music. We propose to build an interface which presents emoticon representation of music with basic music player functionalities.</p>
      </abstract>
      <kwd-group>
        <kwd>Music learning interface</kwd>
        <kwd>user interface</kwd>
        <kwd>music emotion retrieval</kwd>
        <kwd>deep learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
chine learning to sentiment analysis [1, 2, 3], and music
interaction [4, 5]. Despite the progress, we are yet to see
an interface that combines music and emoticons.
Emoticons have played a major part in the sentiment analysis
domain, especially in understanding emotions from text
or tweets. Emoticons also have been widely integrated
into text messages and lend a meaningful value in
determining context in text and natural language processing
applications [6]. In our work, we build an emoji-informed
interface, through which the emotion of the music is
displayed in real time and users can also input their
emoticons associated with a music piece. First, we developed
a back-end machine learning system that decodes
emoticons from audio by using lyrics as a proxy. Secondly, we
design an interface which simultaneously displays
audio properties with their corresponding emoticons. This
interface is built on top of the back-end ML system as
it uses the output emoticons from the system to display
with music. Finally, we extend the interface to enable
the user to interact with the music they are listening via
emoticons in real-time. This additional feedback from
the user interaction is then used to retrain our model
and improve the performance of our machine learning
system.
CEUR
htp:/ceur-ws.org
ISN1613-073
© 2023 Copyright for this paper by its authors. Use permitted under Creative</p>
      <p>CEUR</p>
      <p>Workshop Proceedings (CEUR-WS.org)</p>
      <p>Our design is diferent from SmartVideoRanking [ 7]
and MusicCommentrator [8], both of which estimate
ments by the users rather than the metadata from the
audio itself.
2.</p>
      <p>Methodology
Our system is designed for interactive demonstration of
musical emotions through machine learning and
emoticons, which contains three parts. First, preparing a fresh
dataset with matching emoticon labels for a song by
leveraging the underlying time-annotated lyric representation.
Second, training a machine learning model via supervised
learning by representing audio signals as corresponding
spectrograms and using the generated emoticon symbols
as target labels. Finally, we integrate this system into
our interactive display which utilizes the music-emoticon
pair as a starting point and follows up by enabling users
to interact with music by selecting their preferred
emoticons. We dive into some of the details of each of these
sections below and present an outline of the system in</p>
    </sec>
    <sec id="sec-2">
      <title>2.1. Dataset Creation</title>
      <p>
        We have used the DALI dataset [
        <xref ref-type="bibr" rid="ref1">9</xref>
        ] which is a large and
rich multimodal dataset containing 5358 audio tracks
with their time-aligned vocal melody notes and lyrics
along with other meta-data. Although the dataset
contains rich information, for our setup we’ve only
considered English songs and paragraph level lyrical
annotations. The decision to incorporate paragraph level
annotations stemmed from our analysis where we found
the corresponding context was necessary to derive a
sentiment from the audio. This analysis was done under
the assumption that music segments and their annotated
lyrics share the same emotion content that can be
efectively represented by emoticons. The lyrics are passed
to the fine-tuned DeepMoji [
        <xref ref-type="bibr" rid="ref14">10</xref>
        ] transformer to extract
emoticons labels for the piece. We fine-tune the output
classes by eliminating music based symbols as they do not
represent any emotion.This subprocess is represented in
Figure 1 titled Labeling. Finally from this process we are
able to generate an audio-emoticon pairing which will
serve as a basis for our supervised learning algorithm.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2.2. Musical Machine Learning Model</title>
      <p>We first transform audio signals to spectrograms via
short-time fourier transforms. It has been studied before
that spectrograms ofer a rich representation of audio
[11]. Then, we utilize the spectrogram-emoticon pair to
train our model. Since we’ve represented audio signals
as spectrograms, we leverage transfer learning to extract
information.</p>
      <sec id="sec-3-1">
        <title>2.2.1. Transfer Learning</title>
        <p>Transfer Learning in audio has been mainly focussed on
pretraining a model on a large corpus of audio datasets.
We follow an approach similar to [12] where we
leverage the power of transfer learning shifting focus from
audio datasets to image datasets. We train DenseNet
[13] and ResNet [14] which are convolutional neural
network (CNN) architectures trained on the ImageNet [15]
dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2.2. Fine-Tuning</title>
        <p>EmoMusic Player
Song Level Emoji</p>
        <p>Pop-Up</p>
        <p>Reactions
User Reaction</p>
        <p>Paragraph Level Emoji
Both DenseNet and ResNet are fine-tuned to predict 62
emoticons. To accomplish this, we add a fully connected
layer followed by a sigmoid layer to obtain class proba- Figure 2: Interface with real-time emoticon representation of
bilities. This subprocess is represented in Figure 1 titled audio alongside user enabled reactions.
Training.
3. Interface Design and with real-time audio playback. The emoticon icon
opens a pop up for users to input their selection of
emotiIn our proposed integrated music player design, we aim cons. The pop up icon on the audio waveform section
to build a web based music player that includes basic fea- inputs user interaction for real-time feedback while the
tures like Play/Pause, Next/Previous, Playlist etc. When pop-up icon embedded in the player(horizontally to the
a song is playing, we intend to display the corresponding song title) provides a song level emoticon feedback. From
audio waveform, current emoticon representation (de- these user interactions, we intend to improve the model’s
rived from the paragraph based annotation), and song performance by re-training our model. A visualization
level emoticon representation. In addition, we enable the of the music player is shown in Figure 2.
user to interact with the music by selecting emoticons.</p>
        <p>We propose to build a two-level interaction; with a song</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Meseguer-Brocal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohen-Hadria</surname>
          </string-name>
          , G. Peeters,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Dali: A large dataset of synchronized audio</article-title>
          , lyrics [1]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Gómez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Cáceres</surname>
          </string-name>
          ,
          <article-title>Applying data min- and notes, automatically created using teacher-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Cyber-Physical</surname>
            <given-names>Multi-Agent</given-names>
          </string-name>
          <string-name>
            <surname>Systems</surname>
          </string-name>
          .
          <source>The PAAMS of the 19th International Society for Music Infor-</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Collection - 15th International</surname>
            <given-names>Conference</given-names>
          </string-name>
          , PAAMS mation Retrieval Conference, ISMIR, Paris, France
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          2017, Springer International Publishing, Cham, (
          <year>2018</year>
          )
          <fpage>431</fpage>
          -
          <lpage>437</lpage>
          . doi:
          <volume>10</volume>
          .5281/ZENODO.1492443.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <year>2018</year>
          , pp.
          <fpage>198</fpage>
          -
          <lpage>205</lpage>
          . [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Felbo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mislove</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Søgaard</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Rahwan</surname>
          </string-name>
          , [2]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Biancofiore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Di</given-names>
            <surname>Noia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Sciascio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nar- S. Lehmann</surname>
          </string-name>
          ,
          <article-title>Using millions of emoji occurrences</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>ceedings of the 37th ACM/SIGAPP Symposium ceedings of the 2017 Conference on Empirical</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>on Applied Computing</source>
          , SAC '22,
          <article-title>Association for Methods in Natural Language Processing</article-title>
          , Asso-
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Computing</given-names>
            <surname>Machinery</surname>
          </string-name>
          , New York, NY, USA,
          <year>2022</year>
          , ciation for Computational Linguistics,
          <year>2017</year>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          p.
          <fpage>696</fpage>
          -
          <lpage>703</lpage>
          . URL: https://doi.org/10.1145/3477314. https://doi.org/10.18653%2Fv1%
          <fpage>2Fd17</fpage>
          -
          <lpage>1169</lpage>
          . doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          3507092. doi:
          <volume>10</volume>
          .1145/3477314.3507092. 18653/v1/d17-
          <fpage>1169</fpage>
          . [3]
          <string-name>
            <given-names>L.</given-names>
            <surname>Tarufi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Allen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Downing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Heaton</surname>
          </string-name>
          , Indi- [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wyse</surname>
          </string-name>
          ,
          <article-title>Audio spectrogram representations for</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>The Influence of Externally Oriented Thinking</article-title>
          ,
          <source>CoRR abs/1706</source>
          .09559 (
          <year>2017</year>
          ). URL: http://arxiv.org/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>Music Perception</source>
          <volume>34</volume>
          (
          <year>2017</year>
          )
          <fpage>253</fpage>
          -
          <lpage>266</lpage>
          . URL: https:// abs/1706.09559. arXiv:
          <volume>1706</volume>
          .
          <fpage>09559</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          doi.org/10.1525/mp.
          <year>2017</year>
          .
          <volume>34</volume>
          .3.253. doi:
          <volume>10</volume>
          .1525/mp. [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Palanisamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Singhania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yao</surname>
          </string-name>
          , Rethink-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <year>2017</year>
          .
          <volume>34</volume>
          .3.253.
          <article-title>ing CNN models for audio classification</article-title>
          ,
          <source>CoRR</source>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weeks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jacob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freeman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Magerko</surname>
          </string-name>
          , abs/
          <year>2007</year>
          .11154 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <article-title>Towards a hybrid recommendation system for a 2007.11154</article-title>
          . arXiv:
          <year>2007</year>
          .11154.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <article-title>sound library</article-title>
          , in: C.
          <string-name>
            <surname>Trattner</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Parra</surname>
            , N. Riche [13]
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>K. Q.</given-names>
          </string-name>
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          , Densely
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          (Eds.),
          <source>Joint Proceedings of the ACM IUI</source>
          <year>2019</year>
          <article-title>Work- connected convolutional networks</article-title>
          ,
          <source>CoRR</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>shops co-located with the 24th</article-title>
          <source>ACM Conference on abs/1608</source>
          .06993 (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Intelligent</given-names>
            <surname>User</surname>
          </string-name>
          <article-title>Interfaces (ACM IUI</article-title>
          <year>2019</year>
          ), Los An-
          <volume>1608</volume>
          .06993. arXiv:
          <volume>1608</volume>
          .
          <fpage>06993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>geles</surname>
          </string-name>
          , USA, March
          <volume>20</volume>
          ,
          <year>2019</year>
          , volume
          <volume>2327</volume>
          <source>of CEUR</source>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          , Deep resid-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Workshop</given-names>
            <surname>Proceedings</surname>
          </string-name>
          , CEUR-WS.org,
          <year>2019</year>
          . URL:
          <article-title>ual learning for image recognition</article-title>
          ,
          <source>CoRR</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2327</volume>
          /
          <fpage>IUI19WS</fpage>
          -MILC-
          <article-title>5</article-title>
          .pdf.
          <source>abs/1512</source>
          .03385 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/ [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Thio</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , A mini-
          <volume>1512</volume>
          .03385. arXiv:
          <volume>1512</volume>
          .
          <fpage>03385</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <article-title>mal template for interactive web-based demon-</article-title>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Fei-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          abs/
          <year>1902</year>
          .03722 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/ database, in: 2009 IEEE Conference on Computer
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <year>1902</year>
          .03722. arXiv:
          <year>1902</year>
          .03722.
          <article-title>Vision and Pattern Recognition</article-title>
          , IEEE,
          <year>2009</year>
          , pp. [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thebault-Spieker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chang</surname>
          </string-name>
          , I. John- 248-
          <fpage>255</fpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2009</year>
          .
          <volume>5206848</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>ence on Web and Social Media</source>
          <volume>10</volume>
          (
          <year>2021</year>
          )
          <fpage>259</fpage>
          -
          <lpage>268</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>view/14757</source>
          . doi:
          <volume>10</volume>
          .1609/icwsm.v10i1.
          <fpage>14757</fpage>
          . [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Tsukuda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Masahiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goto</surname>
          </string-name>
          , Smartvideo-
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>time-synchronized comments</article-title>
          ,
          <source>in: 2016 IEEE 16th</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>shops (ICDMW)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>960</fpage>
          -
          <lpage>969</lpage>
          . doi:
          <volume>10</volume>
          .1109/
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>ICDMW.</surname>
          </string-name>
          <year>2016</year>
          .
          <volume>0140</volume>
          . [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yoshii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Goto</surname>
          </string-name>
          , Musiccommentator: Generating
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>tainment Computing - ICEC 2009</source>
          , Springer Berlin
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <string-name>
            <surname>Heidelberg</surname>
          </string-name>
          , Berlin, Heidelberg,
          <year>2009</year>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>