<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Network Analysis and Mining</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1109/FTC.2016.7821783</article-id>
      <title-group>
        <article-title>The Melodies of an Image: Exploring Music Recommendations Based on an Image's Content and Context</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Adwaita Janardhan Jadhav</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ishmeet Kaur</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>11</volume>
      <issue>2021</issue>
      <fpage>1361</fpage>
      <lpage>1364</lpage>
      <abstract>
        <p>Music recommendation systems are widely used in the industry and , have traditionally being supported on user behavior and upcoming trends. Concurrently, advancements in sentiment analysis now allow for complex emotional understanding from both text and images. However, a significant gap exists in integrating these areas for a comprehensive user experience. This paper positions a novel approach to address this gap, proposing an image-to-song recommendation system by utilizing the emotional relation between visuals and music.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>merging the two approaches.</p>
      <p>Music recommendations can benefit from considering
Music recommendation has been a focal point in both re- both an image’s subject and its emotional context. [3]
search and industry(like entertainment and social media), For instance, a picture of a girl celebrating her birthday
with systems evolving to deliver personalized playlists by cutting a birthday cake might lead to a ’happy’ song
based on user behaviors, preferences, and broader trends. suggestion, even if it may be a wedding song, based on
As the dynamics of user interaction shift towards visual mood alone or a dance party track due to visibility of
platforms, there’s an expanding research area in under- balloons. A more efective system would combine these
standing the interplay between images and music. With approaches, ensuring song choice match both the image’s
ML models being more sophisticated at extracting vari- content and feeling. Hence, there’s a demand for a music
ous moods from text and image data, sentiment analysis recommendation system that understands both aspects
have demonstrated the potential to extract and interpret of images.
various emotions. Yet, the convergence of music
recommendation with image sentiment remains relatively
less explored, representing a gap between these advanc- 3. Proposed Approach: Mood
ing fields. Addressing this, our paper aims to position a Melody Mapper(MMM)
novel approach, merging these domains to introduce an
image-to-song recommendation system.</p>
      <sec id="sec-1-1">
        <title>This paper proposes a proof of concept of a novel method</title>
        <p>called Mood Melody Mapper (MMM) to recommend
mu2. The Need for Image-Based sic that maps seamlessly with the content, context, and
sentiment of a given image. We explore a dynamic
inMusic Recommendation tersection of visual content and auditory experience to
enhance user engagement through personalized music
Recommendation systems often rely on user behavior, recommendations. Using Natural Language Processing
like count of the times a song is played, the songs a user (NLP), this method combines techniques such as image
reacts to on social media posts or type of liked images. [1]. description generation and sentiment analysis of music
While music and image recommendations typically fol- lyrics to identify the mood. Each subsection details each
low separate historical trends, there’s value in combining block of the system architecture in Figure 1, discussing
them for purposes like social media post backgrounds underlying algorithms and implementation specifics.
or movie soundtracks. Existing research mainly focuses
on recommending music from image content [2], like ob- 3.1. Image Text Description Generation
jects, or context [1], such as facial expressions[2], without</p>
      </sec>
      <sec id="sec-1-2">
        <title>The first block of MMM (Block 1 in Figure 1) consists of</title>
        <p>generating image descriptions on the given user input
image. We propose to use the existing encoder-decoder
model pair as the base model. The encoder processes
an input image and compresses its information into a
produce an output sequence which is the text description 3.3. Sentimental Classification of Image
of the image. Text Description</p>
        <p>Here encoders such as CNN (VGG16 or ResNet152)[4,
5] and decoders (LSTM, transformer, GRUs)[6, 7] are In Block 3 of Figure 1, the second batch of song
recused. For example,the ResNet152 encoder, pretrained ommendations is derived from the emotions conveyed
on ImageNet, is modified by removing its softmax layer, in the image. Here we are determining the emotional
producing fixed-length vector embeddings from images tone behind the image description text, distinguishing
which is then used by decoder to produce text descrip- between positive, negative, or neutral tones, and further
tion. Training such encoder-decoder model leverages categorizing into complex emotions like happiness or
datasetlike the VizWiz-Caption dataset[8], which ofers worryness.[13, 14, 15] A machine learning model,
preimage-caption pairs. Given that images contain multi- trained on a labeled sentiment dataset assigns, sentiment
ple objects and features, an attention mechanism like labels to the image descriptions.
soft attention[9] is incorporated into the decoder to en- Sentiment analysis can be done using models like
sure nuanced image details are considered during caption LSTM[6](for longer text sequences, CNN[16](capture
generation.</p>
        <sec id="sec-1-2-1">
          <title>3.2. Song Recommendation based on</title>
        </sec>
        <sec id="sec-1-2-2">
          <title>Cosine Similarity</title>
          <p>The initial set of song recommendations is derived
using cosine similarity[10], comparing the text description
generated in Block 1 to the lyrics of each song. In Block
2 of Figure 1, we represent both the text description and
song lyrics as vectors and calculate the cosine
similarity between them. The system then recommends the
top songs—let’s say, the top 10—that exhibit the highest
cosine similarity values.</p>
          <p>To compute cosine similarity here, both the song
lyrics and image text description goes through
preprocessing(tokenization, removing stop
words/punctuation) followed by conversion to a vector using methods
like Doc2Vec[11] or BERT[12].
local patterns), BERT(capture context in both
directions)[12], or hybrid approach. For example,
Datasets like Smile twitter dataset[17] can be used to
train such model.</p>
        </sec>
        <sec id="sec-1-2-3">
          <title>3.4. Mapping Sentiments</title>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Sentiment analysis is now conducted on song lyrics,</title>
        <p>though a pre-categorized music dataset can also be
employed. In Block 4 of Figure 1, an emotion mapping
between the image text and music text is executed, as
their categorizations may difer, as illustrated in Figure 2.
Based on this mapping, songs are recommended to align
with the desired emotion. In complex mapping scenario,
techniques like transfer learning[18] or Canonical
Correlation Analysis (CCA)[19] can be used.</p>
        <sec id="sec-1-3-1">
          <title>3.5. Song Recommendation Aggregation</title>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>The songs recommendation from Block 2 and Block 5</title>
        <p>are then aggregated or combined to produce the final
recommendation song list in Block 6 of Figure 1. The
aggregation can be simple like doing a union or weighted
union of the two lists. A deduplication filter and user
feedback loop can also be used for combining the list.
Metadata Integration can be used to refine rankings based
on additional data about each song (like user ratings,
play count, etc.) available. You can use this metadata
to refine your rankings. For instance, songs with higher
user ratings might get a boost in the combined list.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Applications</title>
      <sec id="sec-2-1">
        <title>There are various applications of the Image-Based Music</title>
        <p>Recommendation System:
1. Social Media Platforms: Adding background
music to various social media posts using
imagebased music recommendations can increase user
engagement since the combination of both
image and music leads can attract the attention of
users [23].
2. Soundtrack generation: Various apps that create
digital photo albums, a suitable digital soundtrack
could be added based on the images in the album.
For example, an album consisting of pictures from
childhood to old age can have a digital
transitioning soundtrack rather than just one boring related
memories soundtrack [3].
3. Event Playlists: Users can upload event images,
like parties or road trips, and the system can
curate event’s theme specific playlists. [3].
4. Interior Decor Music: Images of home interiors
and decor preferences can result in music
recommendations that align with the user’s design
aesthetic [24].</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Conclusion and Future Vision</title>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation Method</title>
      <sec id="sec-4-1">
        <title>A systematic method is needed to evaluate and select the models used in the diferent components highlighted in Figure 1. This section provides some details about the proposed evaluation methods:</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>