1. Introduction

Network Analysis and Mining

10.1109/FTC.2016.7821783

The Melodies of an Image: Exploring Music Recommendations Based on an Image's Content and Context

Adwaita Janardhan Jadhav

Ishmeet Kaur

2016

11 2021 1361 1364

Music recommendation systems are widely used in the industry and , have traditionally being supported on user behavior and upcoming trends. Concurrently, advancements in sentiment analysis now allow for complex emotional understanding from both text and images. However, a significant gap exists in integrating these areas for a comprehensive user experience. This paper positions a novel approach to address this gap, proposing an image-to-song recommendation system by utilizing the emotional relation between visuals and music.

1. Introduction

merging the two approaches.

Music recommendations can benefit from considering Music recommendation has been a focal point in both re- both an image’s subject and its emotional context. [3] search and industry(like entertainment and social media), For instance, a picture of a girl celebrating her birthday with systems evolving to deliver personalized playlists by cutting a birthday cake might lead to a ’happy’ song based on user behaviors, preferences, and broader trends. suggestion, even if it may be a wedding song, based on As the dynamics of user interaction shift towards visual mood alone or a dance party track due to visibility of platforms, there’s an expanding research area in under- balloons. A more efective system would combine these standing the interplay between images and music. With approaches, ensuring song choice match both the image’s ML models being more sophisticated at extracting vari- content and feeling. Hence, there’s a demand for a music ous moods from text and image data, sentiment analysis recommendation system that understands both aspects have demonstrated the potential to extract and interpret of images. various emotions. Yet, the convergence of music recommendation with image sentiment remains relatively less explored, representing a gap between these advanc- 3. Proposed Approach: Mood ing fields. Addressing this, our paper aims to position a Melody Mapper(MMM) novel approach, merging these domains to introduce an image-to-song recommendation system.

This paper proposes a proof of concept of a novel method

called Mood Melody Mapper (MMM) to recommend mu2. The Need for Image-Based sic that maps seamlessly with the content, context, and sentiment of a given image. We explore a dynamic inMusic Recommendation tersection of visual content and auditory experience to enhance user engagement through personalized music Recommendation systems often rely on user behavior, recommendations. Using Natural Language Processing like count of the times a song is played, the songs a user (NLP), this method combines techniques such as image reacts to on social media posts or type of liked images. [1]. description generation and sentiment analysis of music While music and image recommendations typically fol- lyrics to identify the mood. Each subsection details each low separate historical trends, there’s value in combining block of the system architecture in Figure 1, discussing them for purposes like social media post backgrounds underlying algorithms and implementation specifics. or movie soundtracks. Existing research mainly focuses on recommending music from image content [2], like ob- 3.1. Image Text Description Generation jects, or context [1], such as facial expressions[2], without

The first block of MMM (Block 1 in Figure 1) consists of

generating image descriptions on the given user input image. We propose to use the existing encoder-decoder model pair as the base model. The encoder processes an input image and compresses its information into a produce an output sequence which is the text description 3.3. Sentimental Classification of Image of the image. Text Description

Here encoders such as CNN (VGG16 or ResNet152)[4, 5] and decoders (LSTM, transformer, GRUs)[6, 7] are In Block 3 of Figure 1, the second batch of song recused. For example,the ResNet152 encoder, pretrained ommendations is derived from the emotions conveyed on ImageNet, is modified by removing its softmax layer, in the image. Here we are determining the emotional producing fixed-length vector embeddings from images tone behind the image description text, distinguishing which is then used by decoder to produce text descrip- between positive, negative, or neutral tones, and further tion. Training such encoder-decoder model leverages categorizing into complex emotions like happiness or datasetlike the VizWiz-Caption dataset[8], which ofers worryness.[13, 14, 15] A machine learning model, preimage-caption pairs. Given that images contain multi- trained on a labeled sentiment dataset assigns, sentiment ple objects and features, an attention mechanism like labels to the image descriptions. soft attention[9] is incorporated into the decoder to en- Sentiment analysis can be done using models like sure nuanced image details are considered during caption LSTM[6](for longer text sequences, CNN[16](capture generation.

3.2. Song Recommendation based on Cosine Similarity

The initial set of song recommendations is derived using cosine similarity[10], comparing the text description generated in Block 1 to the lyrics of each song. In Block 2 of Figure 1, we represent both the text description and song lyrics as vectors and calculate the cosine similarity between them. The system then recommends the top songs—let’s say, the top 10—that exhibit the highest cosine similarity values.

To compute cosine similarity here, both the song lyrics and image text description goes through preprocessing(tokenization, removing stop words/punctuation) followed by conversion to a vector using methods like Doc2Vec[11] or BERT[12]. local patterns), BERT(capture context in both directions)[12], or hybrid approach. For example, Datasets like Smile twitter dataset[17] can be used to train such model.

3.4. Mapping Sentiments Sentiment analysis is now conducted on song lyrics,

though a pre-categorized music dataset can also be employed. In Block 4 of Figure 1, an emotion mapping between the image text and music text is executed, as their categorizations may difer, as illustrated in Figure 2. Based on this mapping, songs are recommended to align with the desired emotion. In complex mapping scenario, techniques like transfer learning[18] or Canonical Correlation Analysis (CCA)[19] can be used.

3.5. Song Recommendation Aggregation The songs recommendation from Block 2 and Block 5

are then aggregated or combined to produce the final recommendation song list in Block 6 of Figure 1. The aggregation can be simple like doing a union or weighted union of the two lists. A deduplication filter and user feedback loop can also be used for combining the list. Metadata Integration can be used to refine rankings based on additional data about each song (like user ratings, play count, etc.) available. You can use this metadata to refine your rankings. For instance, songs with higher user ratings might get a boost in the combined list.

5. Applications There are various applications of the Image-Based Music

Recommendation System: 1. Social Media Platforms: Adding background music to various social media posts using imagebased music recommendations can increase user engagement since the combination of both image and music leads can attract the attention of users [23]. 2. Soundtrack generation: Various apps that create digital photo albums, a suitable digital soundtrack could be added based on the images in the album. For example, an album consisting of pictures from childhood to old age can have a digital transitioning soundtrack rather than just one boring related memories soundtrack [3]. 3. Event Playlists: Users can upload event images, like parties or road trips, and the system can curate event’s theme specific playlists. [3]. 4. Interior Decor Music: Images of home interiors and decor preferences can result in music recommendations that align with the user’s design aesthetic [24].

6. Conclusion and Future Vision 4. Evaluation Method A systematic method is needed to evaluate and select the models used in the diferent components highlighted in Figure 1. This section provides some details about the proposed evaluation methods: