<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Talking to Your Recs: Multimodal Embeddings For Recommendation and Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergio Oramas</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andres Ferraro</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alvaro Sarasua</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabien Gouyon</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>SiriusXM. Oakland</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Large Language Models (LLMs) excel at understanding complex natural language requests, and even providing recommendations, but they often rely on incomplete or outdated data with respect to platform catalogs. Training or fine-tuning LLMs for these custom catalogs is both costly and challenging. To address this, we propose a method that leverages pre-trained large text embedding models to generate embeddings from catalog descriptions, enriched with multimodal content such as audio or images, as well as collaborative filtering data, using contrastive learning. The resulting enriched embeddings are well-suited for both recommendation and textual search tasks, enabling applications like filtered recommendations, playlist continuation, and playlist generation from text. We evaluate our method through experiments on item recommendation and retrieval, using a real-world music streaming dataset. Our results show substantial improvements in recommendation performance and competitive retrieval performance when compared to of-the-shelf text embeddings and traditional search baselines. We also validate our approach on a public movie dataset, demonstrating its generalizability. Our findings highlight the potential of enhancing language models with additional information and the versatility of our method across diverse domains and applications, all without the need for fine-tuning or training multimodal LLMs from scratch, thereby reducing computational costs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLMs</kwd>
        <kwd>Multimodal Recommendation</kwd>
        <kwd>Retrieval</kwd>
        <kwd>Content-aware Recommendation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) are quickly expanding to new applications and showing beneficial uses
in multiple domains, and specially in music. LLMs showcase an exceptional understanding of language,
and are starting to be leveraged in conversational recommender systems and applications such as the
so called AI generated playlists [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, these large models are trained with general information
and therefore need to be adapted when applied in a specific context [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The disparity between the
knowledge used to train these general-purpose models and the unique internal knowledge and entity
catalogues of a company, coupled with the need for continuous knowledge updates, renders these
models inadequate for of-the-shelf use in real-world music recommendation applications. Moreover, the
expense of fine-tuning or retraining these models to align with an in-house catalog remains significantly
high. Approaches like Retrieval Augmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] present a promising solution to this
challenge without the need for model fine-tuning or retraining. This is achieved by conducting a vector
search on a corpus of in-house document embeddings. These documents, which usually include textual
descriptions of entities, are transformed into embeddings by feeding them into text encoders from large
language models optimized for semantic similarity. However, these document embeddings may not
fully capture a company’s internal knowledge, such as user feedback or content descriptors, making
them suitable for retrieval but less efective for recommendations. In this work, we aim to address
this issue by combining internal knowledge from various modalities with text embeddings, thereby
enhancing the recommendation capabilities of these text embeddings, without the need to retrain or
ifne-tune the large language models used to generate them.
      </p>
      <p>
        Adapting LLMs specifically to recommendation tasks has gained recent attention. A review can
be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Examples of recent work include e.g. methods to leverage LLMs for generating
descriptions used for recommendation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], or e.g. an analysis of the performance of prompting LLMs for
recommendation compared with state-of-the-art recommenders [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However, recent literature suggests
that in general recommendation performance of of-the-shelf LLMs is suboptimal, and that further
research is needed for their adaptation to state-of-the-art recommendation methods and data [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ].
      </p>
      <p>
        Successful recommendation methods typically leverage multiple modalities of data, either user
collaborative data, content features —or both—, areas where LLMs trained on general information
may still be deficient. Recommendation with multiple modalities (in particular including
contentbased methods) has been typically applied with the goal of alleviating cold-start and sparsity issues
(e.g. [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ]), and also to provide explainable recommendations (e.g. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]). In particular, the specific
domain of music retrieval and recommendation has shown to be a particularly rich playground for
exploring the worth of diverse modalities in addition to collaborative filtering (e.g. audio descriptors [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
document similarity [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], or graphs of musical connections [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]). Recent work demonstrated
that pre-trained models covering several modalities can be successfully combined for music retrieval
and recommendation [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and that combining diferent sources and types of data is particularly
promising for mitigating current music recommendation limitations [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Research in the music domain
showed that combining diverse modalities for recommendation can be done in a variety of ways, from
simple concatenation [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], to predicting one modality from another [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], or multimodal contrastive
learning [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Metric learning proved also in other domains to be an efective method to combine
heterogenous and complementary information (users’ feedback data, audio, image, or text) to improve
the quality of the recommendations (e.g. [
        <xref ref-type="bibr" rid="ref21 ref22 ref23">21, 22, 23</xref>
        ]).
      </p>
      <p>In this work, we examine what we believe is a timely and relatively novel hypothesis, that combining
text embeddings derived from item descriptions, with data from multiple modalities can improve the
performance of those embeddings in recommendation while retaining a comparative performance at
retrieval.</p>
      <p>
        In the remainder of this paper, we evaluate an approach that combines text embeddings derived from
item descriptions with audio or image, and collaborative filtering data. Item descriptions are created
by either inserting tags into a predefined template or by instructing an LLM to generate a description
based on these tags —similarly to previous work [
        <xref ref-type="bibr" rid="ref24 ref25 ref26">24, 25, 26</xref>
        ]. Text embeddings are then generated by
passing these descriptions through an of-the-shelf pre-trained text embedding model. Embeddings from
the diferent modalities are then integrated into a shared latent space using contrastive learning, and
multimodal embeddings are obtained by averaging the projections of the diferent modality embeddings
in this shared space, following the methodology described in [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Our evaluation includes a series
of experiments involving item recommendation and textual search tasks, employing two datasets: a
proprietary dataset from the music domain and an open dataset from the movie domain.
      </p>
      <p>In summary, this work introduces a novel method that enhances pre-trained text embeddings with
multimodal content, such as audio, images, and collaborative filtering data, through contrastive learning.
This approach significantly improves recommendation performance while maintaining competitive
retrieval capabilities, all without the need for fine-tuning or retraining large language models. The
method demonstrates robust generalizability across domains, as evidenced by its successful application
to both music and movie datasets. Additionally, it opens up versatile possibilities for multimodal
retrieval and personalized search, ofering a cost-efective solution for improving recommendation
systems in diverse contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <sec id="sec-2-1">
        <title>2.1. Multimodal Contrastive Method</title>
        <p>
          The proposed method combines information from diverse item modalities, following a similar
methodology to the one described in [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], but changing the tag embeddings encoder for a text embeddings
encoder. Our model receives three pre-trained input embeddings, each of them representing a distinct
modality, and returns a new multimodal embedding. One of the input embeddings represents item
ℒAudio-CF, and ℒText-CF.
textual descriptions encoded with a pre-trained text embedding model, another represents user
Collaborative Filtering (CF) information obtained through matrix factorization, and the third one represents
content features related to the application domain —audio or images— (an illustration is provided in
Figure 1, see below for more details on input embeddings computation).
        </p>
        <p>
          For the proposed model we apply a contrastive learning loss based on InfoNCE [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]. Specifically, we
define the contrastive loss between two modalities,   and  , as:
ℒ ,  = ∑︀ − log 2 Ξ( , , ) , where M is the batch size and  is the temperature parameter.
=1 ∑︀ ⊮[̸=]Ξ( , , )
        </p>
        <p>=1
We define Ξ(a, b,  ) = exp(cos(a, b) − 1), based on the cosine similarity.   is defined as  , if  ≤ 
and else  −  . This loss function attempts to minimize the distance between the representations of the
modalities of the same item while maximizing the distance between any representation of modalities
from other items.</p>
        <p>
          We employ three encoders, each dedicated to a specific modality, to generate three representations
within our shared space for every item (see 1). Each of the encoders are a simple feed-forward network
with one or two dense layers and ReLU activation. During training, we learn the parameters of these
encoders by minimizing the cumulative pairwise losses between modalities 1, as in [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. The objective
function, denoted as ℒtot, comprises the sum of losses from all pairwise combinations: ℒAudio-Text,
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Multimodal embeddings for Recommendation and Retrieval</title>
        <p>Once the model is trained with the contrastive method and we want to use it for inference, we obtain the
multimodal embedding by averaging the output of each internal encoder for a given item. Following this
procedure, we compute a multimodal embedding for every item, and then we can use these embeddings
either for recommendation or retrieval.</p>
        <p>For recommendation, the multimodal embeddings can be used as item features in any content-based
or hybrid recommendation approach. For retrieval, given a text query, we first compute the query
embedding by passing the text through a pre-trained text embedding model, followed by projecting it
using our model’s text encoder. We then use the obtained embedding to perform a nearest neighbour
search on the space of all the multimodal item embeddings (see Figure 2).
1All the encoders in the evaluated models had 1 hidden layer of 256 units, an output layer of 200 units, dropout of 0.3 and a
batch size of 512</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental setup</title>
      <p>Our primary focus is to evaluate our method in the music domain, but to test its eficacy across various
scenarios and domains, we employ two distinct datasets: a dataset with information about artists
collected from a music streaming platform, which contains high-quality single-modality embeddings
pretrained on industry-scale data, and a publicly available dataset from the movies domain. Subsequently,
we conduct recommendation and retrieval experiments on each dataset.</p>
      <sec id="sec-3-1">
        <title>3.1. Music dataset and experiments</title>
        <p>
          To evaluate our approach we collected tags and pre-computed collaborative and audio embeddings of
artists from a music streaming platform. Collaborative embeddings come from an internal process of
matrix factorization of explicit feedback, whereas audio embeddings come from an internal machine
learning method [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. The details about these two collections of embeddings are out of the scope of this
paper, we simply use them as input to our method, but they were computed with industry-scale high
quality data. The tags of an artist were manually annotated by curators and provides a rich description
of its genre, mood, location, decade and musicological characteristics.
        </p>
        <p>We generate two types of descriptions for each music artist using their respective sets of tags. The first
type (i.e.     ) is created by employing a template that organizes the tags into categorized
lists of words. A second version of the descriptions (i.e.  ) is created by tasking an LLM (Claude
V3 Haiku) to enhance the semantic expressiveness of the      description. The LLM can
also incorporate its own knowledge about the described item, if available (see examples in Table 1).
We generated these alternative descriptions to examine whether a text exhibiting a smoother natural
language flow can aid the model in any of the tasks.</p>
        <p>In the remainder of this work, we refer to the model trained with the contrastive method with a
template based on tags as        and the model trained with descriptions generated
from the tags with an LLM as     . We also combine template-based descriptions with
LLMbased generated descriptions for training the model, and we refer to this as       + .
In this latter approach, the model is trained with two embeddings per item: one derived from the
     descriptions and the other from the  descriptions.</p>
        <p>We train our contrastive model with the three modality embeddings —CF, text and audio— of 31,605
artists for the diferent types of descriptions. As described in Section 2.1, the training objective is a
self-supervised objective, not directly related to the two downstream tasks we are evaluating, thus our
model is not optimized for any of the tasks.</p>
        <p>
          The recommendation task is evaluated in an item-to-item recommendation scenario. We used for
this evaluation the publicly available dataset OLGA [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], used for artist similarity and artist-to-artist
recommendation. From this dataset we collected a ground-truth of 6537 artists and a list of positive
recommendations for each of the artists. We then evaluate our embeddings by retrieving the k-NN of
each artist and comparing them with the ground-truth.
        </p>
        <p>
          For the retrieval task, we randomly select sets of 1, 2, or 3 tags for which, given a query, there are at
least 30 positive results in the dataset. We combine each set of tags with a query template and then
utilize an LLM (Anthropic’s Claude v3 Haiku model [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ]) to enhance the syntactic variability of the
        </p>
        <p>Metallica is a legendary heavy metal band hailing from the San
Francisco Bay Area in California. Blending elements of speed
thrash, pop-metal, and classic hard rock, their music is
characterized by a high-energy, aggressive sound that has captivated
audiences since the 1980s. With a penchant for angry,
sociallyconscious lyrics delivered in a gravelly, powerful vocal style,
Metallica’s compositions feature a barrage of rifing electric
guitars, thunderous drums, and driving bass lines. The band’s
musicality is further enhanced by their skilled use of harmony,
strong backbeats, and virtuosic instrumental solos. Metallica’s
sonic palette spans from the electric intensity of their studio
recordings to the epic, beyond-the-mainstream experience of
their live performances. Whether exploring themes of
depression, social unrest, or simply unleashing their raw, unbridled
energy, this iconic American metal act has left an indelible mark
on the genre and the music world at large.
queries (see Table 2 for some examples). We experimented with various prompts for the LLM until
we achieved a set of queries that were both satisfactory and semantically rich. Following this method,
we create a dataset of 3000 queries to evaluate the retrieval task. The ground truth for each query
comprises items associated with the corresponding tags used to generate that query.</p>
        <p>Find me some instrumental guitar music from the 2010s.</p>
        <p>Can you find me music with an acoustic piano, joyful lyrics, and a catchy chorus?
Can you list films that depict drug abuse and its consequences?</p>
        <p>I’m in the mood for some Italian films from the 1960s. What do you suggest?</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Movie dataset and experiments</title>
        <p>
          To corroborate our findings in a diferent domain, we use the user-movie ratings from the
Movielens25M dataset2 [31] combined with movie tags from the Movielens Tag Genome Dataset 20213 [32] and
image embeddings released in [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. These image embeddings were generated from the posters of the
movies and produced with the CLIP pre-trained model [33].
        </p>
        <p>The CF embeddings for this dataset are obtained using weighted matrix factorization [34] based on
the ratings on Movielens-25M with random-based hyperparameter tuning. Ratings with a value higher
than 3 are considered as positive during the evaluation. We reserve 10% of users for evaluation of the
recommendation task and use the rest to compute the item CF embeddings.</p>
        <p>Text embeddings are created in the same way as in the music dataset, by creating two types of
descriptions (     and  ) using the tags and then feeding them in a pre-trained text
embedding model (see examples in Table 3). Then, we train our contrastive model with the three
modality embeddings —CF, text and image— of 59,040 items, training one model for each of the item
description types.</p>
        <p>The recommendation task is evaluated using also a naive k-NN approach. Our intention in both
datasets is to compare diferent features in a recommendation setting and not to compare diferent
recommendation approaches. To generate recommendations, we randomly select 50% of each test user’s
2https://grouplens.org/datasets/movielens/25m/
3https://grouplens.org/datasets/movielens/tag-genome-2021
positively rated items and use them to compute recommendations. For this, we calculate the average
embedding of the selected items and search for the top 200 nearest neighbors. We then compare these
recommendations with the remaining items that the users have positively rated.</p>
        <p>The retrieval task is evaluated in a manner similar to the music experiment. Using the available
tagging data, we created an evaluation dataset consisting of 704 queries following the same methodology.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Baselines</title>
        <p>We compare the performance of our multimodal embeddings against of-the-shelf text embedding
models, which enable both recommendation and retrieval tasks. We select the best performing text
embedding models available in Hugging Face at the time of the writing of this paper. We tested the
models WhereIsAI/UAE-Large-V1 [35] and intfloat/e5-large-v2 [ 36]. We only report results of the
    model trained with WhereIsAI/UAE-Large-V1 embeddings, as it provided the best performance
among the two models. In addition, we compare them with the diferent input modality embeddings
—CF, Audio, and Image— only in the recommendation task, as the latter cannot be used directly for
retrieval. Finally, we compare the retrieval performance with the standard lexical search baseline
BM25 [37].</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>/5− − 2
  ℎ/− −  1

 / 
 25</p>
      <p>Music
Recommendation</p>
      <sec id="sec-4-1">
        <title>4.1. Recommendation</title>
        <p>Examining the recommendation performance across both datasets (see Table 4), we observe that the
proposed model outperforms the text embedding models and the rest of individual modality embeddings.</p>
        <p>We also observe that there are no significant diferences between recommendation performance for
 and      descriptions. However, we notice a slight improvement in recommendation
performance by combining  and      descriptions of the items when training the
    model. Combining multiple descriptions for the items generated with multiple methods seems
to work as an up-sampling technique that may suggest potential benefits for learning a better space, as
showcased in both datasets. A deeper study is necessary to understand the reasons behind this, but the
LLM may have introduced some internal knowledge in the  descriptions, which are not present
in the      ones, and at the same time, obviate some tags present in the     .
Therefore, the combination may provide a more complete description of the item. It is noteworthy that
achieving a similar or higher performance with the proposed method compared to the CF baseline is
particularly relevant. This indicates that the model efectively captures the collaborative information
and successfully integrates it with data from other modalities.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Retrieval</title>
        <p>Looking at the results for the retrieval task in both datasets, we observe that the text embedding
models perform slightly better than the     models. However, we observe a good performance of
the proposed method overall, considering that the proposed model enables natural language queries
while improving the recommendation capabilities. The two text embedding models tested show
different performance depending on the dataset: intfloat/e5-large-v2 is better in the Music dataset and
WereIsAI/UAE-Large-V1 is better in the Movielens dataset. All the embedding-based approaches,
including     and the of-the-shelf text embedding models are better than the lexical search baseline
BM25. This implies that semantic search in embedding spaces is behaving better than traditional lexical
search for complex tag-based queries like those in our evaluation dataset.</p>
        <p>We also observe that the proposed method performs closer to the text embedding models (while still
slightly worse) in the music domain than in the movies domain. This can be attributed to diferences
in input feature quality. For instance, the Music dataset includes high-quality manual annotations
from music experts, providing more comprehensive descriptions than those created from Movielens
tags. Additionally, CF and audio embeddings significantly outperform text embeddings in the music
recommendation task, whereas CF and image embeddings are on par with text embeddings in the
movie recommendation task. This highlights the disparity in input embedding quality between the two
datasets, which seems to be a key factor in the multimodal model’s performance.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Further applications</title>
        <p>In this work, we analyzed the characteristics of the embeddings generated by our approach. Results
demonstrate its capabilities for text retrieval, while at the same time, showcasing an improved
organization of items based on similarity — as evidenced in the k-NN recommendation evaluations. This implies
that the top-k results of a text query using our model exhibit higher similarity among items than those
retrieved by of-the-shelf text embedding models. Based on these results we see that using our model
would be particularly useful for playlist generation from text, where the objective is not only to satisfy
a query, but also to provide a coherent listening experience.</p>
        <p>In addition, the proposed model ofers versatile possibilities beyond text retrieval, supporting queries
from various modalities. For instance, it can handle queries using an audio piece, or a combination of
audio and text, or a collaborative embedding representing a user profile alongside a text query, thus
enabling personalized search.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we introduced a method based on contrastive learning that enhances of-the-shelf text
embeddings with multimodal information, avoiding the need for retraining or fine-tuning an LLM. We
assessed its efectiveness in recommendation and retrieval tasks across various domains.</p>
      <p>In our evaluations, we found that the proposed multimodal embeddings outperform baselines in
the recommendation task in both music and movies domains, while enabling natural language search
—which is not directly possible for collaborative filtering, audio or image features alone. While our
model demonstrates competitive performance in retrieval tasks, it slightly trails behind of-the-shelf text
embeddings. The robust performance of these text embedding models underscores their versatility in
various applications and domains, yielding robust item representations beneficial not only for retrieval
but also recommendation. However, our findings indicate that their recommendation capabilities
significantly improve when integrated with collaborative filtering and content-based embeddings.</p>
      <p>Our experiments also demonstrate that combining diferent types of textual item descriptions
enhances performance in both tasks. Exploring the reasons behind this improvement and investigating
how additional types of descriptions can further boost performance presents a promising area for future
research. Future work could also involve expanding experiments to diverse datasets and domains,
comparing with more text embedding models, and exploring diverse modalities. Moreover, this model
opens up a myriad of possibilities for playlist creation, multimodal retrieval and personalized search
that worth to be explored. Lastly, given the reliance on large internet-trained models, careful analysis
of potential biases and risks in recommendations is essential.
sonnet-haiku, Papers With Code (2024). URL: https://paperswithcode.com/paper/
the-claude-3-model-family-opus-sonnet-haiku.
[31] F. M. Harper, J. A. Konstan, The movielens datasets: History and context, ACM Trans. Interact.</p>
      <p>Intell. Syst. 5 (2015). URL: https://doi.org/10.1145/2827872. doi:10.1145/2827872.
[32] J. Vig, S. Sen, J. Riedl, The tag genome: Encoding community knowledge to support novel
interaction, ACM Trans. Interact. Intell. Syst. 2 (2012). URL: https://doi.org/10.1145/2362394.
2362395. doi:10.1145/2362394.2362395.
[33] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual models from natural language supervision, in:
International conference on machine learning, PMLR, 2021, pp. 8748–8763.
[34] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering for implicit feedback datasets, in: Proceedings
of the 8th IEEE International Conference on Data Mining (ICDM 2008), 2008, pp. 263–272.
[35] X. Li, J. Li, Angle-optimized text embeddings, arXiv preprint arXiv:2309.12871 (2023).
[36] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, F. Wei, Text embeddings by
weakly-supervised contrastive pre-training, arXiv preprint arXiv:2212.03533 (2022).
[37] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond,
Foundations and Trends® in Information Retrieval 3 (2009) 333–389.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          ,
          <article-title>Deezer's new ai playlist producer challenges spotify, amazon, youtube music to a dj battle, techradar (</article-title>
          <year>2024</year>
          ). URL: https://www.techradar.com/computing/artificial-intelligence/
          <article-title>deezers-new-ai-playlist-producer-challenges-spotify-amazon-youtube-music-to-a-dj-battle.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Naveed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. U.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Saqib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anwar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Usman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Barnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mian</surname>
          </string-name>
          ,
          <article-title>A comprehensive overview of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.06435</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W. tau Yih, T. Rocktäschel,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Tutorial on large language models for recommendation</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Recommender Systems</source>
          , RecSys '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>1281</fpage>
          -
          <lpage>1283</lpage>
          . URL: https://doi.org/10.1145/ 3604915.3609494. doi:
          <volume>10</volume>
          .1145/3604915.3609494.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Acharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Onoe</surname>
          </string-name>
          ,
          <article-title>Llm based generation of item-description for recommendation system</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Recommender Systems</source>
          , RecSys '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>1204</fpage>
          -
          <lpage>1207</lpage>
          . URL: https: //doi.org/10.1145/3604915.3610647. doi:
          <volume>10</volume>
          .1145/3604915.3610647.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Palma</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented recommender system: Enhancing recommender systems with large language models</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Recommender Systems</source>
          , RecSys '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>1369</fpage>
          -
          <lpage>1373</lpage>
          . URL: https://doi.org/10.1145/3604915.3608889. doi:
          <volume>10</volume>
          .1145/3604915.3608889.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <surname>Tallrec:</surname>
          </string-name>
          <article-title>An efective and eficient tuning framework to align large language model with recommendation</article-title>
          ,
          <source>in: Proceedings of the 17th ACM Conference on Recommender Systems</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1007</fpage>
          -
          <lpage>1014</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Su</surname>
          </string-name>
          , S. Cheng, J.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          ,
          <article-title>Representation learning with large language models for recommendation</article-title>
          ,
          <source>in: Proceedings of the ACM on Web Conference</source>
          <year>2024</year>
          ,
          <year>2024</year>
          . URL: http://dx.doi.org/10.1145/3589334.3645458.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Multimodal meta-learning for cold-start sequential recommendation</article-title>
          ,
          <source>in: Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management, CIKM '22</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>3421</fpage>
          -
          <lpage>3430</lpage>
          . URL: https://doi.org/10.1145/3511808.3557101. doi:
          <volume>10</volume>
          . 1145/3511808.3557101.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Ai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <article-title>Joint representation learning for top-n recommendation with heterogeneous information sources</article-title>
          ,
          <source>in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management</source>
          , CIKM '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>1449</fpage>
          -
          <lpage>1458</lpage>
          . URL: https://doi.org/10.1145/3132847.3132892. doi:
          <volume>10</volume>
          .1145/ 3132847.3132892.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Pevae: A hierarchical vae for personalized explainable recommendation</article-title>
          .,
          <source>in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , p.
          <fpage>692</fpage>
          -
          <lpage>702</lpage>
          . URL: https://doi.org/10.1145/3477495.3532039. doi:
          <volume>10</volume>
          .1145/3477495.3532039.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pohle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schnitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knees</surname>
          </string-name>
          , G. Widmer,
          <article-title>On rhythm and general music similarity</article-title>
          .,
          <source>in: ISMIR</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>525</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hauger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Urbano</surname>
          </string-name>
          ,
          <article-title>Harvesting microblogs for contextual music similarity estimation: a co-occurrence-based framework</article-title>
          ,
          <source>Multimedia Systems</source>
          <volume>20</volume>
          (
          <year>2014</year>
          )
          <fpage>693</fpage>
          -
          <lpage>705</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Oramas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sordo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Espinosa-Anke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          ,
          <article-title>A semantic-based approach for artist similarity</article-title>
          ,
          <source>in: ISMIR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>F.</given-names>
            <surname>Korzeniowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oramas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gouyon</surname>
          </string-name>
          ,
          <article-title>Artist similarity for everyone: A graph neural network approach</article-title>
          ,
          <source>Transactions of the International Society for Music Information Retrieval</source>
          <volume>5</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Won</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oramas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gouyon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Serra</surname>
          </string-name>
          ,
          <article-title>Multimodal metric learning for tag-based music retrieval</article-title>
          , in: IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>591</fpage>
          -
          <lpage>595</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferraro</surname>
          </string-name>
          ,
          <article-title>Music cold-start and long-tail recommendation: Bias in deep representations</article-title>
          ,
          <source>in: Proceedings of the 13th ACM Conference on Recommender Systems</source>
          ,
          <year>2019</year>
          , p.
          <fpage>586</fpage>
          -
          <lpage>590</lpage>
          . URL: https://doi.org/10.1145/3298689.3347052.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>S.</given-names>
            <surname>Oramas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nieto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Barbieri</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          <article-title>Serra, Multi-label music genre classification from audio, text, and images using deep features</article-title>
          ,
          <source>in: ISMIR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>A. Van den Oord</surname>
            , S. Dieleman,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Schrauwen</surname>
          </string-name>
          ,
          <article-title>Deep content-based music recommendation</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>26</volume>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oramas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gouyon</surname>
          </string-name>
          ,
          <article-title>Contrastive learning for cross-modal artist retrieval</article-title>
          ,
          <source>in: ISMIR</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2308.06556.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Z. Cheng, C. Sun,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Kankanhalli, User diverse preference modeling by multimodal attentive metric learning</article-title>
          ,
          <source>in: Proceedings of the 27th ACM International Conference on Multimedia, MM '19</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2019</year>
          , p.
          <fpage>1526</fpage>
          -
          <lpage>1534</lpage>
          . URL: https://doi.org/10.1145/3343031.3350953. doi:
          <volume>10</volume>
          .1145/3343031.3350953.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>C.-K. Hsieh</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Cui</surname>
            , T.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Belongie</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Estrin</surname>
          </string-name>
          ,
          <article-title>Collaborative metric learning</article-title>
          ,
          <source>in: Proceedings of the 26th International Conference on World Wide Web, WWW '17, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE</source>
          ,
          <year>2017</year>
          , p.
          <fpage>193</fpage>
          -
          <lpage>201</lpage>
          . URL: https://doi.org/10.1145/3038912.3052639. doi:
          <volume>10</volume>
          .1145/3038912.3052639.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>I.</given-names>
            <surname>Avas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Allein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Laenen</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-F. Moens</surname>
          </string-name>
          ,
          <article-title>Align macridvae: Multimodal alignment for disentangled recommendations</article-title>
          ,
          <source>in: European Conference on Information Retrieval</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Doh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nam</surname>
          </string-name>
          , Lp-musiccaps:
          <article-title>Llm-based pseudo music captioning</article-title>
          ,
          <source>in: ISMIR</source>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2307.16372.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>McKee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Salamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sivic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Language-guided music recommendation for video via prompt analogies</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>14784</fpage>
          -
          <lpage>14793</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Durand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stoller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Bittner</surname>
          </string-name>
          ,
          <article-title>Llark: A multimodal instruction-following language model for music</article-title>
          ,
          <source>in: Forty-first International Conference on Machine Learning</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>A.</surname>
          </string-name>
          v. d. Oord,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Representation learning with contrastive predictive coding</article-title>
          , arXiv preprint arXiv:
          <year>1807</year>
          .
          <volume>03748</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Favory</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Drossos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bogdanov</surname>
          </string-name>
          ,
          <article-title>Enriched music representations with multiple cross-modal contrastive learning</article-title>
          ,
          <source>IEEE Signal Processing Letters</source>
          <volume>28</volume>
          (
          <year>2021</year>
          )
          <fpage>733</fpage>
          -
          <lpage>737</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>M. C. McCallum</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Korzeniowski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Oramas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Gouyon</surname>
            ,
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Ehmann</surname>
          </string-name>
          ,
          <article-title>Supervised and unsupervised learning of audio representations for music understanding</article-title>
          ,
          <source>in: ISMIR</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Anthropic</surname>
          </string-name>
          , https://paperswithcode.com/paper/the-claude-3
          <article-title>-model-family-opus-</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>